All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH-for-9.1 v2 0/3] rdma: Remove RDMA subsystem and pvrdma device
@ 2024-03-28 13:02 Philippe Mathieu-Daudé
  2024-03-28 13:02 ` [PATCH-for-9.1 v2 1/3] hw/rdma: Remove pvrdma device and rdmacm-mux helper Philippe Mathieu-Daudé
                   ` (4 more replies)
  0 siblings, 5 replies; 52+ messages in thread
From: Philippe Mathieu-Daudé @ 2024-03-28 13:02 UTC (permalink / raw)
  To: qemu-devel
  Cc: Yuval Shaia, Kevin Wolf, Prasanna Kumar Kalever, Fabiano Rosas,
	Cornelia Huck, Michael Roth, Li Zhijian, Prasanna Kumar Kalever,
	Peter Xu, integration, Paolo Bonzini, qemu-block,
	Daniel P. Berrangé,
	devel, Hanna Reitz, Michael S. Tsirkin, Thomas Huth, Eric Blake,
	Philippe Mathieu-Daudé

Since v1:
- split in 3 (Thomas)
- justify gluster removal

Philippe Mathieu-Daudé (3):
  hw/rdma: Remove pvrdma device and rdmacm-mux helper
  migration: Remove RDMA protocol handling
  block/gluster: Remove RDMA protocol handling

 MAINTAINERS                                   |   17 -
 docs/about/deprecated.rst                     |    9 -
 docs/about/removed-features.rst               |    4 +
 docs/devel/migration/main.rst                 |    6 -
 docs/pvrdma.txt                               |  345 --
 docs/rdma.txt                                 |  420 --
 docs/system/device-url-syntax.rst.inc         |    4 +-
 docs/system/loongarch/virt.rst                |    2 +-
 docs/system/qemu-block-drivers.rst.inc        |    1 -
 meson.build                                   |   59 -
 qapi/machine.json                             |   17 -
 qapi/migration.json                           |   31 +-
 qapi/qapi-schema.json                         |    1 -
 qapi/rdma.json                                |   38 -
 contrib/rdmacm-mux/rdmacm-mux.h               |   61 -
 hw/rdma/rdma_backend.h                        |  129 -
 hw/rdma/rdma_backend_defs.h                   |   76 -
 hw/rdma/rdma_rm.h                             |   97 -
 hw/rdma/rdma_rm_defs.h                        |  146 -
 hw/rdma/rdma_utils.h                          |   63 -
 hw/rdma/trace.h                               |    1 -
 hw/rdma/vmw/pvrdma.h                          |  144 -
 hw/rdma/vmw/pvrdma_dev_ring.h                 |   46 -
 hw/rdma/vmw/pvrdma_qp_ops.h                   |   28 -
 hw/rdma/vmw/trace.h                           |    1 -
 include/hw/rdma/rdma.h                        |   37 -
 include/monitor/hmp.h                         |    1 -
 .../infiniband/hw/vmw_pvrdma/pvrdma_dev_api.h |  685 ---
 .../infiniband/hw/vmw_pvrdma/pvrdma_verbs.h   |  348 --
 .../standard-headers/rdma/vmw_pvrdma-abi.h    |  310 --
 migration/migration-stats.h                   |    6 +-
 migration/migration.h                         |    9 -
 migration/options.h                           |    2 -
 migration/rdma.h                              |   69 -
 block/gluster.c                               |   39 -
 contrib/rdmacm-mux/main.c                     |  831 ----
 hw/core/machine-qmp-cmds.c                    |   32 -
 hw/rdma/rdma.c                                |   30 -
 hw/rdma/rdma_backend.c                        | 1401 ------
 hw/rdma/rdma_rm.c                             |  812 ----
 hw/rdma/rdma_utils.c                          |  126 -
 hw/rdma/vmw/pvrdma_cmd.c                      |  815 ----
 hw/rdma/vmw/pvrdma_dev_ring.c                 |  141 -
 hw/rdma/vmw/pvrdma_main.c                     |  735 ---
 hw/rdma/vmw/pvrdma_qp_ops.c                   |  298 --
 migration/migration-stats.c                   |    5 +-
 migration/migration.c                         |   31 -
 migration/options.c                           |   16 -
 migration/qemu-file.c                         |    1 -
 migration/ram.c                               |   86 +-
 migration/rdma.c                              | 4184 -----------------
 migration/savevm.c                            |    2 +-
 monitor/qmp-cmds.c                            |    1 -
 Kconfig.host                                  |    3 -
 contrib/rdmacm-mux/meson.build                |    7 -
 hmp-commands-info.hx                          |   13 -
 hw/Kconfig                                    |    1 -
 hw/meson.build                                |    1 -
 hw/rdma/Kconfig                               |    3 -
 hw/rdma/meson.build                           |   12 -
 hw/rdma/trace-events                          |   31 -
 hw/rdma/vmw/trace-events                      |   17 -
 meson_options.txt                             |    4 -
 migration/meson.build                         |    1 -
 migration/trace-events                        |   68 +-
 qapi/meson.build                              |    1 -
 qemu-options.hx                               |    6 -
 .../org.centos/stream/8/build-environment.yml |    1 -
 .../ci/org.centos/stream/8/x86_64/configure   |    3 -
 scripts/ci/setup/build-environment.yml        |    4 -
 scripts/coverity-scan/run-coverity-scan       |    2 +-
 scripts/meson-buildoptions.sh                 |    6 -
 scripts/update-linux-headers.sh               |   27 -
 tests/lcitool/projects/qemu.yml               |    3 -
 tests/migration/guestperf/engine.py           |    4 +-
 75 files changed, 20 insertions(+), 12997 deletions(-)
 delete mode 100644 docs/pvrdma.txt
 delete mode 100644 docs/rdma.txt
 delete mode 100644 qapi/rdma.json
 delete mode 100644 contrib/rdmacm-mux/rdmacm-mux.h
 delete mode 100644 hw/rdma/rdma_backend.h
 delete mode 100644 hw/rdma/rdma_backend_defs.h
 delete mode 100644 hw/rdma/rdma_rm.h
 delete mode 100644 hw/rdma/rdma_rm_defs.h
 delete mode 100644 hw/rdma/rdma_utils.h
 delete mode 100644 hw/rdma/trace.h
 delete mode 100644 hw/rdma/vmw/pvrdma.h
 delete mode 100644 hw/rdma/vmw/pvrdma_dev_ring.h
 delete mode 100644 hw/rdma/vmw/pvrdma_qp_ops.h
 delete mode 100644 hw/rdma/vmw/trace.h
 delete mode 100644 include/hw/rdma/rdma.h
 delete mode 100644 include/standard-headers/drivers/infiniband/hw/vmw_pvrdma/pvrdma_dev_api.h
 delete mode 100644 include/standard-headers/drivers/infiniband/hw/vmw_pvrdma/pvrdma_verbs.h
 delete mode 100644 include/standard-headers/rdma/vmw_pvrdma-abi.h
 delete mode 100644 migration/rdma.h
 delete mode 100644 contrib/rdmacm-mux/main.c
 delete mode 100644 hw/rdma/rdma.c
 delete mode 100644 hw/rdma/rdma_backend.c
 delete mode 100644 hw/rdma/rdma_rm.c
 delete mode 100644 hw/rdma/rdma_utils.c
 delete mode 100644 hw/rdma/vmw/pvrdma_cmd.c
 delete mode 100644 hw/rdma/vmw/pvrdma_dev_ring.c
 delete mode 100644 hw/rdma/vmw/pvrdma_main.c
 delete mode 100644 hw/rdma/vmw/pvrdma_qp_ops.c
 delete mode 100644 migration/rdma.c
 delete mode 100644 contrib/rdmacm-mux/meson.build
 delete mode 100644 hw/rdma/Kconfig
 delete mode 100644 hw/rdma/meson.build
 delete mode 100644 hw/rdma/trace-events
 delete mode 100644 hw/rdma/vmw/trace-events

-- 
2.41.0



^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH-for-9.1 v2 1/3] hw/rdma: Remove pvrdma device and rdmacm-mux helper
  2024-03-28 13:02 [PATCH-for-9.1 v2 0/3] rdma: Remove RDMA subsystem and pvrdma device Philippe Mathieu-Daudé
@ 2024-03-28 13:02 ` Philippe Mathieu-Daudé
  2024-03-28 17:51   ` Thomas Huth
  2024-03-28 13:02 ` [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling Philippe Mathieu-Daudé
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 52+ messages in thread
From: Philippe Mathieu-Daudé @ 2024-03-28 13:02 UTC (permalink / raw)
  To: qemu-devel
  Cc: Yuval Shaia, Kevin Wolf, Prasanna Kumar Kalever, Fabiano Rosas,
	Cornelia Huck, Michael Roth, Li Zhijian, Prasanna Kumar Kalever,
	Peter Xu, integration, Paolo Bonzini, qemu-block,
	Daniel P. Berrangé,
	devel, Hanna Reitz, Michael S. Tsirkin, Thomas Huth, Eric Blake,
	Philippe Mathieu-Daudé,
	Marcel Apfelbaum, Song Gao, Dr. David Alan Gilbert,
	Eduardo Habkost, Yanan Wang, Marc-André Lureau,
	Markus Armbruster, Alex Bennée, Wainer dos Santos Moschetta,
	Beraldo Leal

The whole RDMA subsystem was deprecated in commit e9a54265f5
("hw/rdma: Deprecate the pvrdma device and the rdma subsystem")
released in v8.2.

Remove:
 - PVRDMA device
 - generated vmw_pvrdma/ directory from linux-headers
 - rdmacm-mux tool from contrib/

Cc: Yuval Shaia <yuval.shaia.ml@gmail.com>
Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
---
 MAINTAINERS                                   |   10 -
 docs/about/deprecated.rst                     |    9 -
 docs/about/removed-features.rst               |    4 +
 docs/pvrdma.txt                               |  345 ----
 docs/system/loongarch/virt.rst                |    2 +-
 meson.build                                   |   36 -
 qapi/machine.json                             |   17 -
 qapi/qapi-schema.json                         |    1 -
 qapi/rdma.json                                |   38 -
 contrib/rdmacm-mux/rdmacm-mux.h               |   61 -
 hw/rdma/rdma_backend.h                        |  129 --
 hw/rdma/rdma_backend_defs.h                   |   76 -
 hw/rdma/rdma_rm.h                             |   97 --
 hw/rdma/rdma_rm_defs.h                        |  146 --
 hw/rdma/rdma_utils.h                          |   63 -
 hw/rdma/trace.h                               |    1 -
 hw/rdma/vmw/pvrdma.h                          |  144 --
 hw/rdma/vmw/pvrdma_dev_ring.h                 |   46 -
 hw/rdma/vmw/pvrdma_qp_ops.h                   |   28 -
 hw/rdma/vmw/trace.h                           |    1 -
 include/hw/rdma/rdma.h                        |   37 -
 include/monitor/hmp.h                         |    1 -
 .../infiniband/hw/vmw_pvrdma/pvrdma_dev_api.h |  685 --------
 .../infiniband/hw/vmw_pvrdma/pvrdma_verbs.h   |  348 ----
 .../standard-headers/rdma/vmw_pvrdma-abi.h    |  310 ----
 contrib/rdmacm-mux/main.c                     |  831 ----------
 hw/core/machine-qmp-cmds.c                    |   32 -
 hw/rdma/rdma.c                                |   30 -
 hw/rdma/rdma_backend.c                        | 1401 -----------------
 hw/rdma/rdma_rm.c                             |  812 ----------
 hw/rdma/rdma_utils.c                          |  126 --
 hw/rdma/vmw/pvrdma_cmd.c                      |  815 ----------
 hw/rdma/vmw/pvrdma_dev_ring.c                 |  141 --
 hw/rdma/vmw/pvrdma_main.c                     |  735 ---------
 hw/rdma/vmw/pvrdma_qp_ops.c                   |  298 ----
 monitor/qmp-cmds.c                            |    1 -
 Kconfig.host                                  |    3 -
 contrib/rdmacm-mux/meson.build                |    7 -
 hmp-commands-info.hx                          |   13 -
 hw/Kconfig                                    |    1 -
 hw/meson.build                                |    1 -
 hw/rdma/Kconfig                               |    3 -
 hw/rdma/meson.build                           |   12 -
 hw/rdma/trace-events                          |   31 -
 hw/rdma/vmw/trace-events                      |   17 -
 meson_options.txt                             |    2 -
 qapi/meson.build                              |    1 -
 qemu-options.hx                               |    3 -
 .../ci/org.centos/stream/8/x86_64/configure   |    1 -
 scripts/meson-buildoptions.sh                 |    3 -
 scripts/update-linux-headers.sh               |   27 -
 51 files changed, 5 insertions(+), 7977 deletions(-)
 delete mode 100644 docs/pvrdma.txt
 delete mode 100644 qapi/rdma.json
 delete mode 100644 contrib/rdmacm-mux/rdmacm-mux.h
 delete mode 100644 hw/rdma/rdma_backend.h
 delete mode 100644 hw/rdma/rdma_backend_defs.h
 delete mode 100644 hw/rdma/rdma_rm.h
 delete mode 100644 hw/rdma/rdma_rm_defs.h
 delete mode 100644 hw/rdma/rdma_utils.h
 delete mode 100644 hw/rdma/trace.h
 delete mode 100644 hw/rdma/vmw/pvrdma.h
 delete mode 100644 hw/rdma/vmw/pvrdma_dev_ring.h
 delete mode 100644 hw/rdma/vmw/pvrdma_qp_ops.h
 delete mode 100644 hw/rdma/vmw/trace.h
 delete mode 100644 include/hw/rdma/rdma.h
 delete mode 100644 include/standard-headers/drivers/infiniband/hw/vmw_pvrdma/pvrdma_dev_api.h
 delete mode 100644 include/standard-headers/drivers/infiniband/hw/vmw_pvrdma/pvrdma_verbs.h
 delete mode 100644 include/standard-headers/rdma/vmw_pvrdma-abi.h
 delete mode 100644 contrib/rdmacm-mux/main.c
 delete mode 100644 hw/rdma/rdma.c
 delete mode 100644 hw/rdma/rdma_backend.c
 delete mode 100644 hw/rdma/rdma_rm.c
 delete mode 100644 hw/rdma/rdma_utils.c
 delete mode 100644 hw/rdma/vmw/pvrdma_cmd.c
 delete mode 100644 hw/rdma/vmw/pvrdma_dev_ring.c
 delete mode 100644 hw/rdma/vmw/pvrdma_main.c
 delete mode 100644 hw/rdma/vmw/pvrdma_qp_ops.c
 delete mode 100644 contrib/rdmacm-mux/meson.build
 delete mode 100644 hw/rdma/Kconfig
 delete mode 100644 hw/rdma/meson.build
 delete mode 100644 hw/rdma/trace-events
 delete mode 100644 hw/rdma/vmw/trace-events

diff --git a/MAINTAINERS b/MAINTAINERS
index a07af6b9d4..91ab5235b8 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -4060,16 +4060,6 @@ F: block/replication.c
 F: tests/unit/test-replication.c
 F: docs/block-replication.txt
 
-PVRDMA
-M: Yuval Shaia <yuval.shaia.ml@gmail.com>
-M: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
-S: Odd Fixes
-F: hw/rdma/*
-F: hw/rdma/vmw/*
-F: docs/pvrdma.txt
-F: contrib/rdmacm-mux/*
-F: qapi/rdma.json
-
 Semihosting
 M: Alex Bennée <alex.bennee@linaro.org>
 S: Maintained
diff --git a/docs/about/deprecated.rst b/docs/about/deprecated.rst
index 7b548519b5..29eae69e50 100644
--- a/docs/about/deprecated.rst
+++ b/docs/about/deprecated.rst
@@ -376,15 +376,6 @@ recommending to switch to their stable counterparts:
 - "Zve64f" should be replaced with "zve64f"
 - "Zve64d" should be replaced with "zve64d"
 
-``-device pvrdma`` and the rdma subsystem (since 8.2)
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-The pvrdma device and the whole rdma subsystem are in a bad shape and
-without active maintenance. The QEMU project intends to remove this
-device and subsystem from the code base in a future release without
-replacement unless somebody steps up and improves the situation.
-
-
 Block device options
 ''''''''''''''''''''
 
diff --git a/docs/about/removed-features.rst b/docs/about/removed-features.rst
index f9cf874f7b..4d5bdc43b4 100644
--- a/docs/about/removed-features.rst
+++ b/docs/about/removed-features.rst
@@ -909,6 +909,10 @@ contains native support for this feature and thus use of the option
 ROM approach was obsolete. The native SeaBIOS support can be activated
 by using ``-machine graphics=off``.
 
+``pvrdma`` and the RDMA subsystem (removed in 9.1)
+''''''''''''''''''''''''''''''''''''''''''''''''''
+
+The 'pvrdma' device and the whole RDMA subsystem have been removed.
 
 Related binaries
 ----------------
diff --git a/docs/pvrdma.txt b/docs/pvrdma.txt
deleted file mode 100644
index 5c122fe818..0000000000
--- a/docs/pvrdma.txt
+++ /dev/null
@@ -1,345 +0,0 @@
-Paravirtualized RDMA Device (PVRDMA)
-====================================
-
-
-1. Description
-===============
-PVRDMA is the QEMU implementation of VMware's paravirtualized RDMA device.
-It works with its Linux Kernel driver AS IS, no need for any special guest
-modifications.
-
-While it complies with the VMware device, it can also communicate with bare
-metal RDMA-enabled machines as peers.
-
-It does not require an RDMA HCA in the host, it can work with Soft-RoCE (rxe).
-
-It does not require the whole guest RAM to be pinned allowing memory
-over-commit and, even if not implemented yet, migration support will be
-possible with some HW assistance.
-
-A project presentation accompany this document:
-- https://blog.linuxplumbersconf.org/2017/ocw/system/presentations/4730/original/lpc-2017-pvrdma-marcel-apfelbaum-yuval-shaia.pdf
-
-
-
-2. Setup
-========
-
-
-2.1 Guest setup
-===============
-Fedora 27+ kernels work out of the box, older distributions
-require updating the kernel to 4.14 to include the pvrdma driver.
-
-However the libpvrdma library needed by User Level Software is still
-not available as part of the distributions, so the rdma-core library
-needs to be compiled and optionally installed.
-
-Please follow the instructions at:
-  https://github.com/linux-rdma/rdma-core.git
-
-
-2.2 Host Setup
-==============
-The pvrdma backend is an ibdevice interface that can be exposed
-either by a Soft-RoCE(rxe) device on machines with no RDMA device,
-or an HCA SRIOV function(VF/PF).
-Note that ibdevice interfaces can't be shared between pvrdma devices,
-each one requiring a separate instance (rxe or SRIOV VF).
-
-
-2.2.1 Soft-RoCE backend(rxe)
-===========================
-A stable version of rxe is required, Fedora 27+ or a Linux
-Kernel 4.14+ is preferred.
-
-The rdma_rxe module is part of the Linux Kernel but not loaded by default.
-Install the User Level library (librxe) following the instructions from:
-https://github.com/SoftRoCE/rxe-dev/wiki/rxe-dev:-Home
-
-Associate an ETH interface with rxe by running:
-   rxe_cfg add eth0
-An rxe0 ibdevice interface will be created and can be used as pvrdma backend.
-
-
-2.2.2 RDMA device Virtual Function backend
-==========================================
-Nothing special is required, the pvrdma device can work not only with
-Ethernet Links, but also Infinibands Links.
-All is needed is an ibdevice with an active port, for Mellanox cards
-will be something like mlx5_6 which can be the backend.
-
-
-2.2.3 QEMU setup
-================
-Configure QEMU with --enable-rdma flag, installing
-the required RDMA libraries.
-
-
-
-3. Usage
-========
-
-
-3.1 VM Memory settings
-======================
-Currently the device is working only with memory backed RAM
-and it must be mark as "shared":
-   -m 1G \
-   -object memory-backend-ram,id=mb1,size=1G,share \
-   -numa node,memdev=mb1 \
-
-
-3.2 MAD Multiplexer
-===================
-MAD Multiplexer is a service that exposes MAD-like interface for VMs in
-order to overcome the limitation where only single entity can register with
-MAD layer to send and receive RDMA-CM MAD packets.
-
-To build rdmacm-mux run
-# make rdmacm-mux
-
-Before running the rdmacm-mux make sure that both ib_cm and rdma_cm kernel
-modules aren't loaded, otherwise the rdmacm-mux service will fail to start.
-
-The application accepts 3 command line arguments and exposes a UNIX socket
-to pass control and data to it.
--d rdma-device-name  Name of RDMA device to register with
--s unix-socket-path  Path to unix socket to listen (default /var/run/rdmacm-mux)
--p rdma-device-port  Port number of RDMA device to register with (default 1)
-The final UNIX socket file name is a concatenation of the 3 arguments so
-for example for device mlx5_0 on port 2 this /var/run/rdmacm-mux-mlx5_0-2
-will be created.
-
-pvrdma requires this service.
-
-Please refer to contrib/rdmacm-mux for more details.
-
-
-3.3 Service exposed by libvirt daemon
-=====================================
-The control over the RDMA device's GID table is done by updating the
-device's Ethernet function addresses.
-Usually the first GID entry is determined by the MAC address, the second by
-the first IPv6 address and the third by the IPv4 address. Other entries can
-be added by adding more IP addresses. The opposite is the same, i.e.
-whenever an address is removed, the corresponding GID entry is removed.
-The process is done by the network and RDMA stacks. Whenever an address is
-added the ib_core driver is notified and calls the device driver add_gid
-function which in turn update the device.
-To support this in pvrdma device the device hooks into the create_bind and
-destroy_bind HW commands triggered by pvrdma driver in guest.
-
-Whenever changed is made to the pvrdma port's GID table a special QMP
-messages is sent to be processed by libvirt to update the address of the
-backend Ethernet device.
-
-pvrdma requires that libvirt service will be up.
-
-
-3.4 PCI devices settings
-========================
-RoCE device exposes two functions - an Ethernet and RDMA.
-To support it, pvrdma device is composed of two PCI functions, an Ethernet
-device of type vmxnet3 on PCI slot 0 and a PVRDMA device on PCI slot 1. The
-Ethernet function can be used for other Ethernet purposes such as IP.
-
-
-3.5 Device parameters
-=====================
-- netdev: Specifies the Ethernet device function name on the host for
-  example enp175s0f0. For Soft-RoCE device (rxe) this would be the Ethernet
-  device used to create it.
-- ibdev: The IB device name on host for example rxe0, mlx5_0 etc.
-- mad-chardev: The name of the MAD multiplexer char device.
-- ibport: In case of multi-port device (such as Mellanox's HCA) this
-  specify the port to use. If not set 1 will be used.
-- dev-caps-max-mr-size: The maximum size of MR.
-- dev-caps-max-qp:      Maximum number of QPs.
-- dev-caps-max-cq:      Maximum number of CQs.
-- dev-caps-max-mr:      Maximum number of MRs.
-- dev-caps-max-pd:      Maximum number of PDs.
-- dev-caps-max-ah:      Maximum number of AHs.
-
-Notes:
-- The first 3 parameters are mandatory settings, the rest have their
-  defaults.
-- The last 8 parameters (the ones that prefixed by dev-caps) defines the top
-  limits but the final values is adjusted by the backend device limitations.
-- netdev can be extracted from ibdev's sysfs
-  (/sys/class/infiniband/<ibdev>/device/net/)
-
-
-3.6 Example
-===========
-Define bridge device with vmxnet3 network backend:
-<interface type='bridge'>
-  <mac address='56:b4:44:e9:62:dc'/>
-  <source bridge='bridge1'/>
-  <model type='vmxnet3'/>
-  <address type='pci' domain='0x0000' bus='0x00' slot='0x10' function='0x0' multifunction='on'/>
-</interface>
-
-Define pvrdma device:
-<qemu:commandline>
-  <qemu:arg value='-object'/>
-  <qemu:arg value='memory-backend-ram,id=mb1,size=1G,share'/>
-  <qemu:arg value='-numa'/>
-  <qemu:arg value='node,memdev=mb1'/>
-  <qemu:arg value='-chardev'/>
-  <qemu:arg value='socket,path=/var/run/rdmacm-mux-rxe0-1,id=mads'/>
-  <qemu:arg value='-device'/>
-  <qemu:arg value='pvrdma,addr=10.1,ibdev=rxe0,netdev=bridge0,mad-chardev=mads'/>
-</qemu:commandline>
-
-
-
-4. Implementation details
-=========================
-
-
-4.1 Overview
-============
-The device acts like a proxy between the Guest Driver and the host
-ibdevice interface.
-On configuration path:
- - For every hardware resource request (PD/QP/CQ/...) the pvrdma will request
-   a resource from the backend interface, maintaining a 1-1 mapping
-   between the guest and host.
-On data path:
- - Every post_send/receive received from the guest will be converted into
-   a post_send/receive for the backend. The buffers data will not be touched
-   or copied resulting in near bare-metal performance for large enough buffers.
- - Completions from the backend interface will result in completions for
-   the pvrdma device.
-
-
-4.2 PCI BARs
-============
-PCI Bars:
-	BAR 0 - MSI-X
-        MSI-X vectors:
-		(0) Command - used when execution of a command is completed.
-		(1) Async - not in use.
-		(2) Completion - used when a completion event is placed in
-		  device's CQ ring.
-	BAR 1 - Registers
-        --------------------------------------------------------
-        | VERSION |  DSR | CTL | REQ | ERR |  ICR | IMR  | MAC |
-        --------------------------------------------------------
-		DSR - Address of driver/device shared memory used
-              for the command channel, used for passing:
-			    - General info such as driver version
-			    - Address of 'command' and 'response'
-			    - Address of async ring
-			    - Address of device's CQ ring
-			    - Device capabilities
-		CTL - Device control operations (activate, reset etc)
-		IMG - Set interrupt mask
-		REQ - Command execution register
-		ERR - Operation status
-
-	BAR 2 - UAR
-        ---------------------------------------------------------
-        | QP_NUM  | SEND/RECV Flag ||  CQ_NUM |   ARM/POLL Flag |
-        ---------------------------------------------------------
-		- Offset 0 used for QP operations (send and recv)
-		- Offset 4 used for CQ operations (arm and poll)
-
-
-4.3 Major flows
-===============
-
-4.3.1 Create CQ
-===============
-    - Guest driver
-        - Allocates pages for CQ ring
-        - Creates page directory (pdir) to hold CQ ring's pages
-        - Initializes CQ ring
-        - Initializes 'Create CQ' command object (cqe, pdir etc)
-        - Copies the command to 'command' address
-        - Writes 0 into REQ register
-    - Device
-        - Reads the request object from the 'command' address
-        - Allocates CQ object and initialize CQ ring based on pdir
-        - Creates the backend CQ
-        - Writes operation status to ERR register
-        - Posts command-interrupt to guest
-    - Guest driver
-        - Reads the HW response code from ERR register
-
-4.3.2 Create QP
-===============
-    - Guest driver
-        - Allocates pages for send and receive rings
-        - Creates page directory(pdir) to hold the ring's pages
-        - Initializes 'Create QP' command object (max_send_wr,
-          send_cq_handle, recv_cq_handle, pdir etc)
-        - Copies the object to 'command' address
-        - Write 0 into REQ register
-    - Device
-        - Reads the request object from 'command' address
-        - Allocates the QP object and initialize
-            - Send and recv rings based on pdir
-            - Send and recv ring state
-        - Creates the backend QP
-        - Writes the operation status to ERR register
-        - Posts command-interrupt to guest
-    - Guest driver
-        - Reads the HW response code from ERR register
-
-4.3.3 Post receive
-==================
-    - Guest driver
-        - Initializes a wqe and place it on recv ring
-        - Write to qpn|qp_recv_bit (31) to QP offset in UAR
-    - Device
-        - Extracts qpn from UAR
-        - Walks through the ring and does the following for each wqe
-            - Prepares the backend CQE context to be used when
-              receiving completion from backend (wr_id, op_code, emu_cq_num)
-            - For each sge prepares backend sge
-            - Calls backend's post_recv
-
-4.3.4 Process backend events
-============================
-    - Done by a dedicated thread used to process backend events;
-      at initialization is attached to the device and creates
-      the communication channel.
-    - Thread main loop:
-        - Polls for completions
-        - Extracts QEMU _cq_num, wr_id and op_code from context
-        - Writes CQE to CQ ring
-        - Writes CQ number to device CQ
-        - Sends completion-interrupt to guest
-        - Deallocates context
-        - Acks the event to backend
-
-
-
-5. Limitations
-==============
-- The device obviously is limited by the Guest Linux Driver features implementation
-  of the VMware device API.
-- Memory registration mechanism requires mremap for every page in the buffer in order
-  to map it to a contiguous virtual address range. Since this is not the data path
-  it should not matter much. If the default max mr size is increased, be aware that
-  memory registration can take up to 0.5 seconds for 1GB of memory.
-- The device requires target page size to be the same as the host page size,
-  otherwise it will fail to init.
-- QEMU cannot map guest RAM from a file descriptor if a pvrdma device is attached,
-  so it can't work with huge pages. The limitation will be addressed in the future,
-  however QEMU allocates Guest RAM with MADV_HUGEPAGE so if there are enough huge
-  pages available, QEMU will use them. QEMU will fail to init if the requirements
-  are not met.
-
-
-
-6. Performance
-==============
-By design the pvrdma device exits on each post-send/receive, so for small buffers
-the performance is affected; however for medium buffers it will became close to
-bare metal and from 1MB buffers and  up it reaches bare metal performance.
-(tested with 2 VMs, the pvrdma devices connected to 2 VFs of the same device)
-
-All the above assumes no memory registration is done on data path.
diff --git a/docs/system/loongarch/virt.rst b/docs/system/loongarch/virt.rst
index c37268b404..06d034b8ef 100644
--- a/docs/system/loongarch/virt.rst
+++ b/docs/system/loongarch/virt.rst
@@ -39,7 +39,7 @@ can be accessed by following steps.
 
 .. code-block:: bash
 
-  ./configure --disable-rdma --disable-pvrdma --prefix=/usr \
+  ./configure --disable-rdma --prefix=/usr \
               --target-list="loongarch64-softmmu" \
               --disable-libiscsi --disable-libnfs --disable-libpmem \
               --disable-glusterfs --enable-libusb --enable-usb-redir \
diff --git a/meson.build b/meson.build
index c9c3217ba4..d6af3cd53a 100644
--- a/meson.build
+++ b/meson.build
@@ -2829,37 +2829,6 @@ config_host_data.set('CONFIG_ARM_AES_BUILTIN', cc.compiles('''
     void foo(uint8x16_t *p) { *p = vaesmcq_u8(*p); }
   '''))
 
-have_pvrdma = get_option('pvrdma') \
-  .require(rdma.found(), error_message: 'PVRDMA requires OpenFabrics libraries') \
-  .require(cc.compiles(gnu_source_prefix + '''
-    #include <sys/mman.h>
-    int main(void)
-    {
-      char buf = 0;
-      void *addr = &buf;
-      addr = mremap(addr, 0, 1, MREMAP_MAYMOVE | MREMAP_FIXED);
-
-      return 0;
-    }'''), error_message: 'PVRDMA requires mremap').allowed()
-
-if have_pvrdma
-  config_host_data.set('LEGACY_RDMA_REG_MR', not cc.links('''
-    #include <infiniband/verbs.h>
-    int main(void)
-    {
-      struct ibv_mr *mr;
-      struct ibv_pd *pd = NULL;
-      size_t length = 10;
-      uint64_t iova = 0;
-      int access = 0;
-      void *addr = NULL;
-
-      mr = ibv_reg_mr_iova(pd, addr, length, iova, access);
-      ibv_dereg_mr(mr);
-      return 0;
-    }'''))
-endif
-
 if get_option('membarrier').disabled()
   have_membarrier = false
 elif host_os == 'windows'
@@ -2993,7 +2962,6 @@ host_kconfig = \
   (have_vhost_kernel ? ['CONFIG_VHOST_KERNEL=y'] : []) + \
   (have_virtfs ? ['CONFIG_VIRTFS=y'] : []) + \
   (host_os == 'linux' ? ['CONFIG_LINUX=y'] : []) + \
-  (have_pvrdma ? ['CONFIG_PVRDMA=y'] : []) + \
   (multiprocess_allowed ? ['CONFIG_MULTIPROCESS_ALLOWED=y'] : []) + \
   (vfio_user_server_allowed ? ['CONFIG_VFIO_USER_SERVER_ALLOWED=y'] : []) + \
   (hv_balloon ? ['CONFIG_HV_BALLOON_POSSIBLE=y'] : [])
@@ -3357,8 +3325,6 @@ if have_system
     'hw/pci',
     'hw/pci-host',
     'hw/ppc',
-    'hw/rdma',
-    'hw/rdma/vmw',
     'hw/rtc',
     'hw/s390x',
     'hw/scsi',
@@ -4028,7 +3994,6 @@ if have_tools
     }]
   endforeach
 
-  subdir('contrib/rdmacm-mux')
   subdir('contrib/elf2dmp')
 
   executable('qemu-edid', files('qemu-edid.c', 'hw/display/edid-generate.c'),
@@ -4434,7 +4399,6 @@ summary_info += {'Linux AIO support': libaio}
 summary_info += {'Linux io_uring support': linux_io_uring}
 summary_info += {'ATTR/XATTR support': libattr}
 summary_info += {'RDMA support':      rdma}
-summary_info += {'PVRDMA support':    have_pvrdma}
 summary_info += {'fdt support':       fdt_opt == 'disabled' ? false : fdt_opt}
 summary_info += {'libcap-ng support': libcap_ng}
 summary_info += {'bpf support':       libbpf}
diff --git a/qapi/machine.json b/qapi/machine.json
index e8b60641f2..e9f0f0c49a 100644
--- a/qapi/machine.json
+++ b/qapi/machine.json
@@ -1737,23 +1737,6 @@
   'returns': 'HumanReadableText',
   'features': [ 'unstable' ] }
 
-##
-# @x-query-rdma:
-#
-# Query RDMA state
-#
-# Features:
-#
-# @unstable: This command is meant for debugging.
-#
-# Returns: RDMA state
-#
-# Since: 6.2
-##
-{ 'command': 'x-query-rdma',
-  'returns': 'HumanReadableText',
-  'features': [ 'unstable' ] }
-
 ##
 # @x-query-roms:
 #
diff --git a/qapi/qapi-schema.json b/qapi/qapi-schema.json
index 8304d45625..5e33da7228 100644
--- a/qapi/qapi-schema.json
+++ b/qapi/qapi-schema.json
@@ -54,7 +54,6 @@
 { 'include': 'dump.json' }
 { 'include': 'net.json' }
 { 'include': 'ebpf.json' }
-{ 'include': 'rdma.json' }
 { 'include': 'rocker.json' }
 { 'include': 'tpm.json' }
 { 'include': 'ui.json' }
diff --git a/qapi/rdma.json b/qapi/rdma.json
deleted file mode 100644
index 195c001850..0000000000
--- a/qapi/rdma.json
+++ /dev/null
@@ -1,38 +0,0 @@
-# -*- Mode: Python -*-
-# vim: filetype=python
-#
-
-##
-# = RDMA device
-##
-
-##
-# @RDMA_GID_STATUS_CHANGED:
-#
-# Emitted when guest driver adds/deletes GID to/from device
-#
-# @netdev: RoCE Network Device name
-#
-# @gid-status: Add or delete indication
-#
-# @subnet-prefix: Subnet Prefix
-#
-# @interface-id: Interface ID
-#
-# Since: 4.0
-#
-# Example:
-#
-#     <- {"timestamp": {"seconds": 1541579657, "microseconds": 986760},
-#         "event": "RDMA_GID_STATUS_CHANGED",
-#         "data":
-#             {"netdev": "bridge0",
-#             "interface-id": 15880512517475447892,
-#             "gid-status": true,
-#             "subnet-prefix": 33022}}
-##
-{ 'event': 'RDMA_GID_STATUS_CHANGED',
-  'data': { 'netdev'        : 'str',
-            'gid-status'    : 'bool',
-            'subnet-prefix' : 'uint64',
-            'interface-id'  : 'uint64' } }
diff --git a/contrib/rdmacm-mux/rdmacm-mux.h b/contrib/rdmacm-mux/rdmacm-mux.h
deleted file mode 100644
index 07a4722913..0000000000
--- a/contrib/rdmacm-mux/rdmacm-mux.h
+++ /dev/null
@@ -1,61 +0,0 @@
-/*
- * QEMU paravirtual RDMA - rdmacm-mux declarations
- *
- * Copyright (C) 2018 Oracle
- * Copyright (C) 2018 Red Hat Inc
- *
- * Authors:
- *     Yuval Shaia <yuval.shaia@oracle.com>
- *     Marcel Apfelbaum <marcel@redhat.com>
- *
- * This work is licensed under the terms of the GNU GPL, version 2 or later.
- * See the COPYING file in the top-level directory.
- *
- */
-
-#ifndef RDMACM_MUX_H
-#define RDMACM_MUX_H
-
-#include "linux/if.h"
-#include <infiniband/verbs.h>
-#include <infiniband/umad.h>
-#include <rdma/rdma_user_cm.h>
-
-typedef enum RdmaCmMuxMsgType {
-    RDMACM_MUX_MSG_TYPE_REQ   = 0,
-    RDMACM_MUX_MSG_TYPE_RESP  = 1,
-} RdmaCmMuxMsgType;
-
-typedef enum RdmaCmMuxOpCode {
-    RDMACM_MUX_OP_CODE_REG   = 0,
-    RDMACM_MUX_OP_CODE_UNREG = 1,
-    RDMACM_MUX_OP_CODE_MAD   = 2,
-} RdmaCmMuxOpCode;
-
-typedef enum RdmaCmMuxErrCode {
-    RDMACM_MUX_ERR_CODE_OK        = 0,
-    RDMACM_MUX_ERR_CODE_EINVAL    = 1,
-    RDMACM_MUX_ERR_CODE_EEXIST    = 2,
-    RDMACM_MUX_ERR_CODE_EACCES    = 3,
-    RDMACM_MUX_ERR_CODE_ENOTFOUND = 4,
-} RdmaCmMuxErrCode;
-
-typedef struct RdmaCmMuxHdr {
-    RdmaCmMuxMsgType msg_type;
-    RdmaCmMuxOpCode op_code;
-    union ibv_gid sgid;
-    RdmaCmMuxErrCode err_code;
-} RdmaCmUHdr;
-
-typedef struct RdmaCmUMad {
-    struct ib_user_mad hdr;
-    char mad[RDMA_MAX_PRIVATE_DATA];
-} RdmaCmUMad;
-
-typedef struct RdmaCmMuxMsg {
-    RdmaCmUHdr hdr;
-    int umad_len;
-    RdmaCmUMad umad;
-} RdmaCmMuxMsg;
-
-#endif
diff --git a/hw/rdma/rdma_backend.h b/hw/rdma/rdma_backend.h
deleted file mode 100644
index 225af481e0..0000000000
--- a/hw/rdma/rdma_backend.h
+++ /dev/null
@@ -1,129 +0,0 @@
-/*
- *  RDMA device: Definitions of Backend Device functions
- *
- * Copyright (C) 2018 Oracle
- * Copyright (C) 2018 Red Hat Inc
- *
- * Authors:
- *     Yuval Shaia <yuval.shaia@oracle.com>
- *     Marcel Apfelbaum <marcel@redhat.com>
- *
- * This work is licensed under the terms of the GNU GPL, version 2 or later.
- * See the COPYING file in the top-level directory.
- *
- */
-
-#ifndef RDMA_BACKEND_H
-#define RDMA_BACKEND_H
-
-#include "qapi/error.h"
-#include "chardev/char-fe.h"
-
-#include "rdma_rm_defs.h"
-#include "rdma_backend_defs.h"
-
-/* Vendor Errors */
-#define VENDOR_ERR_FAIL_BACKEND     0x201
-#define VENDOR_ERR_TOO_MANY_SGES    0x202
-#define VENDOR_ERR_NOMEM            0x203
-#define VENDOR_ERR_QP0              0x204
-#define VENDOR_ERR_INV_NUM_SGE      0x205
-#define VENDOR_ERR_MAD_SEND         0x206
-#define VENDOR_ERR_INVLKEY          0x207
-#define VENDOR_ERR_MR_SMALL         0x208
-#define VENDOR_ERR_INV_MAD_BUFF     0x209
-#define VENDOR_ERR_INV_GID_IDX      0x210
-
-/* Add definition for QP0 and QP1 as there is no userspace enums for them */
-enum ibv_special_qp_type {
-    IBV_QPT_SMI = 0,
-    IBV_QPT_GSI = 1,
-};
-
-static inline uint32_t rdma_backend_qpn(const RdmaBackendQP *qp)
-{
-    return qp->ibqp ? qp->ibqp->qp_num : 1;
-}
-
-static inline uint32_t rdma_backend_mr_lkey(const RdmaBackendMR *mr)
-{
-    return mr->ibmr ? mr->ibmr->lkey : 0;
-}
-
-static inline uint32_t rdma_backend_mr_rkey(const RdmaBackendMR *mr)
-{
-    return mr->ibmr ? mr->ibmr->rkey : 0;
-}
-
-int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
-                      RdmaDeviceResources *rdma_dev_res,
-                      const char *backend_device_name, uint8_t port_num,
-                      struct ibv_device_attr *dev_attr,
-                      CharBackend *mad_chr_be);
-void rdma_backend_fini(RdmaBackendDev *backend_dev);
-int rdma_backend_add_gid(RdmaBackendDev *backend_dev, const char *ifname,
-                         union ibv_gid *gid);
-int rdma_backend_del_gid(RdmaBackendDev *backend_dev, const char *ifname,
-                         union ibv_gid *gid);
-int rdma_backend_get_gid_index(RdmaBackendDev *backend_dev,
-                               union ibv_gid *gid);
-void rdma_backend_start(RdmaBackendDev *backend_dev);
-void rdma_backend_stop(RdmaBackendDev *backend_dev);
-void rdma_backend_register_comp_handler(void (*handler)(void *ctx,
-                                                        struct ibv_wc *wc));
-void rdma_backend_unregister_comp_handler(void);
-
-int rdma_backend_query_port(RdmaBackendDev *backend_dev,
-                            struct ibv_port_attr *port_attr);
-int rdma_backend_create_pd(RdmaBackendDev *backend_dev, RdmaBackendPD *pd);
-void rdma_backend_destroy_pd(RdmaBackendPD *pd);
-
-int rdma_backend_create_mr(RdmaBackendMR *mr, RdmaBackendPD *pd, void *addr,
-                           size_t length, uint64_t guest_start, int access);
-void rdma_backend_destroy_mr(RdmaBackendMR *mr);
-
-int rdma_backend_create_cq(RdmaBackendDev *backend_dev, RdmaBackendCQ *cq,
-                           int cqe);
-void rdma_backend_destroy_cq(RdmaBackendCQ *cq);
-void rdma_backend_poll_cq(RdmaDeviceResources *rdma_dev_res, RdmaBackendCQ *cq);
-
-int rdma_backend_create_qp(RdmaBackendQP *qp, uint8_t qp_type,
-                           RdmaBackendPD *pd, RdmaBackendCQ *scq,
-                           RdmaBackendCQ *rcq, RdmaBackendSRQ *srq,
-                           uint32_t max_send_wr, uint32_t max_recv_wr,
-                           uint32_t max_send_sge, uint32_t max_recv_sge);
-int rdma_backend_qp_state_init(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
-                               uint8_t qp_type, uint32_t qkey);
-int rdma_backend_qp_state_rtr(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
-                              uint8_t qp_type, uint8_t sgid_idx,
-                              union ibv_gid *dgid, uint32_t dqpn,
-                              uint32_t rq_psn, uint32_t qkey, bool use_qkey);
-int rdma_backend_qp_state_rts(RdmaBackendQP *qp, uint8_t qp_type,
-                              uint32_t sq_psn, uint32_t qkey, bool use_qkey);
-int rdma_backend_query_qp(RdmaBackendQP *qp, struct ibv_qp_attr *attr,
-                          int attr_mask, struct ibv_qp_init_attr *init_attr);
-void rdma_backend_destroy_qp(RdmaBackendQP *qp, RdmaDeviceResources *dev_res);
-
-void rdma_backend_post_send(RdmaBackendDev *backend_dev,
-                            RdmaBackendQP *qp, uint8_t qp_type,
-                            struct ibv_sge *sge, uint32_t num_sge,
-                            uint8_t sgid_idx, union ibv_gid *sgid,
-                            union ibv_gid *dgid, uint32_t dqpn, uint32_t dqkey,
-                            void *ctx);
-void rdma_backend_post_recv(RdmaBackendDev *backend_dev,
-                            RdmaBackendQP *qp, uint8_t qp_type,
-                            struct ibv_sge *sge, uint32_t num_sge, void *ctx);
-
-int rdma_backend_create_srq(RdmaBackendSRQ *srq, RdmaBackendPD *pd,
-                            uint32_t max_wr, uint32_t max_sge,
-                            uint32_t srq_limit);
-int rdma_backend_query_srq(RdmaBackendSRQ *srq, struct ibv_srq_attr *srq_attr);
-int rdma_backend_modify_srq(RdmaBackendSRQ *srq, struct ibv_srq_attr *srq_attr,
-                            int srq_attr_mask);
-void rdma_backend_destroy_srq(RdmaBackendSRQ *srq,
-                              RdmaDeviceResources *dev_res);
-void rdma_backend_post_srq_recv(RdmaBackendDev *backend_dev,
-                                RdmaBackendSRQ *srq, struct ibv_sge *sge,
-                                uint32_t num_sge, void *ctx);
-
-#endif
diff --git a/hw/rdma/rdma_backend_defs.h b/hw/rdma/rdma_backend_defs.h
deleted file mode 100644
index 4e6c0ad695..0000000000
--- a/hw/rdma/rdma_backend_defs.h
+++ /dev/null
@@ -1,76 +0,0 @@
-/*
- *  RDMA device: Definitions of Backend Device structures
- *
- * Copyright (C) 2018 Oracle
- * Copyright (C) 2018 Red Hat Inc
- *
- * Authors:
- *     Yuval Shaia <yuval.shaia@oracle.com>
- *     Marcel Apfelbaum <marcel@redhat.com>
- *
- * This work is licensed under the terms of the GNU GPL, version 2 or later.
- * See the COPYING file in the top-level directory.
- *
- */
-
-#ifndef RDMA_BACKEND_DEFS_H
-#define RDMA_BACKEND_DEFS_H
-
-#include "qemu/thread.h"
-#include "chardev/char-fe.h"
-#include <infiniband/verbs.h>
-#include "contrib/rdmacm-mux/rdmacm-mux.h"
-#include "rdma_utils.h"
-
-typedef struct RdmaDeviceResources RdmaDeviceResources;
-
-typedef struct RdmaBackendThread {
-    QemuThread thread;
-    bool run; /* Set by thread manager to let thread know it should exit */
-    bool is_running; /* Set by the thread to report its status */
-} RdmaBackendThread;
-
-typedef struct RdmaCmMux {
-    CharBackend *chr_be;
-    int can_receive;
-} RdmaCmMux;
-
-typedef struct RdmaBackendDev {
-    RdmaBackendThread comp_thread;
-    PCIDevice *dev;
-    RdmaDeviceResources *rdma_dev_res;
-    struct ibv_device *ib_dev;
-    struct ibv_context *context;
-    struct ibv_comp_channel *channel;
-    uint8_t port_num;
-    RdmaProtectedGQueue recv_mads_list;
-    RdmaCmMux rdmacm_mux;
-} RdmaBackendDev;
-
-typedef struct RdmaBackendPD {
-    struct ibv_pd *ibpd;
-} RdmaBackendPD;
-
-typedef struct RdmaBackendMR {
-    struct ibv_pd *ibpd;
-    struct ibv_mr *ibmr;
-} RdmaBackendMR;
-
-typedef struct RdmaBackendCQ {
-    RdmaBackendDev *backend_dev;
-    struct ibv_cq *ibcq;
-} RdmaBackendCQ;
-
-typedef struct RdmaBackendQP {
-    struct ibv_pd *ibpd;
-    struct ibv_qp *ibqp;
-    uint8_t sgid_idx;
-    RdmaProtectedGSList cqe_ctx_list;
-} RdmaBackendQP;
-
-typedef struct RdmaBackendSRQ {
-    struct ibv_srq *ibsrq;
-    RdmaProtectedGSList cqe_ctx_list;
-} RdmaBackendSRQ;
-
-#endif
diff --git a/hw/rdma/rdma_rm.h b/hw/rdma/rdma_rm.h
deleted file mode 100644
index d69a917795..0000000000
--- a/hw/rdma/rdma_rm.h
+++ /dev/null
@@ -1,97 +0,0 @@
-/*
- * RDMA device: Definitions of Resource Manager functions
- *
- * Copyright (C) 2018 Oracle
- * Copyright (C) 2018 Red Hat Inc
- *
- * Authors:
- *     Yuval Shaia <yuval.shaia@oracle.com>
- *     Marcel Apfelbaum <marcel@redhat.com>
- *
- * This work is licensed under the terms of the GNU GPL, version 2 or later.
- * See the COPYING file in the top-level directory.
- *
- */
-
-#ifndef RDMA_RM_H
-#define RDMA_RM_H
-
-#include "qapi/error.h"
-#include "rdma_backend_defs.h"
-#include "rdma_rm_defs.h"
-
-int rdma_rm_init(RdmaDeviceResources *dev_res,
-                 struct ibv_device_attr *dev_attr);
-void rdma_rm_fini(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
-                  const char *ifname);
-
-int rdma_rm_alloc_pd(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
-                     uint32_t *pd_handle, uint32_t ctx_handle);
-RdmaRmPD *rdma_rm_get_pd(RdmaDeviceResources *dev_res, uint32_t pd_handle);
-void rdma_rm_dealloc_pd(RdmaDeviceResources *dev_res, uint32_t pd_handle);
-
-int rdma_rm_alloc_mr(RdmaDeviceResources *dev_res, uint32_t pd_handle,
-                     uint64_t guest_start, uint64_t guest_length,
-                     void *host_virt, int access_flags, uint32_t *mr_handle,
-                     uint32_t *lkey, uint32_t *rkey);
-RdmaRmMR *rdma_rm_get_mr(RdmaDeviceResources *dev_res, uint32_t mr_handle);
-void rdma_rm_dealloc_mr(RdmaDeviceResources *dev_res, uint32_t mr_handle);
-
-int rdma_rm_alloc_uc(RdmaDeviceResources *dev_res, uint32_t pfn,
-                     uint32_t *uc_handle);
-RdmaRmUC *rdma_rm_get_uc(RdmaDeviceResources *dev_res, uint32_t uc_handle);
-void rdma_rm_dealloc_uc(RdmaDeviceResources *dev_res, uint32_t uc_handle);
-
-int rdma_rm_alloc_cq(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
-                     uint32_t cqe, uint32_t *cq_handle, void *opaque);
-RdmaRmCQ *rdma_rm_get_cq(RdmaDeviceResources *dev_res, uint32_t cq_handle);
-void rdma_rm_req_notify_cq(RdmaDeviceResources *dev_res, uint32_t cq_handle,
-                           bool notify);
-void rdma_rm_dealloc_cq(RdmaDeviceResources *dev_res, uint32_t cq_handle);
-
-int rdma_rm_alloc_qp(RdmaDeviceResources *dev_res, uint32_t pd_handle,
-                     uint8_t qp_type, uint32_t max_send_wr,
-                     uint32_t max_send_sge, uint32_t send_cq_handle,
-                     uint32_t max_recv_wr, uint32_t max_recv_sge,
-                     uint32_t recv_cq_handle, void *opaque, uint32_t *qpn,
-                     uint8_t is_srq, uint32_t srq_handle);
-RdmaRmQP *rdma_rm_get_qp(RdmaDeviceResources *dev_res, uint32_t qpn);
-int rdma_rm_modify_qp(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
-                      uint32_t qp_handle, uint32_t attr_mask, uint8_t sgid_idx,
-                      union ibv_gid *dgid, uint32_t dqpn,
-                      enum ibv_qp_state qp_state, uint32_t qkey,
-                      uint32_t rq_psn, uint32_t sq_psn);
-int rdma_rm_query_qp(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
-                     uint32_t qp_handle, struct ibv_qp_attr *attr,
-                     int attr_mask, struct ibv_qp_init_attr *init_attr);
-void rdma_rm_dealloc_qp(RdmaDeviceResources *dev_res, uint32_t qp_handle);
-
-RdmaRmSRQ *rdma_rm_get_srq(RdmaDeviceResources *dev_res, uint32_t srq_handle);
-int rdma_rm_alloc_srq(RdmaDeviceResources *dev_res, uint32_t pd_handle,
-                      uint32_t max_wr, uint32_t max_sge, uint32_t srq_limit,
-                      uint32_t *srq_handle, void *opaque);
-int rdma_rm_query_srq(RdmaDeviceResources *dev_res, uint32_t srq_handle,
-                      struct ibv_srq_attr *srq_attr);
-int rdma_rm_modify_srq(RdmaDeviceResources *dev_res, uint32_t srq_handle,
-                       struct ibv_srq_attr *srq_attr, int srq_attr_mask);
-void rdma_rm_dealloc_srq(RdmaDeviceResources *dev_res, uint32_t srq_handle);
-
-int rdma_rm_alloc_cqe_ctx(RdmaDeviceResources *dev_res, uint32_t *cqe_ctx_id,
-                          void *ctx);
-void *rdma_rm_get_cqe_ctx(RdmaDeviceResources *dev_res, uint32_t cqe_ctx_id);
-void rdma_rm_dealloc_cqe_ctx(RdmaDeviceResources *dev_res, uint32_t cqe_ctx_id);
-
-int rdma_rm_add_gid(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
-                    const char *ifname, union ibv_gid *gid, int gid_idx);
-int rdma_rm_del_gid(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
-                    const char *ifname, int gid_idx);
-int rdma_rm_get_backend_gid_index(RdmaDeviceResources *dev_res,
-                                  RdmaBackendDev *backend_dev, int sgid_idx);
-static inline union ibv_gid *rdma_rm_get_gid(RdmaDeviceResources *dev_res,
-                                             int sgid_idx)
-{
-    return &dev_res->port.gid_tbl[sgid_idx].gid;
-}
-void rdma_format_device_counters(RdmaDeviceResources *dev_res, GString *buf);
-
-#endif
diff --git a/hw/rdma/rdma_rm_defs.h b/hw/rdma/rdma_rm_defs.h
deleted file mode 100644
index 534f2f74d3..0000000000
--- a/hw/rdma/rdma_rm_defs.h
+++ /dev/null
@@ -1,146 +0,0 @@
-/*
- * RDMA device: Definitions of Resource Manager structures
- *
- * Copyright (C) 2018 Oracle
- * Copyright (C) 2018 Red Hat Inc
- *
- * Authors:
- *     Yuval Shaia <yuval.shaia@oracle.com>
- *     Marcel Apfelbaum <marcel@redhat.com>
- *
- * This work is licensed under the terms of the GNU GPL, version 2 or later.
- * See the COPYING file in the top-level directory.
- *
- */
-
-#ifndef RDMA_RM_DEFS_H
-#define RDMA_RM_DEFS_H
-
-#include "rdma_backend_defs.h"
-
-#define MAX_PORTS             1 /* Do not change - we support only one port */
-#define MAX_PORT_GIDS         255
-#define MAX_GIDS              MAX_PORT_GIDS
-#define MAX_PORT_PKEYS        1
-#define MAX_PKEYS             MAX_PORT_PKEYS
-#define MAX_UCS               512
-#define MAX_MR_SIZE           (1UL << 27)
-#define MAX_QP                1024
-#define MAX_SGE               4
-#define MAX_CQ                2048
-#define MAX_MR                1024
-#define MAX_PD                1024
-#define MAX_QP_RD_ATOM        16
-#define MAX_QP_INIT_RD_ATOM   16
-#define MAX_AH                64
-#define MAX_SRQ               512
-
-#define MAX_RM_TBL_NAME             16
-#define MAX_CONSEQ_EMPTY_POLL_CQ    4096 /* considered as error above this */
-
-typedef struct RdmaRmResTbl {
-    char name[MAX_RM_TBL_NAME];
-    QemuMutex lock;
-    unsigned long *bitmap;
-    size_t tbl_sz;
-    size_t res_sz;
-    void *tbl;
-    uint32_t used; /* number of used entries in the table */
-} RdmaRmResTbl;
-
-typedef struct RdmaRmPD {
-    RdmaBackendPD backend_pd;
-    uint32_t ctx_handle;
-} RdmaRmPD;
-
-typedef enum CQNotificationType {
-    CNT_CLEAR,
-    CNT_ARM,
-    CNT_SET,
-} CQNotificationType;
-
-typedef struct RdmaRmCQ {
-    RdmaBackendCQ backend_cq;
-    void *opaque;
-    CQNotificationType notify;
-} RdmaRmCQ;
-
-/* MR (DMA region) */
-typedef struct RdmaRmMR {
-    RdmaBackendMR backend_mr;
-    void *virt;
-    uint64_t start;
-    size_t length;
-    uint32_t pd_handle;
-    uint32_t lkey;
-    uint32_t rkey;
-} RdmaRmMR;
-
-typedef struct RdmaRmUC {
-    uint64_t uc_handle;
-} RdmaRmUC;
-
-typedef struct RdmaRmQP {
-    RdmaBackendQP backend_qp;
-    void *opaque;
-    uint32_t qp_type;
-    uint32_t qpn;
-    uint32_t send_cq_handle;
-    uint32_t recv_cq_handle;
-    enum ibv_qp_state qp_state;
-    uint8_t is_srq;
-} RdmaRmQP;
-
-typedef struct RdmaRmSRQ {
-    RdmaBackendSRQ backend_srq;
-    uint32_t recv_cq_handle;
-    void *opaque;
-} RdmaRmSRQ;
-
-typedef struct RdmaRmGid {
-    union ibv_gid gid;
-    int backend_gid_index;
-} RdmaRmGid;
-
-typedef struct RdmaRmPort {
-    RdmaRmGid gid_tbl[MAX_PORT_GIDS];
-    enum ibv_port_state state;
-} RdmaRmPort;
-
-typedef struct RdmaRmStats {
-    uint64_t tx;
-    uint64_t tx_len;
-    uint64_t tx_err;
-    uint64_t rx_bufs;
-    uint64_t rx_bufs_len;
-    uint64_t rx_bufs_err;
-    uint64_t rx_srq;
-    uint64_t completions;
-    uint64_t mad_tx;
-    uint64_t mad_tx_err;
-    uint64_t mad_rx;
-    uint64_t mad_rx_err;
-    uint64_t mad_rx_bufs;
-    uint64_t mad_rx_bufs_err;
-    uint64_t poll_cq_from_bk;
-    uint64_t poll_cq_from_guest;
-    uint64_t poll_cq_from_guest_empty;
-    uint64_t poll_cq_ppoll_to;
-    uint32_t missing_cqe;
-} RdmaRmStats;
-
-struct RdmaDeviceResources {
-    RdmaRmPort port;
-    RdmaRmResTbl pd_tbl;
-    RdmaRmResTbl mr_tbl;
-    RdmaRmResTbl uc_tbl;
-    RdmaRmResTbl qp_tbl;
-    RdmaRmResTbl cq_tbl;
-    RdmaRmResTbl cqe_ctx_tbl;
-    RdmaRmResTbl srq_tbl;
-    GHashTable *qp_hash; /* Keeps mapping between real and emulated */
-    QemuMutex lock;
-    RdmaRmStats stats;
-};
-
-#endif
diff --git a/hw/rdma/rdma_utils.h b/hw/rdma/rdma_utils.h
deleted file mode 100644
index 54e4f56edd..0000000000
--- a/hw/rdma/rdma_utils.h
+++ /dev/null
@@ -1,63 +0,0 @@
-/*
- * RDMA device: Debug utilities
- *
- * Copyright (C) 2018 Oracle
- * Copyright (C) 2018 Red Hat Inc
- *
- *
- * Authors:
- *     Yuval Shaia <yuval.shaia@oracle.com>
- *     Marcel Apfelbaum <marcel@redhat.com>
- *
- * This work is licensed under the terms of the GNU GPL, version 2 or later.
- * See the COPYING file in the top-level directory.
- *
- */
-
-#ifndef RDMA_UTILS_H
-#define RDMA_UTILS_H
-
-#include "qemu/error-report.h"
-#include "sysemu/dma.h"
-
-#define rdma_error_report(fmt, ...) \
-    error_report("%s: " fmt, "rdma", ## __VA_ARGS__)
-#define rdma_warn_report(fmt, ...) \
-    warn_report("%s: " fmt, "rdma", ## __VA_ARGS__)
-#define rdma_info_report(fmt, ...) \
-    info_report("%s: " fmt, "rdma", ## __VA_ARGS__)
-
-typedef struct RdmaProtectedGQueue {
-    QemuMutex lock;
-    GQueue *list;
-} RdmaProtectedGQueue;
-
-typedef struct RdmaProtectedGSList {
-    QemuMutex lock;
-    GSList *list;
-} RdmaProtectedGSList;
-
-void *rdma_pci_dma_map(PCIDevice *dev, dma_addr_t addr, dma_addr_t len);
-void rdma_pci_dma_unmap(PCIDevice *dev, void *buffer, dma_addr_t len);
-void rdma_protected_gqueue_init(RdmaProtectedGQueue *list);
-void rdma_protected_gqueue_destroy(RdmaProtectedGQueue *list);
-void rdma_protected_gqueue_append_int64(RdmaProtectedGQueue *list,
-                                        int64_t value);
-int64_t rdma_protected_gqueue_pop_int64(RdmaProtectedGQueue *list);
-void rdma_protected_gslist_init(RdmaProtectedGSList *list);
-void rdma_protected_gslist_destroy(RdmaProtectedGSList *list);
-void rdma_protected_gslist_append_int32(RdmaProtectedGSList *list,
-                                        int32_t value);
-void rdma_protected_gslist_remove_int32(RdmaProtectedGSList *list,
-                                        int32_t value);
-
-static inline void addrconf_addr_eui48(uint8_t *eui, const char *addr)
-{
-    memcpy(eui, addr, 3);
-    eui[3] = 0xFF;
-    eui[4] = 0xFE;
-    memcpy(eui + 5, addr + 3, 3);
-    eui[0] ^= 2;
-}
-
-#endif
diff --git a/hw/rdma/trace.h b/hw/rdma/trace.h
deleted file mode 100644
index b3fa8ebc51..0000000000
--- a/hw/rdma/trace.h
+++ /dev/null
@@ -1 +0,0 @@
-#include "trace/trace-hw_rdma.h"
diff --git a/hw/rdma/vmw/pvrdma.h b/hw/rdma/vmw/pvrdma.h
deleted file mode 100644
index 4cbc10c980..0000000000
--- a/hw/rdma/vmw/pvrdma.h
+++ /dev/null
@@ -1,144 +0,0 @@
-/*
- * QEMU VMWARE paravirtual RDMA device definitions
- *
- * Copyright (C) 2018 Oracle
- * Copyright (C) 2018 Red Hat Inc
- *
- * Authors:
- *     Yuval Shaia <yuval.shaia@oracle.com>
- *     Marcel Apfelbaum <marcel@redhat.com>
- *
- * This work is licensed under the terms of the GNU GPL, version 2 or later.
- * See the COPYING file in the top-level directory.
- *
- */
-
-#ifndef PVRDMA_PVRDMA_H
-#define PVRDMA_PVRDMA_H
-
-#include "qemu/units.h"
-#include "qemu/notify.h"
-#include "hw/pci/msix.h"
-#include "hw/pci/pci_device.h"
-#include "chardev/char-fe.h"
-#include "hw/net/vmxnet3_defs.h"
-
-#include "../rdma_backend_defs.h"
-#include "../rdma_rm_defs.h"
-
-#include "standard-headers/drivers/infiniband/hw/vmw_pvrdma/pvrdma_dev_api.h"
-#include "pvrdma_dev_ring.h"
-#include "qom/object.h"
-
-/* BARs */
-#define RDMA_MSIX_BAR_IDX    0
-#define RDMA_REG_BAR_IDX     1
-#define RDMA_UAR_BAR_IDX     2
-#define RDMA_BAR0_MSIX_SIZE  (16 * KiB)
-#define RDMA_BAR1_REGS_SIZE  64
-#define RDMA_BAR2_UAR_SIZE   (0x1000 * MAX_UCS) /* each uc gets page */
-
-/* MSIX */
-#define RDMA_MAX_INTRS       3
-#define RDMA_MSIX_TABLE      0x0000
-#define RDMA_MSIX_PBA        0x2000
-
-/* Interrupts Vectors */
-#define INTR_VEC_CMD_RING            0
-#define INTR_VEC_CMD_ASYNC_EVENTS    1
-#define INTR_VEC_CMD_COMPLETION_Q    2
-
-/* HW attributes */
-#define PVRDMA_HW_NAME       "pvrdma"
-#define PVRDMA_HW_VERSION    17
-#define PVRDMA_FW_VERSION    14
-
-/* Some defaults */
-#define PVRDMA_PKEY          0xFFFF
-
-typedef struct DSRInfo {
-    dma_addr_t dma;
-    struct pvrdma_device_shared_region *dsr;
-
-    union pvrdma_cmd_req *req;
-    union pvrdma_cmd_resp *rsp;
-
-    PvrdmaRingState *async_ring_state;
-    PvrdmaRing async;
-
-    PvrdmaRingState *cq_ring_state;
-    PvrdmaRing cq;
-} DSRInfo;
-
-typedef struct PVRDMADevStats {
-    uint64_t commands;
-    uint64_t regs_reads;
-    uint64_t regs_writes;
-    uint64_t uar_writes;
-    uint64_t interrupts;
-} PVRDMADevStats;
-
-struct PVRDMADev {
-    PCIDevice parent_obj;
-    MemoryRegion msix;
-    MemoryRegion regs;
-    uint32_t regs_data[RDMA_BAR1_REGS_SIZE];
-    MemoryRegion uar;
-    uint32_t uar_data[RDMA_BAR2_UAR_SIZE];
-    DSRInfo dsr_info;
-    int interrupt_mask;
-    struct ibv_device_attr dev_attr;
-    uint64_t node_guid;
-    char *backend_eth_device_name;
-    char *backend_device_name;
-    uint8_t backend_port_num;
-    RdmaBackendDev backend_dev;
-    RdmaDeviceResources rdma_dev_res;
-    CharBackend mad_chr;
-    VMXNET3State *func0;
-    Notifier shutdown_notifier;
-    PVRDMADevStats stats;
-};
-typedef struct PVRDMADev PVRDMADev;
-DECLARE_INSTANCE_CHECKER(PVRDMADev, PVRDMA_DEV,
-                         PVRDMA_HW_NAME)
-
-static inline int get_reg_val(PVRDMADev *dev, hwaddr addr, uint32_t *val)
-{
-    int idx = addr >> 2;
-
-    if (idx >= RDMA_BAR1_REGS_SIZE) {
-        return -EINVAL;
-    }
-
-    *val = dev->regs_data[idx];
-
-    return 0;
-}
-
-static inline int set_reg_val(PVRDMADev *dev, hwaddr addr, uint32_t val)
-{
-    int idx = addr >> 2;
-
-    if (idx >= RDMA_BAR1_REGS_SIZE) {
-        return -EINVAL;
-    }
-
-    dev->regs_data[idx] = val;
-
-    return 0;
-}
-
-static inline void post_interrupt(PVRDMADev *dev, unsigned vector)
-{
-    PCIDevice *pci_dev = PCI_DEVICE(dev);
-
-    if (likely(!dev->interrupt_mask)) {
-        dev->stats.interrupts++;
-        msix_notify(pci_dev, vector);
-    }
-}
-
-int pvrdma_exec_cmd(PVRDMADev *dev);
-
-#endif
diff --git a/hw/rdma/vmw/pvrdma_dev_ring.h b/hw/rdma/vmw/pvrdma_dev_ring.h
deleted file mode 100644
index d231588ce0..0000000000
--- a/hw/rdma/vmw/pvrdma_dev_ring.h
+++ /dev/null
@@ -1,46 +0,0 @@
-/*
- * QEMU VMWARE paravirtual RDMA ring utilities
- *
- * Copyright (C) 2018 Oracle
- * Copyright (C) 2018 Red Hat Inc
- *
- * Authors:
- *     Yuval Shaia <yuval.shaia@oracle.com>
- *     Marcel Apfelbaum <marcel@redhat.com>
- *
- * This work is licensed under the terms of the GNU GPL, version 2 or later.
- * See the COPYING file in the top-level directory.
- *
- */
-
-#ifndef PVRDMA_DEV_RING_H
-#define PVRDMA_DEV_RING_H
-
-
-#define MAX_RING_NAME_SZ 32
-
-typedef struct PvrdmaRingState {
-    int prod_tail; /* producer tail */
-    int cons_head; /* consumer head */
-} PvrdmaRingState;
-
-typedef struct PvrdmaRing {
-    char name[MAX_RING_NAME_SZ];
-    PCIDevice *dev;
-    uint32_t max_elems;
-    size_t elem_sz;
-    PvrdmaRingState *ring_state; /* used only for unmap */
-    int npages;
-    void **pages;
-} PvrdmaRing;
-
-int pvrdma_ring_init(PvrdmaRing *ring, const char *name, PCIDevice *dev,
-                     PvrdmaRingState *ring_state, uint32_t max_elems,
-                     size_t elem_sz, dma_addr_t *tbl, uint32_t npages);
-void *pvrdma_ring_next_elem_read(PvrdmaRing *ring);
-void pvrdma_ring_read_inc(PvrdmaRing *ring);
-void *pvrdma_ring_next_elem_write(PvrdmaRing *ring);
-void pvrdma_ring_write_inc(PvrdmaRing *ring);
-void pvrdma_ring_free(PvrdmaRing *ring);
-
-#endif
diff --git a/hw/rdma/vmw/pvrdma_qp_ops.h b/hw/rdma/vmw/pvrdma_qp_ops.h
deleted file mode 100644
index bf2b15c5ce..0000000000
--- a/hw/rdma/vmw/pvrdma_qp_ops.h
+++ /dev/null
@@ -1,28 +0,0 @@
-/*
- * QEMU VMWARE paravirtual RDMA QP Operations
- *
- * Copyright (C) 2018 Oracle
- * Copyright (C) 2018 Red Hat Inc
- *
- * Authors:
- *     Yuval Shaia <yuval.shaia@oracle.com>
- *     Marcel Apfelbaum <marcel@redhat.com>
- *
- * This work is licensed under the terms of the GNU GPL, version 2 or later.
- * See the COPYING file in the top-level directory.
- *
- */
-
-#ifndef PVRDMA_QP_OPS_H
-#define PVRDMA_QP_OPS_H
-
-#include "pvrdma.h"
-
-int pvrdma_qp_ops_init(void);
-void pvrdma_qp_ops_fini(void);
-void pvrdma_qp_send(PVRDMADev *dev, uint32_t qp_handle);
-void pvrdma_qp_recv(PVRDMADev *dev, uint32_t qp_handle);
-void pvrdma_srq_recv(PVRDMADev *dev, uint32_t srq_handle);
-void pvrdma_cq_poll(RdmaDeviceResources *dev_res, uint32_t cq_handle);
-
-#endif
diff --git a/hw/rdma/vmw/trace.h b/hw/rdma/vmw/trace.h
deleted file mode 100644
index 3ebc9fb7ad..0000000000
--- a/hw/rdma/vmw/trace.h
+++ /dev/null
@@ -1 +0,0 @@
-#include "trace/trace-hw_rdma_vmw.h"
diff --git a/include/hw/rdma/rdma.h b/include/hw/rdma/rdma.h
deleted file mode 100644
index 80b2e531c4..0000000000
--- a/include/hw/rdma/rdma.h
+++ /dev/null
@@ -1,37 +0,0 @@
-/*
- * RDMA device interface
- *
- * Copyright (C) 2019 Oracle
- * Copyright (C) 2019 Red Hat Inc
- *
- * Authors:
- *     Yuval Shaia <yuval.shaia@oracle.com>
- *
- * This work is licensed under the terms of the GNU GPL, version 2 or later.
- * See the COPYING file in the top-level directory.
- *
- */
-
-#ifndef RDMA_H
-#define RDMA_H
-
-#include "qom/object.h"
-
-#define INTERFACE_RDMA_PROVIDER "rdma"
-
-typedef struct RdmaProviderClass RdmaProviderClass;
-DECLARE_CLASS_CHECKERS(RdmaProviderClass, RDMA_PROVIDER,
-                       INTERFACE_RDMA_PROVIDER)
-#define RDMA_PROVIDER(obj) \
-    INTERFACE_CHECK(RdmaProvider, (obj), \
-                    INTERFACE_RDMA_PROVIDER)
-
-typedef struct RdmaProvider RdmaProvider;
-
-struct RdmaProviderClass {
-    InterfaceClass parent;
-
-    void (*format_statistics)(RdmaProvider *obj, GString *buf);
-};
-
-#endif
diff --git a/include/monitor/hmp.h b/include/monitor/hmp.h
index 13f9a2dedb..f4cf8f6717 100644
--- a/include/monitor/hmp.h
+++ b/include/monitor/hmp.h
@@ -37,7 +37,6 @@ void hmp_info_spice(Monitor *mon, const QDict *qdict);
 void hmp_info_balloon(Monitor *mon, const QDict *qdict);
 void hmp_info_irq(Monitor *mon, const QDict *qdict);
 void hmp_info_pic(Monitor *mon, const QDict *qdict);
-void hmp_info_rdma(Monitor *mon, const QDict *qdict);
 void hmp_info_pci(Monitor *mon, const QDict *qdict);
 void hmp_info_tpm(Monitor *mon, const QDict *qdict);
 void hmp_info_iothreads(Monitor *mon, const QDict *qdict);
diff --git a/include/standard-headers/drivers/infiniband/hw/vmw_pvrdma/pvrdma_dev_api.h b/include/standard-headers/drivers/infiniband/hw/vmw_pvrdma/pvrdma_dev_api.h
deleted file mode 100644
index a5a1c8234e..0000000000
--- a/include/standard-headers/drivers/infiniband/hw/vmw_pvrdma/pvrdma_dev_api.h
+++ /dev/null
@@ -1,685 +0,0 @@
-/*
- * Copyright (c) 2012-2016 VMware, Inc.  All rights reserved.
- *
- * This program is free software; you can redistribute it and/or
- * modify it under the terms of EITHER the GNU General Public License
- * version 2 as published by the Free Software Foundation or the BSD
- * 2-Clause License. This program is distributed in the hope that it
- * will be useful, but WITHOUT ANY WARRANTY; WITHOUT EVEN THE IMPLIED
- * WARRANTY OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
- * See the GNU General Public License version 2 for more details at
- * http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html.
- *
- * You should have received a copy of the GNU General Public License
- * along with this program available in the file COPYING in the main
- * directory of this source tree.
- *
- * The BSD 2-Clause License
- *
- *     Redistribution and use in source and binary forms, with or
- *     without modification, are permitted provided that the following
- *     conditions are met:
- *
- *      - Redistributions of source code must retain the above
- *        copyright notice, this list of conditions and the following
- *        disclaimer.
- *
- *      - Redistributions in binary form must reproduce the above
- *        copyright notice, this list of conditions and the following
- *        disclaimer in the documentation and/or other materials
- *        provided with the distribution.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
- * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
- * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
- * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
- * COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
- * INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
- * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
- * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
- * STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
- * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED
- * OF THE POSSIBILITY OF SUCH DAMAGE.
- */
-
-#ifndef __PVRDMA_DEV_API_H__
-#define __PVRDMA_DEV_API_H__
-
-#include "standard-headers/linux/types.h"
-
-#include "pvrdma_verbs.h"
-
-/*
- * PVRDMA version macros. Some new features require updates to PVRDMA_VERSION.
- * These macros allow us to check for different features if necessary.
- */
-
-#define PVRDMA_ROCEV1_VERSION		17
-#define PVRDMA_ROCEV2_VERSION		18
-#define PVRDMA_PPN64_VERSION		19
-#define PVRDMA_QPHANDLE_VERSION		20
-#define PVRDMA_VERSION			PVRDMA_QPHANDLE_VERSION
-
-#define PVRDMA_BOARD_ID			1
-#define PVRDMA_REV_ID			1
-
-/*
- * Masks and accessors for page directory, which is a two-level lookup:
- * page directory -> page table -> page. Only one directory for now, but we
- * could expand that easily. 9 bits for tables, 9 bits for pages, gives one
- * gigabyte for memory regions and so forth.
- */
-
-#define PVRDMA_PDIR_SHIFT		18
-#define PVRDMA_PTABLE_SHIFT		9
-#define PVRDMA_PAGE_DIR_DIR(x)		(((x) >> PVRDMA_PDIR_SHIFT) & 0x1)
-#define PVRDMA_PAGE_DIR_TABLE(x)	(((x) >> PVRDMA_PTABLE_SHIFT) & 0x1ff)
-#define PVRDMA_PAGE_DIR_PAGE(x)		((x) & 0x1ff)
-#define PVRDMA_PAGE_DIR_MAX_PAGES	(1 * 512 * 512)
-#define PVRDMA_MAX_FAST_REG_PAGES	128
-
-/*
- * Max MSI-X vectors.
- */
-
-#define PVRDMA_MAX_INTERRUPTS	3
-
-/* Register offsets within PCI resource on BAR1. */
-#define PVRDMA_REG_VERSION	0x00	/* R: Version of device. */
-#define PVRDMA_REG_DSRLOW	0x04	/* W: Device shared region low PA. */
-#define PVRDMA_REG_DSRHIGH	0x08	/* W: Device shared region high PA. */
-#define PVRDMA_REG_CTL		0x0c	/* W: PVRDMA_DEVICE_CTL */
-#define PVRDMA_REG_REQUEST	0x10	/* W: Indicate device request. */
-#define PVRDMA_REG_ERR		0x14	/* R: Device error. */
-#define PVRDMA_REG_ICR		0x18	/* R: Interrupt cause. */
-#define PVRDMA_REG_IMR		0x1c	/* R/W: Interrupt mask. */
-#define PVRDMA_REG_MACL		0x20	/* R/W: MAC address low. */
-#define PVRDMA_REG_MACH		0x24	/* R/W: MAC address high. */
-
-/* Object flags. */
-#define PVRDMA_CQ_FLAG_ARMED_SOL	BIT(0)	/* Armed for solicited-only. */
-#define PVRDMA_CQ_FLAG_ARMED		BIT(1)	/* Armed. */
-#define PVRDMA_MR_FLAG_DMA		BIT(0)	/* DMA region. */
-#define PVRDMA_MR_FLAG_FRMR		BIT(1)	/* Fast reg memory region. */
-
-/*
- * Atomic operation capability (masked versions are extended atomic
- * operations.
- */
-
-#define PVRDMA_ATOMIC_OP_COMP_SWAP	BIT(0)	/* Compare and swap. */
-#define PVRDMA_ATOMIC_OP_FETCH_ADD	BIT(1)	/* Fetch and add. */
-#define PVRDMA_ATOMIC_OP_MASK_COMP_SWAP	BIT(2)	/* Masked compare and swap. */
-#define PVRDMA_ATOMIC_OP_MASK_FETCH_ADD	BIT(3)	/* Masked fetch and add. */
-
-/*
- * Base Memory Management Extension flags to support Fast Reg Memory Regions
- * and Fast Reg Work Requests. Each flag represents a verb operation and we
- * must support all of them to qualify for the BMME device cap.
- */
-
-#define PVRDMA_BMME_FLAG_LOCAL_INV	BIT(0)	/* Local Invalidate. */
-#define PVRDMA_BMME_FLAG_REMOTE_INV	BIT(1)	/* Remote Invalidate. */
-#define PVRDMA_BMME_FLAG_FAST_REG_WR	BIT(2)	/* Fast Reg Work Request. */
-
-/*
- * GID types. The interpretation of the gid_types bit field in the device
- * capabilities will depend on the device mode. For now, the device only
- * supports RoCE as mode, so only the different GID types for RoCE are
- * defined.
- */
-
-#define PVRDMA_GID_TYPE_FLAG_ROCE_V1	BIT(0)
-#define PVRDMA_GID_TYPE_FLAG_ROCE_V2	BIT(1)
-
-/*
- * Version checks. This checks whether each version supports specific
- * capabilities from the device.
- */
-
-#define PVRDMA_IS_VERSION17(_dev)					\
-	(_dev->dsr_version == PVRDMA_ROCEV1_VERSION &&			\
-	 _dev->dsr->caps.gid_types == PVRDMA_GID_TYPE_FLAG_ROCE_V1)
-
-#define PVRDMA_IS_VERSION18(_dev)					\
-	(_dev->dsr_version >= PVRDMA_ROCEV2_VERSION &&			\
-	 (_dev->dsr->caps.gid_types == PVRDMA_GID_TYPE_FLAG_ROCE_V1 ||  \
-	  _dev->dsr->caps.gid_types == PVRDMA_GID_TYPE_FLAG_ROCE_V2))	\
-
-#define PVRDMA_SUPPORTED(_dev)						\
-	((_dev->dsr->caps.mode == PVRDMA_DEVICE_MODE_ROCE) &&		\
-	 (PVRDMA_IS_VERSION17(_dev) || PVRDMA_IS_VERSION18(_dev)))
-
-/*
- * Get capability values based on device version.
- */
-
-#define PVRDMA_GET_CAP(_dev, _old_val, _val) \
-	((PVRDMA_IS_VERSION18(_dev)) ? _val : _old_val)
-
-enum pvrdma_pci_resource {
-	PVRDMA_PCI_RESOURCE_MSIX,	/* BAR0: MSI-X, MMIO. */
-	PVRDMA_PCI_RESOURCE_REG,	/* BAR1: Registers, MMIO. */
-	PVRDMA_PCI_RESOURCE_UAR,	/* BAR2: UAR pages, MMIO, 64-bit. */
-	PVRDMA_PCI_RESOURCE_LAST,	/* Last. */
-};
-
-enum pvrdma_device_ctl {
-	PVRDMA_DEVICE_CTL_ACTIVATE,	/* Activate device. */
-	PVRDMA_DEVICE_CTL_UNQUIESCE,	/* Unquiesce device. */
-	PVRDMA_DEVICE_CTL_RESET,	/* Reset device. */
-};
-
-enum pvrdma_intr_vector {
-	PVRDMA_INTR_VECTOR_RESPONSE,	/* Command response. */
-	PVRDMA_INTR_VECTOR_ASYNC,	/* Async events. */
-	PVRDMA_INTR_VECTOR_CQ,		/* CQ notification. */
-	/* Additional CQ notification vectors. */
-};
-
-enum pvrdma_intr_cause {
-	PVRDMA_INTR_CAUSE_RESPONSE	= (1 << PVRDMA_INTR_VECTOR_RESPONSE),
-	PVRDMA_INTR_CAUSE_ASYNC		= (1 << PVRDMA_INTR_VECTOR_ASYNC),
-	PVRDMA_INTR_CAUSE_CQ		= (1 << PVRDMA_INTR_VECTOR_CQ),
-};
-
-enum pvrdma_gos_bits {
-	PVRDMA_GOS_BITS_UNK,		/* Unknown. */
-	PVRDMA_GOS_BITS_32,		/* 32-bit. */
-	PVRDMA_GOS_BITS_64,		/* 64-bit. */
-};
-
-enum pvrdma_gos_type {
-	PVRDMA_GOS_TYPE_UNK,		/* Unknown. */
-	PVRDMA_GOS_TYPE_LINUX,		/* Linux. */
-};
-
-enum pvrdma_device_mode {
-	PVRDMA_DEVICE_MODE_ROCE,	/* RoCE. */
-	PVRDMA_DEVICE_MODE_IWARP,	/* iWarp. */
-	PVRDMA_DEVICE_MODE_IB,		/* InfiniBand. */
-};
-
-struct pvrdma_gos_info {
-	uint32_t gos_bits:2;			/* W: PVRDMA_GOS_BITS_ */
-	uint32_t gos_type:4;			/* W: PVRDMA_GOS_TYPE_ */
-	uint32_t gos_ver:16;			/* W: Guest OS version. */
-	uint32_t gos_misc:10;		/* W: Other. */
-	uint32_t pad;			/* Pad to 8-byte alignment. */
-};
-
-struct pvrdma_device_caps {
-	uint64_t fw_ver;				/* R: Query device. */
-	uint64_t node_guid;
-	uint64_t sys_image_guid;
-	uint64_t max_mr_size;
-	uint64_t page_size_cap;
-	uint64_t atomic_arg_sizes;			/* EX verbs. */
-	uint32_t ex_comp_mask;			/* EX verbs. */
-	uint32_t device_cap_flags2;			/* EX verbs. */
-	uint32_t max_fa_bit_boundary;		/* EX verbs. */
-	uint32_t log_max_atomic_inline_arg;		/* EX verbs. */
-	uint32_t vendor_id;
-	uint32_t vendor_part_id;
-	uint32_t hw_ver;
-	uint32_t max_qp;
-	uint32_t max_qp_wr;
-	uint32_t device_cap_flags;
-	uint32_t max_sge;
-	uint32_t max_sge_rd;
-	uint32_t max_cq;
-	uint32_t max_cqe;
-	uint32_t max_mr;
-	uint32_t max_pd;
-	uint32_t max_qp_rd_atom;
-	uint32_t max_ee_rd_atom;
-	uint32_t max_res_rd_atom;
-	uint32_t max_qp_init_rd_atom;
-	uint32_t max_ee_init_rd_atom;
-	uint32_t max_ee;
-	uint32_t max_rdd;
-	uint32_t max_mw;
-	uint32_t max_raw_ipv6_qp;
-	uint32_t max_raw_ethy_qp;
-	uint32_t max_mcast_grp;
-	uint32_t max_mcast_qp_attach;
-	uint32_t max_total_mcast_qp_attach;
-	uint32_t max_ah;
-	uint32_t max_fmr;
-	uint32_t max_map_per_fmr;
-	uint32_t max_srq;
-	uint32_t max_srq_wr;
-	uint32_t max_srq_sge;
-	uint32_t max_uar;
-	uint32_t gid_tbl_len;
-	uint16_t max_pkeys;
-	uint8_t  local_ca_ack_delay;
-	uint8_t  phys_port_cnt;
-	uint8_t  mode;				/* PVRDMA_DEVICE_MODE_ */
-	uint8_t  atomic_ops;				/* PVRDMA_ATOMIC_OP_* bits */
-	uint8_t  bmme_flags;				/* FRWR Mem Mgmt Extensions */
-	uint8_t  gid_types;				/* PVRDMA_GID_TYPE_FLAG_ */
-	uint32_t max_fast_reg_page_list_len;
-};
-
-struct pvrdma_ring_page_info {
-	uint32_t num_pages;				/* Num pages incl. header. */
-	uint32_t reserved;				/* Reserved. */
-	uint64_t pdir_dma;				/* Page directory PA. */
-};
-
-#pragma pack(push, 1)
-
-struct pvrdma_device_shared_region {
-	uint32_t driver_version;			/* W: Driver version. */
-	uint32_t pad;				/* Pad to 8-byte align. */
-	struct pvrdma_gos_info gos_info;	/* W: Guest OS information. */
-	uint64_t cmd_slot_dma;			/* W: Command slot address. */
-	uint64_t resp_slot_dma;			/* W: Response slot address. */
-	struct pvrdma_ring_page_info async_ring_pages;
-						/* W: Async ring page info. */
-	struct pvrdma_ring_page_info cq_ring_pages;
-						/* W: CQ ring page info. */
-	union {
-		uint32_t uar_pfn;			/* W: UAR pageframe. */
-		uint64_t uar_pfn64;			/* W: 64-bit UAR page frame. */
-	};
-	struct pvrdma_device_caps caps;		/* R: Device capabilities. */
-};
-
-#pragma pack(pop)
-
-/* Event types. Currently a 1:1 mapping with enum ib_event. */
-enum pvrdma_eqe_type {
-	PVRDMA_EVENT_CQ_ERR,
-	PVRDMA_EVENT_QP_FATAL,
-	PVRDMA_EVENT_QP_REQ_ERR,
-	PVRDMA_EVENT_QP_ACCESS_ERR,
-	PVRDMA_EVENT_COMM_EST,
-	PVRDMA_EVENT_SQ_DRAINED,
-	PVRDMA_EVENT_PATH_MIG,
-	PVRDMA_EVENT_PATH_MIG_ERR,
-	PVRDMA_EVENT_DEVICE_FATAL,
-	PVRDMA_EVENT_PORT_ACTIVE,
-	PVRDMA_EVENT_PORT_ERR,
-	PVRDMA_EVENT_LID_CHANGE,
-	PVRDMA_EVENT_PKEY_CHANGE,
-	PVRDMA_EVENT_SM_CHANGE,
-	PVRDMA_EVENT_SRQ_ERR,
-	PVRDMA_EVENT_SRQ_LIMIT_REACHED,
-	PVRDMA_EVENT_QP_LAST_WQE_REACHED,
-	PVRDMA_EVENT_CLIENT_REREGISTER,
-	PVRDMA_EVENT_GID_CHANGE,
-};
-
-/* Event queue element. */
-struct pvrdma_eqe {
-	uint32_t type;	/* Event type. */
-	uint32_t info;	/* Handle, other. */
-};
-
-/* CQ notification queue element. */
-struct pvrdma_cqne {
-	uint32_t info;	/* Handle */
-};
-
-enum {
-	PVRDMA_CMD_FIRST,
-	PVRDMA_CMD_QUERY_PORT = PVRDMA_CMD_FIRST,
-	PVRDMA_CMD_QUERY_PKEY,
-	PVRDMA_CMD_CREATE_PD,
-	PVRDMA_CMD_DESTROY_PD,
-	PVRDMA_CMD_CREATE_MR,
-	PVRDMA_CMD_DESTROY_MR,
-	PVRDMA_CMD_CREATE_CQ,
-	PVRDMA_CMD_RESIZE_CQ,
-	PVRDMA_CMD_DESTROY_CQ,
-	PVRDMA_CMD_CREATE_QP,
-	PVRDMA_CMD_MODIFY_QP,
-	PVRDMA_CMD_QUERY_QP,
-	PVRDMA_CMD_DESTROY_QP,
-	PVRDMA_CMD_CREATE_UC,
-	PVRDMA_CMD_DESTROY_UC,
-	PVRDMA_CMD_CREATE_BIND,
-	PVRDMA_CMD_DESTROY_BIND,
-	PVRDMA_CMD_CREATE_SRQ,
-	PVRDMA_CMD_MODIFY_SRQ,
-	PVRDMA_CMD_QUERY_SRQ,
-	PVRDMA_CMD_DESTROY_SRQ,
-	PVRDMA_CMD_MAX,
-};
-
-enum {
-	PVRDMA_CMD_FIRST_RESP = (1 << 31),
-	PVRDMA_CMD_QUERY_PORT_RESP = PVRDMA_CMD_FIRST_RESP,
-	PVRDMA_CMD_QUERY_PKEY_RESP,
-	PVRDMA_CMD_CREATE_PD_RESP,
-	PVRDMA_CMD_DESTROY_PD_RESP_NOOP,
-	PVRDMA_CMD_CREATE_MR_RESP,
-	PVRDMA_CMD_DESTROY_MR_RESP_NOOP,
-	PVRDMA_CMD_CREATE_CQ_RESP,
-	PVRDMA_CMD_RESIZE_CQ_RESP,
-	PVRDMA_CMD_DESTROY_CQ_RESP_NOOP,
-	PVRDMA_CMD_CREATE_QP_RESP,
-	PVRDMA_CMD_MODIFY_QP_RESP,
-	PVRDMA_CMD_QUERY_QP_RESP,
-	PVRDMA_CMD_DESTROY_QP_RESP,
-	PVRDMA_CMD_CREATE_UC_RESP,
-	PVRDMA_CMD_DESTROY_UC_RESP_NOOP,
-	PVRDMA_CMD_CREATE_BIND_RESP_NOOP,
-	PVRDMA_CMD_DESTROY_BIND_RESP_NOOP,
-	PVRDMA_CMD_CREATE_SRQ_RESP,
-	PVRDMA_CMD_MODIFY_SRQ_RESP,
-	PVRDMA_CMD_QUERY_SRQ_RESP,
-	PVRDMA_CMD_DESTROY_SRQ_RESP,
-	PVRDMA_CMD_MAX_RESP,
-};
-
-struct pvrdma_cmd_hdr {
-	uint64_t response;		/* Key for response lookup. */
-	uint32_t cmd;		/* PVRDMA_CMD_ */
-	uint32_t reserved;		/* Reserved. */
-};
-
-struct pvrdma_cmd_resp_hdr {
-	uint64_t response;		/* From cmd hdr. */
-	uint32_t ack;		/* PVRDMA_CMD_XXX_RESP */
-	uint8_t err;			/* Error. */
-	uint8_t reserved[3];		/* Reserved. */
-};
-
-struct pvrdma_cmd_query_port {
-	struct pvrdma_cmd_hdr hdr;
-	uint8_t port_num;
-	uint8_t reserved[7];
-};
-
-struct pvrdma_cmd_query_port_resp {
-	struct pvrdma_cmd_resp_hdr hdr;
-	struct pvrdma_port_attr attrs;
-};
-
-struct pvrdma_cmd_query_pkey {
-	struct pvrdma_cmd_hdr hdr;
-	uint8_t port_num;
-	uint8_t index;
-	uint8_t reserved[6];
-};
-
-struct pvrdma_cmd_query_pkey_resp {
-	struct pvrdma_cmd_resp_hdr hdr;
-	uint16_t pkey;
-	uint8_t reserved[6];
-};
-
-struct pvrdma_cmd_create_uc {
-	struct pvrdma_cmd_hdr hdr;
-	union {
-		uint32_t pfn; /* UAR page frame number */
-		uint64_t pfn64; /* 64-bit UAR page frame number */
-	};
-};
-
-struct pvrdma_cmd_create_uc_resp {
-	struct pvrdma_cmd_resp_hdr hdr;
-	uint32_t ctx_handle;
-	uint8_t reserved[4];
-};
-
-struct pvrdma_cmd_destroy_uc {
-	struct pvrdma_cmd_hdr hdr;
-	uint32_t ctx_handle;
-	uint8_t reserved[4];
-};
-
-struct pvrdma_cmd_create_pd {
-	struct pvrdma_cmd_hdr hdr;
-	uint32_t ctx_handle;
-	uint8_t reserved[4];
-};
-
-struct pvrdma_cmd_create_pd_resp {
-	struct pvrdma_cmd_resp_hdr hdr;
-	uint32_t pd_handle;
-	uint8_t reserved[4];
-};
-
-struct pvrdma_cmd_destroy_pd {
-	struct pvrdma_cmd_hdr hdr;
-	uint32_t pd_handle;
-	uint8_t reserved[4];
-};
-
-struct pvrdma_cmd_create_mr {
-	struct pvrdma_cmd_hdr hdr;
-	uint64_t start;
-	uint64_t length;
-	uint64_t pdir_dma;
-	uint32_t pd_handle;
-	uint32_t access_flags;
-	uint32_t flags;
-	uint32_t nchunks;
-};
-
-struct pvrdma_cmd_create_mr_resp {
-	struct pvrdma_cmd_resp_hdr hdr;
-	uint32_t mr_handle;
-	uint32_t lkey;
-	uint32_t rkey;
-	uint8_t reserved[4];
-};
-
-struct pvrdma_cmd_destroy_mr {
-	struct pvrdma_cmd_hdr hdr;
-	uint32_t mr_handle;
-	uint8_t reserved[4];
-};
-
-struct pvrdma_cmd_create_cq {
-	struct pvrdma_cmd_hdr hdr;
-	uint64_t pdir_dma;
-	uint32_t ctx_handle;
-	uint32_t cqe;
-	uint32_t nchunks;
-	uint8_t reserved[4];
-};
-
-struct pvrdma_cmd_create_cq_resp {
-	struct pvrdma_cmd_resp_hdr hdr;
-	uint32_t cq_handle;
-	uint32_t cqe;
-};
-
-struct pvrdma_cmd_resize_cq {
-	struct pvrdma_cmd_hdr hdr;
-	uint32_t cq_handle;
-	uint32_t cqe;
-};
-
-struct pvrdma_cmd_resize_cq_resp {
-	struct pvrdma_cmd_resp_hdr hdr;
-	uint32_t cqe;
-	uint8_t reserved[4];
-};
-
-struct pvrdma_cmd_destroy_cq {
-	struct pvrdma_cmd_hdr hdr;
-	uint32_t cq_handle;
-	uint8_t reserved[4];
-};
-
-struct pvrdma_cmd_create_srq {
-	struct pvrdma_cmd_hdr hdr;
-	uint64_t pdir_dma;
-	uint32_t pd_handle;
-	uint32_t nchunks;
-	struct pvrdma_srq_attr attrs;
-	uint8_t srq_type;
-	uint8_t reserved[7];
-};
-
-struct pvrdma_cmd_create_srq_resp {
-	struct pvrdma_cmd_resp_hdr hdr;
-	uint32_t srqn;
-	uint8_t reserved[4];
-};
-
-struct pvrdma_cmd_modify_srq {
-	struct pvrdma_cmd_hdr hdr;
-	uint32_t srq_handle;
-	uint32_t attr_mask;
-	struct pvrdma_srq_attr attrs;
-};
-
-struct pvrdma_cmd_query_srq {
-	struct pvrdma_cmd_hdr hdr;
-	uint32_t srq_handle;
-	uint8_t reserved[4];
-};
-
-struct pvrdma_cmd_query_srq_resp {
-	struct pvrdma_cmd_resp_hdr hdr;
-	struct pvrdma_srq_attr attrs;
-};
-
-struct pvrdma_cmd_destroy_srq {
-	struct pvrdma_cmd_hdr hdr;
-	uint32_t srq_handle;
-	uint8_t reserved[4];
-};
-
-struct pvrdma_cmd_create_qp {
-	struct pvrdma_cmd_hdr hdr;
-	uint64_t pdir_dma;
-	uint32_t pd_handle;
-	uint32_t send_cq_handle;
-	uint32_t recv_cq_handle;
-	uint32_t srq_handle;
-	uint32_t max_send_wr;
-	uint32_t max_recv_wr;
-	uint32_t max_send_sge;
-	uint32_t max_recv_sge;
-	uint32_t max_inline_data;
-	uint32_t lkey;
-	uint32_t access_flags;
-	uint16_t total_chunks;
-	uint16_t send_chunks;
-	uint16_t max_atomic_arg;
-	uint8_t sq_sig_all;
-	uint8_t qp_type;
-	uint8_t is_srq;
-	uint8_t reserved[3];
-};
-
-struct pvrdma_cmd_create_qp_resp {
-	struct pvrdma_cmd_resp_hdr hdr;
-	uint32_t qpn;
-	uint32_t max_send_wr;
-	uint32_t max_recv_wr;
-	uint32_t max_send_sge;
-	uint32_t max_recv_sge;
-	uint32_t max_inline_data;
-};
-
-struct pvrdma_cmd_create_qp_resp_v2 {
-	struct pvrdma_cmd_resp_hdr hdr;
-	uint32_t qpn;
-	uint32_t qp_handle;
-	uint32_t max_send_wr;
-	uint32_t max_recv_wr;
-	uint32_t max_send_sge;
-	uint32_t max_recv_sge;
-	uint32_t max_inline_data;
-};
-
-struct pvrdma_cmd_modify_qp {
-	struct pvrdma_cmd_hdr hdr;
-	uint32_t qp_handle;
-	uint32_t attr_mask;
-	struct pvrdma_qp_attr attrs;
-};
-
-struct pvrdma_cmd_query_qp {
-	struct pvrdma_cmd_hdr hdr;
-	uint32_t qp_handle;
-	uint32_t attr_mask;
-};
-
-struct pvrdma_cmd_query_qp_resp {
-	struct pvrdma_cmd_resp_hdr hdr;
-	struct pvrdma_qp_attr attrs;
-};
-
-struct pvrdma_cmd_destroy_qp {
-	struct pvrdma_cmd_hdr hdr;
-	uint32_t qp_handle;
-	uint8_t reserved[4];
-};
-
-struct pvrdma_cmd_destroy_qp_resp {
-	struct pvrdma_cmd_resp_hdr hdr;
-	uint32_t events_reported;
-	uint8_t reserved[4];
-};
-
-struct pvrdma_cmd_create_bind {
-	struct pvrdma_cmd_hdr hdr;
-	uint32_t mtu;
-	uint32_t vlan;
-	uint32_t index;
-	uint8_t new_gid[16];
-	uint8_t gid_type;
-	uint8_t reserved[3];
-};
-
-struct pvrdma_cmd_destroy_bind {
-	struct pvrdma_cmd_hdr hdr;
-	uint32_t index;
-	uint8_t dest_gid[16];
-	uint8_t reserved[4];
-};
-
-union pvrdma_cmd_req {
-	struct pvrdma_cmd_hdr hdr;
-	struct pvrdma_cmd_query_port query_port;
-	struct pvrdma_cmd_query_pkey query_pkey;
-	struct pvrdma_cmd_create_uc create_uc;
-	struct pvrdma_cmd_destroy_uc destroy_uc;
-	struct pvrdma_cmd_create_pd create_pd;
-	struct pvrdma_cmd_destroy_pd destroy_pd;
-	struct pvrdma_cmd_create_mr create_mr;
-	struct pvrdma_cmd_destroy_mr destroy_mr;
-	struct pvrdma_cmd_create_cq create_cq;
-	struct pvrdma_cmd_resize_cq resize_cq;
-	struct pvrdma_cmd_destroy_cq destroy_cq;
-	struct pvrdma_cmd_create_qp create_qp;
-	struct pvrdma_cmd_modify_qp modify_qp;
-	struct pvrdma_cmd_query_qp query_qp;
-	struct pvrdma_cmd_destroy_qp destroy_qp;
-	struct pvrdma_cmd_create_bind create_bind;
-	struct pvrdma_cmd_destroy_bind destroy_bind;
-	struct pvrdma_cmd_create_srq create_srq;
-	struct pvrdma_cmd_modify_srq modify_srq;
-	struct pvrdma_cmd_query_srq query_srq;
-	struct pvrdma_cmd_destroy_srq destroy_srq;
-};
-
-union pvrdma_cmd_resp {
-	struct pvrdma_cmd_resp_hdr hdr;
-	struct pvrdma_cmd_query_port_resp query_port_resp;
-	struct pvrdma_cmd_query_pkey_resp query_pkey_resp;
-	struct pvrdma_cmd_create_uc_resp create_uc_resp;
-	struct pvrdma_cmd_create_pd_resp create_pd_resp;
-	struct pvrdma_cmd_create_mr_resp create_mr_resp;
-	struct pvrdma_cmd_create_cq_resp create_cq_resp;
-	struct pvrdma_cmd_resize_cq_resp resize_cq_resp;
-	struct pvrdma_cmd_create_qp_resp create_qp_resp;
-	struct pvrdma_cmd_create_qp_resp_v2 create_qp_resp_v2;
-	struct pvrdma_cmd_query_qp_resp query_qp_resp;
-	struct pvrdma_cmd_destroy_qp_resp destroy_qp_resp;
-	struct pvrdma_cmd_create_srq_resp create_srq_resp;
-	struct pvrdma_cmd_query_srq_resp query_srq_resp;
-};
-
-#endif /* __PVRDMA_DEV_API_H__ */
diff --git a/include/standard-headers/drivers/infiniband/hw/vmw_pvrdma/pvrdma_verbs.h b/include/standard-headers/drivers/infiniband/hw/vmw_pvrdma/pvrdma_verbs.h
deleted file mode 100644
index 94d41b202c..0000000000
--- a/include/standard-headers/drivers/infiniband/hw/vmw_pvrdma/pvrdma_verbs.h
+++ /dev/null
@@ -1,348 +0,0 @@
-/*
- * Copyright (c) 2012-2016 VMware, Inc.  All rights reserved.
- *
- * This program is free software; you can redistribute it and/or
- * modify it under the terms of EITHER the GNU General Public License
- * version 2 as published by the Free Software Foundation or the BSD
- * 2-Clause License. This program is distributed in the hope that it
- * will be useful, but WITHOUT ANY WARRANTY; WITHOUT EVEN THE IMPLIED
- * WARRANTY OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
- * See the GNU General Public License version 2 for more details at
- * http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html.
- *
- * You should have received a copy of the GNU General Public License
- * along with this program available in the file COPYING in the main
- * directory of this source tree.
- *
- * The BSD 2-Clause License
- *
- *     Redistribution and use in source and binary forms, with or
- *     without modification, are permitted provided that the following
- *     conditions are met:
- *
- *      - Redistributions of source code must retain the above
- *        copyright notice, this list of conditions and the following
- *        disclaimer.
- *
- *      - Redistributions in binary form must reproduce the above
- *        copyright notice, this list of conditions and the following
- *        disclaimer in the documentation and/or other materials
- *        provided with the distribution.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
- * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
- * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
- * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
- * COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
- * INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
- * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
- * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
- * STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
- * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED
- * OF THE POSSIBILITY OF SUCH DAMAGE.
- */
-
-#ifndef __PVRDMA_VERBS_H__
-#define __PVRDMA_VERBS_H__
-
-#include "standard-headers/linux/types.h"
-
-union pvrdma_gid {
-	uint8_t	raw[16];
-	struct {
-		uint64_t	subnet_prefix;
-		uint64_t	interface_id;
-	} global;
-};
-
-enum pvrdma_link_layer {
-	PVRDMA_LINK_LAYER_UNSPECIFIED,
-	PVRDMA_LINK_LAYER_INFINIBAND,
-	PVRDMA_LINK_LAYER_ETHERNET,
-};
-
-enum pvrdma_mtu {
-	PVRDMA_MTU_256  = 1,
-	PVRDMA_MTU_512  = 2,
-	PVRDMA_MTU_1024 = 3,
-	PVRDMA_MTU_2048 = 4,
-	PVRDMA_MTU_4096 = 5,
-};
-
-enum pvrdma_port_state {
-	PVRDMA_PORT_NOP			= 0,
-	PVRDMA_PORT_DOWN		= 1,
-	PVRDMA_PORT_INIT		= 2,
-	PVRDMA_PORT_ARMED		= 3,
-	PVRDMA_PORT_ACTIVE		= 4,
-	PVRDMA_PORT_ACTIVE_DEFER	= 5,
-};
-
-enum pvrdma_port_cap_flags {
-	PVRDMA_PORT_SM				= 1 <<  1,
-	PVRDMA_PORT_NOTICE_SUP			= 1 <<  2,
-	PVRDMA_PORT_TRAP_SUP			= 1 <<  3,
-	PVRDMA_PORT_OPT_IPD_SUP			= 1 <<  4,
-	PVRDMA_PORT_AUTO_MIGR_SUP		= 1 <<  5,
-	PVRDMA_PORT_SL_MAP_SUP			= 1 <<  6,
-	PVRDMA_PORT_MKEY_NVRAM			= 1 <<  7,
-	PVRDMA_PORT_PKEY_NVRAM			= 1 <<  8,
-	PVRDMA_PORT_LED_INFO_SUP		= 1 <<  9,
-	PVRDMA_PORT_SM_DISABLED			= 1 << 10,
-	PVRDMA_PORT_SYS_IMAGE_GUID_SUP		= 1 << 11,
-	PVRDMA_PORT_PKEY_SW_EXT_PORT_TRAP_SUP	= 1 << 12,
-	PVRDMA_PORT_EXTENDED_SPEEDS_SUP		= 1 << 14,
-	PVRDMA_PORT_CM_SUP			= 1 << 16,
-	PVRDMA_PORT_SNMP_TUNNEL_SUP		= 1 << 17,
-	PVRDMA_PORT_REINIT_SUP			= 1 << 18,
-	PVRDMA_PORT_DEVICE_MGMT_SUP		= 1 << 19,
-	PVRDMA_PORT_VENDOR_CLASS_SUP		= 1 << 20,
-	PVRDMA_PORT_DR_NOTICE_SUP		= 1 << 21,
-	PVRDMA_PORT_CAP_MASK_NOTICE_SUP		= 1 << 22,
-	PVRDMA_PORT_BOOT_MGMT_SUP		= 1 << 23,
-	PVRDMA_PORT_LINK_LATENCY_SUP		= 1 << 24,
-	PVRDMA_PORT_CLIENT_REG_SUP		= 1 << 25,
-	PVRDMA_PORT_IP_BASED_GIDS		= 1 << 26,
-	PVRDMA_PORT_CAP_FLAGS_MAX		= PVRDMA_PORT_IP_BASED_GIDS,
-};
-
-enum pvrdma_port_width {
-	PVRDMA_WIDTH_1X		= 1,
-	PVRDMA_WIDTH_4X		= 2,
-	PVRDMA_WIDTH_8X		= 4,
-	PVRDMA_WIDTH_12X	= 8,
-};
-
-enum pvrdma_port_speed {
-	PVRDMA_SPEED_SDR	= 1,
-	PVRDMA_SPEED_DDR	= 2,
-	PVRDMA_SPEED_QDR	= 4,
-	PVRDMA_SPEED_FDR10	= 8,
-	PVRDMA_SPEED_FDR	= 16,
-	PVRDMA_SPEED_EDR	= 32,
-};
-
-struct pvrdma_port_attr {
-	enum pvrdma_port_state	state;
-	enum pvrdma_mtu		max_mtu;
-	enum pvrdma_mtu		active_mtu;
-	uint32_t			gid_tbl_len;
-	uint32_t			port_cap_flags;
-	uint32_t			max_msg_sz;
-	uint32_t			bad_pkey_cntr;
-	uint32_t			qkey_viol_cntr;
-	uint16_t			pkey_tbl_len;
-	uint16_t			lid;
-	uint16_t			sm_lid;
-	uint8_t			lmc;
-	uint8_t			max_vl_num;
-	uint8_t			sm_sl;
-	uint8_t			subnet_timeout;
-	uint8_t			init_type_reply;
-	uint8_t			active_width;
-	uint8_t			active_speed;
-	uint8_t			phys_state;
-	uint8_t			reserved[2];
-};
-
-struct pvrdma_global_route {
-	union pvrdma_gid	dgid;
-	uint32_t			flow_label;
-	uint8_t			sgid_index;
-	uint8_t			hop_limit;
-	uint8_t			traffic_class;
-	uint8_t			reserved;
-};
-
-struct pvrdma_grh {
-	uint32_t			version_tclass_flow;
-	uint16_t			paylen;
-	uint8_t			next_hdr;
-	uint8_t			hop_limit;
-	union pvrdma_gid	sgid;
-	union pvrdma_gid	dgid;
-};
-
-enum pvrdma_ah_flags {
-	PVRDMA_AH_GRH = 1,
-};
-
-enum pvrdma_rate {
-	PVRDMA_RATE_PORT_CURRENT	= 0,
-	PVRDMA_RATE_2_5_GBPS		= 2,
-	PVRDMA_RATE_5_GBPS		= 5,
-	PVRDMA_RATE_10_GBPS		= 3,
-	PVRDMA_RATE_20_GBPS		= 6,
-	PVRDMA_RATE_30_GBPS		= 4,
-	PVRDMA_RATE_40_GBPS		= 7,
-	PVRDMA_RATE_60_GBPS		= 8,
-	PVRDMA_RATE_80_GBPS		= 9,
-	PVRDMA_RATE_120_GBPS		= 10,
-	PVRDMA_RATE_14_GBPS		= 11,
-	PVRDMA_RATE_56_GBPS		= 12,
-	PVRDMA_RATE_112_GBPS		= 13,
-	PVRDMA_RATE_168_GBPS		= 14,
-	PVRDMA_RATE_25_GBPS		= 15,
-	PVRDMA_RATE_100_GBPS		= 16,
-	PVRDMA_RATE_200_GBPS		= 17,
-	PVRDMA_RATE_300_GBPS		= 18,
-};
-
-struct pvrdma_ah_attr {
-	struct pvrdma_global_route	grh;
-	uint16_t				dlid;
-	uint16_t				vlan_id;
-	uint8_t				sl;
-	uint8_t				src_path_bits;
-	uint8_t				static_rate;
-	uint8_t				ah_flags;
-	uint8_t				port_num;
-	uint8_t				dmac[6];
-	uint8_t				reserved;
-};
-
-enum pvrdma_cq_notify_flags {
-	PVRDMA_CQ_SOLICITED		= 1 << 0,
-	PVRDMA_CQ_NEXT_COMP		= 1 << 1,
-	PVRDMA_CQ_SOLICITED_MASK	= PVRDMA_CQ_SOLICITED |
-					  PVRDMA_CQ_NEXT_COMP,
-	PVRDMA_CQ_REPORT_MISSED_EVENTS	= 1 << 2,
-};
-
-struct pvrdma_qp_cap {
-	uint32_t	max_send_wr;
-	uint32_t	max_recv_wr;
-	uint32_t	max_send_sge;
-	uint32_t	max_recv_sge;
-	uint32_t	max_inline_data;
-	uint32_t	reserved;
-};
-
-enum pvrdma_sig_type {
-	PVRDMA_SIGNAL_ALL_WR,
-	PVRDMA_SIGNAL_REQ_WR,
-};
-
-enum pvrdma_qp_type {
-	PVRDMA_QPT_SMI,
-	PVRDMA_QPT_GSI,
-	PVRDMA_QPT_RC,
-	PVRDMA_QPT_UC,
-	PVRDMA_QPT_UD,
-	PVRDMA_QPT_RAW_IPV6,
-	PVRDMA_QPT_RAW_ETHERTYPE,
-	PVRDMA_QPT_RAW_PACKET = 8,
-	PVRDMA_QPT_XRC_INI = 9,
-	PVRDMA_QPT_XRC_TGT,
-	PVRDMA_QPT_MAX,
-};
-
-enum pvrdma_qp_create_flags {
-	PVRDMA_QP_CREATE_IPOPVRDMA_UD_LSO		= 1 << 0,
-	PVRDMA_QP_CREATE_BLOCK_MULTICAST_LOOPBACK	= 1 << 1,
-};
-
-enum pvrdma_qp_attr_mask {
-	PVRDMA_QP_STATE			= 1 << 0,
-	PVRDMA_QP_CUR_STATE		= 1 << 1,
-	PVRDMA_QP_EN_SQD_ASYNC_NOTIFY	= 1 << 2,
-	PVRDMA_QP_ACCESS_FLAGS		= 1 << 3,
-	PVRDMA_QP_PKEY_INDEX		= 1 << 4,
-	PVRDMA_QP_PORT			= 1 << 5,
-	PVRDMA_QP_QKEY			= 1 << 6,
-	PVRDMA_QP_AV			= 1 << 7,
-	PVRDMA_QP_PATH_MTU		= 1 << 8,
-	PVRDMA_QP_TIMEOUT		= 1 << 9,
-	PVRDMA_QP_RETRY_CNT		= 1 << 10,
-	PVRDMA_QP_RNR_RETRY		= 1 << 11,
-	PVRDMA_QP_RQ_PSN		= 1 << 12,
-	PVRDMA_QP_MAX_QP_RD_ATOMIC	= 1 << 13,
-	PVRDMA_QP_ALT_PATH		= 1 << 14,
-	PVRDMA_QP_MIN_RNR_TIMER		= 1 << 15,
-	PVRDMA_QP_SQ_PSN		= 1 << 16,
-	PVRDMA_QP_MAX_DEST_RD_ATOMIC	= 1 << 17,
-	PVRDMA_QP_PATH_MIG_STATE	= 1 << 18,
-	PVRDMA_QP_CAP			= 1 << 19,
-	PVRDMA_QP_DEST_QPN		= 1 << 20,
-	PVRDMA_QP_ATTR_MASK_MAX		= PVRDMA_QP_DEST_QPN,
-};
-
-enum pvrdma_qp_state {
-	PVRDMA_QPS_RESET,
-	PVRDMA_QPS_INIT,
-	PVRDMA_QPS_RTR,
-	PVRDMA_QPS_RTS,
-	PVRDMA_QPS_SQD,
-	PVRDMA_QPS_SQE,
-	PVRDMA_QPS_ERR,
-};
-
-enum pvrdma_mig_state {
-	PVRDMA_MIG_MIGRATED,
-	PVRDMA_MIG_REARM,
-	PVRDMA_MIG_ARMED,
-};
-
-enum pvrdma_mw_type {
-	PVRDMA_MW_TYPE_1 = 1,
-	PVRDMA_MW_TYPE_2 = 2,
-};
-
-struct pvrdma_srq_attr {
-	uint32_t			max_wr;
-	uint32_t			max_sge;
-	uint32_t			srq_limit;
-	uint32_t			reserved;
-};
-
-struct pvrdma_qp_attr {
-	enum pvrdma_qp_state	qp_state;
-	enum pvrdma_qp_state	cur_qp_state;
-	enum pvrdma_mtu		path_mtu;
-	enum pvrdma_mig_state	path_mig_state;
-	uint32_t			qkey;
-	uint32_t			rq_psn;
-	uint32_t			sq_psn;
-	uint32_t			dest_qp_num;
-	uint32_t			qp_access_flags;
-	uint16_t			pkey_index;
-	uint16_t			alt_pkey_index;
-	uint8_t			en_sqd_async_notify;
-	uint8_t			sq_draining;
-	uint8_t			max_rd_atomic;
-	uint8_t			max_dest_rd_atomic;
-	uint8_t			min_rnr_timer;
-	uint8_t			port_num;
-	uint8_t			timeout;
-	uint8_t			retry_cnt;
-	uint8_t			rnr_retry;
-	uint8_t			alt_port_num;
-	uint8_t			alt_timeout;
-	uint8_t			reserved[5];
-	struct pvrdma_qp_cap	cap;
-	struct pvrdma_ah_attr	ah_attr;
-	struct pvrdma_ah_attr	alt_ah_attr;
-};
-
-enum pvrdma_send_flags {
-	PVRDMA_SEND_FENCE	= 1 << 0,
-	PVRDMA_SEND_SIGNALED	= 1 << 1,
-	PVRDMA_SEND_SOLICITED	= 1 << 2,
-	PVRDMA_SEND_INLINE	= 1 << 3,
-	PVRDMA_SEND_IP_CSUM	= 1 << 4,
-	PVRDMA_SEND_FLAGS_MAX	= PVRDMA_SEND_IP_CSUM,
-};
-
-enum pvrdma_access_flags {
-	PVRDMA_ACCESS_LOCAL_WRITE	= 1 << 0,
-	PVRDMA_ACCESS_REMOTE_WRITE	= 1 << 1,
-	PVRDMA_ACCESS_REMOTE_READ	= 1 << 2,
-	PVRDMA_ACCESS_REMOTE_ATOMIC	= 1 << 3,
-	PVRDMA_ACCESS_MW_BIND		= 1 << 4,
-	PVRDMA_ZERO_BASED		= 1 << 5,
-	PVRDMA_ACCESS_ON_DEMAND		= 1 << 6,
-	PVRDMA_ACCESS_FLAGS_MAX		= PVRDMA_ACCESS_ON_DEMAND,
-};
-
-#endif /* __PVRDMA_VERBS_H__ */
diff --git a/include/standard-headers/rdma/vmw_pvrdma-abi.h b/include/standard-headers/rdma/vmw_pvrdma-abi.h
deleted file mode 100644
index c30182a7ae..0000000000
--- a/include/standard-headers/rdma/vmw_pvrdma-abi.h
+++ /dev/null
@@ -1,310 +0,0 @@
-/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) */
-/*
- * Copyright (c) 2012-2016 VMware, Inc.  All rights reserved.
- *
- * This program is free software; you can redistribute it and/or
- * modify it under the terms of EITHER the GNU General Public License
- * version 2 as published by the Free Software Foundation or the BSD
- * 2-Clause License. This program is distributed in the hope that it
- * will be useful, but WITHOUT ANY WARRANTY; WITHOUT EVEN THE IMPLIED
- * WARRANTY OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
- * See the GNU General Public License version 2 for more details at
- * http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html.
- *
- * You should have received a copy of the GNU General Public License
- * along with this program available in the file COPYING in the main
- * directory of this source tree.
- *
- * The BSD 2-Clause License
- *
- *     Redistribution and use in source and binary forms, with or
- *     without modification, are permitted provided that the following
- *     conditions are met:
- *
- *      - Redistributions of source code must retain the above
- *        copyright notice, this list of conditions and the following
- *        disclaimer.
- *
- *      - Redistributions in binary form must reproduce the above
- *        copyright notice, this list of conditions and the following
- *        disclaimer in the documentation and/or other materials
- *        provided with the distribution.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
- * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
- * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
- * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
- * COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
- * INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
- * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
- * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
- * STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
- * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED
- * OF THE POSSIBILITY OF SUCH DAMAGE.
- */
-
-#ifndef __VMW_PVRDMA_ABI_H__
-#define __VMW_PVRDMA_ABI_H__
-
-#include "standard-headers/linux/types.h"
-
-#define PVRDMA_UVERBS_ABI_VERSION	3		/* ABI Version. */
-#define PVRDMA_UAR_HANDLE_MASK		0x00FFFFFF	/* Bottom 24 bits. */
-#define PVRDMA_UAR_QP_OFFSET		0		/* QP doorbell. */
-#define PVRDMA_UAR_QP_SEND		(1 << 30)	/* Send bit. */
-#define PVRDMA_UAR_QP_RECV		(1 << 31)	/* Recv bit. */
-#define PVRDMA_UAR_CQ_OFFSET		4		/* CQ doorbell. */
-#define PVRDMA_UAR_CQ_ARM_SOL		(1 << 29)	/* Arm solicited bit. */
-#define PVRDMA_UAR_CQ_ARM		(1 << 30)	/* Arm bit. */
-#define PVRDMA_UAR_CQ_POLL		(1 << 31)	/* Poll bit. */
-#define PVRDMA_UAR_SRQ_OFFSET		8		/* SRQ doorbell. */
-#define PVRDMA_UAR_SRQ_RECV		(1 << 30)	/* Recv bit. */
-
-enum pvrdma_wr_opcode {
-	PVRDMA_WR_RDMA_WRITE,
-	PVRDMA_WR_RDMA_WRITE_WITH_IMM,
-	PVRDMA_WR_SEND,
-	PVRDMA_WR_SEND_WITH_IMM,
-	PVRDMA_WR_RDMA_READ,
-	PVRDMA_WR_ATOMIC_CMP_AND_SWP,
-	PVRDMA_WR_ATOMIC_FETCH_AND_ADD,
-	PVRDMA_WR_LSO,
-	PVRDMA_WR_SEND_WITH_INV,
-	PVRDMA_WR_RDMA_READ_WITH_INV,
-	PVRDMA_WR_LOCAL_INV,
-	PVRDMA_WR_FAST_REG_MR,
-	PVRDMA_WR_MASKED_ATOMIC_CMP_AND_SWP,
-	PVRDMA_WR_MASKED_ATOMIC_FETCH_AND_ADD,
-	PVRDMA_WR_BIND_MW,
-	PVRDMA_WR_REG_SIG_MR,
-	PVRDMA_WR_ERROR,
-};
-
-enum pvrdma_wc_status {
-	PVRDMA_WC_SUCCESS,
-	PVRDMA_WC_LOC_LEN_ERR,
-	PVRDMA_WC_LOC_QP_OP_ERR,
-	PVRDMA_WC_LOC_EEC_OP_ERR,
-	PVRDMA_WC_LOC_PROT_ERR,
-	PVRDMA_WC_WR_FLUSH_ERR,
-	PVRDMA_WC_MW_BIND_ERR,
-	PVRDMA_WC_BAD_RESP_ERR,
-	PVRDMA_WC_LOC_ACCESS_ERR,
-	PVRDMA_WC_REM_INV_REQ_ERR,
-	PVRDMA_WC_REM_ACCESS_ERR,
-	PVRDMA_WC_REM_OP_ERR,
-	PVRDMA_WC_RETRY_EXC_ERR,
-	PVRDMA_WC_RNR_RETRY_EXC_ERR,
-	PVRDMA_WC_LOC_RDD_VIOL_ERR,
-	PVRDMA_WC_REM_INV_RD_REQ_ERR,
-	PVRDMA_WC_REM_ABORT_ERR,
-	PVRDMA_WC_INV_EECN_ERR,
-	PVRDMA_WC_INV_EEC_STATE_ERR,
-	PVRDMA_WC_FATAL_ERR,
-	PVRDMA_WC_RESP_TIMEOUT_ERR,
-	PVRDMA_WC_GENERAL_ERR,
-};
-
-enum pvrdma_wc_opcode {
-	PVRDMA_WC_SEND,
-	PVRDMA_WC_RDMA_WRITE,
-	PVRDMA_WC_RDMA_READ,
-	PVRDMA_WC_COMP_SWAP,
-	PVRDMA_WC_FETCH_ADD,
-	PVRDMA_WC_BIND_MW,
-	PVRDMA_WC_LSO,
-	PVRDMA_WC_LOCAL_INV,
-	PVRDMA_WC_FAST_REG_MR,
-	PVRDMA_WC_MASKED_COMP_SWAP,
-	PVRDMA_WC_MASKED_FETCH_ADD,
-	PVRDMA_WC_RECV = 1 << 7,
-	PVRDMA_WC_RECV_RDMA_WITH_IMM,
-};
-
-enum pvrdma_wc_flags {
-	PVRDMA_WC_GRH			= 1 << 0,
-	PVRDMA_WC_WITH_IMM		= 1 << 1,
-	PVRDMA_WC_WITH_INVALIDATE	= 1 << 2,
-	PVRDMA_WC_IP_CSUM_OK		= 1 << 3,
-	PVRDMA_WC_WITH_SMAC		= 1 << 4,
-	PVRDMA_WC_WITH_VLAN		= 1 << 5,
-	PVRDMA_WC_WITH_NETWORK_HDR_TYPE	= 1 << 6,
-	PVRDMA_WC_FLAGS_MAX		= PVRDMA_WC_WITH_NETWORK_HDR_TYPE,
-};
-
-enum pvrdma_network_type {
-	PVRDMA_NETWORK_IB,
-	PVRDMA_NETWORK_ROCE_V1 = PVRDMA_NETWORK_IB,
-	PVRDMA_NETWORK_IPV4,
-	PVRDMA_NETWORK_IPV6
-};
-
-struct pvrdma_alloc_ucontext_resp {
-	uint32_t qp_tab_size;
-	uint32_t reserved;
-};
-
-struct pvrdma_alloc_pd_resp {
-	uint32_t pdn;
-	uint32_t reserved;
-};
-
-struct pvrdma_create_cq {
-	uint64_t __attribute__((aligned(8))) buf_addr;
-	uint32_t buf_size;
-	uint32_t reserved;
-};
-
-struct pvrdma_create_cq_resp {
-	uint32_t cqn;
-	uint32_t reserved;
-};
-
-struct pvrdma_resize_cq {
-	uint64_t __attribute__((aligned(8))) buf_addr;
-	uint32_t buf_size;
-	uint32_t reserved;
-};
-
-struct pvrdma_create_srq {
-	uint64_t __attribute__((aligned(8))) buf_addr;
-	uint32_t buf_size;
-	uint32_t reserved;
-};
-
-struct pvrdma_create_srq_resp {
-	uint32_t srqn;
-	uint32_t reserved;
-};
-
-struct pvrdma_create_qp {
-	uint64_t __attribute__((aligned(8))) rbuf_addr;
-	uint64_t __attribute__((aligned(8))) sbuf_addr;
-	uint32_t rbuf_size;
-	uint32_t sbuf_size;
-	uint64_t __attribute__((aligned(8))) qp_addr;
-};
-
-struct pvrdma_create_qp_resp {
-	uint32_t qpn;
-	uint32_t qp_handle;
-};
-
-/* PVRDMA masked atomic compare and swap */
-struct pvrdma_ex_cmp_swap {
-	uint64_t __attribute__((aligned(8))) swap_val;
-	uint64_t __attribute__((aligned(8))) compare_val;
-	uint64_t __attribute__((aligned(8))) swap_mask;
-	uint64_t __attribute__((aligned(8))) compare_mask;
-};
-
-/* PVRDMA masked atomic fetch and add */
-struct pvrdma_ex_fetch_add {
-	uint64_t __attribute__((aligned(8))) add_val;
-	uint64_t __attribute__((aligned(8))) field_boundary;
-};
-
-/* PVRDMA address vector. */
-struct pvrdma_av {
-	uint32_t port_pd;
-	uint32_t sl_tclass_flowlabel;
-	uint8_t dgid[16];
-	uint8_t src_path_bits;
-	uint8_t gid_index;
-	uint8_t stat_rate;
-	uint8_t hop_limit;
-	uint8_t dmac[6];
-	uint8_t reserved[6];
-};
-
-/* PVRDMA scatter/gather entry */
-struct pvrdma_sge {
-	uint64_t __attribute__((aligned(8))) addr;
-	uint32_t   length;
-	uint32_t   lkey;
-};
-
-/* PVRDMA receive queue work request */
-struct pvrdma_rq_wqe_hdr {
-	uint64_t __attribute__((aligned(8))) wr_id;		/* wr id */
-	uint32_t num_sge;		/* size of s/g array */
-	uint32_t total_len;	/* reserved */
-};
-/* Use pvrdma_sge (ib_sge) for receive queue s/g array elements. */
-
-/* PVRDMA send queue work request */
-struct pvrdma_sq_wqe_hdr {
-	uint64_t __attribute__((aligned(8))) wr_id;		/* wr id */
-	uint32_t num_sge;		/* size of s/g array */
-	uint32_t total_len;	/* reserved */
-	uint32_t opcode;		/* operation type */
-	uint32_t send_flags;	/* wr flags */
-	union {
-		uint32_t imm_data;
-		uint32_t invalidate_rkey;
-	} ex;
-	uint32_t reserved;
-	union {
-		struct {
-			uint64_t __attribute__((aligned(8))) remote_addr;
-			uint32_t rkey;
-			uint8_t reserved[4];
-		} rdma;
-		struct {
-			uint64_t __attribute__((aligned(8))) remote_addr;
-			uint64_t __attribute__((aligned(8))) compare_add;
-			uint64_t __attribute__((aligned(8))) swap;
-			uint32_t rkey;
-			uint32_t reserved;
-		} atomic;
-		struct {
-			uint64_t __attribute__((aligned(8))) remote_addr;
-			uint32_t log_arg_sz;
-			uint32_t rkey;
-			union {
-				struct pvrdma_ex_cmp_swap  cmp_swap;
-				struct pvrdma_ex_fetch_add fetch_add;
-			} wr_data;
-		} masked_atomics;
-		struct {
-			uint64_t __attribute__((aligned(8))) iova_start;
-			uint64_t __attribute__((aligned(8))) pl_pdir_dma;
-			uint32_t page_shift;
-			uint32_t page_list_len;
-			uint32_t length;
-			uint32_t access_flags;
-			uint32_t rkey;
-			uint32_t reserved;
-		} fast_reg;
-		struct {
-			uint32_t remote_qpn;
-			uint32_t remote_qkey;
-			struct pvrdma_av av;
-		} ud;
-	} wr;
-};
-/* Use pvrdma_sge (ib_sge) for send queue s/g array elements. */
-
-/* Completion queue element. */
-struct pvrdma_cqe {
-	uint64_t __attribute__((aligned(8))) wr_id;
-	uint64_t __attribute__((aligned(8))) qp;
-	uint32_t opcode;
-	uint32_t status;
-	uint32_t byte_len;
-	uint32_t imm_data;
-	uint32_t src_qp;
-	uint32_t wc_flags;
-	uint32_t vendor_err;
-	uint16_t pkey_index;
-	uint16_t slid;
-	uint8_t sl;
-	uint8_t dlid_path_bits;
-	uint8_t port_num;
-	uint8_t smac[6];
-	uint8_t network_hdr_type;
-	uint8_t reserved2[6]; /* Pad to next power of 2 (64). */
-};
-
-#endif /* __VMW_PVRDMA_ABI_H__ */
diff --git a/contrib/rdmacm-mux/main.c b/contrib/rdmacm-mux/main.c
deleted file mode 100644
index 771ca01e03..0000000000
--- a/contrib/rdmacm-mux/main.c
+++ /dev/null
@@ -1,831 +0,0 @@
-/*
- * QEMU paravirtual RDMA - rdmacm-mux implementation
- *
- * Copyright (C) 2018 Oracle
- * Copyright (C) 2018 Red Hat Inc
- *
- * Authors:
- *     Yuval Shaia <yuval.shaia@oracle.com>
- *     Marcel Apfelbaum <marcel@redhat.com>
- *
- * This work is licensed under the terms of the GNU GPL, version 2 or later.
- * See the COPYING file in the top-level directory.
- *
- */
-
-#include "qemu/osdep.h"
-#include <sys/poll.h>
-#include <sys/ioctl.h>
-#include <pthread.h>
-#include <syslog.h>
-
-#include <infiniband/verbs.h>
-#include <infiniband/umad.h>
-#include <infiniband/umad_types.h>
-#include <infiniband/umad_sa.h>
-#include <infiniband/umad_cm.h>
-
-#include "rdmacm-mux.h"
-
-#define SCALE_US 1000
-#define COMMID_TTL 2 /* How many SCALE_US a context of MAD session is saved */
-#define SLEEP_SECS 5 /* This is used both in poll() and thread */
-#define SERVER_LISTEN_BACKLOG 10
-#define MAX_CLIENTS 4096
-#define MAD_RMPP_VERSION 0
-#define MAD_METHOD_MASK0 0x8
-
-#define IB_USER_MAD_LONGS_PER_METHOD_MASK (128 / (8 * sizeof(long)))
-
-#define CM_REQ_DGID_POS      80
-#define CM_SIDR_REQ_DGID_POS 44
-
-/* The below can be override by command line parameter */
-#define UNIX_SOCKET_PATH "/var/run/rdmacm-mux"
-/* Has format %s-%s-%d" <path>-<rdma-dev--name>-<port> */
-#define SOCKET_PATH_MAX (PATH_MAX - NAME_MAX - sizeof(int) - 2)
-#define RDMA_PORT_NUM 1
-
-typedef struct RdmaCmServerArgs {
-    char unix_socket_path[PATH_MAX];
-    char rdma_dev_name[NAME_MAX];
-    int rdma_port_num;
-} RdmaCMServerArgs;
-
-typedef struct CommId2FdEntry {
-    int fd;
-    int ttl; /* Initialized to 2, decrement each timeout, entry delete when 0 */
-    __be64 gid_ifid;
-} CommId2FdEntry;
-
-typedef struct RdmaCmUMadAgent {
-    int port_id;
-    int agent_id;
-    GHashTable *gid2fd; /* Used to find fd of a given gid */
-    GHashTable *commid2fd; /* Used to find fd on of a given comm_id */
-} RdmaCmUMadAgent;
-
-typedef struct RdmaCmServer {
-    bool run;
-    RdmaCMServerArgs args;
-    struct pollfd fds[MAX_CLIENTS];
-    int nfds;
-    RdmaCmUMadAgent umad_agent;
-    pthread_t umad_recv_thread;
-    pthread_rwlock_t lock;
-} RdmaCMServer;
-
-static RdmaCMServer server = {0};
-
-static void usage(const char *progname)
-{
-    printf("Usage: %s [OPTION]...\n"
-           "Start a RDMA-CM multiplexer\n"
-           "\n"
-           "\t-h                    Show this help\n"
-           "\t-d rdma-device-name   Name of RDMA device to register with\n"
-           "\t-s unix-socket-path   Path to unix socket to listen on (default %s)\n"
-           "\t-p rdma-device-port   Port number of RDMA device to register with (default %d)\n",
-           progname, UNIX_SOCKET_PATH, RDMA_PORT_NUM);
-}
-
-static void help(const char *progname)
-{
-    fprintf(stderr, "Try '%s -h' for more information.\n", progname);
-}
-
-static void parse_args(int argc, char *argv[])
-{
-    int c;
-    char unix_socket_path[SOCKET_PATH_MAX];
-
-    strcpy(server.args.rdma_dev_name, "");
-    strcpy(unix_socket_path, UNIX_SOCKET_PATH);
-    server.args.rdma_port_num = RDMA_PORT_NUM;
-
-    while ((c = getopt(argc, argv, "hs:d:p:")) != -1) {
-        switch (c) {
-        case 'h':
-            usage(argv[0]);
-            exit(0);
-
-        case 'd':
-            strncpy(server.args.rdma_dev_name, optarg, NAME_MAX - 1);
-            break;
-
-        case 's':
-            /* This is temporary, final name will build below */
-            strncpy(unix_socket_path, optarg, SOCKET_PATH_MAX - 1);
-            break;
-
-        case 'p':
-            server.args.rdma_port_num = atoi(optarg);
-            break;
-
-        default:
-            help(argv[0]);
-            exit(1);
-        }
-    }
-
-    if (!strcmp(server.args.rdma_dev_name, "")) {
-        fprintf(stderr, "Missing RDMA device name\n");
-        help(argv[0]);
-        exit(1);
-    }
-
-    /* Build unique unix-socket file name */
-    snprintf(server.args.unix_socket_path, PATH_MAX, "%s-%s-%d",
-             unix_socket_path, server.args.rdma_dev_name,
-             server.args.rdma_port_num);
-
-    syslog(LOG_INFO, "unix_socket_path=%s", server.args.unix_socket_path);
-    syslog(LOG_INFO, "rdma-device-name=%s", server.args.rdma_dev_name);
-    syslog(LOG_INFO, "rdma-device-port=%d", server.args.rdma_port_num);
-}
-
-static void hash_tbl_alloc(void)
-{
-
-    server.umad_agent.gid2fd = g_hash_table_new_full(g_int64_hash,
-                                                     g_int64_equal,
-                                                     g_free, g_free);
-    server.umad_agent.commid2fd = g_hash_table_new_full(g_int_hash,
-                                                        g_int_equal,
-                                                        g_free, g_free);
-}
-
-static void hash_tbl_free(void)
-{
-    if (server.umad_agent.commid2fd) {
-        g_hash_table_destroy(server.umad_agent.commid2fd);
-    }
-    if (server.umad_agent.gid2fd) {
-        g_hash_table_destroy(server.umad_agent.gid2fd);
-    }
-}
-
-
-static int _hash_tbl_search_fd_by_ifid(__be64 *gid_ifid)
-{
-    int *fd;
-
-    fd = g_hash_table_lookup(server.umad_agent.gid2fd, gid_ifid);
-    if (!fd) {
-        /* Let's try IPv4 */
-        *gid_ifid |= 0x00000000ffff0000;
-        fd = g_hash_table_lookup(server.umad_agent.gid2fd, gid_ifid);
-    }
-
-    return fd ? *fd : 0;
-}
-
-static int hash_tbl_search_fd_by_ifid(int *fd, __be64 *gid_ifid)
-{
-    pthread_rwlock_rdlock(&server.lock);
-    *fd = _hash_tbl_search_fd_by_ifid(gid_ifid);
-    pthread_rwlock_unlock(&server.lock);
-
-    if (!*fd) {
-        syslog(LOG_WARNING, "Can't find matching for ifid 0x%llx\n", *gid_ifid);
-        return -ENOENT;
-    }
-
-    return 0;
-}
-
-static int hash_tbl_search_fd_by_comm_id(uint32_t comm_id, int *fd,
-                                         __be64 *gid_idid)
-{
-    CommId2FdEntry *fde;
-
-    pthread_rwlock_rdlock(&server.lock);
-    fde = g_hash_table_lookup(server.umad_agent.commid2fd, &comm_id);
-    pthread_rwlock_unlock(&server.lock);
-
-    if (!fde) {
-        syslog(LOG_WARNING, "Can't find matching for comm_id 0x%x\n", comm_id);
-        return -ENOENT;
-    }
-
-    *fd = fde->fd;
-    *gid_idid = fde->gid_ifid;
-
-    return 0;
-}
-
-static RdmaCmMuxErrCode add_fd_ifid_pair(int fd, __be64 gid_ifid)
-{
-    int fd1;
-
-    pthread_rwlock_wrlock(&server.lock);
-
-    fd1 = _hash_tbl_search_fd_by_ifid(&gid_ifid);
-    if (fd1) { /* record already exist - an error */
-        pthread_rwlock_unlock(&server.lock);
-        return fd == fd1 ? RDMACM_MUX_ERR_CODE_EEXIST :
-                           RDMACM_MUX_ERR_CODE_EACCES;
-    }
-
-    g_hash_table_insert(server.umad_agent.gid2fd, g_memdup(&gid_ifid,
-                        sizeof(gid_ifid)), g_memdup(&fd, sizeof(fd)));
-
-    pthread_rwlock_unlock(&server.lock);
-
-    syslog(LOG_INFO, "0x%lx registered on socket %d",
-           be64toh((uint64_t)gid_ifid), fd);
-
-    return RDMACM_MUX_ERR_CODE_OK;
-}
-
-static RdmaCmMuxErrCode delete_fd_ifid_pair(int fd, __be64 gid_ifid)
-{
-    int fd1;
-
-    pthread_rwlock_wrlock(&server.lock);
-
-    fd1 = _hash_tbl_search_fd_by_ifid(&gid_ifid);
-    if (!fd1) { /* record not exist - an error */
-        pthread_rwlock_unlock(&server.lock);
-        return RDMACM_MUX_ERR_CODE_ENOTFOUND;
-    }
-
-    g_hash_table_remove(server.umad_agent.gid2fd, g_memdup(&gid_ifid,
-                        sizeof(gid_ifid)));
-    pthread_rwlock_unlock(&server.lock);
-
-    syslog(LOG_INFO, "0x%lx unregistered on socket %d",
-           be64toh((uint64_t)gid_ifid), fd);
-
-    return RDMACM_MUX_ERR_CODE_OK;
-}
-
-static void hash_tbl_save_fd_comm_id_pair(int fd, uint32_t comm_id,
-                                          uint64_t gid_ifid)
-{
-    CommId2FdEntry fde = {fd, COMMID_TTL, gid_ifid};
-
-    pthread_rwlock_wrlock(&server.lock);
-    g_hash_table_insert(server.umad_agent.commid2fd,
-                        g_memdup(&comm_id, sizeof(comm_id)),
-                        g_memdup(&fde, sizeof(fde)));
-    pthread_rwlock_unlock(&server.lock);
-}
-
-static gboolean remove_old_comm_ids(gpointer key, gpointer value,
-                                    gpointer user_data)
-{
-    CommId2FdEntry *fde = (CommId2FdEntry *)value;
-
-    return !fde->ttl--;
-}
-
-static gboolean remove_entry_from_gid2fd(gpointer key, gpointer value,
-                                         gpointer user_data)
-{
-    if (*(int *)value == *(int *)user_data) {
-        syslog(LOG_INFO, "0x%lx unregistered on socket %d",
-               be64toh(*(uint64_t *)key), *(int *)value);
-        return true;
-    }
-
-    return false;
-}
-
-static void hash_tbl_remove_fd_ifid_pair(int fd)
-{
-    pthread_rwlock_wrlock(&server.lock);
-    g_hash_table_foreach_remove(server.umad_agent.gid2fd,
-                                remove_entry_from_gid2fd, (gpointer)&fd);
-    pthread_rwlock_unlock(&server.lock);
-}
-
-static int get_fd(const char *mad, int umad_len, int *fd, __be64 *gid_ifid)
-{
-    struct umad_hdr *hdr = (struct umad_hdr *)mad;
-    char *data = (char *)hdr + sizeof(*hdr);
-    int32_t comm_id = 0;
-    uint16_t attr_id = be16toh(hdr->attr_id);
-    int rc = 0;
-
-    if (umad_len <= sizeof(*hdr)) {
-        rc = -EINVAL;
-        syslog(LOG_DEBUG, "Ignoring MAD packets with header only\n");
-        goto out;
-    }
-
-    switch (attr_id) {
-    case UMAD_CM_ATTR_REQ:
-        if (unlikely(umad_len < sizeof(*hdr) + CM_REQ_DGID_POS +
-            sizeof(*gid_ifid))) {
-            rc = -EINVAL;
-            syslog(LOG_WARNING,
-                   "Invalid MAD packet size (%d) for attr_id 0x%x\n", umad_len,
-                    attr_id);
-            goto out;
-        }
-        memcpy(gid_ifid, data + CM_REQ_DGID_POS, sizeof(*gid_ifid));
-        rc = hash_tbl_search_fd_by_ifid(fd, gid_ifid);
-        break;
-
-    case UMAD_CM_ATTR_SIDR_REQ:
-        if (unlikely(umad_len < sizeof(*hdr) + CM_SIDR_REQ_DGID_POS +
-            sizeof(*gid_ifid))) {
-            rc = -EINVAL;
-            syslog(LOG_WARNING,
-                   "Invalid MAD packet size (%d) for attr_id 0x%x\n", umad_len,
-                    attr_id);
-            goto out;
-        }
-        memcpy(gid_ifid, data + CM_SIDR_REQ_DGID_POS, sizeof(*gid_ifid));
-        rc = hash_tbl_search_fd_by_ifid(fd, gid_ifid);
-        break;
-
-    case UMAD_CM_ATTR_REP:
-        /* Fall through */
-    case UMAD_CM_ATTR_REJ:
-        /* Fall through */
-    case UMAD_CM_ATTR_DREQ:
-        /* Fall through */
-    case UMAD_CM_ATTR_DREP:
-        /* Fall through */
-    case UMAD_CM_ATTR_RTU:
-        data += sizeof(comm_id);
-        /* Fall through */
-    case UMAD_CM_ATTR_SIDR_REP:
-        if (unlikely(umad_len < sizeof(*hdr) + sizeof(comm_id))) {
-            rc = -EINVAL;
-            syslog(LOG_WARNING,
-                   "Invalid MAD packet size (%d) for attr_id 0x%x\n", umad_len,
-                   attr_id);
-            goto out;
-        }
-        memcpy(&comm_id, data, sizeof(comm_id));
-        if (comm_id) {
-            rc = hash_tbl_search_fd_by_comm_id(comm_id, fd, gid_ifid);
-        }
-        break;
-
-    default:
-        rc = -EINVAL;
-        syslog(LOG_WARNING, "Unsupported attr_id 0x%x\n", attr_id);
-    }
-
-    syslog(LOG_DEBUG, "mad_to_vm: %d 0x%x 0x%x\n", *fd, attr_id, comm_id);
-
-out:
-    return rc;
-}
-
-static void *umad_recv_thread_func(void *args)
-{
-    int rc;
-    RdmaCmMuxMsg msg = {};
-    int fd = -2;
-
-    msg.hdr.msg_type = RDMACM_MUX_MSG_TYPE_REQ;
-    msg.hdr.op_code = RDMACM_MUX_OP_CODE_MAD;
-
-    while (server.run) {
-        do {
-            msg.umad_len = sizeof(msg.umad.mad);
-            rc = umad_recv(server.umad_agent.port_id, &msg.umad, &msg.umad_len,
-                           SLEEP_SECS * SCALE_US);
-            if ((rc == -EIO) || (rc == -EINVAL)) {
-                syslog(LOG_CRIT, "Fatal error while trying to read MAD");
-            }
-
-            if (rc == -ETIMEDOUT) {
-                g_hash_table_foreach_remove(server.umad_agent.commid2fd,
-                                            remove_old_comm_ids, NULL);
-            }
-        } while (rc && server.run);
-
-        if (server.run) {
-            rc = get_fd(msg.umad.mad, msg.umad_len, &fd,
-                        &msg.hdr.sgid.global.interface_id);
-            if (rc) {
-                continue;
-            }
-
-            send(fd, &msg, sizeof(msg), 0);
-        }
-    }
-
-    return NULL;
-}
-
-static int read_and_process(int fd)
-{
-    int rc;
-    RdmaCmMuxMsg msg = {};
-    struct umad_hdr *hdr;
-    uint32_t *comm_id = 0;
-    uint16_t attr_id;
-
-    rc = recv(fd, &msg, sizeof(msg), 0);
-    syslog(LOG_DEBUG, "Socket %d, recv %d\n", fd, rc);
-
-    if (rc < 0 && errno != EWOULDBLOCK) {
-        syslog(LOG_ERR, "Fail to read from socket %d\n", fd);
-        return -EIO;
-    }
-
-    if (!rc) {
-        syslog(LOG_ERR, "Fail to read from socket %d\n", fd);
-        return -EPIPE;
-    }
-
-    if (msg.hdr.msg_type != RDMACM_MUX_MSG_TYPE_REQ) {
-        syslog(LOG_WARNING, "Got non-request message (%d) from socket %d\n",
-               msg.hdr.msg_type, fd);
-        return -EPERM;
-    }
-
-    switch (msg.hdr.op_code) {
-    case RDMACM_MUX_OP_CODE_REG:
-        rc = add_fd_ifid_pair(fd, msg.hdr.sgid.global.interface_id);
-        break;
-
-    case RDMACM_MUX_OP_CODE_UNREG:
-        rc = delete_fd_ifid_pair(fd, msg.hdr.sgid.global.interface_id);
-        break;
-
-    case RDMACM_MUX_OP_CODE_MAD:
-        /* If this is REQ or REP then store the pair comm_id,fd to be later
-         * used for other messages where gid is unknown */
-        hdr = (struct umad_hdr *)msg.umad.mad;
-        attr_id = be16toh(hdr->attr_id);
-        if ((attr_id == UMAD_CM_ATTR_REQ) || (attr_id == UMAD_CM_ATTR_DREQ) ||
-            (attr_id == UMAD_CM_ATTR_SIDR_REQ) ||
-            (attr_id == UMAD_CM_ATTR_REP) || (attr_id == UMAD_CM_ATTR_DREP)) {
-            comm_id = (uint32_t *)(msg.umad.mad + sizeof(*hdr));
-            hash_tbl_save_fd_comm_id_pair(fd, *comm_id,
-                                          msg.hdr.sgid.global.interface_id);
-        }
-
-        syslog(LOG_DEBUG, "vm_to_mad: %d 0x%x 0x%x\n", fd, attr_id,
-               comm_id ? *comm_id : 0);
-        rc = umad_send(server.umad_agent.port_id, server.umad_agent.agent_id,
-                       &msg.umad, msg.umad_len, 1, 0);
-        if (rc) {
-            syslog(LOG_ERR,
-                  "Fail to send MAD message (0x%x) from socket %d, err=%d",
-                  attr_id, fd, rc);
-        }
-        break;
-
-    default:
-        syslog(LOG_ERR, "Got invalid op_code (%d) from socket %d",
-               msg.hdr.msg_type, fd);
-        rc = RDMACM_MUX_ERR_CODE_EINVAL;
-    }
-
-    msg.hdr.msg_type = RDMACM_MUX_MSG_TYPE_RESP;
-    msg.hdr.err_code = rc;
-    rc = send(fd, &msg, sizeof(msg), 0);
-
-    return rc == sizeof(msg) ? 0 : -EPIPE;
-}
-
-static int accept_all(void)
-{
-    int fd, rc = 0;
-
-    pthread_rwlock_wrlock(&server.lock);
-
-    do {
-        if ((server.nfds + 1) > MAX_CLIENTS) {
-            syslog(LOG_WARNING, "Too many clients (%d)", server.nfds);
-            rc = -EIO;
-            goto out;
-        }
-
-        fd = accept(server.fds[0].fd, NULL, NULL);
-        if (fd < 0) {
-            if (errno != EWOULDBLOCK) {
-                syslog(LOG_WARNING, "accept() failed");
-                rc = -EIO;
-                goto out;
-            }
-            break;
-        }
-
-        syslog(LOG_INFO, "Client connected on socket %d\n", fd);
-        server.fds[server.nfds].fd = fd;
-        server.fds[server.nfds].events = POLLIN;
-        server.nfds++;
-    } while (fd != -1);
-
-out:
-    pthread_rwlock_unlock(&server.lock);
-    return rc;
-}
-
-static void compress_fds(void)
-{
-    int i, j;
-    int closed = 0;
-
-    pthread_rwlock_wrlock(&server.lock);
-
-    for (i = 1; i < server.nfds; i++) {
-        if (!server.fds[i].fd) {
-            closed++;
-            for (j = i; j < server.nfds - 1; j++) {
-                server.fds[j] = server.fds[j + 1];
-            }
-        }
-    }
-
-    server.nfds -= closed;
-
-    pthread_rwlock_unlock(&server.lock);
-}
-
-static void close_fd(int idx)
-{
-    close(server.fds[idx].fd);
-    syslog(LOG_INFO, "Socket %d closed\n", server.fds[idx].fd);
-    hash_tbl_remove_fd_ifid_pair(server.fds[idx].fd);
-    server.fds[idx].fd = 0;
-}
-
-static void run(void)
-{
-    int rc, nfds, i;
-    bool compress = false;
-
-    syslog(LOG_INFO, "Service started");
-
-    while (server.run) {
-        rc = poll(server.fds, server.nfds, SLEEP_SECS * SCALE_US);
-        if (rc < 0) {
-            if (errno != EINTR) {
-                syslog(LOG_WARNING, "poll() failed");
-            }
-            continue;
-        }
-
-        if (rc == 0) {
-            continue;
-        }
-
-        nfds = server.nfds;
-        for (i = 0; i < nfds; i++) {
-            syslog(LOG_DEBUG, "pollfd[%d]: revents 0x%x, events 0x%x\n", i,
-                   server.fds[i].revents, server.fds[i].events);
-            if (server.fds[i].revents == 0) {
-                continue;
-            }
-
-            if (server.fds[i].revents != POLLIN) {
-                if (i == 0) {
-                    syslog(LOG_NOTICE, "Unexpected poll() event (0x%x)\n",
-                           server.fds[i].revents);
-                } else {
-                    close_fd(i);
-                    compress = true;
-                }
-                continue;
-            }
-
-            if (i == 0) {
-                rc = accept_all();
-                if (rc) {
-                    continue;
-                }
-            } else {
-                rc = read_and_process(server.fds[i].fd);
-                if (rc) {
-                    close_fd(i);
-                    compress = true;
-                }
-            }
-        }
-
-        if (compress) {
-            compress = false;
-            compress_fds();
-        }
-    }
-}
-
-static void fini_listener(void)
-{
-    int i;
-
-    if (server.fds[0].fd <= 0) {
-        return;
-    }
-
-    for (i = server.nfds - 1; i >= 0; i--) {
-        if (server.fds[i].fd) {
-            close(server.fds[i].fd);
-        }
-    }
-
-    unlink(server.args.unix_socket_path);
-}
-
-static void fini_umad(void)
-{
-    if (server.umad_agent.agent_id) {
-        umad_unregister(server.umad_agent.port_id, server.umad_agent.agent_id);
-    }
-
-    if (server.umad_agent.port_id) {
-        umad_close_port(server.umad_agent.port_id);
-    }
-
-    hash_tbl_free();
-}
-
-static void fini(void)
-{
-    if (server.umad_recv_thread) {
-        pthread_join(server.umad_recv_thread, NULL);
-        server.umad_recv_thread = 0;
-    }
-    fini_umad();
-    fini_listener();
-    pthread_rwlock_destroy(&server.lock);
-
-    syslog(LOG_INFO, "Service going down");
-}
-
-static int init_listener(void)
-{
-    struct sockaddr_un sun;
-    int rc, on = 1;
-
-    server.fds[0].fd = socket(AF_UNIX, SOCK_STREAM, 0);
-    if (server.fds[0].fd < 0) {
-        syslog(LOG_ALERT, "socket() failed");
-        return -EIO;
-    }
-
-    rc = setsockopt(server.fds[0].fd, SOL_SOCKET, SO_REUSEADDR, (char *)&on,
-                    sizeof(on));
-    if (rc < 0) {
-        syslog(LOG_ALERT, "setsockopt() failed");
-        rc = -EIO;
-        goto err;
-    }
-
-    rc = ioctl(server.fds[0].fd, FIONBIO, (char *)&on);
-    if (rc < 0) {
-        syslog(LOG_ALERT, "ioctl() failed");
-        rc = -EIO;
-        goto err;
-    }
-
-    if (strlen(server.args.unix_socket_path) >= sizeof(sun.sun_path)) {
-        syslog(LOG_ALERT,
-               "Invalid unix_socket_path, size must be less than %ld\n",
-               sizeof(sun.sun_path));
-        rc = -EINVAL;
-        goto err;
-    }
-
-    sun.sun_family = AF_UNIX;
-    rc = snprintf(sun.sun_path, sizeof(sun.sun_path), "%s",
-                  server.args.unix_socket_path);
-    if (rc < 0 || rc >= sizeof(sun.sun_path)) {
-        syslog(LOG_ALERT, "Could not copy unix socket path\n");
-        rc = -EINVAL;
-        goto err;
-    }
-
-    rc = bind(server.fds[0].fd, (struct sockaddr *)&sun, sizeof(sun));
-    if (rc < 0) {
-        syslog(LOG_ALERT, "bind() failed");
-        rc = -EIO;
-        goto err;
-    }
-
-    rc = listen(server.fds[0].fd, SERVER_LISTEN_BACKLOG);
-    if (rc < 0) {
-        syslog(LOG_ALERT, "listen() failed");
-        rc = -EIO;
-        goto err;
-    }
-
-    server.fds[0].events = POLLIN;
-    server.nfds = 1;
-    server.run = true;
-
-    return 0;
-
-err:
-    close(server.fds[0].fd);
-    return rc;
-}
-
-static int init_umad(void)
-{
-    long method_mask[IB_USER_MAD_LONGS_PER_METHOD_MASK];
-
-    server.umad_agent.port_id = umad_open_port(server.args.rdma_dev_name,
-                                               server.args.rdma_port_num);
-
-    if (server.umad_agent.port_id < 0) {
-        syslog(LOG_WARNING, "umad_open_port() failed");
-        return -EIO;
-    }
-
-    memset(&method_mask, 0, sizeof(method_mask));
-    method_mask[0] = MAD_METHOD_MASK0;
-    server.umad_agent.agent_id = umad_register(server.umad_agent.port_id,
-                                               UMAD_CLASS_CM,
-                                               UMAD_SA_CLASS_VERSION,
-                                               MAD_RMPP_VERSION, method_mask);
-    if (server.umad_agent.agent_id < 0) {
-        syslog(LOG_WARNING, "umad_register() failed");
-        return -EIO;
-    }
-
-    hash_tbl_alloc();
-
-    return 0;
-}
-
-static void signal_handler(int sig, siginfo_t *siginfo, void *context)
-{
-    static bool warned;
-
-    /* Prevent stop if clients are connected */
-    if (server.nfds != 1) {
-        if (!warned) {
-            syslog(LOG_WARNING,
-                   "Can't stop while active client exist, resend SIGINT to overid");
-            warned = true;
-            return;
-        }
-    }
-
-    if (sig == SIGINT) {
-        server.run = false;
-        fini();
-    }
-
-    exit(0);
-}
-
-static int init(void)
-{
-    int rc;
-    struct sigaction sig = {};
-
-    rc = init_listener();
-    if (rc) {
-        return rc;
-    }
-
-    rc = init_umad();
-    if (rc) {
-        return rc;
-    }
-
-    pthread_rwlock_init(&server.lock, 0);
-
-    rc = pthread_create(&server.umad_recv_thread, NULL, umad_recv_thread_func,
-                        NULL);
-    if (rc) {
-        syslog(LOG_ERR, "Fail to create UMAD receiver thread (%d)\n", rc);
-        return rc;
-    }
-
-    sig.sa_sigaction = &signal_handler;
-    sig.sa_flags = SA_SIGINFO;
-    rc = sigaction(SIGINT, &sig, NULL);
-    if (rc < 0) {
-        syslog(LOG_ERR, "Fail to install SIGINT handler (%d)\n", errno);
-        return rc;
-    }
-
-    return 0;
-}
-
-int main(int argc, char *argv[])
-{
-    int rc;
-
-    memset(&server, 0, sizeof(server));
-
-    parse_args(argc, argv);
-
-    rc = init();
-    if (rc) {
-        syslog(LOG_ERR, "Fail to initialize server (%d)\n", rc);
-        rc = -EAGAIN;
-        goto out;
-    }
-
-    run();
-
-out:
-    fini();
-
-    return rc;
-}
diff --git a/hw/core/machine-qmp-cmds.c b/hw/core/machine-qmp-cmds.c
index 4b72009cd3..c20829b9ae 100644
--- a/hw/core/machine-qmp-cmds.c
+++ b/hw/core/machine-qmp-cmds.c
@@ -12,7 +12,6 @@
 #include "hw/boards.h"
 #include "hw/intc/intc.h"
 #include "hw/mem/memory-device.h"
-#include "hw/rdma/rdma.h"
 #include "qapi/error.h"
 #include "qapi/qapi-builtin-visit.h"
 #include "qapi/qapi-commands-machine.h"
@@ -291,37 +290,6 @@ MemoryInfo *qmp_query_memory_size_summary(Error **errp)
     return mem_info;
 }
 
-static int qmp_x_query_rdma_foreach(Object *obj, void *opaque)
-{
-    RdmaProvider *rdma;
-    RdmaProviderClass *k;
-    GString *buf = opaque;
-
-    if (object_dynamic_cast(obj, INTERFACE_RDMA_PROVIDER)) {
-        rdma = RDMA_PROVIDER(obj);
-        k = RDMA_PROVIDER_GET_CLASS(obj);
-        if (k->format_statistics) {
-            k->format_statistics(rdma, buf);
-        } else {
-            g_string_append_printf(buf,
-                                   "RDMA statistics not available for %s.\n",
-                                   object_get_typename(obj));
-        }
-    }
-
-    return 0;
-}
-
-HumanReadableText *qmp_x_query_rdma(Error **errp)
-{
-    g_autoptr(GString) buf = g_string_new("");
-
-    object_child_foreach_recursive(object_get_root(),
-                                   qmp_x_query_rdma_foreach, buf);
-
-    return human_readable_text_from_str(buf);
-}
-
 HumanReadableText *qmp_x_query_ramblock(Error **errp)
 {
     g_autoptr(GString) buf = ram_block_format();
diff --git a/hw/rdma/rdma.c b/hw/rdma/rdma.c
deleted file mode 100644
index 7bec0d0d2c..0000000000
--- a/hw/rdma/rdma.c
+++ /dev/null
@@ -1,30 +0,0 @@
-/*
- * RDMA device interface
- *
- * Copyright (C) 2018 Oracle
- * Copyright (C) 2018 Red Hat Inc
- *
- * Authors:
- *     Yuval Shaia <yuval.shaia@oracle.com>
- *
- * This work is licensed under the terms of the GNU GPL, version 2 or later.
- * See the COPYING file in the top-level directory.
- *
- */
-
-#include "qemu/osdep.h"
-#include "hw/rdma/rdma.h"
-#include "qemu/module.h"
-
-static const TypeInfo rdma_hmp_info = {
-    .name = INTERFACE_RDMA_PROVIDER,
-    .parent = TYPE_INTERFACE,
-    .class_size = sizeof(RdmaProviderClass),
-};
-
-static void rdma_register_types(void)
-{
-    type_register_static(&rdma_hmp_info);
-}
-
-type_init(rdma_register_types)
diff --git a/hw/rdma/rdma_backend.c b/hw/rdma/rdma_backend.c
deleted file mode 100644
index 6dcdfbbbe2..0000000000
--- a/hw/rdma/rdma_backend.c
+++ /dev/null
@@ -1,1401 +0,0 @@
-/*
- * QEMU paravirtual RDMA - Generic RDMA backend
- *
- * Copyright (C) 2018 Oracle
- * Copyright (C) 2018 Red Hat Inc
- *
- * Authors:
- *     Yuval Shaia <yuval.shaia@oracle.com>
- *     Marcel Apfelbaum <marcel@redhat.com>
- *
- * This work is licensed under the terms of the GNU GPL, version 2 or later.
- * See the COPYING file in the top-level directory.
- *
- */
-
-#include "qemu/osdep.h"
-#include "qapi/qapi-events-rdma.h"
-
-#include <infiniband/verbs.h>
-
-#include "contrib/rdmacm-mux/rdmacm-mux.h"
-#include "trace.h"
-#include "rdma_utils.h"
-#include "rdma_rm.h"
-#include "rdma_backend.h"
-
-#define THR_NAME_LEN 16
-#define THR_POLL_TO  5000
-
-#define MAD_HDR_SIZE sizeof(struct ibv_grh)
-
-typedef struct BackendCtx {
-    void *up_ctx;
-    struct ibv_sge sge; /* Used to save MAD recv buffer */
-    RdmaBackendQP *backend_qp; /* To maintain recv buffers */
-    RdmaBackendSRQ *backend_srq;
-} BackendCtx;
-
-struct backend_umad {
-    struct ib_user_mad hdr;
-    char mad[RDMA_MAX_PRIVATE_DATA];
-};
-
-static void (*comp_handler)(void *ctx, struct ibv_wc *wc);
-
-static void dummy_comp_handler(void *ctx, struct ibv_wc *wc)
-{
-    rdma_error_report("No completion handler is registered");
-}
-
-static inline void complete_work(enum ibv_wc_status status, uint32_t vendor_err,
-                                 void *ctx)
-{
-    struct ibv_wc wc = {};
-
-    wc.status = status;
-    wc.vendor_err = vendor_err;
-
-    comp_handler(ctx, &wc);
-}
-
-static void free_cqe_ctx(gpointer data, gpointer user_data)
-{
-    BackendCtx *bctx;
-    RdmaDeviceResources *rdma_dev_res = user_data;
-    unsigned long cqe_ctx_id = GPOINTER_TO_INT(data);
-
-    bctx = rdma_rm_get_cqe_ctx(rdma_dev_res, cqe_ctx_id);
-    if (bctx) {
-        rdma_rm_dealloc_cqe_ctx(rdma_dev_res, cqe_ctx_id);
-        qatomic_dec(&rdma_dev_res->stats.missing_cqe);
-    }
-    g_free(bctx);
-}
-
-static void clean_recv_mads(RdmaBackendDev *backend_dev)
-{
-    unsigned long cqe_ctx_id;
-
-    do {
-        cqe_ctx_id = rdma_protected_gqueue_pop_int64(&backend_dev->
-                                                    recv_mads_list);
-        if (cqe_ctx_id != -ENOENT) {
-            qatomic_inc(&backend_dev->rdma_dev_res->stats.missing_cqe);
-            free_cqe_ctx(GINT_TO_POINTER(cqe_ctx_id),
-                         backend_dev->rdma_dev_res);
-        }
-    } while (cqe_ctx_id != -ENOENT);
-}
-
-static int rdma_poll_cq(RdmaDeviceResources *rdma_dev_res, struct ibv_cq *ibcq)
-{
-    int i, ne, total_ne = 0;
-    BackendCtx *bctx;
-    struct ibv_wc wc[2];
-    RdmaProtectedGSList *cqe_ctx_list;
-
-    WITH_QEMU_LOCK_GUARD(&rdma_dev_res->lock) {
-        do {
-            ne = ibv_poll_cq(ibcq, ARRAY_SIZE(wc), wc);
-
-            trace_rdma_poll_cq(ne, ibcq);
-
-            for (i = 0; i < ne; i++) {
-                bctx = rdma_rm_get_cqe_ctx(rdma_dev_res, wc[i].wr_id);
-                if (unlikely(!bctx)) {
-                    rdma_error_report("No matching ctx for req %"PRId64,
-                                      wc[i].wr_id);
-                    continue;
-                }
-
-                comp_handler(bctx->up_ctx, &wc[i]);
-
-                if (bctx->backend_qp) {
-                    cqe_ctx_list = &bctx->backend_qp->cqe_ctx_list;
-                } else {
-                    cqe_ctx_list = &bctx->backend_srq->cqe_ctx_list;
-                }
-
-                rdma_protected_gslist_remove_int32(cqe_ctx_list, wc[i].wr_id);
-                rdma_rm_dealloc_cqe_ctx(rdma_dev_res, wc[i].wr_id);
-                g_free(bctx);
-            }
-            total_ne += ne;
-        } while (ne > 0);
-        qatomic_sub(&rdma_dev_res->stats.missing_cqe, total_ne);
-    }
-
-    if (ne < 0) {
-        rdma_error_report("ibv_poll_cq fail, rc=%d, errno=%d", ne, errno);
-    }
-
-    rdma_dev_res->stats.completions += total_ne;
-
-    return total_ne;
-}
-
-static void *comp_handler_thread(void *arg)
-{
-    RdmaBackendDev *backend_dev = (RdmaBackendDev *)arg;
-    int rc;
-    struct ibv_cq *ev_cq;
-    void *ev_ctx;
-    int flags;
-    GPollFD pfds[1];
-
-    /* Change to non-blocking mode */
-    flags = fcntl(backend_dev->channel->fd, F_GETFL);
-    rc = fcntl(backend_dev->channel->fd, F_SETFL, flags | O_NONBLOCK);
-    if (rc < 0) {
-        rdma_error_report("Failed to change backend channel FD to non-blocking");
-        return NULL;
-    }
-
-    pfds[0].fd = backend_dev->channel->fd;
-    pfds[0].events = G_IO_IN | G_IO_HUP | G_IO_ERR;
-
-    backend_dev->comp_thread.is_running = true;
-
-    while (backend_dev->comp_thread.run) {
-        do {
-            rc = qemu_poll_ns(pfds, 1, THR_POLL_TO * (int64_t)SCALE_MS);
-            if (!rc) {
-                backend_dev->rdma_dev_res->stats.poll_cq_ppoll_to++;
-            }
-        } while (!rc && backend_dev->comp_thread.run);
-
-        if (backend_dev->comp_thread.run) {
-            rc = ibv_get_cq_event(backend_dev->channel, &ev_cq, &ev_ctx);
-            if (unlikely(rc)) {
-                rdma_error_report("ibv_get_cq_event fail, rc=%d, errno=%d", rc,
-                                  errno);
-                continue;
-            }
-
-            rc = ibv_req_notify_cq(ev_cq, 0);
-            if (unlikely(rc)) {
-                rdma_error_report("ibv_req_notify_cq fail, rc=%d, errno=%d", rc,
-                                  errno);
-            }
-
-            backend_dev->rdma_dev_res->stats.poll_cq_from_bk++;
-            rdma_poll_cq(backend_dev->rdma_dev_res, ev_cq);
-
-            ibv_ack_cq_events(ev_cq, 1);
-        }
-    }
-
-    backend_dev->comp_thread.is_running = false;
-
-    qemu_thread_exit(0);
-
-    return NULL;
-}
-
-static inline void disable_rdmacm_mux_async(RdmaBackendDev *backend_dev)
-{
-    qatomic_set(&backend_dev->rdmacm_mux.can_receive, 0);
-}
-
-static inline void enable_rdmacm_mux_async(RdmaBackendDev *backend_dev)
-{
-    qatomic_set(&backend_dev->rdmacm_mux.can_receive, sizeof(RdmaCmMuxMsg));
-}
-
-static inline int rdmacm_mux_can_process_async(RdmaBackendDev *backend_dev)
-{
-    return qatomic_read(&backend_dev->rdmacm_mux.can_receive);
-}
-
-static int rdmacm_mux_check_op_status(CharBackend *mad_chr_be)
-{
-    RdmaCmMuxMsg msg = {};
-    int ret;
-
-    ret = qemu_chr_fe_read_all(mad_chr_be, (uint8_t *)&msg, sizeof(msg));
-    if (ret != sizeof(msg)) {
-        rdma_error_report("Got invalid message from mux: size %d, expecting %d",
-                          ret, (int)sizeof(msg));
-        return -EIO;
-    }
-
-    trace_rdmacm_mux_check_op_status(msg.hdr.msg_type, msg.hdr.op_code,
-                                     msg.hdr.err_code);
-
-    if (msg.hdr.msg_type != RDMACM_MUX_MSG_TYPE_RESP) {
-        rdma_error_report("Got invalid message type %d", msg.hdr.msg_type);
-        return -EIO;
-    }
-
-    if (msg.hdr.err_code != RDMACM_MUX_ERR_CODE_OK) {
-        rdma_error_report("Operation failed in mux, error code %d",
-                          msg.hdr.err_code);
-        return -EIO;
-    }
-
-    return 0;
-}
-
-static int rdmacm_mux_send(RdmaBackendDev *backend_dev, RdmaCmMuxMsg *msg)
-{
-    int rc = 0;
-
-    msg->hdr.msg_type = RDMACM_MUX_MSG_TYPE_REQ;
-    trace_rdmacm_mux("send", msg->hdr.msg_type, msg->hdr.op_code);
-    disable_rdmacm_mux_async(backend_dev);
-    rc = qemu_chr_fe_write(backend_dev->rdmacm_mux.chr_be,
-                           (const uint8_t *)msg, sizeof(*msg));
-    if (rc != sizeof(*msg)) {
-        enable_rdmacm_mux_async(backend_dev);
-        rdma_error_report("Failed to send request to rdmacm_mux (rc=%d)", rc);
-        return -EIO;
-    }
-
-    rc = rdmacm_mux_check_op_status(backend_dev->rdmacm_mux.chr_be);
-    if (rc) {
-        rdma_error_report("Failed to execute rdmacm_mux request %d (rc=%d)",
-                          msg->hdr.op_code, rc);
-    }
-
-    enable_rdmacm_mux_async(backend_dev);
-
-    return 0;
-}
-
-static void stop_backend_thread(RdmaBackendThread *thread)
-{
-    thread->run = false;
-    while (thread->is_running) {
-        sleep(THR_POLL_TO / SCALE_US / 2);
-    }
-}
-
-static void start_comp_thread(RdmaBackendDev *backend_dev)
-{
-    char thread_name[THR_NAME_LEN] = {};
-
-    stop_backend_thread(&backend_dev->comp_thread);
-
-    snprintf(thread_name, sizeof(thread_name), "rdma_comp_%s",
-             ibv_get_device_name(backend_dev->ib_dev));
-    backend_dev->comp_thread.run = true;
-    qemu_thread_create(&backend_dev->comp_thread.thread, thread_name,
-                       comp_handler_thread, backend_dev, QEMU_THREAD_DETACHED);
-}
-
-void rdma_backend_register_comp_handler(void (*handler)(void *ctx,
-                                                         struct ibv_wc *wc))
-{
-    comp_handler = handler;
-}
-
-void rdma_backend_unregister_comp_handler(void)
-{
-    rdma_backend_register_comp_handler(dummy_comp_handler);
-}
-
-int rdma_backend_query_port(RdmaBackendDev *backend_dev,
-                            struct ibv_port_attr *port_attr)
-{
-    int rc;
-
-    rc = ibv_query_port(backend_dev->context, backend_dev->port_num, port_attr);
-    if (rc) {
-        rdma_error_report("ibv_query_port fail, rc=%d, errno=%d", rc, errno);
-        return -EIO;
-    }
-
-    return 0;
-}
-
-void rdma_backend_poll_cq(RdmaDeviceResources *rdma_dev_res, RdmaBackendCQ *cq)
-{
-    int polled;
-
-    rdma_dev_res->stats.poll_cq_from_guest++;
-    polled = rdma_poll_cq(rdma_dev_res, cq->ibcq);
-    if (!polled) {
-        rdma_dev_res->stats.poll_cq_from_guest_empty++;
-    }
-}
-
-static GHashTable *ah_hash;
-
-static struct ibv_ah *create_ah(RdmaBackendDev *backend_dev, struct ibv_pd *pd,
-                                uint8_t sgid_idx, union ibv_gid *dgid)
-{
-    GBytes *ah_key = g_bytes_new(dgid, sizeof(*dgid));
-    struct ibv_ah *ah = g_hash_table_lookup(ah_hash, ah_key);
-
-    if (ah) {
-        trace_rdma_create_ah_cache_hit(be64_to_cpu(dgid->global.subnet_prefix),
-                                       be64_to_cpu(dgid->global.interface_id));
-        g_bytes_unref(ah_key);
-    } else {
-        struct ibv_ah_attr ah_attr = {
-            .is_global     = 1,
-            .port_num      = backend_dev->port_num,
-            .grh.hop_limit = 1,
-        };
-
-        ah_attr.grh.dgid = *dgid;
-        ah_attr.grh.sgid_index = sgid_idx;
-
-        ah = ibv_create_ah(pd, &ah_attr);
-        if (ah) {
-            g_hash_table_insert(ah_hash, ah_key, ah);
-        } else {
-            g_bytes_unref(ah_key);
-            rdma_error_report("Failed to create AH for gid <0x%" PRIx64", 0x%"PRIx64">",
-                              be64_to_cpu(dgid->global.subnet_prefix),
-                              be64_to_cpu(dgid->global.interface_id));
-        }
-
-        trace_rdma_create_ah_cache_miss(be64_to_cpu(dgid->global.subnet_prefix),
-                                        be64_to_cpu(dgid->global.interface_id));
-    }
-
-    return ah;
-}
-
-static void destroy_ah_hash_key(gpointer data)
-{
-    g_bytes_unref(data);
-}
-
-static void destroy_ah_hast_data(gpointer data)
-{
-    struct ibv_ah *ah = data;
-
-    ibv_destroy_ah(ah);
-}
-
-static void ah_cache_init(void)
-{
-    ah_hash = g_hash_table_new_full(g_bytes_hash, g_bytes_equal,
-                                    destroy_ah_hash_key, destroy_ah_hast_data);
-}
-
-#ifdef LEGACY_RDMA_REG_MR
-static int build_host_sge_array(RdmaDeviceResources *rdma_dev_res,
-                                struct ibv_sge *sge, uint8_t num_sge,
-                                uint64_t *total_length)
-{
-    RdmaRmMR *mr;
-    int idx;
-
-    for (idx = 0; idx < num_sge; idx++) {
-        mr = rdma_rm_get_mr(rdma_dev_res, sge[idx].lkey);
-        if (unlikely(!mr)) {
-            rdma_error_report("Invalid lkey 0x%x", sge[idx].lkey);
-            return VENDOR_ERR_INVLKEY | sge[idx].lkey;
-        }
-
-        sge[idx].addr = (uintptr_t)mr->virt + sge[idx].addr - mr->start;
-        sge[idx].lkey = rdma_backend_mr_lkey(&mr->backend_mr);
-
-        *total_length += sge[idx].length;
-    }
-
-    return 0;
-}
-#else
-static inline int build_host_sge_array(RdmaDeviceResources *rdma_dev_res,
-                                       struct ibv_sge *sge, uint8_t num_sge,
-                                       uint64_t *total_length)
-{
-    int idx;
-
-    for (idx = 0; idx < num_sge; idx++) {
-        *total_length += sge[idx].length;
-    }
-    return 0;
-}
-#endif
-
-static void trace_mad_message(const char *title, char *buf, int len)
-{
-    int i;
-    char *b = g_malloc0(len * 3 + 1);
-    char b1[4];
-
-    for (i = 0; i < len; i++) {
-        sprintf(b1, "%.2X ", buf[i] & 0x000000FF);
-        strcat(b, b1);
-    }
-
-    trace_rdma_mad_message(title, len, b);
-
-    g_free(b);
-}
-
-static int mad_send(RdmaBackendDev *backend_dev, uint8_t sgid_idx,
-                    union ibv_gid *sgid, struct ibv_sge *sge, uint32_t num_sge)
-{
-    RdmaCmMuxMsg msg = {};
-    char *hdr, *data;
-    int ret;
-
-    if (num_sge != 2) {
-        return -EINVAL;
-    }
-
-    msg.hdr.op_code = RDMACM_MUX_OP_CODE_MAD;
-    memcpy(msg.hdr.sgid.raw, sgid->raw, sizeof(msg.hdr.sgid));
-
-    msg.umad_len = sge[0].length + sge[1].length;
-
-    if (msg.umad_len > sizeof(msg.umad.mad)) {
-        return -ENOMEM;
-    }
-
-    msg.umad.hdr.addr.qpn = htobe32(1);
-    msg.umad.hdr.addr.grh_present = 1;
-    msg.umad.hdr.addr.gid_index = sgid_idx;
-    memcpy(msg.umad.hdr.addr.gid, sgid->raw, sizeof(msg.umad.hdr.addr.gid));
-    msg.umad.hdr.addr.hop_limit = 0xFF;
-
-    hdr = rdma_pci_dma_map(backend_dev->dev, sge[0].addr, sge[0].length);
-    if (!hdr) {
-        return -ENOMEM;
-    }
-    data = rdma_pci_dma_map(backend_dev->dev, sge[1].addr, sge[1].length);
-    if (!data) {
-        rdma_pci_dma_unmap(backend_dev->dev, hdr, sge[0].length);
-        return -ENOMEM;
-    }
-
-    memcpy(&msg.umad.mad[0], hdr, sge[0].length);
-    memcpy(&msg.umad.mad[sge[0].length], data, sge[1].length);
-
-    rdma_pci_dma_unmap(backend_dev->dev, data, sge[1].length);
-    rdma_pci_dma_unmap(backend_dev->dev, hdr, sge[0].length);
-
-    trace_mad_message("send", msg.umad.mad, msg.umad_len);
-
-    ret = rdmacm_mux_send(backend_dev, &msg);
-    if (ret) {
-        rdma_error_report("Failed to send MAD to rdma_umadmux (%d)", ret);
-        return -EIO;
-    }
-
-    return 0;
-}
-
-void rdma_backend_post_send(RdmaBackendDev *backend_dev,
-                            RdmaBackendQP *qp, uint8_t qp_type,
-                            struct ibv_sge *sge, uint32_t num_sge,
-                            uint8_t sgid_idx, union ibv_gid *sgid,
-                            union ibv_gid *dgid, uint32_t dqpn, uint32_t dqkey,
-                            void *ctx)
-{
-    BackendCtx *bctx;
-    uint32_t bctx_id;
-    int rc;
-    struct ibv_send_wr wr = {}, *bad_wr;
-
-    if (!qp->ibqp) { /* This field is not initialized for QP0 and QP1 */
-        if (qp_type == IBV_QPT_SMI) {
-            rdma_error_report("Got QP0 request");
-            complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_QP0, ctx);
-        } else if (qp_type == IBV_QPT_GSI) {
-            rc = mad_send(backend_dev, sgid_idx, sgid, sge, num_sge);
-            if (rc) {
-                complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_MAD_SEND, ctx);
-                backend_dev->rdma_dev_res->stats.mad_tx_err++;
-            } else {
-                complete_work(IBV_WC_SUCCESS, 0, ctx);
-                backend_dev->rdma_dev_res->stats.mad_tx++;
-            }
-        }
-        return;
-    }
-
-    bctx = g_malloc0(sizeof(*bctx));
-    bctx->up_ctx = ctx;
-    bctx->backend_qp = qp;
-
-    rc = rdma_rm_alloc_cqe_ctx(backend_dev->rdma_dev_res, &bctx_id, bctx);
-    if (unlikely(rc)) {
-        complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_NOMEM, ctx);
-        goto err_free_bctx;
-    }
-
-    rdma_protected_gslist_append_int32(&qp->cqe_ctx_list, bctx_id);
-
-    rc = build_host_sge_array(backend_dev->rdma_dev_res, sge, num_sge,
-                              &backend_dev->rdma_dev_res->stats.tx_len);
-    if (rc) {
-        complete_work(IBV_WC_GENERAL_ERR, rc, ctx);
-        goto err_dealloc_cqe_ctx;
-    }
-
-    if (qp_type == IBV_QPT_UD) {
-        wr.wr.ud.ah = create_ah(backend_dev, qp->ibpd, sgid_idx, dgid);
-        if (!wr.wr.ud.ah) {
-            complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_FAIL_BACKEND, ctx);
-            goto err_dealloc_cqe_ctx;
-        }
-        wr.wr.ud.remote_qpn = dqpn;
-        wr.wr.ud.remote_qkey = dqkey;
-    }
-
-    wr.num_sge = num_sge;
-    wr.opcode = IBV_WR_SEND;
-    wr.send_flags = IBV_SEND_SIGNALED;
-    wr.sg_list = sge;
-    wr.wr_id = bctx_id;
-
-    rc = ibv_post_send(qp->ibqp, &wr, &bad_wr);
-    if (rc) {
-        rdma_error_report("ibv_post_send fail, qpn=0x%x, rc=%d, errno=%d",
-                          qp->ibqp->qp_num, rc, errno);
-        complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_FAIL_BACKEND, ctx);
-        goto err_dealloc_cqe_ctx;
-    }
-
-    qatomic_inc(&backend_dev->rdma_dev_res->stats.missing_cqe);
-    backend_dev->rdma_dev_res->stats.tx++;
-
-    return;
-
-err_dealloc_cqe_ctx:
-    backend_dev->rdma_dev_res->stats.tx_err++;
-    rdma_rm_dealloc_cqe_ctx(backend_dev->rdma_dev_res, bctx_id);
-
-err_free_bctx:
-    g_free(bctx);
-}
-
-static unsigned int save_mad_recv_buffer(RdmaBackendDev *backend_dev,
-                                         struct ibv_sge *sge, uint32_t num_sge,
-                                         void *ctx)
-{
-    BackendCtx *bctx;
-    int rc;
-    uint32_t bctx_id;
-
-    if (num_sge != 1) {
-        rdma_error_report("Invalid num_sge (%d), expecting 1", num_sge);
-        return VENDOR_ERR_INV_NUM_SGE;
-    }
-
-    if (sge[0].length < RDMA_MAX_PRIVATE_DATA + sizeof(struct ibv_grh)) {
-        rdma_error_report("Too small buffer for MAD");
-        return VENDOR_ERR_INV_MAD_BUFF;
-    }
-
-    bctx = g_malloc0(sizeof(*bctx));
-
-    rc = rdma_rm_alloc_cqe_ctx(backend_dev->rdma_dev_res, &bctx_id, bctx);
-    if (unlikely(rc)) {
-        g_free(bctx);
-        return VENDOR_ERR_NOMEM;
-    }
-
-    bctx->up_ctx = ctx;
-    bctx->sge = *sge;
-
-    rdma_protected_gqueue_append_int64(&backend_dev->recv_mads_list, bctx_id);
-
-    return 0;
-}
-
-void rdma_backend_post_recv(RdmaBackendDev *backend_dev,
-                            RdmaBackendQP *qp, uint8_t qp_type,
-                            struct ibv_sge *sge, uint32_t num_sge, void *ctx)
-{
-    BackendCtx *bctx;
-    uint32_t bctx_id;
-    int rc;
-    struct ibv_recv_wr wr = {}, *bad_wr;
-
-    if (!qp->ibqp) { /* This field does not get initialized for QP0 and QP1 */
-        if (qp_type == IBV_QPT_SMI) {
-            rdma_error_report("Got QP0 request");
-            complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_QP0, ctx);
-        }
-        if (qp_type == IBV_QPT_GSI) {
-            rc = save_mad_recv_buffer(backend_dev, sge, num_sge, ctx);
-            if (rc) {
-                complete_work(IBV_WC_GENERAL_ERR, rc, ctx);
-                backend_dev->rdma_dev_res->stats.mad_rx_bufs_err++;
-            } else {
-                backend_dev->rdma_dev_res->stats.mad_rx_bufs++;
-            }
-        }
-        return;
-    }
-
-    bctx = g_malloc0(sizeof(*bctx));
-    bctx->up_ctx = ctx;
-    bctx->backend_qp = qp;
-
-    rc = rdma_rm_alloc_cqe_ctx(backend_dev->rdma_dev_res, &bctx_id, bctx);
-    if (unlikely(rc)) {
-        complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_NOMEM, ctx);
-        goto err_free_bctx;
-    }
-
-    rdma_protected_gslist_append_int32(&qp->cqe_ctx_list, bctx_id);
-
-    rc = build_host_sge_array(backend_dev->rdma_dev_res, sge, num_sge,
-                              &backend_dev->rdma_dev_res->stats.rx_bufs_len);
-    if (rc) {
-        complete_work(IBV_WC_GENERAL_ERR, rc, ctx);
-        goto err_dealloc_cqe_ctx;
-    }
-
-    wr.num_sge = num_sge;
-    wr.sg_list = sge;
-    wr.wr_id = bctx_id;
-    rc = ibv_post_recv(qp->ibqp, &wr, &bad_wr);
-    if (rc) {
-        rdma_error_report("ibv_post_recv fail, qpn=0x%x, rc=%d, errno=%d",
-                          qp->ibqp->qp_num, rc, errno);
-        complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_FAIL_BACKEND, ctx);
-        goto err_dealloc_cqe_ctx;
-    }
-
-    qatomic_inc(&backend_dev->rdma_dev_res->stats.missing_cqe);
-    backend_dev->rdma_dev_res->stats.rx_bufs++;
-
-    return;
-
-err_dealloc_cqe_ctx:
-    backend_dev->rdma_dev_res->stats.rx_bufs_err++;
-    rdma_rm_dealloc_cqe_ctx(backend_dev->rdma_dev_res, bctx_id);
-
-err_free_bctx:
-    g_free(bctx);
-}
-
-void rdma_backend_post_srq_recv(RdmaBackendDev *backend_dev,
-                                RdmaBackendSRQ *srq, struct ibv_sge *sge,
-                                uint32_t num_sge, void *ctx)
-{
-    BackendCtx *bctx;
-    uint32_t bctx_id;
-    int rc;
-    struct ibv_recv_wr wr = {}, *bad_wr;
-
-    bctx = g_malloc0(sizeof(*bctx));
-    bctx->up_ctx = ctx;
-    bctx->backend_srq = srq;
-
-    rc = rdma_rm_alloc_cqe_ctx(backend_dev->rdma_dev_res, &bctx_id, bctx);
-    if (unlikely(rc)) {
-        complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_NOMEM, ctx);
-        goto err_free_bctx;
-    }
-
-    rdma_protected_gslist_append_int32(&srq->cqe_ctx_list, bctx_id);
-
-    rc = build_host_sge_array(backend_dev->rdma_dev_res, sge, num_sge,
-                              &backend_dev->rdma_dev_res->stats.rx_bufs_len);
-    if (rc) {
-        complete_work(IBV_WC_GENERAL_ERR, rc, ctx);
-        goto err_dealloc_cqe_ctx;
-    }
-
-    wr.num_sge = num_sge;
-    wr.sg_list = sge;
-    wr.wr_id = bctx_id;
-    rc = ibv_post_srq_recv(srq->ibsrq, &wr, &bad_wr);
-    if (rc) {
-        rdma_error_report("ibv_post_srq_recv fail, srqn=0x%x, rc=%d, errno=%d",
-                          srq->ibsrq->handle, rc, errno);
-        complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_FAIL_BACKEND, ctx);
-        goto err_dealloc_cqe_ctx;
-    }
-
-    qatomic_inc(&backend_dev->rdma_dev_res->stats.missing_cqe);
-    backend_dev->rdma_dev_res->stats.rx_bufs++;
-    backend_dev->rdma_dev_res->stats.rx_srq++;
-
-    return;
-
-err_dealloc_cqe_ctx:
-    backend_dev->rdma_dev_res->stats.rx_bufs_err++;
-    rdma_rm_dealloc_cqe_ctx(backend_dev->rdma_dev_res, bctx_id);
-
-err_free_bctx:
-    g_free(bctx);
-}
-
-int rdma_backend_create_pd(RdmaBackendDev *backend_dev, RdmaBackendPD *pd)
-{
-    pd->ibpd = ibv_alloc_pd(backend_dev->context);
-
-    if (!pd->ibpd) {
-        rdma_error_report("ibv_alloc_pd fail, errno=%d", errno);
-        return -EIO;
-    }
-
-    return 0;
-}
-
-void rdma_backend_destroy_pd(RdmaBackendPD *pd)
-{
-    if (pd->ibpd) {
-        ibv_dealloc_pd(pd->ibpd);
-    }
-}
-
-int rdma_backend_create_mr(RdmaBackendMR *mr, RdmaBackendPD *pd, void *addr,
-                           size_t length, uint64_t guest_start, int access)
-{
-#ifdef LEGACY_RDMA_REG_MR
-    mr->ibmr = ibv_reg_mr(pd->ibpd, addr, length, access);
-#else
-    mr->ibmr = ibv_reg_mr_iova(pd->ibpd, addr, length, guest_start, access);
-#endif
-    if (!mr->ibmr) {
-        rdma_error_report("ibv_reg_mr fail, errno=%d", errno);
-        return -EIO;
-    }
-
-    mr->ibpd = pd->ibpd;
-
-    return 0;
-}
-
-void rdma_backend_destroy_mr(RdmaBackendMR *mr)
-{
-    if (mr->ibmr) {
-        ibv_dereg_mr(mr->ibmr);
-    }
-}
-
-int rdma_backend_create_cq(RdmaBackendDev *backend_dev, RdmaBackendCQ *cq,
-                           int cqe)
-{
-    int rc;
-
-    cq->ibcq = ibv_create_cq(backend_dev->context, cqe + 1, NULL,
-                             backend_dev->channel, 0);
-    if (!cq->ibcq) {
-        rdma_error_report("ibv_create_cq fail, errno=%d", errno);
-        return -EIO;
-    }
-
-    rc = ibv_req_notify_cq(cq->ibcq, 0);
-    if (rc) {
-        rdma_warn_report("ibv_req_notify_cq fail, rc=%d, errno=%d", rc, errno);
-    }
-
-    cq->backend_dev = backend_dev;
-
-    return 0;
-}
-
-void rdma_backend_destroy_cq(RdmaBackendCQ *cq)
-{
-    if (cq->ibcq) {
-        ibv_destroy_cq(cq->ibcq);
-    }
-}
-
-int rdma_backend_create_qp(RdmaBackendQP *qp, uint8_t qp_type,
-                           RdmaBackendPD *pd, RdmaBackendCQ *scq,
-                           RdmaBackendCQ *rcq, RdmaBackendSRQ *srq,
-                           uint32_t max_send_wr, uint32_t max_recv_wr,
-                           uint32_t max_send_sge, uint32_t max_recv_sge)
-{
-    struct ibv_qp_init_attr attr = {};
-
-    qp->ibqp = 0;
-
-    switch (qp_type) {
-    case IBV_QPT_GSI:
-        return 0;
-
-    case IBV_QPT_RC:
-        /* fall through */
-    case IBV_QPT_UD:
-        /* do nothing */
-        break;
-
-    default:
-        rdma_error_report("Unsupported QP type %d", qp_type);
-        return -EIO;
-    }
-
-    attr.qp_type = qp_type;
-    attr.send_cq = scq->ibcq;
-    attr.recv_cq = rcq->ibcq;
-    attr.cap.max_send_wr = max_send_wr;
-    attr.cap.max_recv_wr = max_recv_wr;
-    attr.cap.max_send_sge = max_send_sge;
-    attr.cap.max_recv_sge = max_recv_sge;
-    if (srq) {
-        attr.srq = srq->ibsrq;
-    }
-
-    qp->ibqp = ibv_create_qp(pd->ibpd, &attr);
-    if (!qp->ibqp) {
-        rdma_error_report("ibv_create_qp fail, errno=%d", errno);
-        return -EIO;
-    }
-
-    rdma_protected_gslist_init(&qp->cqe_ctx_list);
-
-    qp->ibpd = pd->ibpd;
-
-    /* TODO: Query QP to get max_inline_data and save it to be used in send */
-
-    return 0;
-}
-
-int rdma_backend_qp_state_init(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
-                               uint8_t qp_type, uint32_t qkey)
-{
-    struct ibv_qp_attr attr = {};
-    int rc, attr_mask;
-
-    attr_mask = IBV_QP_STATE | IBV_QP_PKEY_INDEX | IBV_QP_PORT;
-    attr.qp_state        = IBV_QPS_INIT;
-    attr.pkey_index      = 0;
-    attr.port_num        = backend_dev->port_num;
-
-    switch (qp_type) {
-    case IBV_QPT_RC:
-        attr_mask |= IBV_QP_ACCESS_FLAGS;
-        trace_rdma_backend_rc_qp_state_init(qp->ibqp->qp_num);
-        break;
-
-    case IBV_QPT_UD:
-        attr.qkey = qkey;
-        attr_mask |= IBV_QP_QKEY;
-        trace_rdma_backend_ud_qp_state_init(qp->ibqp->qp_num, qkey);
-        break;
-
-    default:
-        rdma_error_report("Unsupported QP type %d", qp_type);
-        return -EIO;
-    }
-
-    rc = ibv_modify_qp(qp->ibqp, &attr, attr_mask);
-    if (rc) {
-        rdma_error_report("ibv_modify_qp fail, rc=%d, errno=%d", rc, errno);
-        return -EIO;
-    }
-
-    return 0;
-}
-
-int rdma_backend_qp_state_rtr(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
-                              uint8_t qp_type, uint8_t sgid_idx,
-                              union ibv_gid *dgid, uint32_t dqpn,
-                              uint32_t rq_psn, uint32_t qkey, bool use_qkey)
-{
-    struct ibv_qp_attr attr = {};
-    union ibv_gid ibv_gid = {
-        .global.interface_id = dgid->global.interface_id,
-        .global.subnet_prefix = dgid->global.subnet_prefix
-    };
-    int rc, attr_mask;
-
-    attr.qp_state = IBV_QPS_RTR;
-    attr_mask = IBV_QP_STATE;
-
-    qp->sgid_idx = sgid_idx;
-
-    switch (qp_type) {
-    case IBV_QPT_RC:
-        attr.path_mtu               = IBV_MTU_1024;
-        attr.dest_qp_num            = dqpn;
-        attr.max_dest_rd_atomic     = 1;
-        attr.min_rnr_timer          = 12;
-        attr.ah_attr.port_num       = backend_dev->port_num;
-        attr.ah_attr.is_global      = 1;
-        attr.ah_attr.grh.hop_limit  = 1;
-        attr.ah_attr.grh.dgid       = ibv_gid;
-        attr.ah_attr.grh.sgid_index = qp->sgid_idx;
-        attr.rq_psn                 = rq_psn;
-
-        attr_mask |= IBV_QP_AV | IBV_QP_PATH_MTU | IBV_QP_DEST_QPN |
-                     IBV_QP_RQ_PSN | IBV_QP_MAX_DEST_RD_ATOMIC |
-                     IBV_QP_MIN_RNR_TIMER;
-
-        trace_rdma_backend_rc_qp_state_rtr(qp->ibqp->qp_num,
-                                           be64_to_cpu(ibv_gid.global.
-                                                       subnet_prefix),
-                                           be64_to_cpu(ibv_gid.global.
-                                                       interface_id),
-                                           qp->sgid_idx, dqpn, rq_psn);
-        break;
-
-    case IBV_QPT_UD:
-        if (use_qkey) {
-            attr.qkey = qkey;
-            attr_mask |= IBV_QP_QKEY;
-        }
-        trace_rdma_backend_ud_qp_state_rtr(qp->ibqp->qp_num, use_qkey ? qkey :
-                                           0);
-        break;
-    }
-
-    rc = ibv_modify_qp(qp->ibqp, &attr, attr_mask);
-    if (rc) {
-        rdma_error_report("ibv_modify_qp fail, rc=%d, errno=%d", rc, errno);
-        return -EIO;
-    }
-
-    return 0;
-}
-
-int rdma_backend_qp_state_rts(RdmaBackendQP *qp, uint8_t qp_type,
-                              uint32_t sq_psn, uint32_t qkey, bool use_qkey)
-{
-    struct ibv_qp_attr attr = {};
-    int rc, attr_mask;
-
-    attr.qp_state = IBV_QPS_RTS;
-    attr.sq_psn = sq_psn;
-    attr_mask = IBV_QP_STATE | IBV_QP_SQ_PSN;
-
-    switch (qp_type) {
-    case IBV_QPT_RC:
-        attr.timeout       = 14;
-        attr.retry_cnt     = 7;
-        attr.rnr_retry     = 7;
-        attr.max_rd_atomic = 1;
-
-        attr_mask |= IBV_QP_TIMEOUT | IBV_QP_RETRY_CNT | IBV_QP_RNR_RETRY |
-                     IBV_QP_MAX_QP_RD_ATOMIC;
-        trace_rdma_backend_rc_qp_state_rts(qp->ibqp->qp_num, sq_psn);
-        break;
-
-    case IBV_QPT_UD:
-        if (use_qkey) {
-            attr.qkey = qkey;
-            attr_mask |= IBV_QP_QKEY;
-        }
-        trace_rdma_backend_ud_qp_state_rts(qp->ibqp->qp_num, sq_psn,
-                                           use_qkey ? qkey : 0);
-        break;
-    }
-
-    rc = ibv_modify_qp(qp->ibqp, &attr, attr_mask);
-    if (rc) {
-        rdma_error_report("ibv_modify_qp fail, rc=%d, errno=%d", rc, errno);
-        return -EIO;
-    }
-
-    return 0;
-}
-
-int rdma_backend_query_qp(RdmaBackendQP *qp, struct ibv_qp_attr *attr,
-                          int attr_mask, struct ibv_qp_init_attr *init_attr)
-{
-    if (!qp->ibqp) {
-        attr->qp_state = IBV_QPS_RTS;
-        return 0;
-    }
-
-    return ibv_query_qp(qp->ibqp, attr, attr_mask, init_attr);
-}
-
-void rdma_backend_destroy_qp(RdmaBackendQP *qp, RdmaDeviceResources *dev_res)
-{
-    if (qp->ibqp) {
-        ibv_destroy_qp(qp->ibqp);
-    }
-    g_slist_foreach(qp->cqe_ctx_list.list, free_cqe_ctx, dev_res);
-    rdma_protected_gslist_destroy(&qp->cqe_ctx_list);
-}
-
-int rdma_backend_create_srq(RdmaBackendSRQ *srq, RdmaBackendPD *pd,
-                            uint32_t max_wr, uint32_t max_sge,
-                            uint32_t srq_limit)
-{
-    struct ibv_srq_init_attr srq_init_attr = {};
-
-    srq_init_attr.attr.max_wr = max_wr;
-    srq_init_attr.attr.max_sge = max_sge;
-    srq_init_attr.attr.srq_limit = srq_limit;
-
-    srq->ibsrq = ibv_create_srq(pd->ibpd, &srq_init_attr);
-    if (!srq->ibsrq) {
-        rdma_error_report("ibv_create_srq failed, errno=%d", errno);
-        return -EIO;
-    }
-
-    rdma_protected_gslist_init(&srq->cqe_ctx_list);
-
-    return 0;
-}
-
-int rdma_backend_query_srq(RdmaBackendSRQ *srq, struct ibv_srq_attr *srq_attr)
-{
-    if (!srq->ibsrq) {
-        return -EINVAL;
-    }
-
-    return ibv_query_srq(srq->ibsrq, srq_attr);
-}
-
-int rdma_backend_modify_srq(RdmaBackendSRQ *srq, struct ibv_srq_attr *srq_attr,
-                int srq_attr_mask)
-{
-    if (!srq->ibsrq) {
-        return -EINVAL;
-    }
-
-    return ibv_modify_srq(srq->ibsrq, srq_attr, srq_attr_mask);
-}
-
-void rdma_backend_destroy_srq(RdmaBackendSRQ *srq, RdmaDeviceResources *dev_res)
-{
-    if (srq->ibsrq) {
-        ibv_destroy_srq(srq->ibsrq);
-    }
-    g_slist_foreach(srq->cqe_ctx_list.list, free_cqe_ctx, dev_res);
-    rdma_protected_gslist_destroy(&srq->cqe_ctx_list);
-}
-
-#define CHK_ATTR(req, dev, member, fmt) ({ \
-    trace_rdma_check_dev_attr(#member, dev.member, req->member); \
-    if (req->member > dev.member) { \
-        rdma_warn_report("%s = "fmt" is higher than host device capability "fmt, \
-                         #member, req->member, dev.member); \
-        req->member = dev.member; \
-    } \
-})
-
-static int init_device_caps(RdmaBackendDev *backend_dev,
-                            struct ibv_device_attr *dev_attr)
-{
-    struct ibv_device_attr bk_dev_attr;
-    int rc;
-
-    rc = ibv_query_device(backend_dev->context, &bk_dev_attr);
-    if (rc) {
-        rdma_error_report("ibv_query_device fail, rc=%d, errno=%d", rc, errno);
-        return -EIO;
-    }
-
-    dev_attr->max_sge = MAX_SGE;
-    dev_attr->max_srq_sge = MAX_SGE;
-
-    CHK_ATTR(dev_attr, bk_dev_attr, max_mr_size, "%" PRId64);
-    CHK_ATTR(dev_attr, bk_dev_attr, max_qp, "%d");
-    CHK_ATTR(dev_attr, bk_dev_attr, max_sge, "%d");
-    CHK_ATTR(dev_attr, bk_dev_attr, max_cq, "%d");
-    CHK_ATTR(dev_attr, bk_dev_attr, max_mr, "%d");
-    CHK_ATTR(dev_attr, bk_dev_attr, max_pd, "%d");
-    CHK_ATTR(dev_attr, bk_dev_attr, max_qp_rd_atom, "%d");
-    CHK_ATTR(dev_attr, bk_dev_attr, max_qp_init_rd_atom, "%d");
-    CHK_ATTR(dev_attr, bk_dev_attr, max_ah, "%d");
-    CHK_ATTR(dev_attr, bk_dev_attr, max_srq, "%d");
-
-    return 0;
-}
-
-static inline void build_mad_hdr(struct ibv_grh *grh, union ibv_gid *sgid,
-                                 union ibv_gid *my_gid, int paylen)
-{
-    grh->paylen = htons(paylen);
-    grh->sgid = *sgid;
-    grh->dgid = *my_gid;
-}
-
-static void process_incoming_mad_req(RdmaBackendDev *backend_dev,
-                                     RdmaCmMuxMsg *msg)
-{
-    unsigned long cqe_ctx_id;
-    BackendCtx *bctx;
-    char *mad;
-
-    trace_mad_message("recv", msg->umad.mad, msg->umad_len);
-
-    cqe_ctx_id = rdma_protected_gqueue_pop_int64(&backend_dev->recv_mads_list);
-    if (cqe_ctx_id == -ENOENT) {
-        rdma_warn_report("No more free MADs buffers, waiting for a while");
-        sleep(THR_POLL_TO);
-        return;
-    }
-
-    bctx = rdma_rm_get_cqe_ctx(backend_dev->rdma_dev_res, cqe_ctx_id);
-    if (unlikely(!bctx)) {
-        rdma_error_report("No matching ctx for req %ld", cqe_ctx_id);
-        backend_dev->rdma_dev_res->stats.mad_rx_err++;
-        return;
-    }
-
-    mad = rdma_pci_dma_map(backend_dev->dev, bctx->sge.addr,
-                           bctx->sge.length);
-    if (!mad || bctx->sge.length < msg->umad_len + MAD_HDR_SIZE) {
-        backend_dev->rdma_dev_res->stats.mad_rx_err++;
-        complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_INV_MAD_BUFF,
-                      bctx->up_ctx);
-    } else {
-        struct ibv_wc wc = {};
-        memset(mad, 0, bctx->sge.length);
-        build_mad_hdr((struct ibv_grh *)mad,
-                      (union ibv_gid *)&msg->umad.hdr.addr.gid, &msg->hdr.sgid,
-                      msg->umad_len);
-        memcpy(&mad[MAD_HDR_SIZE], msg->umad.mad, msg->umad_len);
-        rdma_pci_dma_unmap(backend_dev->dev, mad, bctx->sge.length);
-
-        wc.byte_len = msg->umad_len;
-        wc.status = IBV_WC_SUCCESS;
-        wc.wc_flags = IBV_WC_GRH;
-        backend_dev->rdma_dev_res->stats.mad_rx++;
-        comp_handler(bctx->up_ctx, &wc);
-    }
-
-    g_free(bctx);
-    rdma_rm_dealloc_cqe_ctx(backend_dev->rdma_dev_res, cqe_ctx_id);
-}
-
-static inline int rdmacm_mux_can_receive(void *opaque)
-{
-    RdmaBackendDev *backend_dev = (RdmaBackendDev *)opaque;
-
-    return rdmacm_mux_can_process_async(backend_dev);
-}
-
-static void rdmacm_mux_read(void *opaque, const uint8_t *buf, int size)
-{
-    RdmaBackendDev *backend_dev = (RdmaBackendDev *)opaque;
-    RdmaCmMuxMsg *msg = (RdmaCmMuxMsg *)buf;
-
-    trace_rdmacm_mux("read", msg->hdr.msg_type, msg->hdr.op_code);
-
-    if (msg->hdr.msg_type != RDMACM_MUX_MSG_TYPE_REQ &&
-        msg->hdr.op_code != RDMACM_MUX_OP_CODE_MAD) {
-            rdma_error_report("Error: Not a MAD request, skipping");
-            return;
-    }
-    process_incoming_mad_req(backend_dev, msg);
-}
-
-static int mad_init(RdmaBackendDev *backend_dev, CharBackend *mad_chr_be)
-{
-    int ret;
-
-    backend_dev->rdmacm_mux.chr_be = mad_chr_be;
-
-    ret = qemu_chr_fe_backend_connected(backend_dev->rdmacm_mux.chr_be);
-    if (!ret) {
-        rdma_error_report("Missing chardev for MAD multiplexer");
-        return -EIO;
-    }
-
-    rdma_protected_gqueue_init(&backend_dev->recv_mads_list);
-
-    enable_rdmacm_mux_async(backend_dev);
-
-    qemu_chr_fe_set_handlers(backend_dev->rdmacm_mux.chr_be,
-                             rdmacm_mux_can_receive, rdmacm_mux_read, NULL,
-                             NULL, backend_dev, NULL, true);
-
-    return 0;
-}
-
-static void mad_stop(RdmaBackendDev *backend_dev)
-{
-    clean_recv_mads(backend_dev);
-}
-
-static void mad_fini(RdmaBackendDev *backend_dev)
-{
-    disable_rdmacm_mux_async(backend_dev);
-    qemu_chr_fe_disconnect(backend_dev->rdmacm_mux.chr_be);
-    rdma_protected_gqueue_destroy(&backend_dev->recv_mads_list);
-}
-
-int rdma_backend_get_gid_index(RdmaBackendDev *backend_dev,
-                               union ibv_gid *gid)
-{
-    union ibv_gid sgid;
-    int ret;
-    int i = 0;
-
-    do {
-        ret = ibv_query_gid(backend_dev->context, backend_dev->port_num, i,
-                            &sgid);
-        i++;
-    } while (!ret && (memcmp(&sgid, gid, sizeof(*gid))));
-
-    trace_rdma_backend_get_gid_index(be64_to_cpu(gid->global.subnet_prefix),
-                                     be64_to_cpu(gid->global.interface_id),
-                                     i - 1);
-
-    return ret ? ret : i - 1;
-}
-
-int rdma_backend_add_gid(RdmaBackendDev *backend_dev, const char *ifname,
-                         union ibv_gid *gid)
-{
-    RdmaCmMuxMsg msg = {};
-    int ret;
-
-    trace_rdma_backend_gid_change("add", be64_to_cpu(gid->global.subnet_prefix),
-                                  be64_to_cpu(gid->global.interface_id));
-
-    msg.hdr.op_code = RDMACM_MUX_OP_CODE_REG;
-    memcpy(msg.hdr.sgid.raw, gid->raw, sizeof(msg.hdr.sgid));
-
-    ret = rdmacm_mux_send(backend_dev, &msg);
-    if (ret) {
-        rdma_error_report("Failed to register GID to rdma_umadmux (%d)", ret);
-        return -EIO;
-    }
-
-    qapi_event_send_rdma_gid_status_changed(ifname, true,
-                                            gid->global.subnet_prefix,
-                                            gid->global.interface_id);
-
-    return ret;
-}
-
-int rdma_backend_del_gid(RdmaBackendDev *backend_dev, const char *ifname,
-                         union ibv_gid *gid)
-{
-    RdmaCmMuxMsg msg = {};
-    int ret;
-
-    trace_rdma_backend_gid_change("del", be64_to_cpu(gid->global.subnet_prefix),
-                                  be64_to_cpu(gid->global.interface_id));
-
-    msg.hdr.op_code = RDMACM_MUX_OP_CODE_UNREG;
-    memcpy(msg.hdr.sgid.raw, gid->raw, sizeof(msg.hdr.sgid));
-
-    ret = rdmacm_mux_send(backend_dev, &msg);
-    if (ret) {
-        rdma_error_report("Failed to unregister GID from rdma_umadmux (%d)",
-                          ret);
-        return -EIO;
-    }
-
-    qapi_event_send_rdma_gid_status_changed(ifname, false,
-                                            gid->global.subnet_prefix,
-                                            gid->global.interface_id);
-
-    return 0;
-}
-
-int rdma_backend_init(RdmaBackendDev *backend_dev, PCIDevice *pdev,
-                      RdmaDeviceResources *rdma_dev_res,
-                      const char *backend_device_name, uint8_t port_num,
-                      struct ibv_device_attr *dev_attr, CharBackend *mad_chr_be)
-{
-    int i;
-    int ret = 0;
-    int num_ibv_devices;
-    struct ibv_device **dev_list;
-
-    memset(backend_dev, 0, sizeof(*backend_dev));
-
-    backend_dev->dev = pdev;
-    backend_dev->port_num = port_num;
-    backend_dev->rdma_dev_res = rdma_dev_res;
-
-    rdma_backend_register_comp_handler(dummy_comp_handler);
-
-    dev_list = ibv_get_device_list(&num_ibv_devices);
-    if (!dev_list) {
-        rdma_error_report("Failed to get IB devices list");
-        return -EIO;
-    }
-
-    if (num_ibv_devices == 0) {
-        rdma_error_report("No IB devices were found");
-        ret = -ENXIO;
-        goto out_free_dev_list;
-    }
-
-    if (backend_device_name) {
-        for (i = 0; dev_list[i]; ++i) {
-            if (!strcmp(ibv_get_device_name(dev_list[i]),
-                        backend_device_name)) {
-                break;
-            }
-        }
-
-        backend_dev->ib_dev = dev_list[i];
-        if (!backend_dev->ib_dev) {
-            rdma_error_report("Failed to find IB device %s",
-                              backend_device_name);
-            ret = -EIO;
-            goto out_free_dev_list;
-        }
-    } else {
-        backend_dev->ib_dev = *dev_list;
-    }
-
-    rdma_info_report("uverb device %s", backend_dev->ib_dev->dev_name);
-
-    backend_dev->context = ibv_open_device(backend_dev->ib_dev);
-    if (!backend_dev->context) {
-        rdma_error_report("Failed to open IB device %s",
-                          ibv_get_device_name(backend_dev->ib_dev));
-        ret = -EIO;
-        goto out;
-    }
-
-    backend_dev->channel = ibv_create_comp_channel(backend_dev->context);
-    if (!backend_dev->channel) {
-        rdma_error_report("Failed to create IB communication channel");
-        ret = -EIO;
-        goto out_close_device;
-    }
-
-    ret = init_device_caps(backend_dev, dev_attr);
-    if (ret) {
-        rdma_error_report("Failed to initialize device capabilities");
-        ret = -EIO;
-        goto out_destroy_comm_channel;
-    }
-
-
-    ret = mad_init(backend_dev, mad_chr_be);
-    if (ret) {
-        rdma_error_report("Failed to initialize mad");
-        ret = -EIO;
-        goto out_destroy_comm_channel;
-    }
-
-    backend_dev->comp_thread.run = false;
-    backend_dev->comp_thread.is_running = false;
-
-    ah_cache_init();
-
-    goto out_free_dev_list;
-
-out_destroy_comm_channel:
-    ibv_destroy_comp_channel(backend_dev->channel);
-
-out_close_device:
-    ibv_close_device(backend_dev->context);
-
-out_free_dev_list:
-    ibv_free_device_list(dev_list);
-
-out:
-    return ret;
-}
-
-
-void rdma_backend_start(RdmaBackendDev *backend_dev)
-{
-    start_comp_thread(backend_dev);
-}
-
-void rdma_backend_stop(RdmaBackendDev *backend_dev)
-{
-    mad_stop(backend_dev);
-    stop_backend_thread(&backend_dev->comp_thread);
-}
-
-void rdma_backend_fini(RdmaBackendDev *backend_dev)
-{
-    mad_fini(backend_dev);
-    g_hash_table_destroy(ah_hash);
-    ibv_destroy_comp_channel(backend_dev->channel);
-    ibv_close_device(backend_dev->context);
-}
diff --git a/hw/rdma/rdma_rm.c b/hw/rdma/rdma_rm.c
deleted file mode 100644
index 038d564433..0000000000
--- a/hw/rdma/rdma_rm.c
+++ /dev/null
@@ -1,812 +0,0 @@
-/*
- * QEMU paravirtual RDMA - Resource Manager Implementation
- *
- * Copyright (C) 2018 Oracle
- * Copyright (C) 2018 Red Hat Inc
- *
- * Authors:
- *     Yuval Shaia <yuval.shaia@oracle.com>
- *     Marcel Apfelbaum <marcel@redhat.com>
- *
- * This work is licensed under the terms of the GNU GPL, version 2 or later.
- * See the COPYING file in the top-level directory.
- *
- */
-
-#include "qemu/osdep.h"
-#include "qapi/error.h"
-#include "cpu.h"
-#include "monitor/monitor.h"
-
-#include "trace.h"
-#include "rdma_utils.h"
-#include "rdma_backend.h"
-#include "rdma_rm.h"
-
-void rdma_format_device_counters(RdmaDeviceResources *dev_res, GString *buf)
-{
-    g_string_append_printf(buf, "\ttx               : %" PRId64 "\n",
-                           dev_res->stats.tx);
-    g_string_append_printf(buf, "\ttx_len           : %" PRId64 "\n",
-                           dev_res->stats.tx_len);
-    g_string_append_printf(buf, "\ttx_err           : %" PRId64 "\n",
-                           dev_res->stats.tx_err);
-    g_string_append_printf(buf, "\trx_bufs          : %" PRId64 "\n",
-                           dev_res->stats.rx_bufs);
-    g_string_append_printf(buf, "\trx_srq           : %" PRId64 "\n",
-                           dev_res->stats.rx_srq);
-    g_string_append_printf(buf, "\trx_bufs_len      : %" PRId64 "\n",
-                           dev_res->stats.rx_bufs_len);
-    g_string_append_printf(buf, "\trx_bufs_err      : %" PRId64 "\n",
-                           dev_res->stats.rx_bufs_err);
-    g_string_append_printf(buf, "\tcomps            : %" PRId64 "\n",
-                           dev_res->stats.completions);
-    g_string_append_printf(buf, "\tmissing_comps    : %" PRId32 "\n",
-                           dev_res->stats.missing_cqe);
-    g_string_append_printf(buf, "\tpoll_cq (bk)     : %" PRId64 "\n",
-                           dev_res->stats.poll_cq_from_bk);
-    g_string_append_printf(buf, "\tpoll_cq_ppoll_to : %" PRId64 "\n",
-                           dev_res->stats.poll_cq_ppoll_to);
-    g_string_append_printf(buf, "\tpoll_cq (fe)     : %" PRId64 "\n",
-                           dev_res->stats.poll_cq_from_guest);
-    g_string_append_printf(buf, "\tpoll_cq_empty    : %" PRId64 "\n",
-                           dev_res->stats.poll_cq_from_guest_empty);
-    g_string_append_printf(buf, "\tmad_tx           : %" PRId64 "\n",
-                           dev_res->stats.mad_tx);
-    g_string_append_printf(buf, "\tmad_tx_err       : %" PRId64 "\n",
-                           dev_res->stats.mad_tx_err);
-    g_string_append_printf(buf, "\tmad_rx           : %" PRId64 "\n",
-                           dev_res->stats.mad_rx);
-    g_string_append_printf(buf, "\tmad_rx_err       : %" PRId64 "\n",
-                           dev_res->stats.mad_rx_err);
-    g_string_append_printf(buf, "\tmad_rx_bufs      : %" PRId64 "\n",
-                           dev_res->stats.mad_rx_bufs);
-    g_string_append_printf(buf, "\tmad_rx_bufs_err  : %" PRId64 "\n",
-                           dev_res->stats.mad_rx_bufs_err);
-    g_string_append_printf(buf, "\tPDs              : %" PRId32 "\n",
-                           dev_res->pd_tbl.used);
-    g_string_append_printf(buf, "\tMRs              : %" PRId32 "\n",
-                           dev_res->mr_tbl.used);
-    g_string_append_printf(buf, "\tUCs              : %" PRId32 "\n",
-                           dev_res->uc_tbl.used);
-    g_string_append_printf(buf, "\tQPs              : %" PRId32 "\n",
-                           dev_res->qp_tbl.used);
-    g_string_append_printf(buf, "\tCQs              : %" PRId32 "\n",
-                           dev_res->cq_tbl.used);
-    g_string_append_printf(buf, "\tCEQ_CTXs         : %" PRId32 "\n",
-                           dev_res->cqe_ctx_tbl.used);
-}
-
-static inline void res_tbl_init(const char *name, RdmaRmResTbl *tbl,
-                                uint32_t tbl_sz, uint32_t res_sz)
-{
-    tbl->tbl = g_malloc(tbl_sz * res_sz);
-
-    strncpy(tbl->name, name, MAX_RM_TBL_NAME);
-    tbl->name[MAX_RM_TBL_NAME - 1] = 0;
-
-    tbl->bitmap = bitmap_new(tbl_sz);
-    tbl->tbl_sz = tbl_sz;
-    tbl->res_sz = res_sz;
-    tbl->used = 0;
-    qemu_mutex_init(&tbl->lock);
-}
-
-static inline void res_tbl_free(RdmaRmResTbl *tbl)
-{
-    if (!tbl->bitmap) {
-        return;
-    }
-    qemu_mutex_destroy(&tbl->lock);
-    g_free(tbl->tbl);
-    g_free(tbl->bitmap);
-}
-
-static inline void *rdma_res_tbl_get(RdmaRmResTbl *tbl, uint32_t handle)
-{
-    trace_rdma_res_tbl_get(tbl->name, handle);
-
-    if ((handle < tbl->tbl_sz) && (test_bit(handle, tbl->bitmap))) {
-        return tbl->tbl + handle * tbl->res_sz;
-    } else {
-        rdma_error_report("Table %s, invalid handle %d", tbl->name, handle);
-        return NULL;
-    }
-}
-
-static inline void *rdma_res_tbl_alloc(RdmaRmResTbl *tbl, uint32_t *handle)
-{
-    qemu_mutex_lock(&tbl->lock);
-
-    *handle = find_first_zero_bit(tbl->bitmap, tbl->tbl_sz);
-    if (*handle > tbl->tbl_sz) {
-        rdma_error_report("Table %s, failed to allocate, bitmap is full",
-                          tbl->name);
-        qemu_mutex_unlock(&tbl->lock);
-        return NULL;
-    }
-
-    set_bit(*handle, tbl->bitmap);
-
-    tbl->used++;
-
-    qemu_mutex_unlock(&tbl->lock);
-
-    memset(tbl->tbl + *handle * tbl->res_sz, 0, tbl->res_sz);
-
-    trace_rdma_res_tbl_alloc(tbl->name, *handle);
-
-    return tbl->tbl + *handle * tbl->res_sz;
-}
-
-static inline void rdma_res_tbl_dealloc(RdmaRmResTbl *tbl, uint32_t handle)
-{
-    trace_rdma_res_tbl_dealloc(tbl->name, handle);
-
-    QEMU_LOCK_GUARD(&tbl->lock);
-
-    if (handle < tbl->tbl_sz) {
-        clear_bit(handle, tbl->bitmap);
-        tbl->used--;
-    }
-
-}
-
-int rdma_rm_alloc_pd(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
-                     uint32_t *pd_handle, uint32_t ctx_handle)
-{
-    RdmaRmPD *pd;
-    int ret = -ENOMEM;
-
-    pd = rdma_res_tbl_alloc(&dev_res->pd_tbl, pd_handle);
-    if (!pd) {
-        goto out;
-    }
-
-    ret = rdma_backend_create_pd(backend_dev, &pd->backend_pd);
-    if (ret) {
-        ret = -EIO;
-        goto out_tbl_dealloc;
-    }
-
-    pd->ctx_handle = ctx_handle;
-
-    return 0;
-
-out_tbl_dealloc:
-    rdma_res_tbl_dealloc(&dev_res->pd_tbl, *pd_handle);
-
-out:
-    return ret;
-}
-
-RdmaRmPD *rdma_rm_get_pd(RdmaDeviceResources *dev_res, uint32_t pd_handle)
-{
-    return rdma_res_tbl_get(&dev_res->pd_tbl, pd_handle);
-}
-
-void rdma_rm_dealloc_pd(RdmaDeviceResources *dev_res, uint32_t pd_handle)
-{
-    RdmaRmPD *pd = rdma_rm_get_pd(dev_res, pd_handle);
-
-    if (pd) {
-        rdma_backend_destroy_pd(&pd->backend_pd);
-        rdma_res_tbl_dealloc(&dev_res->pd_tbl, pd_handle);
-    }
-}
-
-int rdma_rm_alloc_mr(RdmaDeviceResources *dev_res, uint32_t pd_handle,
-                     uint64_t guest_start, uint64_t guest_length,
-                     void *host_virt, int access_flags, uint32_t *mr_handle,
-                     uint32_t *lkey, uint32_t *rkey)
-{
-    RdmaRmMR *mr;
-    int ret = 0;
-    RdmaRmPD *pd;
-
-    pd = rdma_rm_get_pd(dev_res, pd_handle);
-    if (!pd) {
-        return -EINVAL;
-    }
-
-    mr = rdma_res_tbl_alloc(&dev_res->mr_tbl, mr_handle);
-    if (!mr) {
-        return -ENOMEM;
-    }
-    trace_rdma_rm_alloc_mr(*mr_handle, host_virt, guest_start, guest_length,
-                           access_flags);
-
-    if (host_virt) {
-        mr->virt = host_virt;
-        mr->start = guest_start;
-        mr->length = guest_length;
-        mr->virt += (mr->start & (TARGET_PAGE_SIZE - 1));
-
-        ret = rdma_backend_create_mr(&mr->backend_mr, &pd->backend_pd, mr->virt,
-                                     mr->length, guest_start, access_flags);
-        if (ret) {
-            ret = -EIO;
-            goto out_dealloc_mr;
-        }
-#ifdef LEGACY_RDMA_REG_MR
-        /* We keep mr_handle in lkey so send and recv get get mr ptr */
-        *lkey = *mr_handle;
-#else
-        *lkey = rdma_backend_mr_lkey(&mr->backend_mr);
-#endif
-    }
-
-    *rkey = -1;
-
-    mr->pd_handle = pd_handle;
-
-    return 0;
-
-out_dealloc_mr:
-    rdma_res_tbl_dealloc(&dev_res->mr_tbl, *mr_handle);
-
-    return ret;
-}
-
-RdmaRmMR *rdma_rm_get_mr(RdmaDeviceResources *dev_res, uint32_t mr_handle)
-{
-    return rdma_res_tbl_get(&dev_res->mr_tbl, mr_handle);
-}
-
-void rdma_rm_dealloc_mr(RdmaDeviceResources *dev_res, uint32_t mr_handle)
-{
-    RdmaRmMR *mr = rdma_rm_get_mr(dev_res, mr_handle);
-
-    if (mr) {
-        rdma_backend_destroy_mr(&mr->backend_mr);
-        trace_rdma_rm_dealloc_mr(mr_handle, mr->start);
-        if (mr->start) {
-            mr->virt -= (mr->start & (TARGET_PAGE_SIZE - 1));
-            munmap(mr->virt, mr->length);
-        }
-        rdma_res_tbl_dealloc(&dev_res->mr_tbl, mr_handle);
-    }
-}
-
-int rdma_rm_alloc_uc(RdmaDeviceResources *dev_res, uint32_t pfn,
-                     uint32_t *uc_handle)
-{
-    RdmaRmUC *uc;
-
-    /* TODO: Need to make sure pfn is between bar start address and
-     * bsd+RDMA_BAR2_UAR_SIZE
-    if (pfn > RDMA_BAR2_UAR_SIZE) {
-        rdma_error_report("pfn out of range (%d > %d)", pfn,
-                          RDMA_BAR2_UAR_SIZE);
-        return -ENOMEM;
-    }
-    */
-
-    uc = rdma_res_tbl_alloc(&dev_res->uc_tbl, uc_handle);
-    if (!uc) {
-        return -ENOMEM;
-    }
-
-    return 0;
-}
-
-RdmaRmUC *rdma_rm_get_uc(RdmaDeviceResources *dev_res, uint32_t uc_handle)
-{
-    return rdma_res_tbl_get(&dev_res->uc_tbl, uc_handle);
-}
-
-void rdma_rm_dealloc_uc(RdmaDeviceResources *dev_res, uint32_t uc_handle)
-{
-    RdmaRmUC *uc = rdma_rm_get_uc(dev_res, uc_handle);
-
-    if (uc) {
-        rdma_res_tbl_dealloc(&dev_res->uc_tbl, uc_handle);
-    }
-}
-
-RdmaRmCQ *rdma_rm_get_cq(RdmaDeviceResources *dev_res, uint32_t cq_handle)
-{
-    return rdma_res_tbl_get(&dev_res->cq_tbl, cq_handle);
-}
-
-int rdma_rm_alloc_cq(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
-                     uint32_t cqe, uint32_t *cq_handle, void *opaque)
-{
-    int rc;
-    RdmaRmCQ *cq;
-
-    cq = rdma_res_tbl_alloc(&dev_res->cq_tbl, cq_handle);
-    if (!cq) {
-        return -ENOMEM;
-    }
-
-    cq->opaque = opaque;
-    cq->notify = CNT_CLEAR;
-
-    rc = rdma_backend_create_cq(backend_dev, &cq->backend_cq, cqe);
-    if (rc) {
-        rc = -EIO;
-        goto out_dealloc_cq;
-    }
-
-    return 0;
-
-out_dealloc_cq:
-    rdma_rm_dealloc_cq(dev_res, *cq_handle);
-
-    return rc;
-}
-
-void rdma_rm_req_notify_cq(RdmaDeviceResources *dev_res, uint32_t cq_handle,
-                           bool notify)
-{
-    RdmaRmCQ *cq;
-
-    cq = rdma_rm_get_cq(dev_res, cq_handle);
-    if (!cq) {
-        return;
-    }
-
-    if (cq->notify != CNT_SET) {
-        cq->notify = notify ? CNT_ARM : CNT_CLEAR;
-    }
-}
-
-void rdma_rm_dealloc_cq(RdmaDeviceResources *dev_res, uint32_t cq_handle)
-{
-    RdmaRmCQ *cq;
-
-    cq = rdma_rm_get_cq(dev_res, cq_handle);
-    if (!cq) {
-        return;
-    }
-
-    rdma_backend_destroy_cq(&cq->backend_cq);
-
-    rdma_res_tbl_dealloc(&dev_res->cq_tbl, cq_handle);
-}
-
-RdmaRmQP *rdma_rm_get_qp(RdmaDeviceResources *dev_res, uint32_t qpn)
-{
-    GBytes *key = g_bytes_new(&qpn, sizeof(qpn));
-
-    RdmaRmQP *qp = g_hash_table_lookup(dev_res->qp_hash, key);
-
-    g_bytes_unref(key);
-
-    if (!qp) {
-        rdma_error_report("Invalid QP handle %d", qpn);
-    }
-
-    return qp;
-}
-
-int rdma_rm_alloc_qp(RdmaDeviceResources *dev_res, uint32_t pd_handle,
-                     uint8_t qp_type, uint32_t max_send_wr,
-                     uint32_t max_send_sge, uint32_t send_cq_handle,
-                     uint32_t max_recv_wr, uint32_t max_recv_sge,
-                     uint32_t recv_cq_handle, void *opaque, uint32_t *qpn,
-                     uint8_t is_srq, uint32_t srq_handle)
-{
-    int rc;
-    RdmaRmQP *qp;
-    RdmaRmCQ *scq, *rcq;
-    RdmaRmPD *pd;
-    RdmaRmSRQ *srq = NULL;
-    uint32_t rm_qpn;
-
-    pd = rdma_rm_get_pd(dev_res, pd_handle);
-    if (!pd) {
-        return -EINVAL;
-    }
-
-    scq = rdma_rm_get_cq(dev_res, send_cq_handle);
-    rcq = rdma_rm_get_cq(dev_res, recv_cq_handle);
-
-    if (!scq || !rcq) {
-        rdma_error_report("Invalid send_cqn or recv_cqn (%d, %d)",
-                          send_cq_handle, recv_cq_handle);
-        return -EINVAL;
-    }
-
-    if (is_srq) {
-        srq = rdma_rm_get_srq(dev_res, srq_handle);
-        if (!srq) {
-            rdma_error_report("Invalid srqn %d", srq_handle);
-            return -EINVAL;
-        }
-
-        srq->recv_cq_handle = recv_cq_handle;
-    }
-
-    if (qp_type == IBV_QPT_GSI) {
-        scq->notify = CNT_SET;
-        rcq->notify = CNT_SET;
-    }
-
-    qp = rdma_res_tbl_alloc(&dev_res->qp_tbl, &rm_qpn);
-    if (!qp) {
-        return -ENOMEM;
-    }
-
-    qp->qpn = rm_qpn;
-    qp->qp_state = IBV_QPS_RESET;
-    qp->qp_type = qp_type;
-    qp->send_cq_handle = send_cq_handle;
-    qp->recv_cq_handle = recv_cq_handle;
-    qp->opaque = opaque;
-    qp->is_srq = is_srq;
-
-    rc = rdma_backend_create_qp(&qp->backend_qp, qp_type, &pd->backend_pd,
-                                &scq->backend_cq, &rcq->backend_cq,
-                                is_srq ? &srq->backend_srq : NULL,
-                                max_send_wr, max_recv_wr, max_send_sge,
-                                max_recv_sge);
-
-    if (rc) {
-        rc = -EIO;
-        goto out_dealloc_qp;
-    }
-
-    *qpn = rdma_backend_qpn(&qp->backend_qp);
-    trace_rdma_rm_alloc_qp(rm_qpn, *qpn, qp_type);
-    g_hash_table_insert(dev_res->qp_hash, g_bytes_new(qpn, sizeof(*qpn)), qp);
-
-    return 0;
-
-out_dealloc_qp:
-    rdma_res_tbl_dealloc(&dev_res->qp_tbl, qp->qpn);
-
-    return rc;
-}
-
-int rdma_rm_modify_qp(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
-                      uint32_t qp_handle, uint32_t attr_mask, uint8_t sgid_idx,
-                      union ibv_gid *dgid, uint32_t dqpn,
-                      enum ibv_qp_state qp_state, uint32_t qkey,
-                      uint32_t rq_psn, uint32_t sq_psn)
-{
-    RdmaRmQP *qp;
-    int ret;
-
-    qp = rdma_rm_get_qp(dev_res, qp_handle);
-    if (!qp) {
-        return -EINVAL;
-    }
-
-    if (qp->qp_type == IBV_QPT_SMI) {
-        rdma_error_report("Got QP0 request");
-        return -EPERM;
-    } else if (qp->qp_type == IBV_QPT_GSI) {
-        return 0;
-    }
-
-    trace_rdma_rm_modify_qp(qp_handle, attr_mask, qp_state, sgid_idx);
-
-    if (attr_mask & IBV_QP_STATE) {
-        qp->qp_state = qp_state;
-
-        if (qp->qp_state == IBV_QPS_INIT) {
-            ret = rdma_backend_qp_state_init(backend_dev, &qp->backend_qp,
-                                             qp->qp_type, qkey);
-            if (ret) {
-                return -EIO;
-            }
-        }
-
-        if (qp->qp_state == IBV_QPS_RTR) {
-            /* Get backend gid index */
-            sgid_idx = rdma_rm_get_backend_gid_index(dev_res, backend_dev,
-                                                     sgid_idx);
-            if (sgid_idx <= 0) { /* TODO check also less than bk.max_sgid */
-                rdma_error_report("Failed to get bk sgid_idx for sgid_idx %d",
-                                  sgid_idx);
-                return -EIO;
-            }
-
-            ret = rdma_backend_qp_state_rtr(backend_dev, &qp->backend_qp,
-                                            qp->qp_type, sgid_idx, dgid, dqpn,
-                                            rq_psn, qkey,
-                                            attr_mask & IBV_QP_QKEY);
-            if (ret) {
-                return -EIO;
-            }
-        }
-
-        if (qp->qp_state == IBV_QPS_RTS) {
-            ret = rdma_backend_qp_state_rts(&qp->backend_qp, qp->qp_type,
-                                            sq_psn, qkey,
-                                            attr_mask & IBV_QP_QKEY);
-            if (ret) {
-                return -EIO;
-            }
-        }
-    }
-
-    return 0;
-}
-
-int rdma_rm_query_qp(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
-                     uint32_t qp_handle, struct ibv_qp_attr *attr,
-                     int attr_mask, struct ibv_qp_init_attr *init_attr)
-{
-    RdmaRmQP *qp;
-
-    qp = rdma_rm_get_qp(dev_res, qp_handle);
-    if (!qp) {
-        return -EINVAL;
-    }
-
-    return rdma_backend_query_qp(&qp->backend_qp, attr, attr_mask, init_attr);
-}
-
-void rdma_rm_dealloc_qp(RdmaDeviceResources *dev_res, uint32_t qp_handle)
-{
-    RdmaRmQP *qp;
-    GBytes *key;
-
-    key = g_bytes_new(&qp_handle, sizeof(qp_handle));
-    qp = g_hash_table_lookup(dev_res->qp_hash, key);
-    g_hash_table_remove(dev_res->qp_hash, key);
-    g_bytes_unref(key);
-
-    if (!qp) {
-        return;
-    }
-
-    rdma_backend_destroy_qp(&qp->backend_qp, dev_res);
-
-    rdma_res_tbl_dealloc(&dev_res->qp_tbl, qp->qpn);
-}
-
-RdmaRmSRQ *rdma_rm_get_srq(RdmaDeviceResources *dev_res, uint32_t srq_handle)
-{
-    return rdma_res_tbl_get(&dev_res->srq_tbl, srq_handle);
-}
-
-int rdma_rm_alloc_srq(RdmaDeviceResources *dev_res, uint32_t pd_handle,
-                      uint32_t max_wr, uint32_t max_sge, uint32_t srq_limit,
-                      uint32_t *srq_handle, void *opaque)
-{
-    RdmaRmSRQ *srq;
-    RdmaRmPD *pd;
-    int rc;
-
-    pd = rdma_rm_get_pd(dev_res, pd_handle);
-    if (!pd) {
-        return -EINVAL;
-    }
-
-    srq = rdma_res_tbl_alloc(&dev_res->srq_tbl, srq_handle);
-    if (!srq) {
-        return -ENOMEM;
-    }
-
-    rc = rdma_backend_create_srq(&srq->backend_srq, &pd->backend_pd,
-                                 max_wr, max_sge, srq_limit);
-    if (rc) {
-        rc = -EIO;
-        goto out_dealloc_srq;
-    }
-
-    srq->opaque = opaque;
-
-    return 0;
-
-out_dealloc_srq:
-    rdma_res_tbl_dealloc(&dev_res->srq_tbl, *srq_handle);
-
-    return rc;
-}
-
-int rdma_rm_query_srq(RdmaDeviceResources *dev_res, uint32_t srq_handle,
-                      struct ibv_srq_attr *srq_attr)
-{
-    RdmaRmSRQ *srq;
-
-    srq = rdma_rm_get_srq(dev_res, srq_handle);
-    if (!srq) {
-        return -EINVAL;
-    }
-
-    return rdma_backend_query_srq(&srq->backend_srq, srq_attr);
-}
-
-int rdma_rm_modify_srq(RdmaDeviceResources *dev_res, uint32_t srq_handle,
-                       struct ibv_srq_attr *srq_attr, int srq_attr_mask)
-{
-    RdmaRmSRQ *srq;
-
-    srq = rdma_rm_get_srq(dev_res, srq_handle);
-    if (!srq) {
-        return -EINVAL;
-    }
-
-    if ((srq_attr_mask & IBV_SRQ_LIMIT) &&
-        (srq_attr->srq_limit == 0)) {
-        return -EINVAL;
-    }
-
-    if ((srq_attr_mask & IBV_SRQ_MAX_WR) &&
-        (srq_attr->max_wr == 0)) {
-        return -EINVAL;
-    }
-
-    return rdma_backend_modify_srq(&srq->backend_srq, srq_attr,
-                                   srq_attr_mask);
-}
-
-void rdma_rm_dealloc_srq(RdmaDeviceResources *dev_res, uint32_t srq_handle)
-{
-    RdmaRmSRQ *srq;
-
-    srq = rdma_rm_get_srq(dev_res, srq_handle);
-    if (!srq) {
-        return;
-    }
-
-    rdma_backend_destroy_srq(&srq->backend_srq, dev_res);
-    rdma_res_tbl_dealloc(&dev_res->srq_tbl, srq_handle);
-}
-
-void *rdma_rm_get_cqe_ctx(RdmaDeviceResources *dev_res, uint32_t cqe_ctx_id)
-{
-    void **cqe_ctx;
-
-    cqe_ctx = rdma_res_tbl_get(&dev_res->cqe_ctx_tbl, cqe_ctx_id);
-    if (!cqe_ctx) {
-        return NULL;
-    }
-
-    return *cqe_ctx;
-}
-
-int rdma_rm_alloc_cqe_ctx(RdmaDeviceResources *dev_res, uint32_t *cqe_ctx_id,
-                          void *ctx)
-{
-    void **cqe_ctx;
-
-    cqe_ctx = rdma_res_tbl_alloc(&dev_res->cqe_ctx_tbl, cqe_ctx_id);
-    if (!cqe_ctx) {
-        return -ENOMEM;
-    }
-
-    *cqe_ctx = ctx;
-
-    return 0;
-}
-
-void rdma_rm_dealloc_cqe_ctx(RdmaDeviceResources *dev_res, uint32_t cqe_ctx_id)
-{
-    rdma_res_tbl_dealloc(&dev_res->cqe_ctx_tbl, cqe_ctx_id);
-}
-
-int rdma_rm_add_gid(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
-                    const char *ifname, union ibv_gid *gid, int gid_idx)
-{
-    int rc;
-
-    rc = rdma_backend_add_gid(backend_dev, ifname, gid);
-    if (rc) {
-        return -EINVAL;
-    }
-
-    memcpy(&dev_res->port.gid_tbl[gid_idx].gid, gid, sizeof(*gid));
-
-    return 0;
-}
-
-int rdma_rm_del_gid(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
-                    const char *ifname, int gid_idx)
-{
-    int rc;
-
-    if (!dev_res->port.gid_tbl[gid_idx].gid.global.interface_id) {
-        return 0;
-    }
-
-    rc = rdma_backend_del_gid(backend_dev, ifname,
-                              &dev_res->port.gid_tbl[gid_idx].gid);
-    if (rc) {
-        return -EINVAL;
-    }
-
-    memset(dev_res->port.gid_tbl[gid_idx].gid.raw, 0,
-           sizeof(dev_res->port.gid_tbl[gid_idx].gid));
-    dev_res->port.gid_tbl[gid_idx].backend_gid_index = -1;
-
-    return 0;
-}
-
-int rdma_rm_get_backend_gid_index(RdmaDeviceResources *dev_res,
-                                  RdmaBackendDev *backend_dev, int sgid_idx)
-{
-    if (unlikely(sgid_idx < 0 || sgid_idx >= MAX_PORT_GIDS)) {
-        rdma_error_report("Got invalid sgid_idx %d", sgid_idx);
-        return -EINVAL;
-    }
-
-    if (unlikely(dev_res->port.gid_tbl[sgid_idx].backend_gid_index == -1)) {
-        dev_res->port.gid_tbl[sgid_idx].backend_gid_index =
-        rdma_backend_get_gid_index(backend_dev,
-                                   &dev_res->port.gid_tbl[sgid_idx].gid);
-    }
-
-    return dev_res->port.gid_tbl[sgid_idx].backend_gid_index;
-}
-
-static void destroy_qp_hash_key(gpointer data)
-{
-    g_bytes_unref(data);
-}
-
-static void init_ports(RdmaDeviceResources *dev_res)
-{
-    int i;
-
-    memset(&dev_res->port, 0, sizeof(dev_res->port));
-
-    dev_res->port.state = IBV_PORT_DOWN;
-    for (i = 0; i < MAX_PORT_GIDS; i++) {
-        dev_res->port.gid_tbl[i].backend_gid_index = -1;
-    }
-}
-
-static void fini_ports(RdmaDeviceResources *dev_res,
-                       RdmaBackendDev *backend_dev, const char *ifname)
-{
-    int i;
-
-    dev_res->port.state = IBV_PORT_DOWN;
-    for (i = 0; i < MAX_PORT_GIDS; i++) {
-        rdma_rm_del_gid(dev_res, backend_dev, ifname, i);
-    }
-}
-
-int rdma_rm_init(RdmaDeviceResources *dev_res, struct ibv_device_attr *dev_attr)
-{
-    dev_res->qp_hash = g_hash_table_new_full(g_bytes_hash, g_bytes_equal,
-                                             destroy_qp_hash_key, NULL);
-    if (!dev_res->qp_hash) {
-        return -ENOMEM;
-    }
-
-    res_tbl_init("PD", &dev_res->pd_tbl, dev_attr->max_pd, sizeof(RdmaRmPD));
-    res_tbl_init("CQ", &dev_res->cq_tbl, dev_attr->max_cq, sizeof(RdmaRmCQ));
-    res_tbl_init("MR", &dev_res->mr_tbl, dev_attr->max_mr, sizeof(RdmaRmMR));
-    res_tbl_init("QP", &dev_res->qp_tbl, dev_attr->max_qp, sizeof(RdmaRmQP));
-    res_tbl_init("CQE_CTX", &dev_res->cqe_ctx_tbl, dev_attr->max_qp *
-                       dev_attr->max_qp_wr, sizeof(void *));
-    res_tbl_init("UC", &dev_res->uc_tbl, MAX_UCS, sizeof(RdmaRmUC));
-    res_tbl_init("SRQ", &dev_res->srq_tbl, dev_attr->max_srq,
-                 sizeof(RdmaRmSRQ));
-
-    init_ports(dev_res);
-
-    qemu_mutex_init(&dev_res->lock);
-
-    memset(&dev_res->stats, 0, sizeof(dev_res->stats));
-    qatomic_set(&dev_res->stats.missing_cqe, 0);
-
-    return 0;
-}
-
-void rdma_rm_fini(RdmaDeviceResources *dev_res, RdmaBackendDev *backend_dev,
-                  const char *ifname)
-{
-    qemu_mutex_destroy(&dev_res->lock);
-
-    fini_ports(dev_res, backend_dev, ifname);
-
-    res_tbl_free(&dev_res->srq_tbl);
-    res_tbl_free(&dev_res->uc_tbl);
-    res_tbl_free(&dev_res->cqe_ctx_tbl);
-    res_tbl_free(&dev_res->qp_tbl);
-    res_tbl_free(&dev_res->mr_tbl);
-    res_tbl_free(&dev_res->cq_tbl);
-    res_tbl_free(&dev_res->pd_tbl);
-
-    if (dev_res->qp_hash) {
-        g_hash_table_destroy(dev_res->qp_hash);
-    }
-}
diff --git a/hw/rdma/rdma_utils.c b/hw/rdma/rdma_utils.c
deleted file mode 100644
index c948baf052..0000000000
--- a/hw/rdma/rdma_utils.c
+++ /dev/null
@@ -1,126 +0,0 @@
-/*
- * QEMU paravirtual RDMA - Generic RDMA backend
- *
- * Copyright (C) 2018 Oracle
- * Copyright (C) 2018 Red Hat Inc
- *
- * Authors:
- *     Yuval Shaia <yuval.shaia@oracle.com>
- *     Marcel Apfelbaum <marcel@redhat.com>
- *
- * This work is licensed under the terms of the GNU GPL, version 2 or later.
- * See the COPYING file in the top-level directory.
- *
- */
-
-#include "qemu/osdep.h"
-#include "hw/pci/pci_device.h"
-#include "trace.h"
-#include "rdma_utils.h"
-
-void *rdma_pci_dma_map(PCIDevice *dev, dma_addr_t addr, dma_addr_t len)
-{
-    void *p;
-    dma_addr_t pci_len = len;
-
-    if (!addr) {
-        rdma_error_report("addr is NULL");
-        return NULL;
-    }
-
-    p = pci_dma_map(dev, addr, &pci_len, DMA_DIRECTION_TO_DEVICE);
-    if (!p) {
-        rdma_error_report("pci_dma_map fail, addr=0x%"PRIx64", len=%"PRId64,
-                          addr, pci_len);
-        return NULL;
-    }
-
-    if (pci_len != len) {
-        rdma_pci_dma_unmap(dev, p, pci_len);
-        return NULL;
-    }
-
-    trace_rdma_pci_dma_map(addr, p, pci_len);
-
-    return p;
-}
-
-void rdma_pci_dma_unmap(PCIDevice *dev, void *buffer, dma_addr_t len)
-{
-    trace_rdma_pci_dma_unmap(buffer);
-    if (buffer) {
-        pci_dma_unmap(dev, buffer, len, DMA_DIRECTION_TO_DEVICE, 0);
-    }
-}
-
-void rdma_protected_gqueue_init(RdmaProtectedGQueue *list)
-{
-    qemu_mutex_init(&list->lock);
-    list->list = g_queue_new();
-}
-
-void rdma_protected_gqueue_destroy(RdmaProtectedGQueue *list)
-{
-    if (list->list) {
-        g_queue_free_full(list->list, g_free);
-        qemu_mutex_destroy(&list->lock);
-        list->list = NULL;
-    }
-}
-
-void rdma_protected_gqueue_append_int64(RdmaProtectedGQueue *list,
-                                        int64_t value)
-{
-    qemu_mutex_lock(&list->lock);
-    g_queue_push_tail(list->list, g_memdup(&value, sizeof(value)));
-    qemu_mutex_unlock(&list->lock);
-}
-
-int64_t rdma_protected_gqueue_pop_int64(RdmaProtectedGQueue *list)
-{
-    int64_t *valp;
-    int64_t val;
-
-    qemu_mutex_lock(&list->lock);
-
-    valp = g_queue_pop_head(list->list);
-    qemu_mutex_unlock(&list->lock);
-
-    if (!valp) {
-        return -ENOENT;
-    }
-
-    val = *valp;
-    g_free(valp);
-    return val;
-}
-
-void rdma_protected_gslist_init(RdmaProtectedGSList *list)
-{
-    qemu_mutex_init(&list->lock);
-}
-
-void rdma_protected_gslist_destroy(RdmaProtectedGSList *list)
-{
-    if (list->list) {
-        g_slist_free(list->list);
-        qemu_mutex_destroy(&list->lock);
-        list->list = NULL;
-    }
-}
-
-void rdma_protected_gslist_append_int32(RdmaProtectedGSList *list,
-                                        int32_t value)
-{
-    qemu_mutex_lock(&list->lock);
-    list->list = g_slist_prepend(list->list, GINT_TO_POINTER(value));
-    qemu_mutex_unlock(&list->lock);
-}
-
-void rdma_protected_gslist_remove_int32(RdmaProtectedGSList *list,
-                                        int32_t value)
-{
-    qemu_mutex_lock(&list->lock);
-    list->list = g_slist_remove(list->list, GINT_TO_POINTER(value));
-    qemu_mutex_unlock(&list->lock);
-}
diff --git a/hw/rdma/vmw/pvrdma_cmd.c b/hw/rdma/vmw/pvrdma_cmd.c
deleted file mode 100644
index d385d18d9c..0000000000
--- a/hw/rdma/vmw/pvrdma_cmd.c
+++ /dev/null
@@ -1,815 +0,0 @@
-/*
- * QEMU paravirtual RDMA - Command channel
- *
- * Copyright (C) 2018 Oracle
- * Copyright (C) 2018 Red Hat Inc
- *
- * Authors:
- *     Yuval Shaia <yuval.shaia@oracle.com>
- *     Marcel Apfelbaum <marcel@redhat.com>
- *
- * This work is licensed under the terms of the GNU GPL, version 2 or later.
- * See the COPYING file in the top-level directory.
- *
- */
-
-#include "qemu/osdep.h"
-#include "cpu.h"
-#include "hw/pci/pci.h"
-#include "hw/pci/pci_ids.h"
-
-#include "../rdma_backend.h"
-#include "../rdma_rm.h"
-#include "../rdma_utils.h"
-
-#include "trace.h"
-#include "pvrdma.h"
-#include "standard-headers/rdma/vmw_pvrdma-abi.h"
-
-static void *pvrdma_map_to_pdir(PCIDevice *pdev, uint64_t pdir_dma,
-                                uint32_t nchunks, size_t length)
-{
-    uint64_t *dir, *tbl;
-    int tbl_idx, dir_idx, addr_idx;
-    void *host_virt = NULL, *curr_page;
-
-    if (!nchunks) {
-        rdma_error_report("Got nchunks=0");
-        return NULL;
-    }
-
-    length = ROUND_UP(length, TARGET_PAGE_SIZE);
-    if (nchunks * TARGET_PAGE_SIZE != length) {
-        rdma_error_report("Invalid nchunks/length (%u, %lu)", nchunks,
-                          (unsigned long)length);
-        return NULL;
-    }
-
-    dir = rdma_pci_dma_map(pdev, pdir_dma, TARGET_PAGE_SIZE);
-    if (!dir) {
-        rdma_error_report("Failed to map to page directory");
-        return NULL;
-    }
-
-    tbl = rdma_pci_dma_map(pdev, dir[0], TARGET_PAGE_SIZE);
-    if (!tbl) {
-        rdma_error_report("Failed to map to page table 0");
-        goto out_unmap_dir;
-    }
-
-    curr_page = rdma_pci_dma_map(pdev, (dma_addr_t)tbl[0], TARGET_PAGE_SIZE);
-    if (!curr_page) {
-        rdma_error_report("Failed to map the page 0");
-        goto out_unmap_tbl;
-    }
-
-    host_virt = mremap(curr_page, 0, length, MREMAP_MAYMOVE);
-    if (host_virt == MAP_FAILED) {
-        host_virt = NULL;
-        rdma_error_report("Failed to remap memory for host_virt");
-        goto out_unmap_tbl;
-    }
-    trace_pvrdma_map_to_pdir_host_virt(curr_page, host_virt);
-
-    rdma_pci_dma_unmap(pdev, curr_page, TARGET_PAGE_SIZE);
-
-    dir_idx = 0;
-    tbl_idx = 1;
-    addr_idx = 1;
-    while (addr_idx < nchunks) {
-        if (tbl_idx == TARGET_PAGE_SIZE / sizeof(uint64_t)) {
-            tbl_idx = 0;
-            dir_idx++;
-            rdma_pci_dma_unmap(pdev, tbl, TARGET_PAGE_SIZE);
-            tbl = rdma_pci_dma_map(pdev, dir[dir_idx], TARGET_PAGE_SIZE);
-            if (!tbl) {
-                rdma_error_report("Failed to map to page table %d", dir_idx);
-                goto out_unmap_host_virt;
-            }
-        }
-
-        curr_page = rdma_pci_dma_map(pdev, (dma_addr_t)tbl[tbl_idx],
-                                     TARGET_PAGE_SIZE);
-        if (!curr_page) {
-            rdma_error_report("Failed to map to page %d, dir %d", tbl_idx,
-                              dir_idx);
-            goto out_unmap_host_virt;
-        }
-
-        mremap(curr_page, 0, TARGET_PAGE_SIZE, MREMAP_MAYMOVE | MREMAP_FIXED,
-               host_virt + TARGET_PAGE_SIZE * addr_idx);
-
-        trace_pvrdma_map_to_pdir_next_page(addr_idx, curr_page, host_virt +
-                                           TARGET_PAGE_SIZE * addr_idx);
-
-        rdma_pci_dma_unmap(pdev, curr_page, TARGET_PAGE_SIZE);
-
-        addr_idx++;
-
-        tbl_idx++;
-    }
-
-    goto out_unmap_tbl;
-
-out_unmap_host_virt:
-    munmap(host_virt, length);
-    host_virt = NULL;
-
-out_unmap_tbl:
-    rdma_pci_dma_unmap(pdev, tbl, TARGET_PAGE_SIZE);
-
-out_unmap_dir:
-    rdma_pci_dma_unmap(pdev, dir, TARGET_PAGE_SIZE);
-
-    return host_virt;
-}
-
-static int query_port(PVRDMADev *dev, union pvrdma_cmd_req *req,
-                      union pvrdma_cmd_resp *rsp)
-{
-    struct pvrdma_cmd_query_port *cmd = &req->query_port;
-    struct pvrdma_cmd_query_port_resp *resp = &rsp->query_port_resp;
-    struct ibv_port_attr attrs = {};
-
-    if (cmd->port_num > MAX_PORTS) {
-        return -EINVAL;
-    }
-
-    if (rdma_backend_query_port(&dev->backend_dev, &attrs)) {
-        return -ENOMEM;
-    }
-
-    memset(resp, 0, sizeof(*resp));
-
-    /*
-     * The state, max_mtu and active_mtu fields are enums; the values
-     * for pvrdma_port_state and pvrdma_mtu match those for
-     * ibv_port_state and ibv_mtu, so we can cast them safely.
-     */
-    resp->attrs.state = dev->func0->device_active ?
-        (enum pvrdma_port_state)attrs.state : PVRDMA_PORT_DOWN;
-    resp->attrs.max_mtu = (enum pvrdma_mtu)attrs.max_mtu;
-    resp->attrs.active_mtu = (enum pvrdma_mtu)attrs.active_mtu;
-    resp->attrs.phys_state = attrs.phys_state;
-    resp->attrs.gid_tbl_len = MIN(MAX_PORT_GIDS, attrs.gid_tbl_len);
-    resp->attrs.max_msg_sz = 1024;
-    resp->attrs.pkey_tbl_len = MIN(MAX_PORT_PKEYS, attrs.pkey_tbl_len);
-    resp->attrs.active_width = 1;
-    resp->attrs.active_speed = 1;
-
-    return 0;
-}
-
-static int query_pkey(PVRDMADev *dev, union pvrdma_cmd_req *req,
-                      union pvrdma_cmd_resp *rsp)
-{
-    struct pvrdma_cmd_query_pkey *cmd = &req->query_pkey;
-    struct pvrdma_cmd_query_pkey_resp *resp = &rsp->query_pkey_resp;
-
-    if (cmd->port_num > MAX_PORTS) {
-        return -EINVAL;
-    }
-
-    if (cmd->index > MAX_PKEYS) {
-        return -EINVAL;
-    }
-
-    memset(resp, 0, sizeof(*resp));
-
-    resp->pkey = PVRDMA_PKEY;
-
-    return 0;
-}
-
-static int create_pd(PVRDMADev *dev, union pvrdma_cmd_req *req,
-                     union pvrdma_cmd_resp *rsp)
-{
-    struct pvrdma_cmd_create_pd *cmd = &req->create_pd;
-    struct pvrdma_cmd_create_pd_resp *resp = &rsp->create_pd_resp;
-
-    memset(resp, 0, sizeof(*resp));
-    return rdma_rm_alloc_pd(&dev->rdma_dev_res, &dev->backend_dev,
-                            &resp->pd_handle, cmd->ctx_handle);
-}
-
-static int destroy_pd(PVRDMADev *dev, union pvrdma_cmd_req *req,
-                      union pvrdma_cmd_resp *rsp)
-{
-    struct pvrdma_cmd_destroy_pd *cmd = &req->destroy_pd;
-
-    rdma_rm_dealloc_pd(&dev->rdma_dev_res, cmd->pd_handle);
-
-    return 0;
-}
-
-static int create_mr(PVRDMADev *dev, union pvrdma_cmd_req *req,
-                     union pvrdma_cmd_resp *rsp)
-{
-    struct pvrdma_cmd_create_mr *cmd = &req->create_mr;
-    struct pvrdma_cmd_create_mr_resp *resp = &rsp->create_mr_resp;
-    PCIDevice *pci_dev = PCI_DEVICE(dev);
-    void *host_virt = NULL;
-    int rc = 0;
-
-    memset(resp, 0, sizeof(*resp));
-
-    if (!(cmd->flags & PVRDMA_MR_FLAG_DMA)) {
-        host_virt = pvrdma_map_to_pdir(pci_dev, cmd->pdir_dma, cmd->nchunks,
-                                       cmd->length);
-        if (!host_virt) {
-            rdma_error_report("Failed to map to pdir");
-            return -EINVAL;
-        }
-    }
-
-    rc = rdma_rm_alloc_mr(&dev->rdma_dev_res, cmd->pd_handle, cmd->start,
-                          cmd->length, host_virt, cmd->access_flags,
-                          &resp->mr_handle, &resp->lkey, &resp->rkey);
-    if (rc && host_virt) {
-        munmap(host_virt, cmd->length);
-    }
-
-    return rc;
-}
-
-static int destroy_mr(PVRDMADev *dev, union pvrdma_cmd_req *req,
-                      union pvrdma_cmd_resp *rsp)
-{
-    struct pvrdma_cmd_destroy_mr *cmd = &req->destroy_mr;
-
-    rdma_rm_dealloc_mr(&dev->rdma_dev_res, cmd->mr_handle);
-
-    return 0;
-}
-
-static int create_cq_ring(PCIDevice *pci_dev , PvrdmaRing **ring,
-                          uint64_t pdir_dma, uint32_t nchunks, uint32_t cqe)
-{
-    uint64_t *dir = NULL, *tbl = NULL;
-    PvrdmaRing *r;
-    int rc = -EINVAL;
-    char ring_name[MAX_RING_NAME_SZ];
-
-    if (!nchunks || nchunks > PVRDMA_MAX_FAST_REG_PAGES) {
-        rdma_error_report("Got invalid nchunks: %d", nchunks);
-        return rc;
-    }
-
-    dir = rdma_pci_dma_map(pci_dev, pdir_dma, TARGET_PAGE_SIZE);
-    if (!dir) {
-        rdma_error_report("Failed to map to CQ page directory");
-        goto out;
-    }
-
-    tbl = rdma_pci_dma_map(pci_dev, dir[0], TARGET_PAGE_SIZE);
-    if (!tbl) {
-        rdma_error_report("Failed to map to CQ page table");
-        goto out;
-    }
-
-    r = g_malloc(sizeof(*r));
-    *ring = r;
-
-    r->ring_state = rdma_pci_dma_map(pci_dev, tbl[0], TARGET_PAGE_SIZE);
-
-    if (!r->ring_state) {
-        rdma_error_report("Failed to map to CQ ring state");
-        goto out_free_ring;
-    }
-
-    sprintf(ring_name, "cq_ring_%" PRIx64, pdir_dma);
-    rc = pvrdma_ring_init(r, ring_name, pci_dev, &r->ring_state[1],
-                          cqe, sizeof(struct pvrdma_cqe),
-                          /* first page is ring state */
-                          (dma_addr_t *)&tbl[1], nchunks - 1);
-    if (rc) {
-        goto out_unmap_ring_state;
-    }
-
-    goto out;
-
-out_unmap_ring_state:
-    /* ring_state was in slot 1, not 0 so need to jump back */
-    rdma_pci_dma_unmap(pci_dev, --r->ring_state, TARGET_PAGE_SIZE);
-
-out_free_ring:
-    g_free(r);
-
-out:
-    rdma_pci_dma_unmap(pci_dev, tbl, TARGET_PAGE_SIZE);
-    rdma_pci_dma_unmap(pci_dev, dir, TARGET_PAGE_SIZE);
-
-    return rc;
-}
-
-static void destroy_cq_ring(PvrdmaRing *ring)
-{
-    pvrdma_ring_free(ring);
-    /* ring_state was in slot 1, not 0 so need to jump back */
-    rdma_pci_dma_unmap(ring->dev, --ring->ring_state, TARGET_PAGE_SIZE);
-    g_free(ring);
-}
-
-static int create_cq(PVRDMADev *dev, union pvrdma_cmd_req *req,
-                     union pvrdma_cmd_resp *rsp)
-{
-    struct pvrdma_cmd_create_cq *cmd = &req->create_cq;
-    struct pvrdma_cmd_create_cq_resp *resp = &rsp->create_cq_resp;
-    PvrdmaRing *ring = NULL;
-    int rc;
-
-    memset(resp, 0, sizeof(*resp));
-
-    resp->cqe = cmd->cqe;
-
-    rc = create_cq_ring(PCI_DEVICE(dev), &ring, cmd->pdir_dma, cmd->nchunks,
-                        cmd->cqe);
-    if (rc) {
-        return rc;
-    }
-
-    rc = rdma_rm_alloc_cq(&dev->rdma_dev_res, &dev->backend_dev, cmd->cqe,
-                          &resp->cq_handle, ring);
-    if (rc) {
-        destroy_cq_ring(ring);
-    }
-
-    resp->cqe = cmd->cqe;
-
-    return rc;
-}
-
-static int destroy_cq(PVRDMADev *dev, union pvrdma_cmd_req *req,
-                      union pvrdma_cmd_resp *rsp)
-{
-    struct pvrdma_cmd_destroy_cq *cmd = &req->destroy_cq;
-    RdmaRmCQ *cq;
-    PvrdmaRing *ring;
-
-    cq = rdma_rm_get_cq(&dev->rdma_dev_res, cmd->cq_handle);
-    if (!cq) {
-        rdma_error_report("Got invalid CQ handle");
-        return -EINVAL;
-    }
-
-    ring = (PvrdmaRing *)cq->opaque;
-    destroy_cq_ring(ring);
-
-    rdma_rm_dealloc_cq(&dev->rdma_dev_res, cmd->cq_handle);
-
-    return 0;
-}
-
-static int create_qp_rings(PCIDevice *pci_dev, uint64_t pdir_dma,
-                           PvrdmaRing **rings, uint32_t scqe, uint32_t smax_sge,
-                           uint32_t spages, uint32_t rcqe, uint32_t rmax_sge,
-                           uint32_t rpages, uint8_t is_srq)
-{
-    uint64_t *dir = NULL, *tbl = NULL;
-    PvrdmaRing *sr, *rr;
-    int rc = -EINVAL;
-    char ring_name[MAX_RING_NAME_SZ];
-    uint32_t wqe_sz;
-
-    if (!spages || spages > PVRDMA_MAX_FAST_REG_PAGES) {
-        rdma_error_report("Got invalid send page count for QP ring: %d",
-                          spages);
-        return rc;
-    }
-
-    if (!is_srq && (!rpages || rpages > PVRDMA_MAX_FAST_REG_PAGES)) {
-        rdma_error_report("Got invalid recv page count for QP ring: %d",
-                          rpages);
-        return rc;
-    }
-
-    dir = rdma_pci_dma_map(pci_dev, pdir_dma, TARGET_PAGE_SIZE);
-    if (!dir) {
-        rdma_error_report("Failed to map to QP page directory");
-        goto out;
-    }
-
-    tbl = rdma_pci_dma_map(pci_dev, dir[0], TARGET_PAGE_SIZE);
-    if (!tbl) {
-        rdma_error_report("Failed to map to QP page table");
-        goto out;
-    }
-
-    if (!is_srq) {
-        sr = g_malloc(2 * sizeof(*rr));
-        rr = &sr[1];
-    } else {
-        sr = g_malloc(sizeof(*sr));
-    }
-
-    *rings = sr;
-
-    /* Create send ring */
-    sr->ring_state = rdma_pci_dma_map(pci_dev, tbl[0], TARGET_PAGE_SIZE);
-    if (!sr->ring_state) {
-        rdma_error_report("Failed to map to QP ring state");
-        goto out_free_sr_mem;
-    }
-
-    wqe_sz = pow2ceil(sizeof(struct pvrdma_sq_wqe_hdr) +
-                      sizeof(struct pvrdma_sge) * smax_sge - 1);
-
-    sprintf(ring_name, "qp_sring_%" PRIx64, pdir_dma);
-    rc = pvrdma_ring_init(sr, ring_name, pci_dev, sr->ring_state,
-                          scqe, wqe_sz, (dma_addr_t *)&tbl[1], spages);
-    if (rc) {
-        goto out_unmap_ring_state;
-    }
-
-    if (!is_srq) {
-        /* Create recv ring */
-        rr->ring_state = &sr->ring_state[1];
-        wqe_sz = pow2ceil(sizeof(struct pvrdma_rq_wqe_hdr) +
-                          sizeof(struct pvrdma_sge) * rmax_sge - 1);
-        sprintf(ring_name, "qp_rring_%" PRIx64, pdir_dma);
-        rc = pvrdma_ring_init(rr, ring_name, pci_dev, rr->ring_state,
-                              rcqe, wqe_sz, (dma_addr_t *)&tbl[1 + spages],
-                              rpages);
-        if (rc) {
-            goto out_free_sr;
-        }
-    }
-
-    goto out;
-
-out_free_sr:
-    pvrdma_ring_free(sr);
-
-out_unmap_ring_state:
-    rdma_pci_dma_unmap(pci_dev, sr->ring_state, TARGET_PAGE_SIZE);
-
-out_free_sr_mem:
-    g_free(sr);
-
-out:
-    rdma_pci_dma_unmap(pci_dev, tbl, TARGET_PAGE_SIZE);
-    rdma_pci_dma_unmap(pci_dev, dir, TARGET_PAGE_SIZE);
-
-    return rc;
-}
-
-static void destroy_qp_rings(PvrdmaRing *ring, uint8_t is_srq)
-{
-    pvrdma_ring_free(&ring[0]);
-    if (!is_srq) {
-        pvrdma_ring_free(&ring[1]);
-    }
-
-    rdma_pci_dma_unmap(ring->dev, ring->ring_state, TARGET_PAGE_SIZE);
-    g_free(ring);
-}
-
-static int create_qp(PVRDMADev *dev, union pvrdma_cmd_req *req,
-                     union pvrdma_cmd_resp *rsp)
-{
-    struct pvrdma_cmd_create_qp *cmd = &req->create_qp;
-    struct pvrdma_cmd_create_qp_resp *resp = &rsp->create_qp_resp;
-    PvrdmaRing *rings = NULL;
-    int rc;
-
-    memset(resp, 0, sizeof(*resp));
-
-    rc = create_qp_rings(PCI_DEVICE(dev), cmd->pdir_dma, &rings,
-                         cmd->max_send_wr, cmd->max_send_sge, cmd->send_chunks,
-                         cmd->max_recv_wr, cmd->max_recv_sge,
-                         cmd->total_chunks - cmd->send_chunks - 1, cmd->is_srq);
-    if (rc) {
-        return rc;
-    }
-
-    rc = rdma_rm_alloc_qp(&dev->rdma_dev_res, cmd->pd_handle, cmd->qp_type,
-                          cmd->max_send_wr, cmd->max_send_sge,
-                          cmd->send_cq_handle, cmd->max_recv_wr,
-                          cmd->max_recv_sge, cmd->recv_cq_handle, rings,
-                          &resp->qpn, cmd->is_srq, cmd->srq_handle);
-    if (rc) {
-        destroy_qp_rings(rings, cmd->is_srq);
-        return rc;
-    }
-
-    resp->max_send_wr = cmd->max_send_wr;
-    resp->max_recv_wr = cmd->max_recv_wr;
-    resp->max_send_sge = cmd->max_send_sge;
-    resp->max_recv_sge = cmd->max_recv_sge;
-    resp->max_inline_data = cmd->max_inline_data;
-
-    return 0;
-}
-
-static int modify_qp(PVRDMADev *dev, union pvrdma_cmd_req *req,
-                     union pvrdma_cmd_resp *rsp)
-{
-    struct pvrdma_cmd_modify_qp *cmd = &req->modify_qp;
-
-    /* No need to verify sgid_index since it is u8 */
-
-    return rdma_rm_modify_qp(&dev->rdma_dev_res, &dev->backend_dev,
-                             cmd->qp_handle, cmd->attr_mask,
-                             cmd->attrs.ah_attr.grh.sgid_index,
-                             (union ibv_gid *)&cmd->attrs.ah_attr.grh.dgid,
-                             cmd->attrs.dest_qp_num,
-                             (enum ibv_qp_state)cmd->attrs.qp_state,
-                             cmd->attrs.qkey, cmd->attrs.rq_psn,
-                             cmd->attrs.sq_psn);
-}
-
-static int query_qp(PVRDMADev *dev, union pvrdma_cmd_req *req,
-                     union pvrdma_cmd_resp *rsp)
-{
-    struct pvrdma_cmd_query_qp *cmd = &req->query_qp;
-    struct pvrdma_cmd_query_qp_resp *resp = &rsp->query_qp_resp;
-    struct ibv_qp_init_attr init_attr;
-
-    memset(resp, 0, sizeof(*resp));
-
-    return rdma_rm_query_qp(&dev->rdma_dev_res, &dev->backend_dev,
-                            cmd->qp_handle,
-                            (struct ibv_qp_attr *)&resp->attrs,
-                            cmd->attr_mask,
-                            &init_attr);
-}
-
-static int destroy_qp(PVRDMADev *dev, union pvrdma_cmd_req *req,
-                      union pvrdma_cmd_resp *rsp)
-{
-    struct pvrdma_cmd_destroy_qp *cmd = &req->destroy_qp;
-    RdmaRmQP *qp;
-    PvrdmaRing *ring;
-
-    qp = rdma_rm_get_qp(&dev->rdma_dev_res, cmd->qp_handle);
-    if (!qp) {
-        return -EINVAL;
-    }
-
-    ring = (PvrdmaRing *)qp->opaque;
-    destroy_qp_rings(ring, qp->is_srq);
-    rdma_rm_dealloc_qp(&dev->rdma_dev_res, cmd->qp_handle);
-
-    return 0;
-}
-
-static int create_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
-                       union pvrdma_cmd_resp *rsp)
-{
-    struct pvrdma_cmd_create_bind *cmd = &req->create_bind;
-    union ibv_gid *gid = (union ibv_gid *)&cmd->new_gid;
-
-    if (cmd->index >= MAX_PORT_GIDS) {
-        return -EINVAL;
-    }
-
-    return rdma_rm_add_gid(&dev->rdma_dev_res, &dev->backend_dev,
-                           dev->backend_eth_device_name, gid, cmd->index);
-}
-
-static int destroy_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
-                        union pvrdma_cmd_resp *rsp)
-{
-    struct pvrdma_cmd_destroy_bind *cmd = &req->destroy_bind;
-
-    if (cmd->index >= MAX_PORT_GIDS) {
-        return -EINVAL;
-    }
-
-    return rdma_rm_del_gid(&dev->rdma_dev_res, &dev->backend_dev,
-                           dev->backend_eth_device_name, cmd->index);
-}
-
-static int create_uc(PVRDMADev *dev, union pvrdma_cmd_req *req,
-                     union pvrdma_cmd_resp *rsp)
-{
-    struct pvrdma_cmd_create_uc *cmd = &req->create_uc;
-    struct pvrdma_cmd_create_uc_resp *resp = &rsp->create_uc_resp;
-
-    memset(resp, 0, sizeof(*resp));
-    return rdma_rm_alloc_uc(&dev->rdma_dev_res, cmd->pfn, &resp->ctx_handle);
-}
-
-static int destroy_uc(PVRDMADev *dev, union pvrdma_cmd_req *req,
-                      union pvrdma_cmd_resp *rsp)
-{
-    struct pvrdma_cmd_destroy_uc *cmd = &req->destroy_uc;
-
-    rdma_rm_dealloc_uc(&dev->rdma_dev_res, cmd->ctx_handle);
-
-    return 0;
-}
-
-static int create_srq_ring(PCIDevice *pci_dev, PvrdmaRing **ring,
-                           uint64_t pdir_dma, uint32_t max_wr,
-                           uint32_t max_sge, uint32_t nchunks)
-{
-    uint64_t *dir = NULL, *tbl = NULL;
-    PvrdmaRing *r;
-    int rc = -EINVAL;
-    char ring_name[MAX_RING_NAME_SZ];
-    uint32_t wqe_sz;
-
-    if (!nchunks || nchunks > PVRDMA_MAX_FAST_REG_PAGES) {
-        rdma_error_report("Got invalid page count for SRQ ring: %d",
-                          nchunks);
-        return rc;
-    }
-
-    dir = rdma_pci_dma_map(pci_dev, pdir_dma, TARGET_PAGE_SIZE);
-    if (!dir) {
-        rdma_error_report("Failed to map to SRQ page directory");
-        goto out;
-    }
-
-    tbl = rdma_pci_dma_map(pci_dev, dir[0], TARGET_PAGE_SIZE);
-    if (!tbl) {
-        rdma_error_report("Failed to map to SRQ page table");
-        goto out;
-    }
-
-    r = g_malloc(sizeof(*r));
-    *ring = r;
-
-    r->ring_state = rdma_pci_dma_map(pci_dev, tbl[0], TARGET_PAGE_SIZE);
-    if (!r->ring_state) {
-        rdma_error_report("Failed to map tp SRQ ring state");
-        goto out_free_ring_mem;
-    }
-
-    wqe_sz = pow2ceil(sizeof(struct pvrdma_rq_wqe_hdr) +
-                      sizeof(struct pvrdma_sge) * max_sge - 1);
-    sprintf(ring_name, "srq_ring_%" PRIx64, pdir_dma);
-    rc = pvrdma_ring_init(r, ring_name, pci_dev, &r->ring_state[1], max_wr,
-                          wqe_sz, (dma_addr_t *)&tbl[1], nchunks - 1);
-    if (rc) {
-        goto out_unmap_ring_state;
-    }
-
-    goto out;
-
-out_unmap_ring_state:
-    rdma_pci_dma_unmap(pci_dev, r->ring_state, TARGET_PAGE_SIZE);
-
-out_free_ring_mem:
-    g_free(r);
-
-out:
-    rdma_pci_dma_unmap(pci_dev, tbl, TARGET_PAGE_SIZE);
-    rdma_pci_dma_unmap(pci_dev, dir, TARGET_PAGE_SIZE);
-
-    return rc;
-}
-
-static void destroy_srq_ring(PvrdmaRing *ring)
-{
-    pvrdma_ring_free(ring);
-    rdma_pci_dma_unmap(ring->dev, ring->ring_state, TARGET_PAGE_SIZE);
-    g_free(ring);
-}
-
-static int create_srq(PVRDMADev *dev, union pvrdma_cmd_req *req,
-                      union pvrdma_cmd_resp *rsp)
-{
-    struct pvrdma_cmd_create_srq *cmd = &req->create_srq;
-    struct pvrdma_cmd_create_srq_resp *resp = &rsp->create_srq_resp;
-    PvrdmaRing *ring = NULL;
-    int rc;
-
-    memset(resp, 0, sizeof(*resp));
-
-    rc = create_srq_ring(PCI_DEVICE(dev), &ring, cmd->pdir_dma,
-                         cmd->attrs.max_wr, cmd->attrs.max_sge,
-                         cmd->nchunks);
-    if (rc) {
-        return rc;
-    }
-
-    rc = rdma_rm_alloc_srq(&dev->rdma_dev_res, cmd->pd_handle,
-                           cmd->attrs.max_wr, cmd->attrs.max_sge,
-                           cmd->attrs.srq_limit, &resp->srqn, ring);
-    if (rc) {
-        destroy_srq_ring(ring);
-        return rc;
-    }
-
-    return 0;
-}
-
-static int query_srq(PVRDMADev *dev, union pvrdma_cmd_req *req,
-                     union pvrdma_cmd_resp *rsp)
-{
-    struct pvrdma_cmd_query_srq *cmd = &req->query_srq;
-    struct pvrdma_cmd_query_srq_resp *resp = &rsp->query_srq_resp;
-
-    memset(resp, 0, sizeof(*resp));
-
-    return rdma_rm_query_srq(&dev->rdma_dev_res, cmd->srq_handle,
-                             (struct ibv_srq_attr *)&resp->attrs);
-}
-
-static int modify_srq(PVRDMADev *dev, union pvrdma_cmd_req *req,
-                      union pvrdma_cmd_resp *rsp)
-{
-    struct pvrdma_cmd_modify_srq *cmd = &req->modify_srq;
-
-    /* Only support SRQ limit */
-    if (!(cmd->attr_mask & IBV_SRQ_LIMIT) ||
-        (cmd->attr_mask & IBV_SRQ_MAX_WR))
-            return -EINVAL;
-
-    return rdma_rm_modify_srq(&dev->rdma_dev_res, cmd->srq_handle,
-                              (struct ibv_srq_attr *)&cmd->attrs,
-                              cmd->attr_mask);
-}
-
-static int destroy_srq(PVRDMADev *dev, union pvrdma_cmd_req *req,
-                       union pvrdma_cmd_resp *rsp)
-{
-    struct pvrdma_cmd_destroy_srq *cmd = &req->destroy_srq;
-    RdmaRmSRQ *srq;
-    PvrdmaRing *ring;
-
-    srq = rdma_rm_get_srq(&dev->rdma_dev_res, cmd->srq_handle);
-    if (!srq) {
-        return -EINVAL;
-    }
-
-    ring = (PvrdmaRing *)srq->opaque;
-    destroy_srq_ring(ring);
-    rdma_rm_dealloc_srq(&dev->rdma_dev_res, cmd->srq_handle);
-
-    return 0;
-}
-
-struct cmd_handler {
-    uint32_t cmd;
-    uint32_t ack;
-    int (*exec)(PVRDMADev *dev, union pvrdma_cmd_req *req,
-            union pvrdma_cmd_resp *rsp);
-};
-
-static struct cmd_handler cmd_handlers[] = {
-    {PVRDMA_CMD_QUERY_PORT,   PVRDMA_CMD_QUERY_PORT_RESP,        query_port},
-    {PVRDMA_CMD_QUERY_PKEY,   PVRDMA_CMD_QUERY_PKEY_RESP,        query_pkey},
-    {PVRDMA_CMD_CREATE_PD,    PVRDMA_CMD_CREATE_PD_RESP,         create_pd},
-    {PVRDMA_CMD_DESTROY_PD,   PVRDMA_CMD_DESTROY_PD_RESP_NOOP,   destroy_pd},
-    {PVRDMA_CMD_CREATE_MR,    PVRDMA_CMD_CREATE_MR_RESP,         create_mr},
-    {PVRDMA_CMD_DESTROY_MR,   PVRDMA_CMD_DESTROY_MR_RESP_NOOP,   destroy_mr},
-    {PVRDMA_CMD_CREATE_CQ,    PVRDMA_CMD_CREATE_CQ_RESP,         create_cq},
-    {PVRDMA_CMD_RESIZE_CQ,    PVRDMA_CMD_RESIZE_CQ_RESP,         NULL},
-    {PVRDMA_CMD_DESTROY_CQ,   PVRDMA_CMD_DESTROY_CQ_RESP_NOOP,   destroy_cq},
-    {PVRDMA_CMD_CREATE_QP,    PVRDMA_CMD_CREATE_QP_RESP,         create_qp},
-    {PVRDMA_CMD_MODIFY_QP,    PVRDMA_CMD_MODIFY_QP_RESP,         modify_qp},
-    {PVRDMA_CMD_QUERY_QP,     PVRDMA_CMD_QUERY_QP_RESP,          query_qp},
-    {PVRDMA_CMD_DESTROY_QP,   PVRDMA_CMD_DESTROY_QP_RESP,        destroy_qp},
-    {PVRDMA_CMD_CREATE_UC,    PVRDMA_CMD_CREATE_UC_RESP,         create_uc},
-    {PVRDMA_CMD_DESTROY_UC,   PVRDMA_CMD_DESTROY_UC_RESP_NOOP,   destroy_uc},
-    {PVRDMA_CMD_CREATE_BIND,  PVRDMA_CMD_CREATE_BIND_RESP_NOOP,  create_bind},
-    {PVRDMA_CMD_DESTROY_BIND, PVRDMA_CMD_DESTROY_BIND_RESP_NOOP, destroy_bind},
-    {PVRDMA_CMD_CREATE_SRQ,   PVRDMA_CMD_CREATE_SRQ_RESP,        create_srq},
-    {PVRDMA_CMD_QUERY_SRQ,    PVRDMA_CMD_QUERY_SRQ_RESP,         query_srq},
-    {PVRDMA_CMD_MODIFY_SRQ,   PVRDMA_CMD_MODIFY_SRQ_RESP,        modify_srq},
-    {PVRDMA_CMD_DESTROY_SRQ,  PVRDMA_CMD_DESTROY_SRQ_RESP,       destroy_srq},
-};
-
-int pvrdma_exec_cmd(PVRDMADev *dev)
-{
-    int err = 0xFFFF;
-    DSRInfo *dsr_info;
-
-    dsr_info = &dev->dsr_info;
-
-    if (!dsr_info->dsr) {
-            /* Buggy or malicious guest driver */
-            rdma_error_report("Exec command without dsr, req or rsp buffers");
-            goto out;
-    }
-
-    if (dsr_info->req->hdr.cmd >= sizeof(cmd_handlers) /
-                      sizeof(struct cmd_handler)) {
-        rdma_error_report("Unsupported command");
-        goto out;
-    }
-
-    if (!cmd_handlers[dsr_info->req->hdr.cmd].exec) {
-        rdma_error_report("Unsupported command (not implemented yet)");
-        goto out;
-    }
-
-    err = cmd_handlers[dsr_info->req->hdr.cmd].exec(dev, dsr_info->req,
-                                                    dsr_info->rsp);
-    dsr_info->rsp->hdr.response = dsr_info->req->hdr.response;
-    dsr_info->rsp->hdr.ack = cmd_handlers[dsr_info->req->hdr.cmd].ack;
-    dsr_info->rsp->hdr.err = err < 0 ? -err : 0;
-
-    trace_pvrdma_exec_cmd(dsr_info->req->hdr.cmd, dsr_info->rsp->hdr.err);
-
-    dev->stats.commands++;
-
-out:
-    set_reg_val(dev, PVRDMA_REG_ERR, err);
-    post_interrupt(dev, INTR_VEC_CMD_RING);
-
-    return (err == 0) ? 0 : -EINVAL;
-}
diff --git a/hw/rdma/vmw/pvrdma_dev_ring.c b/hw/rdma/vmw/pvrdma_dev_ring.c
deleted file mode 100644
index 30ce22a5be..0000000000
--- a/hw/rdma/vmw/pvrdma_dev_ring.c
+++ /dev/null
@@ -1,141 +0,0 @@
-/*
- * QEMU paravirtual RDMA - Device rings
- *
- * Copyright (C) 2018 Oracle
- * Copyright (C) 2018 Red Hat Inc
- *
- * Authors:
- *     Yuval Shaia <yuval.shaia@oracle.com>
- *     Marcel Apfelbaum <marcel@redhat.com>
- *
- * This work is licensed under the terms of the GNU GPL, version 2 or later.
- * See the COPYING file in the top-level directory.
- *
- */
-
-#include "qemu/osdep.h"
-#include "hw/pci/pci.h"
-#include "cpu.h"
-#include "qemu/cutils.h"
-
-#include "trace.h"
-
-#include "../rdma_utils.h"
-#include "pvrdma_dev_ring.h"
-
-int pvrdma_ring_init(PvrdmaRing *ring, const char *name, PCIDevice *dev,
-                     PvrdmaRingState *ring_state, uint32_t max_elems,
-                     size_t elem_sz, dma_addr_t *tbl, uint32_t npages)
-{
-    int i;
-    int rc = 0;
-
-    pstrcpy(ring->name, MAX_RING_NAME_SZ, name);
-    ring->dev = dev;
-    ring->ring_state = ring_state;
-    ring->max_elems = max_elems;
-    ring->elem_sz = elem_sz;
-    /* TODO: Give a moment to think if we want to redo driver settings
-    qatomic_set(&ring->ring_state->prod_tail, 0);
-    qatomic_set(&ring->ring_state->cons_head, 0);
-    */
-    ring->npages = npages;
-    ring->pages = g_new0(void *, npages);
-
-    for (i = 0; i < npages; i++) {
-        if (!tbl[i]) {
-            rdma_error_report("npages=%d but tbl[%d] is NULL", npages, i);
-            continue;
-        }
-
-        ring->pages[i] = rdma_pci_dma_map(dev, tbl[i], TARGET_PAGE_SIZE);
-        if (!ring->pages[i]) {
-            rc = -ENOMEM;
-            rdma_error_report("Failed to map to page %d in ring %s", i, name);
-            goto out_free;
-        }
-        memset(ring->pages[i], 0, TARGET_PAGE_SIZE);
-    }
-
-    goto out;
-
-out_free:
-    while (i--) {
-        rdma_pci_dma_unmap(dev, ring->pages[i], TARGET_PAGE_SIZE);
-    }
-    g_free(ring->pages);
-
-out:
-    return rc;
-}
-
-void *pvrdma_ring_next_elem_read(PvrdmaRing *ring)
-{
-    unsigned int idx, offset;
-    const uint32_t tail = qatomic_read(&ring->ring_state->prod_tail);
-    const uint32_t head = qatomic_read(&ring->ring_state->cons_head);
-
-    if (tail & ~((ring->max_elems << 1) - 1) ||
-        head & ~((ring->max_elems << 1) - 1) ||
-        tail == head) {
-        trace_pvrdma_ring_next_elem_read_no_data(ring->name);
-        return NULL;
-    }
-
-    idx = head & (ring->max_elems - 1);
-    offset = idx * ring->elem_sz;
-    return ring->pages[offset / TARGET_PAGE_SIZE] + (offset % TARGET_PAGE_SIZE);
-}
-
-void pvrdma_ring_read_inc(PvrdmaRing *ring)
-{
-    uint32_t idx = qatomic_read(&ring->ring_state->cons_head);
-
-    idx = (idx + 1) & ((ring->max_elems << 1) - 1);
-    qatomic_set(&ring->ring_state->cons_head, idx);
-}
-
-void *pvrdma_ring_next_elem_write(PvrdmaRing *ring)
-{
-    unsigned int idx, offset;
-    const uint32_t tail = qatomic_read(&ring->ring_state->prod_tail);
-    const uint32_t head = qatomic_read(&ring->ring_state->cons_head);
-
-    if (tail & ~((ring->max_elems << 1) - 1) ||
-        head & ~((ring->max_elems << 1) - 1) ||
-        tail == (head ^ ring->max_elems)) {
-        rdma_error_report("CQ is full");
-        return NULL;
-    }
-
-    idx = tail & (ring->max_elems - 1);
-    offset = idx * ring->elem_sz;
-    return ring->pages[offset / TARGET_PAGE_SIZE] + (offset % TARGET_PAGE_SIZE);
-}
-
-void pvrdma_ring_write_inc(PvrdmaRing *ring)
-{
-    uint32_t idx = qatomic_read(&ring->ring_state->prod_tail);
-
-    idx = (idx + 1) & ((ring->max_elems << 1) - 1);
-    qatomic_set(&ring->ring_state->prod_tail, idx);
-}
-
-void pvrdma_ring_free(PvrdmaRing *ring)
-{
-    if (!ring) {
-        return;
-    }
-
-    if (!ring->pages) {
-        return;
-    }
-
-    while (ring->npages--) {
-        rdma_pci_dma_unmap(ring->dev, ring->pages[ring->npages],
-                           TARGET_PAGE_SIZE);
-    }
-
-    g_free(ring->pages);
-    ring->pages = NULL;
-}
diff --git a/hw/rdma/vmw/pvrdma_main.c b/hw/rdma/vmw/pvrdma_main.c
deleted file mode 100644
index e735ff97eb..0000000000
--- a/hw/rdma/vmw/pvrdma_main.c
+++ /dev/null
@@ -1,735 +0,0 @@
-/*
- * QEMU paravirtual RDMA
- *
- * Copyright (C) 2018 Oracle
- * Copyright (C) 2018 Red Hat Inc
- *
- * Authors:
- *     Yuval Shaia <yuval.shaia@oracle.com>
- *     Marcel Apfelbaum <marcel@redhat.com>
- *
- * This work is licensed under the terms of the GNU GPL, version 2 or later.
- * See the COPYING file in the top-level directory.
- *
- */
-
-#include "qemu/osdep.h"
-#include "qapi/error.h"
-#include "qemu/module.h"
-#include "hw/pci/pci.h"
-#include "hw/pci/pci_ids.h"
-#include "hw/pci/msi.h"
-#include "hw/pci/msix.h"
-#include "hw/qdev-properties.h"
-#include "hw/qdev-properties-system.h"
-#include "cpu.h"
-#include "trace.h"
-#include "monitor/monitor.h"
-#include "hw/rdma/rdma.h"
-
-#include "../rdma_rm.h"
-#include "../rdma_backend.h"
-#include "../rdma_utils.h"
-
-#include <infiniband/verbs.h>
-#include "pvrdma.h"
-#include "standard-headers/rdma/vmw_pvrdma-abi.h"
-#include "sysemu/runstate.h"
-#include "standard-headers/drivers/infiniband/hw/vmw_pvrdma/pvrdma_dev_api.h"
-#include "pvrdma_qp_ops.h"
-
-static Property pvrdma_dev_properties[] = {
-    DEFINE_PROP_STRING("netdev", PVRDMADev, backend_eth_device_name),
-    DEFINE_PROP_STRING("ibdev", PVRDMADev, backend_device_name),
-    DEFINE_PROP_UINT8("ibport", PVRDMADev, backend_port_num, 1),
-    DEFINE_PROP_UINT64("dev-caps-max-mr-size", PVRDMADev, dev_attr.max_mr_size,
-                       MAX_MR_SIZE),
-    DEFINE_PROP_INT32("dev-caps-max-qp", PVRDMADev, dev_attr.max_qp, MAX_QP),
-    DEFINE_PROP_INT32("dev-caps-max-cq", PVRDMADev, dev_attr.max_cq, MAX_CQ),
-    DEFINE_PROP_INT32("dev-caps-max-mr", PVRDMADev, dev_attr.max_mr, MAX_MR),
-    DEFINE_PROP_INT32("dev-caps-max-pd", PVRDMADev, dev_attr.max_pd, MAX_PD),
-    DEFINE_PROP_INT32("dev-caps-qp-rd-atom", PVRDMADev, dev_attr.max_qp_rd_atom,
-                      MAX_QP_RD_ATOM),
-    DEFINE_PROP_INT32("dev-caps-max-qp-init-rd-atom", PVRDMADev,
-                      dev_attr.max_qp_init_rd_atom, MAX_QP_INIT_RD_ATOM),
-    DEFINE_PROP_INT32("dev-caps-max-ah", PVRDMADev, dev_attr.max_ah, MAX_AH),
-    DEFINE_PROP_INT32("dev-caps-max-srq", PVRDMADev, dev_attr.max_srq, MAX_SRQ),
-    DEFINE_PROP_CHR("mad-chardev", PVRDMADev, mad_chr),
-    DEFINE_PROP_END_OF_LIST(),
-};
-
-static void pvrdma_format_statistics(RdmaProvider *obj, GString *buf)
-{
-    PVRDMADev *dev = PVRDMA_DEV(obj);
-    PCIDevice *pdev = PCI_DEVICE(dev);
-
-    g_string_append_printf(buf, "%s, %x.%x\n",
-                           pdev->name, PCI_SLOT(pdev->devfn),
-                           PCI_FUNC(pdev->devfn));
-    g_string_append_printf(buf, "\tcommands         : %" PRId64 "\n",
-                           dev->stats.commands);
-    g_string_append_printf(buf, "\tregs_reads       : %" PRId64 "\n",
-                           dev->stats.regs_reads);
-    g_string_append_printf(buf, "\tregs_writes      : %" PRId64 "\n",
-                           dev->stats.regs_writes);
-    g_string_append_printf(buf, "\tuar_writes       : %" PRId64 "\n",
-                           dev->stats.uar_writes);
-    g_string_append_printf(buf, "\tinterrupts       : %" PRId64 "\n",
-                           dev->stats.interrupts);
-    rdma_format_device_counters(&dev->rdma_dev_res, buf);
-}
-
-static void free_dev_ring(PCIDevice *pci_dev, PvrdmaRing *ring,
-                          void *ring_state)
-{
-    pvrdma_ring_free(ring);
-    rdma_pci_dma_unmap(pci_dev, ring_state, TARGET_PAGE_SIZE);
-}
-
-static int init_dev_ring(PvrdmaRing *ring, PvrdmaRingState **ring_state,
-                         const char *name, PCIDevice *pci_dev,
-                         dma_addr_t dir_addr, uint32_t num_pages)
-{
-    uint64_t *dir, *tbl;
-    int max_pages, rc = 0;
-
-    if (!num_pages) {
-        rdma_error_report("Ring pages count must be strictly positive");
-        return -EINVAL;
-    }
-
-    /*
-     * Make sure we can satisfy the requested number of pages in a single
-     * TARGET_PAGE_SIZE sized page table (taking into account that first entry
-     * is reserved for ring-state)
-     */
-    max_pages = TARGET_PAGE_SIZE / sizeof(dma_addr_t) - 1;
-    if (num_pages > max_pages) {
-        rdma_error_report("Maximum pages on a single directory must not exceed %d\n",
-                          max_pages);
-        return -EINVAL;
-    }
-
-    dir = rdma_pci_dma_map(pci_dev, dir_addr, TARGET_PAGE_SIZE);
-    if (!dir) {
-        rdma_error_report("Failed to map to page directory (ring %s)", name);
-        rc = -ENOMEM;
-        goto out;
-    }
-
-    /* We support only one page table for a ring */
-    tbl = rdma_pci_dma_map(pci_dev, dir[0], TARGET_PAGE_SIZE);
-    if (!tbl) {
-        rdma_error_report("Failed to map to page table (ring %s)", name);
-        rc = -ENOMEM;
-        goto out_free_dir;
-    }
-
-    *ring_state = rdma_pci_dma_map(pci_dev, tbl[0], TARGET_PAGE_SIZE);
-    if (!*ring_state) {
-        rdma_error_report("Failed to map to ring state (ring %s)", name);
-        rc = -ENOMEM;
-        goto out_free_tbl;
-    }
-    /* RX ring is the second */
-    (*ring_state)++;
-    rc = pvrdma_ring_init(ring, name, pci_dev,
-                          (PvrdmaRingState *)*ring_state,
-                          (num_pages - 1) * TARGET_PAGE_SIZE /
-                          sizeof(struct pvrdma_cqne),
-                          sizeof(struct pvrdma_cqne),
-                          (dma_addr_t *)&tbl[1], (dma_addr_t)num_pages - 1);
-    if (rc) {
-        rc = -ENOMEM;
-        goto out_free_ring_state;
-    }
-
-    goto out_free_tbl;
-
-out_free_ring_state:
-    rdma_pci_dma_unmap(pci_dev, *ring_state, TARGET_PAGE_SIZE);
-
-out_free_tbl:
-    rdma_pci_dma_unmap(pci_dev, tbl, TARGET_PAGE_SIZE);
-
-out_free_dir:
-    rdma_pci_dma_unmap(pci_dev, dir, TARGET_PAGE_SIZE);
-
-out:
-    return rc;
-}
-
-static void free_dsr(PVRDMADev *dev)
-{
-    PCIDevice *pci_dev = PCI_DEVICE(dev);
-
-    if (!dev->dsr_info.dsr) {
-        return;
-    }
-
-    free_dev_ring(pci_dev, &dev->dsr_info.async,
-                  dev->dsr_info.async_ring_state);
-
-    free_dev_ring(pci_dev, &dev->dsr_info.cq, dev->dsr_info.cq_ring_state);
-
-    rdma_pci_dma_unmap(pci_dev, dev->dsr_info.req,
-                       sizeof(union pvrdma_cmd_req));
-
-    rdma_pci_dma_unmap(pci_dev, dev->dsr_info.rsp,
-                       sizeof(union pvrdma_cmd_resp));
-
-    rdma_pci_dma_unmap(pci_dev, dev->dsr_info.dsr,
-                       sizeof(struct pvrdma_device_shared_region));
-
-    dev->dsr_info.dsr = NULL;
-}
-
-static int load_dsr(PVRDMADev *dev)
-{
-    int rc = 0;
-    PCIDevice *pci_dev = PCI_DEVICE(dev);
-    DSRInfo *dsr_info;
-    struct pvrdma_device_shared_region *dsr;
-
-    free_dsr(dev);
-
-    /* Map to DSR */
-    dev->dsr_info.dsr = rdma_pci_dma_map(pci_dev, dev->dsr_info.dma,
-                              sizeof(struct pvrdma_device_shared_region));
-    if (!dev->dsr_info.dsr) {
-        rdma_error_report("Failed to map to DSR");
-        rc = -ENOMEM;
-        goto out;
-    }
-
-    /* Shortcuts */
-    dsr_info = &dev->dsr_info;
-    dsr = dsr_info->dsr;
-
-    /* Map to command slot */
-    dsr_info->req = rdma_pci_dma_map(pci_dev, dsr->cmd_slot_dma,
-                                     sizeof(union pvrdma_cmd_req));
-    if (!dsr_info->req) {
-        rdma_error_report("Failed to map to command slot address");
-        rc = -ENOMEM;
-        goto out_free_dsr;
-    }
-
-    /* Map to response slot */
-    dsr_info->rsp = rdma_pci_dma_map(pci_dev, dsr->resp_slot_dma,
-                                     sizeof(union pvrdma_cmd_resp));
-    if (!dsr_info->rsp) {
-        rdma_error_report("Failed to map to response slot address");
-        rc = -ENOMEM;
-        goto out_free_req;
-    }
-
-    /* Map to CQ notification ring */
-    rc = init_dev_ring(&dsr_info->cq, &dsr_info->cq_ring_state, "dev_cq",
-                       pci_dev, dsr->cq_ring_pages.pdir_dma,
-                       dsr->cq_ring_pages.num_pages);
-    if (rc) {
-        rc = -ENOMEM;
-        goto out_free_rsp;
-    }
-
-    /* Map to event notification ring */
-    rc = init_dev_ring(&dsr_info->async, &dsr_info->async_ring_state,
-                       "dev_async", pci_dev, dsr->async_ring_pages.pdir_dma,
-                       dsr->async_ring_pages.num_pages);
-    if (rc) {
-        rc = -ENOMEM;
-        goto out_free_rsp;
-    }
-
-    goto out;
-
-out_free_rsp:
-    rdma_pci_dma_unmap(pci_dev, dsr_info->rsp, sizeof(union pvrdma_cmd_resp));
-
-out_free_req:
-    rdma_pci_dma_unmap(pci_dev, dsr_info->req, sizeof(union pvrdma_cmd_req));
-
-out_free_dsr:
-    rdma_pci_dma_unmap(pci_dev, dsr_info->dsr,
-                       sizeof(struct pvrdma_device_shared_region));
-    dsr_info->dsr = NULL;
-
-out:
-    return rc;
-}
-
-static void init_dsr_dev_caps(PVRDMADev *dev)
-{
-    struct pvrdma_device_shared_region *dsr;
-
-    if (!dev->dsr_info.dsr) {
-        /* Buggy or malicious guest driver */
-        rdma_error_report("Can't initialized DSR");
-        return;
-    }
-
-    dsr = dev->dsr_info.dsr;
-    dsr->caps.fw_ver = PVRDMA_FW_VERSION;
-    dsr->caps.mode = PVRDMA_DEVICE_MODE_ROCE;
-    dsr->caps.gid_types |= PVRDMA_GID_TYPE_FLAG_ROCE_V1;
-    dsr->caps.max_uar = RDMA_BAR2_UAR_SIZE;
-    dsr->caps.max_mr_size = dev->dev_attr.max_mr_size;
-    dsr->caps.max_qp = dev->dev_attr.max_qp;
-    dsr->caps.max_qp_wr = dev->dev_attr.max_qp_wr;
-    dsr->caps.max_sge = dev->dev_attr.max_sge;
-    dsr->caps.max_cq = dev->dev_attr.max_cq;
-    dsr->caps.max_cqe = dev->dev_attr.max_cqe;
-    dsr->caps.max_mr = dev->dev_attr.max_mr;
-    dsr->caps.max_pd = dev->dev_attr.max_pd;
-    dsr->caps.max_ah = dev->dev_attr.max_ah;
-    dsr->caps.max_srq = dev->dev_attr.max_srq;
-    dsr->caps.max_srq_wr = dev->dev_attr.max_srq_wr;
-    dsr->caps.max_srq_sge = dev->dev_attr.max_srq_sge;
-    dsr->caps.gid_tbl_len = MAX_GIDS;
-    dsr->caps.sys_image_guid = 0;
-    dsr->caps.node_guid = dev->node_guid;
-    dsr->caps.phys_port_cnt = MAX_PORTS;
-    dsr->caps.max_pkeys = MAX_PKEYS;
-}
-
-static void uninit_msix(PCIDevice *pdev, int used_vectors)
-{
-    PVRDMADev *dev = PVRDMA_DEV(pdev);
-    int i;
-
-    for (i = 0; i < used_vectors; i++) {
-        msix_vector_unuse(pdev, i);
-    }
-
-    msix_uninit(pdev, &dev->msix, &dev->msix);
-}
-
-static int init_msix(PCIDevice *pdev)
-{
-    PVRDMADev *dev = PVRDMA_DEV(pdev);
-    int i;
-    int rc;
-
-    rc = msix_init(pdev, RDMA_MAX_INTRS, &dev->msix, RDMA_MSIX_BAR_IDX,
-                   RDMA_MSIX_TABLE, &dev->msix, RDMA_MSIX_BAR_IDX,
-                   RDMA_MSIX_PBA, 0, NULL);
-
-    if (rc < 0) {
-        rdma_error_report("Failed to initialize MSI-X");
-        return rc;
-    }
-
-    for (i = 0; i < RDMA_MAX_INTRS; i++) {
-        msix_vector_use(PCI_DEVICE(dev), i);
-    }
-
-    return 0;
-}
-
-static void pvrdma_fini(PCIDevice *pdev)
-{
-    PVRDMADev *dev = PVRDMA_DEV(pdev);
-
-    notifier_remove(&dev->shutdown_notifier);
-
-    pvrdma_qp_ops_fini();
-
-    rdma_backend_stop(&dev->backend_dev);
-
-    rdma_rm_fini(&dev->rdma_dev_res, &dev->backend_dev,
-                 dev->backend_eth_device_name);
-
-    rdma_backend_fini(&dev->backend_dev);
-
-    free_dsr(dev);
-
-    if (msix_enabled(pdev)) {
-        uninit_msix(pdev, RDMA_MAX_INTRS);
-    }
-
-    rdma_info_report("Device %s %x.%x is down", pdev->name,
-                     PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn));
-}
-
-static void pvrdma_stop(PVRDMADev *dev)
-{
-    rdma_backend_stop(&dev->backend_dev);
-}
-
-static void pvrdma_start(PVRDMADev *dev)
-{
-    rdma_backend_start(&dev->backend_dev);
-}
-
-static void activate_device(PVRDMADev *dev)
-{
-    pvrdma_start(dev);
-    set_reg_val(dev, PVRDMA_REG_ERR, 0);
-}
-
-static int unquiesce_device(PVRDMADev *dev)
-{
-    return 0;
-}
-
-static void reset_device(PVRDMADev *dev)
-{
-    pvrdma_stop(dev);
-}
-
-static uint64_t pvrdma_regs_read(void *opaque, hwaddr addr, unsigned size)
-{
-    PVRDMADev *dev = opaque;
-    uint32_t val;
-
-    dev->stats.regs_reads++;
-
-    if (get_reg_val(dev, addr, &val)) {
-        rdma_error_report("Failed to read REG value from address 0x%x",
-                          (uint32_t)addr);
-        return -EINVAL;
-    }
-
-    trace_pvrdma_regs_read(addr, val);
-
-    return val;
-}
-
-static void pvrdma_regs_write(void *opaque, hwaddr addr, uint64_t val,
-                              unsigned size)
-{
-    PVRDMADev *dev = opaque;
-
-    dev->stats.regs_writes++;
-
-    if (set_reg_val(dev, addr, val)) {
-        rdma_error_report("Failed to set REG value, addr=0x%"PRIx64 ", val=0x%"PRIx64,
-                          addr, val);
-        return;
-    }
-
-    switch (addr) {
-    case PVRDMA_REG_DSRLOW:
-        trace_pvrdma_regs_write(addr, val, "DSRLOW", "");
-        dev->dsr_info.dma = val;
-        break;
-    case PVRDMA_REG_DSRHIGH:
-        trace_pvrdma_regs_write(addr, val, "DSRHIGH", "");
-        dev->dsr_info.dma |= val << 32;
-        load_dsr(dev);
-        init_dsr_dev_caps(dev);
-        break;
-    case PVRDMA_REG_CTL:
-        switch (val) {
-        case PVRDMA_DEVICE_CTL_ACTIVATE:
-            trace_pvrdma_regs_write(addr, val, "CTL", "ACTIVATE");
-            activate_device(dev);
-            break;
-        case PVRDMA_DEVICE_CTL_UNQUIESCE:
-            trace_pvrdma_regs_write(addr, val, "CTL", "UNQUIESCE");
-            unquiesce_device(dev);
-            break;
-        case PVRDMA_DEVICE_CTL_RESET:
-            trace_pvrdma_regs_write(addr, val, "CTL", "URESET");
-            reset_device(dev);
-            break;
-        }
-        break;
-    case PVRDMA_REG_IMR:
-        trace_pvrdma_regs_write(addr, val, "INTR_MASK", "");
-        dev->interrupt_mask = val;
-        break;
-    case PVRDMA_REG_REQUEST:
-        if (val == 0) {
-            trace_pvrdma_regs_write(addr, val, "REQUEST", "");
-            pvrdma_exec_cmd(dev);
-        }
-        break;
-    default:
-        break;
-    }
-}
-
-static const MemoryRegionOps regs_ops = {
-    .read = pvrdma_regs_read,
-    .write = pvrdma_regs_write,
-    .endianness = DEVICE_LITTLE_ENDIAN,
-    .impl = {
-        .min_access_size = sizeof(uint32_t),
-        .max_access_size = sizeof(uint32_t),
-    },
-};
-
-static uint64_t pvrdma_uar_read(void *opaque, hwaddr addr, unsigned size)
-{
-    return 0xffffffff;
-}
-
-static void pvrdma_uar_write(void *opaque, hwaddr addr, uint64_t val,
-                             unsigned size)
-{
-    PVRDMADev *dev = opaque;
-
-    dev->stats.uar_writes++;
-
-    switch (addr & 0xFFF) { /* Mask with 0xFFF as each UC gets page */
-    case PVRDMA_UAR_QP_OFFSET:
-        if (val & PVRDMA_UAR_QP_SEND) {
-            trace_pvrdma_uar_write(addr, val, "QP", "SEND",
-                                   val & PVRDMA_UAR_HANDLE_MASK, 0);
-            pvrdma_qp_send(dev, val & PVRDMA_UAR_HANDLE_MASK);
-        }
-        if (val & PVRDMA_UAR_QP_RECV) {
-            trace_pvrdma_uar_write(addr, val, "QP", "RECV",
-                                   val & PVRDMA_UAR_HANDLE_MASK, 0);
-            pvrdma_qp_recv(dev, val & PVRDMA_UAR_HANDLE_MASK);
-        }
-        break;
-    case PVRDMA_UAR_CQ_OFFSET:
-        if (val & PVRDMA_UAR_CQ_ARM) {
-            trace_pvrdma_uar_write(addr, val, "CQ", "ARM",
-                                   val & PVRDMA_UAR_HANDLE_MASK,
-                                   !!(val & PVRDMA_UAR_CQ_ARM_SOL));
-            rdma_rm_req_notify_cq(&dev->rdma_dev_res,
-                                  val & PVRDMA_UAR_HANDLE_MASK,
-                                  !!(val & PVRDMA_UAR_CQ_ARM_SOL));
-        }
-        if (val & PVRDMA_UAR_CQ_ARM_SOL) {
-            trace_pvrdma_uar_write(addr, val, "CQ", "ARMSOL - not supported", 0,
-                                   0);
-        }
-        if (val & PVRDMA_UAR_CQ_POLL) {
-            trace_pvrdma_uar_write(addr, val, "CQ", "POLL",
-                                   val & PVRDMA_UAR_HANDLE_MASK, 0);
-            pvrdma_cq_poll(&dev->rdma_dev_res, val & PVRDMA_UAR_HANDLE_MASK);
-        }
-        break;
-    case PVRDMA_UAR_SRQ_OFFSET:
-        if (val & PVRDMA_UAR_SRQ_RECV) {
-            trace_pvrdma_uar_write(addr, val, "QP", "SRQ",
-                                   val & PVRDMA_UAR_HANDLE_MASK, 0);
-            pvrdma_srq_recv(dev, val & PVRDMA_UAR_HANDLE_MASK);
-        }
-        break;
-    default:
-        rdma_error_report("Unsupported command, addr=0x%"PRIx64", val=0x%"PRIx64,
-                          addr, val);
-        break;
-    }
-}
-
-static const MemoryRegionOps uar_ops = {
-    .read = pvrdma_uar_read,
-    .write = pvrdma_uar_write,
-    .endianness = DEVICE_LITTLE_ENDIAN,
-    .impl = {
-        .min_access_size = sizeof(uint32_t),
-        .max_access_size = sizeof(uint32_t),
-    },
-};
-
-static void init_pci_config(PCIDevice *pdev)
-{
-    pdev->config[PCI_INTERRUPT_PIN] = 1;
-}
-
-static void init_bars(PCIDevice *pdev)
-{
-    PVRDMADev *dev = PVRDMA_DEV(pdev);
-
-    /* BAR 0 - MSI-X */
-    memory_region_init(&dev->msix, OBJECT(dev), "pvrdma-msix",
-                       RDMA_BAR0_MSIX_SIZE);
-    pci_register_bar(pdev, RDMA_MSIX_BAR_IDX, PCI_BASE_ADDRESS_SPACE_MEMORY,
-                     &dev->msix);
-
-    /* BAR 1 - Registers */
-    memset(&dev->regs_data, 0, sizeof(dev->regs_data));
-    memory_region_init_io(&dev->regs, OBJECT(dev), &regs_ops, dev,
-                          "pvrdma-regs", sizeof(dev->regs_data));
-    pci_register_bar(pdev, RDMA_REG_BAR_IDX, PCI_BASE_ADDRESS_SPACE_MEMORY,
-                     &dev->regs);
-
-    /* BAR 2 - UAR */
-    memset(&dev->uar_data, 0, sizeof(dev->uar_data));
-    memory_region_init_io(&dev->uar, OBJECT(dev), &uar_ops, dev, "rdma-uar",
-                          sizeof(dev->uar_data));
-    pci_register_bar(pdev, RDMA_UAR_BAR_IDX, PCI_BASE_ADDRESS_SPACE_MEMORY,
-                     &dev->uar);
-}
-
-static void init_regs(PCIDevice *pdev)
-{
-    PVRDMADev *dev = PVRDMA_DEV(pdev);
-
-    set_reg_val(dev, PVRDMA_REG_VERSION, PVRDMA_HW_VERSION);
-    set_reg_val(dev, PVRDMA_REG_ERR, 0xFFFF);
-}
-
-static void init_dev_caps(PVRDMADev *dev)
-{
-    size_t pg_tbl_bytes = TARGET_PAGE_SIZE *
-                          (TARGET_PAGE_SIZE / sizeof(uint64_t));
-    size_t wr_sz = MAX(sizeof(struct pvrdma_sq_wqe_hdr),
-                       sizeof(struct pvrdma_rq_wqe_hdr));
-
-    dev->dev_attr.max_qp_wr = pg_tbl_bytes /
-                              (wr_sz + sizeof(struct pvrdma_sge) *
-                              dev->dev_attr.max_sge) - TARGET_PAGE_SIZE;
-                              /* First page is ring state  ^^^^ */
-
-    dev->dev_attr.max_cqe = pg_tbl_bytes / sizeof(struct pvrdma_cqe) -
-                            TARGET_PAGE_SIZE; /* First page is ring state */
-
-    dev->dev_attr.max_srq_wr = pg_tbl_bytes /
-                                ((sizeof(struct pvrdma_rq_wqe_hdr) +
-                                sizeof(struct pvrdma_sge)) *
-                                dev->dev_attr.max_sge) - TARGET_PAGE_SIZE;
-}
-
-static int pvrdma_check_ram_shared(Object *obj, void *opaque)
-{
-    bool *shared = opaque;
-
-    if (object_dynamic_cast(obj, "memory-backend-ram")) {
-        *shared = object_property_get_bool(obj, "share", NULL);
-    }
-
-    return 0;
-}
-
-static void pvrdma_shutdown_notifier(Notifier *n, void *opaque)
-{
-    PVRDMADev *dev = container_of(n, PVRDMADev, shutdown_notifier);
-    PCIDevice *pci_dev = PCI_DEVICE(dev);
-
-    pvrdma_fini(pci_dev);
-}
-
-static void pvrdma_realize(PCIDevice *pdev, Error **errp)
-{
-    int rc = 0;
-    PVRDMADev *dev = PVRDMA_DEV(pdev);
-    Object *memdev_root;
-    bool ram_shared = false;
-    PCIDevice *func0;
-
-    warn_report_once("pvrdma is deprecated and will be removed in a future release");
-
-    rdma_info_report("Initializing device %s %x.%x", pdev->name,
-                     PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn));
-
-    if (TARGET_PAGE_SIZE != qemu_real_host_page_size()) {
-        error_setg(errp, "Target page size must be the same as host page size");
-        return;
-    }
-
-    func0 = pci_get_function_0(pdev);
-    /* Break if not vmxnet3 device in slot 0 */
-    if (strcmp(object_get_typename(OBJECT(func0)), TYPE_VMXNET3)) {
-        error_setg(errp, "Device on %x.0 must be %s", PCI_SLOT(pdev->devfn),
-                   TYPE_VMXNET3);
-        return;
-    }
-    dev->func0 = VMXNET3(func0);
-
-    addrconf_addr_eui48((unsigned char *)&dev->node_guid,
-                        (const char *)&dev->func0->conf.macaddr.a);
-
-    memdev_root = object_resolve_path("/objects", NULL);
-    if (memdev_root) {
-        object_child_foreach(memdev_root, pvrdma_check_ram_shared, &ram_shared);
-    }
-    if (!ram_shared) {
-        error_setg(errp, "Only shared memory backed ram is supported");
-        return;
-    }
-
-    dev->dsr_info.dsr = NULL;
-
-    init_pci_config(pdev);
-
-    init_bars(pdev);
-
-    init_regs(pdev);
-
-    rc = init_msix(pdev);
-    if (rc) {
-        goto out;
-    }
-
-    rc = rdma_backend_init(&dev->backend_dev, pdev, &dev->rdma_dev_res,
-                           dev->backend_device_name, dev->backend_port_num,
-                           &dev->dev_attr, &dev->mad_chr);
-    if (rc) {
-        goto out;
-    }
-
-    init_dev_caps(dev);
-
-    rc = rdma_rm_init(&dev->rdma_dev_res, &dev->dev_attr);
-    if (rc) {
-        goto out;
-    }
-
-    rc = pvrdma_qp_ops_init();
-    if (rc) {
-        goto out;
-    }
-
-    memset(&dev->stats, 0, sizeof(dev->stats));
-
-    dev->shutdown_notifier.notify = pvrdma_shutdown_notifier;
-    qemu_register_shutdown_notifier(&dev->shutdown_notifier);
-
-#ifdef LEGACY_RDMA_REG_MR
-    rdma_info_report("Using legacy reg_mr");
-#else
-    rdma_info_report("Using iova reg_mr");
-#endif
-
-out:
-    if (rc) {
-        pvrdma_fini(pdev);
-        error_append_hint(errp, "Device failed to load\n");
-    }
-}
-
-static void pvrdma_class_init(ObjectClass *klass, void *data)
-{
-    DeviceClass *dc = DEVICE_CLASS(klass);
-    PCIDeviceClass *k = PCI_DEVICE_CLASS(klass);
-    RdmaProviderClass *ir = RDMA_PROVIDER_CLASS(klass);
-
-    k->realize = pvrdma_realize;
-    k->vendor_id = PCI_VENDOR_ID_VMWARE;
-    k->device_id = PCI_DEVICE_ID_VMWARE_PVRDMA;
-    k->revision = 0x00;
-    k->class_id = PCI_CLASS_NETWORK_OTHER;
-
-    dc->desc = "RDMA Device";
-    device_class_set_props(dc, pvrdma_dev_properties);
-    set_bit(DEVICE_CATEGORY_NETWORK, dc->categories);
-
-    ir->format_statistics = pvrdma_format_statistics;
-}
-
-static const TypeInfo pvrdma_info = {
-    .name = PVRDMA_HW_NAME,
-    .parent = TYPE_PCI_DEVICE,
-    .instance_size = sizeof(PVRDMADev),
-    .class_init = pvrdma_class_init,
-    .interfaces = (InterfaceInfo[]) {
-        { INTERFACE_CONVENTIONAL_PCI_DEVICE },
-        { INTERFACE_RDMA_PROVIDER },
-        { }
-    }
-};
-
-static void register_types(void)
-{
-    type_register_static(&pvrdma_info);
-}
-
-type_init(register_types)
diff --git a/hw/rdma/vmw/pvrdma_qp_ops.c b/hw/rdma/vmw/pvrdma_qp_ops.c
deleted file mode 100644
index c30c8344f6..0000000000
--- a/hw/rdma/vmw/pvrdma_qp_ops.c
+++ /dev/null
@@ -1,298 +0,0 @@
-/*
- * QEMU paravirtual RDMA - QP implementation
- *
- * Copyright (C) 2018 Oracle
- * Copyright (C) 2018 Red Hat Inc
- *
- * Authors:
- *     Yuval Shaia <yuval.shaia@oracle.com>
- *     Marcel Apfelbaum <marcel@redhat.com>
- *
- * This work is licensed under the terms of the GNU GPL, version 2 or later.
- * See the COPYING file in the top-level directory.
- *
- */
-
-#include "qemu/osdep.h"
-
-#include "../rdma_utils.h"
-#include "../rdma_rm.h"
-#include "../rdma_backend.h"
-
-#include "trace.h"
-
-#include "pvrdma.h"
-#include "standard-headers/rdma/vmw_pvrdma-abi.h"
-#include "pvrdma_qp_ops.h"
-
-typedef struct CompHandlerCtx {
-    PVRDMADev *dev;
-    uint32_t cq_handle;
-    struct pvrdma_cqe cqe;
-} CompHandlerCtx;
-
-/* Send Queue WQE */
-typedef struct PvrdmaSqWqe {
-    struct pvrdma_sq_wqe_hdr hdr;
-    struct pvrdma_sge sge[];
-} PvrdmaSqWqe;
-
-/* Recv Queue WQE */
-typedef struct PvrdmaRqWqe {
-    struct pvrdma_rq_wqe_hdr hdr;
-    struct pvrdma_sge sge[];
-} PvrdmaRqWqe;
-
-/*
- * 1. Put CQE on send CQ ring
- * 2. Put CQ number on dsr completion ring
- * 3. Interrupt host
- */
-static int pvrdma_post_cqe(PVRDMADev *dev, uint32_t cq_handle,
-                           struct pvrdma_cqe *cqe, struct ibv_wc *wc)
-{
-    struct pvrdma_cqe *cqe1;
-    struct pvrdma_cqne *cqne;
-    PvrdmaRing *ring;
-    RdmaRmCQ *cq = rdma_rm_get_cq(&dev->rdma_dev_res, cq_handle);
-
-    if (unlikely(!cq)) {
-        return -EINVAL;
-    }
-
-    ring = (PvrdmaRing *)cq->opaque;
-
-    /* Step #1: Put CQE on CQ ring */
-    cqe1 = pvrdma_ring_next_elem_write(ring);
-    if (unlikely(!cqe1)) {
-        return -EINVAL;
-    }
-
-    memset(cqe1, 0, sizeof(*cqe1));
-    cqe1->wr_id = cqe->wr_id;
-    cqe1->qp = cqe->qp ? cqe->qp : wc->qp_num;
-    cqe1->opcode = cqe->opcode;
-    cqe1->status = wc->status;
-    cqe1->byte_len = wc->byte_len;
-    cqe1->src_qp = wc->src_qp;
-    cqe1->wc_flags = wc->wc_flags;
-    cqe1->vendor_err = wc->vendor_err;
-
-    trace_pvrdma_post_cqe(cq_handle, cq->notify, cqe1->wr_id, cqe1->qp,
-                          cqe1->opcode, cqe1->status, cqe1->byte_len,
-                          cqe1->src_qp, cqe1->wc_flags, cqe1->vendor_err);
-
-    pvrdma_ring_write_inc(ring);
-
-    /* Step #2: Put CQ number on dsr completion ring */
-    cqne = pvrdma_ring_next_elem_write(&dev->dsr_info.cq);
-    if (unlikely(!cqne)) {
-        return -EINVAL;
-    }
-
-    cqne->info = cq_handle;
-    pvrdma_ring_write_inc(&dev->dsr_info.cq);
-
-    if (cq->notify != CNT_CLEAR) {
-        if (cq->notify == CNT_ARM) {
-            cq->notify = CNT_CLEAR;
-        }
-        post_interrupt(dev, INTR_VEC_CMD_COMPLETION_Q);
-    }
-
-    return 0;
-}
-
-static void pvrdma_qp_ops_comp_handler(void *ctx, struct ibv_wc *wc)
-{
-    CompHandlerCtx *comp_ctx = (CompHandlerCtx *)ctx;
-
-    pvrdma_post_cqe(comp_ctx->dev, comp_ctx->cq_handle, &comp_ctx->cqe, wc);
-
-    g_free(ctx);
-}
-
-static void complete_with_error(uint32_t vendor_err, void *ctx)
-{
-    struct ibv_wc wc = {};
-
-    wc.status = IBV_WC_GENERAL_ERR;
-    wc.vendor_err = vendor_err;
-
-    pvrdma_qp_ops_comp_handler(ctx, &wc);
-}
-
-void pvrdma_qp_ops_fini(void)
-{
-    rdma_backend_unregister_comp_handler();
-}
-
-int pvrdma_qp_ops_init(void)
-{
-    rdma_backend_register_comp_handler(pvrdma_qp_ops_comp_handler);
-
-    return 0;
-}
-
-void pvrdma_qp_send(PVRDMADev *dev, uint32_t qp_handle)
-{
-    RdmaRmQP *qp;
-    PvrdmaSqWqe *wqe;
-    PvrdmaRing *ring;
-    int sgid_idx;
-    union ibv_gid *sgid;
-
-    qp = rdma_rm_get_qp(&dev->rdma_dev_res, qp_handle);
-    if (unlikely(!qp)) {
-        return;
-    }
-
-    ring = (PvrdmaRing *)qp->opaque;
-
-    wqe = pvrdma_ring_next_elem_read(ring);
-    while (wqe) {
-        CompHandlerCtx *comp_ctx;
-
-        /* Prepare CQE */
-        comp_ctx = g_new(CompHandlerCtx, 1);
-        comp_ctx->dev = dev;
-        comp_ctx->cq_handle = qp->send_cq_handle;
-        comp_ctx->cqe.wr_id = wqe->hdr.wr_id;
-        comp_ctx->cqe.qp = qp_handle;
-        comp_ctx->cqe.opcode = IBV_WC_SEND;
-
-        sgid = rdma_rm_get_gid(&dev->rdma_dev_res, wqe->hdr.wr.ud.av.gid_index);
-        if (!sgid) {
-            rdma_error_report("Failed to get gid for idx %d",
-                              wqe->hdr.wr.ud.av.gid_index);
-            complete_with_error(VENDOR_ERR_INV_GID_IDX, comp_ctx);
-            continue;
-        }
-
-        sgid_idx = rdma_rm_get_backend_gid_index(&dev->rdma_dev_res,
-                                                 &dev->backend_dev,
-                                                 wqe->hdr.wr.ud.av.gid_index);
-        if (sgid_idx <= 0) {
-            rdma_error_report("Failed to get bk sgid_idx for sgid_idx %d",
-                              wqe->hdr.wr.ud.av.gid_index);
-            complete_with_error(VENDOR_ERR_INV_GID_IDX, comp_ctx);
-            continue;
-        }
-
-        if (wqe->hdr.num_sge > dev->dev_attr.max_sge) {
-            rdma_error_report("Invalid num_sge=%d (max %d)", wqe->hdr.num_sge,
-                              dev->dev_attr.max_sge);
-            complete_with_error(VENDOR_ERR_INV_NUM_SGE, comp_ctx);
-            continue;
-        }
-
-        rdma_backend_post_send(&dev->backend_dev, &qp->backend_qp, qp->qp_type,
-                               (struct ibv_sge *)&wqe->sge[0], wqe->hdr.num_sge,
-                               sgid_idx, sgid,
-                               (union ibv_gid *)wqe->hdr.wr.ud.av.dgid,
-                               wqe->hdr.wr.ud.remote_qpn,
-                               wqe->hdr.wr.ud.remote_qkey, comp_ctx);
-
-        pvrdma_ring_read_inc(ring);
-
-        wqe = pvrdma_ring_next_elem_read(ring);
-    }
-}
-
-void pvrdma_qp_recv(PVRDMADev *dev, uint32_t qp_handle)
-{
-    RdmaRmQP *qp;
-    PvrdmaRqWqe *wqe;
-    PvrdmaRing *ring;
-
-    qp = rdma_rm_get_qp(&dev->rdma_dev_res, qp_handle);
-    if (unlikely(!qp)) {
-        return;
-    }
-
-    ring = &((PvrdmaRing *)qp->opaque)[1];
-
-    wqe = pvrdma_ring_next_elem_read(ring);
-    while (wqe) {
-        CompHandlerCtx *comp_ctx;
-
-        /* Prepare CQE */
-        comp_ctx = g_new(CompHandlerCtx, 1);
-        comp_ctx->dev = dev;
-        comp_ctx->cq_handle = qp->recv_cq_handle;
-        comp_ctx->cqe.wr_id = wqe->hdr.wr_id;
-        comp_ctx->cqe.qp = qp_handle;
-        comp_ctx->cqe.opcode = IBV_WC_RECV;
-
-        if (wqe->hdr.num_sge > dev->dev_attr.max_sge) {
-            rdma_error_report("Invalid num_sge=%d (max %d)", wqe->hdr.num_sge,
-                              dev->dev_attr.max_sge);
-            complete_with_error(VENDOR_ERR_INV_NUM_SGE, comp_ctx);
-            continue;
-        }
-
-        rdma_backend_post_recv(&dev->backend_dev, &qp->backend_qp, qp->qp_type,
-                               (struct ibv_sge *)&wqe->sge[0], wqe->hdr.num_sge,
-                               comp_ctx);
-
-        pvrdma_ring_read_inc(ring);
-
-        wqe = pvrdma_ring_next_elem_read(ring);
-    }
-}
-
-void pvrdma_srq_recv(PVRDMADev *dev, uint32_t srq_handle)
-{
-    RdmaRmSRQ *srq;
-    PvrdmaRqWqe *wqe;
-    PvrdmaRing *ring;
-
-    srq = rdma_rm_get_srq(&dev->rdma_dev_res, srq_handle);
-    if (unlikely(!srq)) {
-        return;
-    }
-
-    ring = (PvrdmaRing *)srq->opaque;
-
-    wqe = pvrdma_ring_next_elem_read(ring);
-    while (wqe) {
-        CompHandlerCtx *comp_ctx;
-
-        /* Prepare CQE */
-        comp_ctx = g_new(CompHandlerCtx, 1);
-        comp_ctx->dev = dev;
-        comp_ctx->cq_handle = srq->recv_cq_handle;
-        comp_ctx->cqe.wr_id = wqe->hdr.wr_id;
-        comp_ctx->cqe.qp = 0;
-        comp_ctx->cqe.opcode = IBV_WC_RECV;
-
-        if (wqe->hdr.num_sge > dev->dev_attr.max_sge) {
-            rdma_error_report("Invalid num_sge=%d (max %d)", wqe->hdr.num_sge,
-                              dev->dev_attr.max_sge);
-            complete_with_error(VENDOR_ERR_INV_NUM_SGE, comp_ctx);
-            continue;
-        }
-
-        rdma_backend_post_srq_recv(&dev->backend_dev, &srq->backend_srq,
-                                   (struct ibv_sge *)&wqe->sge[0],
-                                   wqe->hdr.num_sge,
-                                   comp_ctx);
-
-        pvrdma_ring_read_inc(ring);
-
-        wqe = pvrdma_ring_next_elem_read(ring);
-    }
-
-}
-
-void pvrdma_cq_poll(RdmaDeviceResources *dev_res, uint32_t cq_handle)
-{
-    RdmaRmCQ *cq;
-
-    cq = rdma_rm_get_cq(dev_res, cq_handle);
-    if (!cq) {
-        return;
-    }
-
-    rdma_backend_poll_cq(dev_res, &cq->backend_cq);
-}
diff --git a/monitor/qmp-cmds.c b/monitor/qmp-cmds.c
index b0f948d337..f84a0dc523 100644
--- a/monitor/qmp-cmds.c
+++ b/monitor/qmp-cmds.c
@@ -31,7 +31,6 @@
 #include "qapi/type-helpers.h"
 #include "hw/mem/memory-device.h"
 #include "hw/intc/intc.h"
-#include "hw/rdma/rdma.h"
 
 NameInfo *qmp_query_name(Error **errp)
 {
diff --git a/Kconfig.host b/Kconfig.host
index f496475f8e..f6a2a131e6 100644
--- a/Kconfig.host
+++ b/Kconfig.host
@@ -35,9 +35,6 @@ config VHOST_KERNEL
 config VIRTFS
     bool
 
-config PVRDMA
-    bool
-
 config MULTIPROCESS_ALLOWED
     bool
     imply MULTIPROCESS
diff --git a/contrib/rdmacm-mux/meson.build b/contrib/rdmacm-mux/meson.build
deleted file mode 100644
index 36c9c89630..0000000000
--- a/contrib/rdmacm-mux/meson.build
+++ /dev/null
@@ -1,7 +0,0 @@
-if have_pvrdma
-  # FIXME: broken on big endian architectures
-  executable('rdmacm-mux', files('main.c'), genh,
-             dependencies: [glib, libumad],
-             build_by_default: false,
-             install: false)
-endif
diff --git a/hmp-commands-info.hx b/hmp-commands-info.hx
index ad1b1306e3..20a9835ea8 100644
--- a/hmp-commands-info.hx
+++ b/hmp-commands-info.hx
@@ -182,19 +182,6 @@ SRST
     Show PIC state.
 ERST
 
-    {
-        .name       = "rdma",
-        .args_type  = "",
-        .params     = "",
-        .help       = "show RDMA state",
-        .cmd_info_hrt = qmp_x_query_rdma,
-    },
-
-SRST
-  ``info rdma``
-    Show RDMA state.
-ERST
-
     {
         .name       = "pci",
         .args_type  = "",
diff --git a/hw/Kconfig b/hw/Kconfig
index 2c00936c28..32f876deb0 100644
--- a/hw/Kconfig
+++ b/hw/Kconfig
@@ -29,7 +29,6 @@ source pci-bridge/Kconfig
 source pci-host/Kconfig
 source pcmcia/Kconfig
 source pci/Kconfig
-source rdma/Kconfig
 source remote/Kconfig
 source rtc/Kconfig
 source scsi/Kconfig
diff --git a/hw/meson.build b/hw/meson.build
index 463d702683..3049a6fab0 100644
--- a/hw/meson.build
+++ b/hw/meson.build
@@ -28,7 +28,6 @@ subdir('pci')
 subdir('pci-bridge')
 subdir('pci-host')
 subdir('pcmcia')
-subdir('rdma')
 subdir('rtc')
 subdir('scsi')
 subdir('sd')
diff --git a/hw/rdma/Kconfig b/hw/rdma/Kconfig
deleted file mode 100644
index 840320bdc0..0000000000
--- a/hw/rdma/Kconfig
+++ /dev/null
@@ -1,3 +0,0 @@
-config VMW_PVRDMA
-    default y if PCI_DEVICES
-    depends on PVRDMA && MSI_NONBROKEN && VMXNET3_PCI
diff --git a/hw/rdma/meson.build b/hw/rdma/meson.build
deleted file mode 100644
index 363c9b8c83..0000000000
--- a/hw/rdma/meson.build
+++ /dev/null
@@ -1,12 +0,0 @@
-system_ss.add(when: 'CONFIG_VMW_PVRDMA', if_true: files(
-  'rdma.c',
-  'rdma_backend.c',
-  'rdma_utils.c',
-  'vmw/pvrdma_qp_ops.c',
-))
-specific_ss.add(when: 'CONFIG_VMW_PVRDMA', if_true: files(
-  'rdma_rm.c',
-  'vmw/pvrdma_cmd.c',
-  'vmw/pvrdma_dev_ring.c',
-  'vmw/pvrdma_main.c',
-))
diff --git a/hw/rdma/trace-events b/hw/rdma/trace-events
deleted file mode 100644
index c23175120e..0000000000
--- a/hw/rdma/trace-events
+++ /dev/null
@@ -1,31 +0,0 @@
-# See docs/devel/tracing.rst for syntax documentation.
-
-# rdma_backend.c
-rdma_check_dev_attr(const char *name, int max_bk, int max_fe) "%s: be=%d, fe=%d"
-rdma_create_ah_cache_hit(uint64_t subnet, uint64_t if_id) "subnet=0x%"PRIx64",if_id=0x%"PRIx64
-rdma_create_ah_cache_miss(uint64_t subnet, uint64_t if_id) "subnet=0x%"PRIx64",if_id=0x%"PRIx64
-rdma_poll_cq(int ne, void *ibcq) "Got %d completion(s) from cq %p"
-rdmacm_mux(const char *title, int msg_type, int op_code) "%s: msg_type=%d, op_code=%d"
-rdmacm_mux_check_op_status(int msg_type, int op_code, int err_code) "resp: msg_type=%d, op_code=%d, err_code=%d"
-rdma_mad_message(const char *title, int len, char *data) "mad %s (%d): %s"
-rdma_backend_rc_qp_state_init(uint32_t qpn) "RC QP 0x%x switch to INIT"
-rdma_backend_ud_qp_state_init(uint32_t qpn, uint32_t qkey) "UD QP 0x%x switch to INIT, qkey=0x%x"
-rdma_backend_rc_qp_state_rtr(uint32_t qpn, uint64_t subnet, uint64_t ifid, uint8_t sgid_idx, uint32_t dqpn, uint32_t rq_psn) "RC QP 0x%x switch to RTR, subnet = 0x%"PRIx64", ifid = 0x%"PRIx64 ", sgid_idx=%d, dqpn=0x%x, rq_psn=0x%x"
-rdma_backend_ud_qp_state_rtr(uint32_t qpn, uint32_t qkey) "UD QP 0x%x switch to RTR, qkey=0x%x"
-rdma_backend_rc_qp_state_rts(uint32_t qpn, uint32_t sq_psn) "RC QP 0x%x switch to RTS, sq_psn=0x%x, "
-rdma_backend_ud_qp_state_rts(uint32_t qpn, uint32_t sq_psn, uint32_t qkey) "UD QP 0x%x switch to RTS, sq_psn=0x%x, qkey=0x%x"
-rdma_backend_get_gid_index(uint64_t subnet, uint64_t ifid, int gid_idx) "subnet=0x%"PRIx64", ifid=0x%"PRIx64 ", gid_idx=%d"
-rdma_backend_gid_change(const char *op, uint64_t subnet, uint64_t ifid) "%s subnet=0x%"PRIx64", ifid=0x%"PRIx64
-
-# rdma_rm.c
-rdma_res_tbl_get(char *name, uint32_t handle) "tbl %s, handle %d"
-rdma_res_tbl_alloc(char *name, uint32_t handle) "tbl %s, handle %d"
-rdma_res_tbl_dealloc(char *name, uint32_t handle) "tbl %s, handle %d"
-rdma_rm_alloc_mr(uint32_t mr_handle, void *host_virt, uint64_t guest_start, uint64_t guest_length, int access_flags) "mr_handle=%d, host_virt=%p, guest_start=0x%"PRIx64", length=%" PRId64", access_flags=0x%x"
-rdma_rm_dealloc_mr(uint32_t mr_handle, uint64_t guest_start) "mr_handle=%d, guest_start=0x%"PRIx64
-rdma_rm_alloc_qp(uint32_t rm_qpn, uint32_t backend_qpn, uint8_t qp_type) "rm_qpn=%d, backend_qpn=0x%x, qp_type=%d"
-rdma_rm_modify_qp(uint32_t qpn, uint32_t attr_mask, int qp_state, uint8_t sgid_idx) "qpn=0x%x, attr_mask=0x%x, qp_state=%d, sgid_idx=%d"
-
-# rdma_utils.c
-rdma_pci_dma_map(uint64_t addr, void *vaddr, uint64_t len) "0x%"PRIx64" -> %p (len=%" PRIu64")"
-rdma_pci_dma_unmap(void *vaddr) "%p"
diff --git a/hw/rdma/vmw/trace-events b/hw/rdma/vmw/trace-events
deleted file mode 100644
index a6c77e1e10..0000000000
--- a/hw/rdma/vmw/trace-events
+++ /dev/null
@@ -1,17 +0,0 @@
-# See docs/devel/tracing.rst for syntax documentation.
-
-# pvrdma_main.c
-pvrdma_regs_read(uint64_t addr, uint64_t val) "pvrdma.regs[0x%"PRIx64"]=0x%"PRIx64
-pvrdma_regs_write(uint64_t addr, uint64_t val, const char *reg_name, const char *val_name) "pvrdma.regs[0x%"PRIx64"]=0x%"PRIx64" (%s %s)"
-pvrdma_uar_write(uint64_t addr, uint64_t val, const char *reg_name, const char *val_name, int val1, int val2) "uar[0x%"PRIx64"]=0x%"PRIx64" (cls=%s, op=%s, obj=%d, val=%d)"
-
-# pvrdma_cmd.c
-pvrdma_map_to_pdir_host_virt(void *vfirst, void *vremaped) "mremap %p -> %p"
-pvrdma_map_to_pdir_next_page(int page_idx, void *vnext, void *vremaped) "mremap [%d] %p -> %p"
-pvrdma_exec_cmd(int cmd, int err) "cmd=%d, err=%d"
-
-# pvrdma_dev_ring.c
-pvrdma_ring_next_elem_read_no_data(char *ring_name) "pvrdma_ring %s is empty"
-
-# pvrdma_qp_ops.c
-pvrdma_post_cqe(uint32_t cq_handle, int notify, uint64_t wr_id, uint64_t qpn, uint32_t op_code, uint32_t status, uint32_t byte_len, uint32_t src_qp, uint32_t wc_flags, uint32_t vendor_err) "cq_handle=%d, notify=%d, wr_id=0x%"PRIx64", qpn=0x%"PRIx64", opcode=%d, status=%d, byte_len=%d, src_qp=%d, wc_flags=%d, vendor_err=%d"
diff --git a/meson_options.txt b/meson_options.txt
index 0a99a059ec..b5c0bad9e7 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -198,8 +198,6 @@ option('opengl', type : 'feature', value : 'auto',
        description: 'OpenGL support')
 option('rdma', type : 'feature', value : 'auto',
        description: 'Enable RDMA-based migration')
-option('pvrdma', type : 'feature', value : 'auto',
-       description: 'Enable PVRDMA support')
 option('gtk', type : 'feature', value : 'auto',
        description: 'GTK+ user interface')
 option('sdl', type : 'feature', value : 'auto',
diff --git a/qapi/meson.build b/qapi/meson.build
index 375d564277..c92af6e063 100644
--- a/qapi/meson.build
+++ b/qapi/meson.build
@@ -62,7 +62,6 @@ if have_system
     'cryptodev',
     'qdev',
     'pci',
-    'rdma',
     'rocker',
     'tpm',
   ]
diff --git a/qemu-options.hx b/qemu-options.hx
index 7fd1713fa8..f7ef9b4e41 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -5113,9 +5113,6 @@ SRST
         allows a co-operating external process to access the QEMU memory
         region.
 
-        The ``share`` is also required for pvrdma devices due to
-        limitations in the RDMA API provided by Linux.
-
         Setting share=on might affect the ability to configure NUMA
         bindings for the memory backend under some circumstances, see
         Documentation/vm/numa\_memory\_policy.txt on the Linux kernel
diff --git a/scripts/ci/org.centos/stream/8/x86_64/configure b/scripts/ci/org.centos/stream/8/x86_64/configure
index 76781f17f4..868db665f6 100755
--- a/scripts/ci/org.centos/stream/8/x86_64/configure
+++ b/scripts/ci/org.centos/stream/8/x86_64/configure
@@ -99,7 +99,6 @@
 --disable-opengl \
 --disable-parallels \
 --disable-pie \
---disable-pvrdma \
 --disable-qcow1 \
 --disable-qed \
 --disable-qom-cast-debug \
diff --git a/scripts/meson-buildoptions.sh b/scripts/meson-buildoptions.sh
index 680fa3f581..5ace33f167 100644
--- a/scripts/meson-buildoptions.sh
+++ b/scripts/meson-buildoptions.sh
@@ -163,7 +163,6 @@ meson_options_help() {
   printf "%s\n" '  pixman          pixman support'
   printf "%s\n" '  plugins         TCG plugins via shared library loading'
   printf "%s\n" '  png             PNG support with libpng'
-  printf "%s\n" '  pvrdma          Enable PVRDMA support'
   printf "%s\n" '  qcow1           qcow1 image format support'
   printf "%s\n" '  qed             qed image format support'
   printf "%s\n" '  qga-vss         build QGA VSS support (broken with MinGW)'
@@ -428,8 +427,6 @@ _meson_option_parse() {
     --enable-png) printf "%s" -Dpng=enabled ;;
     --disable-png) printf "%s" -Dpng=disabled ;;
     --prefix=*) quote_sh "-Dprefix=$2" ;;
-    --enable-pvrdma) printf "%s" -Dpvrdma=enabled ;;
-    --disable-pvrdma) printf "%s" -Dpvrdma=disabled ;;
     --enable-qcow1) printf "%s" -Dqcow1=enabled ;;
     --disable-qcow1) printf "%s" -Dqcow1=disabled ;;
     --enable-qed) printf "%s" -Dqed=enabled ;;
diff --git a/scripts/update-linux-headers.sh b/scripts/update-linux-headers.sh
index a0006eec6f..73c292bbac 100755
--- a/scripts/update-linux-headers.sh
+++ b/scripts/update-linux-headers.sh
@@ -55,7 +55,6 @@ cp_portable() {
                                      -e 'linux/if_ether' \
                                      -e 'input-event-codes' \
                                      -e 'sys/' \
-                                     -e 'pvrdma_verbs' \
                                      -e 'drm.h' \
                                      -e 'limits' \
                                      -e 'linux/const' \
@@ -226,32 +225,6 @@ mkdir -p "$output/include/standard-headers/drm"
 cp_portable "$tmpdir/include/drm/drm_fourcc.h" \
             "$output/include/standard-headers/drm"
 
-rm -rf "$output/include/standard-headers/drivers/infiniband/hw/vmw_pvrdma"
-mkdir -p "$output/include/standard-headers/drivers/infiniband/hw/vmw_pvrdma"
-
-# Remove the unused functions from pvrdma_verbs.h avoiding the unnecessary
-# import of several infiniband/networking/other headers
-tmp_pvrdma_verbs="$tmpdir/pvrdma_verbs.h"
-# Parse the entire file instead of single lines to match
-# function declarations expanding over multiple lines
-# and strip the declarations starting with pvrdma prefix.
-sed  -e '1h;2,$H;$!d;g'  -e 's/[^};]*pvrdma[^(| ]*([^)]*);//g' \
-    "$linux/drivers/infiniband/hw/vmw_pvrdma/pvrdma_verbs.h" > \
-    "$tmp_pvrdma_verbs";
-
-for i in "$linux/drivers/infiniband/hw/vmw_pvrdma/pvrdma_dev_api.h" \
-         "$tmp_pvrdma_verbs"; do \
-    cp_portable "$i" \
-         "$output/include/standard-headers/drivers/infiniband/hw/vmw_pvrdma/"
-done
-
-rm -rf "$output/include/standard-headers/rdma/"
-mkdir -p "$output/include/standard-headers/rdma/"
-for i in "$tmpdir/include/rdma/vmw_pvrdma-abi.h"; do
-    cp_portable "$i" \
-         "$output/include/standard-headers/rdma/"
-done
-
 cat <<EOF >$output/include/standard-headers/linux/types.h
 /* For QEMU all types are already defined via osdep.h, so this
  * header does not need to do anything.
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
  2024-03-28 13:02 [PATCH-for-9.1 v2 0/3] rdma: Remove RDMA subsystem and pvrdma device Philippe Mathieu-Daudé
  2024-03-28 13:02 ` [PATCH-for-9.1 v2 1/3] hw/rdma: Remove pvrdma device and rdmacm-mux helper Philippe Mathieu-Daudé
@ 2024-03-28 13:02 ` Philippe Mathieu-Daudé
  2024-03-28 14:18   ` Fabiano Rosas
  2024-03-28 13:02 ` [PATCH-for-9.1 v2 3/3] block/gluster: " Philippe Mathieu-Daudé
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 52+ messages in thread
From: Philippe Mathieu-Daudé @ 2024-03-28 13:02 UTC (permalink / raw)
  To: qemu-devel
  Cc: Yuval Shaia, Kevin Wolf, Prasanna Kumar Kalever, Fabiano Rosas,
	Cornelia Huck, Michael Roth, Li Zhijian, Prasanna Kumar Kalever,
	Peter Xu, integration, Paolo Bonzini, qemu-block,
	Daniel P. Berrangé,
	devel, Hanna Reitz, Michael S. Tsirkin, Thomas Huth, Eric Blake,
	Philippe Mathieu-Daudé,
	Song Gao, Marc-André Lureau, Markus Armbruster,
	Alex Bennée, Wainer dos Santos Moschetta, Beraldo Leal,
	Peter Maydell

The whole RDMA subsystem was deprecated in commit e9a54265f5
("hw/rdma: Deprecate the pvrdma device and the rdma subsystem")
released in v8.2.

Remove:
 - RDMA handling from migration
 - dependencies on libibumad, libibverbs and librdmacm

Keep the RAM_SAVE_FLAG_HOOK definition since it might appears
in old migration streams.

Cc: Peter Xu <peterx@redhat.com>
Cc: Li Zhijian <lizhijian@fujitsu.com>
Acked-by: Fabiano Rosas <farosas@suse.de>
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
---
 MAINTAINERS                                   |    7 -
 docs/devel/migration/main.rst                 |    6 -
 docs/rdma.txt                                 |  420 --
 docs/system/loongarch/virt.rst                |    2 +-
 meson.build                                   |   23 -
 qapi/migration.json                           |   31 +-
 migration/migration-stats.h                   |    6 +-
 migration/migration.h                         |    9 -
 migration/options.h                           |    2 -
 migration/rdma.h                              |   69 -
 migration/migration-stats.c                   |    5 +-
 migration/migration.c                         |   31 -
 migration/options.c                           |   16 -
 migration/qemu-file.c                         |    1 -
 migration/ram.c                               |   86 +-
 migration/rdma.c                              | 4184 -----------------
 migration/savevm.c                            |    2 +-
 meson_options.txt                             |    2 -
 migration/meson.build                         |    1 -
 migration/trace-events                        |   68 +-
 qemu-options.hx                               |    3 -
 .../org.centos/stream/8/build-environment.yml |    1 -
 .../ci/org.centos/stream/8/x86_64/configure   |    2 -
 scripts/ci/setup/build-environment.yml        |    4 -
 scripts/coverity-scan/run-coverity-scan       |    2 +-
 scripts/meson-buildoptions.sh                 |    3 -
 tests/lcitool/projects/qemu.yml               |    3 -
 tests/migration/guestperf/engine.py           |    4 +-
 28 files changed, 14 insertions(+), 4979 deletions(-)
 delete mode 100644 docs/rdma.txt
 delete mode 100644 migration/rdma.h
 delete mode 100644 migration/rdma.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 91ab5235b8..05226cea0a 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3426,13 +3426,6 @@ F: docs/devel/migration.rst
 F: qapi/migration.json
 F: tests/migration/
 F: util/userfaultfd.c
-X: migration/rdma*
-
-RDMA Migration
-R: Li Zhijian <lizhijian@fujitsu.com>
-R: Peter Xu <peterx@redhat.com>
-S: Odd Fixes
-F: migration/rdma*
 
 Migration dirty limit and dirty page rate
 M: Hyman Huang <yong.huang@smartx.com>
diff --git a/docs/devel/migration/main.rst b/docs/devel/migration/main.rst
index 54385a23e5..70278ce1e3 100644
--- a/docs/devel/migration/main.rst
+++ b/docs/devel/migration/main.rst
@@ -47,12 +47,6 @@ over any transport.
   QEMU interference. Note that QEMU does not flush cached file
   data/metadata at the end of migration.
 
-In addition, support is included for migration using RDMA, which
-transports the page data using ``RDMA``, where the hardware takes care of
-transporting the pages, and the load on the CPU is much lower.  While the
-internals of RDMA migration are a bit different, this isn't really visible
-outside the RAM migration code.
-
 All these migration protocols use the same infrastructure to
 save/restore state devices.  This infrastructure is shared with the
 savevm/loadvm functionality.
diff --git a/docs/rdma.txt b/docs/rdma.txt
deleted file mode 100644
index bd8dd799a9..0000000000
--- a/docs/rdma.txt
+++ /dev/null
@@ -1,420 +0,0 @@
-(RDMA: Remote Direct Memory Access)
-RDMA Live Migration Specification, Version # 1
-==============================================
-Wiki: https://wiki.qemu.org/Features/RDMALiveMigration
-Github: git@github.com:hinesmr/qemu.git, 'rdma' branch
-
-Copyright (C) 2013 Michael R. Hines <mrhines@us.ibm.com>
-
-An *exhaustive* paper (2010) shows additional performance details
-linked on the QEMU wiki above.
-
-Contents:
-=========
-* Introduction
-* Before running
-* Running
-* Performance
-* RDMA Migration Protocol Description
-* Versioning and Capabilities
-* QEMUFileRDMA Interface
-* Migration of VM's ram
-* Error handling
-* TODO
-
-Introduction:
-=============
-
-RDMA helps make your migration more deterministic under heavy load because
-of the significantly lower latency and higher throughput over TCP/IP. This is
-because the RDMA I/O architecture reduces the number of interrupts and
-data copies by bypassing the host networking stack. In particular, a TCP-based
-migration, under certain types of memory-bound workloads, may take a more
-unpredictable amount of time to complete the migration if the amount of
-memory tracked during each live migration iteration round cannot keep pace
-with the rate of dirty memory produced by the workload.
-
-RDMA currently comes in two flavors: both Ethernet based (RoCE, or RDMA
-over Converged Ethernet) as well as Infiniband-based. This implementation of
-migration using RDMA is capable of using both technologies because of
-the use of the OpenFabrics OFED software stack that abstracts out the
-programming model irrespective of the underlying hardware.
-
-Refer to openfabrics.org or your respective RDMA hardware vendor for
-an understanding on how to verify that you have the OFED software stack
-installed in your environment. You should be able to successfully link
-against the "librdmacm" and "libibverbs" libraries and development headers
-for a working build of QEMU to run successfully using RDMA Migration.
-
-BEFORE RUNNING:
-===============
-
-Use of RDMA during migration requires pinning and registering memory
-with the hardware. This means that memory must be physically resident
-before the hardware can transmit that memory to another machine.
-If this is not acceptable for your application or product, then the use
-of RDMA migration may in fact be harmful to co-located VMs or other
-software on the machine if there is not sufficient memory available to
-relocate the entire footprint of the virtual machine. If so, then the
-use of RDMA is discouraged and it is recommended to use standard TCP migration.
-
-Experimental: Next, decide if you want dynamic page registration.
-For example, if you have an 8GB RAM virtual machine, but only 1GB
-is in active use, then enabling this feature will cause all 8GB to
-be pinned and resident in memory. This feature mostly affects the
-bulk-phase round of the migration and can be enabled for extremely
-high-performance RDMA hardware using the following command:
-
-QEMU Monitor Command:
-$ migrate_set_capability rdma-pin-all on # disabled by default
-
-Performing this action will cause all 8GB to be pinned, so if that's
-not what you want, then please ignore this step altogether.
-
-On the other hand, this will also significantly speed up the bulk round
-of the migration, which can greatly reduce the "total" time of your migration.
-Example performance of this using an idle VM in the previous example
-can be found in the "Performance" section.
-
-Note: for very large virtual machines (hundreds of GBs), pinning all
-*all* of the memory of your virtual machine in the kernel is very expensive
-may extend the initial bulk iteration time by many seconds,
-and thus extending the total migration time. However, this will not
-affect the determinism or predictability of your migration you will
-still gain from the benefits of advanced pinning with RDMA.
-
-RUNNING:
-========
-
-First, set the migration speed to match your hardware's capabilities:
-
-QEMU Monitor Command:
-$ migrate_set_parameter max-bandwidth 40g # or whatever is the MAX of your RDMA device
-
-Next, on the destination machine, add the following to the QEMU command line:
-
-qemu ..... -incoming rdma:host:port
-
-Finally, perform the actual migration on the source machine:
-
-QEMU Monitor Command:
-$ migrate -d rdma:host:port
-
-PERFORMANCE
-===========
-
-Here is a brief summary of total migration time and downtime using RDMA:
-Using a 40gbps infiniband link performing a worst-case stress test,
-using an 8GB RAM virtual machine:
-
-Using the following command:
-$ apt-get install stress
-$ stress --vm-bytes 7500M --vm 1 --vm-keep
-
-1. Migration throughput: 26 gigabits/second.
-2. Downtime (stop time) varies between 15 and 100 milliseconds.
-
-EFFECTS of memory registration on bulk phase round:
-
-For example, in the same 8GB RAM example with all 8GB of memory in
-active use and the VM itself is completely idle using the same 40 gbps
-infiniband link:
-
-1. rdma-pin-all disabled total time: approximately 7.5 seconds @ 9.5 Gbps
-2. rdma-pin-all enabled total time: approximately 4 seconds @ 26 Gbps
-
-These numbers would of course scale up to whatever size virtual machine
-you have to migrate using RDMA.
-
-Enabling this feature does *not* have any measurable affect on
-migration *downtime*. This is because, without this feature, all of the
-memory will have already been registered already in advance during
-the bulk round and does not need to be re-registered during the successive
-iteration rounds.
-
-RDMA Protocol Description:
-==========================
-
-Migration with RDMA is separated into two parts:
-
-1. The transmission of the pages using RDMA
-2. Everything else (a control channel is introduced)
-
-"Everything else" is transmitted using a formal
-protocol now, consisting of infiniband SEND messages.
-
-An infiniband SEND message is the standard ibverbs
-message used by applications of infiniband hardware.
-The only difference between a SEND message and an RDMA
-message is that SEND messages cause notifications
-to be posted to the completion queue (CQ) on the
-infiniband receiver side, whereas RDMA messages (used
-for VM's ram) do not (to behave like an actual DMA).
-
-Messages in infiniband require two things:
-
-1. registration of the memory that will be transmitted
-2. (SEND only) work requests to be posted on both
-   sides of the network before the actual transmission
-   can occur.
-
-RDMA messages are much easier to deal with. Once the memory
-on the receiver side is registered and pinned, we're
-basically done. All that is required is for the sender
-side to start dumping bytes onto the link.
-
-(Memory is not released from pinning until the migration
-completes, given that RDMA migrations are very fast.)
-
-SEND messages require more coordination because the
-receiver must have reserved space (using a receive
-work request) on the receive queue (RQ) before QEMUFileRDMA
-can start using them to carry all the bytes as
-a control transport for migration of device state.
-
-To begin the migration, the initial connection setup is
-as follows (migration-rdma.c):
-
-1. Receiver and Sender are started (command line or libvirt):
-2. Both sides post two RQ work requests
-3. Receiver does listen()
-4. Sender does connect()
-5. Receiver accept()
-6. Check versioning and capabilities (described later)
-
-At this point, we define a control channel on top of SEND messages
-which is described by a formal protocol. Each SEND message has a
-header portion and a data portion (but together are transmitted
-as a single SEND message).
-
-Header:
-    * Length               (of the data portion, uint32, network byte order)
-    * Type                 (what command to perform, uint32, network byte order)
-    * Repeat               (Number of commands in data portion, same type only)
-
-The 'Repeat' field is here to support future multiple page registrations
-in a single message without any need to change the protocol itself
-so that the protocol is compatible against multiple versions of QEMU.
-Version #1 requires that all server implementations of the protocol must
-check this field and register all requests found in the array of commands located
-in the data portion and return an equal number of results in the response.
-The maximum number of repeats is hard-coded to 4096. This is a conservative
-limit based on the maximum size of a SEND message along with empirical
-observations on the maximum future benefit of simultaneous page registrations.
-
-The 'type' field has 12 different command values:
-     1. Unused
-     2. Error                      (sent to the source during bad things)
-     3. Ready                      (control-channel is available)
-     4. QEMU File                  (for sending non-live device state)
-     5. RAM Blocks request         (used right after connection setup)
-     6. RAM Blocks result          (used right after connection setup)
-     7. Compress page              (zap zero page and skip registration)
-     8. Register request           (dynamic chunk registration)
-     9. Register result            ('rkey' to be used by sender)
-    10. Register finished          (registration for current iteration finished)
-    11. Unregister request         (unpin previously registered memory)
-    12. Unregister finished        (confirmation that unpin completed)
-
-A single control message, as hinted above, can contain within the data
-portion an array of many commands of the same type. If there is more than
-one command, then the 'repeat' field will be greater than 1.
-
-After connection setup, message 5 & 6 are used to exchange ram block
-information and optionally pin all the memory if requested by the user.
-
-After ram block exchange is completed, we have two protocol-level
-functions, responsible for communicating control-channel commands
-using the above list of values:
-
-Logically:
-
-qemu_rdma_exchange_recv(header, expected command type)
-
-1. We transmit a READY command to let the sender know that
-   we are *ready* to receive some data bytes on the control channel.
-2. Before attempting to receive the expected command, we post another
-   RQ work request to replace the one we just used up.
-3. Block on a CQ event channel and wait for the SEND to arrive.
-4. When the send arrives, librdmacm will unblock us.
-5. Verify that the command-type and version received matches the one we expected.
-
-qemu_rdma_exchange_send(header, data, optional response header & data):
-
-1. Block on the CQ event channel waiting for a READY command
-   from the receiver to tell us that the receiver
-   is *ready* for us to transmit some new bytes.
-2. Optionally: if we are expecting a response from the command
-   (that we have not yet transmitted), let's post an RQ
-   work request to receive that data a few moments later.
-3. When the READY arrives, librdmacm will
-   unblock us and we immediately post a RQ work request
-   to replace the one we just used up.
-4. Now, we can actually post the work request to SEND
-   the requested command type of the header we were asked for.
-5. Optionally, if we are expecting a response (as before),
-   we block again and wait for that response using the additional
-   work request we previously posted. (This is used to carry
-   'Register result' commands #6 back to the sender which
-   hold the rkey need to perform RDMA. Note that the virtual address
-   corresponding to this rkey was already exchanged at the beginning
-   of the connection (described below).
-
-All of the remaining command types (not including 'ready')
-described above all use the aforementioned two functions to do the hard work:
-
-1. After connection setup, RAMBlock information is exchanged using
-   this protocol before the actual migration begins. This information includes
-   a description of each RAMBlock on the server side as well as the virtual addresses
-   and lengths of each RAMBlock. This is used by the client to determine the
-   start and stop locations of chunks and how to register them dynamically
-   before performing the RDMA operations.
-2. During runtime, once a 'chunk' becomes full of pages ready to
-   be sent with RDMA, the registration commands are used to ask the
-   other side to register the memory for this chunk and respond
-   with the result (rkey) of the registration.
-3. Also, the QEMUFile interfaces also call these functions (described below)
-   when transmitting non-live state, such as devices or to send
-   its own protocol information during the migration process.
-4. Finally, zero pages are only checked if a page has not yet been registered
-   using chunk registration (or not checked at all and unconditionally
-   written if chunk registration is disabled. This is accomplished using
-   the "Compress" command listed above. If the page *has* been registered
-   then we check the entire chunk for zero. Only if the entire chunk is
-   zero, then we send a compress command to zap the page on the other side.
-
-Versioning and Capabilities
-===========================
-Current version of the protocol is version #1.
-
-The same version applies to both for protocol traffic and capabilities
-negotiation. (i.e. There is only one version number that is referred to
-by all communication).
-
-librdmacm provides the user with a 'private data' area to be exchanged
-at connection-setup time before any infiniband traffic is generated.
-
-Header:
-    * Version (protocol version validated before send/recv occurs),
-                                               uint32, network byte order
-    * Flags   (bitwise OR of each capability),
-                                               uint32, network byte order
-
-There is no data portion of this header right now, so there is
-no length field. The maximum size of the 'private data' section
-is only 192 bytes per the Infiniband specification, so it's not
-very useful for data anyway. This structure needs to remain small.
-
-This private data area is a convenient place to check for protocol
-versioning because the user does not need to register memory to
-transmit a few bytes of version information.
-
-This is also a convenient place to negotiate capabilities
-(like dynamic page registration).
-
-If the version is invalid, we throw an error.
-
-If the version is new, we only negotiate the capabilities that the
-requested version is able to perform and ignore the rest.
-
-Currently there is only one capability in Version #1: dynamic page registration
-
-Finally: Negotiation happens with the Flags field: If the primary-VM
-sets a flag, but the destination does not support this capability, it
-will return a zero-bit for that flag and the primary-VM will understand
-that as not being an available capability and will thus disable that
-capability on the primary-VM side.
-
-QEMUFileRDMA Interface:
-=======================
-
-QEMUFileRDMA introduces a couple of new functions:
-
-1. qemu_rdma_get_buffer()               (QEMUFileOps rdma_read_ops)
-2. qemu_rdma_put_buffer()               (QEMUFileOps rdma_write_ops)
-
-These two functions are very short and simply use the protocol
-describe above to deliver bytes without changing the upper-level
-users of QEMUFile that depend on a bytestream abstraction.
-
-Finally, how do we handoff the actual bytes to get_buffer()?
-
-Again, because we're trying to "fake" a bytestream abstraction
-using an analogy not unlike individual UDP frames, we have
-to hold on to the bytes received from control-channel's SEND
-messages in memory.
-
-Each time we receive a complete "QEMU File" control-channel
-message, the bytes from SEND are copied into a small local holding area.
-
-Then, we return the number of bytes requested by get_buffer()
-and leave the remaining bytes in the holding area until get_buffer()
-comes around for another pass.
-
-If the buffer is empty, then we follow the same steps
-listed above and issue another "QEMU File" protocol command,
-asking for a new SEND message to re-fill the buffer.
-
-Migration of VM's ram:
-====================
-
-At the beginning of the migration, (migration-rdma.c),
-the sender and the receiver populate the list of RAMBlocks
-to be registered with each other into a structure.
-Then, using the aforementioned protocol, they exchange a
-description of these blocks with each other, to be used later
-during the iteration of main memory. This description includes
-a list of all the RAMBlocks, their offsets and lengths, virtual
-addresses and possibly includes pre-registered RDMA keys in case dynamic
-page registration was disabled on the server-side, otherwise not.
-
-Main memory is not migrated with the aforementioned protocol,
-but is instead migrated with normal RDMA Write operations.
-
-Pages are migrated in "chunks" (hard-coded to 1 Megabyte right now).
-Chunk size is not dynamic, but it could be in a future implementation.
-There's nothing to indicate that this is useful right now.
-
-When a chunk is full (or a flush() occurs), the memory backed by
-the chunk is registered with librdmacm is pinned in memory on
-both sides using the aforementioned protocol.
-After pinning, an RDMA Write is generated and transmitted
-for the entire chunk.
-
-Chunks are also transmitted in batches: This means that we
-do not request that the hardware signal the completion queue
-for the completion of *every* chunk. The current batch size
-is about 64 chunks (corresponding to 64 MB of memory).
-Only the last chunk in a batch must be signaled.
-This helps keep everything as asynchronous as possible
-and helps keep the hardware busy performing RDMA operations.
-
-Error-handling:
-===============
-
-Infiniband has what is called a "Reliable, Connected"
-link (one of 4 choices). This is the mode in which
-we use for RDMA migration.
-
-If a *single* message fails,
-the decision is to abort the migration entirely and
-cleanup all the RDMA descriptors and unregister all
-the memory.
-
-After cleanup, the Virtual Machine is returned to normal
-operation the same way that would happen if the TCP
-socket is broken during a non-RDMA based migration.
-
-TODO:
-=====
-1. Currently, 'ulimit -l' mlock() limits as well as cgroups swap limits
-   are not compatible with infiniband memory pinning and will result in
-   an aborted migration (but with the source VM left unaffected).
-2. Use of the recent /proc/<pid>/pagemap would likely speed up
-   the use of KSM and ballooning while using RDMA.
-3. Also, some form of balloon-device usage tracking would also
-   help alleviate some issues.
-4. Use LRU to provide more fine-grained direction of UNREGISTER
-   requests for unpinning memory in an overcommitted environment.
-5. Expose UNREGISTER support to the user by way of workload-specific
-   hints about application behavior.
diff --git a/docs/system/loongarch/virt.rst b/docs/system/loongarch/virt.rst
index 06d034b8ef..0a8e0766e4 100644
--- a/docs/system/loongarch/virt.rst
+++ b/docs/system/loongarch/virt.rst
@@ -39,7 +39,7 @@ can be accessed by following steps.
 
 .. code-block:: bash
 
-  ./configure --disable-rdma --prefix=/usr \
+  ./configure --prefix=/usr \
               --target-list="loongarch64-softmmu" \
               --disable-libiscsi --disable-libnfs --disable-libpmem \
               --disable-glusterfs --enable-libusb --enable-usb-redir \
diff --git a/meson.build b/meson.build
index d6af3cd53a..bd65abad13 100644
--- a/meson.build
+++ b/meson.build
@@ -1854,21 +1854,6 @@ if numa.found() and not cc.links('''
   endif
 endif
 
-rdma = not_found
-if not get_option('rdma').auto() or have_system
-  libumad = cc.find_library('ibumad', required: get_option('rdma'))
-  rdma_libs = [cc.find_library('rdmacm', has_headers: ['rdma/rdma_cma.h'],
-                               required: get_option('rdma')),
-               cc.find_library('ibverbs', required: get_option('rdma')),
-               libumad]
-  rdma = declare_dependency(dependencies: rdma_libs)
-  foreach lib: rdma_libs
-    if not lib.found()
-      rdma = not_found
-    endif
-  endforeach
-endif
-
 cacard = not_found
 if not get_option('smartcard').auto() or have_system
   cacard = dependency('libcacard', required: get_option('smartcard'),
@@ -2246,7 +2231,6 @@ endif
 config_host_data.set('CONFIG_OPENGL', opengl.found())
 config_host_data.set('CONFIG_PLUGIN', get_option('plugins'))
 config_host_data.set('CONFIG_RBD', rbd.found())
-config_host_data.set('CONFIG_RDMA', rdma.found())
 config_host_data.set('CONFIG_RELOCATABLE', get_option('relocatable'))
 config_host_data.set('CONFIG_SAFESTACK', get_option('safe_stack'))
 config_host_data.set('CONFIG_SDL', sdl.found())
@@ -2399,12 +2383,6 @@ if rbd.found()
                                        dependencies: rbd,
                                        prefix: '#include <rbd/librbd.h>'))
 endif
-if rdma.found()
-  config_host_data.set('HAVE_IBV_ADVISE_MR',
-                       cc.has_function('ibv_advise_mr',
-                                       dependencies: rdma,
-                                       prefix: '#include <infiniband/verbs.h>'))
-endif
 
 have_asan_fiber = false
 if get_option('sanitizers') and \
@@ -4398,7 +4376,6 @@ summary_info += {'Multipath support': mpathpersist}
 summary_info += {'Linux AIO support': libaio}
 summary_info += {'Linux io_uring support': linux_io_uring}
 summary_info += {'ATTR/XATTR support': libattr}
-summary_info += {'RDMA support':      rdma}
 summary_info += {'fdt support':       fdt_opt == 'disabled' ? false : fdt_opt}
 summary_info += {'libcap-ng support': libcap_ng}
 summary_info += {'bpf support':       libbpf}
diff --git a/qapi/migration.json b/qapi/migration.json
index 8c65b90328..9a56d403be 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -221,8 +221,8 @@
 #
 # @setup-time: amount of setup time in milliseconds *before* the
 #     iterations begin but *after* the QMP command is issued.  This is
-#     designed to provide an accounting of any activities (such as
-#     RDMA pinning) which may be expensive, but do not actually occur
+#     designed to provide an accounting of any activities which may be
+#     expensive, but do not actually occur
 #     during the iterative migration rounds themselves.  (since 1.6)
 #
 # @cpu-throttle-percentage: percentage of time guest cpus are being
@@ -430,10 +430,6 @@
 #     for certain work loads, by sending compressed difference of the
 #     pages
 #
-# @rdma-pin-all: Controls whether or not the entire VM memory
-#     footprint is mlock()'d on demand or all at once.  Refer to
-#     docs/rdma.txt for usage.  Disabled by default.  (since 2.0)
-#
 # @zero-blocks: During storage migration encode blocks of zeroes
 #     efficiently.  This essentially saves 1MB of zeroes per block on
 #     the wire.  Enabling requires source and target VM to support
@@ -547,7 +543,7 @@
 # Since: 1.2
 ##
 { 'enum': 'MigrationCapability',
-  'data': ['xbzrle', 'rdma-pin-all', 'auto-converge', 'zero-blocks',
+  'data': ['xbzrle', 'auto-converge', 'zero-blocks',
            { 'name': 'compress', 'features': [ 'deprecated' ] },
            'events', 'postcopy-ram',
            { 'name': 'x-colo', 'features': [ 'unstable' ] },
@@ -606,7 +602,6 @@
 #     -> { "execute": "query-migrate-capabilities" }
 #     <- { "return": [
 #           {"state": false, "capability": "xbzrle"},
-#           {"state": false, "capability": "rdma-pin-all"},
 #           {"state": false, "capability": "auto-converge"},
 #           {"state": false, "capability": "zero-blocks"},
 #           {"state": false, "capability": "compress"},
@@ -1654,14 +1649,12 @@
 #
 # @exec: Direct the migration stream to another process.
 #
-# @rdma: Migrate via RDMA.
-#
 # @file: Direct the migration stream to a file.
 #
 # Since: 8.2
 ##
 { 'enum': 'MigrationAddressType',
-  'data': [ 'socket', 'exec', 'rdma', 'file' ] }
+  'data': [ 'socket', 'exec', 'file' ] }
 
 ##
 # @FileMigrationArgs:
@@ -1701,7 +1694,6 @@
   'data': {
     'socket': 'SocketAddress',
     'exec': 'MigrationExecCommand',
-    'rdma': 'InetSocketAddress',
     'file': 'FileMigrationArgs' } }
 
 ##
@@ -1804,14 +1796,6 @@
 #     -> { "execute": "migrate",
 #          "arguments": {
 #              "channels": [ { "channel-type": "main",
-#                              "addr": { "transport": "rdma",
-#                                        "host": "10.12.34.9",
-#                                        "port": "1050" } } ] } }
-#     <- { "return": {} }
-#
-#     -> { "execute": "migrate",
-#          "arguments": {
-#              "channels": [ { "channel-type": "main",
 #                              "addr": { "transport": "file",
 #                                        "filename": "/tmp/migfile",
 #                                        "offset": "0x1000" } } ] } }
@@ -1879,13 +1863,6 @@
 #                                                  "/some/sock" ] } } ] } }
 #     <- { "return": {} }
 #
-#     -> { "execute": "migrate-incoming",
-#          "arguments": {
-#              "channels": [ { "channel-type": "main",
-#                              "addr": { "transport": "rdma",
-#                                        "host": "10.12.34.9",
-#                                        "port": "1050" } } ] } }
-#     <- { "return": {} }
 ##
 { 'command': 'migrate-incoming',
              'data': {'*uri': 'str',
diff --git a/migration/migration-stats.h b/migration/migration-stats.h
index 05290ade76..817c53559a 100644
--- a/migration/migration-stats.h
+++ b/migration/migration-stats.h
@@ -93,10 +93,6 @@ typedef struct {
      * Maximum amount of data we can send in a cycle.
      */
     Stat64 rate_limit_max;
-    /*
-     * Number of bytes sent through RDMA.
-     */
-    Stat64 rdma_bytes;
     /*
      * Number of pages transferred that were full of zeros.
      */
@@ -133,7 +129,7 @@ void migration_rate_set(uint64_t new_rate);
  *
  * Returns how many bytes have we transferred since the beginning of
  * the migration.  It accounts for bytes sent through any migration
- * channel, multifd, qemu_file, rdma, ....
+ * channel, multifd, qemu_file, ....
  */
 uint64_t migration_transferred_bytes(void);
 #endif
diff --git a/migration/migration.h b/migration/migration.h
index 8045e39c26..d097828580 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -162,13 +162,6 @@ struct MigrationIncomingState {
 
     int state;
 
-    /*
-     * The incoming migration coroutine, non-NULL during qemu_loadvm_state().
-     * Used to wake the migration incoming coroutine from rdma code. How much is
-     * it safe - it's a question.
-     */
-    Coroutine *loadvm_co;
-
     /* The coroutine we should enter (back) after failover */
     Coroutine *colo_incoming_co;
     QemuSemaphore colo_incoming_sem;
@@ -463,8 +456,6 @@ struct MigrationState {
      * switchover has been received.
      */
     bool switchover_acked;
-    /* Is this a rdma migration */
-    bool rdma_migration;
 };
 
 void migrate_set_state(int *state, int old_state, int new_state);
diff --git a/migration/options.h b/migration/options.h
index ab8199e207..c00213973e 100644
--- a/migration/options.h
+++ b/migration/options.h
@@ -37,7 +37,6 @@ bool migrate_multifd(void);
 bool migrate_pause_before_switchover(void);
 bool migrate_postcopy_blocktime(void);
 bool migrate_postcopy_preempt(void);
-bool migrate_rdma_pin_all(void);
 bool migrate_release_ram(void);
 bool migrate_return_path(void);
 bool migrate_validate_uuid(void);
@@ -54,7 +53,6 @@ bool migrate_zero_copy_send(void);
 
 bool migrate_multifd_flush_after_each_section(void);
 bool migrate_postcopy(void);
-bool migrate_rdma(void);
 bool migrate_tls(void);
 
 /* capabilities helpers */
diff --git a/migration/rdma.h b/migration/rdma.h
deleted file mode 100644
index a8d27f33b8..0000000000
--- a/migration/rdma.h
+++ /dev/null
@@ -1,69 +0,0 @@
-/*
- * RDMA protocol and interfaces
- *
- * Copyright IBM, Corp. 2010-2013
- * Copyright Red Hat, Inc. 2015-2016
- *
- * Authors:
- *  Michael R. Hines <mrhines@us.ibm.com>
- *  Jiuxing Liu <jl@us.ibm.com>
- *  Daniel P. Berrange <berrange@redhat.com>
- *
- * This work is licensed under the terms of the GNU GPL, version 2 or
- * later.  See the COPYING file in the top-level directory.
- *
- */
-
-#include "qemu/sockets.h"
-
-#ifndef QEMU_MIGRATION_RDMA_H
-#define QEMU_MIGRATION_RDMA_H
-
-#include "exec/memory.h"
-
-void rdma_start_outgoing_migration(void *opaque, InetSocketAddress *host_port,
-                                   Error **errp);
-
-void rdma_start_incoming_migration(InetSocketAddress *host_port, Error **errp);
-
-/*
- * Constants used by rdma return codes
- */
-#define RAM_CONTROL_SETUP     0
-#define RAM_CONTROL_ROUND     1
-#define RAM_CONTROL_FINISH    3
-
-/*
- * Whenever this is found in the data stream, the flags
- * will be passed to rdma functions in the incoming-migration
- * side.
- */
-#define RAM_SAVE_FLAG_HOOK     0x80
-
-#define RAM_SAVE_CONTROL_NOT_SUPP -1000
-#define RAM_SAVE_CONTROL_DELAYED  -2000
-
-#ifdef CONFIG_RDMA
-int rdma_registration_handle(QEMUFile *f);
-int rdma_registration_start(QEMUFile *f, uint64_t flags);
-int rdma_registration_stop(QEMUFile *f, uint64_t flags);
-int rdma_block_notification_handle(QEMUFile *f, const char *name);
-int rdma_control_save_page(QEMUFile *f, ram_addr_t block_offset,
-                           ram_addr_t offset, size_t size);
-#else
-static inline
-int rdma_registration_handle(QEMUFile *f) { return 0; }
-static inline
-int rdma_registration_start(QEMUFile *f, uint64_t flags) { return 0; }
-static inline
-int rdma_registration_stop(QEMUFile *f, uint64_t flags) { return 0; }
-static inline
-int rdma_block_notification_handle(QEMUFile *f, const char *name) { return 0; }
-static inline
-int rdma_control_save_page(QEMUFile *f, ram_addr_t block_offset,
-                           ram_addr_t offset, size_t size)
-{
-    return RAM_SAVE_CONTROL_NOT_SUPP;
-}
-#endif
-#endif
diff --git a/migration/migration-stats.c b/migration/migration-stats.c
index f690b98a03..9bc8d7018f 100644
--- a/migration/migration-stats.c
+++ b/migration/migration-stats.c
@@ -62,9 +62,8 @@ void migration_rate_reset(void)
 uint64_t migration_transferred_bytes(void)
 {
     uint64_t multifd = stat64_get(&mig_stats.multifd_bytes);
-    uint64_t rdma = stat64_get(&mig_stats.rdma_bytes);
     uint64_t qemu_file = stat64_get(&mig_stats.qemu_file_transferred);
 
-    trace_migration_transferred_bytes(qemu_file, multifd, rdma);
-    return qemu_file + multifd + rdma;
+    trace_migration_transferred_bytes(qemu_file, multifd);
+    return qemu_file + multifd;
 }
diff --git a/migration/migration.c b/migration/migration.c
index 9fe8fd2afd..8e17914c8b 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -25,7 +25,6 @@
 #include "sysemu/runstate.h"
 #include "sysemu/sysemu.h"
 #include "sysemu/cpu-throttle.h"
-#include "rdma.h"
 #include "ram.h"
 #include "ram-compress.h"
 #include "migration/global_state.h"
@@ -545,7 +544,6 @@ bool migrate_uri_parse(const char *uri, MigrationChannel **channel,
 {
     g_autoptr(MigrationChannel) val = g_new0(MigrationChannel, 1);
     g_autoptr(MigrationAddress) addr = g_new0(MigrationAddress, 1);
-    InetSocketAddress *isock = &addr->u.rdma;
     strList **tail = &addr->u.exec.args;
 
     if (strstart(uri, "exec:", NULL)) {
@@ -558,12 +556,6 @@ bool migrate_uri_parse(const char *uri, MigrationChannel **channel,
         QAPI_LIST_APPEND(tail, g_strdup("-c"));
 #endif
         QAPI_LIST_APPEND(tail, g_strdup(uri + strlen("exec:")));
-    } else if (strstart(uri, "rdma:", NULL)) {
-        if (inet_parse(isock, uri + strlen("rdma:"), errp)) {
-            qapi_free_InetSocketAddress(isock);
-            return false;
-        }
-        addr->transport = MIGRATION_ADDRESS_TYPE_RDMA;
     } else if (strstart(uri, "tcp:", NULL) ||
                 strstart(uri, "unix:", NULL) ||
                 strstart(uri, "vsock:", NULL) ||
@@ -645,22 +637,6 @@ static void qemu_start_incoming_migration(const char *uri, bool has_channels,
         } else if (saddr->type == SOCKET_ADDRESS_TYPE_FD) {
             fd_start_incoming_migration(saddr->u.fd.str, errp);
         }
-#ifdef CONFIG_RDMA
-    } else if (addr->transport == MIGRATION_ADDRESS_TYPE_RDMA) {
-        if (migrate_compress()) {
-            error_setg(errp, "RDMA and compression can't be used together");
-            return;
-        }
-        if (migrate_xbzrle()) {
-            error_setg(errp, "RDMA and XBZRLE can't be used together");
-            return;
-        }
-        if (migrate_multifd()) {
-            error_setg(errp, "RDMA and multifd can't be used together");
-            return;
-        }
-        rdma_start_incoming_migration(&addr->u.rdma, errp);
-#endif
     } else if (addr->transport == MIGRATION_ADDRESS_TYPE_EXEC) {
         exec_start_incoming_migration(addr->u.exec.args, errp);
     } else if (addr->transport == MIGRATION_ADDRESS_TYPE_FILE) {
@@ -751,9 +727,7 @@ process_incoming_migration_co(void *opaque)
     migrate_set_state(&mis->state, MIGRATION_STATUS_SETUP,
                       MIGRATION_STATUS_ACTIVE);
 
-    mis->loadvm_co = qemu_coroutine_self();
     ret = qemu_loadvm_state(mis->from_src_file);
-    mis->loadvm_co = NULL;
 
     trace_vmstate_downtime_checkpoint("dst-precopy-loadvm-completed");
 
@@ -1679,7 +1653,6 @@ int migrate_init(MigrationState *s, Error **errp)
     s->iteration_initial_bytes = 0;
     s->threshold_size = 0;
     s->switchover_acked = false;
-    s->rdma_migration = false;
     /*
      * set mig_stats memory to zero for a new migration
      */
@@ -2100,10 +2073,6 @@ void qmp_migrate(const char *uri, bool has_channels,
         } else if (saddr->type == SOCKET_ADDRESS_TYPE_FD) {
             fd_start_outgoing_migration(s, saddr->u.fd.str, &local_err);
         }
-#ifdef CONFIG_RDMA
-    } else if (addr->transport == MIGRATION_ADDRESS_TYPE_RDMA) {
-        rdma_start_outgoing_migration(s, &addr->u.rdma, &local_err);
-#endif
     } else if (addr->transport == MIGRATION_ADDRESS_TYPE_EXEC) {
         exec_start_outgoing_migration(s, addr->u.exec.args, &local_err);
     } else if (addr->transport == MIGRATION_ADDRESS_TYPE_FILE) {
diff --git a/migration/options.c b/migration/options.c
index bfd7753b69..02fc0b9ae8 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -185,7 +185,6 @@ Property migration_properties[] = {
 
     /* Migration capabilities */
     DEFINE_PROP_MIG_CAP("x-xbzrle", MIGRATION_CAPABILITY_XBZRLE),
-    DEFINE_PROP_MIG_CAP("x-rdma-pin-all", MIGRATION_CAPABILITY_RDMA_PIN_ALL),
     DEFINE_PROP_MIG_CAP("x-auto-converge", MIGRATION_CAPABILITY_AUTO_CONVERGE),
     DEFINE_PROP_MIG_CAP("x-zero-blocks", MIGRATION_CAPABILITY_ZERO_BLOCKS),
     DEFINE_PROP_MIG_CAP("x-compress", MIGRATION_CAPABILITY_COMPRESS),
@@ -323,13 +322,6 @@ bool migrate_postcopy_ram(void)
     return s->capabilities[MIGRATION_CAPABILITY_POSTCOPY_RAM];
 }
 
-bool migrate_rdma_pin_all(void)
-{
-    MigrationState *s = migrate_get_current();
-
-    return s->capabilities[MIGRATION_CAPABILITY_RDMA_PIN_ALL];
-}
-
 bool migrate_release_ram(void)
 {
     MigrationState *s = migrate_get_current();
@@ -393,13 +385,6 @@ bool migrate_postcopy(void)
     return migrate_postcopy_ram() || migrate_dirty_bitmaps();
 }
 
-bool migrate_rdma(void)
-{
-    MigrationState *s = migrate_get_current();
-
-    return s->rdma_migration;
-}
-
 bool migrate_tls(void)
 {
     MigrationState *s = migrate_get_current();
@@ -458,7 +443,6 @@ INITIALIZE_MIGRATE_CAPS_SET(check_caps_background_snapshot,
     MIGRATION_CAPABILITY_PAUSE_BEFORE_SWITCHOVER,
     MIGRATION_CAPABILITY_AUTO_CONVERGE,
     MIGRATION_CAPABILITY_RELEASE_RAM,
-    MIGRATION_CAPABILITY_RDMA_PIN_ALL,
     MIGRATION_CAPABILITY_COMPRESS,
     MIGRATION_CAPABILITY_XBZRLE,
     MIGRATION_CAPABILITY_X_COLO,
diff --git a/migration/qemu-file.c b/migration/qemu-file.c
index a10882d47f..ad2efb332e 100644
--- a/migration/qemu-file.c
+++ b/migration/qemu-file.c
@@ -32,7 +32,6 @@
 #include "trace.h"
 #include "options.h"
 #include "qapi/error.h"
-#include "rdma.h"
 #include "io/channel-file.h"
 
 #define IO_BUF_SIZE 32768
diff --git a/migration/ram.c b/migration/ram.c
index 8deb84984f..c81c8a7cff 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -59,7 +59,6 @@
 #include "qemu/iov.h"
 #include "multifd.h"
 #include "sysemu/runstate.h"
-#include "rdma.h"
 #include "options.h"
 #include "sysemu/dirtylimit.h"
 #include "sysemu/kvm.h"
@@ -89,7 +88,7 @@
 #define RAM_SAVE_FLAG_EOS      0x10
 #define RAM_SAVE_FLAG_CONTINUE 0x20
 #define RAM_SAVE_FLAG_XBZRLE   0x40
-/* 0x80 is reserved in rdma.h for RAM_SAVE_FLAG_HOOK */
+#define RAM_SAVE_FLAG_HOOK     0x80 /* was reserved by RDMA */
 #define RAM_SAVE_FLAG_COMPRESS_PAGE    0x100
 #define RAM_SAVE_FLAG_MULTIFD_FLUSH    0x200
 /* We can't use any flag that is bigger than 0x200 */
@@ -1175,32 +1174,6 @@ static int save_zero_page(RAMState *rs, PageSearchStatus *pss,
     return len;
 }
 
-/*
- * @pages: the number of pages written by the control path,
- *        < 0 - error
- *        > 0 - number of pages written
- *
- * Return true if the pages has been saved, otherwise false is returned.
- */
-static bool control_save_page(PageSearchStatus *pss,
-                              ram_addr_t offset, int *pages)
-{
-    int ret;
-
-    ret = rdma_control_save_page(pss->pss_channel, pss->block->offset, offset,
-                                 TARGET_PAGE_SIZE);
-    if (ret == RAM_SAVE_CONTROL_NOT_SUPP) {
-        return false;
-    }
-
-    if (ret == RAM_SAVE_CONTROL_DELAYED) {
-        *pages = 1;
-        return true;
-    }
-    *pages = ret;
-    return true;
-}
-
 /*
  * directly send the page to the stream
  *
@@ -2080,11 +2053,6 @@ static bool save_compress_page(RAMState *rs, PageSearchStatus *pss,
 static int ram_save_target_page_legacy(RAMState *rs, PageSearchStatus *pss)
 {
     ram_addr_t offset = ((ram_addr_t)pss->page) << TARGET_PAGE_BITS;
-    int res;
-
-    if (control_save_page(pss, offset, &res)) {
-        return res;
-    }
 
     if (save_compress_page(rs, pss, offset)) {
         return 1;
@@ -3114,18 +3082,6 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
         }
     }
 
-    ret = rdma_registration_start(f, RAM_CONTROL_SETUP);
-    if (ret < 0) {
-        qemu_file_set_error(f, ret);
-        return ret;
-    }
-
-    ret = rdma_registration_stop(f, RAM_CONTROL_SETUP);
-    if (ret < 0) {
-        qemu_file_set_error(f, ret);
-        return ret;
-    }
-
     migration_ops = g_malloc0(sizeof(MigrationOps));
 
     if (migrate_multifd()) {
@@ -3221,12 +3177,6 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
             /* Read version before ram_list.blocks */
             smp_rmb();
 
-            ret = rdma_registration_start(f, RAM_CONTROL_ROUND);
-            if (ret < 0) {
-                qemu_file_set_error(f, ret);
-                goto out;
-            }
-
             t0 = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
             i = 0;
             while ((ret = migration_rate_exceeded(f)) == 0 ||
@@ -3278,15 +3228,6 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
         }
     }
 
-    /*
-     * Must occur before EOS (or any QEMUFile operation)
-     * because of RDMA protocol.
-     */
-    ret = rdma_registration_stop(f, RAM_CONTROL_ROUND);
-    if (ret < 0) {
-        qemu_file_set_error(f, ret);
-    }
-
 out:
     if (ret >= 0
         && migration_is_setup_or_active()) {
@@ -3332,12 +3273,6 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
             migration_bitmap_sync_precopy(rs, true);
         }
 
-        ret = rdma_registration_start(f, RAM_CONTROL_FINISH);
-        if (ret < 0) {
-            qemu_file_set_error(f, ret);
-            return ret;
-        }
-
         /* try transferring iterative blocks of memory */
 
         /* flush all remaining blocks regardless of rate limiting */
@@ -3358,12 +3293,6 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
         qemu_mutex_unlock(&rs->bitmap_mutex);
 
         compress_flush_data();
-
-        ret = rdma_registration_stop(f, RAM_CONTROL_FINISH);
-        if (ret < 0) {
-            qemu_file_set_error(f, ret);
-            return ret;
-        }
     }
 
     ret = multifd_send_sync_main();
@@ -3576,8 +3505,7 @@ static inline void *colo_cache_from_block_offset(RAMBlock *block,
 /**
  * ram_handle_zero: handle the zero page case
  *
- * If a page (or a whole RDMA chunk) has been
- * determined to be zero, then zap it.
+ * If a page has been determined to be zero, then zap it.
  *
  * @host: host address for the zero page
  * @ch: what the page is filled from.  We only support zero
@@ -4161,10 +4089,6 @@ static int parse_ramblock(QEMUFile *f, RAMBlock *block, ram_addr_t length)
             return -EINVAL;
         }
     }
-    ret = rdma_block_notification_handle(f, block->idstr);
-    if (ret < 0) {
-        qemu_file_set_error(f, ret);
-    }
 
     return ret;
 }
@@ -4363,12 +4287,6 @@ static int ram_load_precopy(QEMUFile *f)
                 multifd_recv_sync_main();
             }
             break;
-        case RAM_SAVE_FLAG_HOOK:
-            ret = rdma_registration_handle(f);
-            if (ret < 0) {
-                qemu_file_set_error(f, ret);
-            }
-            break;
         default:
             error_report("Unknown combination of migration flags: 0x%x", flags);
             ret = -EINVAL;
diff --git a/migration/rdma.c b/migration/rdma.c
deleted file mode 100644
index 855753c671..0000000000
--- a/migration/rdma.c
+++ /dev/null
@@ -1,4184 +0,0 @@
-/*
- * RDMA protocol and interfaces
- *
- * Copyright IBM, Corp. 2010-2013
- * Copyright Red Hat, Inc. 2015-2016
- *
- * Authors:
- *  Michael R. Hines <mrhines@us.ibm.com>
- *  Jiuxing Liu <jl@us.ibm.com>
- *  Daniel P. Berrange <berrange@redhat.com>
- *
- * This work is licensed under the terms of the GNU GPL, version 2 or
- * later.  See the COPYING file in the top-level directory.
- *
- */
-
-#include "qemu/osdep.h"
-#include "qapi/error.h"
-#include "qemu/cutils.h"
-#include "exec/target_page.h"
-#include "rdma.h"
-#include "migration.h"
-#include "migration-stats.h"
-#include "qemu-file.h"
-#include "ram.h"
-#include "qemu/error-report.h"
-#include "qemu/main-loop.h"
-#include "qemu/module.h"
-#include "qemu/rcu.h"
-#include "qemu/sockets.h"
-#include "qemu/bitmap.h"
-#include "qemu/coroutine.h"
-#include "exec/memory.h"
-#include <sys/socket.h>
-#include <netdb.h>
-#include <arpa/inet.h>
-#include <rdma/rdma_cma.h>
-#include "trace.h"
-#include "qom/object.h"
-#include "options.h"
-#include <poll.h>
-
-#define RDMA_RESOLVE_TIMEOUT_MS 10000
-
-/* Do not merge data if larger than this. */
-#define RDMA_MERGE_MAX (2 * 1024 * 1024)
-#define RDMA_SIGNALED_SEND_MAX (RDMA_MERGE_MAX / 4096)
-
-#define RDMA_REG_CHUNK_SHIFT 20 /* 1 MB */
-
-/*
- * This is only for non-live state being migrated.
- * Instead of RDMA_WRITE messages, we use RDMA_SEND
- * messages for that state, which requires a different
- * delivery design than main memory.
- */
-#define RDMA_SEND_INCREMENT 32768
-
-/*
- * Maximum size infiniband SEND message
- */
-#define RDMA_CONTROL_MAX_BUFFER (512 * 1024)
-#define RDMA_CONTROL_MAX_COMMANDS_PER_MESSAGE 4096
-
-#define RDMA_CONTROL_VERSION_CURRENT 1
-/*
- * Capabilities for negotiation.
- */
-#define RDMA_CAPABILITY_PIN_ALL 0x01
-
-/*
- * Add the other flags above to this list of known capabilities
- * as they are introduced.
- */
-static uint32_t known_capabilities = RDMA_CAPABILITY_PIN_ALL;
-
-/*
- * A work request ID is 64-bits and we split up these bits
- * into 3 parts:
- *
- * bits 0-15 : type of control message, 2^16
- * bits 16-29: ram block index, 2^14
- * bits 30-63: ram block chunk number, 2^34
- *
- * The last two bit ranges are only used for RDMA writes,
- * in order to track their completion and potentially
- * also track unregistration status of the message.
- */
-#define RDMA_WRID_TYPE_SHIFT  0UL
-#define RDMA_WRID_BLOCK_SHIFT 16UL
-#define RDMA_WRID_CHUNK_SHIFT 30UL
-
-#define RDMA_WRID_TYPE_MASK \
-    ((1UL << RDMA_WRID_BLOCK_SHIFT) - 1UL)
-
-#define RDMA_WRID_BLOCK_MASK \
-    (~RDMA_WRID_TYPE_MASK & ((1UL << RDMA_WRID_CHUNK_SHIFT) - 1UL))
-
-#define RDMA_WRID_CHUNK_MASK (~RDMA_WRID_BLOCK_MASK & ~RDMA_WRID_TYPE_MASK)
-
-/*
- * RDMA migration protocol:
- * 1. RDMA Writes (data messages, i.e. RAM)
- * 2. IB Send/Recv (control channel messages)
- */
-enum {
-    RDMA_WRID_NONE = 0,
-    RDMA_WRID_RDMA_WRITE = 1,
-    RDMA_WRID_SEND_CONTROL = 2000,
-    RDMA_WRID_RECV_CONTROL = 4000,
-};
-
-/*
- * Work request IDs for IB SEND messages only (not RDMA writes).
- * This is used by the migration protocol to transmit
- * control messages (such as device state and registration commands)
- *
- * We could use more WRs, but we have enough for now.
- */
-enum {
-    RDMA_WRID_READY = 0,
-    RDMA_WRID_DATA,
-    RDMA_WRID_CONTROL,
-    RDMA_WRID_MAX,
-};
-
-/*
- * SEND/RECV IB Control Messages.
- */
-enum {
-    RDMA_CONTROL_NONE = 0,
-    RDMA_CONTROL_ERROR,
-    RDMA_CONTROL_READY,               /* ready to receive */
-    RDMA_CONTROL_QEMU_FILE,           /* QEMUFile-transmitted bytes */
-    RDMA_CONTROL_RAM_BLOCKS_REQUEST,  /* RAMBlock synchronization */
-    RDMA_CONTROL_RAM_BLOCKS_RESULT,   /* RAMBlock synchronization */
-    RDMA_CONTROL_COMPRESS,            /* page contains repeat values */
-    RDMA_CONTROL_REGISTER_REQUEST,    /* dynamic page registration */
-    RDMA_CONTROL_REGISTER_RESULT,     /* key to use after registration */
-    RDMA_CONTROL_REGISTER_FINISHED,   /* current iteration finished */
-    RDMA_CONTROL_UNREGISTER_REQUEST,  /* dynamic UN-registration */
-    RDMA_CONTROL_UNREGISTER_FINISHED, /* unpinning finished */
-};
-
-
-/*
- * Memory and MR structures used to represent an IB Send/Recv work request.
- * This is *not* used for RDMA writes, only IB Send/Recv.
- */
-typedef struct {
-    uint8_t  control[RDMA_CONTROL_MAX_BUFFER]; /* actual buffer to register */
-    struct   ibv_mr *control_mr;               /* registration metadata */
-    size_t   control_len;                      /* length of the message */
-    uint8_t *control_curr;                     /* start of unconsumed bytes */
-} RDMAWorkRequestData;
-
-/*
- * Negotiate RDMA capabilities during connection-setup time.
- */
-typedef struct {
-    uint32_t version;
-    uint32_t flags;
-} RDMACapabilities;
-
-static void caps_to_network(RDMACapabilities *cap)
-{
-    cap->version = htonl(cap->version);
-    cap->flags = htonl(cap->flags);
-}
-
-static void network_to_caps(RDMACapabilities *cap)
-{
-    cap->version = ntohl(cap->version);
-    cap->flags = ntohl(cap->flags);
-}
-
-/*
- * Representation of a RAMBlock from an RDMA perspective.
- * This is not transmitted, only local.
- * This and subsequent structures cannot be linked lists
- * because we're using a single IB message to transmit
- * the information. It's small anyway, so a list is overkill.
- */
-typedef struct RDMALocalBlock {
-    char          *block_name;
-    uint8_t       *local_host_addr; /* local virtual address */
-    uint64_t       remote_host_addr; /* remote virtual address */
-    uint64_t       offset;
-    uint64_t       length;
-    struct         ibv_mr **pmr;    /* MRs for chunk-level registration */
-    struct         ibv_mr *mr;      /* MR for non-chunk-level registration */
-    uint32_t      *remote_keys;     /* rkeys for chunk-level registration */
-    uint32_t       remote_rkey;     /* rkeys for non-chunk-level registration */
-    int            index;           /* which block are we */
-    unsigned int   src_index;       /* (Only used on dest) */
-    bool           is_ram_block;
-    int            nb_chunks;
-    unsigned long *transit_bitmap;
-    unsigned long *unregister_bitmap;
-} RDMALocalBlock;
-
-/*
- * Also represents a RAMblock, but only on the dest.
- * This gets transmitted by the dest during connection-time
- * to the source VM and then is used to populate the
- * corresponding RDMALocalBlock with
- * the information needed to perform the actual RDMA.
- */
-typedef struct QEMU_PACKED RDMADestBlock {
-    uint64_t remote_host_addr;
-    uint64_t offset;
-    uint64_t length;
-    uint32_t remote_rkey;
-    uint32_t padding;
-} RDMADestBlock;
-
-static const char *control_desc(unsigned int rdma_control)
-{
-    static const char *strs[] = {
-        [RDMA_CONTROL_NONE] = "NONE",
-        [RDMA_CONTROL_ERROR] = "ERROR",
-        [RDMA_CONTROL_READY] = "READY",
-        [RDMA_CONTROL_QEMU_FILE] = "QEMU FILE",
-        [RDMA_CONTROL_RAM_BLOCKS_REQUEST] = "RAM BLOCKS REQUEST",
-        [RDMA_CONTROL_RAM_BLOCKS_RESULT] = "RAM BLOCKS RESULT",
-        [RDMA_CONTROL_COMPRESS] = "COMPRESS",
-        [RDMA_CONTROL_REGISTER_REQUEST] = "REGISTER REQUEST",
-        [RDMA_CONTROL_REGISTER_RESULT] = "REGISTER RESULT",
-        [RDMA_CONTROL_REGISTER_FINISHED] = "REGISTER FINISHED",
-        [RDMA_CONTROL_UNREGISTER_REQUEST] = "UNREGISTER REQUEST",
-        [RDMA_CONTROL_UNREGISTER_FINISHED] = "UNREGISTER FINISHED",
-    };
-
-    if (rdma_control > RDMA_CONTROL_UNREGISTER_FINISHED) {
-        return "??BAD CONTROL VALUE??";
-    }
-
-    return strs[rdma_control];
-}
-
-#if !defined(htonll)
-static uint64_t htonll(uint64_t v)
-{
-    union { uint32_t lv[2]; uint64_t llv; } u;
-    u.lv[0] = htonl(v >> 32);
-    u.lv[1] = htonl(v & 0xFFFFFFFFULL);
-    return u.llv;
-}
-#endif
-
-#if !defined(ntohll)
-static uint64_t ntohll(uint64_t v)
-{
-    union { uint32_t lv[2]; uint64_t llv; } u;
-    u.llv = v;
-    return ((uint64_t)ntohl(u.lv[0]) << 32) | (uint64_t) ntohl(u.lv[1]);
-}
-#endif
-
-static void dest_block_to_network(RDMADestBlock *db)
-{
-    db->remote_host_addr = htonll(db->remote_host_addr);
-    db->offset = htonll(db->offset);
-    db->length = htonll(db->length);
-    db->remote_rkey = htonl(db->remote_rkey);
-}
-
-static void network_to_dest_block(RDMADestBlock *db)
-{
-    db->remote_host_addr = ntohll(db->remote_host_addr);
-    db->offset = ntohll(db->offset);
-    db->length = ntohll(db->length);
-    db->remote_rkey = ntohl(db->remote_rkey);
-}
-
-/*
- * Virtual address of the above structures used for transmitting
- * the RAMBlock descriptions at connection-time.
- * This structure is *not* transmitted.
- */
-typedef struct RDMALocalBlocks {
-    int nb_blocks;
-    bool     init;             /* main memory init complete */
-    RDMALocalBlock *block;
-} RDMALocalBlocks;
-
-/*
- * Main data structure for RDMA state.
- * While there is only one copy of this structure being allocated right now,
- * this is the place where one would start if you wanted to consider
- * having more than one RDMA connection open at the same time.
- */
-typedef struct RDMAContext {
-    char *host;
-    int port;
-
-    RDMAWorkRequestData wr_data[RDMA_WRID_MAX];
-
-    /*
-     * This is used by *_exchange_send() to figure out whether or not
-     * the initial "READY" message has already been received or not.
-     * This is because other functions may potentially poll() and detect
-     * the READY message before send() does, in which case we need to
-     * know if it completed.
-     */
-    int control_ready_expected;
-
-    /* number of outstanding writes */
-    int nb_sent;
-
-    /* store info about current buffer so that we can
-       merge it with future sends */
-    uint64_t current_addr;
-    uint64_t current_length;
-    /* index of ram block the current buffer belongs to */
-    int current_index;
-    /* index of the chunk in the current ram block */
-    int current_chunk;
-
-    bool pin_all;
-
-    /*
-     * infiniband-specific variables for opening the device
-     * and maintaining connection state and so forth.
-     *
-     * cm_id also has ibv_context, rdma_event_channel, and ibv_qp in
-     * cm_id->verbs, cm_id->channel, and cm_id->qp.
-     */
-    struct rdma_cm_id *cm_id;               /* connection manager ID */
-    struct rdma_cm_id *listen_id;
-    bool connected;
-
-    struct ibv_context          *verbs;
-    struct rdma_event_channel   *channel;
-    struct ibv_qp *qp;                      /* queue pair */
-    struct ibv_comp_channel *recv_comp_channel;  /* recv completion channel */
-    struct ibv_comp_channel *send_comp_channel;  /* send completion channel */
-    struct ibv_pd *pd;                      /* protection domain */
-    struct ibv_cq *recv_cq;                 /* recvieve completion queue */
-    struct ibv_cq *send_cq;                 /* send completion queue */
-
-    /*
-     * If a previous write failed (perhaps because of a failed
-     * memory registration, then do not attempt any future work
-     * and remember the error state.
-     */
-    bool errored;
-    bool error_reported;
-    bool received_error;
-
-    /*
-     * Description of ram blocks used throughout the code.
-     */
-    RDMALocalBlocks local_ram_blocks;
-    RDMADestBlock  *dest_blocks;
-
-    /* Index of the next RAMBlock received during block registration */
-    unsigned int    next_src_index;
-
-    /*
-     * Migration on *destination* started.
-     * Then use coroutine yield function.
-     * Source runs in a thread, so we don't care.
-     */
-    int migration_started_on_destination;
-
-    int total_registrations;
-    int total_writes;
-
-    int unregister_current, unregister_next;
-    uint64_t unregistrations[RDMA_SIGNALED_SEND_MAX];
-
-    GHashTable *blockmap;
-
-    /* the RDMAContext for return path */
-    struct RDMAContext *return_path;
-    bool is_return_path;
-} RDMAContext;
-
-#define TYPE_QIO_CHANNEL_RDMA "qio-channel-rdma"
-OBJECT_DECLARE_SIMPLE_TYPE(QIOChannelRDMA, QIO_CHANNEL_RDMA)
-
-
-
-struct QIOChannelRDMA {
-    QIOChannel parent;
-    RDMAContext *rdmain;
-    RDMAContext *rdmaout;
-    QEMUFile *file;
-    bool blocking; /* XXX we don't actually honour this yet */
-};
-
-/*
- * Main structure for IB Send/Recv control messages.
- * This gets prepended at the beginning of every Send/Recv.
- */
-typedef struct QEMU_PACKED {
-    uint32_t len;     /* Total length of data portion */
-    uint32_t type;    /* which control command to perform */
-    uint32_t repeat;  /* number of commands in data portion of same type */
-    uint32_t padding;
-} RDMAControlHeader;
-
-static void control_to_network(RDMAControlHeader *control)
-{
-    control->type = htonl(control->type);
-    control->len = htonl(control->len);
-    control->repeat = htonl(control->repeat);
-}
-
-static void network_to_control(RDMAControlHeader *control)
-{
-    control->type = ntohl(control->type);
-    control->len = ntohl(control->len);
-    control->repeat = ntohl(control->repeat);
-}
-
-/*
- * Register a single Chunk.
- * Information sent by the source VM to inform the dest
- * to register an single chunk of memory before we can perform
- * the actual RDMA operation.
- */
-typedef struct QEMU_PACKED {
-    union QEMU_PACKED {
-        uint64_t current_addr;  /* offset into the ram_addr_t space */
-        uint64_t chunk;         /* chunk to lookup if unregistering */
-    } key;
-    uint32_t current_index; /* which ramblock the chunk belongs to */
-    uint32_t padding;
-    uint64_t chunks;            /* how many sequential chunks to register */
-} RDMARegister;
-
-static bool rdma_errored(RDMAContext *rdma)
-{
-    if (rdma->errored && !rdma->error_reported) {
-        error_report("RDMA is in an error state waiting migration"
-                     " to abort!");
-        rdma->error_reported = true;
-    }
-    return rdma->errored;
-}
-
-static void register_to_network(RDMAContext *rdma, RDMARegister *reg)
-{
-    RDMALocalBlock *local_block;
-    local_block  = &rdma->local_ram_blocks.block[reg->current_index];
-
-    if (local_block->is_ram_block) {
-        /*
-         * current_addr as passed in is an address in the local ram_addr_t
-         * space, we need to translate this for the destination
-         */
-        reg->key.current_addr -= local_block->offset;
-        reg->key.current_addr += rdma->dest_blocks[reg->current_index].offset;
-    }
-    reg->key.current_addr = htonll(reg->key.current_addr);
-    reg->current_index = htonl(reg->current_index);
-    reg->chunks = htonll(reg->chunks);
-}
-
-static void network_to_register(RDMARegister *reg)
-{
-    reg->key.current_addr = ntohll(reg->key.current_addr);
-    reg->current_index = ntohl(reg->current_index);
-    reg->chunks = ntohll(reg->chunks);
-}
-
-typedef struct QEMU_PACKED {
-    uint32_t value;     /* if zero, we will madvise() */
-    uint32_t block_idx; /* which ram block index */
-    uint64_t offset;    /* Address in remote ram_addr_t space */
-    uint64_t length;    /* length of the chunk */
-} RDMACompress;
-
-static void compress_to_network(RDMAContext *rdma, RDMACompress *comp)
-{
-    comp->value = htonl(comp->value);
-    /*
-     * comp->offset as passed in is an address in the local ram_addr_t
-     * space, we need to translate this for the destination
-     */
-    comp->offset -= rdma->local_ram_blocks.block[comp->block_idx].offset;
-    comp->offset += rdma->dest_blocks[comp->block_idx].offset;
-    comp->block_idx = htonl(comp->block_idx);
-    comp->offset = htonll(comp->offset);
-    comp->length = htonll(comp->length);
-}
-
-static void network_to_compress(RDMACompress *comp)
-{
-    comp->value = ntohl(comp->value);
-    comp->block_idx = ntohl(comp->block_idx);
-    comp->offset = ntohll(comp->offset);
-    comp->length = ntohll(comp->length);
-}
-
-/*
- * The result of the dest's memory registration produces an "rkey"
- * which the source VM must reference in order to perform
- * the RDMA operation.
- */
-typedef struct QEMU_PACKED {
-    uint32_t rkey;
-    uint32_t padding;
-    uint64_t host_addr;
-} RDMARegisterResult;
-
-static void result_to_network(RDMARegisterResult *result)
-{
-    result->rkey = htonl(result->rkey);
-    result->host_addr = htonll(result->host_addr);
-};
-
-static void network_to_result(RDMARegisterResult *result)
-{
-    result->rkey = ntohl(result->rkey);
-    result->host_addr = ntohll(result->host_addr);
-};
-
-static int qemu_rdma_exchange_send(RDMAContext *rdma, RDMAControlHeader *head,
-                                   uint8_t *data, RDMAControlHeader *resp,
-                                   int *resp_idx,
-                                   int (*callback)(RDMAContext *rdma,
-                                                   Error **errp),
-                                   Error **errp);
-
-static inline uint64_t ram_chunk_index(const uint8_t *start,
-                                       const uint8_t *host)
-{
-    return ((uintptr_t) host - (uintptr_t) start) >> RDMA_REG_CHUNK_SHIFT;
-}
-
-static inline uint8_t *ram_chunk_start(const RDMALocalBlock *rdma_ram_block,
-                                       uint64_t i)
-{
-    return (uint8_t *)(uintptr_t)(rdma_ram_block->local_host_addr +
-                                  (i << RDMA_REG_CHUNK_SHIFT));
-}
-
-static inline uint8_t *ram_chunk_end(const RDMALocalBlock *rdma_ram_block,
-                                     uint64_t i)
-{
-    uint8_t *result = ram_chunk_start(rdma_ram_block, i) +
-                                         (1UL << RDMA_REG_CHUNK_SHIFT);
-
-    if (result > (rdma_ram_block->local_host_addr + rdma_ram_block->length)) {
-        result = rdma_ram_block->local_host_addr + rdma_ram_block->length;
-    }
-
-    return result;
-}
-
-static void rdma_add_block(RDMAContext *rdma, const char *block_name,
-                           void *host_addr,
-                           ram_addr_t block_offset, uint64_t length)
-{
-    RDMALocalBlocks *local = &rdma->local_ram_blocks;
-    RDMALocalBlock *block;
-    RDMALocalBlock *old = local->block;
-
-    local->block = g_new0(RDMALocalBlock, local->nb_blocks + 1);
-
-    if (local->nb_blocks) {
-        if (rdma->blockmap) {
-            for (int x = 0; x < local->nb_blocks; x++) {
-                g_hash_table_remove(rdma->blockmap,
-                                    (void *)(uintptr_t)old[x].offset);
-                g_hash_table_insert(rdma->blockmap,
-                                    (void *)(uintptr_t)old[x].offset,
-                                    &local->block[x]);
-            }
-        }
-        memcpy(local->block, old, sizeof(RDMALocalBlock) * local->nb_blocks);
-        g_free(old);
-    }
-
-    block = &local->block[local->nb_blocks];
-
-    block->block_name = g_strdup(block_name);
-    block->local_host_addr = host_addr;
-    block->offset = block_offset;
-    block->length = length;
-    block->index = local->nb_blocks;
-    block->src_index = ~0U; /* Filled in by the receipt of the block list */
-    block->nb_chunks = ram_chunk_index(host_addr, host_addr + length) + 1UL;
-    block->transit_bitmap = bitmap_new(block->nb_chunks);
-    bitmap_clear(block->transit_bitmap, 0, block->nb_chunks);
-    block->unregister_bitmap = bitmap_new(block->nb_chunks);
-    bitmap_clear(block->unregister_bitmap, 0, block->nb_chunks);
-    block->remote_keys = g_new0(uint32_t, block->nb_chunks);
-
-    block->is_ram_block = local->init ? false : true;
-
-    if (rdma->blockmap) {
-        g_hash_table_insert(rdma->blockmap, (void *)(uintptr_t)block_offset, block);
-    }
-
-    trace_rdma_add_block(block_name, local->nb_blocks,
-                         (uintptr_t) block->local_host_addr,
-                         block->offset, block->length,
-                         (uintptr_t) (block->local_host_addr + block->length),
-                         BITS_TO_LONGS(block->nb_chunks) *
-                             sizeof(unsigned long) * 8,
-                         block->nb_chunks);
-
-    local->nb_blocks++;
-}
-
-/*
- * Memory regions need to be registered with the device and queue pairs setup
- * in advanced before the migration starts. This tells us where the RAM blocks
- * are so that we can register them individually.
- */
-static int qemu_rdma_init_one_block(RAMBlock *rb, void *opaque)
-{
-    const char *block_name = qemu_ram_get_idstr(rb);
-    void *host_addr = qemu_ram_get_host_addr(rb);
-    ram_addr_t block_offset = qemu_ram_get_offset(rb);
-    ram_addr_t length = qemu_ram_get_used_length(rb);
-    rdma_add_block(opaque, block_name, host_addr, block_offset, length);
-    return 0;
-}
-
-/*
- * Identify the RAMBlocks and their quantity. They will be references to
- * identify chunk boundaries inside each RAMBlock and also be referenced
- * during dynamic page registration.
- */
-static void qemu_rdma_init_ram_blocks(RDMAContext *rdma)
-{
-    RDMALocalBlocks *local = &rdma->local_ram_blocks;
-    int ret;
-
-    assert(rdma->blockmap == NULL);
-    memset(local, 0, sizeof *local);
-    ret = foreach_not_ignored_block(qemu_rdma_init_one_block, rdma);
-    assert(!ret);
-    trace_qemu_rdma_init_ram_blocks(local->nb_blocks);
-    rdma->dest_blocks = g_new0(RDMADestBlock,
-                               rdma->local_ram_blocks.nb_blocks);
-    local->init = true;
-}
-
-/*
- * Note: If used outside of cleanup, the caller must ensure that the destination
- * block structures are also updated
- */
-static void rdma_delete_block(RDMAContext *rdma, RDMALocalBlock *block)
-{
-    RDMALocalBlocks *local = &rdma->local_ram_blocks;
-    RDMALocalBlock *old = local->block;
-
-    if (rdma->blockmap) {
-        g_hash_table_remove(rdma->blockmap, (void *)(uintptr_t)block->offset);
-    }
-    if (block->pmr) {
-        for (int j = 0; j < block->nb_chunks; j++) {
-            if (!block->pmr[j]) {
-                continue;
-            }
-            ibv_dereg_mr(block->pmr[j]);
-            rdma->total_registrations--;
-        }
-        g_free(block->pmr);
-        block->pmr = NULL;
-    }
-
-    if (block->mr) {
-        ibv_dereg_mr(block->mr);
-        rdma->total_registrations--;
-        block->mr = NULL;
-    }
-
-    g_free(block->transit_bitmap);
-    block->transit_bitmap = NULL;
-
-    g_free(block->unregister_bitmap);
-    block->unregister_bitmap = NULL;
-
-    g_free(block->remote_keys);
-    block->remote_keys = NULL;
-
-    g_free(block->block_name);
-    block->block_name = NULL;
-
-    if (rdma->blockmap) {
-        for (int x = 0; x < local->nb_blocks; x++) {
-            g_hash_table_remove(rdma->blockmap,
-                                (void *)(uintptr_t)old[x].offset);
-        }
-    }
-
-    if (local->nb_blocks > 1) {
-
-        local->block = g_new0(RDMALocalBlock, local->nb_blocks - 1);
-
-        if (block->index) {
-            memcpy(local->block, old, sizeof(RDMALocalBlock) * block->index);
-        }
-
-        if (block->index < (local->nb_blocks - 1)) {
-            memcpy(local->block + block->index, old + (block->index + 1),
-                sizeof(RDMALocalBlock) *
-                    (local->nb_blocks - (block->index + 1)));
-            for (int x = block->index; x < local->nb_blocks - 1; x++) {
-                local->block[x].index--;
-            }
-        }
-    } else {
-        assert(block == local->block);
-        local->block = NULL;
-    }
-
-    trace_rdma_delete_block(block, (uintptr_t)block->local_host_addr,
-                           block->offset, block->length,
-                            (uintptr_t)(block->local_host_addr + block->length),
-                           BITS_TO_LONGS(block->nb_chunks) *
-                               sizeof(unsigned long) * 8, block->nb_chunks);
-
-    g_free(old);
-
-    local->nb_blocks--;
-
-    if (local->nb_blocks && rdma->blockmap) {
-        for (int x = 0; x < local->nb_blocks; x++) {
-            g_hash_table_insert(rdma->blockmap,
-                                (void *)(uintptr_t)local->block[x].offset,
-                                &local->block[x]);
-        }
-    }
-}
-
-/*
- * Trace RDMA device open, with device details.
- */
-static void qemu_rdma_dump_id(const char *who, struct ibv_context *verbs)
-{
-    struct ibv_port_attr port;
-
-    if (ibv_query_port(verbs, 1, &port)) {
-        trace_qemu_rdma_dump_id_failed(who);
-        return;
-    }
-
-    trace_qemu_rdma_dump_id(who,
-                verbs->device->name,
-                verbs->device->dev_name,
-                verbs->device->dev_path,
-                verbs->device->ibdev_path,
-                port.link_layer,
-                port.link_layer == IBV_LINK_LAYER_INFINIBAND ? "Infiniband"
-                : port.link_layer == IBV_LINK_LAYER_ETHERNET ? "Ethernet"
-                : "Unknown");
-}
-
-/*
- * Trace RDMA gid addressing information.
- * Useful for understanding the RDMA device hierarchy in the kernel.
- */
-static void qemu_rdma_dump_gid(const char *who, struct rdma_cm_id *id)
-{
-    char sgid[33];
-    char dgid[33];
-    inet_ntop(AF_INET6, &id->route.addr.addr.ibaddr.sgid, sgid, sizeof sgid);
-    inet_ntop(AF_INET6, &id->route.addr.addr.ibaddr.dgid, dgid, sizeof dgid);
-    trace_qemu_rdma_dump_gid(who, sgid, dgid);
-}
-
-/*
- * As of now, IPv6 over RoCE / iWARP is not supported by linux.
- * We will try the next addrinfo struct, and fail if there are
- * no other valid addresses to bind against.
- *
- * If user is listening on '[::]', then we will not have a opened a device
- * yet and have no way of verifying if the device is RoCE or not.
- *
- * In this case, the source VM will throw an error for ALL types of
- * connections (both IPv4 and IPv6) if the destination machine does not have
- * a regular infiniband network available for use.
- *
- * The only way to guarantee that an error is thrown for broken kernels is
- * for the management software to choose a *specific* interface at bind time
- * and validate what time of hardware it is.
- *
- * Unfortunately, this puts the user in a fix:
- *
- *  If the source VM connects with an IPv4 address without knowing that the
- *  destination has bound to '[::]' the migration will unconditionally fail
- *  unless the management software is explicitly listening on the IPv4
- *  address while using a RoCE-based device.
- *
- *  If the source VM connects with an IPv6 address, then we're OK because we can
- *  throw an error on the source (and similarly on the destination).
- *
- *  But in mixed environments, this will be broken for a while until it is fixed
- *  inside linux.
- *
- * We do provide a *tiny* bit of help in this function: We can list all of the
- * devices in the system and check to see if all the devices are RoCE or
- * Infiniband.
- *
- * If we detect that we have a *pure* RoCE environment, then we can safely
- * thrown an error even if the management software has specified '[::]' as the
- * bind address.
- *
- * However, if there is are multiple hetergeneous devices, then we cannot make
- * this assumption and the user just has to be sure they know what they are
- * doing.
- *
- * Patches are being reviewed on linux-rdma.
- */
-static int qemu_rdma_broken_ipv6_kernel(struct ibv_context *verbs, Error **errp)
-{
-    /* This bug only exists in linux, to our knowledge. */
-#ifdef CONFIG_LINUX
-    struct ibv_port_attr port_attr;
-
-    /*
-     * Verbs are only NULL if management has bound to '[::]'.
-     *
-     * Let's iterate through all the devices and see if there any pure IB
-     * devices (non-ethernet).
-     *
-     * If not, then we can safely proceed with the migration.
-     * Otherwise, there are no guarantees until the bug is fixed in linux.
-     */
-    if (!verbs) {
-        int num_devices;
-        struct ibv_device **dev_list = ibv_get_device_list(&num_devices);
-        bool roce_found = false;
-        bool ib_found = false;
-
-        for (int x = 0; x < num_devices; x++) {
-            verbs = ibv_open_device(dev_list[x]);
-            /*
-             * ibv_open_device() is not documented to set errno.  If
-             * it does, it's somebody else's doc bug.  If it doesn't,
-             * the use of errno below is wrong.
-             * TODO Find out whether ibv_open_device() sets errno.
-             */
-            if (!verbs) {
-                if (errno == EPERM) {
-                    continue;
-                } else {
-                    error_setg_errno(errp, errno,
-                                     "could not open RDMA device context");
-                    return -1;
-                }
-            }
-
-            if (ibv_query_port(verbs, 1, &port_attr)) {
-                ibv_close_device(verbs);
-                error_setg(errp,
-                           "RDMA ERROR: Could not query initial IB port");
-                return -1;
-            }
-
-            if (port_attr.link_layer == IBV_LINK_LAYER_INFINIBAND) {
-                ib_found = true;
-            } else if (port_attr.link_layer == IBV_LINK_LAYER_ETHERNET) {
-                roce_found = true;
-            }
-
-            ibv_close_device(verbs);
-
-        }
-
-        if (roce_found) {
-            if (ib_found) {
-                warn_report("migrations may fail:"
-                            " IPv6 over RoCE / iWARP in linux"
-                            " is broken. But since you appear to have a"
-                            " mixed RoCE / IB environment, be sure to only"
-                            " migrate over the IB fabric until the kernel "
-                            " fixes the bug.");
-            } else {
-                error_setg(errp, "RDMA ERROR: "
-                           "You only have RoCE / iWARP devices in your systems"
-                           " and your management software has specified '[::]'"
-                           ", but IPv6 over RoCE / iWARP is not supported in Linux.");
-                return -1;
-            }
-        }
-
-        return 0;
-    }
-
-    /*
-     * If we have a verbs context, that means that some other than '[::]' was
-     * used by the management software for binding. In which case we can
-     * actually warn the user about a potentially broken kernel.
-     */
-
-    /* IB ports start with 1, not 0 */
-    if (ibv_query_port(verbs, 1, &port_attr)) {
-        error_setg(errp, "RDMA ERROR: Could not query initial IB port");
-        return -1;
-    }
-
-    if (port_attr.link_layer == IBV_LINK_LAYER_ETHERNET) {
-        error_setg(errp, "RDMA ERROR: "
-                   "Linux kernel's RoCE / iWARP does not support IPv6 "
-                   "(but patches on linux-rdma in progress)");
-        return -1;
-    }
-
-#endif
-
-    return 0;
-}
-
-/*
- * Figure out which RDMA device corresponds to the requested IP hostname
- * Also create the initial connection manager identifiers for opening
- * the connection.
- */
-static int qemu_rdma_resolve_host(RDMAContext *rdma, Error **errp)
-{
-    Error *err = NULL;
-    int ret;
-    struct rdma_addrinfo *res;
-    char port_str[16];
-    struct rdma_cm_event *cm_event;
-    char ip[40] = "unknown";
-
-    if (rdma->host == NULL || !strcmp(rdma->host, "")) {
-        error_setg(errp, "RDMA ERROR: RDMA hostname has not been set");
-        return -1;
-    }
-
-    /* create CM channel */
-    rdma->channel = rdma_create_event_channel();
-    if (!rdma->channel) {
-        error_setg(errp, "RDMA ERROR: could not create CM channel");
-        return -1;
-    }
-
-    /* create CM id */
-    ret = rdma_create_id(rdma->channel, &rdma->cm_id, NULL, RDMA_PS_TCP);
-    if (ret < 0) {
-        error_setg(errp, "RDMA ERROR: could not create channel id");
-        goto err_resolve_create_id;
-    }
-
-    snprintf(port_str, 16, "%d", rdma->port);
-    port_str[15] = '\0';
-
-    ret = rdma_getaddrinfo(rdma->host, port_str, NULL, &res);
-    if (ret) {
-        error_setg(errp, "RDMA ERROR: could not rdma_getaddrinfo address %s",
-                   rdma->host);
-        goto err_resolve_get_addr;
-    }
-
-    /* Try all addresses, saving the first error in @err */
-    for (struct rdma_addrinfo *e = res; e != NULL; e = e->ai_next) {
-        Error **local_errp = err ? NULL : &err;
-
-        inet_ntop(e->ai_family,
-            &((struct sockaddr_in *) e->ai_dst_addr)->sin_addr, ip, sizeof ip);
-        trace_qemu_rdma_resolve_host_trying(rdma->host, ip);
-
-        ret = rdma_resolve_addr(rdma->cm_id, NULL, e->ai_dst_addr,
-                RDMA_RESOLVE_TIMEOUT_MS);
-        if (ret >= 0) {
-            if (e->ai_family == AF_INET6) {
-                ret = qemu_rdma_broken_ipv6_kernel(rdma->cm_id->verbs,
-                                                   local_errp);
-                if (ret < 0) {
-                    continue;
-                }
-            }
-            error_free(err);
-            goto route;
-        }
-    }
-
-    rdma_freeaddrinfo(res);
-    if (err) {
-        error_propagate(errp, err);
-    } else {
-        error_setg(errp, "RDMA ERROR: could not resolve address %s",
-                   rdma->host);
-    }
-    goto err_resolve_get_addr;
-
-route:
-    rdma_freeaddrinfo(res);
-    qemu_rdma_dump_gid("source_resolve_addr", rdma->cm_id);
-
-    ret = rdma_get_cm_event(rdma->channel, &cm_event);
-    if (ret < 0) {
-        error_setg(errp, "RDMA ERROR: could not perform event_addr_resolved");
-        goto err_resolve_get_addr;
-    }
-
-    if (cm_event->event != RDMA_CM_EVENT_ADDR_RESOLVED) {
-        error_setg(errp,
-                   "RDMA ERROR: result not equal to event_addr_resolved %s",
-                   rdma_event_str(cm_event->event));
-        rdma_ack_cm_event(cm_event);
-        goto err_resolve_get_addr;
-    }
-    rdma_ack_cm_event(cm_event);
-
-    /* resolve route */
-    ret = rdma_resolve_route(rdma->cm_id, RDMA_RESOLVE_TIMEOUT_MS);
-    if (ret < 0) {
-        error_setg(errp, "RDMA ERROR: could not resolve rdma route");
-        goto err_resolve_get_addr;
-    }
-
-    ret = rdma_get_cm_event(rdma->channel, &cm_event);
-    if (ret < 0) {
-        error_setg(errp, "RDMA ERROR: could not perform event_route_resolved");
-        goto err_resolve_get_addr;
-    }
-    if (cm_event->event != RDMA_CM_EVENT_ROUTE_RESOLVED) {
-        error_setg(errp, "RDMA ERROR: "
-                   "result not equal to event_route_resolved: %s",
-                   rdma_event_str(cm_event->event));
-        rdma_ack_cm_event(cm_event);
-        goto err_resolve_get_addr;
-    }
-    rdma_ack_cm_event(cm_event);
-    rdma->verbs = rdma->cm_id->verbs;
-    qemu_rdma_dump_id("source_resolve_host", rdma->cm_id->verbs);
-    qemu_rdma_dump_gid("source_resolve_host", rdma->cm_id);
-    return 0;
-
-err_resolve_get_addr:
-    rdma_destroy_id(rdma->cm_id);
-    rdma->cm_id = NULL;
-err_resolve_create_id:
-    rdma_destroy_event_channel(rdma->channel);
-    rdma->channel = NULL;
-    return -1;
-}
-
-/*
- * Create protection domain and completion queues
- */
-static int qemu_rdma_alloc_pd_cq(RDMAContext *rdma, Error **errp)
-{
-    /* allocate pd */
-    rdma->pd = ibv_alloc_pd(rdma->verbs);
-    if (!rdma->pd) {
-        error_setg(errp, "failed to allocate protection domain");
-        return -1;
-    }
-
-    /* create receive completion channel */
-    rdma->recv_comp_channel = ibv_create_comp_channel(rdma->verbs);
-    if (!rdma->recv_comp_channel) {
-        error_setg(errp, "failed to allocate receive completion channel");
-        goto err_alloc_pd_cq;
-    }
-
-    /*
-     * Completion queue can be filled by read work requests.
-     */
-    rdma->recv_cq = ibv_create_cq(rdma->verbs, (RDMA_SIGNALED_SEND_MAX * 3),
-                                  NULL, rdma->recv_comp_channel, 0);
-    if (!rdma->recv_cq) {
-        error_setg(errp, "failed to allocate receive completion queue");
-        goto err_alloc_pd_cq;
-    }
-
-    /* create send completion channel */
-    rdma->send_comp_channel = ibv_create_comp_channel(rdma->verbs);
-    if (!rdma->send_comp_channel) {
-        error_setg(errp, "failed to allocate send completion channel");
-        goto err_alloc_pd_cq;
-    }
-
-    rdma->send_cq = ibv_create_cq(rdma->verbs, (RDMA_SIGNALED_SEND_MAX * 3),
-                                  NULL, rdma->send_comp_channel, 0);
-    if (!rdma->send_cq) {
-        error_setg(errp, "failed to allocate send completion queue");
-        goto err_alloc_pd_cq;
-    }
-
-    return 0;
-
-err_alloc_pd_cq:
-    if (rdma->pd) {
-        ibv_dealloc_pd(rdma->pd);
-    }
-    if (rdma->recv_comp_channel) {
-        ibv_destroy_comp_channel(rdma->recv_comp_channel);
-    }
-    if (rdma->send_comp_channel) {
-        ibv_destroy_comp_channel(rdma->send_comp_channel);
-    }
-    if (rdma->recv_cq) {
-        ibv_destroy_cq(rdma->recv_cq);
-        rdma->recv_cq = NULL;
-    }
-    rdma->pd = NULL;
-    rdma->recv_comp_channel = NULL;
-    rdma->send_comp_channel = NULL;
-    return -1;
-
-}
-
-/*
- * Create queue pairs.
- */
-static int qemu_rdma_alloc_qp(RDMAContext *rdma)
-{
-    struct ibv_qp_init_attr attr = { 0 };
-
-    attr.cap.max_send_wr = RDMA_SIGNALED_SEND_MAX;
-    attr.cap.max_recv_wr = 3;
-    attr.cap.max_send_sge = 1;
-    attr.cap.max_recv_sge = 1;
-    attr.send_cq = rdma->send_cq;
-    attr.recv_cq = rdma->recv_cq;
-    attr.qp_type = IBV_QPT_RC;
-
-    if (rdma_create_qp(rdma->cm_id, rdma->pd, &attr) < 0) {
-        return -1;
-    }
-
-    rdma->qp = rdma->cm_id->qp;
-    return 0;
-}
-
-/* Check whether On-Demand Paging is supported by RDAM device */
-static bool rdma_support_odp(struct ibv_context *dev)
-{
-    struct ibv_device_attr_ex attr = {0};
-
-    if (ibv_query_device_ex(dev, NULL, &attr)) {
-        return false;
-    }
-
-    if (attr.odp_caps.general_caps & IBV_ODP_SUPPORT) {
-        return true;
-    }
-
-    return false;
-}
-
-/*
- * ibv_advise_mr to avoid RNR NAK error as far as possible.
- * The responder mr registering with ODP will sent RNR NAK back to
- * the requester in the face of the page fault.
- */
-static void qemu_rdma_advise_prefetch_mr(struct ibv_pd *pd, uint64_t addr,
-                                         uint32_t len,  uint32_t lkey,
-                                         const char *name, bool wr)
-{
-#ifdef HAVE_IBV_ADVISE_MR
-    int ret;
-    int advice = wr ? IBV_ADVISE_MR_ADVICE_PREFETCH_WRITE :
-                 IBV_ADVISE_MR_ADVICE_PREFETCH;
-    struct ibv_sge sg_list = {.lkey = lkey, .addr = addr, .length = len};
-
-    ret = ibv_advise_mr(pd, advice,
-                        IBV_ADVISE_MR_FLAG_FLUSH, &sg_list, 1);
-    /* ignore the error */
-    trace_qemu_rdma_advise_mr(name, len, addr, strerror(ret));
-#endif
-}
-
-static int qemu_rdma_reg_whole_ram_blocks(RDMAContext *rdma, Error **errp)
-{
-    int i;
-    RDMALocalBlocks *local = &rdma->local_ram_blocks;
-
-    for (i = 0; i < local->nb_blocks; i++) {
-        int access = IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE;
-
-        local->block[i].mr =
-            ibv_reg_mr(rdma->pd,
-                    local->block[i].local_host_addr,
-                    local->block[i].length, access
-                    );
-        /*
-         * ibv_reg_mr() is not documented to set errno.  If it does,
-         * it's somebody else's doc bug.  If it doesn't, the use of
-         * errno below is wrong.
-         * TODO Find out whether ibv_reg_mr() sets errno.
-         */
-        if (!local->block[i].mr &&
-            errno == ENOTSUP && rdma_support_odp(rdma->verbs)) {
-                access |= IBV_ACCESS_ON_DEMAND;
-                /* register ODP mr */
-                local->block[i].mr =
-                    ibv_reg_mr(rdma->pd,
-                               local->block[i].local_host_addr,
-                               local->block[i].length, access);
-                trace_qemu_rdma_register_odp_mr(local->block[i].block_name);
-
-                if (local->block[i].mr) {
-                    qemu_rdma_advise_prefetch_mr(rdma->pd,
-                                    (uintptr_t)local->block[i].local_host_addr,
-                                    local->block[i].length,
-                                    local->block[i].mr->lkey,
-                                    local->block[i].block_name,
-                                    true);
-                }
-        }
-
-        if (!local->block[i].mr) {
-            error_setg_errno(errp, errno,
-                             "Failed to register local dest ram block!");
-            goto err;
-        }
-        rdma->total_registrations++;
-    }
-
-    return 0;
-
-err:
-    for (i--; i >= 0; i--) {
-        ibv_dereg_mr(local->block[i].mr);
-        local->block[i].mr = NULL;
-        rdma->total_registrations--;
-    }
-
-    return -1;
-
-}
-
-/*
- * Find the ram block that corresponds to the page requested to be
- * transmitted by QEMU.
- *
- * Once the block is found, also identify which 'chunk' within that
- * block that the page belongs to.
- */
-static void qemu_rdma_search_ram_block(RDMAContext *rdma,
-                                       uintptr_t block_offset,
-                                       uint64_t offset,
-                                       uint64_t length,
-                                       uint64_t *block_index,
-                                       uint64_t *chunk_index)
-{
-    uint64_t current_addr = block_offset + offset;
-    RDMALocalBlock *block = g_hash_table_lookup(rdma->blockmap,
-                                                (void *) block_offset);
-    assert(block);
-    assert(current_addr >= block->offset);
-    assert((current_addr + length) <= (block->offset + block->length));
-
-    *block_index = block->index;
-    *chunk_index = ram_chunk_index(block->local_host_addr,
-                block->local_host_addr + (current_addr - block->offset));
-}
-
-/*
- * Register a chunk with IB. If the chunk was already registered
- * previously, then skip.
- *
- * Also return the keys associated with the registration needed
- * to perform the actual RDMA operation.
- */
-static int qemu_rdma_register_and_get_keys(RDMAContext *rdma,
-        RDMALocalBlock *block, uintptr_t host_addr,
-        uint32_t *lkey, uint32_t *rkey, int chunk,
-        uint8_t *chunk_start, uint8_t *chunk_end)
-{
-    if (block->mr) {
-        if (lkey) {
-            *lkey = block->mr->lkey;
-        }
-        if (rkey) {
-            *rkey = block->mr->rkey;
-        }
-        return 0;
-    }
-
-    /* allocate memory to store chunk MRs */
-    if (!block->pmr) {
-        block->pmr = g_new0(struct ibv_mr *, block->nb_chunks);
-    }
-
-    /*
-     * If 'rkey', then we're the destination, so grant access to the source.
-     *
-     * If 'lkey', then we're the source VM, so grant access only to ourselves.
-     */
-    if (!block->pmr[chunk]) {
-        uint64_t len = chunk_end - chunk_start;
-        int access = rkey ? IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE :
-                     0;
-
-        trace_qemu_rdma_register_and_get_keys(len, chunk_start);
-
-        block->pmr[chunk] = ibv_reg_mr(rdma->pd, chunk_start, len, access);
-        /*
-         * ibv_reg_mr() is not documented to set errno.  If it does,
-         * it's somebody else's doc bug.  If it doesn't, the use of
-         * errno below is wrong.
-         * TODO Find out whether ibv_reg_mr() sets errno.
-         */
-        if (!block->pmr[chunk] &&
-            errno == ENOTSUP && rdma_support_odp(rdma->verbs)) {
-            access |= IBV_ACCESS_ON_DEMAND;
-            /* register ODP mr */
-            block->pmr[chunk] = ibv_reg_mr(rdma->pd, chunk_start, len, access);
-            trace_qemu_rdma_register_odp_mr(block->block_name);
-
-            if (block->pmr[chunk]) {
-                qemu_rdma_advise_prefetch_mr(rdma->pd, (uintptr_t)chunk_start,
-                                            len, block->pmr[chunk]->lkey,
-                                            block->block_name, rkey);
-
-            }
-        }
-    }
-    if (!block->pmr[chunk]) {
-        return -1;
-    }
-    rdma->total_registrations++;
-
-    if (lkey) {
-        *lkey = block->pmr[chunk]->lkey;
-    }
-    if (rkey) {
-        *rkey = block->pmr[chunk]->rkey;
-    }
-    return 0;
-}
-
-/*
- * Register (at connection time) the memory used for control
- * channel messages.
- */
-static int qemu_rdma_reg_control(RDMAContext *rdma, int idx)
-{
-    rdma->wr_data[idx].control_mr = ibv_reg_mr(rdma->pd,
-            rdma->wr_data[idx].control, RDMA_CONTROL_MAX_BUFFER,
-            IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE);
-    if (rdma->wr_data[idx].control_mr) {
-        rdma->total_registrations++;
-        return 0;
-    }
-    return -1;
-}
-
-/*
- * Perform a non-optimized memory unregistration after every transfer
- * for demonstration purposes, only if pin-all is not requested.
- *
- * Potential optimizations:
- * 1. Start a new thread to run this function continuously
-        - for bit clearing
-        - and for receipt of unregister messages
- * 2. Use an LRU.
- * 3. Use workload hints.
- */
-static int qemu_rdma_unregister_waiting(RDMAContext *rdma)
-{
-    Error *err = NULL;
-
-    while (rdma->unregistrations[rdma->unregister_current]) {
-        int ret;
-        uint64_t wr_id = rdma->unregistrations[rdma->unregister_current];
-        uint64_t chunk =
-            (wr_id & RDMA_WRID_CHUNK_MASK) >> RDMA_WRID_CHUNK_SHIFT;
-        uint64_t index =
-            (wr_id & RDMA_WRID_BLOCK_MASK) >> RDMA_WRID_BLOCK_SHIFT;
-        RDMALocalBlock *block =
-            &(rdma->local_ram_blocks.block[index]);
-        RDMARegister reg = { .current_index = index };
-        RDMAControlHeader resp = { .type = RDMA_CONTROL_UNREGISTER_FINISHED,
-                                 };
-        RDMAControlHeader head = { .len = sizeof(RDMARegister),
-                                   .type = RDMA_CONTROL_UNREGISTER_REQUEST,
-                                   .repeat = 1,
-                                 };
-
-        trace_qemu_rdma_unregister_waiting_proc(chunk,
-                                                rdma->unregister_current);
-
-        rdma->unregistrations[rdma->unregister_current] = 0;
-        rdma->unregister_current++;
-
-        if (rdma->unregister_current == RDMA_SIGNALED_SEND_MAX) {
-            rdma->unregister_current = 0;
-        }
-
-
-        /*
-         * Unregistration is speculative (because migration is single-threaded
-         * and we cannot break the protocol's inifinband message ordering).
-         * Thus, if the memory is currently being used for transmission,
-         * then abort the attempt to unregister and try again
-         * later the next time a completion is received for this memory.
-         */
-        clear_bit(chunk, block->unregister_bitmap);
-
-        if (test_bit(chunk, block->transit_bitmap)) {
-            trace_qemu_rdma_unregister_waiting_inflight(chunk);
-            continue;
-        }
-
-        trace_qemu_rdma_unregister_waiting_send(chunk);
-
-        ret = ibv_dereg_mr(block->pmr[chunk]);
-        block->pmr[chunk] = NULL;
-        block->remote_keys[chunk] = 0;
-
-        if (ret != 0) {
-            error_report("unregistration chunk failed: %s",
-                         strerror(ret));
-            return -1;
-        }
-        rdma->total_registrations--;
-
-        reg.key.chunk = chunk;
-        register_to_network(rdma, &reg);
-        ret = qemu_rdma_exchange_send(rdma, &head, (uint8_t *) &reg,
-                                      &resp, NULL, NULL, &err);
-        if (ret < 0) {
-            error_report_err(err);
-            return -1;
-        }
-
-        trace_qemu_rdma_unregister_waiting_complete(chunk);
-    }
-
-    return 0;
-}
-
-static uint64_t qemu_rdma_make_wrid(uint64_t wr_id, uint64_t index,
-                                         uint64_t chunk)
-{
-    uint64_t result = wr_id & RDMA_WRID_TYPE_MASK;
-
-    result |= (index << RDMA_WRID_BLOCK_SHIFT);
-    result |= (chunk << RDMA_WRID_CHUNK_SHIFT);
-
-    return result;
-}
-
-/*
- * Consult the connection manager to see a work request
- * (of any kind) has completed.
- * Return the work request ID that completed.
- */
-static int qemu_rdma_poll(RDMAContext *rdma, struct ibv_cq *cq,
-                          uint64_t *wr_id_out, uint32_t *byte_len)
-{
-    int ret;
-    struct ibv_wc wc;
-    uint64_t wr_id;
-
-    ret = ibv_poll_cq(cq, 1, &wc);
-
-    if (!ret) {
-        *wr_id_out = RDMA_WRID_NONE;
-        return 0;
-    }
-
-    if (ret < 0) {
-        return -1;
-    }
-
-    wr_id = wc.wr_id & RDMA_WRID_TYPE_MASK;
-
-    if (wc.status != IBV_WC_SUCCESS) {
-        return -1;
-    }
-
-    if (rdma->control_ready_expected &&
-        (wr_id >= RDMA_WRID_RECV_CONTROL)) {
-        trace_qemu_rdma_poll_recv(wr_id - RDMA_WRID_RECV_CONTROL, wr_id,
-                                  rdma->nb_sent);
-        rdma->control_ready_expected = 0;
-    }
-
-    if (wr_id == RDMA_WRID_RDMA_WRITE) {
-        uint64_t chunk =
-            (wc.wr_id & RDMA_WRID_CHUNK_MASK) >> RDMA_WRID_CHUNK_SHIFT;
-        uint64_t index =
-            (wc.wr_id & RDMA_WRID_BLOCK_MASK) >> RDMA_WRID_BLOCK_SHIFT;
-        RDMALocalBlock *block = &(rdma->local_ram_blocks.block[index]);
-
-        trace_qemu_rdma_poll_write(wr_id, rdma->nb_sent,
-                                   index, chunk, block->local_host_addr,
-                                   (void *)(uintptr_t)block->remote_host_addr);
-
-        clear_bit(chunk, block->transit_bitmap);
-
-        if (rdma->nb_sent > 0) {
-            rdma->nb_sent--;
-        }
-    } else {
-        trace_qemu_rdma_poll_other(wr_id, rdma->nb_sent);
-    }
-
-    *wr_id_out = wc.wr_id;
-    if (byte_len) {
-        *byte_len = wc.byte_len;
-    }
-
-    return  0;
-}
-
-/* Wait for activity on the completion channel.
- * Returns 0 on success, none-0 on error.
- */
-static int qemu_rdma_wait_comp_channel(RDMAContext *rdma,
-                                       struct ibv_comp_channel *comp_channel)
-{
-    struct rdma_cm_event *cm_event;
-
-    /*
-     * Coroutine doesn't start until migration_fd_process_incoming()
-     * so don't yield unless we know we're running inside of a coroutine.
-     */
-    if (rdma->migration_started_on_destination &&
-        migration_incoming_get_current()->state == MIGRATION_STATUS_ACTIVE) {
-        yield_until_fd_readable(comp_channel->fd);
-    } else {
-        /* This is the source side, we're in a separate thread
-         * or destination prior to migration_fd_process_incoming()
-         * after postcopy, the destination also in a separate thread.
-         * we can't yield; so we have to poll the fd.
-         * But we need to be able to handle 'cancel' or an error
-         * without hanging forever.
-         */
-        while (!rdma->errored && !rdma->received_error) {
-            GPollFD pfds[2];
-            pfds[0].fd = comp_channel->fd;
-            pfds[0].events = G_IO_IN | G_IO_HUP | G_IO_ERR;
-            pfds[0].revents = 0;
-
-            pfds[1].fd = rdma->channel->fd;
-            pfds[1].events = G_IO_IN | G_IO_HUP | G_IO_ERR;
-            pfds[1].revents = 0;
-
-            /* 0.1s timeout, should be fine for a 'cancel' */
-            switch (qemu_poll_ns(pfds, 2, 100 * 1000 * 1000)) {
-            case 2:
-            case 1: /* fd active */
-                if (pfds[0].revents) {
-                    return 0;
-                }
-
-                if (pfds[1].revents) {
-                    if (rdma_get_cm_event(rdma->channel, &cm_event) < 0) {
-                        return -1;
-                    }
-
-                    if (cm_event->event == RDMA_CM_EVENT_DISCONNECTED ||
-                        cm_event->event == RDMA_CM_EVENT_DEVICE_REMOVAL) {
-                        rdma_ack_cm_event(cm_event);
-                        return -1;
-                    }
-                    rdma_ack_cm_event(cm_event);
-                }
-                break;
-
-            case 0: /* Timeout, go around again */
-                break;
-
-            default: /* Error of some type -
-                      * I don't trust errno from qemu_poll_ns
-                     */
-                return -1;
-            }
-
-            if (migrate_get_current()->state == MIGRATION_STATUS_CANCELLING) {
-                /* Bail out and let the cancellation happen */
-                return -1;
-            }
-        }
-    }
-
-    if (rdma->received_error) {
-        return -1;
-    }
-    return -rdma->errored;
-}
-
-static struct ibv_comp_channel *to_channel(RDMAContext *rdma, uint64_t wrid)
-{
-    return wrid < RDMA_WRID_RECV_CONTROL ? rdma->send_comp_channel :
-           rdma->recv_comp_channel;
-}
-
-static struct ibv_cq *to_cq(RDMAContext *rdma, uint64_t wrid)
-{
-    return wrid < RDMA_WRID_RECV_CONTROL ? rdma->send_cq : rdma->recv_cq;
-}
-
-/*
- * Block until the next work request has completed.
- *
- * First poll to see if a work request has already completed,
- * otherwise block.
- *
- * If we encounter completed work requests for IDs other than
- * the one we're interested in, then that's generally an error.
- *
- * The only exception is actual RDMA Write completions. These
- * completions only need to be recorded, but do not actually
- * need further processing.
- */
-static int qemu_rdma_block_for_wrid(RDMAContext *rdma,
-                                    uint64_t wrid_requested,
-                                    uint32_t *byte_len)
-{
-    int num_cq_events = 0, ret;
-    struct ibv_cq *cq;
-    void *cq_ctx;
-    uint64_t wr_id = RDMA_WRID_NONE, wr_id_in;
-    struct ibv_comp_channel *ch = to_channel(rdma, wrid_requested);
-    struct ibv_cq *poll_cq = to_cq(rdma, wrid_requested);
-
-    if (ibv_req_notify_cq(poll_cq, 0)) {
-        return -1;
-    }
-    /* poll cq first */
-    while (wr_id != wrid_requested) {
-        ret = qemu_rdma_poll(rdma, poll_cq, &wr_id_in, byte_len);
-        if (ret < 0) {
-            return -1;
-        }
-
-        wr_id = wr_id_in & RDMA_WRID_TYPE_MASK;
-
-        if (wr_id == RDMA_WRID_NONE) {
-            break;
-        }
-        if (wr_id != wrid_requested) {
-            trace_qemu_rdma_block_for_wrid_miss(wrid_requested, wr_id);
-        }
-    }
-
-    if (wr_id == wrid_requested) {
-        return 0;
-    }
-
-    while (1) {
-        ret = qemu_rdma_wait_comp_channel(rdma, ch);
-        if (ret < 0) {
-            goto err_block_for_wrid;
-        }
-
-        ret = ibv_get_cq_event(ch, &cq, &cq_ctx);
-        if (ret < 0) {
-            goto err_block_for_wrid;
-        }
-
-        num_cq_events++;
-
-        if (ibv_req_notify_cq(cq, 0)) {
-            goto err_block_for_wrid;
-        }
-
-        while (wr_id != wrid_requested) {
-            ret = qemu_rdma_poll(rdma, poll_cq, &wr_id_in, byte_len);
-            if (ret < 0) {
-                goto err_block_for_wrid;
-            }
-
-            wr_id = wr_id_in & RDMA_WRID_TYPE_MASK;
-
-            if (wr_id == RDMA_WRID_NONE) {
-                break;
-            }
-            if (wr_id != wrid_requested) {
-                trace_qemu_rdma_block_for_wrid_miss(wrid_requested, wr_id);
-            }
-        }
-
-        if (wr_id == wrid_requested) {
-            goto success_block_for_wrid;
-        }
-    }
-
-success_block_for_wrid:
-    if (num_cq_events) {
-        ibv_ack_cq_events(cq, num_cq_events);
-    }
-    return 0;
-
-err_block_for_wrid:
-    if (num_cq_events) {
-        ibv_ack_cq_events(cq, num_cq_events);
-    }
-
-    rdma->errored = true;
-    return -1;
-}
-
-/*
- * Post a SEND message work request for the control channel
- * containing some data and block until the post completes.
- */
-static int qemu_rdma_post_send_control(RDMAContext *rdma, uint8_t *buf,
-                                       RDMAControlHeader *head,
-                                       Error **errp)
-{
-    int ret;
-    RDMAWorkRequestData *wr = &rdma->wr_data[RDMA_WRID_CONTROL];
-    struct ibv_send_wr *bad_wr;
-    struct ibv_sge sge = {
-                           .addr = (uintptr_t)(wr->control),
-                           .length = head->len + sizeof(RDMAControlHeader),
-                           .lkey = wr->control_mr->lkey,
-                         };
-    struct ibv_send_wr send_wr = {
-                                   .wr_id = RDMA_WRID_SEND_CONTROL,
-                                   .opcode = IBV_WR_SEND,
-                                   .send_flags = IBV_SEND_SIGNALED,
-                                   .sg_list = &sge,
-                                   .num_sge = 1,
-                                };
-
-    trace_qemu_rdma_post_send_control(control_desc(head->type));
-
-    /*
-     * We don't actually need to do a memcpy() in here if we used
-     * the "sge" properly, but since we're only sending control messages
-     * (not RAM in a performance-critical path), then its OK for now.
-     *
-     * The copy makes the RDMAControlHeader simpler to manipulate
-     * for the time being.
-     */
-    assert(head->len <= RDMA_CONTROL_MAX_BUFFER - sizeof(*head));
-    memcpy(wr->control, head, sizeof(RDMAControlHeader));
-    control_to_network((void *) wr->control);
-
-    if (buf) {
-        memcpy(wr->control + sizeof(RDMAControlHeader), buf, head->len);
-    }
-
-
-    ret = ibv_post_send(rdma->qp, &send_wr, &bad_wr);
-
-    if (ret > 0) {
-        error_setg(errp, "Failed to use post IB SEND for control");
-        return -1;
-    }
-
-    ret = qemu_rdma_block_for_wrid(rdma, RDMA_WRID_SEND_CONTROL, NULL);
-    if (ret < 0) {
-        error_setg(errp, "rdma migration: send polling control error");
-        return -1;
-    }
-
-    return 0;
-}
-
-/*
- * Post a RECV work request in anticipation of some future receipt
- * of data on the control channel.
- */
-static int qemu_rdma_post_recv_control(RDMAContext *rdma, int idx,
-                                       Error **errp)
-{
-    struct ibv_recv_wr *bad_wr;
-    struct ibv_sge sge = {
-                            .addr = (uintptr_t)(rdma->wr_data[idx].control),
-                            .length = RDMA_CONTROL_MAX_BUFFER,
-                            .lkey = rdma->wr_data[idx].control_mr->lkey,
-                         };
-
-    struct ibv_recv_wr recv_wr = {
-                                    .wr_id = RDMA_WRID_RECV_CONTROL + idx,
-                                    .sg_list = &sge,
-                                    .num_sge = 1,
-                                 };
-
-
-    if (ibv_post_recv(rdma->qp, &recv_wr, &bad_wr)) {
-        error_setg(errp, "error posting control recv");
-        return -1;
-    }
-
-    return 0;
-}
-
-/*
- * Block and wait for a RECV control channel message to arrive.
- */
-static int qemu_rdma_exchange_get_response(RDMAContext *rdma,
-                RDMAControlHeader *head, uint32_t expecting, int idx,
-                Error **errp)
-{
-    uint32_t byte_len;
-    int ret = qemu_rdma_block_for_wrid(rdma, RDMA_WRID_RECV_CONTROL + idx,
-                                       &byte_len);
-
-    if (ret < 0) {
-        error_setg(errp, "rdma migration: recv polling control error!");
-        return -1;
-    }
-
-    network_to_control((void *) rdma->wr_data[idx].control);
-    memcpy(head, rdma->wr_data[idx].control, sizeof(RDMAControlHeader));
-
-    trace_qemu_rdma_exchange_get_response_start(control_desc(expecting));
-
-    if (expecting == RDMA_CONTROL_NONE) {
-        trace_qemu_rdma_exchange_get_response_none(control_desc(head->type),
-                                             head->type);
-    } else if (head->type != expecting || head->type == RDMA_CONTROL_ERROR) {
-        error_setg(errp, "Was expecting a %s (%d) control message"
-                ", but got: %s (%d), length: %d",
-                control_desc(expecting), expecting,
-                control_desc(head->type), head->type, head->len);
-        if (head->type == RDMA_CONTROL_ERROR) {
-            rdma->received_error = true;
-        }
-        return -1;
-    }
-    if (head->len > RDMA_CONTROL_MAX_BUFFER - sizeof(*head)) {
-        error_setg(errp, "too long length: %d", head->len);
-        return -1;
-    }
-    if (sizeof(*head) + head->len != byte_len) {
-        error_setg(errp, "Malformed length: %d byte_len %d",
-                   head->len, byte_len);
-        return -1;
-    }
-
-    return 0;
-}
-
-/*
- * When a RECV work request has completed, the work request's
- * buffer is pointed at the header.
- *
- * This will advance the pointer to the data portion
- * of the control message of the work request's buffer that
- * was populated after the work request finished.
- */
-static void qemu_rdma_move_header(RDMAContext *rdma, int idx,
-                                  RDMAControlHeader *head)
-{
-    rdma->wr_data[idx].control_len = head->len;
-    rdma->wr_data[idx].control_curr =
-        rdma->wr_data[idx].control + sizeof(RDMAControlHeader);
-}
-
-/*
- * This is an 'atomic' high-level operation to deliver a single, unified
- * control-channel message.
- *
- * Additionally, if the user is expecting some kind of reply to this message,
- * they can request a 'resp' response message be filled in by posting an
- * additional work request on behalf of the user and waiting for an additional
- * completion.
- *
- * The extra (optional) response is used during registration to us from having
- * to perform an *additional* exchange of message just to provide a response by
- * instead piggy-backing on the acknowledgement.
- */
-static int qemu_rdma_exchange_send(RDMAContext *rdma, RDMAControlHeader *head,
-                                   uint8_t *data, RDMAControlHeader *resp,
-                                   int *resp_idx,
-                                   int (*callback)(RDMAContext *rdma,
-                                                   Error **errp),
-                                   Error **errp)
-{
-    int ret;
-
-    /*
-     * Wait until the dest is ready before attempting to deliver the message
-     * by waiting for a READY message.
-     */
-    if (rdma->control_ready_expected) {
-        RDMAControlHeader resp_ignored;
-
-        ret = qemu_rdma_exchange_get_response(rdma, &resp_ignored,
-                                              RDMA_CONTROL_READY,
-                                              RDMA_WRID_READY, errp);
-        if (ret < 0) {
-            return -1;
-        }
-    }
-
-    /*
-     * If the user is expecting a response, post a WR in anticipation of it.
-     */
-    if (resp) {
-        ret = qemu_rdma_post_recv_control(rdma, RDMA_WRID_DATA, errp);
-        if (ret < 0) {
-            return -1;
-        }
-    }
-
-    /*
-     * Post a WR to replace the one we just consumed for the READY message.
-     */
-    ret = qemu_rdma_post_recv_control(rdma, RDMA_WRID_READY, errp);
-    if (ret < 0) {
-        return -1;
-    }
-
-    /*
-     * Deliver the control message that was requested.
-     */
-    ret = qemu_rdma_post_send_control(rdma, data, head, errp);
-
-    if (ret < 0) {
-        return -1;
-    }
-
-    /*
-     * If we're expecting a response, block and wait for it.
-     */
-    if (resp) {
-        if (callback) {
-            trace_qemu_rdma_exchange_send_issue_callback();
-            ret = callback(rdma, errp);
-            if (ret < 0) {
-                return -1;
-            }
-        }
-
-        trace_qemu_rdma_exchange_send_waiting(control_desc(resp->type));
-        ret = qemu_rdma_exchange_get_response(rdma, resp,
-                                              resp->type, RDMA_WRID_DATA,
-                                              errp);
-
-        if (ret < 0) {
-            return -1;
-        }
-
-        qemu_rdma_move_header(rdma, RDMA_WRID_DATA, resp);
-        if (resp_idx) {
-            *resp_idx = RDMA_WRID_DATA;
-        }
-        trace_qemu_rdma_exchange_send_received(control_desc(resp->type));
-    }
-
-    rdma->control_ready_expected = 1;
-
-    return 0;
-}
-
-/*
- * This is an 'atomic' high-level operation to receive a single, unified
- * control-channel message.
- */
-static int qemu_rdma_exchange_recv(RDMAContext *rdma, RDMAControlHeader *head,
-                                   uint32_t expecting, Error **errp)
-{
-    RDMAControlHeader ready = {
-                                .len = 0,
-                                .type = RDMA_CONTROL_READY,
-                                .repeat = 1,
-                              };
-    int ret;
-
-    /*
-     * Inform the source that we're ready to receive a message.
-     */
-    ret = qemu_rdma_post_send_control(rdma, NULL, &ready, errp);
-
-    if (ret < 0) {
-        return -1;
-    }
-
-    /*
-     * Block and wait for the message.
-     */
-    ret = qemu_rdma_exchange_get_response(rdma, head,
-                                          expecting, RDMA_WRID_READY, errp);
-
-    if (ret < 0) {
-        return -1;
-    }
-
-    qemu_rdma_move_header(rdma, RDMA_WRID_READY, head);
-
-    /*
-     * Post a new RECV work request to replace the one we just consumed.
-     */
-    ret = qemu_rdma_post_recv_control(rdma, RDMA_WRID_READY, errp);
-    if (ret < 0) {
-        return -1;
-    }
-
-    return 0;
-}
-
-/*
- * Write an actual chunk of memory using RDMA.
- *
- * If we're using dynamic registration on the dest-side, we have to
- * send a registration command first.
- */
-static int qemu_rdma_write_one(RDMAContext *rdma,
-                               int current_index, uint64_t current_addr,
-                               uint64_t length, Error **errp)
-{
-    struct ibv_sge sge;
-    struct ibv_send_wr send_wr = { 0 };
-    struct ibv_send_wr *bad_wr;
-    int reg_result_idx, ret, count = 0;
-    uint64_t chunk, chunks;
-    uint8_t *chunk_start, *chunk_end;
-    RDMALocalBlock *block = &(rdma->local_ram_blocks.block[current_index]);
-    RDMARegister reg;
-    RDMARegisterResult *reg_result;
-    RDMAControlHeader resp = { .type = RDMA_CONTROL_REGISTER_RESULT };
-    RDMAControlHeader head = { .len = sizeof(RDMARegister),
-                               .type = RDMA_CONTROL_REGISTER_REQUEST,
-                               .repeat = 1,
-                             };
-
-retry:
-    sge.addr = (uintptr_t)(block->local_host_addr +
-                            (current_addr - block->offset));
-    sge.length = length;
-
-    chunk = ram_chunk_index(block->local_host_addr,
-                            (uint8_t *)(uintptr_t)sge.addr);
-    chunk_start = ram_chunk_start(block, chunk);
-
-    if (block->is_ram_block) {
-        chunks = length / (1UL << RDMA_REG_CHUNK_SHIFT);
-
-        if (chunks && ((length % (1UL << RDMA_REG_CHUNK_SHIFT)) == 0)) {
-            chunks--;
-        }
-    } else {
-        chunks = block->length / (1UL << RDMA_REG_CHUNK_SHIFT);
-
-        if (chunks && ((block->length % (1UL << RDMA_REG_CHUNK_SHIFT)) == 0)) {
-            chunks--;
-        }
-    }
-
-    trace_qemu_rdma_write_one_top(chunks + 1,
-                                  (chunks + 1) *
-                                  (1UL << RDMA_REG_CHUNK_SHIFT) / 1024 / 1024);
-
-    chunk_end = ram_chunk_end(block, chunk + chunks);
-
-
-    while (test_bit(chunk, block->transit_bitmap)) {
-        (void)count;
-        trace_qemu_rdma_write_one_block(count++, current_index, chunk,
-                sge.addr, length, rdma->nb_sent, block->nb_chunks);
-
-        ret = qemu_rdma_block_for_wrid(rdma, RDMA_WRID_RDMA_WRITE, NULL);
-
-        if (ret < 0) {
-            error_setg(errp, "Failed to Wait for previous write to complete "
-                    "block %d chunk %" PRIu64
-                    " current %" PRIu64 " len %" PRIu64 " %d",
-                    current_index, chunk, sge.addr, length, rdma->nb_sent);
-            return -1;
-        }
-    }
-
-    if (!rdma->pin_all || !block->is_ram_block) {
-        if (!block->remote_keys[chunk]) {
-            /*
-             * This chunk has not yet been registered, so first check to see
-             * if the entire chunk is zero. If so, tell the other size to
-             * memset() + madvise() the entire chunk without RDMA.
-             */
-
-            if (buffer_is_zero((void *)(uintptr_t)sge.addr, length)) {
-                RDMACompress comp = {
-                                        .offset = current_addr,
-                                        .value = 0,
-                                        .block_idx = current_index,
-                                        .length = length,
-                                    };
-
-                head.len = sizeof(comp);
-                head.type = RDMA_CONTROL_COMPRESS;
-
-                trace_qemu_rdma_write_one_zero(chunk, sge.length,
-                                               current_index, current_addr);
-
-                compress_to_network(rdma, &comp);
-                ret = qemu_rdma_exchange_send(rdma, &head,
-                                (uint8_t *) &comp, NULL, NULL, NULL, errp);
-
-                if (ret < 0) {
-                    return -1;
-                }
-
-                /*
-                 * TODO: Here we are sending something, but we are not
-                 * accounting for anything transferred.  The following is wrong:
-                 *
-                 * stat64_add(&mig_stats.rdma_bytes, sge.length);
-                 *
-                 * because we are using some kind of compression.  I
-                 * would think that head.len would be the more similar
-                 * thing to a correct value.
-                 */
-                stat64_add(&mig_stats.zero_pages,
-                           sge.length / qemu_target_page_size());
-                return 1;
-            }
-
-            /*
-             * Otherwise, tell other side to register.
-             */
-            reg.current_index = current_index;
-            if (block->is_ram_block) {
-                reg.key.current_addr = current_addr;
-            } else {
-                reg.key.chunk = chunk;
-            }
-            reg.chunks = chunks;
-
-            trace_qemu_rdma_write_one_sendreg(chunk, sge.length, current_index,
-                                              current_addr);
-
-            register_to_network(rdma, &reg);
-            ret = qemu_rdma_exchange_send(rdma, &head, (uint8_t *) &reg,
-                                    &resp, &reg_result_idx, NULL, errp);
-            if (ret < 0) {
-                return -1;
-            }
-
-            /* try to overlap this single registration with the one we sent. */
-            if (qemu_rdma_register_and_get_keys(rdma, block, sge.addr,
-                                                &sge.lkey, NULL, chunk,
-                                                chunk_start, chunk_end)) {
-                error_setg(errp, "cannot get lkey");
-                return -1;
-            }
-
-            reg_result = (RDMARegisterResult *)
-                    rdma->wr_data[reg_result_idx].control_curr;
-
-            network_to_result(reg_result);
-
-            trace_qemu_rdma_write_one_recvregres(block->remote_keys[chunk],
-                                                 reg_result->rkey, chunk);
-
-            block->remote_keys[chunk] = reg_result->rkey;
-            block->remote_host_addr = reg_result->host_addr;
-        } else {
-            /* already registered before */
-            if (qemu_rdma_register_and_get_keys(rdma, block, sge.addr,
-                                                &sge.lkey, NULL, chunk,
-                                                chunk_start, chunk_end)) {
-                error_setg(errp, "cannot get lkey!");
-                return -1;
-            }
-        }
-
-        send_wr.wr.rdma.rkey = block->remote_keys[chunk];
-    } else {
-        send_wr.wr.rdma.rkey = block->remote_rkey;
-
-        if (qemu_rdma_register_and_get_keys(rdma, block, sge.addr,
-                                                     &sge.lkey, NULL, chunk,
-                                                     chunk_start, chunk_end)) {
-            error_setg(errp, "cannot get lkey!");
-            return -1;
-        }
-    }
-
-    /*
-     * Encode the ram block index and chunk within this wrid.
-     * We will use this information at the time of completion
-     * to figure out which bitmap to check against and then which
-     * chunk in the bitmap to look for.
-     */
-    send_wr.wr_id = qemu_rdma_make_wrid(RDMA_WRID_RDMA_WRITE,
-                                        current_index, chunk);
-
-    send_wr.opcode = IBV_WR_RDMA_WRITE;
-    send_wr.send_flags = IBV_SEND_SIGNALED;
-    send_wr.sg_list = &sge;
-    send_wr.num_sge = 1;
-    send_wr.wr.rdma.remote_addr = block->remote_host_addr +
-                                (current_addr - block->offset);
-
-    trace_qemu_rdma_write_one_post(chunk, sge.addr, send_wr.wr.rdma.remote_addr,
-                                   sge.length);
-
-    /*
-     * ibv_post_send() does not return negative error numbers,
-     * per the specification they are positive - no idea why.
-     */
-    ret = ibv_post_send(rdma->qp, &send_wr, &bad_wr);
-
-    if (ret == ENOMEM) {
-        trace_qemu_rdma_write_one_queue_full();
-        ret = qemu_rdma_block_for_wrid(rdma, RDMA_WRID_RDMA_WRITE, NULL);
-        if (ret < 0) {
-            error_setg(errp, "rdma migration: failed to make "
-                         "room in full send queue!");
-            return -1;
-        }
-
-        goto retry;
-
-    } else if (ret > 0) {
-        error_setg_errno(errp, ret,
-                         "rdma migration: post rdma write failed");
-        return -1;
-    }
-
-    set_bit(chunk, block->transit_bitmap);
-    stat64_add(&mig_stats.normal_pages, sge.length / qemu_target_page_size());
-    /*
-     * We are adding to transferred the amount of data written, but no
-     * overhead at all.  I will assume that RDMA is magicaly and don't
-     * need to transfer (at least) the addresses where it wants to
-     * write the pages.  Here it looks like it should be something
-     * like:
-     *     sizeof(send_wr) + sge.length
-     * but this being RDMA, who knows.
-     */
-    stat64_add(&mig_stats.rdma_bytes, sge.length);
-    ram_transferred_add(sge.length);
-    rdma->total_writes++;
-
-    return 0;
-}
-
-/*
- * Push out any unwritten RDMA operations.
- *
- * We support sending out multiple chunks at the same time.
- * Not all of them need to get signaled in the completion queue.
- */
-static int qemu_rdma_write_flush(RDMAContext *rdma, Error **errp)
-{
-    int ret;
-
-    if (!rdma->current_length) {
-        return 0;
-    }
-
-    ret = qemu_rdma_write_one(rdma, rdma->current_index, rdma->current_addr,
-                              rdma->current_length, errp);
-
-    if (ret < 0) {
-        return -1;
-    }
-
-    if (ret == 0) {
-        rdma->nb_sent++;
-        trace_qemu_rdma_write_flush(rdma->nb_sent);
-    }
-
-    rdma->current_length = 0;
-    rdma->current_addr = 0;
-
-    return 0;
-}
-
-static inline bool qemu_rdma_buffer_mergeable(RDMAContext *rdma,
-                    uint64_t offset, uint64_t len)
-{
-    RDMALocalBlock *block;
-    uint8_t *host_addr;
-    uint8_t *chunk_end;
-
-    if (rdma->current_index < 0) {
-        return false;
-    }
-
-    if (rdma->current_chunk < 0) {
-        return false;
-    }
-
-    block = &(rdma->local_ram_blocks.block[rdma->current_index]);
-    host_addr = block->local_host_addr + (offset - block->offset);
-    chunk_end = ram_chunk_end(block, rdma->current_chunk);
-
-    if (rdma->current_length == 0) {
-        return false;
-    }
-
-    /*
-     * Only merge into chunk sequentially.
-     */
-    if (offset != (rdma->current_addr + rdma->current_length)) {
-        return false;
-    }
-
-    if (offset < block->offset) {
-        return false;
-    }
-
-    if ((offset + len) > (block->offset + block->length)) {
-        return false;
-    }
-
-    if ((host_addr + len) > chunk_end) {
-        return false;
-    }
-
-    return true;
-}
-
-/*
- * We're not actually writing here, but doing three things:
- *
- * 1. Identify the chunk the buffer belongs to.
- * 2. If the chunk is full or the buffer doesn't belong to the current
- *    chunk, then start a new chunk and flush() the old chunk.
- * 3. To keep the hardware busy, we also group chunks into batches
- *    and only require that a batch gets acknowledged in the completion
- *    queue instead of each individual chunk.
- */
-static int qemu_rdma_write(RDMAContext *rdma,
-                           uint64_t block_offset, uint64_t offset,
-                           uint64_t len, Error **errp)
-{
-    uint64_t current_addr = block_offset + offset;
-    uint64_t index = rdma->current_index;
-    uint64_t chunk = rdma->current_chunk;
-
-    /* If we cannot merge it, we flush the current buffer first. */
-    if (!qemu_rdma_buffer_mergeable(rdma, current_addr, len)) {
-        if (qemu_rdma_write_flush(rdma, errp) < 0) {
-            return -1;
-        }
-        rdma->current_length = 0;
-        rdma->current_addr = current_addr;
-
-        qemu_rdma_search_ram_block(rdma, block_offset,
-                                   offset, len, &index, &chunk);
-        rdma->current_index = index;
-        rdma->current_chunk = chunk;
-    }
-
-    /* merge it */
-    rdma->current_length += len;
-
-    /* flush it if buffer is too large */
-    if (rdma->current_length >= RDMA_MERGE_MAX) {
-        return qemu_rdma_write_flush(rdma, errp);
-    }
-
-    return 0;
-}
-
-static void qemu_rdma_cleanup(RDMAContext *rdma)
-{
-    Error *err = NULL;
-
-    if (rdma->cm_id && rdma->connected) {
-        if ((rdma->errored ||
-             migrate_get_current()->state == MIGRATION_STATUS_CANCELLING) &&
-            !rdma->received_error) {
-            RDMAControlHeader head = { .len = 0,
-                                       .type = RDMA_CONTROL_ERROR,
-                                       .repeat = 1,
-                                     };
-            warn_report("Early error. Sending error.");
-            if (qemu_rdma_post_send_control(rdma, NULL, &head, &err) < 0) {
-                warn_report_err(err);
-            }
-        }
-
-        rdma_disconnect(rdma->cm_id);
-        trace_qemu_rdma_cleanup_disconnect();
-        rdma->connected = false;
-    }
-
-    if (rdma->channel) {
-        qemu_set_fd_handler(rdma->channel->fd, NULL, NULL, NULL);
-    }
-    g_free(rdma->dest_blocks);
-    rdma->dest_blocks = NULL;
-
-    for (int i = 0; i < RDMA_WRID_MAX; i++) {
-        if (rdma->wr_data[i].control_mr) {
-            rdma->total_registrations--;
-            ibv_dereg_mr(rdma->wr_data[i].control_mr);
-        }
-        rdma->wr_data[i].control_mr = NULL;
-    }
-
-    if (rdma->local_ram_blocks.block) {
-        while (rdma->local_ram_blocks.nb_blocks) {
-            rdma_delete_block(rdma, &rdma->local_ram_blocks.block[0]);
-        }
-    }
-
-    if (rdma->qp) {
-        rdma_destroy_qp(rdma->cm_id);
-        rdma->qp = NULL;
-    }
-    if (rdma->recv_cq) {
-        ibv_destroy_cq(rdma->recv_cq);
-        rdma->recv_cq = NULL;
-    }
-    if (rdma->send_cq) {
-        ibv_destroy_cq(rdma->send_cq);
-        rdma->send_cq = NULL;
-    }
-    if (rdma->recv_comp_channel) {
-        ibv_destroy_comp_channel(rdma->recv_comp_channel);
-        rdma->recv_comp_channel = NULL;
-    }
-    if (rdma->send_comp_channel) {
-        ibv_destroy_comp_channel(rdma->send_comp_channel);
-        rdma->send_comp_channel = NULL;
-    }
-    if (rdma->pd) {
-        ibv_dealloc_pd(rdma->pd);
-        rdma->pd = NULL;
-    }
-    if (rdma->cm_id) {
-        rdma_destroy_id(rdma->cm_id);
-        rdma->cm_id = NULL;
-    }
-
-    /* the destination side, listen_id and channel is shared */
-    if (rdma->listen_id) {
-        if (!rdma->is_return_path) {
-            rdma_destroy_id(rdma->listen_id);
-        }
-        rdma->listen_id = NULL;
-
-        if (rdma->channel) {
-            if (!rdma->is_return_path) {
-                rdma_destroy_event_channel(rdma->channel);
-            }
-            rdma->channel = NULL;
-        }
-    }
-
-    if (rdma->channel) {
-        rdma_destroy_event_channel(rdma->channel);
-        rdma->channel = NULL;
-    }
-    g_free(rdma->host);
-    rdma->host = NULL;
-}
-
-
-static int qemu_rdma_source_init(RDMAContext *rdma, bool pin_all, Error **errp)
-{
-    int ret;
-
-    /*
-     * Will be validated against destination's actual capabilities
-     * after the connect() completes.
-     */
-    rdma->pin_all = pin_all;
-
-    ret = qemu_rdma_resolve_host(rdma, errp);
-    if (ret < 0) {
-        goto err_rdma_source_init;
-    }
-
-    ret = qemu_rdma_alloc_pd_cq(rdma, errp);
-    if (ret < 0) {
-        goto err_rdma_source_init;
-    }
-
-    ret = qemu_rdma_alloc_qp(rdma);
-    if (ret < 0) {
-        error_setg(errp, "RDMA ERROR: rdma migration: error allocating qp!");
-        goto err_rdma_source_init;
-    }
-
-    qemu_rdma_init_ram_blocks(rdma);
-
-    /* Build the hash that maps from offset to RAMBlock */
-    rdma->blockmap = g_hash_table_new(g_direct_hash, g_direct_equal);
-    for (int i = 0; i < rdma->local_ram_blocks.nb_blocks; i++) {
-        g_hash_table_insert(rdma->blockmap,
-                (void *)(uintptr_t)rdma->local_ram_blocks.block[i].offset,
-                &rdma->local_ram_blocks.block[i]);
-    }
-
-    for (int i = 0; i < RDMA_WRID_MAX; i++) {
-        ret = qemu_rdma_reg_control(rdma, i);
-        if (ret < 0) {
-            error_setg(errp, "RDMA ERROR: rdma migration: error "
-                       "registering %d control!", i);
-            goto err_rdma_source_init;
-        }
-    }
-
-    return 0;
-
-err_rdma_source_init:
-    qemu_rdma_cleanup(rdma);
-    return -1;
-}
-
-static int qemu_get_cm_event_timeout(RDMAContext *rdma,
-                                     struct rdma_cm_event **cm_event,
-                                     long msec, Error **errp)
-{
-    int ret;
-    struct pollfd poll_fd = {
-                                .fd = rdma->channel->fd,
-                                .events = POLLIN,
-                                .revents = 0
-                            };
-
-    do {
-        ret = poll(&poll_fd, 1, msec);
-    } while (ret < 0 && errno == EINTR);
-
-    if (ret == 0) {
-        error_setg(errp, "RDMA ERROR: poll cm event timeout");
-        return -1;
-    } else if (ret < 0) {
-        error_setg(errp, "RDMA ERROR: failed to poll cm event, errno=%i",
-                   errno);
-        return -1;
-    } else if (poll_fd.revents & POLLIN) {
-        if (rdma_get_cm_event(rdma->channel, cm_event) < 0) {
-            error_setg(errp, "RDMA ERROR: failed to get cm event");
-            return -1;
-        }
-        return 0;
-    } else {
-        error_setg(errp, "RDMA ERROR: no POLLIN event, revent=%x",
-                   poll_fd.revents);
-        return -1;
-    }
-}
-
-static int qemu_rdma_connect(RDMAContext *rdma, bool return_path,
-                             Error **errp)
-{
-    RDMACapabilities cap = {
-                                .version = RDMA_CONTROL_VERSION_CURRENT,
-                                .flags = 0,
-                           };
-    struct rdma_conn_param conn_param = { .initiator_depth = 2,
-                                          .retry_count = 5,
-                                          .private_data = &cap,
-                                          .private_data_len = sizeof(cap),
-                                        };
-    struct rdma_cm_event *cm_event;
-    int ret;
-
-    /*
-     * Only negotiate the capability with destination if the user
-     * on the source first requested the capability.
-     */
-    if (rdma->pin_all) {
-        trace_qemu_rdma_connect_pin_all_requested();
-        cap.flags |= RDMA_CAPABILITY_PIN_ALL;
-    }
-
-    caps_to_network(&cap);
-
-    ret = qemu_rdma_post_recv_control(rdma, RDMA_WRID_READY, errp);
-    if (ret < 0) {
-        goto err_rdma_source_connect;
-    }
-
-    ret = rdma_connect(rdma->cm_id, &conn_param);
-    if (ret < 0) {
-        error_setg_errno(errp, errno,
-                         "RDMA ERROR: connecting to destination!");
-        goto err_rdma_source_connect;
-    }
-
-    if (return_path) {
-        ret = qemu_get_cm_event_timeout(rdma, &cm_event, 5000, errp);
-    } else {
-        ret = rdma_get_cm_event(rdma->channel, &cm_event);
-        if (ret < 0) {
-            error_setg_errno(errp, errno,
-                             "RDMA ERROR: failed to get cm event");
-        }
-    }
-    if (ret < 0) {
-        goto err_rdma_source_connect;
-    }
-
-    if (cm_event->event != RDMA_CM_EVENT_ESTABLISHED) {
-        error_setg(errp, "RDMA ERROR: connecting to destination!");
-        rdma_ack_cm_event(cm_event);
-        goto err_rdma_source_connect;
-    }
-    rdma->connected = true;
-
-    memcpy(&cap, cm_event->param.conn.private_data, sizeof(cap));
-    network_to_caps(&cap);
-
-    /*
-     * Verify that the *requested* capabilities are supported by the destination
-     * and disable them otherwise.
-     */
-    if (rdma->pin_all && !(cap.flags & RDMA_CAPABILITY_PIN_ALL)) {
-        warn_report("RDMA: Server cannot support pinning all memory. "
-                    "Will register memory dynamically.");
-        rdma->pin_all = false;
-    }
-
-    trace_qemu_rdma_connect_pin_all_outcome(rdma->pin_all);
-
-    rdma_ack_cm_event(cm_event);
-
-    rdma->control_ready_expected = 1;
-    rdma->nb_sent = 0;
-    return 0;
-
-err_rdma_source_connect:
-    qemu_rdma_cleanup(rdma);
-    return -1;
-}
-
-static int qemu_rdma_dest_init(RDMAContext *rdma, Error **errp)
-{
-    Error *err = NULL;
-    int ret;
-    struct rdma_cm_id *listen_id;
-    char ip[40] = "unknown";
-    struct rdma_addrinfo *res, *e;
-    char port_str[16];
-    int reuse = 1;
-
-    for (int i = 0; i < RDMA_WRID_MAX; i++) {
-        rdma->wr_data[i].control_len = 0;
-        rdma->wr_data[i].control_curr = NULL;
-    }
-
-    if (!rdma->host || !rdma->host[0]) {
-        error_setg(errp, "RDMA ERROR: RDMA host is not set!");
-        rdma->errored = true;
-        return -1;
-    }
-    /* create CM channel */
-    rdma->channel = rdma_create_event_channel();
-    if (!rdma->channel) {
-        error_setg(errp, "RDMA ERROR: could not create rdma event channel");
-        rdma->errored = true;
-        return -1;
-    }
-
-    /* create CM id */
-    ret = rdma_create_id(rdma->channel, &listen_id, NULL, RDMA_PS_TCP);
-    if (ret < 0) {
-        error_setg(errp, "RDMA ERROR: could not create cm_id!");
-        goto err_dest_init_create_listen_id;
-    }
-
-    snprintf(port_str, 16, "%d", rdma->port);
-    port_str[15] = '\0';
-
-    ret = rdma_getaddrinfo(rdma->host, port_str, NULL, &res);
-    if (ret) {
-        error_setg(errp, "RDMA ERROR: could not rdma_getaddrinfo address %s",
-                   rdma->host);
-        goto err_dest_init_bind_addr;
-    }
-
-    ret = rdma_set_option(listen_id, RDMA_OPTION_ID, RDMA_OPTION_ID_REUSEADDR,
-                          &reuse, sizeof reuse);
-    if (ret < 0) {
-        error_setg(errp, "RDMA ERROR: Error: could not set REUSEADDR option");
-        goto err_dest_init_bind_addr;
-    }
-
-    /* Try all addresses, saving the first error in @err */
-    for (e = res; e != NULL; e = e->ai_next) {
-        Error **local_errp = err ? NULL : &err;
-
-        inet_ntop(e->ai_family,
-            &((struct sockaddr_in *) e->ai_dst_addr)->sin_addr, ip, sizeof ip);
-        trace_qemu_rdma_dest_init_trying(rdma->host, ip);
-        ret = rdma_bind_addr(listen_id, e->ai_dst_addr);
-        if (ret < 0) {
-            continue;
-        }
-        if (e->ai_family == AF_INET6) {
-            ret = qemu_rdma_broken_ipv6_kernel(listen_id->verbs,
-                                               local_errp);
-            if (ret < 0) {
-                continue;
-            }
-        }
-        error_free(err);
-        break;
-    }
-
-    rdma_freeaddrinfo(res);
-    if (!e) {
-        if (err) {
-            error_propagate(errp, err);
-        } else {
-            error_setg(errp, "RDMA ERROR: Error: could not rdma_bind_addr!");
-        }
-        goto err_dest_init_bind_addr;
-    }
-
-    rdma->listen_id = listen_id;
-    qemu_rdma_dump_gid("dest_init", listen_id);
-    return 0;
-
-err_dest_init_bind_addr:
-    rdma_destroy_id(listen_id);
-err_dest_init_create_listen_id:
-    rdma_destroy_event_channel(rdma->channel);
-    rdma->channel = NULL;
-    rdma->errored = true;
-    return -1;
-
-}
-
-static void qemu_rdma_return_path_dest_init(RDMAContext *rdma_return_path,
-                                            RDMAContext *rdma)
-{
-    for (int i = 0; i < RDMA_WRID_MAX; i++) {
-        rdma_return_path->wr_data[i].control_len = 0;
-        rdma_return_path->wr_data[i].control_curr = NULL;
-    }
-
-    /*the CM channel and CM id is shared*/
-    rdma_return_path->channel = rdma->channel;
-    rdma_return_path->listen_id = rdma->listen_id;
-
-    rdma->return_path = rdma_return_path;
-    rdma_return_path->return_path = rdma;
-    rdma_return_path->is_return_path = true;
-}
-
-static RDMAContext *qemu_rdma_data_init(InetSocketAddress *saddr, Error **errp)
-{
-    RDMAContext *rdma = NULL;
-
-    rdma = g_new0(RDMAContext, 1);
-    rdma->current_index = -1;
-    rdma->current_chunk = -1;
-
-    rdma->host = g_strdup(saddr->host);
-    rdma->port = atoi(saddr->port);
-    return rdma;
-}
-
-/*
- * QEMUFile interface to the control channel.
- * SEND messages for control only.
- * VM's ram is handled with regular RDMA messages.
- */
-static ssize_t qio_channel_rdma_writev(QIOChannel *ioc,
-                                       const struct iovec *iov,
-                                       size_t niov,
-                                       int *fds,
-                                       size_t nfds,
-                                       int flags,
-                                       Error **errp)
-{
-    QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
-    RDMAContext *rdma;
-    int ret;
-    ssize_t done = 0;
-    size_t len;
-
-    RCU_READ_LOCK_GUARD();
-    rdma = qatomic_rcu_read(&rioc->rdmaout);
-
-    if (!rdma) {
-        error_setg(errp, "RDMA control channel output is not set");
-        return -1;
-    }
-
-    if (rdma->errored) {
-        error_setg(errp,
-                   "RDMA is in an error state waiting migration to abort!");
-        return -1;
-    }
-
-    /*
-     * Push out any writes that
-     * we're queued up for VM's ram.
-     */
-    ret = qemu_rdma_write_flush(rdma, errp);
-    if (ret < 0) {
-        rdma->errored = true;
-        return -1;
-    }
-
-    for (int i = 0; i < niov; i++) {
-        size_t remaining = iov[i].iov_len;
-        uint8_t * data = (void *)iov[i].iov_base;
-        while (remaining) {
-            RDMAControlHeader head = {};
-
-            len = MIN(remaining, RDMA_SEND_INCREMENT);
-            remaining -= len;
-
-            head.len = len;
-            head.type = RDMA_CONTROL_QEMU_FILE;
-
-            ret = qemu_rdma_exchange_send(rdma, &head,
-                                          data, NULL, NULL, NULL, errp);
-
-            if (ret < 0) {
-                rdma->errored = true;
-                return -1;
-            }
-
-            data += len;
-            done += len;
-        }
-    }
-
-    return done;
-}
-
-static size_t qemu_rdma_fill(RDMAContext *rdma, uint8_t *buf,
-                             size_t size, int idx)
-{
-    size_t len = 0;
-
-    if (rdma->wr_data[idx].control_len) {
-        trace_qemu_rdma_fill(rdma->wr_data[idx].control_len, size);
-
-        len = MIN(size, rdma->wr_data[idx].control_len);
-        memcpy(buf, rdma->wr_data[idx].control_curr, len);
-        rdma->wr_data[idx].control_curr += len;
-        rdma->wr_data[idx].control_len -= len;
-    }
-
-    return len;
-}
-
-/*
- * QEMUFile interface to the control channel.
- * RDMA links don't use bytestreams, so we have to
- * return bytes to QEMUFile opportunistically.
- */
-static ssize_t qio_channel_rdma_readv(QIOChannel *ioc,
-                                      const struct iovec *iov,
-                                      size_t niov,
-                                      int **fds,
-                                      size_t *nfds,
-                                      int flags,
-                                      Error **errp)
-{
-    QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
-    RDMAContext *rdma;
-    RDMAControlHeader head;
-    int ret;
-    ssize_t done = 0;
-    size_t len;
-
-    RCU_READ_LOCK_GUARD();
-    rdma = qatomic_rcu_read(&rioc->rdmain);
-
-    if (!rdma) {
-        error_setg(errp, "RDMA control channel input is not set");
-        return -1;
-    }
-
-    if (rdma->errored) {
-        error_setg(errp,
-                   "RDMA is in an error state waiting migration to abort!");
-        return -1;
-    }
-
-    for (int i = 0; i < niov; i++) {
-        size_t want = iov[i].iov_len;
-        uint8_t *data = (void *)iov[i].iov_base;
-
-        /*
-         * First, we hold on to the last SEND message we
-         * were given and dish out the bytes until we run
-         * out of bytes.
-         */
-        len = qemu_rdma_fill(rdma, data, want, 0);
-        done += len;
-        want -= len;
-        /* Got what we needed, so go to next iovec */
-        if (want == 0) {
-            continue;
-        }
-
-        /* If we got any data so far, then don't wait
-         * for more, just return what we have */
-        if (done > 0) {
-            break;
-        }
-
-
-        /* We've got nothing at all, so lets wait for
-         * more to arrive
-         */
-        ret = qemu_rdma_exchange_recv(rdma, &head, RDMA_CONTROL_QEMU_FILE,
-                                      errp);
-
-        if (ret < 0) {
-            rdma->errored = true;
-            return -1;
-        }
-
-        /*
-         * SEND was received with new bytes, now try again.
-         */
-        len = qemu_rdma_fill(rdma, data, want, 0);
-        done += len;
-        want -= len;
-
-        /* Still didn't get enough, so lets just return */
-        if (want) {
-            if (done == 0) {
-                return QIO_CHANNEL_ERR_BLOCK;
-            } else {
-                break;
-            }
-        }
-    }
-    return done;
-}
-
-/*
- * Block until all the outstanding chunks have been delivered by the hardware.
- */
-static int qemu_rdma_drain_cq(RDMAContext *rdma)
-{
-    Error *err = NULL;
-
-    if (qemu_rdma_write_flush(rdma, &err) < 0) {
-        error_report_err(err);
-        return -1;
-    }
-
-    while (rdma->nb_sent) {
-        if (qemu_rdma_block_for_wrid(rdma, RDMA_WRID_RDMA_WRITE, NULL) < 0) {
-            error_report("rdma migration: complete polling error!");
-            return -1;
-        }
-    }
-
-    qemu_rdma_unregister_waiting(rdma);
-
-    return 0;
-}
-
-
-static int qio_channel_rdma_set_blocking(QIOChannel *ioc,
-                                         bool blocking,
-                                         Error **errp)
-{
-    QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
-    /* XXX we should make readv/writev actually honour this :-) */
-    rioc->blocking = blocking;
-    return 0;
-}
-
-
-typedef struct QIOChannelRDMASource QIOChannelRDMASource;
-struct QIOChannelRDMASource {
-    GSource parent;
-    QIOChannelRDMA *rioc;
-    GIOCondition condition;
-};
-
-static gboolean
-qio_channel_rdma_source_prepare(GSource *source,
-                                gint *timeout)
-{
-    QIOChannelRDMASource *rsource = (QIOChannelRDMASource *)source;
-    RDMAContext *rdma;
-    GIOCondition cond = 0;
-    *timeout = -1;
-
-    RCU_READ_LOCK_GUARD();
-    if (rsource->condition == G_IO_IN) {
-        rdma = qatomic_rcu_read(&rsource->rioc->rdmain);
-    } else {
-        rdma = qatomic_rcu_read(&rsource->rioc->rdmaout);
-    }
-
-    if (!rdma) {
-        error_report("RDMAContext is NULL when prepare Gsource");
-        return FALSE;
-    }
-
-    if (rdma->wr_data[0].control_len) {
-        cond |= G_IO_IN;
-    }
-    cond |= G_IO_OUT;
-
-    return cond & rsource->condition;
-}
-
-static gboolean
-qio_channel_rdma_source_check(GSource *source)
-{
-    QIOChannelRDMASource *rsource = (QIOChannelRDMASource *)source;
-    RDMAContext *rdma;
-    GIOCondition cond = 0;
-
-    RCU_READ_LOCK_GUARD();
-    if (rsource->condition == G_IO_IN) {
-        rdma = qatomic_rcu_read(&rsource->rioc->rdmain);
-    } else {
-        rdma = qatomic_rcu_read(&rsource->rioc->rdmaout);
-    }
-
-    if (!rdma) {
-        error_report("RDMAContext is NULL when check Gsource");
-        return FALSE;
-    }
-
-    if (rdma->wr_data[0].control_len) {
-        cond |= G_IO_IN;
-    }
-    cond |= G_IO_OUT;
-
-    return cond & rsource->condition;
-}
-
-static gboolean
-qio_channel_rdma_source_dispatch(GSource *source,
-                                 GSourceFunc callback,
-                                 gpointer user_data)
-{
-    QIOChannelFunc func = (QIOChannelFunc)callback;
-    QIOChannelRDMASource *rsource = (QIOChannelRDMASource *)source;
-    RDMAContext *rdma;
-    GIOCondition cond = 0;
-
-    RCU_READ_LOCK_GUARD();
-    if (rsource->condition == G_IO_IN) {
-        rdma = qatomic_rcu_read(&rsource->rioc->rdmain);
-    } else {
-        rdma = qatomic_rcu_read(&rsource->rioc->rdmaout);
-    }
-
-    if (!rdma) {
-        error_report("RDMAContext is NULL when dispatch Gsource");
-        return FALSE;
-    }
-
-    if (rdma->wr_data[0].control_len) {
-        cond |= G_IO_IN;
-    }
-    cond |= G_IO_OUT;
-
-    return (*func)(QIO_CHANNEL(rsource->rioc),
-                   (cond & rsource->condition),
-                   user_data);
-}
-
-static void
-qio_channel_rdma_source_finalize(GSource *source)
-{
-    QIOChannelRDMASource *ssource = (QIOChannelRDMASource *)source;
-
-    object_unref(OBJECT(ssource->rioc));
-}
-
-static GSourceFuncs qio_channel_rdma_source_funcs = {
-    qio_channel_rdma_source_prepare,
-    qio_channel_rdma_source_check,
-    qio_channel_rdma_source_dispatch,
-    qio_channel_rdma_source_finalize
-};
-
-static GSource *qio_channel_rdma_create_watch(QIOChannel *ioc,
-                                              GIOCondition condition)
-{
-    QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
-    QIOChannelRDMASource *ssource;
-    GSource *source;
-
-    source = g_source_new(&qio_channel_rdma_source_funcs,
-                          sizeof(QIOChannelRDMASource));
-    ssource = (QIOChannelRDMASource *)source;
-
-    ssource->rioc = rioc;
-    object_ref(OBJECT(rioc));
-
-    ssource->condition = condition;
-
-    return source;
-}
-
-static void qio_channel_rdma_set_aio_fd_handler(QIOChannel *ioc,
-                                                AioContext *read_ctx,
-                                                IOHandler *io_read,
-                                                AioContext *write_ctx,
-                                                IOHandler *io_write,
-                                                void *opaque)
-{
-    QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
-    if (io_read) {
-        aio_set_fd_handler(read_ctx, rioc->rdmain->recv_comp_channel->fd,
-                           io_read, io_write, NULL, NULL, opaque);
-        aio_set_fd_handler(read_ctx, rioc->rdmain->send_comp_channel->fd,
-                           io_read, io_write, NULL, NULL, opaque);
-    } else {
-        aio_set_fd_handler(write_ctx, rioc->rdmaout->recv_comp_channel->fd,
-                           io_read, io_write, NULL, NULL, opaque);
-        aio_set_fd_handler(write_ctx, rioc->rdmaout->send_comp_channel->fd,
-                           io_read, io_write, NULL, NULL, opaque);
-    }
-}
-
-struct rdma_close_rcu {
-    struct rcu_head rcu;
-    RDMAContext *rdmain;
-    RDMAContext *rdmaout;
-};
-
-/* callback from qio_channel_rdma_close via call_rcu */
-static void qio_channel_rdma_close_rcu(struct rdma_close_rcu *rcu)
-{
-    if (rcu->rdmain) {
-        qemu_rdma_cleanup(rcu->rdmain);
-    }
-
-    if (rcu->rdmaout) {
-        qemu_rdma_cleanup(rcu->rdmaout);
-    }
-
-    g_free(rcu->rdmain);
-    g_free(rcu->rdmaout);
-    g_free(rcu);
-}
-
-static int qio_channel_rdma_close(QIOChannel *ioc,
-                                  Error **errp)
-{
-    QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
-    RDMAContext *rdmain, *rdmaout;
-    struct rdma_close_rcu *rcu = g_new(struct rdma_close_rcu, 1);
-
-    trace_qemu_rdma_close();
-
-    rdmain = rioc->rdmain;
-    if (rdmain) {
-        qatomic_rcu_set(&rioc->rdmain, NULL);
-    }
-
-    rdmaout = rioc->rdmaout;
-    if (rdmaout) {
-        qatomic_rcu_set(&rioc->rdmaout, NULL);
-    }
-
-    rcu->rdmain = rdmain;
-    rcu->rdmaout = rdmaout;
-    call_rcu(rcu, qio_channel_rdma_close_rcu, rcu);
-
-    return 0;
-}
-
-static int
-qio_channel_rdma_shutdown(QIOChannel *ioc,
-                            QIOChannelShutdown how,
-                            Error **errp)
-{
-    QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
-    RDMAContext *rdmain, *rdmaout;
-
-    RCU_READ_LOCK_GUARD();
-
-    rdmain = qatomic_rcu_read(&rioc->rdmain);
-    rdmaout = qatomic_rcu_read(&rioc->rdmain);
-
-    switch (how) {
-    case QIO_CHANNEL_SHUTDOWN_READ:
-        if (rdmain) {
-            rdmain->errored = true;
-        }
-        break;
-    case QIO_CHANNEL_SHUTDOWN_WRITE:
-        if (rdmaout) {
-            rdmaout->errored = true;
-        }
-        break;
-    case QIO_CHANNEL_SHUTDOWN_BOTH:
-    default:
-        if (rdmain) {
-            rdmain->errored = true;
-        }
-        if (rdmaout) {
-            rdmaout->errored = true;
-        }
-        break;
-    }
-
-    return 0;
-}
-
-/*
- * Parameters:
- *    @offset == 0 :
- *        This means that 'block_offset' is a full virtual address that does not
- *        belong to a RAMBlock of the virtual machine and instead
- *        represents a private malloc'd memory area that the caller wishes to
- *        transfer.
- *
- *    @offset != 0 :
- *        Offset is an offset to be added to block_offset and used
- *        to also lookup the corresponding RAMBlock.
- *
- *    @size : Number of bytes to transfer
- *
- *    @pages_sent : User-specificed pointer to indicate how many pages were
- *                  sent. Usually, this will not be more than a few bytes of
- *                  the protocol because most transfers are sent asynchronously.
- */
-static int qemu_rdma_save_page(QEMUFile *f, ram_addr_t block_offset,
-                               ram_addr_t offset, size_t size)
-{
-    QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(qemu_file_get_ioc(f));
-    Error *err = NULL;
-    RDMAContext *rdma;
-    int ret;
-
-    RCU_READ_LOCK_GUARD();
-    rdma = qatomic_rcu_read(&rioc->rdmaout);
-
-    if (!rdma) {
-        return -1;
-    }
-
-    if (rdma_errored(rdma)) {
-        return -1;
-    }
-
-    qemu_fflush(f);
-
-    /*
-     * Add this page to the current 'chunk'. If the chunk
-     * is full, or the page doesn't belong to the current chunk,
-     * an actual RDMA write will occur and a new chunk will be formed.
-     */
-    ret = qemu_rdma_write(rdma, block_offset, offset, size, &err);
-    if (ret < 0) {
-        error_report_err(err);
-        goto err;
-    }
-
-    /*
-     * Drain the Completion Queue if possible, but do not block,
-     * just poll.
-     *
-     * If nothing to poll, the end of the iteration will do this
-     * again to make sure we don't overflow the request queue.
-     */
-    while (1) {
-        uint64_t wr_id, wr_id_in;
-        ret = qemu_rdma_poll(rdma, rdma->recv_cq, &wr_id_in, NULL);
-
-        if (ret < 0) {
-            error_report("rdma migration: polling error");
-            goto err;
-        }
-
-        wr_id = wr_id_in & RDMA_WRID_TYPE_MASK;
-
-        if (wr_id == RDMA_WRID_NONE) {
-            break;
-        }
-    }
-
-    while (1) {
-        uint64_t wr_id, wr_id_in;
-        ret = qemu_rdma_poll(rdma, rdma->send_cq, &wr_id_in, NULL);
-
-        if (ret < 0) {
-            error_report("rdma migration: polling error");
-            goto err;
-        }
-
-        wr_id = wr_id_in & RDMA_WRID_TYPE_MASK;
-
-        if (wr_id == RDMA_WRID_NONE) {
-            break;
-        }
-    }
-
-    return RAM_SAVE_CONTROL_DELAYED;
-
-err:
-    rdma->errored = true;
-    return -1;
-}
-
-int rdma_control_save_page(QEMUFile *f, ram_addr_t block_offset,
-                           ram_addr_t offset, size_t size)
-{
-    if (!migrate_rdma() || migration_in_postcopy()) {
-        return RAM_SAVE_CONTROL_NOT_SUPP;
-    }
-
-    int ret = qemu_rdma_save_page(f, block_offset, offset, size);
-
-    if (ret != RAM_SAVE_CONTROL_DELAYED &&
-        ret != RAM_SAVE_CONTROL_NOT_SUPP) {
-        if (ret < 0) {
-            qemu_file_set_error(f, ret);
-        }
-    }
-    return ret;
-}
-
-static void rdma_accept_incoming_migration(void *opaque);
-
-static void rdma_cm_poll_handler(void *opaque)
-{
-    RDMAContext *rdma = opaque;
-    struct rdma_cm_event *cm_event;
-    MigrationIncomingState *mis = migration_incoming_get_current();
-
-    if (rdma_get_cm_event(rdma->channel, &cm_event) < 0) {
-        error_report("get_cm_event failed %d", errno);
-        return;
-    }
-
-    if (cm_event->event == RDMA_CM_EVENT_DISCONNECTED ||
-        cm_event->event == RDMA_CM_EVENT_DEVICE_REMOVAL) {
-        if (!rdma->errored &&
-            migration_incoming_get_current()->state !=
-              MIGRATION_STATUS_COMPLETED) {
-            error_report("receive cm event, cm event is %d", cm_event->event);
-            rdma->errored = true;
-            if (rdma->return_path) {
-                rdma->return_path->errored = true;
-            }
-        }
-        rdma_ack_cm_event(cm_event);
-        if (mis->loadvm_co) {
-            qemu_coroutine_enter(mis->loadvm_co);
-        }
-        return;
-    }
-    rdma_ack_cm_event(cm_event);
-}
-
-static int qemu_rdma_accept(RDMAContext *rdma)
-{
-    Error *err = NULL;
-    RDMACapabilities cap;
-    struct rdma_conn_param conn_param = {
-                                            .responder_resources = 2,
-                                            .private_data = &cap,
-                                            .private_data_len = sizeof(cap),
-                                         };
-    RDMAContext *rdma_return_path = NULL;
-    g_autoptr(InetSocketAddress) isock = g_new0(InetSocketAddress, 1);
-    struct rdma_cm_event *cm_event;
-    struct ibv_context *verbs;
-    int ret;
-
-    ret = rdma_get_cm_event(rdma->channel, &cm_event);
-    if (ret < 0) {
-        goto err_rdma_dest_wait;
-    }
-
-    if (cm_event->event != RDMA_CM_EVENT_CONNECT_REQUEST) {
-        rdma_ack_cm_event(cm_event);
-        goto err_rdma_dest_wait;
-    }
-
-    isock->host = g_strdup(rdma->host);
-    isock->port = g_strdup_printf("%d", rdma->port);
-
-    /*
-     * initialize the RDMAContext for return path for postcopy after first
-     * connection request reached.
-     */
-    if ((migrate_postcopy() || migrate_return_path())
-        && !rdma->is_return_path) {
-        rdma_return_path = qemu_rdma_data_init(isock, NULL);
-        if (rdma_return_path == NULL) {
-            rdma_ack_cm_event(cm_event);
-            goto err_rdma_dest_wait;
-        }
-
-        qemu_rdma_return_path_dest_init(rdma_return_path, rdma);
-    }
-
-    memcpy(&cap, cm_event->param.conn.private_data, sizeof(cap));
-
-    network_to_caps(&cap);
-
-    if (cap.version < 1 || cap.version > RDMA_CONTROL_VERSION_CURRENT) {
-        error_report("Unknown source RDMA version: %d, bailing...",
-                     cap.version);
-        rdma_ack_cm_event(cm_event);
-        goto err_rdma_dest_wait;
-    }
-
-    /*
-     * Respond with only the capabilities this version of QEMU knows about.
-     */
-    cap.flags &= known_capabilities;
-
-    /*
-     * Enable the ones that we do know about.
-     * Add other checks here as new ones are introduced.
-     */
-    if (cap.flags & RDMA_CAPABILITY_PIN_ALL) {
-        rdma->pin_all = true;
-    }
-
-    rdma->cm_id = cm_event->id;
-    verbs = cm_event->id->verbs;
-
-    rdma_ack_cm_event(cm_event);
-
-    trace_qemu_rdma_accept_pin_state(rdma->pin_all);
-
-    caps_to_network(&cap);
-
-    trace_qemu_rdma_accept_pin_verbsc(verbs);
-
-    if (!rdma->verbs) {
-        rdma->verbs = verbs;
-    } else if (rdma->verbs != verbs) {
-        error_report("ibv context not matching %p, %p!", rdma->verbs,
-                     verbs);
-        goto err_rdma_dest_wait;
-    }
-
-    qemu_rdma_dump_id("dest_init", verbs);
-
-    ret = qemu_rdma_alloc_pd_cq(rdma, &err);
-    if (ret < 0) {
-        error_report_err(err);
-        goto err_rdma_dest_wait;
-    }
-
-    ret = qemu_rdma_alloc_qp(rdma);
-    if (ret < 0) {
-        error_report("rdma migration: error allocating qp!");
-        goto err_rdma_dest_wait;
-    }
-
-    qemu_rdma_init_ram_blocks(rdma);
-
-    for (int i = 0; i < RDMA_WRID_MAX; i++) {
-        ret = qemu_rdma_reg_control(rdma, i);
-        if (ret < 0) {
-            error_report("rdma: error registering %d control", i);
-            goto err_rdma_dest_wait;
-        }
-    }
-
-    /* Accept the second connection request for return path */
-    if ((migrate_postcopy() || migrate_return_path())
-        && !rdma->is_return_path) {
-        qemu_set_fd_handler(rdma->channel->fd, rdma_accept_incoming_migration,
-                            NULL,
-                            (void *)(intptr_t)rdma->return_path);
-    } else {
-        qemu_set_fd_handler(rdma->channel->fd, rdma_cm_poll_handler,
-                            NULL, rdma);
-    }
-
-    ret = rdma_accept(rdma->cm_id, &conn_param);
-    if (ret < 0) {
-        error_report("rdma_accept failed");
-        goto err_rdma_dest_wait;
-    }
-
-    ret = rdma_get_cm_event(rdma->channel, &cm_event);
-    if (ret < 0) {
-        error_report("rdma_accept get_cm_event failed");
-        goto err_rdma_dest_wait;
-    }
-
-    if (cm_event->event != RDMA_CM_EVENT_ESTABLISHED) {
-        error_report("rdma_accept not event established");
-        rdma_ack_cm_event(cm_event);
-        goto err_rdma_dest_wait;
-    }
-
-    rdma_ack_cm_event(cm_event);
-    rdma->connected = true;
-
-    ret = qemu_rdma_post_recv_control(rdma, RDMA_WRID_READY, &err);
-    if (ret < 0) {
-        error_report_err(err);
-        goto err_rdma_dest_wait;
-    }
-
-    qemu_rdma_dump_gid("dest_connect", rdma->cm_id);
-
-    return 0;
-
-err_rdma_dest_wait:
-    rdma->errored = true;
-    qemu_rdma_cleanup(rdma);
-    g_free(rdma_return_path);
-    return -1;
-}
-
-static int dest_ram_sort_func(const void *a, const void *b)
-{
-    unsigned int a_index = ((const RDMALocalBlock *)a)->src_index;
-    unsigned int b_index = ((const RDMALocalBlock *)b)->src_index;
-
-    return (a_index < b_index) ? -1 : (a_index != b_index);
-}
-
-/*
- * During each iteration of the migration, we listen for instructions
- * by the source VM to perform dynamic page registrations before they
- * can perform RDMA operations.
- *
- * We respond with the 'rkey'.
- *
- * Keep doing this until the source tells us to stop.
- */
-int rdma_registration_handle(QEMUFile *f)
-{
-    RDMAControlHeader reg_resp = { .len = sizeof(RDMARegisterResult),
-                               .type = RDMA_CONTROL_REGISTER_RESULT,
-                               .repeat = 0,
-                             };
-    RDMAControlHeader unreg_resp = { .len = 0,
-                               .type = RDMA_CONTROL_UNREGISTER_FINISHED,
-                               .repeat = 0,
-                             };
-    RDMAControlHeader blocks = { .type = RDMA_CONTROL_RAM_BLOCKS_RESULT,
-                                 .repeat = 1 };
-    QIOChannelRDMA *rioc;
-    Error *err = NULL;
-    RDMAContext *rdma;
-    RDMALocalBlocks *local;
-    RDMAControlHeader head;
-    RDMARegister *reg, *registers;
-    RDMACompress *comp;
-    RDMARegisterResult *reg_result;
-    static RDMARegisterResult results[RDMA_CONTROL_MAX_COMMANDS_PER_MESSAGE];
-    RDMALocalBlock *block;
-    void *host_addr;
-    int ret;
-    int idx = 0;
-
-    if (!migrate_rdma()) {
-        return 0;
-    }
-
-    RCU_READ_LOCK_GUARD();
-    rioc = QIO_CHANNEL_RDMA(qemu_file_get_ioc(f));
-    rdma = qatomic_rcu_read(&rioc->rdmain);
-
-    if (!rdma) {
-        return -1;
-    }
-
-    if (rdma_errored(rdma)) {
-        return -1;
-    }
-
-    local = &rdma->local_ram_blocks;
-    do {
-        trace_rdma_registration_handle_wait();
-
-        ret = qemu_rdma_exchange_recv(rdma, &head, RDMA_CONTROL_NONE, &err);
-
-        if (ret < 0) {
-            error_report_err(err);
-            break;
-        }
-
-        if (head.repeat > RDMA_CONTROL_MAX_COMMANDS_PER_MESSAGE) {
-            error_report("rdma: Too many requests in this message (%d)."
-                            "Bailing.", head.repeat);
-            break;
-        }
-
-        switch (head.type) {
-        case RDMA_CONTROL_COMPRESS:
-            comp = (RDMACompress *) rdma->wr_data[idx].control_curr;
-            network_to_compress(comp);
-
-            trace_rdma_registration_handle_compress(comp->length,
-                                                    comp->block_idx,
-                                                    comp->offset);
-            if (comp->block_idx >= rdma->local_ram_blocks.nb_blocks) {
-                error_report("rdma: 'compress' bad block index %u (vs %d)",
-                             (unsigned int)comp->block_idx,
-                             rdma->local_ram_blocks.nb_blocks);
-                goto err;
-            }
-            block = &(rdma->local_ram_blocks.block[comp->block_idx]);
-
-            host_addr = block->local_host_addr +
-                            (comp->offset - block->offset);
-            if (comp->value) {
-                error_report("rdma: Zero page with non-zero (%d) value",
-                             comp->value);
-                goto err;
-            }
-            ram_handle_zero(host_addr, comp->length);
-            break;
-
-        case RDMA_CONTROL_REGISTER_FINISHED:
-            trace_rdma_registration_handle_finished();
-            return 0;
-
-        case RDMA_CONTROL_RAM_BLOCKS_REQUEST:
-            trace_rdma_registration_handle_ram_blocks();
-
-            /* Sort our local RAM Block list so it's the same as the source,
-             * we can do this since we've filled in a src_index in the list
-             * as we received the RAMBlock list earlier.
-             */
-            qsort(rdma->local_ram_blocks.block,
-                  rdma->local_ram_blocks.nb_blocks,
-                  sizeof(RDMALocalBlock), dest_ram_sort_func);
-            for (int i = 0; i < local->nb_blocks; i++) {
-                local->block[i].index = i;
-            }
-
-            if (rdma->pin_all) {
-                ret = qemu_rdma_reg_whole_ram_blocks(rdma, &err);
-                if (ret < 0) {
-                    error_report_err(err);
-                    goto err;
-                }
-            }
-
-            /*
-             * Dest uses this to prepare to transmit the RAMBlock descriptions
-             * to the source VM after connection setup.
-             * Both sides use the "remote" structure to communicate and update
-             * their "local" descriptions with what was sent.
-             */
-            for (int i = 0; i < local->nb_blocks; i++) {
-                rdma->dest_blocks[i].remote_host_addr =
-                    (uintptr_t)(local->block[i].local_host_addr);
-
-                if (rdma->pin_all) {
-                    rdma->dest_blocks[i].remote_rkey = local->block[i].mr->rkey;
-                }
-
-                rdma->dest_blocks[i].offset = local->block[i].offset;
-                rdma->dest_blocks[i].length = local->block[i].length;
-
-                dest_block_to_network(&rdma->dest_blocks[i]);
-                trace_rdma_registration_handle_ram_blocks_loop(
-                    local->block[i].block_name,
-                    local->block[i].offset,
-                    local->block[i].length,
-                    local->block[i].local_host_addr,
-                    local->block[i].src_index);
-            }
-
-            blocks.len = rdma->local_ram_blocks.nb_blocks
-                                                * sizeof(RDMADestBlock);
-
-
-            ret = qemu_rdma_post_send_control(rdma,
-                                    (uint8_t *) rdma->dest_blocks, &blocks,
-                                    &err);
-
-            if (ret < 0) {
-                error_report_err(err);
-                goto err;
-            }
-
-            break;
-        case RDMA_CONTROL_REGISTER_REQUEST:
-            trace_rdma_registration_handle_register(head.repeat);
-
-            reg_resp.repeat = head.repeat;
-            registers = (RDMARegister *) rdma->wr_data[idx].control_curr;
-
-            for (int count = 0; count < head.repeat; count++) {
-                uint64_t chunk;
-                uint8_t *chunk_start, *chunk_end;
-
-                reg = &registers[count];
-                network_to_register(reg);
-
-                reg_result = &results[count];
-
-                trace_rdma_registration_handle_register_loop(count,
-                         reg->current_index, reg->key.current_addr, reg->chunks);
-
-                if (reg->current_index >= rdma->local_ram_blocks.nb_blocks) {
-                    error_report("rdma: 'register' bad block index %u (vs %d)",
-                                 (unsigned int)reg->current_index,
-                                 rdma->local_ram_blocks.nb_blocks);
-                    goto err;
-                }
-                block = &(rdma->local_ram_blocks.block[reg->current_index]);
-                if (block->is_ram_block) {
-                    if (block->offset > reg->key.current_addr) {
-                        error_report("rdma: bad register address for block %s"
-                            " offset: %" PRIx64 " current_addr: %" PRIx64,
-                            block->block_name, block->offset,
-                            reg->key.current_addr);
-                        goto err;
-                    }
-                    host_addr = (block->local_host_addr +
-                                (reg->key.current_addr - block->offset));
-                    chunk = ram_chunk_index(block->local_host_addr,
-                                            (uint8_t *) host_addr);
-                } else {
-                    chunk = reg->key.chunk;
-                    host_addr = block->local_host_addr +
-                        (reg->key.chunk * (1UL << RDMA_REG_CHUNK_SHIFT));
-                    /* Check for particularly bad chunk value */
-                    if (host_addr < (void *)block->local_host_addr) {
-                        error_report("rdma: bad chunk for block %s"
-                            " chunk: %" PRIx64,
-                            block->block_name, reg->key.chunk);
-                        goto err;
-                    }
-                }
-                chunk_start = ram_chunk_start(block, chunk);
-                chunk_end = ram_chunk_end(block, chunk + reg->chunks);
-                /* avoid "-Waddress-of-packed-member" warning */
-                uint32_t tmp_rkey = 0;
-                if (qemu_rdma_register_and_get_keys(rdma, block,
-                            (uintptr_t)host_addr, NULL, &tmp_rkey,
-                            chunk, chunk_start, chunk_end)) {
-                    error_report("cannot get rkey");
-                    goto err;
-                }
-                reg_result->rkey = tmp_rkey;
-
-                reg_result->host_addr = (uintptr_t)block->local_host_addr;
-
-                trace_rdma_registration_handle_register_rkey(reg_result->rkey);
-
-                result_to_network(reg_result);
-            }
-
-            ret = qemu_rdma_post_send_control(rdma,
-                            (uint8_t *) results, &reg_resp, &err);
-
-            if (ret < 0) {
-                error_report_err(err);
-                goto err;
-            }
-            break;
-        case RDMA_CONTROL_UNREGISTER_REQUEST:
-            trace_rdma_registration_handle_unregister(head.repeat);
-            unreg_resp.repeat = head.repeat;
-            registers = (RDMARegister *) rdma->wr_data[idx].control_curr;
-
-            for (int count = 0; count < head.repeat; count++) {
-                reg = &registers[count];
-                network_to_register(reg);
-
-                trace_rdma_registration_handle_unregister_loop(count,
-                           reg->current_index, reg->key.chunk);
-
-                block = &(rdma->local_ram_blocks.block[reg->current_index]);
-
-                ret = ibv_dereg_mr(block->pmr[reg->key.chunk]);
-                block->pmr[reg->key.chunk] = NULL;
-
-                if (ret != 0) {
-                    error_report("rdma unregistration chunk failed: %s",
-                                 strerror(errno));
-                    goto err;
-                }
-
-                rdma->total_registrations--;
-
-                trace_rdma_registration_handle_unregister_success(reg->key.chunk);
-            }
-
-            ret = qemu_rdma_post_send_control(rdma, NULL, &unreg_resp, &err);
-
-            if (ret < 0) {
-                error_report_err(err);
-                goto err;
-            }
-            break;
-        case RDMA_CONTROL_REGISTER_RESULT:
-            error_report("Invalid RESULT message at dest.");
-            goto err;
-        default:
-            error_report("Unknown control message %s", control_desc(head.type));
-            goto err;
-        }
-    } while (1);
-
-err:
-    rdma->errored = true;
-    return -1;
-}
-
-/* Destination:
- * Called during the initial RAM load section which lists the
- * RAMBlocks by name.  This lets us know the order of the RAMBlocks on
- * the source.  We've already built our local RAMBlock list, but not
- * yet sent the list to the source.
- */
-int rdma_block_notification_handle(QEMUFile *f, const char *name)
-{
-    int curr;
-    int found = -1;
-
-    if (!migrate_rdma()) {
-        return 0;
-    }
-
-    RCU_READ_LOCK_GUARD();
-    QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(qemu_file_get_ioc(f));
-    RDMAContext *rdma = qatomic_rcu_read(&rioc->rdmain);
-
-    if (!rdma) {
-        return -1;
-    }
-
-    /* Find the matching RAMBlock in our local list */
-    for (curr = 0; curr < rdma->local_ram_blocks.nb_blocks; curr++) {
-        if (!strcmp(rdma->local_ram_blocks.block[curr].block_name, name)) {
-            found = curr;
-            break;
-        }
-    }
-
-    if (found == -1) {
-        error_report("RAMBlock '%s' not found on destination", name);
-        return -1;
-    }
-
-    rdma->local_ram_blocks.block[curr].src_index = rdma->next_src_index;
-    trace_rdma_block_notification_handle(name, rdma->next_src_index);
-    rdma->next_src_index++;
-
-    return 0;
-}
-
-int rdma_registration_start(QEMUFile *f, uint64_t flags)
-{
-    if (!migrate_rdma() || migration_in_postcopy()) {
-        return 0;
-    }
-
-    QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(qemu_file_get_ioc(f));
-    RCU_READ_LOCK_GUARD();
-    RDMAContext *rdma = qatomic_rcu_read(&rioc->rdmaout);
-    if (!rdma) {
-        return -1;
-    }
-
-    if (rdma_errored(rdma)) {
-        return -1;
-    }
-
-    trace_rdma_registration_start(flags);
-    qemu_put_be64(f, RAM_SAVE_FLAG_HOOK);
-    return qemu_fflush(f);
-}
-
-/*
- * Inform dest that dynamic registrations are done for now.
- * First, flush writes, if any.
- */
-int rdma_registration_stop(QEMUFile *f, uint64_t flags)
-{
-    QIOChannelRDMA *rioc;
-    Error *err = NULL;
-    RDMAContext *rdma;
-    RDMAControlHeader head = { .len = 0, .repeat = 1 };
-    int ret;
-
-    if (!migrate_rdma() || migration_in_postcopy()) {
-        return 0;
-    }
-
-    RCU_READ_LOCK_GUARD();
-    rioc = QIO_CHANNEL_RDMA(qemu_file_get_ioc(f));
-    rdma = qatomic_rcu_read(&rioc->rdmaout);
-    if (!rdma) {
-        return -1;
-    }
-
-    if (rdma_errored(rdma)) {
-        return -1;
-    }
-
-    qemu_fflush(f);
-    ret = qemu_rdma_drain_cq(rdma);
-
-    if (ret < 0) {
-        goto err;
-    }
-
-    if (flags == RAM_CONTROL_SETUP) {
-        RDMAControlHeader resp = {.type = RDMA_CONTROL_RAM_BLOCKS_RESULT };
-        RDMALocalBlocks *local = &rdma->local_ram_blocks;
-        int reg_result_idx, nb_dest_blocks;
-
-        head.type = RDMA_CONTROL_RAM_BLOCKS_REQUEST;
-        trace_rdma_registration_stop_ram();
-
-        /*
-         * Make sure that we parallelize the pinning on both sides.
-         * For very large guests, doing this serially takes a really
-         * long time, so we have to 'interleave' the pinning locally
-         * with the control messages by performing the pinning on this
-         * side before we receive the control response from the other
-         * side that the pinning has completed.
-         */
-        ret = qemu_rdma_exchange_send(rdma, &head, NULL, &resp,
-                    &reg_result_idx, rdma->pin_all ?
-                    qemu_rdma_reg_whole_ram_blocks : NULL,
-                    &err);
-        if (ret < 0) {
-            error_report_err(err);
-            return -1;
-        }
-
-        nb_dest_blocks = resp.len / sizeof(RDMADestBlock);
-
-        /*
-         * The protocol uses two different sets of rkeys (mutually exclusive):
-         * 1. One key to represent the virtual address of the entire ram block.
-         *    (dynamic chunk registration disabled - pin everything with one rkey.)
-         * 2. One to represent individual chunks within a ram block.
-         *    (dynamic chunk registration enabled - pin individual chunks.)
-         *
-         * Once the capability is successfully negotiated, the destination transmits
-         * the keys to use (or sends them later) including the virtual addresses
-         * and then propagates the remote ram block descriptions to his local copy.
-         */
-
-        if (local->nb_blocks != nb_dest_blocks) {
-            error_report("ram blocks mismatch (Number of blocks %d vs %d)",
-                         local->nb_blocks, nb_dest_blocks);
-            error_printf("Your QEMU command line parameters are probably "
-                         "not identical on both the source and destination.");
-            rdma->errored = true;
-            return -1;
-        }
-
-        qemu_rdma_move_header(rdma, reg_result_idx, &resp);
-        memcpy(rdma->dest_blocks,
-            rdma->wr_data[reg_result_idx].control_curr, resp.len);
-        for (int i = 0; i < nb_dest_blocks; i++) {
-            network_to_dest_block(&rdma->dest_blocks[i]);
-
-            /* We require that the blocks are in the same order */
-            if (rdma->dest_blocks[i].length != local->block[i].length) {
-                error_report("Block %s/%d has a different length %" PRIu64
-                             "vs %" PRIu64,
-                             local->block[i].block_name, i,
-                             local->block[i].length,
-                             rdma->dest_blocks[i].length);
-                rdma->errored = true;
-                return -1;
-            }
-            local->block[i].remote_host_addr =
-                    rdma->dest_blocks[i].remote_host_addr;
-            local->block[i].remote_rkey = rdma->dest_blocks[i].remote_rkey;
-        }
-    }
-
-    trace_rdma_registration_stop(flags);
-
-    head.type = RDMA_CONTROL_REGISTER_FINISHED;
-    ret = qemu_rdma_exchange_send(rdma, &head, NULL, NULL, NULL, NULL, &err);
-
-    if (ret < 0) {
-        error_report_err(err);
-        goto err;
-    }
-
-    return 0;
-err:
-    rdma->errored = true;
-    return -1;
-}
-
-static void qio_channel_rdma_finalize(Object *obj)
-{
-    QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(obj);
-    if (rioc->rdmain) {
-        qemu_rdma_cleanup(rioc->rdmain);
-        g_free(rioc->rdmain);
-        rioc->rdmain = NULL;
-    }
-    if (rioc->rdmaout) {
-        qemu_rdma_cleanup(rioc->rdmaout);
-        g_free(rioc->rdmaout);
-        rioc->rdmaout = NULL;
-    }
-}
-
-static void qio_channel_rdma_class_init(ObjectClass *klass,
-                                        void *class_data G_GNUC_UNUSED)
-{
-    QIOChannelClass *ioc_klass = QIO_CHANNEL_CLASS(klass);
-
-    ioc_klass->io_writev = qio_channel_rdma_writev;
-    ioc_klass->io_readv = qio_channel_rdma_readv;
-    ioc_klass->io_set_blocking = qio_channel_rdma_set_blocking;
-    ioc_klass->io_close = qio_channel_rdma_close;
-    ioc_klass->io_create_watch = qio_channel_rdma_create_watch;
-    ioc_klass->io_set_aio_fd_handler = qio_channel_rdma_set_aio_fd_handler;
-    ioc_klass->io_shutdown = qio_channel_rdma_shutdown;
-}
-
-static const TypeInfo qio_channel_rdma_info = {
-    .parent = TYPE_QIO_CHANNEL,
-    .name = TYPE_QIO_CHANNEL_RDMA,
-    .instance_size = sizeof(QIOChannelRDMA),
-    .instance_finalize = qio_channel_rdma_finalize,
-    .class_init = qio_channel_rdma_class_init,
-};
-
-static void qio_channel_rdma_register_types(void)
-{
-    type_register_static(&qio_channel_rdma_info);
-}
-
-type_init(qio_channel_rdma_register_types);
-
-static QEMUFile *rdma_new_input(RDMAContext *rdma)
-{
-    QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(object_new(TYPE_QIO_CHANNEL_RDMA));
-
-    rioc->file = qemu_file_new_input(QIO_CHANNEL(rioc));
-    rioc->rdmain = rdma;
-    rioc->rdmaout = rdma->return_path;
-
-    return rioc->file;
-}
-
-static QEMUFile *rdma_new_output(RDMAContext *rdma)
-{
-    QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(object_new(TYPE_QIO_CHANNEL_RDMA));
-
-    rioc->file = qemu_file_new_output(QIO_CHANNEL(rioc));
-    rioc->rdmaout = rdma;
-    rioc->rdmain = rdma->return_path;
-
-    return rioc->file;
-}
-
-static void rdma_accept_incoming_migration(void *opaque)
-{
-    RDMAContext *rdma = opaque;
-    QEMUFile *f;
-
-    trace_qemu_rdma_accept_incoming_migration();
-    if (qemu_rdma_accept(rdma) < 0) {
-        error_report("RDMA ERROR: Migration initialization failed");
-        return;
-    }
-
-    trace_qemu_rdma_accept_incoming_migration_accepted();
-
-    if (rdma->is_return_path) {
-        return;
-    }
-
-    f = rdma_new_input(rdma);
-    if (f == NULL) {
-        error_report("RDMA ERROR: could not open RDMA for input");
-        qemu_rdma_cleanup(rdma);
-        return;
-    }
-
-    rdma->migration_started_on_destination = 1;
-    migration_fd_process_incoming(f);
-}
-
-void rdma_start_incoming_migration(InetSocketAddress *host_port,
-                                   Error **errp)
-{
-    MigrationState *s = migrate_get_current();
-    int ret;
-    RDMAContext *rdma;
-
-    trace_rdma_start_incoming_migration();
-
-    /* Avoid ram_block_discard_disable(), cannot change during migration. */
-    if (ram_block_discard_is_required()) {
-        error_setg(errp, "RDMA: cannot disable RAM discard");
-        return;
-    }
-
-    rdma = qemu_rdma_data_init(host_port, errp);
-    if (rdma == NULL) {
-        goto err;
-    }
-
-    ret = qemu_rdma_dest_init(rdma, errp);
-    if (ret < 0) {
-        goto err;
-    }
-
-    trace_rdma_start_incoming_migration_after_dest_init();
-
-    ret = rdma_listen(rdma->listen_id, 5);
-
-    if (ret < 0) {
-        error_setg(errp, "RDMA ERROR: listening on socket!");
-        goto cleanup_rdma;
-    }
-
-    trace_rdma_start_incoming_migration_after_rdma_listen();
-    s->rdma_migration = true;
-    qemu_set_fd_handler(rdma->channel->fd, rdma_accept_incoming_migration,
-                        NULL, (void *)(intptr_t)rdma);
-    return;
-
-cleanup_rdma:
-    qemu_rdma_cleanup(rdma);
-err:
-    if (rdma) {
-        g_free(rdma->host);
-    }
-    g_free(rdma);
-}
-
-void rdma_start_outgoing_migration(void *opaque,
-                            InetSocketAddress *host_port, Error **errp)
-{
-    MigrationState *s = opaque;
-    RDMAContext *rdma_return_path = NULL;
-    RDMAContext *rdma;
-    int ret;
-
-    /* Avoid ram_block_discard_disable(), cannot change during migration. */
-    if (ram_block_discard_is_required()) {
-        error_setg(errp, "RDMA: cannot disable RAM discard");
-        return;
-    }
-
-    rdma = qemu_rdma_data_init(host_port, errp);
-    if (rdma == NULL) {
-        goto err;
-    }
-
-    ret = qemu_rdma_source_init(rdma, migrate_rdma_pin_all(), errp);
-
-    if (ret < 0) {
-        goto err;
-    }
-
-    trace_rdma_start_outgoing_migration_after_rdma_source_init();
-    ret = qemu_rdma_connect(rdma, false, errp);
-
-    if (ret < 0) {
-        goto err;
-    }
-
-    /* RDMA postcopy need a separate queue pair for return path */
-    if (migrate_postcopy() || migrate_return_path()) {
-        rdma_return_path = qemu_rdma_data_init(host_port, errp);
-
-        if (rdma_return_path == NULL) {
-            goto return_path_err;
-        }
-
-        ret = qemu_rdma_source_init(rdma_return_path,
-                                    migrate_rdma_pin_all(), errp);
-
-        if (ret < 0) {
-            goto return_path_err;
-        }
-
-        ret = qemu_rdma_connect(rdma_return_path, true, errp);
-
-        if (ret < 0) {
-            goto return_path_err;
-        }
-
-        rdma->return_path = rdma_return_path;
-        rdma_return_path->return_path = rdma;
-        rdma_return_path->is_return_path = true;
-    }
-
-    trace_rdma_start_outgoing_migration_after_rdma_connect();
-
-    s->to_dst_file = rdma_new_output(rdma);
-    s->rdma_migration = true;
-    migrate_fd_connect(s, NULL);
-    return;
-return_path_err:
-    qemu_rdma_cleanup(rdma);
-err:
-    g_free(rdma);
-    g_free(rdma_return_path);
-}
diff --git a/migration/savevm.c b/migration/savevm.c
index 388d7af7cd..939d35d69e 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2970,7 +2970,7 @@ int qemu_loadvm_state(QEMUFile *f)
 
     /* We've got to be careful; if we don't read the data and just shut the fd
      * then the sender can error if we close while it's still sending.
-     * We also mustn't read data that isn't there; some transports (RDMA)
+     * We also mustn't read data that isn't there; some transports
      * will stall waiting for that data when the source has already closed.
      */
     if (ret == 0 && should_send_vmdesc()) {
diff --git a/meson_options.txt b/meson_options.txt
index b5c0bad9e7..79b69d4286 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -196,8 +196,6 @@ option('rbd', type : 'feature', value : 'auto',
        description: 'Ceph block device driver')
 option('opengl', type : 'feature', value : 'auto',
        description: 'OpenGL support')
-option('rdma', type : 'feature', value : 'auto',
-       description: 'Enable RDMA-based migration')
 option('gtk', type : 'feature', value : 'auto',
        description: 'GTK+ user interface')
 option('sdl', type : 'feature', value : 'auto',
diff --git a/migration/meson.build b/migration/meson.build
index 1eeb915ff6..e2cd92c01f 100644
--- a/migration/meson.build
+++ b/migration/meson.build
@@ -36,7 +36,6 @@ if get_option('replication').allowed()
   system_ss.add(files('colo-failover.c', 'colo.c'))
 endif
 
-system_ss.add(when: rdma, if_true: files('rdma.c'))
 if get_option('live_block_migration').allowed()
   system_ss.add(files('block.c'))
 endif
diff --git a/migration/trace-events b/migration/trace-events
index f0e1cb80c7..7db3a5194f 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -193,7 +193,7 @@ process_incoming_migration_co_postcopy_end_main(void) ""
 postcopy_preempt_enabled(bool value) "%d"
 
 # migration-stats
-migration_transferred_bytes(uint64_t qemu_file, uint64_t multifd, uint64_t rdma) "qemu_file %" PRIu64 " multifd %" PRIu64 " RDMA %" PRIu64
+migration_transferred_bytes(uint64_t qemu_file, uint64_t multifd) "qemu_file %" PRIu64 " multifd %" PRIu64
 
 # channel.c
 migration_set_incoming_channel(void *ioc, const char *ioctype) "ioc=%p ioctype=%s"
@@ -204,72 +204,6 @@ migrate_state_too_big(void) ""
 migrate_global_state_post_load(const char *state) "loaded state: %s"
 migrate_global_state_pre_save(const char *state) "saved state: %s"
 
-# rdma.c
-qemu_rdma_accept_incoming_migration(void) ""
-qemu_rdma_accept_incoming_migration_accepted(void) ""
-qemu_rdma_accept_pin_state(bool pin) "%d"
-qemu_rdma_accept_pin_verbsc(void *verbs) "Verbs context after listen: %p"
-qemu_rdma_block_for_wrid_miss(uint64_t wcomp, uint64_t req) "A Wanted wrid %" PRIu64 " but got %" PRIu64
-qemu_rdma_cleanup_disconnect(void) ""
-qemu_rdma_close(void) ""
-qemu_rdma_connect_pin_all_requested(void) ""
-qemu_rdma_connect_pin_all_outcome(bool pin) "%d"
-qemu_rdma_dest_init_trying(const char *host, const char *ip) "%s => %s"
-qemu_rdma_dump_id_failed(const char *who) "%s RDMA Device opened, but can't query port information"
-qemu_rdma_dump_id(const char *who, const char *name, const char *dev_name, const char *dev_path, const char *ibdev_path, int transport, const char *transport_name) "%s RDMA Device opened: kernel name %s uverbs device name %s, infiniband_verbs class device path %s, infiniband class device path %s, transport: (%d) %s"
-qemu_rdma_dump_gid(const char *who, const char *src, const char *dst) "%s Source GID: %s, Dest GID: %s"
-qemu_rdma_exchange_get_response_start(const char *desc) "CONTROL: %s receiving..."
-qemu_rdma_exchange_get_response_none(const char *desc, int type) "Surprise: got %s (%d)"
-qemu_rdma_exchange_send_issue_callback(void) ""
-qemu_rdma_exchange_send_waiting(const char *desc) "Waiting for response %s"
-qemu_rdma_exchange_send_received(const char *desc) "Response %s received."
-qemu_rdma_fill(size_t control_len, size_t size) "RDMA %zd of %zd bytes already in buffer"
-qemu_rdma_init_ram_blocks(int blocks) "Allocated %d local ram block structures"
-qemu_rdma_poll_recv(uint64_t comp, int64_t id, int sent) "completion %" PRIu64 " received (%" PRId64 ") left %d"
-qemu_rdma_poll_write(uint64_t comp, int left, uint64_t block, uint64_t chunk, void *local, void *remote) "completions %" PRIu64 " left %d, block %" PRIu64 ", chunk: %" PRIu64 " %p %p"
-qemu_rdma_poll_other(uint64_t comp, int left) "other completion %" PRIu64 " received left %d"
-qemu_rdma_post_send_control(const char *desc) "CONTROL: sending %s.."
-qemu_rdma_register_and_get_keys(uint64_t len, void *start) "Registering %" PRIu64 " bytes @ %p"
-qemu_rdma_register_odp_mr(const char *name) "Try to register On-Demand Paging memory region: %s"
-qemu_rdma_advise_mr(const char *name, uint32_t len, uint64_t addr, const char *res) "Try to advise block %s prefetch at %" PRIu32 "@0x%" PRIx64 ": %s"
-qemu_rdma_resolve_host_trying(const char *host, const char *ip) "Trying %s => %s"
-qemu_rdma_signal_unregister_append(uint64_t chunk, int pos) "Appending unregister chunk %" PRIu64 " at position %d"
-qemu_rdma_signal_unregister_already(uint64_t chunk) "Unregister chunk %" PRIu64 " already in queue"
-qemu_rdma_unregister_waiting_inflight(uint64_t chunk) "Cannot unregister inflight chunk: %" PRIu64
-qemu_rdma_unregister_waiting_proc(uint64_t chunk, int pos) "Processing unregister for chunk: %" PRIu64 " at position %d"
-qemu_rdma_unregister_waiting_send(uint64_t chunk) "Sending unregister for chunk: %" PRIu64
-qemu_rdma_unregister_waiting_complete(uint64_t chunk) "Unregister for chunk: %" PRIu64 " complete."
-qemu_rdma_write_flush(int sent) "sent total: %d"
-qemu_rdma_write_one_block(int count, int block, uint64_t chunk, uint64_t current, uint64_t len, int nb_sent, int nb_chunks) "(%d) Not clobbering: block: %d chunk %" PRIu64 " current %" PRIu64 " len %" PRIu64 " %d %d"
-qemu_rdma_write_one_post(uint64_t chunk, long addr, long remote, uint32_t len) "Posting chunk: %" PRIu64 ", addr: 0x%lx remote: 0x%lx, bytes %" PRIu32
-qemu_rdma_write_one_queue_full(void) ""
-qemu_rdma_write_one_recvregres(int mykey, int theirkey, uint64_t chunk) "Received registration result: my key: 0x%x their key 0x%x, chunk %" PRIu64
-qemu_rdma_write_one_sendreg(uint64_t chunk, int len, int index, int64_t offset) "Sending registration request chunk %" PRIu64 " for %d bytes, index: %d, offset: %" PRId64
-qemu_rdma_write_one_top(uint64_t chunks, uint64_t size) "Writing %" PRIu64 " chunks, (%" PRIu64 " MB)"
-qemu_rdma_write_one_zero(uint64_t chunk, int len, int index, int64_t offset) "Entire chunk is zero, sending compress: %" PRIu64 " for %d bytes, index: %d, offset: %" PRId64
-rdma_add_block(const char *block_name, int block, uint64_t addr, uint64_t offset, uint64_t len, uint64_t end, uint64_t bits, int chunks) "Added Block: '%s':%d, addr: %" PRIu64 ", offset: %" PRIu64 " length: %" PRIu64 " end: %" PRIu64 " bits %" PRIu64 " chunks %d"
-rdma_block_notification_handle(const char *name, int index) "%s at %d"
-rdma_delete_block(void *block, uint64_t addr, uint64_t offset, uint64_t len, uint64_t end, uint64_t bits, int chunks) "Deleted Block: %p, addr: %" PRIu64 ", offset: %" PRIu64 " length: %" PRIu64 " end: %" PRIu64 " bits %" PRIu64 " chunks %d"
-rdma_registration_handle_compress(int64_t length, int index, int64_t offset) "Zapping zero chunk: %" PRId64 " bytes, index %d, offset %" PRId64
-rdma_registration_handle_finished(void) ""
-rdma_registration_handle_ram_blocks(void) ""
-rdma_registration_handle_ram_blocks_loop(const char *name, uint64_t offset, uint64_t length, void *local_host_addr, unsigned int src_index) "%s: @0x%" PRIx64 "/%" PRIu64 " host:@%p src_index: %u"
-rdma_registration_handle_register(int requests) "%d requests"
-rdma_registration_handle_register_loop(int req, int index, uint64_t addr, uint64_t chunks) "Registration request (%d): index %d, current_addr %" PRIu64 " chunks: %" PRIu64
-rdma_registration_handle_register_rkey(int rkey) "0x%x"
-rdma_registration_handle_unregister(int requests) "%d requests"
-rdma_registration_handle_unregister_loop(int count, int index, uint64_t chunk) "Unregistration request (%d): index %d, chunk %" PRIu64
-rdma_registration_handle_unregister_success(uint64_t chunk) "%" PRIu64
-rdma_registration_handle_wait(void) ""
-rdma_registration_start(uint64_t flags) "%" PRIu64
-rdma_registration_stop(uint64_t flags) "%" PRIu64
-rdma_registration_stop_ram(void) ""
-rdma_start_incoming_migration(void) ""
-rdma_start_incoming_migration_after_dest_init(void) ""
-rdma_start_incoming_migration_after_rdma_listen(void) ""
-rdma_start_outgoing_migration_after_rdma_connect(void) ""
-rdma_start_outgoing_migration_after_rdma_source_init(void) ""
-
 # postcopy-ram.c
 postcopy_discard_send_finish(const char *ramblock, int nwords, int ncmds) "%s mask words sent=%d in %d commands"
 postcopy_discard_send_range(const char *ramblock, unsigned long start, unsigned long length) "%s:%lx/%lx"
diff --git a/qemu-options.hx b/qemu-options.hx
index f7ef9b4e41..4f390c33ef 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -4759,7 +4759,6 @@ ERST
 
 DEF("incoming", HAS_ARG, QEMU_OPTION_incoming, \
     "-incoming tcp:[host]:port[,to=maxport][,ipv4=on|off][,ipv6=on|off]\n" \
-    "-incoming rdma:host:port[,ipv4=on|off][,ipv6=on|off]\n" \
     "-incoming unix:socketpath\n" \
     "                prepare for incoming migration, listen on\n" \
     "                specified protocol and socket address\n" \
@@ -4773,8 +4772,6 @@ DEF("incoming", HAS_ARG, QEMU_OPTION_incoming, \
     QEMU_ARCH_ALL)
 SRST
 ``-incoming tcp:[host]:port[,to=maxport][,ipv4=on|off][,ipv6=on|off]``
-  \ 
-``-incoming rdma:host:port[,ipv4=on|off][,ipv6=on|off]``
     Prepare for incoming migration, listen on a given tcp port.
 
 ``-incoming unix:socketpath``
diff --git a/scripts/ci/org.centos/stream/8/build-environment.yml b/scripts/ci/org.centos/stream/8/build-environment.yml
index 1ead77e2cb..a366bb185b 100644
--- a/scripts/ci/org.centos/stream/8/build-environment.yml
+++ b/scripts/ci/org.centos/stream/8/build-environment.yml
@@ -68,7 +68,6 @@
           - pixman-devel
           - python38
           - python3-sphinx
-          - rdma-core-devel
           - redhat-rpm-config
           - snappy-devel
           - spice-glib-devel
diff --git a/scripts/ci/org.centos/stream/8/x86_64/configure b/scripts/ci/org.centos/stream/8/x86_64/configure
index 868db665f6..5dead834fb 100755
--- a/scripts/ci/org.centos/stream/8/x86_64/configure
+++ b/scripts/ci/org.centos/stream/8/x86_64/configure
@@ -103,7 +103,6 @@
 --disable-qed \
 --disable-qom-cast-debug \
 --disable-rbd \
---disable-rdma \
 --disable-replication \
 --disable-rng-none \
 --disable-safe-stack \
@@ -175,7 +174,6 @@
 --enable-opengl \
 --enable-pie \
 --enable-rbd \
---enable-rdma \
 --enable-seccomp \
 --enable-snappy \
 --enable-smartcard \
diff --git a/scripts/ci/setup/build-environment.yml b/scripts/ci/setup/build-environment.yml
index f344d1a850..0359b1c023 100644
--- a/scripts/ci/setup/build-environment.yml
+++ b/scripts/ci/setup/build-environment.yml
@@ -81,8 +81,6 @@
           - libglusterfs-dev
           - libgnutls28-dev
           - libgtk-3-dev
-          - libibumad-dev
-          - libibverbs-dev
           - libiscsi-dev
           - libjemalloc-dev
           - libjpeg-turbo8-dev
@@ -99,7 +97,6 @@
           - libpng-dev
           - libpulse-dev
           - librbd-dev
-          - librdmacm-dev
           - libsasl2-dev
           - libsdl2-dev
           - libsdl2-image-dev
@@ -236,7 +233,6 @@
           - pixman-devel
           - python38
           - python3-sphinx
-          - rdma-core-devel
           - redhat-rpm-config
           - snappy-devel
           - spice-glib-devel
diff --git a/scripts/coverity-scan/run-coverity-scan b/scripts/coverity-scan/run-coverity-scan
index 43cf770f5e..3dd14c3cc4 100755
--- a/scripts/coverity-scan/run-coverity-scan
+++ b/scripts/coverity-scan/run-coverity-scan
@@ -426,7 +426,7 @@ echo "Configuring..."
     --enable-libusb --enable-usb-redir \
     --enable-libiscsi --enable-libnfs --enable-seccomp \
     --enable-tpm --enable-libssh --enable-lzo --enable-snappy --enable-bzip2 \
-    --enable-numa --enable-rdma --enable-smartcard --enable-virglrenderer \
+    --enable-numa --enable-smartcard --enable-virglrenderer \
     --enable-mpath --enable-glusterfs \
     --enable-virtfs --enable-zstd
 
diff --git a/scripts/meson-buildoptions.sh b/scripts/meson-buildoptions.sh
index 5ace33f167..52c34598ba 100644
--- a/scripts/meson-buildoptions.sh
+++ b/scripts/meson-buildoptions.sh
@@ -167,7 +167,6 @@ meson_options_help() {
   printf "%s\n" '  qed             qed image format support'
   printf "%s\n" '  qga-vss         build QGA VSS support (broken with MinGW)'
   printf "%s\n" '  rbd             Ceph block device driver'
-  printf "%s\n" '  rdma            Enable RDMA-based migration'
   printf "%s\n" '  replication     replication support'
   printf "%s\n" '  rutabaga-gfx    rutabaga_gfx support'
   printf "%s\n" '  sdl             SDL user interface'
@@ -442,8 +441,6 @@ _meson_option_parse() {
     --disable-qom-cast-debug) printf "%s" -Dqom_cast_debug=false ;;
     --enable-rbd) printf "%s" -Drbd=enabled ;;
     --disable-rbd) printf "%s" -Drbd=disabled ;;
-    --enable-rdma) printf "%s" -Drdma=enabled ;;
-    --disable-rdma) printf "%s" -Drdma=disabled ;;
     --enable-relocatable) printf "%s" -Drelocatable=true ;;
     --disable-relocatable) printf "%s" -Drelocatable=false ;;
     --enable-replication) printf "%s" -Dreplication=enabled ;;
diff --git a/tests/lcitool/projects/qemu.yml b/tests/lcitool/projects/qemu.yml
index 149b15de57..511e48a5ec 100644
--- a/tests/lcitool/projects/qemu.yml
+++ b/tests/lcitool/projects/qemu.yml
@@ -48,8 +48,6 @@ packages:
  - libfdt
  - libffi
  - libgcrypt
- - libibumad
- - libibverbs
  - libiscsi
  - libjemalloc
  - libjpeg
@@ -58,7 +56,6 @@ packages:
  - libpmem
  - libpng
  - librbd
- - librdmacm
  - libseccomp
  - libselinux
  - libslirp
diff --git a/tests/migration/guestperf/engine.py b/tests/migration/guestperf/engine.py
index 608d7270f6..a704419082 100644
--- a/tests/migration/guestperf/engine.py
+++ b/tests/migration/guestperf/engine.py
@@ -41,7 +41,7 @@ def __init__(self, binary, dst_host, kernel, initrd, transport="tcp",
         self._dst_host = dst_host # Hostname of target host
         self._kernel = kernel # Path to kernel image
         self._initrd = initrd # Path to stress initrd
-        self._transport = transport # 'unix' or 'tcp' or 'rdma'
+        self._transport = transport # 'unix' or 'tcp'
         self._sleep = sleep
         self._verbose = verbose
         self._debug = debug
@@ -427,8 +427,6 @@ def run(self, hardware, scenario, result_dir=os.getcwd()):
 
         if self._transport == "tcp":
             uri = "tcp:%s:9000" % self._dst_host
-        elif self._transport == "rdma":
-            uri = "rdma:%s:9000" % self._dst_host
         elif self._transport == "unix":
             if self._dst_host != "localhost":
                 raise Exception("Running use unix migration transport for non-local host")
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH-for-9.1 v2 3/3] block/gluster: Remove RDMA protocol handling
  2024-03-28 13:02 [PATCH-for-9.1 v2 0/3] rdma: Remove RDMA subsystem and pvrdma device Philippe Mathieu-Daudé
  2024-03-28 13:02 ` [PATCH-for-9.1 v2 1/3] hw/rdma: Remove pvrdma device and rdmacm-mux helper Philippe Mathieu-Daudé
  2024-03-28 13:02 ` [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling Philippe Mathieu-Daudé
@ 2024-03-28 13:02 ` Philippe Mathieu-Daudé
  2024-03-28 17:54   ` Thomas Huth
  2024-03-29  9:17 ` [PATCH-for-9.1 v2 0/3] rdma: Remove RDMA subsystem and pvrdma device Michael S. Tsirkin
  2024-04-03  9:37 ` Philippe Mathieu-Daudé
  4 siblings, 1 reply; 52+ messages in thread
From: Philippe Mathieu-Daudé @ 2024-03-28 13:02 UTC (permalink / raw)
  To: qemu-devel
  Cc: Yuval Shaia, Kevin Wolf, Prasanna Kumar Kalever, Fabiano Rosas,
	Cornelia Huck, Michael Roth, Li Zhijian, Prasanna Kumar Kalever,
	Peter Xu, integration, Paolo Bonzini, qemu-block,
	Daniel P. Berrangé,
	devel, Hanna Reitz, Michael S. Tsirkin, Thomas Huth, Eric Blake,
	Philippe Mathieu-Daudé

GlusterFS+RDMA has been deprecated 8 years ago in commit
0552ff2465 ("block/gluster: deprecate rdma support"):

  gluster volfile server fetch happens through unix and/or tcp,
  it doesn't support volfile fetch over rdma. The rdma code may
  actually mislead, so to make sure things do not break, for now
  we fallback to tcp when requested for rdma, with a warning.

  If you are wondering how this worked all these days, its the
  gluster libgfapi code which handles anything other than unix
  transport as socket/tcp, sad but true.

Besides, the whole RDMA subsystem was deprecated in commit
e9a54265f5 ("hw/rdma: Deprecate the pvrdma device and the rdma
subsystem") released in v8.2.

Cc: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
---
 docs/system/device-url-syntax.rst.inc  |  4 +--
 docs/system/qemu-block-drivers.rst.inc |  1 -
 block/gluster.c                        | 39 --------------------------
 3 files changed, 2 insertions(+), 42 deletions(-)

diff --git a/docs/system/device-url-syntax.rst.inc b/docs/system/device-url-syntax.rst.inc
index 7dbc525fa8..43b5c2596b 100644
--- a/docs/system/device-url-syntax.rst.inc
+++ b/docs/system/device-url-syntax.rst.inc
@@ -87,8 +87,8 @@ These are specified using a special URL syntax.
 
 ``GlusterFS``
    GlusterFS is a user space distributed file system. QEMU supports the
-   use of GlusterFS volumes for hosting VM disk images using TCP, Unix
-   Domain Sockets and RDMA transport protocols.
+   use of GlusterFS volumes for hosting VM disk images using TCP and Unix
+   Domain Sockets transport protocols.
 
    Syntax for specifying a VM disk image on GlusterFS volume is
 
diff --git a/docs/system/qemu-block-drivers.rst.inc b/docs/system/qemu-block-drivers.rst.inc
index 105cb9679c..384e95ba76 100644
--- a/docs/system/qemu-block-drivers.rst.inc
+++ b/docs/system/qemu-block-drivers.rst.inc
@@ -737,7 +737,6 @@ Examples
   |qemu_system| -drive file=gluster+tcp://[1:2:3:4:5:6:7:8]:24007/testvol/dir/a.img
   |qemu_system| -drive file=gluster+tcp://server.domain.com:24007/testvol/dir/a.img
   |qemu_system| -drive file=gluster+unix:///testvol/dir/a.img?socket=/tmp/glusterd.socket
-  |qemu_system| -drive file=gluster+rdma://1.2.3.4:24007/testvol/a.img
   |qemu_system| -drive file=gluster://1.2.3.4/testvol/a.img,file.debug=9,file.logfile=/var/log/qemu-gluster.log
   |qemu_system| 'json:{"driver":"qcow2",
                            "file":{"driver":"gluster",
diff --git a/block/gluster.c b/block/gluster.c
index cc74af06dc..4253c8db5e 100644
--- a/block/gluster.c
+++ b/block/gluster.c
@@ -371,9 +371,6 @@ static int qemu_gluster_parse_uri(BlockdevOptionsGluster *gconf,
     } else if (!strcmp(uri->scheme, "gluster+unix")) {
         gsconf->type = SOCKET_ADDRESS_TYPE_UNIX;
         is_unix = true;
-    } else if (!strcmp(uri->scheme, "gluster+rdma")) {
-        gsconf->type = SOCKET_ADDRESS_TYPE_INET;
-        warn_report("rdma feature is not supported, falling back to tcp");
     } else {
         ret = -EINVAL;
         goto out;
@@ -1638,44 +1635,8 @@ static BlockDriver bdrv_gluster_unix = {
     .strong_runtime_opts          = gluster_strong_open_opts,
 };
 
-/* rdma is deprecated (actually never supported for volfile fetch).
- * Let's maintain it for the protocol compatibility, to make sure things
- * won't break immediately. For now, gluster+rdma will fall back to gluster+tcp
- * protocol with a warning.
- * TODO: remove gluster+rdma interface support
- */
-static BlockDriver bdrv_gluster_rdma = {
-    .format_name                  = "gluster",
-    .protocol_name                = "gluster+rdma",
-    .instance_size                = sizeof(BDRVGlusterState),
-    .bdrv_file_open               = qemu_gluster_open,
-    .bdrv_reopen_prepare          = qemu_gluster_reopen_prepare,
-    .bdrv_reopen_commit           = qemu_gluster_reopen_commit,
-    .bdrv_reopen_abort            = qemu_gluster_reopen_abort,
-    .bdrv_close                   = qemu_gluster_close,
-    .bdrv_co_create               = qemu_gluster_co_create,
-    .bdrv_co_create_opts          = qemu_gluster_co_create_opts,
-    .bdrv_co_getlength            = qemu_gluster_co_getlength,
-    .bdrv_co_get_allocated_file_size = qemu_gluster_co_get_allocated_file_size,
-    .bdrv_co_truncate             = qemu_gluster_co_truncate,
-    .bdrv_co_readv                = qemu_gluster_co_readv,
-    .bdrv_co_writev               = qemu_gluster_co_writev,
-    .bdrv_co_flush_to_disk        = qemu_gluster_co_flush_to_disk,
-#ifdef CONFIG_GLUSTERFS_DISCARD
-    .bdrv_co_pdiscard             = qemu_gluster_co_pdiscard,
-#endif
-#ifdef CONFIG_GLUSTERFS_ZEROFILL
-    .bdrv_co_pwrite_zeroes        = qemu_gluster_co_pwrite_zeroes,
-#endif
-    .bdrv_co_block_status         = qemu_gluster_co_block_status,
-    .bdrv_refresh_limits          = qemu_gluster_refresh_limits,
-    .create_opts                  = &qemu_gluster_create_opts,
-    .strong_runtime_opts          = gluster_strong_open_opts,
-};
-
 static void bdrv_gluster_init(void)
 {
-    bdrv_register(&bdrv_gluster_rdma);
     bdrv_register(&bdrv_gluster_unix);
     bdrv_register(&bdrv_gluster_tcp);
     bdrv_register(&bdrv_gluster);
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
  2024-03-28 13:02 ` [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling Philippe Mathieu-Daudé
@ 2024-03-28 14:18   ` Fabiano Rosas
  2024-03-28 15:01     ` Peter Xu
  0 siblings, 1 reply; 52+ messages in thread
From: Fabiano Rosas @ 2024-03-28 14:18 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé, qemu-devel
  Cc: Yuval Shaia, Kevin Wolf, Prasanna Kumar Kalever, Cornelia Huck,
	Michael Roth, Li Zhijian, Prasanna Kumar Kalever, Peter Xu,
	integration, Paolo Bonzini, qemu-block, Daniel P. Berrangé,
	devel, Hanna Reitz, Michael S. Tsirkin, Thomas Huth, Eric Blake,
	Philippe Mathieu-Daudé,
	Song Gao, Marc-André Lureau, Markus Armbruster,
	Alex Bennée, Wainer dos Santos Moschetta, Beraldo Leal,
	Peter Maydell

Philippe Mathieu-Daudé <philmd@linaro.org> writes:

> The whole RDMA subsystem was deprecated in commit e9a54265f5
> ("hw/rdma: Deprecate the pvrdma device and the rdma subsystem")
> released in v8.2.
>
> Remove:
>  - RDMA handling from migration
>  - dependencies on libibumad, libibverbs and librdmacm
>
> Keep the RAM_SAVE_FLAG_HOOK definition since it might appears
> in old migration streams.
>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Li Zhijian <lizhijian@fujitsu.com>
> Acked-by: Fabiano Rosas <farosas@suse.de>
> Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>

Just to be clear, because people raised the point in the last version,
the first link in the deprecation commit links to a thread comprising
entirely of rdma migration patches. I don't see any ambiguity on whether
the deprecation was intended to include migration. There's even an ack
from Juan.

So on the basis of not reverting the previous maintainer's decision, my
Ack stands here.

We also had pretty obvious bugs ([1], [2]) in the past that would have
been caught if we had any kind of testing for the feature, so I can't
even say this thing works currently.

@Peter Xu, @Li Zhijian, what are your thoughts on this?

1- https://lore.kernel.org/r/20230920090412.726725-1-lizhijian@fujitsu.com
2- https://lore.kernel.org/r/CAHEcVy7HXSwn4Ow_Kog+Q+TN6f_kMeiCHevz1qGM-fbxBPp1hQ@mail.gmail.com



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
  2024-03-28 14:18   ` Fabiano Rosas
@ 2024-03-28 15:01     ` Peter Xu
  2024-03-28 15:22       ` Thomas Huth
  2024-03-29  1:53       ` Zhijian Li (Fujitsu) via
  0 siblings, 2 replies; 52+ messages in thread
From: Peter Xu @ 2024-03-28 15:01 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: Philippe Mathieu-Daudé,
	qemu-devel, Yuval Shaia, Kevin Wolf, Prasanna Kumar Kalever,
	Cornelia Huck, Michael Roth, Li Zhijian, Prasanna Kumar Kalever,
	integration, Paolo Bonzini, qemu-block, Daniel P. Berrangé,
	devel, Hanna Reitz, Michael S. Tsirkin, Thomas Huth, Eric Blake,
	Song Gao, Marc-André Lureau, Markus Armbruster,
	Alex Bennée, Wainer dos Santos Moschetta, Beraldo Leal,
	Peter Maydell, Yu Zhang

On Thu, Mar 28, 2024 at 11:18:04AM -0300, Fabiano Rosas wrote:
> Philippe Mathieu-Daudé <philmd@linaro.org> writes:
> 
> > The whole RDMA subsystem was deprecated in commit e9a54265f5
> > ("hw/rdma: Deprecate the pvrdma device and the rdma subsystem")
> > released in v8.2.
> >
> > Remove:
> >  - RDMA handling from migration
> >  - dependencies on libibumad, libibverbs and librdmacm
> >
> > Keep the RAM_SAVE_FLAG_HOOK definition since it might appears
> > in old migration streams.
> >
> > Cc: Peter Xu <peterx@redhat.com>
> > Cc: Li Zhijian <lizhijian@fujitsu.com>
> > Acked-by: Fabiano Rosas <farosas@suse.de>
> > Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
> 
> Just to be clear, because people raised the point in the last version,
> the first link in the deprecation commit links to a thread comprising
> entirely of rdma migration patches. I don't see any ambiguity on whether
> the deprecation was intended to include migration. There's even an ack
> from Juan.

Yes I remember that's the plan.

> 
> So on the basis of not reverting the previous maintainer's decision, my
> Ack stands here.
> 
> We also had pretty obvious bugs ([1], [2]) in the past that would have
> been caught if we had any kind of testing for the feature, so I can't
> even say this thing works currently.
> 
> @Peter Xu, @Li Zhijian, what are your thoughts on this?

Generally I definitely agree with such a removal sooner or later, as that's
how deprecation works, and even after Juan's left I'm not aware of any
other new RDMA users.  Personally, I'd slightly prefer postponing it one
more release which might help a bit of our downstream maintenance, however
I assume that's not a blocker either, as I think we can also manage it.

IMHO it's more important to know whether there are still users and whether
they would still like to see it around. That's also one thing I notice that
e9a54265f533f didn't yet get acks from RDMA users that we are aware, even
if they're rare. According to [2] it could be that such user may only rely
on the release versions of QEMU when it broke things.

So I'm copying Yu too (while Zhijian is already in the loop), just in case
someone would like to stand up and speak.

Thanks,

> 
> 1- https://lore.kernel.org/r/20230920090412.726725-1-lizhijian@fujitsu.com
> 2- https://lore.kernel.org/r/CAHEcVy7HXSwn4Ow_Kog+Q+TN6f_kMeiCHevz1qGM-fbxBPp1hQ@mail.gmail.com
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
  2024-03-28 15:01     ` Peter Xu
@ 2024-03-28 15:22       ` Thomas Huth
  2024-03-28 19:04         ` Peter Xu
  2024-03-29  1:53       ` Zhijian Li (Fujitsu) via
  1 sibling, 1 reply; 52+ messages in thread
From: Thomas Huth @ 2024-03-28 15:22 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Philippe Mathieu-Daudé,
	qemu-devel, Yuval Shaia, Kevin Wolf, Prasanna Kumar Kalever,
	Cornelia Huck, Michael Roth, Li Zhijian, Prasanna Kumar Kalever,
	integration, Paolo Bonzini, qemu-block, Daniel P. Berrangé,
	devel, Hanna Reitz, Michael S. Tsirkin, Eric Blake, Song Gao,
	Marc-André Lureau, Markus Armbruster, Alex Bennée,
	Wainer dos Santos Moschetta, Beraldo Leal, Peter Maydell,
	Yu Zhang

On 28/03/2024 16.01, Peter Xu wrote:
> On Thu, Mar 28, 2024 at 11:18:04AM -0300, Fabiano Rosas wrote:
>> Philippe Mathieu-Daudé <philmd@linaro.org> writes:
>>
>>> The whole RDMA subsystem was deprecated in commit e9a54265f5
>>> ("hw/rdma: Deprecate the pvrdma device and the rdma subsystem")
>>> released in v8.2.
>>>
>>> Remove:
>>>   - RDMA handling from migration
>>>   - dependencies on libibumad, libibverbs and librdmacm
>>>
>>> Keep the RAM_SAVE_FLAG_HOOK definition since it might appears
>>> in old migration streams.
>>>
>>> Cc: Peter Xu <peterx@redhat.com>
>>> Cc: Li Zhijian <lizhijian@fujitsu.com>
>>> Acked-by: Fabiano Rosas <farosas@suse.de>
>>> Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
>>
>> Just to be clear, because people raised the point in the last version,
>> the first link in the deprecation commit links to a thread comprising
>> entirely of rdma migration patches. I don't see any ambiguity on whether
>> the deprecation was intended to include migration. There's even an ack
>> from Juan.
> 
> Yes I remember that's the plan.
> 
>>
>> So on the basis of not reverting the previous maintainer's decision, my
>> Ack stands here.
>>
>> We also had pretty obvious bugs ([1], [2]) in the past that would have
>> been caught if we had any kind of testing for the feature, so I can't
>> even say this thing works currently.
>>
>> @Peter Xu, @Li Zhijian, what are your thoughts on this?
> 
> Generally I definitely agree with such a removal sooner or later, as that's
> how deprecation works, and even after Juan's left I'm not aware of any
> other new RDMA users.  Personally, I'd slightly prefer postponing it one
> more release which might help a bit of our downstream maintenance, however
> I assume that's not a blocker either, as I think we can also manage it.
> 
> IMHO it's more important to know whether there are still users and whether
> they would still like to see it around.

Since e9a54265f5 was not very clear about rdma migration code, should we 
maybe rather add a separate deprecation note for the migration part, and add 
a proper warning message to the migration code in case someone tries to use 
it there, and then only remove the rdma migration code after two more releases?

  Thomas




^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH-for-9.1 v2 1/3] hw/rdma: Remove pvrdma device and rdmacm-mux helper
  2024-03-28 13:02 ` [PATCH-for-9.1 v2 1/3] hw/rdma: Remove pvrdma device and rdmacm-mux helper Philippe Mathieu-Daudé
@ 2024-03-28 17:51   ` Thomas Huth
  0 siblings, 0 replies; 52+ messages in thread
From: Thomas Huth @ 2024-03-28 17:51 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé, qemu-devel
  Cc: Yuval Shaia, Kevin Wolf, Prasanna Kumar Kalever, Fabiano Rosas,
	Cornelia Huck, Michael Roth, Li Zhijian, Prasanna Kumar Kalever,
	Peter Xu, integration, Paolo Bonzini, qemu-block,
	Daniel P. Berrangé,
	devel, Hanna Reitz, Michael S. Tsirkin, Eric Blake,
	Marcel Apfelbaum, Song Gao, Dr. David Alan Gilbert,
	Eduardo Habkost, Yanan Wang, Marc-André Lureau,
	Markus Armbruster, Alex Bennée, Wainer dos Santos Moschetta,
	Beraldo Leal

On 28/03/2024 14.02, Philippe Mathieu-Daudé wrote:
> The whole RDMA subsystem was deprecated in commit e9a54265f5
> ("hw/rdma: Deprecate the pvrdma device and the rdma subsystem")
> released in v8.2.
> 
> Remove:
>   - PVRDMA device
>   - generated vmw_pvrdma/ directory from linux-headers
>   - rdmacm-mux tool from contrib/
> 
> Cc: Yuval Shaia <yuval.shaia.ml@gmail.com>
> Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
> Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
> ---

Reviewed-by: Thomas Huth <thuth@redhat.com>



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH-for-9.1 v2 3/3] block/gluster: Remove RDMA protocol handling
  2024-03-28 13:02 ` [PATCH-for-9.1 v2 3/3] block/gluster: " Philippe Mathieu-Daudé
@ 2024-03-28 17:54   ` Thomas Huth
  0 siblings, 0 replies; 52+ messages in thread
From: Thomas Huth @ 2024-03-28 17:54 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé, qemu-devel
  Cc: Yuval Shaia, Kevin Wolf, Prasanna Kumar Kalever, Fabiano Rosas,
	Cornelia Huck, Michael Roth, Li Zhijian, Prasanna Kumar Kalever,
	Peter Xu, integration, Paolo Bonzini, qemu-block,
	Daniel P. Berrangé,
	devel, Hanna Reitz, Michael S. Tsirkin, Eric Blake

On 28/03/2024 14.02, Philippe Mathieu-Daudé wrote:
> GlusterFS+RDMA has been deprecated 8 years ago in commit
> 0552ff2465 ("block/gluster: deprecate rdma support"):
> 
>    gluster volfile server fetch happens through unix and/or tcp,
>    it doesn't support volfile fetch over rdma. The rdma code may
>    actually mislead, so to make sure things do not break, for now
>    we fallback to tcp when requested for rdma, with a warning.
> 
>    If you are wondering how this worked all these days, its the
>    gluster libgfapi code which handles anything other than unix
>    transport as socket/tcp, sad but true.
> 
> Besides, the whole RDMA subsystem was deprecated in commit
> e9a54265f5 ("hw/rdma: Deprecate the pvrdma device and the rdma
> subsystem") released in v8.2.
> 
> Cc: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
> Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
> ---
>   docs/system/device-url-syntax.rst.inc  |  4 +--
>   docs/system/qemu-block-drivers.rst.inc |  1 -
>   block/gluster.c                        | 39 --------------------------
>   3 files changed, 2 insertions(+), 42 deletions(-)

Reviewed-by: Thomas Huth <thuth@redhat.com>



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
  2024-03-28 15:22       ` Thomas Huth
@ 2024-03-28 19:04         ` Peter Xu
  0 siblings, 0 replies; 52+ messages in thread
From: Peter Xu @ 2024-03-28 19:04 UTC (permalink / raw)
  To: Thomas Huth
  Cc: Fabiano Rosas, Philippe Mathieu-Daudé,
	qemu-devel, Yuval Shaia, Kevin Wolf, Prasanna Kumar Kalever,
	Cornelia Huck, Michael Roth, Li Zhijian, Prasanna Kumar Kalever,
	integration, Paolo Bonzini, qemu-block, Daniel P. Berrangé,
	devel, Hanna Reitz, Michael S. Tsirkin, Eric Blake, Song Gao,
	Marc-André Lureau, Markus Armbruster, Alex Bennée,
	Wainer dos Santos Moschetta, Beraldo Leal, Peter Maydell,
	Yu Zhang

On Thu, Mar 28, 2024 at 04:22:27PM +0100, Thomas Huth wrote:
> Since e9a54265f5 was not very clear about rdma migration code, should we
> maybe rather add a separate deprecation note for the migration part, and add
> a proper warning message to the migration code in case someone tries to use
> it there, and then only remove the rdma migration code after two more
> releases?

Definitely a valid option to me.

So far RDMA isn't covered in tests (actually same to COLO, and I wonder our
position of COLO too in this case..), so unfortunately we don't even know
when it'll break just like before.

From other activities that I can see when new code comes, maintaining RDMA
code should be fairly manageable so far (and whoever will write new rdma
codes in those two releases will also need to take the maintainer's
role). We did it for those years, and we can keep that for two more
releases. Hopefully that can ring a louder alarm to the current users with
such warnings, so that people can either stick with old binaries, or invest
developer/test resources to the community.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
  2024-03-28 15:01     ` Peter Xu
  2024-03-28 15:22       ` Thomas Huth
@ 2024-03-29  1:53       ` Zhijian Li (Fujitsu) via
  2024-03-29 10:28         ` Philippe Mathieu-Daudé
  1 sibling, 1 reply; 52+ messages in thread
From: Zhijian Li (Fujitsu) via @ 2024-03-29  1:53 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: Philippe Mathieu-Daudé,
	qemu-devel, Yuval Shaia, Kevin Wolf, Prasanna Kumar Kalever,
	Cornelia Huck, Michael Roth, Prasanna Kumar Kalever, integration,
	Paolo Bonzini, qemu-block, Daniel P. Berrangé,
	devel, Hanna Reitz, Michael S. Tsirkin, Thomas Huth, Eric Blake,
	Song Gao, Marc-André Lureau, Markus Armbruster,
	Alex Bennée, Wainer dos Santos Moschetta, Beraldo Leal,
	Peter Maydell, Yu Zhang



On 28/03/2024 23:01, Peter Xu wrote:
> On Thu, Mar 28, 2024 at 11:18:04AM -0300, Fabiano Rosas wrote:
>> Philippe Mathieu-Daudé <philmd@linaro.org> writes:
>>
>>> The whole RDMA subsystem was deprecated in commit e9a54265f5
>>> ("hw/rdma: Deprecate the pvrdma device and the rdma subsystem")
>>> released in v8.2.
>>>
>>> Remove:
>>>   - RDMA handling from migration
>>>   - dependencies on libibumad, libibverbs and librdmacm
>>>
>>> Keep the RAM_SAVE_FLAG_HOOK definition since it might appears
>>> in old migration streams.
>>>
>>> Cc: Peter Xu <peterx@redhat.com>
>>> Cc: Li Zhijian <lizhijian@fujitsu.com>
>>> Acked-by: Fabiano Rosas <farosas@suse.de>
>>> Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
>>
>> Just to be clear, because people raised the point in the last version,
>> the first link in the deprecation commit links to a thread comprising
>> entirely of rdma migration patches. I don't see any ambiguity on whether
>> the deprecation was intended to include migration. There's even an ack
>> from Juan.
> 
> Yes I remember that's the plan.
> 
>>
>> So on the basis of not reverting the previous maintainer's decision, my
>> Ack stands here.
>>
>> We also had pretty obvious bugs ([1], [2]) in the past that would have
>> been caught if we had any kind of testing for the feature, so I can't
>> even say this thing works currently.
>>
>> @Peter Xu, @Li Zhijian, what are your thoughts on this?
> 
> Generally I definitely agree with such a removal sooner or later, as that's
> how deprecation works, and even after Juan's left I'm not aware of any
> other new RDMA users.  Personally, I'd slightly prefer postponing it one
> more release which might help a bit of our downstream maintenance, however
> I assume that's not a blocker either, as I think we can also manage it.
> 
> IMHO it's more important to know whether there are still users and whether
> they would still like to see it around. That's also one thing I notice that
> e9a54265f533f didn't yet get acks from RDMA users that we are aware, even
> if they're rare. According to [2] it could be that such user may only rely
> on the release versions of QEMU when it broke things.
> 
> So I'm copying Yu too (while Zhijian is already in the loop), just in case
> someone would like to stand up and speak.


I admit RDMA migration was lack of testing(unit/CI test), which led to the a few
obvious bugs being noticed too late.
However I was a bit surprised when I saw the removal of the RDMA migration. I wasn't
aware that this feature has not been marked as deprecated(at least there is no
prompt to end-user).


> IMHO it's more important to know whether there are still users and whether
> they would still like to see it around.

Agree.
I didn't immediately express my opinion in V1 because I'm also consulting our
customers for this feature in the future.

Personally, I agree with Perter's idea that "I'd slightly prefer postponing it one
more release which might help a bit of our downstream maintenance"

Thanks
Zhijian

> 
> Thanks,
> 
>>
>> 1- https://lore.kernel.org/r/20230920090412.726725-1-lizhijian@fujitsu.com
>> 2- https://lore.kernel.org/r/CAHEcVy7HXSwn4Ow_Kog+Q+TN6f_kMeiCHevz1qGM-fbxBPp1hQ@mail.gmail.com
>>
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH-for-9.1 v2 0/3] rdma: Remove RDMA subsystem and pvrdma device
  2024-03-28 13:02 [PATCH-for-9.1 v2 0/3] rdma: Remove RDMA subsystem and pvrdma device Philippe Mathieu-Daudé
                   ` (2 preceding siblings ...)
  2024-03-28 13:02 ` [PATCH-for-9.1 v2 3/3] block/gluster: " Philippe Mathieu-Daudé
@ 2024-03-29  9:17 ` Michael S. Tsirkin
  2024-04-03  9:37 ` Philippe Mathieu-Daudé
  4 siblings, 0 replies; 52+ messages in thread
From: Michael S. Tsirkin @ 2024-03-29  9:17 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé
  Cc: qemu-devel, Yuval Shaia, Kevin Wolf, Prasanna Kumar Kalever,
	Fabiano Rosas, Cornelia Huck, Michael Roth, Li Zhijian,
	Prasanna Kumar Kalever, Peter Xu, integration, Paolo Bonzini,
	qemu-block, Daniel P. Berrangé,
	devel, Hanna Reitz, Thomas Huth, Eric Blake

On Thu, Mar 28, 2024 at 02:02:52PM +0100, Philippe Mathieu-Daudé wrote:
> Since v1:
> - split in 3 (Thomas)
> - justify gluster removal


Reviewed-by: Michael S. Tsirkin <mst@redhat.com>

> Philippe Mathieu-Daudé (3):
>   hw/rdma: Remove pvrdma device and rdmacm-mux helper
>   migration: Remove RDMA protocol handling
>   block/gluster: Remove RDMA protocol handling
> 
>  MAINTAINERS                                   |   17 -
>  docs/about/deprecated.rst                     |    9 -
>  docs/about/removed-features.rst               |    4 +
>  docs/devel/migration/main.rst                 |    6 -
>  docs/pvrdma.txt                               |  345 --
>  docs/rdma.txt                                 |  420 --
>  docs/system/device-url-syntax.rst.inc         |    4 +-
>  docs/system/loongarch/virt.rst                |    2 +-
>  docs/system/qemu-block-drivers.rst.inc        |    1 -
>  meson.build                                   |   59 -
>  qapi/machine.json                             |   17 -
>  qapi/migration.json                           |   31 +-
>  qapi/qapi-schema.json                         |    1 -
>  qapi/rdma.json                                |   38 -
>  contrib/rdmacm-mux/rdmacm-mux.h               |   61 -
>  hw/rdma/rdma_backend.h                        |  129 -
>  hw/rdma/rdma_backend_defs.h                   |   76 -
>  hw/rdma/rdma_rm.h                             |   97 -
>  hw/rdma/rdma_rm_defs.h                        |  146 -
>  hw/rdma/rdma_utils.h                          |   63 -
>  hw/rdma/trace.h                               |    1 -
>  hw/rdma/vmw/pvrdma.h                          |  144 -
>  hw/rdma/vmw/pvrdma_dev_ring.h                 |   46 -
>  hw/rdma/vmw/pvrdma_qp_ops.h                   |   28 -
>  hw/rdma/vmw/trace.h                           |    1 -
>  include/hw/rdma/rdma.h                        |   37 -
>  include/monitor/hmp.h                         |    1 -
>  .../infiniband/hw/vmw_pvrdma/pvrdma_dev_api.h |  685 ---
>  .../infiniband/hw/vmw_pvrdma/pvrdma_verbs.h   |  348 --
>  .../standard-headers/rdma/vmw_pvrdma-abi.h    |  310 --
>  migration/migration-stats.h                   |    6 +-
>  migration/migration.h                         |    9 -
>  migration/options.h                           |    2 -
>  migration/rdma.h                              |   69 -
>  block/gluster.c                               |   39 -
>  contrib/rdmacm-mux/main.c                     |  831 ----
>  hw/core/machine-qmp-cmds.c                    |   32 -
>  hw/rdma/rdma.c                                |   30 -
>  hw/rdma/rdma_backend.c                        | 1401 ------
>  hw/rdma/rdma_rm.c                             |  812 ----
>  hw/rdma/rdma_utils.c                          |  126 -
>  hw/rdma/vmw/pvrdma_cmd.c                      |  815 ----
>  hw/rdma/vmw/pvrdma_dev_ring.c                 |  141 -
>  hw/rdma/vmw/pvrdma_main.c                     |  735 ---
>  hw/rdma/vmw/pvrdma_qp_ops.c                   |  298 --
>  migration/migration-stats.c                   |    5 +-
>  migration/migration.c                         |   31 -
>  migration/options.c                           |   16 -
>  migration/qemu-file.c                         |    1 -
>  migration/ram.c                               |   86 +-
>  migration/rdma.c                              | 4184 -----------------
>  migration/savevm.c                            |    2 +-
>  monitor/qmp-cmds.c                            |    1 -
>  Kconfig.host                                  |    3 -
>  contrib/rdmacm-mux/meson.build                |    7 -
>  hmp-commands-info.hx                          |   13 -
>  hw/Kconfig                                    |    1 -
>  hw/meson.build                                |    1 -
>  hw/rdma/Kconfig                               |    3 -
>  hw/rdma/meson.build                           |   12 -
>  hw/rdma/trace-events                          |   31 -
>  hw/rdma/vmw/trace-events                      |   17 -
>  meson_options.txt                             |    4 -
>  migration/meson.build                         |    1 -
>  migration/trace-events                        |   68 +-
>  qapi/meson.build                              |    1 -
>  qemu-options.hx                               |    6 -
>  .../org.centos/stream/8/build-environment.yml |    1 -
>  .../ci/org.centos/stream/8/x86_64/configure   |    3 -
>  scripts/ci/setup/build-environment.yml        |    4 -
>  scripts/coverity-scan/run-coverity-scan       |    2 +-
>  scripts/meson-buildoptions.sh                 |    6 -
>  scripts/update-linux-headers.sh               |   27 -
>  tests/lcitool/projects/qemu.yml               |    3 -
>  tests/migration/guestperf/engine.py           |    4 +-
>  75 files changed, 20 insertions(+), 12997 deletions(-)
>  delete mode 100644 docs/pvrdma.txt
>  delete mode 100644 docs/rdma.txt
>  delete mode 100644 qapi/rdma.json
>  delete mode 100644 contrib/rdmacm-mux/rdmacm-mux.h
>  delete mode 100644 hw/rdma/rdma_backend.h
>  delete mode 100644 hw/rdma/rdma_backend_defs.h
>  delete mode 100644 hw/rdma/rdma_rm.h
>  delete mode 100644 hw/rdma/rdma_rm_defs.h
>  delete mode 100644 hw/rdma/rdma_utils.h
>  delete mode 100644 hw/rdma/trace.h
>  delete mode 100644 hw/rdma/vmw/pvrdma.h
>  delete mode 100644 hw/rdma/vmw/pvrdma_dev_ring.h
>  delete mode 100644 hw/rdma/vmw/pvrdma_qp_ops.h
>  delete mode 100644 hw/rdma/vmw/trace.h
>  delete mode 100644 include/hw/rdma/rdma.h
>  delete mode 100644 include/standard-headers/drivers/infiniband/hw/vmw_pvrdma/pvrdma_dev_api.h
>  delete mode 100644 include/standard-headers/drivers/infiniband/hw/vmw_pvrdma/pvrdma_verbs.h
>  delete mode 100644 include/standard-headers/rdma/vmw_pvrdma-abi.h
>  delete mode 100644 migration/rdma.h
>  delete mode 100644 contrib/rdmacm-mux/main.c
>  delete mode 100644 hw/rdma/rdma.c
>  delete mode 100644 hw/rdma/rdma_backend.c
>  delete mode 100644 hw/rdma/rdma_rm.c
>  delete mode 100644 hw/rdma/rdma_utils.c
>  delete mode 100644 hw/rdma/vmw/pvrdma_cmd.c
>  delete mode 100644 hw/rdma/vmw/pvrdma_dev_ring.c
>  delete mode 100644 hw/rdma/vmw/pvrdma_main.c
>  delete mode 100644 hw/rdma/vmw/pvrdma_qp_ops.c
>  delete mode 100644 migration/rdma.c
>  delete mode 100644 contrib/rdmacm-mux/meson.build
>  delete mode 100644 hw/rdma/Kconfig
>  delete mode 100644 hw/rdma/meson.build
>  delete mode 100644 hw/rdma/trace-events
>  delete mode 100644 hw/rdma/vmw/trace-events
> 
> -- 
> 2.41.0



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
  2024-03-29  1:53       ` Zhijian Li (Fujitsu) via
@ 2024-03-29 10:28         ` Philippe Mathieu-Daudé
  2024-03-29 19:44           ` Daniel P. Berrangé
  2024-04-01  7:55           ` Zhijian Li (Fujitsu) via
  0 siblings, 2 replies; 52+ messages in thread
From: Philippe Mathieu-Daudé @ 2024-03-29 10:28 UTC (permalink / raw)
  To: Zhijian Li (Fujitsu), Peter Xu, Fabiano Rosas
  Cc: qemu-devel, Yuval Shaia, Kevin Wolf, Prasanna Kumar Kalever,
	Cornelia Huck, Michael Roth, Prasanna Kumar Kalever, integration,
	Paolo Bonzini, qemu-block, Daniel P. Berrangé,
	devel, Hanna Reitz, Michael S. Tsirkin, Thomas Huth, Eric Blake,
	Song Gao, Marc-André Lureau, Markus Armbruster,
	Alex Bennée, Wainer dos Santos Moschetta, Beraldo Leal,
	Peter Maydell, Yu Zhang

Hi Zhijian,

On 29/3/24 02:53, Zhijian Li (Fujitsu) wrote:
> 
> 
> On 28/03/2024 23:01, Peter Xu wrote:
>> On Thu, Mar 28, 2024 at 11:18:04AM -0300, Fabiano Rosas wrote:
>>> Philippe Mathieu-Daudé <philmd@linaro.org> writes:
>>>
>>>> The whole RDMA subsystem was deprecated in commit e9a54265f5
>>>> ("hw/rdma: Deprecate the pvrdma device and the rdma subsystem")
>>>> released in v8.2.
>>>>
>>>> Remove:
>>>>    - RDMA handling from migration
>>>>    - dependencies on libibumad, libibverbs and librdmacm
>>>>
>>>> Keep the RAM_SAVE_FLAG_HOOK definition since it might appears
>>>> in old migration streams.
>>>>
>>>> Cc: Peter Xu <peterx@redhat.com>
>>>> Cc: Li Zhijian <lizhijian@fujitsu.com>
>>>> Acked-by: Fabiano Rosas <farosas@suse.de>
>>>> Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
>>>
>>> Just to be clear, because people raised the point in the last version,
>>> the first link in the deprecation commit links to a thread comprising
>>> entirely of rdma migration patches. I don't see any ambiguity on whether
>>> the deprecation was intended to include migration. There's even an ack
>>> from Juan.
>>
>> Yes I remember that's the plan.
>>
>>>
>>> So on the basis of not reverting the previous maintainer's decision, my
>>> Ack stands here.
>>>
>>> We also had pretty obvious bugs ([1], [2]) in the past that would have
>>> been caught if we had any kind of testing for the feature, so I can't
>>> even say this thing works currently.
>>>
>>> @Peter Xu, @Li Zhijian, what are your thoughts on this?
>>
>> Generally I definitely agree with such a removal sooner or later, as that's
>> how deprecation works, and even after Juan's left I'm not aware of any
>> other new RDMA users.  Personally, I'd slightly prefer postponing it one
>> more release which might help a bit of our downstream maintenance, however
>> I assume that's not a blocker either, as I think we can also manage it.
>>
>> IMHO it's more important to know whether there are still users and whether
>> they would still like to see it around. That's also one thing I notice that
>> e9a54265f533f didn't yet get acks from RDMA users that we are aware, even
>> if they're rare. According to [2] it could be that such user may only rely
>> on the release versions of QEMU when it broke things.
>>
>> So I'm copying Yu too (while Zhijian is already in the loop), just in case
>> someone would like to stand up and speak.
> 
> 
> I admit RDMA migration was lack of testing(unit/CI test), which led to the a few
> obvious bugs being noticed too late.
> However I was a bit surprised when I saw the removal of the RDMA migration. I wasn't
> aware that this feature has not been marked as deprecated(at least there is no
> prompt to end-user).
> 
> 
>> IMHO it's more important to know whether there are still users and whether
>> they would still like to see it around.
> 
> Agree.
> I didn't immediately express my opinion in V1 because I'm also consulting our
> customers for this feature in the future.
> 
> Personally, I agree with Perter's idea that "I'd slightly prefer postponing it one
> more release which might help a bit of our downstream maintenance"

Do you mind posting a deprecation patch to clarify the situation?

Thanks,

Phil.

> 
> Thanks
> Zhijian
> 
>>
>> Thanks,
>>
>>>
>>> 1- https://lore.kernel.org/r/20230920090412.726725-1-lizhijian@fujitsu.com
>>> 2- https://lore.kernel.org/r/CAHEcVy7HXSwn4Ow_Kog+Q+TN6f_kMeiCHevz1qGM-fbxBPp1hQ@mail.gmail.com
>>>



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
  2024-03-29 10:28         ` Philippe Mathieu-Daudé
@ 2024-03-29 19:44           ` Daniel P. Berrangé
  2024-04-01  7:55           ` Zhijian Li (Fujitsu) via
  1 sibling, 0 replies; 52+ messages in thread
From: Daniel P. Berrangé @ 2024-03-29 19:44 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé
  Cc: Zhijian Li (Fujitsu),
	Peter Xu, Fabiano Rosas, qemu-devel, Yuval Shaia, Kevin Wolf,
	Prasanna Kumar Kalever, Cornelia Huck, Michael Roth,
	Prasanna Kumar Kalever, integration, Paolo Bonzini, qemu-block,
	devel, Hanna Reitz, Michael S. Tsirkin, Thomas Huth, Eric Blake,
	Song Gao, Marc-André Lureau, Markus Armbruster,
	Alex Bennée, Wainer dos Santos Moschetta, Beraldo Leal,
	Peter Maydell, Yu Zhang

On Fri, Mar 29, 2024 at 11:28:54AM +0100, Philippe Mathieu-Daudé wrote:
> Hi Zhijian,
> 
> On 29/3/24 02:53, Zhijian Li (Fujitsu) wrote:
> > 
> > 
> > On 28/03/2024 23:01, Peter Xu wrote:
> > > On Thu, Mar 28, 2024 at 11:18:04AM -0300, Fabiano Rosas wrote:
> > > > Philippe Mathieu-Daudé <philmd@linaro.org> writes:
> > > > 
> > > > > The whole RDMA subsystem was deprecated in commit e9a54265f5
> > > > > ("hw/rdma: Deprecate the pvrdma device and the rdma subsystem")
> > > > > released in v8.2.
> > > > > 
> > > > > Remove:
> > > > >    - RDMA handling from migration
> > > > >    - dependencies on libibumad, libibverbs and librdmacm
> > > > > 
> > > > > Keep the RAM_SAVE_FLAG_HOOK definition since it might appears
> > > > > in old migration streams.
> > > > > 
> > > > > Cc: Peter Xu <peterx@redhat.com>
> > > > > Cc: Li Zhijian <lizhijian@fujitsu.com>
> > > > > Acked-by: Fabiano Rosas <farosas@suse.de>
> > > > > Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
> > > > 
> > > > Just to be clear, because people raised the point in the last version,
> > > > the first link in the deprecation commit links to a thread comprising
> > > > entirely of rdma migration patches. I don't see any ambiguity on whether
> > > > the deprecation was intended to include migration. There's even an ack
> > > > from Juan.
> > > 
> > > Yes I remember that's the plan.
> > > 
> > > > 
> > > > So on the basis of not reverting the previous maintainer's decision, my
> > > > Ack stands here.
> > > > 
> > > > We also had pretty obvious bugs ([1], [2]) in the past that would have
> > > > been caught if we had any kind of testing for the feature, so I can't
> > > > even say this thing works currently.
> > > > 
> > > > @Peter Xu, @Li Zhijian, what are your thoughts on this?
> > > 
> > > Generally I definitely agree with such a removal sooner or later, as that's
> > > how deprecation works, and even after Juan's left I'm not aware of any
> > > other new RDMA users.  Personally, I'd slightly prefer postponing it one
> > > more release which might help a bit of our downstream maintenance, however
> > > I assume that's not a blocker either, as I think we can also manage it.
> > > 
> > > IMHO it's more important to know whether there are still users and whether
> > > they would still like to see it around. That's also one thing I notice that
> > > e9a54265f533f didn't yet get acks from RDMA users that we are aware, even
> > > if they're rare. According to [2] it could be that such user may only rely
> > > on the release versions of QEMU when it broke things.
> > > 
> > > So I'm copying Yu too (while Zhijian is already in the loop), just in case
> > > someone would like to stand up and speak.
> > 
> > 
> > I admit RDMA migration was lack of testing(unit/CI test), which led to the a few
> > obvious bugs being noticed too late.
> > However I was a bit surprised when I saw the removal of the RDMA migration. I wasn't
> > aware that this feature has not been marked as deprecated(at least there is no
> > prompt to end-user).
> > 
> > 
> > > IMHO it's more important to know whether there are still users and whether
> > > they would still like to see it around.
> > 
> > Agree.
> > I didn't immediately express my opinion in V1 because I'm also consulting our
> > customers for this feature in the future.
> > 
> > Personally, I agree with Perter's idea that "I'd slightly prefer postponing it one
> > more release which might help a bit of our downstream maintenance"
> 
> Do you mind posting a deprecation patch to clarify the situation?

The key thing the first deprecation patch missed was that it failed
to issue a warning message when RDMA migration was actually used.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
  2024-03-29 10:28         ` Philippe Mathieu-Daudé
  2024-03-29 19:44           ` Daniel P. Berrangé
@ 2024-04-01  7:55           ` Zhijian Li (Fujitsu) via
  2024-04-01 21:26             ` Yu Zhang
  1 sibling, 1 reply; 52+ messages in thread
From: Zhijian Li (Fujitsu) via @ 2024-04-01  7:55 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé, Peter Xu, Fabiano Rosas
  Cc: qemu-devel, Yuval Shaia, Kevin Wolf, Prasanna Kumar Kalever,
	Cornelia Huck, Michael Roth, Prasanna Kumar Kalever, integration,
	Paolo Bonzini, qemu-block, Daniel P. Berrangé,
	devel, Hanna Reitz, Michael S. Tsirkin, Thomas Huth, Eric Blake,
	Song Gao, Marc-André Lureau, Markus Armbruster,
	Alex Bennée, Wainer dos Santos Moschetta, Beraldo Leal,
	Peter Maydell, Yu Zhang

Phil,

on 3/29/2024 6:28 PM, Philippe Mathieu-Daudé wrote:
>>
>>
>>> IMHO it's more important to know whether there are still users and 
>>> whether
>>> they would still like to see it around.
>>
>> Agree.
>> I didn't immediately express my opinion in V1 because I'm also 
>> consulting our
>> customers for this feature in the future.
>>
>> Personally, I agree with Perter's idea that "I'd slightly prefer 
>> postponing it one
>> more release which might help a bit of our downstream maintenance"
>
> Do you mind posting a deprecation patch to clarify the situation?
>

No problem, i just posted a deprecation patch, please take a look.
https://lore.kernel.org/qemu-devel/20240401035947.3310834-1-lizhijian@fujitsu.com/T/#u

Thanks
Zhijian

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
  2024-04-01  7:55           ` Zhijian Li (Fujitsu) via
@ 2024-04-01 21:26             ` Yu Zhang
  2024-04-02 21:23               ` Peter Xu
  0 siblings, 1 reply; 52+ messages in thread
From: Yu Zhang @ 2024-04-01 21:26 UTC (permalink / raw)
  To: Peter Xu, Zhijian Li (Fujitsu), Jinpu Wang, Elmar Gerdes
  Cc: qemu-devel, Yuval Shaia, Kevin Wolf, Prasanna Kumar Kalever,
	Cornelia Huck, Michael Roth, Prasanna Kumar Kalever, integration,
	Paolo Bonzini, qemu-block, Daniel P. Berrangé,
	devel, Hanna Reitz, Michael S. Tsirkin, Thomas Huth, Eric Blake,
	Song Gao, Marc-André Lureau, Markus Armbruster,
	Alex Bennée, Wainer dos Santos Moschetta, Beraldo Leal

Hello Peter und Zhjian,

Thank you so much for letting me know about this. I'm also a bit surprised at
the plan for deprecating the RDMA migration subsystem.

> IMHO it's more important to know whether there are still users and whether
> they would still like to see it around.

> I admit RDMA migration was lack of testing(unit/CI test), which led to the a few
> obvious bugs being noticed too late.

Yes, we are a user of this subsystem. I was unaware of the lack of test coverage
for this part. As soon as 8.2 was released, I saw that many of the
migration test
cases failed and came to realize that there might be a bug between 8.1
and 8.2, but
was unable to confirm and report it quickly to you.

The maintenance of this part could be too costly or difficult from
your point of view.

My concern is, this plan will forces a few QEMU users (not sure how
many) like us
either to stick to the RDMA migration by using an increasingly older
version of QEMU,
or to abandon the currently used RDMA migration.

Best regards,
Yu Zhang

On Mon, Apr 1, 2024 at 9:56 AM Zhijian Li (Fujitsu)
<lizhijian@fujitsu.com> wrote:
>
> Phil,
>
> on 3/29/2024 6:28 PM, Philippe Mathieu-Daudé wrote:
> >>
> >>
> >>> IMHO it's more important to know whether there are still users and
> >>> whether
> >>> they would still like to see it around.
> >>
> >> Agree.
> >> I didn't immediately express my opinion in V1 because I'm also
> >> consulting our
> >> customers for this feature in the future.
> >>
> >> Personally, I agree with Perter's idea that "I'd slightly prefer
> >> postponing it one
> >> more release which might help a bit of our downstream maintenance"
> >
> > Do you mind posting a deprecation patch to clarify the situation?
> >
>
> No problem, i just posted a deprecation patch, please take a look.
> https://lore.kernel.org/qemu-devel/20240401035947.3310834-1-lizhijian@fujitsu.com/T/#u
>
> Thanks
> Zhijian


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
  2024-04-01 21:26             ` Yu Zhang
@ 2024-04-02 21:23               ` Peter Xu
  2024-04-08 14:07                 ` Jinpu Wang
  0 siblings, 1 reply; 52+ messages in thread
From: Peter Xu @ 2024-04-02 21:23 UTC (permalink / raw)
  To: Yu Zhang
  Cc: Zhijian Li (Fujitsu),
	Jinpu Wang, Elmar Gerdes, qemu-devel, Yuval Shaia, Kevin Wolf,
	Prasanna Kumar Kalever, Cornelia Huck, Michael Roth,
	Prasanna Kumar Kalever, integration, Paolo Bonzini, qemu-block,
	Daniel P. Berrangé,
	devel, Hanna Reitz, Michael S. Tsirkin, Thomas Huth, Eric Blake,
	Song Gao, Marc-André Lureau, Markus Armbruster,
	Alex Bennée, Wainer dos Santos Moschetta, Beraldo Leal

On Mon, Apr 01, 2024 at 11:26:25PM +0200, Yu Zhang wrote:
> Hello Peter und Zhjian,
> 
> Thank you so much for letting me know about this. I'm also a bit surprised at
> the plan for deprecating the RDMA migration subsystem.

It's not too late, since it looks like we do have users not yet notified
from this, we'll redo the deprecation procedure even if it'll be the final
plan, and it'll be 2 releases after this.

> 
> > IMHO it's more important to know whether there are still users and whether
> > they would still like to see it around.
> 
> > I admit RDMA migration was lack of testing(unit/CI test), which led to the a few
> > obvious bugs being noticed too late.
> 
> Yes, we are a user of this subsystem. I was unaware of the lack of test coverage
> for this part. As soon as 8.2 was released, I saw that many of the
> migration test
> cases failed and came to realize that there might be a bug between 8.1
> and 8.2, but
> was unable to confirm and report it quickly to you.
> 
> The maintenance of this part could be too costly or difficult from
> your point of view.

It may or may not be too costly, it's just that we need real users of RDMA
taking some care of it.  Having it broken easily for >1 releases definitely
is a sign of lack of users.  It is an implication to the community that we
should consider dropping some features so that we can get the best use of
the community resources for the things that may have a broader audience.

One thing majorly missing is a RDMA tester to guard all the merges to not
break RDMA paths, hopefully in CI.  That should not rely on RDMA hardwares
but just to sanity check the migration+rdma code running all fine.  RDMA
taught us the lesson so we're requesting CI coverage for all other new
features that will be merged at least for migration subsystem, so that we
plan to not merge anything that is not covered by CI unless extremely
necessary in the future.

For sure CI is not the only missing part, but I'd say we should start with
it, then someone should also take care of the code even if only in
maintenance mode (no new feature to add on top).

> 
> My concern is, this plan will forces a few QEMU users (not sure how
> many) like us
> either to stick to the RDMA migration by using an increasingly older
> version of QEMU,
> or to abandon the currently used RDMA migration.

RDMA doesn't get new features anyway, if there's specific use case for RDMA
migrations, would it work if such scenario uses the old binary?  Is it
possible to switch to the TCP protocol with some good NICs?

Per our best knowledge, RDMA users are rare, and please let anyone know if
you are aware of such users.  IIUC the major reason why RDMA stopped being
the trend is because the network is not like ten years ago; I don't think I
have good knowledge in RDMA at all nor network, but my understanding is
it's pretty easy to fetch modern NIC to outperform RDMAs, then it may make
little sense to maintain multiple protocols, considering RDMA migration
code is so special so that it has the most custom code comparing to other
protocols.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH-for-9.1 v2 0/3] rdma: Remove RDMA subsystem and pvrdma device
  2024-03-28 13:02 [PATCH-for-9.1 v2 0/3] rdma: Remove RDMA subsystem and pvrdma device Philippe Mathieu-Daudé
                   ` (3 preceding siblings ...)
  2024-03-29  9:17 ` [PATCH-for-9.1 v2 0/3] rdma: Remove RDMA subsystem and pvrdma device Michael S. Tsirkin
@ 2024-04-03  9:37 ` Philippe Mathieu-Daudé
  4 siblings, 0 replies; 52+ messages in thread
From: Philippe Mathieu-Daudé @ 2024-04-03  9:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: Yuval Shaia, Kevin Wolf, Prasanna Kumar Kalever, Fabiano Rosas,
	Cornelia Huck, Michael Roth, Li Zhijian, Prasanna Kumar Kalever,
	Peter Xu, integration, Paolo Bonzini, qemu-block,
	Daniel P. Berrangé,
	devel, Hanna Reitz, Michael S. Tsirkin, Thomas Huth, Eric Blake

On 28/3/24 14:02, Philippe Mathieu-Daudé wrote:
> Since v1:
> - split in 3 (Thomas)
> - justify gluster removal
> 
> Philippe Mathieu-Daudé (3):
>    hw/rdma: Remove pvrdma device and rdmacm-mux helper
>    migration: Remove RDMA protocol handling
>    block/gluster: Remove RDMA protocol handling

Patch 2 superseded by 
https://lore.kernel.org/qemu-devel/20240401035947.3310834-1-lizhijian@fujitsu.com/,
queuing patches 1 and 3 for 9.1, thanks.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
  2024-04-02 21:23               ` Peter Xu
@ 2024-04-08 14:07                 ` Jinpu Wang
  2024-04-08 16:18                   ` Peter Xu
  0 siblings, 1 reply; 52+ messages in thread
From: Jinpu Wang @ 2024-04-08 14:07 UTC (permalink / raw)
  To: Peter Xu
  Cc: Yu Zhang, Zhijian Li (Fujitsu),
	Elmar Gerdes, qemu-devel, Yuval Shaia, Kevin Wolf,
	Prasanna Kumar Kalever, Cornelia Huck, Michael Roth,
	Prasanna Kumar Kalever, integration, Paolo Bonzini, qemu-block,
	Daniel P. Berrangé,
	devel, Hanna Reitz, Michael S. Tsirkin, Thomas Huth, Eric Blake,
	Song Gao, Marc-André Lureau, Markus Armbruster,
	Alex Bennée, Wainer dos Santos Moschetta, Beraldo Leal,
	arei.gonglei, pannengyuan

Hi Peter,

On Tue, Apr 2, 2024 at 11:24 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Mon, Apr 01, 2024 at 11:26:25PM +0200, Yu Zhang wrote:
> > Hello Peter und Zhjian,
> >
> > Thank you so much for letting me know about this. I'm also a bit surprised at
> > the plan for deprecating the RDMA migration subsystem.
>
> It's not too late, since it looks like we do have users not yet notified
> from this, we'll redo the deprecation procedure even if it'll be the final
> plan, and it'll be 2 releases after this.
>
> >
> > > IMHO it's more important to know whether there are still users and whether
> > > they would still like to see it around.
> >
> > > I admit RDMA migration was lack of testing(unit/CI test), which led to the a few
> > > obvious bugs being noticed too late.
> >
> > Yes, we are a user of this subsystem. I was unaware of the lack of test coverage
> > for this part. As soon as 8.2 was released, I saw that many of the
> > migration test
> > cases failed and came to realize that there might be a bug between 8.1
> > and 8.2, but
> > was unable to confirm and report it quickly to you.
> >
> > The maintenance of this part could be too costly or difficult from
> > your point of view.
>
> It may or may not be too costly, it's just that we need real users of RDMA
> taking some care of it.  Having it broken easily for >1 releases definitely
> is a sign of lack of users.  It is an implication to the community that we
> should consider dropping some features so that we can get the best use of
> the community resources for the things that may have a broader audience.
>
> One thing majorly missing is a RDMA tester to guard all the merges to not
> break RDMA paths, hopefully in CI.  That should not rely on RDMA hardwares
> but just to sanity check the migration+rdma code running all fine.  RDMA
> taught us the lesson so we're requesting CI coverage for all other new
> features that will be merged at least for migration subsystem, so that we
> plan to not merge anything that is not covered by CI unless extremely
> necessary in the future.
>
> For sure CI is not the only missing part, but I'd say we should start with
> it, then someone should also take care of the code even if only in
> maintenance mode (no new feature to add on top).
>
> >
> > My concern is, this plan will forces a few QEMU users (not sure how
> > many) like us
> > either to stick to the RDMA migration by using an increasingly older
> > version of QEMU,
> > or to abandon the currently used RDMA migration.
>
> RDMA doesn't get new features anyway, if there's specific use case for RDMA
> migrations, would it work if such a scenario uses the old binary?  Is it
> possible to switch to the TCP protocol with some good NICs?
We have used rdma migration with HCA from Nvidia for years, our
experience is RDMA migration works better than tcp (over ipoib).

Switching back to TCP will lead us to the old problems which was
solved by RDMA migration.

>
> Per our best knowledge, RDMA users are rare, and please let anyone know if
> you are aware of such users.  IIUC the major reason why RDMA stopped being
> the trend is because the network is not like ten years ago; I don't think I
> have good knowledge in RDMA at all nor network, but my understanding is
> it's pretty easy to fetch modern NIC to outperform RDMAs, then it may make
> little sense to maintain multiple protocols, considering RDMA migration
> code is so special so that it has the most custom code comparing to other
> protocols.
+cc some guys from Huawei.

I'm surprised RDMA users are rare,  I guess maybe many are just
working with different code base.
>
> Thanks,
>
> --
> Peter Xu

Thx!
Jinpu Wang
>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
  2024-04-08 14:07                 ` Jinpu Wang
@ 2024-04-08 16:18                   ` Peter Xu
  2024-04-09  7:32                     ` Jinpu Wang
  2024-04-09  9:00                     ` Markus Armbruster
  0 siblings, 2 replies; 52+ messages in thread
From: Peter Xu @ 2024-04-08 16:18 UTC (permalink / raw)
  To: Jinpu Wang
  Cc: Yu Zhang, Zhijian Li (Fujitsu),
	Elmar Gerdes, qemu-devel, Yuval Shaia, Kevin Wolf,
	Prasanna Kumar Kalever, Cornelia Huck, Michael Roth,
	Prasanna Kumar Kalever, integration, Paolo Bonzini, qemu-block,
	Daniel P. Berrangé,
	devel, Hanna Reitz, Michael S. Tsirkin, Thomas Huth, Eric Blake,
	Song Gao, Marc-André Lureau, Markus Armbruster,
	Alex Bennée, Wainer dos Santos Moschetta, Beraldo Leal,
	arei.gonglei, pannengyuan

On Mon, Apr 08, 2024 at 04:07:20PM +0200, Jinpu Wang wrote:
> Hi Peter,

Jinpu,

Thanks for joining the discussion.

> 
> On Tue, Apr 2, 2024 at 11:24 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Mon, Apr 01, 2024 at 11:26:25PM +0200, Yu Zhang wrote:
> > > Hello Peter und Zhjian,
> > >
> > > Thank you so much for letting me know about this. I'm also a bit surprised at
> > > the plan for deprecating the RDMA migration subsystem.
> >
> > It's not too late, since it looks like we do have users not yet notified
> > from this, we'll redo the deprecation procedure even if it'll be the final
> > plan, and it'll be 2 releases after this.
> >
> > >
> > > > IMHO it's more important to know whether there are still users and whether
> > > > they would still like to see it around.
> > >
> > > > I admit RDMA migration was lack of testing(unit/CI test), which led to the a few
> > > > obvious bugs being noticed too late.
> > >
> > > Yes, we are a user of this subsystem. I was unaware of the lack of test coverage
> > > for this part. As soon as 8.2 was released, I saw that many of the
> > > migration test
> > > cases failed and came to realize that there might be a bug between 8.1
> > > and 8.2, but
> > > was unable to confirm and report it quickly to you.
> > >
> > > The maintenance of this part could be too costly or difficult from
> > > your point of view.
> >
> > It may or may not be too costly, it's just that we need real users of RDMA
> > taking some care of it.  Having it broken easily for >1 releases definitely
> > is a sign of lack of users.  It is an implication to the community that we
> > should consider dropping some features so that we can get the best use of
> > the community resources for the things that may have a broader audience.
> >
> > One thing majorly missing is a RDMA tester to guard all the merges to not
> > break RDMA paths, hopefully in CI.  That should not rely on RDMA hardwares
> > but just to sanity check the migration+rdma code running all fine.  RDMA
> > taught us the lesson so we're requesting CI coverage for all other new
> > features that will be merged at least for migration subsystem, so that we
> > plan to not merge anything that is not covered by CI unless extremely
> > necessary in the future.
> >
> > For sure CI is not the only missing part, but I'd say we should start with
> > it, then someone should also take care of the code even if only in
> > maintenance mode (no new feature to add on top).
> >
> > >
> > > My concern is, this plan will forces a few QEMU users (not sure how
> > > many) like us
> > > either to stick to the RDMA migration by using an increasingly older
> > > version of QEMU,
> > > or to abandon the currently used RDMA migration.
> >
> > RDMA doesn't get new features anyway, if there's specific use case for RDMA
> > migrations, would it work if such a scenario uses the old binary?  Is it
> > possible to switch to the TCP protocol with some good NICs?
> We have used rdma migration with HCA from Nvidia for years, our
> experience is RDMA migration works better than tcp (over ipoib).

Please bare with me, as I know little on rdma stuff.

I'm actually pretty confused (and since a long time ago..) on why we need
to operation with rdma contexts when ipoib seems to provide all the tcp
layers.  I meant, can it work with the current "tcp:" protocol with ipoib
even if there's rdma/ib hardwares underneath?  Is it because of performance
improvements so that we must use a separate path comparing to generic
"tcp:" protocol here?

> 
> Switching back to TCP will lead us to the old problems which was
> solved by RDMA migration.

Can you elaborate the problems, and why tcp won't work in this case?  They
may not be directly relevant to the issue we're discussing, but I'm happy
to learn more.

What is the NICs you were testing before?  Did the test carry out with
things like modern ones (50Gbps-200Gbps NICs), or the test was done when
these hardwares are not common?

Per my recent knowledge on the new Intel hardwares, at least the ones that
support QPL, it's easy to achieve single core 50Gbps+.

https://lore.kernel.org/r/PH7PR11MB5941A91AC1E514BCC32896A6A3342@PH7PR11MB5941.namprd11.prod.outlook.com

Quote from Yuan:

  Yes, I use iperf3 to check the bandwidth for one core, the bandwith is 60Gbps.
  [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
  [  5]   0.00-1.00   sec  7.00 GBytes  60.1 Gbits/sec    0   2.87 MBytes
  [  5]   1.00-2.00   sec  7.05 GBytes  60.6 Gbits/sec    0   2.87 Mbytes

  And in the live migration test, a multifd thread's CPU utilization is almost 100%

It boils down to what old problems were there with tcp first, though.

> 
> >
> > Per our best knowledge, RDMA users are rare, and please let anyone know if
> > you are aware of such users.  IIUC the major reason why RDMA stopped being
> > the trend is because the network is not like ten years ago; I don't think I
> > have good knowledge in RDMA at all nor network, but my understanding is
> > it's pretty easy to fetch modern NIC to outperform RDMAs, then it may make
> > little sense to maintain multiple protocols, considering RDMA migration
> > code is so special so that it has the most custom code comparing to other
> > protocols.
> +cc some guys from Huawei.
> 
> I'm surprised RDMA users are rare,  I guess maybe many are just
> working with different code base.

Yes, please cc whoever might be interested (or surprised.. :) to know this,
and let's be open to all possibilities.

I don't think it makes sense if there're a lot of users of a feature then
we deprecate that without a good reason.  However there's always the
resource limitation issue we're facing, so it could still have the
possibility that this gets deprecated if nobody is working on our upstream
branch. Say, if people use private branches anyway to support rdma without
collaborating upstream, keeping such feature upstream then may not make
much sense either, unless there's some way to collaborate.  We'll see.

It seems there can still be people joining this discussion.  I'll hold off
a bit on merging this patch to provide enough window for anyone to chim in.

Thanks,

> >
> > Thanks,
> >
> > --
> > Peter Xu
> 
> Thx!
> Jinpu Wang
> >
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
  2024-04-08 16:18                   ` Peter Xu
@ 2024-04-09  7:32                     ` Jinpu Wang
  2024-04-09 19:46                       ` Peter Xu
  2024-04-09  9:00                     ` Markus Armbruster
  1 sibling, 1 reply; 52+ messages in thread
From: Jinpu Wang @ 2024-04-09  7:32 UTC (permalink / raw)
  To: Peter Xu
  Cc: Yu Zhang, Zhijian Li (Fujitsu),
	Elmar Gerdes, qemu-devel, Yuval Shaia, Kevin Wolf,
	Prasanna Kumar Kalever, Cornelia Huck, Michael Roth,
	Prasanna Kumar Kalever, integration, Paolo Bonzini, qemu-block,
	Daniel P. Berrangé,
	devel, Hanna Reitz, Michael S. Tsirkin, Thomas Huth, Eric Blake,
	Song Gao, Marc-André Lureau, Markus Armbruster,
	Alex Bennée, Wainer dos Santos Moschetta, Beraldo Leal,
	arei.gonglei, pannengyuan

Hi Peter,

On Mon, Apr 8, 2024 at 6:18 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Mon, Apr 08, 2024 at 04:07:20PM +0200, Jinpu Wang wrote:
> > Hi Peter,
>
> Jinpu,
>
> Thanks for joining the discussion.
>
> >
> > On Tue, Apr 2, 2024 at 11:24 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Mon, Apr 01, 2024 at 11:26:25PM +0200, Yu Zhang wrote:
> > > > Hello Peter und Zhjian,
> > > >
> > > > Thank you so much for letting me know about this. I'm also a bit surprised at
> > > > the plan for deprecating the RDMA migration subsystem.
> > >
> > > It's not too late, since it looks like we do have users not yet notified
> > > from this, we'll redo the deprecation procedure even if it'll be the final
> > > plan, and it'll be 2 releases after this.
> > >
> > > >
> > > > > IMHO it's more important to know whether there are still users and whether
> > > > > they would still like to see it around.
> > > >
> > > > > I admit RDMA migration was lack of testing(unit/CI test), which led to the a few
> > > > > obvious bugs being noticed too late.
> > > >
> > > > Yes, we are a user of this subsystem. I was unaware of the lack of test coverage
> > > > for this part. As soon as 8.2 was released, I saw that many of the
> > > > migration test
> > > > cases failed and came to realize that there might be a bug between 8.1
> > > > and 8.2, but
> > > > was unable to confirm and report it quickly to you.
> > > >
> > > > The maintenance of this part could be too costly or difficult from
> > > > your point of view.
> > >
> > > It may or may not be too costly, it's just that we need real users of RDMA
> > > taking some care of it.  Having it broken easily for >1 releases definitely
> > > is a sign of lack of users.  It is an implication to the community that we
> > > should consider dropping some features so that we can get the best use of
> > > the community resources for the things that may have a broader audience.
> > >
> > > One thing majorly missing is a RDMA tester to guard all the merges to not
> > > break RDMA paths, hopefully in CI.  That should not rely on RDMA hardwares
> > > but just to sanity check the migration+rdma code running all fine.  RDMA
> > > taught us the lesson so we're requesting CI coverage for all other new
> > > features that will be merged at least for migration subsystem, so that we
> > > plan to not merge anything that is not covered by CI unless extremely
> > > necessary in the future.
> > >
> > > For sure CI is not the only missing part, but I'd say we should start with
> > > it, then someone should also take care of the code even if only in
> > > maintenance mode (no new feature to add on top).
> > >
> > > >
> > > > My concern is, this plan will forces a few QEMU users (not sure how
> > > > many) like us
> > > > either to stick to the RDMA migration by using an increasingly older
> > > > version of QEMU,
> > > > or to abandon the currently used RDMA migration.
> > >
> > > RDMA doesn't get new features anyway, if there's specific use case for RDMA
> > > migrations, would it work if such a scenario uses the old binary?  Is it
> > > possible to switch to the TCP protocol with some good NICs?
> > We have used rdma migration with HCA from Nvidia for years, our
> > experience is RDMA migration works better than tcp (over ipoib).
>
> Please bare with me, as I know little on rdma stuff.
>
> I'm actually pretty confused (and since a long time ago..) on why we need
> to operation with rdma contexts when ipoib seems to provide all the tcp
> layers.  I meant, can it work with the current "tcp:" protocol with ipoib
> even if there's rdma/ib hardwares underneath?  Is it because of performance
> improvements so that we must use a separate path comparing to generic
> "tcp:" protocol here?
using rdma protocol with ib verbs , we can leverage the full benefit of RDMA by
talking directly to NIC which bypasses the kernel overhead, less cpu
utilization and better performance.

While IPoIB is more for compatibility to  applications using tcp, but
can't get full benefit of RDMA.  When you have mix generation of IB
devices, there are performance issue on IPoIB, we've seen 40G HCA can
only reach 2 Gb/s on IPoIB, but with raw RDMA can reach full line
speed.

I just run a simple iperf3 test via ipoib and ib_send_bw on same hosts:

iperf 3.9
Linux ps404a-3 5.15.137-pserver #5.15.137-6~deb11 SMP Thu Jan 4
07:19:34 UTC 2024 x86_64
-----------------------------------------------------------
Server listening on 5201
-----------------------------------------------------------
Time: Tue, 09 Apr 2024 06:55:02 GMT
Accepted connection from 2a02:247f:401:4:2:0:b:3, port 41130
      Cookie: cer2hexlldrowclq6izh7gbg5toviffqbcwt
      TCP MSS: 0 (default)
[  5] local 2a02:247f:401:4:2:0:a:3 port 5201 connected to
2a02:247f:401:4:2:0:b:3 port 41136
Starting Test: protocol: TCP, 1 streams, 131072 byte blocks, omitting
0 seconds, 10 second test, tos 0
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  1.80 GBytes  15.5 Gbits/sec
[  5]   1.00-2.00   sec  1.85 GBytes  15.9 Gbits/sec
[  5]   2.00-3.00   sec  1.88 GBytes  16.2 Gbits/sec
[  5]   3.00-4.00   sec  1.87 GBytes  16.1 Gbits/sec
[  5]   4.00-5.00   sec  1.88 GBytes  16.2 Gbits/sec
[  5]   5.00-6.00   sec  1.93 GBytes  16.6 Gbits/sec
[  5]   6.00-7.00   sec  2.00 GBytes  17.2 Gbits/sec
[  5]   7.00-8.00   sec  1.93 GBytes  16.6 Gbits/sec
[  5]   8.00-9.00   sec  1.86 GBytes  16.0 Gbits/sec
[  5]   9.00-10.00  sec  1.95 GBytes  16.8 Gbits/sec
[  5]  10.00-10.04  sec  85.2 MBytes  17.3 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
Test Complete. Summary Results:
[ ID] Interval           Transfer     Bitrate
[  5] (sender statistics not available)
[  5]   0.00-10.04  sec  19.0 GBytes  16.3 Gbits/sec                  receiver
rcv_tcp_congestion cubic
iperf 3.9
Linux ps404a-3 5.15.137-pserver #5.15.137-6~deb11 SMP Thu Jan 4
07:19:34 UTC 2024 x86_64
-----------------------------------------------------------
Server listening on 5201
-----------------------------------------------------------
^Ciperf3: interrupt - the server has terminated
1 jwang@ps404a-3.stg:~$ sudo ib_send_bw -F -a

************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
                    Send BW Test
 Dual-port       : OFF Device         : mlx5_0
 Number of qps   : 1 Transport type : IB
 Connection type : RC Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 RX depth        : 512
 CQ Moderation   : 100
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x24 QPN 0x0174 PSN 0x300138
 remote address: LID 0x17 QPN 0x004a PSN 0xc54d6f
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
 2          1000             0.00               6.46       3.385977
 4          1000             0.00               10.38     2.721894
 8          1000             0.00               25.69     3.367830
 16         1000             0.00               41.46     2.716859
 32         1000             0.00               102.98    3.374577
 64         1000             0.00               206.12    3.377053
 128        1000             0.00               405.03    3.318007
 256        1000             0.00               821.52    3.364939
 512        1000             0.00               2150.78    4.404803
 1024       1000             0.00               4288.13    4.391044
 2048       1000             0.00               8518.25    4.361346
 4096       1000             0.00               11440.77    2.928836
 8192       1000             0.00               11526.45    1.475385
 16384      1000             0.00               11526.06    0.737668
 32768      1000             0.00               11524.86    0.368795
 65536      1000             0.00               11331.84    0.181309
 131072     1000             0.00               11524.75    0.092198
 262144     1000             0.00               11525.82    0.046103
 524288     1000             0.00               11524.70    0.023049
 1048576    1000             0.00               11510.84    0.011511
 2097152    1000             0.00               11524.58    0.005762
 4194304    1000             0.00               11514.26    0.002879
 8388608    1000             0.00               11511.01    0.001439
---------------------------------------------------------------------------------------

you can see with ipoib, it reaches 16 Gb/s using TCP, 1 streams,
131072 byte blocks
with RDMA at 4k+ message size it reaches 100 Gb/s


>
> >
> > Switching back to TCP will lead us to the old problems which was
> > solved by RDMA migration.
>
> Can you elaborate the problems, and why tcp won't work in this case?  They
> may not be directly relevant to the issue we're discussing, but I'm happy
> to learn more.
>
> What is the NICs you were testing before?  Did the test carry out with
> things like modern ones (50Gbps-200Gbps NICs), or the test was done when
> these hardwares are not common?
We use Mellanox/NVidia IB HCA from 40 Gb/s to 200 Gb/s mixed
generation across globe.
>
> Per my recent knowledge on the new Intel hardwares, at least the ones that
> support QPL, it's easy to achieve single core 50Gbps+.
In good case, I've also seen 50 Gbps + on Mellanox HCA.
>
> https://lore.kernel.org/r/PH7PR11MB5941A91AC1E514BCC32896A6A3342@PH7PR11MB5941.namprd11.prod.outlook.com
>
> Quote from Yuan:
>
>   Yes, I use iperf3 to check the bandwidth for one core, the bandwith is 60Gbps.
>   [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
>   [  5]   0.00-1.00   sec  7.00 GBytes  60.1 Gbits/sec    0   2.87 MBytes
>   [  5]   1.00-2.00   sec  7.05 GBytes  60.6 Gbits/sec    0   2.87 Mbytes
>
>   And in the live migration test, a multifd thread's CPU utilization is almost 100%
>
> It boils down to what old problems were there with tcp first, though.
Yeah, this is the key reason we use RDMA. (low cpu ulitization and
better performance)
>
> >
> > >
> > > Per our best knowledge, RDMA users are rare, and please let anyone know if
> > > you are aware of such users.  IIUC the major reason why RDMA stopped being
> > > the trend is because the network is not like ten years ago; I don't think I
> > > have good knowledge in RDMA at all nor network, but my understanding is
> > > it's pretty easy to fetch modern NIC to outperform RDMAs, then it may make
> > > little sense to maintain multiple protocols, considering RDMA migration
> > > code is so special so that it has the most custom code comparing to other
> > > protocols.
> > +cc some guys from Huawei.
> >
> > I'm surprised RDMA users are rare,  I guess maybe many are just
> > working with different code base.
>
> Yes, please cc whoever might be interested (or surprised.. :) to know this,
> and let's be open to all possibilities.
>
> I don't think it makes sense if there're a lot of users of a feature then
> we deprecate that without a good reason.  However there's always the
> resource limitation issue we're facing, so it could still have the
> possibility that this gets deprecated if nobody is working on our upstream
> branch. Say, if people use private branches anyway to support rdma without
> collaborating upstream, keeping such feature upstream then may not make
> much sense either, unless there's some way to collaborate.  We'll see.

Is there document/link about the unittest/CI for migration tests, Why
are those tests missing?
Is it hard or very special to set up an environment for that? maybe we
can help in this regards.
>
> It seems there can still be people joining this discussion.  I'll hold off
> a bit on merging this patch to provide enough window for anyone to chim in.

Thx for discussion and understanding.


Jinpu Wang
>
> Thanks,
>
> > >
> > > Thanks,
> > >
> > > --
> > > Peter Xu
> >
> > Thx!
> > Jinpu Wang
> > >
> >
>
> --
> Peter Xu
>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
  2024-04-08 16:18                   ` Peter Xu
  2024-04-09  7:32                     ` Jinpu Wang
@ 2024-04-09  9:00                     ` Markus Armbruster
  1 sibling, 0 replies; 52+ messages in thread
From: Markus Armbruster @ 2024-04-09  9:00 UTC (permalink / raw)
  To: Peter Xu
  Cc: Jinpu Wang, Yu Zhang, Zhijian Li (Fujitsu),
	Elmar Gerdes, qemu-devel, Yuval Shaia, Kevin Wolf,
	Prasanna Kumar Kalever, Cornelia Huck, Michael Roth,
	Prasanna Kumar Kalever, integration, Paolo Bonzini, qemu-block,
	Daniel P. Berrangé,
	devel, Hanna Reitz, Michael S. Tsirkin, Thomas Huth, Eric Blake,
	Song Gao, Marc-André Lureau, Alex Bennée,
	Wainer dos Santos Moschetta, Beraldo Leal, arei.gonglei,
	pannengyuan

Peter Xu <peterx@redhat.com> writes:

> On Mon, Apr 08, 2024 at 04:07:20PM +0200, Jinpu Wang wrote:
>> Hi Peter,
>
> Jinpu,
>
> Thanks for joining the discussion.
>
>> 
>> On Tue, Apr 2, 2024 at 11:24 PM Peter Xu <peterx@redhat.com> wrote:
>> >
>> > On Mon, Apr 01, 2024 at 11:26:25PM +0200, Yu Zhang wrote:
>> > > Hello Peter und Zhjian,
>> > >
>> > > Thank you so much for letting me know about this. I'm also a bit surprised at
>> > > the plan for deprecating the RDMA migration subsystem.
>> >
>> > It's not too late, since it looks like we do have users not yet notified
>> > from this, we'll redo the deprecation procedure even if it'll be the final
>> > plan, and it'll be 2 releases after this.

[...]

>> > Per our best knowledge, RDMA users are rare, and please let anyone know if
>> > you are aware of such users.  IIUC the major reason why RDMA stopped being
>> > the trend is because the network is not like ten years ago; I don't think I
>> > have good knowledge in RDMA at all nor network, but my understanding is
>> > it's pretty easy to fetch modern NIC to outperform RDMAs, then it may make
>> > little sense to maintain multiple protocols, considering RDMA migration
>> > code is so special so that it has the most custom code comparing to other
>> > protocols.
>> +cc some guys from Huawei.
>> 
>> I'm surprised RDMA users are rare,  I guess maybe many are just
>> working with different code base.
>
> Yes, please cc whoever might be interested (or surprised.. :) to know this,
> and let's be open to all possibilities.
>
> I don't think it makes sense if there're a lot of users of a feature then
> we deprecate that without a good reason.  However there's always the
> resource limitation issue we're facing, so it could still have the
> possibility that this gets deprecated if nobody is working on our upstream
> branch. Say, if people use private branches anyway to support rdma without
> collaborating upstream, keeping such feature upstream then may not make
> much sense either, unless there's some way to collaborate.  We'll see.
>
> It seems there can still be people joining this discussion.  I'll hold off
> a bit on merging this patch to provide enough window for anyone to chim in.

Users are not enough.  Only maintainers are.

At some point, people cared enough about RDMA in QEMU to contribute the
code.  That's why have the code.

To keep the code, we need people who care enough about RDMA in QEMU to
maintain it.  Without such people, the case for keeping it remains
dangerously weak, and no amount of talk or even benchmarks can change
that.



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
  2024-04-09  7:32                     ` Jinpu Wang
@ 2024-04-09 19:46                       ` Peter Xu
  2024-04-10  2:28                         ` Zhijian Li (Fujitsu) via
  2024-04-11 14:42                         ` Jinpu Wang
  0 siblings, 2 replies; 52+ messages in thread
From: Peter Xu @ 2024-04-09 19:46 UTC (permalink / raw)
  To: Jinpu Wang
  Cc: Yu Zhang, Zhijian Li (Fujitsu),
	Elmar Gerdes, qemu-devel, Yuval Shaia, Kevin Wolf,
	Prasanna Kumar Kalever, Cornelia Huck, Michael Roth,
	Prasanna Kumar Kalever, integration, Paolo Bonzini, qemu-block,
	Daniel P. Berrangé,
	devel, Hanna Reitz, Michael S. Tsirkin, Thomas Huth, Eric Blake,
	Song Gao, Marc-André Lureau, Markus Armbruster,
	Alex Bennée, Wainer dos Santos Moschetta, Beraldo Leal,
	arei.gonglei, pannengyuan

On Tue, Apr 09, 2024 at 09:32:46AM +0200, Jinpu Wang wrote:
> Hi Peter,
> 
> On Mon, Apr 8, 2024 at 6:18 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Mon, Apr 08, 2024 at 04:07:20PM +0200, Jinpu Wang wrote:
> > > Hi Peter,
> >
> > Jinpu,
> >
> > Thanks for joining the discussion.
> >
> > >
> > > On Tue, Apr 2, 2024 at 11:24 PM Peter Xu <peterx@redhat.com> wrote:
> > > >
> > > > On Mon, Apr 01, 2024 at 11:26:25PM +0200, Yu Zhang wrote:
> > > > > Hello Peter und Zhjian,
> > > > >
> > > > > Thank you so much for letting me know about this. I'm also a bit surprised at
> > > > > the plan for deprecating the RDMA migration subsystem.
> > > >
> > > > It's not too late, since it looks like we do have users not yet notified
> > > > from this, we'll redo the deprecation procedure even if it'll be the final
> > > > plan, and it'll be 2 releases after this.
> > > >
> > > > >
> > > > > > IMHO it's more important to know whether there are still users and whether
> > > > > > they would still like to see it around.
> > > > >
> > > > > > I admit RDMA migration was lack of testing(unit/CI test), which led to the a few
> > > > > > obvious bugs being noticed too late.
> > > > >
> > > > > Yes, we are a user of this subsystem. I was unaware of the lack of test coverage
> > > > > for this part. As soon as 8.2 was released, I saw that many of the
> > > > > migration test
> > > > > cases failed and came to realize that there might be a bug between 8.1
> > > > > and 8.2, but
> > > > > was unable to confirm and report it quickly to you.
> > > > >
> > > > > The maintenance of this part could be too costly or difficult from
> > > > > your point of view.
> > > >
> > > > It may or may not be too costly, it's just that we need real users of RDMA
> > > > taking some care of it.  Having it broken easily for >1 releases definitely
> > > > is a sign of lack of users.  It is an implication to the community that we
> > > > should consider dropping some features so that we can get the best use of
> > > > the community resources for the things that may have a broader audience.
> > > >
> > > > One thing majorly missing is a RDMA tester to guard all the merges to not
> > > > break RDMA paths, hopefully in CI.  That should not rely on RDMA hardwares
> > > > but just to sanity check the migration+rdma code running all fine.  RDMA
> > > > taught us the lesson so we're requesting CI coverage for all other new
> > > > features that will be merged at least for migration subsystem, so that we
> > > > plan to not merge anything that is not covered by CI unless extremely
> > > > necessary in the future.
> > > >
> > > > For sure CI is not the only missing part, but I'd say we should start with
> > > > it, then someone should also take care of the code even if only in
> > > > maintenance mode (no new feature to add on top).
> > > >
> > > > >
> > > > > My concern is, this plan will forces a few QEMU users (not sure how
> > > > > many) like us
> > > > > either to stick to the RDMA migration by using an increasingly older
> > > > > version of QEMU,
> > > > > or to abandon the currently used RDMA migration.
> > > >
> > > > RDMA doesn't get new features anyway, if there's specific use case for RDMA
> > > > migrations, would it work if such a scenario uses the old binary?  Is it
> > > > possible to switch to the TCP protocol with some good NICs?
> > > We have used rdma migration with HCA from Nvidia for years, our
> > > experience is RDMA migration works better than tcp (over ipoib).
> >
> > Please bare with me, as I know little on rdma stuff.
> >
> > I'm actually pretty confused (and since a long time ago..) on why we need
> > to operation with rdma contexts when ipoib seems to provide all the tcp
> > layers.  I meant, can it work with the current "tcp:" protocol with ipoib
> > even if there's rdma/ib hardwares underneath?  Is it because of performance
> > improvements so that we must use a separate path comparing to generic
> > "tcp:" protocol here?
> using rdma protocol with ib verbs , we can leverage the full benefit of RDMA by
> talking directly to NIC which bypasses the kernel overhead, less cpu
> utilization and better performance.
> 
> While IPoIB is more for compatibility to  applications using tcp, but
> can't get full benefit of RDMA.  When you have mix generation of IB
> devices, there are performance issue on IPoIB, we've seen 40G HCA can
> only reach 2 Gb/s on IPoIB, but with raw RDMA can reach full line
> speed.
> 
> I just run a simple iperf3 test via ipoib and ib_send_bw on same hosts:
> 
> iperf 3.9
> Linux ps404a-3 5.15.137-pserver #5.15.137-6~deb11 SMP Thu Jan 4
> 07:19:34 UTC 2024 x86_64
> -----------------------------------------------------------
> Server listening on 5201
> -----------------------------------------------------------
> Time: Tue, 09 Apr 2024 06:55:02 GMT
> Accepted connection from 2a02:247f:401:4:2:0:b:3, port 41130
>       Cookie: cer2hexlldrowclq6izh7gbg5toviffqbcwt
>       TCP MSS: 0 (default)
> [  5] local 2a02:247f:401:4:2:0:a:3 port 5201 connected to
> 2a02:247f:401:4:2:0:b:3 port 41136
> Starting Test: protocol: TCP, 1 streams, 131072 byte blocks, omitting
> 0 seconds, 10 second test, tos 0
> [ ID] Interval           Transfer     Bitrate
> [  5]   0.00-1.00   sec  1.80 GBytes  15.5 Gbits/sec
> [  5]   1.00-2.00   sec  1.85 GBytes  15.9 Gbits/sec
> [  5]   2.00-3.00   sec  1.88 GBytes  16.2 Gbits/sec
> [  5]   3.00-4.00   sec  1.87 GBytes  16.1 Gbits/sec
> [  5]   4.00-5.00   sec  1.88 GBytes  16.2 Gbits/sec
> [  5]   5.00-6.00   sec  1.93 GBytes  16.6 Gbits/sec
> [  5]   6.00-7.00   sec  2.00 GBytes  17.2 Gbits/sec
> [  5]   7.00-8.00   sec  1.93 GBytes  16.6 Gbits/sec
> [  5]   8.00-9.00   sec  1.86 GBytes  16.0 Gbits/sec
> [  5]   9.00-10.00  sec  1.95 GBytes  16.8 Gbits/sec
> [  5]  10.00-10.04  sec  85.2 MBytes  17.3 Gbits/sec
> - - - - - - - - - - - - - - - - - - - - - - - - -
> Test Complete. Summary Results:
> [ ID] Interval           Transfer     Bitrate
> [  5] (sender statistics not available)
> [  5]   0.00-10.04  sec  19.0 GBytes  16.3 Gbits/sec                  receiver
> rcv_tcp_congestion cubic
> iperf 3.9
> Linux ps404a-3 5.15.137-pserver #5.15.137-6~deb11 SMP Thu Jan 4
> 07:19:34 UTC 2024 x86_64
> -----------------------------------------------------------
> Server listening on 5201
> -----------------------------------------------------------
> ^Ciperf3: interrupt - the server has terminated
> 1 jwang@ps404a-3.stg:~$ sudo ib_send_bw -F -a
> 
> ************************************
> * Waiting for client to connect... *
> ************************************
> ---------------------------------------------------------------------------------------
>                     Send BW Test
>  Dual-port       : OFF Device         : mlx5_0
>  Number of qps   : 1 Transport type : IB
>  Connection type : RC Using SRQ      : OFF
>  PCIe relax order: ON
>  ibv_wr* API     : ON
>  RX depth        : 512
>  CQ Moderation   : 100
>  Mtu             : 4096[B]
>  Link type       : IB
>  Max inline data : 0[B]
>  rdma_cm QPs : OFF
>  Data ex. method : Ethernet
> ---------------------------------------------------------------------------------------
>  local address: LID 0x24 QPN 0x0174 PSN 0x300138
>  remote address: LID 0x17 QPN 0x004a PSN 0xc54d6f
> ---------------------------------------------------------------------------------------
>  #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
>  2          1000             0.00               6.46       3.385977
>  4          1000             0.00               10.38     2.721894
>  8          1000             0.00               25.69     3.367830
>  16         1000             0.00               41.46     2.716859
>  32         1000             0.00               102.98    3.374577
>  64         1000             0.00               206.12    3.377053
>  128        1000             0.00               405.03    3.318007
>  256        1000             0.00               821.52    3.364939
>  512        1000             0.00               2150.78    4.404803
>  1024       1000             0.00               4288.13    4.391044
>  2048       1000             0.00               8518.25    4.361346
>  4096       1000             0.00               11440.77    2.928836
>  8192       1000             0.00               11526.45    1.475385
>  16384      1000             0.00               11526.06    0.737668
>  32768      1000             0.00               11524.86    0.368795
>  65536      1000             0.00               11331.84    0.181309
>  131072     1000             0.00               11524.75    0.092198
>  262144     1000             0.00               11525.82    0.046103
>  524288     1000             0.00               11524.70    0.023049
>  1048576    1000             0.00               11510.84    0.011511
>  2097152    1000             0.00               11524.58    0.005762
>  4194304    1000             0.00               11514.26    0.002879
>  8388608    1000             0.00               11511.01    0.001439
> ---------------------------------------------------------------------------------------
> 
> you can see with ipoib, it reaches 16 Gb/s using TCP, 1 streams,
> 131072 byte blocks
> with RDMA at 4k+ message size it reaches 100 Gb/s

I get it now, thank you!

> 
> 
> >
> > >
> > > Switching back to TCP will lead us to the old problems which was
> > > solved by RDMA migration.
> >
> > Can you elaborate the problems, and why tcp won't work in this case?  They
> > may not be directly relevant to the issue we're discussing, but I'm happy
> > to learn more.
> >
> > What is the NICs you were testing before?  Did the test carry out with
> > things like modern ones (50Gbps-200Gbps NICs), or the test was done when
> > these hardwares are not common?
> We use Mellanox/NVidia IB HCA from 40 Gb/s to 200 Gb/s mixed
> generation across globe.
> >
> > Per my recent knowledge on the new Intel hardwares, at least the ones that
> > support QPL, it's easy to achieve single core 50Gbps+.
> In good case, I've also seen 50 Gbps + on Mellanox HCA.

I see. Have you compared the HCAs v.s. the modern NICs?  Now NICs can
achieve similar performance from their spec as I said; I am not sure how
they perform in real life, but maybe worth trying.  I only tried 100G nic
and I rem I can hit 70+Gbps with multifd migrations at peak bandwidth.
Have you tried that before?

Note that here I didn't want to compare the performance between the two and
find a winner.  The issue we're facing now is we have the RDMA migration
now mostly having its own path all over the place, while the rest protocols
(socket, fd, file, etc.) all share the rest.

Then, _if_ modern NICs can work similarly v.s. rdma, I don't yet see a good
reason to keep it.  It could be that technology just improved so we can use
less code to do as good.  It's a good news to help QEMU evolve by dropping
unused code.

For some details there on the rdma complications for migration:

  (1) RDMA is the only protocol that doesn't yet support QIOChannel, while
      migration uses QIOChannels mostly everywhere now.. e.g. in multifd,
      it means it won't easily support any new things using QIOChannels.

  (2) RDMA is the only protocol that mostly hard-coded everywhere in the
      RAM migrations, polluting the core logic with much more code
      internally to support this protocol.

For (1), see migrate_fd_connect() from rdma_start_outgoing_migration().
While the rest protocols all go via migration_channel_connect().

For (2), see all the "rdma_*" functions in migration/ram.c, where I don't
think it's common to a protocol - most of the rest protocols don't need
those hard-coded stuff.  migration/rdma.c has 4000+ LOC for these stuff,
while to do a not-so-fair comparison, migration/fd.c only has <100 LOC.

Then, we found we don't even know who's using it.

I hope I explained why people started this idea, and also why I think that
makes sense at least to me.

> >
> > https://lore.kernel.org/r/PH7PR11MB5941A91AC1E514BCC32896A6A3342@PH7PR11MB5941.namprd11.prod.outlook.com
> >
> > Quote from Yuan:
> >
> >   Yes, I use iperf3 to check the bandwidth for one core, the bandwith is 60Gbps.
> >   [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> >   [  5]   0.00-1.00   sec  7.00 GBytes  60.1 Gbits/sec    0   2.87 MBytes
> >   [  5]   1.00-2.00   sec  7.05 GBytes  60.6 Gbits/sec    0   2.87 Mbytes
> >
> >   And in the live migration test, a multifd thread's CPU utilization is almost 100%
> >
> > It boils down to what old problems were there with tcp first, though.
> Yeah, this is the key reason we use RDMA. (low cpu ulitization and
> better performance)
> >
> > >
> > > >
> > > > Per our best knowledge, RDMA users are rare, and please let anyone know if
> > > > you are aware of such users.  IIUC the major reason why RDMA stopped being
> > > > the trend is because the network is not like ten years ago; I don't think I
> > > > have good knowledge in RDMA at all nor network, but my understanding is
> > > > it's pretty easy to fetch modern NIC to outperform RDMAs, then it may make
> > > > little sense to maintain multiple protocols, considering RDMA migration
> > > > code is so special so that it has the most custom code comparing to other
> > > > protocols.
> > > +cc some guys from Huawei.
> > >
> > > I'm surprised RDMA users are rare,  I guess maybe many are just
> > > working with different code base.
> >
> > Yes, please cc whoever might be interested (or surprised.. :) to know this,
> > and let's be open to all possibilities.
> >
> > I don't think it makes sense if there're a lot of users of a feature then
> > we deprecate that without a good reason.  However there's always the
> > resource limitation issue we're facing, so it could still have the
> > possibility that this gets deprecated if nobody is working on our upstream
> > branch. Say, if people use private branches anyway to support rdma without
> > collaborating upstream, keeping such feature upstream then may not make
> > much sense either, unless there's some way to collaborate.  We'll see.
> 
> Is there document/link about the unittest/CI for migration tests, Why
> are those tests missing?
> Is it hard or very special to set up an environment for that? maybe we
> can help in this regards.

See tests/qtest/migration-test.c.  We put most of our migration tests
there and that's covered in CI.

I think one major issue is CI systems don't normally have rdma devices.
Can rdma migration test be carried out without a real hardware?

> >
> > It seems there can still be people joining this discussion.  I'll hold off
> > a bit on merging this patch to provide enough window for anyone to chim in.
> 
> Thx for discussion and understanding.

Thanks for all these inputs so far.  These can help us make a wiser and
clearer step no matter which way we choose.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
  2024-04-09 19:46                       ` Peter Xu
@ 2024-04-10  2:28                         ` Zhijian Li (Fujitsu) via
  2024-04-10 13:49                           ` Peter Xu
  2024-04-11 14:42                         ` Jinpu Wang
  1 sibling, 1 reply; 52+ messages in thread
From: Zhijian Li (Fujitsu) via @ 2024-04-10  2:28 UTC (permalink / raw)
  To: Peter Xu, Jinpu Wang
  Cc: Yu Zhang, Elmar Gerdes, qemu-devel, Yuval Shaia, Kevin Wolf,
	Prasanna Kumar Kalever, Cornelia Huck, Michael Roth,
	Prasanna Kumar Kalever, integration, Paolo Bonzini, qemu-block,
	Daniel P. Berrangé,
	devel, Hanna Reitz, Michael S. Tsirkin, Thomas Huth, Eric Blake,
	Song Gao, Marc-André Lureau, Markus Armbruster,
	Alex Bennée, Wainer dos Santos Moschetta, Beraldo Leal,
	arei.gonglei, pannengyuan



on 4/10/2024 3:46 AM, Peter Xu wrote:

>> Is there document/link about the unittest/CI for migration tests, Why
>> are those tests missing?
>> Is it hard or very special to set up an environment for that? maybe we
>> can help in this regards.
> See tests/qtest/migration-test.c.  We put most of our migration tests
> there and that's covered in CI.
>
> I think one major issue is CI systems don't normally have rdma devices.
> Can rdma migration test be carried out without a real hardware?

Yeah,  RXE aka. SOFT-RoCE is able to emulate the RDMA, for example
$ sudo rdma link add rxe_eth0 type rxe netdev eth0  # on host
then we can get a new RDMA interface "rxe_eth0".
This new RDMA interface is able to do the QEMU RDMA migration.

Also, the loopback(lo) device is able to emulate the RDMA interface 
"rxe_lo", however when
I tried(years ago) to do RDMA migration over this 
interface(rdma:127.0.0.1:3333) , it got something wrong.
So i gave up enabling the RDMA migration qtest at that time.



Thanks
Zhijian



     

>
>>> It seems there can still be people joining this discussion.  I'll hold off
>>> a bit on merging this patch to provide enough window for anyone to chim in.
>> Thx for discussion and understanding.
> Thanks for all these inputs so far.  These can help us make a wiser and
> clearer step no matter which way we choose.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
  2024-04-10  2:28                         ` Zhijian Li (Fujitsu) via
@ 2024-04-10 13:49                           ` Peter Xu
  2024-04-11 14:20                             ` Peter Xu
  0 siblings, 1 reply; 52+ messages in thread
From: Peter Xu @ 2024-04-10 13:49 UTC (permalink / raw)
  To: Zhijian Li (Fujitsu)
  Cc: Jinpu Wang, Yu Zhang, Elmar Gerdes, qemu-devel, Yuval Shaia,
	Kevin Wolf, Prasanna Kumar Kalever, Cornelia Huck, Michael Roth,
	Prasanna Kumar Kalever, integration, Paolo Bonzini, qemu-block,
	Daniel P. Berrangé,
	devel, Hanna Reitz, Michael S. Tsirkin, Thomas Huth, Eric Blake,
	Song Gao, Marc-André Lureau, Markus Armbruster,
	Alex Bennée, Wainer dos Santos Moschetta, Beraldo Leal,
	arei.gonglei, pannengyuan

On Wed, Apr 10, 2024 at 02:28:59AM +0000, Zhijian Li (Fujitsu) via wrote:
> 
> 
> on 4/10/2024 3:46 AM, Peter Xu wrote:
> 
> >> Is there document/link about the unittest/CI for migration tests, Why
> >> are those tests missing?
> >> Is it hard or very special to set up an environment for that? maybe we
> >> can help in this regards.
> > See tests/qtest/migration-test.c.  We put most of our migration tests
> > there and that's covered in CI.
> >
> > I think one major issue is CI systems don't normally have rdma devices.
> > Can rdma migration test be carried out without a real hardware?
> 
> Yeah,  RXE aka. SOFT-RoCE is able to emulate the RDMA, for example
> $ sudo rdma link add rxe_eth0 type rxe netdev eth0  # on host
> then we can get a new RDMA interface "rxe_eth0".
> This new RDMA interface is able to do the QEMU RDMA migration.
> 
> Also, the loopback(lo) device is able to emulate the RDMA interface 
> "rxe_lo", however when
> I tried(years ago) to do RDMA migration over this 
> interface(rdma:127.0.0.1:3333) , it got something wrong.
> So i gave up enabling the RDMA migration qtest at that time.

Thanks, Zhijian.

I'm not sure adding an emu-link for rdma is doable for CI systems, though.
Maybe someone more familiar with how CI works can chim in.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
  2024-04-10 13:49                           ` Peter Xu
@ 2024-04-11 14:20                             ` Peter Xu
  2024-04-11 16:36                               ` Yu Zhang
  0 siblings, 1 reply; 52+ messages in thread
From: Peter Xu @ 2024-04-11 14:20 UTC (permalink / raw)
  To: Zhijian Li (Fujitsu)
  Cc: Jinpu Wang, Yu Zhang, Elmar Gerdes, qemu-devel, Yuval Shaia,
	Kevin Wolf, Prasanna Kumar Kalever, Cornelia Huck, Michael Roth,
	Prasanna Kumar Kalever, integration, Paolo Bonzini, qemu-block,
	Daniel P. Berrangé,
	devel, Hanna Reitz, Michael S. Tsirkin, Thomas Huth, Eric Blake,
	Song Gao, Marc-André Lureau, Markus Armbruster,
	Alex Bennée, Wainer dos Santos Moschetta, Beraldo Leal,
	arei.gonglei, pannengyuan, Fabiano Rosas, Peter Maydell,
	Philippe Mathieu-Daudé

On Wed, Apr 10, 2024 at 09:49:15AM -0400, Peter Xu wrote:
> On Wed, Apr 10, 2024 at 02:28:59AM +0000, Zhijian Li (Fujitsu) via wrote:
> > 
> > 
> > on 4/10/2024 3:46 AM, Peter Xu wrote:
> > 
> > >> Is there document/link about the unittest/CI for migration tests, Why
> > >> are those tests missing?
> > >> Is it hard or very special to set up an environment for that? maybe we
> > >> can help in this regards.
> > > See tests/qtest/migration-test.c.  We put most of our migration tests
> > > there and that's covered in CI.
> > >
> > > I think one major issue is CI systems don't normally have rdma devices.
> > > Can rdma migration test be carried out without a real hardware?
> > 
> > Yeah,  RXE aka. SOFT-RoCE is able to emulate the RDMA, for example
> > $ sudo rdma link add rxe_eth0 type rxe netdev eth0  # on host
> > then we can get a new RDMA interface "rxe_eth0".
> > This new RDMA interface is able to do the QEMU RDMA migration.
> > 
> > Also, the loopback(lo) device is able to emulate the RDMA interface 
> > "rxe_lo", however when
> > I tried(years ago) to do RDMA migration over this 
> > interface(rdma:127.0.0.1:3333) , it got something wrong.
> > So i gave up enabling the RDMA migration qtest at that time.
> 
> Thanks, Zhijian.
> 
> I'm not sure adding an emu-link for rdma is doable for CI systems, though.
> Maybe someone more familiar with how CI works can chim in.

Some people got dropped on the cc list for unknown reason, I'm adding them
back (Fabiano, Peter Maydell, Phil).  Let's make sure nobody is dropped by
accident.

I'll try to summarize what is still missing, and I think these will be
greatly helpful if we don't want to deprecate rdma migration:

  1) Either a CI test covering at least the major RDMA paths, or at least
     periodically tests for each QEMU release will be needed.

  2) Some performance tests between modern RDMA and NIC devices are
     welcomed.  The current knowledge is modern NIC can work similarly to
     RDMA in performance, then it's debatable why we still maintain so much
     rdma specific code.

  3) No need to be soild patchsets for this one, but some plan to improve
     RDMA migration code so that it is not almost isolated from the rest
     protocols.

  4) Someone to look after this code for real.

For 2) and 3) more info is here:

https://lore.kernel.org/r/ZhWa0YeAb9ySVKD1@x1n

Here 4) can be the most important as Markus pointed out.  We just didn't
get there yet on the discussions, but maybe Markus is right that we should
talk that first.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
  2024-04-09 19:46                       ` Peter Xu
  2024-04-10  2:28                         ` Zhijian Li (Fujitsu) via
@ 2024-04-11 14:42                         ` Jinpu Wang
  1 sibling, 0 replies; 52+ messages in thread
From: Jinpu Wang @ 2024-04-11 14:42 UTC (permalink / raw)
  To: Peter Xu
  Cc: Yu Zhang, Zhijian Li (Fujitsu),
	Elmar Gerdes, qemu-devel, Yuval Shaia, Kevin Wolf,
	Prasanna Kumar Kalever, Cornelia Huck, Michael Roth,
	Prasanna Kumar Kalever, integration, Paolo Bonzini, qemu-block,
	Daniel P. Berrangé,
	devel, Hanna Reitz, Michael S. Tsirkin, Thomas Huth, Eric Blake,
	Song Gao, Marc-André Lureau, Markus Armbruster,
	Alex Bennée, Wainer dos Santos Moschetta, Beraldo Leal,
	arei.gonglei, pannengyuan

Hi Peter,

On Tue, Apr 9, 2024 at 9:47 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Tue, Apr 09, 2024 at 09:32:46AM +0200, Jinpu Wang wrote:
> > Hi Peter,
> >
> > On Mon, Apr 8, 2024 at 6:18 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Mon, Apr 08, 2024 at 04:07:20PM +0200, Jinpu Wang wrote:
> > > > Hi Peter,
> > >
> > > Jinpu,
> > >
> > > Thanks for joining the discussion.
> > >
> > > >
> > > > On Tue, Apr 2, 2024 at 11:24 PM Peter Xu <peterx@redhat.com> wrote:
> > > > >
> > > > > On Mon, Apr 01, 2024 at 11:26:25PM +0200, Yu Zhang wrote:
> > > > > > Hello Peter und Zhjian,
> > > > > >
> > > > > > Thank you so much for letting me know about this. I'm also a bit surprised at
> > > > > > the plan for deprecating the RDMA migration subsystem.
> > > > >
> > > > > It's not too late, since it looks like we do have users not yet notified
> > > > > from this, we'll redo the deprecation procedure even if it'll be the final
> > > > > plan, and it'll be 2 releases after this.
> > > > >
> > > > > >
> > > > > > > IMHO it's more important to know whether there are still users and whether
> > > > > > > they would still like to see it around.
> > > > > >
> > > > > > > I admit RDMA migration was lack of testing(unit/CI test), which led to the a few
> > > > > > > obvious bugs being noticed too late.
> > > > > >
> > > > > > Yes, we are a user of this subsystem. I was unaware of the lack of test coverage
> > > > > > for this part. As soon as 8.2 was released, I saw that many of the
> > > > > > migration test
> > > > > > cases failed and came to realize that there might be a bug between 8.1
> > > > > > and 8.2, but
> > > > > > was unable to confirm and report it quickly to you.
> > > > > >
> > > > > > The maintenance of this part could be too costly or difficult from
> > > > > > your point of view.
> > > > >
> > > > > It may or may not be too costly, it's just that we need real users of RDMA
> > > > > taking some care of it.  Having it broken easily for >1 releases definitely
> > > > > is a sign of lack of users.  It is an implication to the community that we
> > > > > should consider dropping some features so that we can get the best use of
> > > > > the community resources for the things that may have a broader audience.
> > > > >
> > > > > One thing majorly missing is a RDMA tester to guard all the merges to not
> > > > > break RDMA paths, hopefully in CI.  That should not rely on RDMA hardwares
> > > > > but just to sanity check the migration+rdma code running all fine.  RDMA
> > > > > taught us the lesson so we're requesting CI coverage for all other new
> > > > > features that will be merged at least for migration subsystem, so that we
> > > > > plan to not merge anything that is not covered by CI unless extremely
> > > > > necessary in the future.
> > > > >
> > > > > For sure CI is not the only missing part, but I'd say we should start with
> > > > > it, then someone should also take care of the code even if only in
> > > > > maintenance mode (no new feature to add on top).
> > > > >
> > > > > >
> > > > > > My concern is, this plan will forces a few QEMU users (not sure how
> > > > > > many) like us
> > > > > > either to stick to the RDMA migration by using an increasingly older
> > > > > > version of QEMU,
> > > > > > or to abandon the currently used RDMA migration.
> > > > >
> > > > > RDMA doesn't get new features anyway, if there's specific use case for RDMA
> > > > > migrations, would it work if such a scenario uses the old binary?  Is it
> > > > > possible to switch to the TCP protocol with some good NICs?
> > > > We have used rdma migration with HCA from Nvidia for years, our
> > > > experience is RDMA migration works better than tcp (over ipoib).
> > >
> > > Please bare with me, as I know little on rdma stuff.
> > >
> > > I'm actually pretty confused (and since a long time ago..) on why we need
> > > to operation with rdma contexts when ipoib seems to provide all the tcp
> > > layers.  I meant, can it work with the current "tcp:" protocol with ipoib
> > > even if there's rdma/ib hardwares underneath?  Is it because of performance
> > > improvements so that we must use a separate path comparing to generic
> > > "tcp:" protocol here?
> > using rdma protocol with ib verbs , we can leverage the full benefit of RDMA by
> > talking directly to NIC which bypasses the kernel overhead, less cpu
> > utilization and better performance.
> >
> > While IPoIB is more for compatibility to  applications using tcp, but
> > can't get full benefit of RDMA.  When you have mix generation of IB
> > devices, there are performance issue on IPoIB, we've seen 40G HCA can
> > only reach 2 Gb/s on IPoIB, but with raw RDMA can reach full line
> > speed.
> >
> > I just run a simple iperf3 test via ipoib and ib_send_bw on same hosts:
> >
> > iperf 3.9
> > Linux ps404a-3 5.15.137-pserver #5.15.137-6~deb11 SMP Thu Jan 4
> > 07:19:34 UTC 2024 x86_64
> > -----------------------------------------------------------
> > Server listening on 5201
> > -----------------------------------------------------------
> > Time: Tue, 09 Apr 2024 06:55:02 GMT
> > Accepted connection from 2a02:247f:401:4:2:0:b:3, port 41130
> >       Cookie: cer2hexlldrowclq6izh7gbg5toviffqbcwt
> >       TCP MSS: 0 (default)
> > [  5] local 2a02:247f:401:4:2:0:a:3 port 5201 connected to
> > 2a02:247f:401:4:2:0:b:3 port 41136
> > Starting Test: protocol: TCP, 1 streams, 131072 byte blocks, omitting
> > 0 seconds, 10 second test, tos 0
> > [ ID] Interval           Transfer     Bitrate
> > [  5]   0.00-1.00   sec  1.80 GBytes  15.5 Gbits/sec
> > [  5]   1.00-2.00   sec  1.85 GBytes  15.9 Gbits/sec
> > [  5]   2.00-3.00   sec  1.88 GBytes  16.2 Gbits/sec
> > [  5]   3.00-4.00   sec  1.87 GBytes  16.1 Gbits/sec
> > [  5]   4.00-5.00   sec  1.88 GBytes  16.2 Gbits/sec
> > [  5]   5.00-6.00   sec  1.93 GBytes  16.6 Gbits/sec
> > [  5]   6.00-7.00   sec  2.00 GBytes  17.2 Gbits/sec
> > [  5]   7.00-8.00   sec  1.93 GBytes  16.6 Gbits/sec
> > [  5]   8.00-9.00   sec  1.86 GBytes  16.0 Gbits/sec
> > [  5]   9.00-10.00  sec  1.95 GBytes  16.8 Gbits/sec
> > [  5]  10.00-10.04  sec  85.2 MBytes  17.3 Gbits/sec
> > - - - - - - - - - - - - - - - - - - - - - - - - -
> > Test Complete. Summary Results:
> > [ ID] Interval           Transfer     Bitrate
> > [  5] (sender statistics not available)
> > [  5]   0.00-10.04  sec  19.0 GBytes  16.3 Gbits/sec                  receiver
> > rcv_tcp_congestion cubic
> > iperf 3.9
> > Linux ps404a-3 5.15.137-pserver #5.15.137-6~deb11 SMP Thu Jan 4
> > 07:19:34 UTC 2024 x86_64
> > -----------------------------------------------------------
> > Server listening on 5201
> > -----------------------------------------------------------
> > ^Ciperf3: interrupt - the server has terminated
> > 1 jwang@ps404a-3.stg:~$ sudo ib_send_bw -F -a
> >
> > ************************************
> > * Waiting for client to connect... *
> > ************************************
> > ---------------------------------------------------------------------------------------
> >                     Send BW Test
> >  Dual-port       : OFF Device         : mlx5_0
> >  Number of qps   : 1 Transport type : IB
> >  Connection type : RC Using SRQ      : OFF
> >  PCIe relax order: ON
> >  ibv_wr* API     : ON
> >  RX depth        : 512
> >  CQ Moderation   : 100
> >  Mtu             : 4096[B]
> >  Link type       : IB
> >  Max inline data : 0[B]
> >  rdma_cm QPs : OFF
> >  Data ex. method : Ethernet
> > ---------------------------------------------------------------------------------------
> >  local address: LID 0x24 QPN 0x0174 PSN 0x300138
> >  remote address: LID 0x17 QPN 0x004a PSN 0xc54d6f
> > ---------------------------------------------------------------------------------------
> >  #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
> >  2          1000             0.00               6.46       3.385977
> >  4          1000             0.00               10.38     2.721894
> >  8          1000             0.00               25.69     3.367830
> >  16         1000             0.00               41.46     2.716859
> >  32         1000             0.00               102.98    3.374577
> >  64         1000             0.00               206.12    3.377053
> >  128        1000             0.00               405.03    3.318007
> >  256        1000             0.00               821.52    3.364939
> >  512        1000             0.00               2150.78    4.404803
> >  1024       1000             0.00               4288.13    4.391044
> >  2048       1000             0.00               8518.25    4.361346
> >  4096       1000             0.00               11440.77    2.928836
> >  8192       1000             0.00               11526.45    1.475385
> >  16384      1000             0.00               11526.06    0.737668
> >  32768      1000             0.00               11524.86    0.368795
> >  65536      1000             0.00               11331.84    0.181309
> >  131072     1000             0.00               11524.75    0.092198
> >  262144     1000             0.00               11525.82    0.046103
> >  524288     1000             0.00               11524.70    0.023049
> >  1048576    1000             0.00               11510.84    0.011511
> >  2097152    1000             0.00               11524.58    0.005762
> >  4194304    1000             0.00               11514.26    0.002879
> >  8388608    1000             0.00               11511.01    0.001439
> > ---------------------------------------------------------------------------------------
> >
> > you can see with ipoib, it reaches 16 Gb/s using TCP, 1 streams,
> > 131072 byte blocks
> > with RDMA at 4k+ message size it reaches 100 Gb/s
>
> I get it now, thank you!
>
> >
> >
> > >
> > > >
> > > > Switching back to TCP will lead us to the old problems which was
> > > > solved by RDMA migration.
> > >
> > > Can you elaborate the problems, and why tcp won't work in this case?  They
> > > may not be directly relevant to the issue we're discussing, but I'm happy
> > > to learn more.
> > >
> > > What is the NICs you were testing before?  Did the test carry out with
> > > things like modern ones (50Gbps-200Gbps NICs), or the test was done when
> > > these hardwares are not common?
> > We use Mellanox/NVidia IB HCA from 40 Gb/s to 200 Gb/s mixed
> > generation across globe.
> > >
> > > Per my recent knowledge on the new Intel hardwares, at least the ones that
> > > support QPL, it's easy to achieve single core 50Gbps+.
> > In good case, I've also seen 50 Gbps + on Mellanox HCA.
>
> I see. Have you compared the HCAs v.s. the modern NICs?  Now NICs can
> achieve similar performance from their spec as I said; I am not sure how
> they perform in real life, but maybe worth trying.  I only tried 100G nic
> and I rem I can hit 70+Gbps with multifd migrations at peak bandwidth.
> Have you tried that before?
Yes, I recently tried 100 G Eth NIC, with only iperf not yet with qemu
migration.
yes, iperf can reach 90 Gbps with multiple streams.
>
> Note that here I didn't want to compare the performance between the two and
> find a winner.  The issue we're facing now is we have the RDMA migration
> now mostly having its own path all over the place, while the rest protocols
> (socket, fd, file, etc.) all share the rest.
>
> Then, _if_ modern NICs can work similarly v.s. rdma, I don't yet see a good
> reason to keep it.  It could be that technology just improved so we can use
> less code to do as good.  It's a good news to help QEMU evolve by dropping
> unused code.
>
> For some details there on the rdma complications for migration:
>
>   (1) RDMA is the only protocol that doesn't yet support QIOChannel, while
>       migration uses QIOChannels mostly everywhere now.. e.g. in multifd,
>       it means it won't easily support any new things using QIOChannels.
>
>   (2) RDMA is the only protocol that mostly hard-coded everywhere in the
>       RAM migrations, polluting the core logic with much more code
>       internally to support this protocol.
>
> For (1), see migrate_fd_connect() from rdma_start_outgoing_migration().
> While the rest protocols all go via migration_channel_connect().
>
> For (2), see all the "rdma_*" functions in migration/ram.c, where I don't
> think it's common to a protocol - most of the rest protocols don't need
> those hard-coded stuff.  migration/rdma.c has 4000+ LOC for these stuff,
> while to do a not-so-fair comparison, migration/fd.c only has <100 LOC.
>
> Then, we found we don't even know who's using it.
>
> I hope I explained why people started this idea, and also why I think that
> makes sense at least to me.
Yes, I can understand rdma migration become more a burden for upstream
maintainers.
>
> > >
> > > https://lore.kernel.org/r/PH7PR11MB5941A91AC1E514BCC32896A6A3342@PH7PR11MB5941.namprd11.prod.outlook.com
> > >
> > > Quote from Yuan:
> > >
> > >   Yes, I use iperf3 to check the bandwidth for one core, the bandwith is 60Gbps.
> > >   [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > >   [  5]   0.00-1.00   sec  7.00 GBytes  60.1 Gbits/sec    0   2.87 MBytes
> > >   [  5]   1.00-2.00   sec  7.05 GBytes  60.6 Gbits/sec    0   2.87 Mbytes
> > >
> > >   And in the live migration test, a multifd thread's CPU utilization is almost 100%
> > >
> > > It boils down to what old problems were there with tcp first, though.
> > Yeah, this is the key reason we use RDMA. (low cpu ulitization and
> > better performance)
> > >
> > > >
> > > > >
> > > > > Per our best knowledge, RDMA users are rare, and please let anyone know if
> > > > > you are aware of such users.  IIUC the major reason why RDMA stopped being
> > > > > the trend is because the network is not like ten years ago; I don't think I
> > > > > have good knowledge in RDMA at all nor network, but my understanding is
> > > > > it's pretty easy to fetch modern NIC to outperform RDMAs, then it may make
> > > > > little sense to maintain multiple protocols, considering RDMA migration
> > > > > code is so special so that it has the most custom code comparing to other
> > > > > protocols.
> > > > +cc some guys from Huawei.
> > > >
> > > > I'm surprised RDMA users are rare,  I guess maybe many are just
> > > > working with different code base.
> > >
> > > Yes, please cc whoever might be interested (or surprised.. :) to know this,
> > > and let's be open to all possibilities.
> > >
> > > I don't think it makes sense if there're a lot of users of a feature then
> > > we deprecate that without a good reason.  However there's always the
> > > resource limitation issue we're facing, so it could still have the
> > > possibility that this gets deprecated if nobody is working on our upstream
> > > branch. Say, if people use private branches anyway to support rdma without
> > > collaborating upstream, keeping such feature upstream then may not make
> > > much sense either, unless there's some way to collaborate.  We'll see.
> >
> > Is there document/link about the unittest/CI for migration tests, Why
> > are those tests missing?
> > Is it hard or very special to set up an environment for that? maybe we
> > can help in this regards.
>
> See tests/qtest/migration-test.c.  We put most of our migration tests
> there and that's covered in CI.
Yu is looking into that see if we can run the CI on our side.
>
> I think one major issue is CI systems don't normally have rdma devices.
> Can rdma migration test be carried out without a real hardware?
As Zhijian mentioned we can use the SoftRoCE (rxe)
>
> > >
> > > It seems there can still be people joining this discussion.  I'll hold off
> > > a bit on merging this patch to provide enough window for anyone to chim in.
> >
> > Thx for discussion and understanding.
>
> Thanks for all these inputs so far.  These can help us make a wiser and
> clearer step no matter which way we choose.
>
> --
> Peter Xu
>
Thx!


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
  2024-04-11 14:20                             ` Peter Xu
@ 2024-04-11 16:36                               ` Yu Zhang
  2024-04-12 14:04                                 ` Peter Xu
  2024-04-29 13:08                                 ` Michael Galaxy
  0 siblings, 2 replies; 52+ messages in thread
From: Yu Zhang @ 2024-04-11 16:36 UTC (permalink / raw)
  To: Peter Xu
  Cc: Zhijian Li (Fujitsu),
	Jinpu Wang, Elmar Gerdes, qemu-devel, Yuval Shaia, Kevin Wolf,
	Prasanna Kumar Kalever, Cornelia Huck, Michael Roth,
	Prasanna Kumar Kalever, integration, Paolo Bonzini, qemu-block,
	Daniel P. Berrangé,
	devel, Hanna Reitz, Michael S. Tsirkin, Thomas Huth, Eric Blake,
	Song Gao, Marc-André Lureau, Markus Armbruster,
	Alex Bennée, Wainer dos Santos Moschetta, Beraldo Leal,
	arei.gonglei, pannengyuan

> 1) Either a CI test covering at least the major RDMA paths, or at least
>     periodically tests for each QEMU release will be needed.
We use a batch of regression test cases for the stack, which covers the
test for QEMU. I did such test for most of the QEMU releases planned as
candidates for rollout.

The migration test needs a pair of (either physical or virtual) servers with
InfiniBand network, which makes it difficult to do on a single server. The
nested VM could be a possible approach, for which we may need virtual
InfiniBand network. Is SoftRoCE [1] a choice? I will try it and let you know.

[1]  https://enterprise-support.nvidia.com/s/article/howto-configure-soft-roce

Thanks and best regards!

On Thu, Apr 11, 2024 at 4:20 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Apr 10, 2024 at 09:49:15AM -0400, Peter Xu wrote:
> > On Wed, Apr 10, 2024 at 02:28:59AM +0000, Zhijian Li (Fujitsu) via wrote:
> > >
> > >
> > > on 4/10/2024 3:46 AM, Peter Xu wrote:
> > >
> > > >> Is there document/link about the unittest/CI for migration tests, Why
> > > >> are those tests missing?
> > > >> Is it hard or very special to set up an environment for that? maybe we
> > > >> can help in this regards.
> > > > See tests/qtest/migration-test.c.  We put most of our migration tests
> > > > there and that's covered in CI.
> > > >
> > > > I think one major issue is CI systems don't normally have rdma devices.
> > > > Can rdma migration test be carried out without a real hardware?
> > >
> > > Yeah,  RXE aka. SOFT-RoCE is able to emulate the RDMA, for example
> > > $ sudo rdma link add rxe_eth0 type rxe netdev eth0  # on host
> > > then we can get a new RDMA interface "rxe_eth0".
> > > This new RDMA interface is able to do the QEMU RDMA migration.
> > >
> > > Also, the loopback(lo) device is able to emulate the RDMA interface
> > > "rxe_lo", however when
> > > I tried(years ago) to do RDMA migration over this
> > > interface(rdma:127.0.0.1:3333) , it got something wrong.
> > > So i gave up enabling the RDMA migration qtest at that time.
> >
> > Thanks, Zhijian.
> >
> > I'm not sure adding an emu-link for rdma is doable for CI systems, though.
> > Maybe someone more familiar with how CI works can chim in.
>
> Some people got dropped on the cc list for unknown reason, I'm adding them
> back (Fabiano, Peter Maydell, Phil).  Let's make sure nobody is dropped by
> accident.
>
> I'll try to summarize what is still missing, and I think these will be
> greatly helpful if we don't want to deprecate rdma migration:
>
>   1) Either a CI test covering at least the major RDMA paths, or at least
>      periodically tests for each QEMU release will be needed.
>
>   2) Some performance tests between modern RDMA and NIC devices are
>      welcomed.  The current knowledge is modern NIC can work similarly to
>      RDMA in performance, then it's debatable why we still maintain so much
>      rdma specific code.
>
>   3) No need to be soild patchsets for this one, but some plan to improve
>      RDMA migration code so that it is not almost isolated from the rest
>      protocols.
>
>   4) Someone to look after this code for real.
>
> For 2) and 3) more info is here:
>
> https://lore.kernel.org/r/ZhWa0YeAb9ySVKD1@x1n
>
> Here 4) can be the most important as Markus pointed out.  We just didn't
> get there yet on the discussions, but maybe Markus is right that we should
> talk that first.
>
> Thanks,
>
> --
> Peter Xu
>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
  2024-04-11 16:36                               ` Yu Zhang
@ 2024-04-12 14:04                                 ` Peter Xu
  2024-04-29 13:08                                 ` Michael Galaxy
  1 sibling, 0 replies; 52+ messages in thread
From: Peter Xu @ 2024-04-12 14:04 UTC (permalink / raw)
  To: Yu Zhang
  Cc: Zhijian Li (Fujitsu),
	Jinpu Wang, Elmar Gerdes, qemu-devel, Yuval Shaia, Kevin Wolf,
	Prasanna Kumar Kalever, Cornelia Huck, Michael Roth,
	Prasanna Kumar Kalever, integration, Paolo Bonzini, qemu-block,
	Daniel P. Berrangé,
	devel, Hanna Reitz, Michael S. Tsirkin, Thomas Huth, Eric Blake,
	Song Gao, Marc-André Lureau, Markus Armbruster,
	Alex Bennée, Wainer dos Santos Moschetta, Beraldo Leal,
	arei.gonglei, pannengyuan

Yu,

On Thu, Apr 11, 2024 at 06:36:54PM +0200, Yu Zhang wrote:
> > 1) Either a CI test covering at least the major RDMA paths, or at least
> >     periodically tests for each QEMU release will be needed.
> We use a batch of regression test cases for the stack, which covers the
> test for QEMU. I did such test for most of the QEMU releases planned as
> candidates for rollout.

The least I can think of is a few tests in one release.  Definitely too
less if one release can already break..

> 
> The migration test needs a pair of (either physical or virtual) servers with
> InfiniBand network, which makes it difficult to do on a single server. The
> nested VM could be a possible approach, for which we may need virtual
> InfiniBand network. Is SoftRoCE [1] a choice? I will try it and let you know.
> 
> [1]  https://enterprise-support.nvidia.com/s/article/howto-configure-soft-roce

Does it require a kernel driver?  The less host kernel / hardware /
.. dependencies the better.

I am wondering whether there can be a library doing everything in
userspace, translating RDMA into e.g. socket messages (so maybe ultimately
that's something like IP->rdma->IP.. just to cover the "rdma" procedures),
then that'll work for CI reliably.

Please also see my full list, though, especially entry 4).  Thanks already
for looking for solutions on the tests, but I don't want to waste your time
then found that tests are not enough even if ready.  I think we need people
that understand these stuff well enough, have dedicated time and look after
it.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
  2024-04-11 16:36                               ` Yu Zhang
  2024-04-12 14:04                                 ` Peter Xu
@ 2024-04-29 13:08                                 ` Michael Galaxy
  2024-04-29 14:56                                   ` Peter Xu
  1 sibling, 1 reply; 52+ messages in thread
From: Michael Galaxy @ 2024-04-29 13:08 UTC (permalink / raw)
  To: Yu Zhang, Peter Xu
  Cc: Zhijian Li (Fujitsu),
	Jinpu Wang, Elmar Gerdes, qemu-devel, Yuval Shaia, Kevin Wolf,
	Prasanna Kumar Kalever, Cornelia Huck, Michael Roth,
	Prasanna Kumar Kalever, integration, Paolo Bonzini, qemu-block,
	Daniel P. Berrangé,
	devel, Hanna Reitz, Michael S. Tsirkin, Thomas Huth, Eric Blake,
	Song Gao, Marc-André Lureau, Markus Armbruster,
	Alex Bennée, Wainer dos Santos Moschetta, Beraldo Leal,
	arei.gonglei, pannengyuan

Hi All (and Peter),

My name is Michael Galaxy (formerly Hines). Yes, I changed my last name 
(highly irregular for a male) and yes, that's my real last name: 
https://www.linkedin.com/in/mrgalaxy/)

I'm the original author of the RDMA implementation. I've been discussing 
with Yu Zhang for a little bit about potentially handing over 
maintainership of the codebase to his team.

I simply have zero access to RoCE or Infiniband hardware at all, 
unfortunately. so I've never been able to run tests or use what I wrote 
at work, and as all of you know, if you don't have a way to test 
something, then you can't maintain it.

Yu Zhang put a (very kind) proposal forward to me to ask the community 
if they feel comfortable training his team to maintain the codebase (and 
run tests) while they learn about it.

If you don't mind, I'd like to let him send over his (very detailed) 
proposal,

- Michael

On 4/11/24 11:36, Yu Zhang wrote:
>> 1) Either a CI test covering at least the major RDMA paths, or at least
>>      periodically tests for each QEMU release will be needed.
> We use a batch of regression test cases for the stack, which covers the
> test for QEMU. I did such test for most of the QEMU releases planned as
> candidates for rollout.
>
> The migration test needs a pair of (either physical or virtual) servers with
> InfiniBand network, which makes it difficult to do on a single server. The
> nested VM could be a possible approach, for which we may need virtual
> InfiniBand network. Is SoftRoCE [1] a choice? I will try it and let you know.
>
> [1]  https://urldefense.com/v3/__https://enterprise-support.nvidia.com/s/article/howto-configure-soft-roce__;!!GjvTz_vk!VEqNfg3Kdf58Oh1FkGL6ErDLfvUXZXPwMTaXizuIQeIgJiywPzuwbqx8wM0KUsyopw_EYQxWvGHE3ig$
>
> Thanks and best regards!
>
> On Thu, Apr 11, 2024 at 4:20 PM Peter Xu <peterx@redhat.com> wrote:
>> On Wed, Apr 10, 2024 at 09:49:15AM -0400, Peter Xu wrote:
>>> On Wed, Apr 10, 2024 at 02:28:59AM +0000, Zhijian Li (Fujitsu) via wrote:
>>>>
>>>> on 4/10/2024 3:46 AM, Peter Xu wrote:
>>>>
>>>>>> Is there document/link about the unittest/CI for migration tests, Why
>>>>>> are those tests missing?
>>>>>> Is it hard or very special to set up an environment for that? maybe we
>>>>>> can help in this regards.
>>>>> See tests/qtest/migration-test.c.  We put most of our migration tests
>>>>> there and that's covered in CI.
>>>>>
>>>>> I think one major issue is CI systems don't normally have rdma devices.
>>>>> Can rdma migration test be carried out without a real hardware?
>>>> Yeah,  RXE aka. SOFT-RoCE is able to emulate the RDMA, for example
>>>> $ sudo rdma link add rxe_eth0 type rxe netdev eth0  # on host
>>>> then we can get a new RDMA interface "rxe_eth0".
>>>> This new RDMA interface is able to do the QEMU RDMA migration.
>>>>
>>>> Also, the loopback(lo) device is able to emulate the RDMA interface
>>>> "rxe_lo", however when
>>>> I tried(years ago) to do RDMA migration over this
>>>> interface(rdma:127.0.0.1:3333) , it got something wrong.
>>>> So i gave up enabling the RDMA migration qtest at that time.
>>> Thanks, Zhijian.
>>>
>>> I'm not sure adding an emu-link for rdma is doable for CI systems, though.
>>> Maybe someone more familiar with how CI works can chim in.
>> Some people got dropped on the cc list for unknown reason, I'm adding them
>> back (Fabiano, Peter Maydell, Phil).  Let's make sure nobody is dropped by
>> accident.
>>
>> I'll try to summarize what is still missing, and I think these will be
>> greatly helpful if we don't want to deprecate rdma migration:
>>
>>    1) Either a CI test covering at least the major RDMA paths, or at least
>>       periodically tests for each QEMU release will be needed.
>>
>>    2) Some performance tests between modern RDMA and NIC devices are
>>       welcomed.  The current knowledge is modern NIC can work similarly to
>>       RDMA in performance, then it's debatable why we still maintain so much
>>       rdma specific code.
>>
>>    3) No need to be soild patchsets for this one, but some plan to improve
>>       RDMA migration code so that it is not almost isolated from the rest
>>       protocols.
>>
>>    4) Someone to look after this code for real.
>>
>> For 2) and 3) more info is here:
>>
>> https://urldefense.com/v3/__https://lore.kernel.org/r/ZhWa0YeAb9ySVKD1@x1n__;!!GjvTz_vk!VEqNfg3Kdf58Oh1FkGL6ErDLfvUXZXPwMTaXizuIQeIgJiywPzuwbqx8wM0KUsyopw_EYQxWpIWYBhQ$
>>
>> Here 4) can be the most important as Markus pointed out.  We just didn't
>> get there yet on the discussions, but maybe Markus is right that we should
>> talk that first.
>>
>> Thanks,
>>
>> --
>> Peter Xu
>>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
  2024-04-29 13:08                                 ` Michael Galaxy
@ 2024-04-29 14:56                                   ` Peter Xu
  2024-04-29 20:45                                     ` Yu Zhang
  2024-04-30  7:15                                     ` Markus Armbruster
  0 siblings, 2 replies; 52+ messages in thread
From: Peter Xu @ 2024-04-29 14:56 UTC (permalink / raw)
  To: Michael Galaxy
  Cc: Yu Zhang, Zhijian Li (Fujitsu),
	Jinpu Wang, Elmar Gerdes, qemu-devel, Yuval Shaia, Kevin Wolf,
	Prasanna Kumar Kalever, Cornelia Huck, Michael Roth,
	Prasanna Kumar Kalever, integration, Paolo Bonzini, qemu-block,
	Daniel P. Berrangé,
	devel, Hanna Reitz, Michael S. Tsirkin, Thomas Huth, Eric Blake,
	Song Gao, Marc-André Lureau, Markus Armbruster,
	Alex Bennée, Wainer dos Santos Moschetta, Beraldo Leal,
	arei.gonglei, pannengyuan

On Mon, Apr 29, 2024 at 08:08:10AM -0500, Michael Galaxy wrote:
> Hi All (and Peter),

Hi, Michael,

> 
> My name is Michael Galaxy (formerly Hines). Yes, I changed my last name
> (highly irregular for a male) and yes, that's my real last name:
> https://www.linkedin.com/in/mrgalaxy/)
> 
> I'm the original author of the RDMA implementation. I've been discussing
> with Yu Zhang for a little bit about potentially handing over maintainership
> of the codebase to his team.
> 
> I simply have zero access to RoCE or Infiniband hardware at all,
> unfortunately. so I've never been able to run tests or use what I wrote at
> work, and as all of you know, if you don't have a way to test something,
> then you can't maintain it.
> 
> Yu Zhang put a (very kind) proposal forward to me to ask the community if
> they feel comfortable training his team to maintain the codebase (and run
> tests) while they learn about it.

The "while learning" part is fine at least to me.  IMHO the "ownership" to
the code, or say, taking over the responsibility, may or may not need 100%
mastering the code base first.  There should still be some fundamental
confidence to work on the code though as a starting point, then it's about
serious use case to back this up, and careful testings while getting more
familiar with it.

> 
> If you don't mind, I'd like to let him send over his (very detailed)
> proposal,

Yes please, it's exactly the time to share the plan.  The hope is we try to
reach a consensus before or around the middle of this release (9.1).
Normally QEMU has a 3~4 months window for each release and 9.1 schedule is
not yet out, but I think it means we make a decision before or around
middle of June.

Thanks,

> 
> - Michael
> 
> On 4/11/24 11:36, Yu Zhang wrote:
> > > 1) Either a CI test covering at least the major RDMA paths, or at least
> > >      periodically tests for each QEMU release will be needed.
> > We use a batch of regression test cases for the stack, which covers the
> > test for QEMU. I did such test for most of the QEMU releases planned as
> > candidates for rollout.
> > 
> > The migration test needs a pair of (either physical or virtual) servers with
> > InfiniBand network, which makes it difficult to do on a single server. The
> > nested VM could be a possible approach, for which we may need virtual
> > InfiniBand network. Is SoftRoCE [1] a choice? I will try it and let you know.
> > 
> > [1]  https://urldefense.com/v3/__https://enterprise-support.nvidia.com/s/article/howto-configure-soft-roce__;!!GjvTz_vk!VEqNfg3Kdf58Oh1FkGL6ErDLfvUXZXPwMTaXizuIQeIgJiywPzuwbqx8wM0KUsyopw_EYQxWvGHE3ig$
> > 
> > Thanks and best regards!
> > 
> > On Thu, Apr 11, 2024 at 4:20 PM Peter Xu <peterx@redhat.com> wrote:
> > > On Wed, Apr 10, 2024 at 09:49:15AM -0400, Peter Xu wrote:
> > > > On Wed, Apr 10, 2024 at 02:28:59AM +0000, Zhijian Li (Fujitsu) via wrote:
> > > > > 
> > > > > on 4/10/2024 3:46 AM, Peter Xu wrote:
> > > > > 
> > > > > > > Is there document/link about the unittest/CI for migration tests, Why
> > > > > > > are those tests missing?
> > > > > > > Is it hard or very special to set up an environment for that? maybe we
> > > > > > > can help in this regards.
> > > > > > See tests/qtest/migration-test.c.  We put most of our migration tests
> > > > > > there and that's covered in CI.
> > > > > > 
> > > > > > I think one major issue is CI systems don't normally have rdma devices.
> > > > > > Can rdma migration test be carried out without a real hardware?
> > > > > Yeah,  RXE aka. SOFT-RoCE is able to emulate the RDMA, for example
> > > > > $ sudo rdma link add rxe_eth0 type rxe netdev eth0  # on host
> > > > > then we can get a new RDMA interface "rxe_eth0".
> > > > > This new RDMA interface is able to do the QEMU RDMA migration.
> > > > > 
> > > > > Also, the loopback(lo) device is able to emulate the RDMA interface
> > > > > "rxe_lo", however when
> > > > > I tried(years ago) to do RDMA migration over this
> > > > > interface(rdma:127.0.0.1:3333) , it got something wrong.
> > > > > So i gave up enabling the RDMA migration qtest at that time.
> > > > Thanks, Zhijian.
> > > > 
> > > > I'm not sure adding an emu-link for rdma is doable for CI systems, though.
> > > > Maybe someone more familiar with how CI works can chim in.
> > > Some people got dropped on the cc list for unknown reason, I'm adding them
> > > back (Fabiano, Peter Maydell, Phil).  Let's make sure nobody is dropped by
> > > accident.
> > > 
> > > I'll try to summarize what is still missing, and I think these will be
> > > greatly helpful if we don't want to deprecate rdma migration:
> > > 
> > >    1) Either a CI test covering at least the major RDMA paths, or at least
> > >       periodically tests for each QEMU release will be needed.
> > > 
> > >    2) Some performance tests between modern RDMA and NIC devices are
> > >       welcomed.  The current knowledge is modern NIC can work similarly to
> > >       RDMA in performance, then it's debatable why we still maintain so much
> > >       rdma specific code.
> > > 
> > >    3) No need to be soild patchsets for this one, but some plan to improve
> > >       RDMA migration code so that it is not almost isolated from the rest
> > >       protocols.
> > > 
> > >    4) Someone to look after this code for real.
> > > 
> > > For 2) and 3) more info is here:
> > > 
> > > https://urldefense.com/v3/__https://lore.kernel.org/r/ZhWa0YeAb9ySVKD1@x1n__;!!GjvTz_vk!VEqNfg3Kdf58Oh1FkGL6ErDLfvUXZXPwMTaXizuIQeIgJiywPzuwbqx8wM0KUsyopw_EYQxWpIWYBhQ$
> > > 
> > > Here 4) can be the most important as Markus pointed out.  We just didn't
> > > get there yet on the discussions, but maybe Markus is right that we should
> > > talk that first.
> > > 
> > > Thanks,
> > > 
> > > --
> > > Peter Xu
> > > 
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
  2024-04-29 14:56                                   ` Peter Xu
@ 2024-04-29 20:45                                     ` Yu Zhang
  2024-04-29 20:56                                       ` Michael Galaxy
  2024-04-30  7:15                                     ` Markus Armbruster
  1 sibling, 1 reply; 52+ messages in thread
From: Yu Zhang @ 2024-04-29 20:45 UTC (permalink / raw)
  To: Michael Galaxy, Peter Xu
  Cc: Zhijian Li (Fujitsu),
	Jinpu Wang, Elmar Gerdes, qemu-devel, Yuval Shaia, Kevin Wolf,
	Prasanna Kumar Kalever, Cornelia Huck, Michael Roth,
	Prasanna Kumar Kalever, integration, Paolo Bonzini, qemu-block,
	Daniel P. Berrangé,
	devel, Hanna Reitz, Michael S. Tsirkin, Thomas Huth, Eric Blake,
	Song Gao, Marc-André Lureau, Markus Armbruster,
	Alex Bennée, Wainer dos Santos Moschetta, Beraldo Leal,
	arei.gonglei, pannengyuan

[-- Attachment #1: Type: text/plain, Size: 7071 bytes --]

Hello Michael and Peter,

We are very glad at your quick and kind reply about our plan to take
over the maintenance of your code. The message is for presenting our
plan and working together.
If we were able to obtain the maintainer's role, our plan is:

1. Create the necessary unit-test cases and get them integrated into
the current QEMU GitLab-CI pipeline
2. Review and test the code changes by other developers to ensure that
nothing is broken in the changed code before being merged by the
community
3. Based on our current practice and application scenario, look for
possible improvements when necessary

Besides that, a patch is attached to announce this change in the community.

With your generous support, we hope that the development community
will make a positive decision for us.

Kind regards,
Yu Zhang@ IONOS Cloud

On Mon, Apr 29, 2024 at 4:57 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Mon, Apr 29, 2024 at 08:08:10AM -0500, Michael Galaxy wrote:
> > Hi All (and Peter),
>
> Hi, Michael,
>
> >
> > My name is Michael Galaxy (formerly Hines). Yes, I changed my last name
> > (highly irregular for a male) and yes, that's my real last name:
> > https://www.linkedin.com/in/mrgalaxy/)
> >
> > I'm the original author of the RDMA implementation. I've been discussing
> > with Yu Zhang for a little bit about potentially handing over maintainership
> > of the codebase to his team.
> >
> > I simply have zero access to RoCE or Infiniband hardware at all,
> > unfortunately. so I've never been able to run tests or use what I wrote at
> > work, and as all of you know, if you don't have a way to test something,
> > then you can't maintain it.
> >
> > Yu Zhang put a (very kind) proposal forward to me to ask the community if
> > they feel comfortable training his team to maintain the codebase (and run
> > tests) while they learn about it.
>
> The "while learning" part is fine at least to me.  IMHO the "ownership" to
> the code, or say, taking over the responsibility, may or may not need 100%
> mastering the code base first.  There should still be some fundamental
> confidence to work on the code though as a starting point, then it's about
> serious use case to back this up, and careful testings while getting more
> familiar with it.
>
> >
> > If you don't mind, I'd like to let him send over his (very detailed)
> > proposal,
>
> Yes please, it's exactly the time to share the plan.  The hope is we try to
> reach a consensus before or around the middle of this release (9.1).
> Normally QEMU has a 3~4 months window for each release and 9.1 schedule is
> not yet out, but I think it means we make a decision before or around
> middle of June.
>
> Thanks,
>
> >
> > - Michael
> >
> > On 4/11/24 11:36, Yu Zhang wrote:
> > > > 1) Either a CI test covering at least the major RDMA paths, or at least
> > > >      periodically tests for each QEMU release will be needed.
> > > We use a batch of regression test cases for the stack, which covers the
> > > test for QEMU. I did such test for most of the QEMU releases planned as
> > > candidates for rollout.
> > >
> > > The migration test needs a pair of (either physical or virtual) servers with
> > > InfiniBand network, which makes it difficult to do on a single server. The
> > > nested VM could be a possible approach, for which we may need virtual
> > > InfiniBand network. Is SoftRoCE [1] a choice? I will try it and let you know.
> > >
> > > [1]  https://urldefense.com/v3/__https://enterprise-support.nvidia.com/s/article/howto-configure-soft-roce__;!!GjvTz_vk!VEqNfg3Kdf58Oh1FkGL6ErDLfvUXZXPwMTaXizuIQeIgJiywPzuwbqx8wM0KUsyopw_EYQxWvGHE3ig$
> > >
> > > Thanks and best regards!
> > >
> > > On Thu, Apr 11, 2024 at 4:20 PM Peter Xu <peterx@redhat.com> wrote:
> > > > On Wed, Apr 10, 2024 at 09:49:15AM -0400, Peter Xu wrote:
> > > > > On Wed, Apr 10, 2024 at 02:28:59AM +0000, Zhijian Li (Fujitsu) via wrote:
> > > > > >
> > > > > > on 4/10/2024 3:46 AM, Peter Xu wrote:
> > > > > >
> > > > > > > > Is there document/link about the unittest/CI for migration tests, Why
> > > > > > > > are those tests missing?
> > > > > > > > Is it hard or very special to set up an environment for that? maybe we
> > > > > > > > can help in this regards.
> > > > > > > See tests/qtest/migration-test.c.  We put most of our migration tests
> > > > > > > there and that's covered in CI.
> > > > > > >
> > > > > > > I think one major issue is CI systems don't normally have rdma devices.
> > > > > > > Can rdma migration test be carried out without a real hardware?
> > > > > > Yeah,  RXE aka. SOFT-RoCE is able to emulate the RDMA, for example
> > > > > > $ sudo rdma link add rxe_eth0 type rxe netdev eth0  # on host
> > > > > > then we can get a new RDMA interface "rxe_eth0".
> > > > > > This new RDMA interface is able to do the QEMU RDMA migration.
> > > > > >
> > > > > > Also, the loopback(lo) device is able to emulate the RDMA interface
> > > > > > "rxe_lo", however when
> > > > > > I tried(years ago) to do RDMA migration over this
> > > > > > interface(rdma:127.0.0.1:3333) , it got something wrong.
> > > > > > So i gave up enabling the RDMA migration qtest at that time.
> > > > > Thanks, Zhijian.
> > > > >
> > > > > I'm not sure adding an emu-link for rdma is doable for CI systems, though.
> > > > > Maybe someone more familiar with how CI works can chim in.
> > > > Some people got dropped on the cc list for unknown reason, I'm adding them
> > > > back (Fabiano, Peter Maydell, Phil).  Let's make sure nobody is dropped by
> > > > accident.
> > > >
> > > > I'll try to summarize what is still missing, and I think these will be
> > > > greatly helpful if we don't want to deprecate rdma migration:
> > > >
> > > >    1) Either a CI test covering at least the major RDMA paths, or at least
> > > >       periodically tests for each QEMU release will be needed.
> > > >
> > > >    2) Some performance tests between modern RDMA and NIC devices are
> > > >       welcomed.  The current knowledge is modern NIC can work similarly to
> > > >       RDMA in performance, then it's debatable why we still maintain so much
> > > >       rdma specific code.
> > > >
> > > >    3) No need to be soild patchsets for this one, but some plan to improve
> > > >       RDMA migration code so that it is not almost isolated from the rest
> > > >       protocols.
> > > >
> > > >    4) Someone to look after this code for real.
> > > >
> > > > For 2) and 3) more info is here:
> > > >
> > > > https://urldefense.com/v3/__https://lore.kernel.org/r/ZhWa0YeAb9ySVKD1@x1n__;!!GjvTz_vk!VEqNfg3Kdf58Oh1FkGL6ErDLfvUXZXPwMTaXizuIQeIgJiywPzuwbqx8wM0KUsyopw_EYQxWpIWYBhQ$
> > > >
> > > > Here 4) can be the most important as Markus pointed out.  We just didn't
> > > > get there yet on the discussions, but maybe Markus is right that we should
> > > > talk that first.
> > > >
> > > > Thanks,
> > > >
> > > > --
> > > > Peter Xu
> > > >
> >
>
> --
> Peter Xu
>

[-- Attachment #2: 0001-MAINTAINERS-Update-the-maintainers-and-reviewers-for.patch --]
[-- Type: text/x-patch, Size: 1976 bytes --]

From 40dea392f9ca606c2a0c53999d662670eb08b2d8 Mon Sep 17 00:00:00 2001
From: Yu Zhang <yu.zhang@ionos.com>
Date: Mon, 29 Apr 2024 15:31:53 +0200
Subject: [PATCH] MAINTAINERS: Update the maintainers and reviewers for RDMA
 migration

As the links [1][2] below stated, QEMU development community is currently
having some difficulties in maintaining the RDMA migration subsystem due
to the lack of resources (maintainers, test cases, test environment etc.)
and considering to deprecate it.

According to our user experience in the recent two years, we observed that
RDMA is capable of providing higher migration speed and lower performance
impact to a running VM, which can significantly improve the end-user's
experience during the VM live migration. We believe that RDMA still plays
a key role for the QoS and can't yet be replaced by TCP/IP for VM migration
at the moment.

With the consent and supports from Michael Galaxy, who has developed this
feature for QEMU, we would like to take over the maintainer's role and
create the necessary resources to maintain it further for the community.

Jinpu Wang is the upstream maintainer of RNBD/RTRS. He is experienced in
RDMA programming, and Yu Zhang maintains the downstream QEMU for IONOS
cloud in production.

[1] https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg00001.html
[2] https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg00228.html

Signed-off-by: Yu Zhang <yu.zhang@ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
---
 MAINTAINERS | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 302b6fd00c..54d32dff94 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3413,7 +3413,10 @@ F: util/userfaultfd.c
 X: migration/rdma*
 
 RDMA Migration
+M: Yu Zhang <yu.zhang@ionos.com>
+M: Jack Wang <jack.wang@ionos.com>
 R: Li Zhijian <lizhijian@fujitsu.com>
+R: Michael Galaxy <mgalaxy@akamai.com>
 R: Peter Xu <peterx@redhat.com>
 S: Odd Fixes
 F: migration/rdma*
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
  2024-04-29 20:45                                     ` Yu Zhang
@ 2024-04-29 20:56                                       ` Michael Galaxy
  0 siblings, 0 replies; 52+ messages in thread
From: Michael Galaxy @ 2024-04-29 20:56 UTC (permalink / raw)
  To: Yu Zhang, Peter Xu
  Cc: Zhijian Li (Fujitsu),
	Jinpu Wang, Elmar Gerdes, qemu-devel, Yuval Shaia, Kevin Wolf,
	Prasanna Kumar Kalever, Cornelia Huck, Michael Roth,
	Prasanna Kumar Kalever, integration, Paolo Bonzini, qemu-block,
	Daniel P. Berrangé,
	devel, Hanna Reitz, Michael S. Tsirkin, Thomas Huth, Eric Blake,
	Song Gao, Marc-André Lureau, Markus Armbruster,
	Alex Bennée, Wainer dos Santos Moschetta, Beraldo Leal,
	arei.gonglei, pannengyuan

Reviewed-by: Michael Galaxy <mgalaxy@akamai.com>

Thanks Yu Zhang and Peter.

- Michael

On 4/29/24 15:45, Yu Zhang wrote:
> Hello Michael and Peter,
>
> We are very glad at your quick and kind reply about our plan to take
> over the maintenance of your code. The message is for presenting our
> plan and working together.
> If we were able to obtain the maintainer's role, our plan is:
>
> 1. Create the necessary unit-test cases and get them integrated into
> the current QEMU GitLab-CI pipeline
> 2. Review and test the code changes by other developers to ensure that
> nothing is broken in the changed code before being merged by the
> community
> 3. Based on our current practice and application scenario, look for
> possible improvements when necessary
>
> Besides that, a patch is attached to announce this change in the community.
>
> With your generous support, we hope that the development community
> will make a positive decision for us.
>
> Kind regards,
> Yu Zhang@ IONOS Cloud
>
> On Mon, Apr 29, 2024 at 4:57 PM Peter Xu <peterx@redhat.com> wrote:
>> On Mon, Apr 29, 2024 at 08:08:10AM -0500, Michael Galaxy wrote:
>>> Hi All (and Peter),
>> Hi, Michael,
>>
>>> My name is Michael Galaxy (formerly Hines). Yes, I changed my last name
>>> (highly irregular for a male) and yes, that's my real last name:
>>> https://urldefense.com/v3/__https://www.linkedin.com/in/mrgalaxy/__;!!GjvTz_vk!TZmnCE90EK692dSjZGr-2cpOEZBQTBsTO2bW5z3rSbpZgNVCexZkxwDXhmIOWG2GAKZAUovQ5xe5coQ$ )
>>>
>>> I'm the original author of the RDMA implementation. I've been discussing
>>> with Yu Zhang for a little bit about potentially handing over maintainership
>>> of the codebase to his team.
>>>
>>> I simply have zero access to RoCE or Infiniband hardware at all,
>>> unfortunately. so I've never been able to run tests or use what I wrote at
>>> work, and as all of you know, if you don't have a way to test something,
>>> then you can't maintain it.
>>>
>>> Yu Zhang put a (very kind) proposal forward to me to ask the community if
>>> they feel comfortable training his team to maintain the codebase (and run
>>> tests) while they learn about it.
>> The "while learning" part is fine at least to me.  IMHO the "ownership" to
>> the code, or say, taking over the responsibility, may or may not need 100%
>> mastering the code base first.  There should still be some fundamental
>> confidence to work on the code though as a starting point, then it's about
>> serious use case to back this up, and careful testings while getting more
>> familiar with it.
>>
>>> If you don't mind, I'd like to let him send over his (very detailed)
>>> proposal,
>> Yes please, it's exactly the time to share the plan.  The hope is we try to
>> reach a consensus before or around the middle of this release (9.1).
>> Normally QEMU has a 3~4 months window for each release and 9.1 schedule is
>> not yet out, but I think it means we make a decision before or around
>> middle of June.
>>
>> Thanks,
>>
>>> - Michael
>>>
>>> On 4/11/24 11:36, Yu Zhang wrote:
>>>>> 1) Either a CI test covering at least the major RDMA paths, or at least
>>>>>       periodically tests for each QEMU release will be needed.
>>>> We use a batch of regression test cases for the stack, which covers the
>>>> test for QEMU. I did such test for most of the QEMU releases planned as
>>>> candidates for rollout.
>>>>
>>>> The migration test needs a pair of (either physical or virtual) servers with
>>>> InfiniBand network, which makes it difficult to do on a single server. The
>>>> nested VM could be a possible approach, for which we may need virtual
>>>> InfiniBand network. Is SoftRoCE [1] a choice? I will try it and let you know.
>>>>
>>>> [1]  https://urldefense.com/v3/__https://enterprise-support.nvidia.com/s/article/howto-configure-soft-roce__;!!GjvTz_vk!VEqNfg3Kdf58Oh1FkGL6ErDLfvUXZXPwMTaXizuIQeIgJiywPzuwbqx8wM0KUsyopw_EYQxWvGHE3ig$
>>>>
>>>> Thanks and best regards!
>>>>
>>>> On Thu, Apr 11, 2024 at 4:20 PM Peter Xu <peterx@redhat.com> wrote:
>>>>> On Wed, Apr 10, 2024 at 09:49:15AM -0400, Peter Xu wrote:
>>>>>> On Wed, Apr 10, 2024 at 02:28:59AM +0000, Zhijian Li (Fujitsu) via wrote:
>>>>>>> on 4/10/2024 3:46 AM, Peter Xu wrote:
>>>>>>>
>>>>>>>>> Is there document/link about the unittest/CI for migration tests, Why
>>>>>>>>> are those tests missing?
>>>>>>>>> Is it hard or very special to set up an environment for that? maybe we
>>>>>>>>> can help in this regards.
>>>>>>>> See tests/qtest/migration-test.c.  We put most of our migration tests
>>>>>>>> there and that's covered in CI.
>>>>>>>>
>>>>>>>> I think one major issue is CI systems don't normally have rdma devices.
>>>>>>>> Can rdma migration test be carried out without a real hardware?
>>>>>>> Yeah,  RXE aka. SOFT-RoCE is able to emulate the RDMA, for example
>>>>>>> $ sudo rdma link add rxe_eth0 type rxe netdev eth0  # on host
>>>>>>> then we can get a new RDMA interface "rxe_eth0".
>>>>>>> This new RDMA interface is able to do the QEMU RDMA migration.
>>>>>>>
>>>>>>> Also, the loopback(lo) device is able to emulate the RDMA interface
>>>>>>> "rxe_lo", however when
>>>>>>> I tried(years ago) to do RDMA migration over this
>>>>>>> interface(rdma:127.0.0.1:3333) , it got something wrong.
>>>>>>> So i gave up enabling the RDMA migration qtest at that time.
>>>>>> Thanks, Zhijian.
>>>>>>
>>>>>> I'm not sure adding an emu-link for rdma is doable for CI systems, though.
>>>>>> Maybe someone more familiar with how CI works can chim in.
>>>>> Some people got dropped on the cc list for unknown reason, I'm adding them
>>>>> back (Fabiano, Peter Maydell, Phil).  Let's make sure nobody is dropped by
>>>>> accident.
>>>>>
>>>>> I'll try to summarize what is still missing, and I think these will be
>>>>> greatly helpful if we don't want to deprecate rdma migration:
>>>>>
>>>>>     1) Either a CI test covering at least the major RDMA paths, or at least
>>>>>        periodically tests for each QEMU release will be needed.
>>>>>
>>>>>     2) Some performance tests between modern RDMA and NIC devices are
>>>>>        welcomed.  The current knowledge is modern NIC can work similarly to
>>>>>        RDMA in performance, then it's debatable why we still maintain so much
>>>>>        rdma specific code.
>>>>>
>>>>>     3) No need to be soild patchsets for this one, but some plan to improve
>>>>>        RDMA migration code so that it is not almost isolated from the rest
>>>>>        protocols.
>>>>>
>>>>>     4) Someone to look after this code for real.
>>>>>
>>>>> For 2) and 3) more info is here:
>>>>>
>>>>> https://urldefense.com/v3/__https://lore.kernel.org/r/ZhWa0YeAb9ySVKD1@x1n__;!!GjvTz_vk!VEqNfg3Kdf58Oh1FkGL6ErDLfvUXZXPwMTaXizuIQeIgJiywPzuwbqx8wM0KUsyopw_EYQxWpIWYBhQ$
>>>>>
>>>>> Here 4) can be the most important as Markus pointed out.  We just didn't
>>>>> get there yet on the discussions, but maybe Markus is right that we should
>>>>> talk that first.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> --
>>>>> Peter Xu
>>>>>
>> --
>> Peter Xu
>>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
  2024-04-29 14:56                                   ` Peter Xu
  2024-04-29 20:45                                     ` Yu Zhang
@ 2024-04-30  7:15                                     ` Markus Armbruster
  2024-04-30  8:00                                       ` Daniel P. Berrangé
  1 sibling, 1 reply; 52+ messages in thread
From: Markus Armbruster @ 2024-04-30  7:15 UTC (permalink / raw)
  To: Peter Xu
  Cc: Michael Galaxy, Yu Zhang, Zhijian Li (Fujitsu),
	Jinpu Wang, Elmar Gerdes, qemu-devel, Yuval Shaia, Kevin Wolf,
	Prasanna Kumar Kalever, Cornelia Huck, Michael Roth,
	Prasanna Kumar Kalever, integration, Paolo Bonzini, qemu-block,
	Daniel P. Berrangé,
	devel, Hanna Reitz, Michael S. Tsirkin, Thomas Huth, Eric Blake,
	Song Gao, Marc-André Lureau, Alex Bennée,
	Wainer dos Santos Moschetta, Beraldo Leal, arei.gonglei,
	pannengyuan

Peter Xu <peterx@redhat.com> writes:

> On Mon, Apr 29, 2024 at 08:08:10AM -0500, Michael Galaxy wrote:
>> Hi All (and Peter),
>
> Hi, Michael,
>
>> 
>> My name is Michael Galaxy (formerly Hines). Yes, I changed my last name
>> (highly irregular for a male) and yes, that's my real last name:
>> https://www.linkedin.com/in/mrgalaxy/)
>> 
>> I'm the original author of the RDMA implementation. I've been discussing
>> with Yu Zhang for a little bit about potentially handing over maintainership
>> of the codebase to his team.
>> 
>> I simply have zero access to RoCE or Infiniband hardware at all,
>> unfortunately. so I've never been able to run tests or use what I wrote at
>> work, and as all of you know, if you don't have a way to test something,
>> then you can't maintain it.
>> 
>> Yu Zhang put a (very kind) proposal forward to me to ask the community if
>> they feel comfortable training his team to maintain the codebase (and run
>> tests) while they learn about it.
>
> The "while learning" part is fine at least to me.  IMHO the "ownership" to
> the code, or say, taking over the responsibility, may or may not need 100%
> mastering the code base first.  There should still be some fundamental
> confidence to work on the code though as a starting point, then it's about
> serious use case to back this up, and careful testings while getting more
> familiar with it.

How much experience we expect of maintainers depends on the subsystem
and other circumstances.  The hard requirement isn't experience, it's
trust.  See the recent attack on xz.

I do not mean to express any doubts whatsoever on Yu Zhang's integrity!
I'm merely reminding y'all what's at stake.

[...]



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
  2024-04-30  7:15                                     ` Markus Armbruster
@ 2024-04-30  8:00                                       ` Daniel P. Berrangé
  2024-05-01 15:31                                         ` Peter Xu
  0 siblings, 1 reply; 52+ messages in thread
From: Daniel P. Berrangé @ 2024-04-30  8:00 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Peter Xu, Michael Galaxy, Yu Zhang, Zhijian Li (Fujitsu),
	Jinpu Wang, Elmar Gerdes, qemu-devel, Yuval Shaia, Kevin Wolf,
	Prasanna Kumar Kalever, Cornelia Huck, Michael Roth,
	Prasanna Kumar Kalever, integration, Paolo Bonzini, qemu-block,
	devel, Hanna Reitz, Michael S. Tsirkin, Thomas Huth, Eric Blake,
	Song Gao, Marc-André Lureau, Alex Bennée,
	Wainer dos Santos Moschetta, Beraldo Leal, arei.gonglei,
	pannengyuan

On Tue, Apr 30, 2024 at 09:15:03AM +0200, Markus Armbruster wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > On Mon, Apr 29, 2024 at 08:08:10AM -0500, Michael Galaxy wrote:
> >> Hi All (and Peter),
> >
> > Hi, Michael,
> >
> >> 
> >> My name is Michael Galaxy (formerly Hines). Yes, I changed my last name
> >> (highly irregular for a male) and yes, that's my real last name:
> >> https://www.linkedin.com/in/mrgalaxy/)
> >> 
> >> I'm the original author of the RDMA implementation. I've been discussing
> >> with Yu Zhang for a little bit about potentially handing over maintainership
> >> of the codebase to his team.
> >> 
> >> I simply have zero access to RoCE or Infiniband hardware at all,
> >> unfortunately. so I've never been able to run tests or use what I wrote at
> >> work, and as all of you know, if you don't have a way to test something,
> >> then you can't maintain it.
> >> 
> >> Yu Zhang put a (very kind) proposal forward to me to ask the community if
> >> they feel comfortable training his team to maintain the codebase (and run
> >> tests) while they learn about it.
> >
> > The "while learning" part is fine at least to me.  IMHO the "ownership" to
> > the code, or say, taking over the responsibility, may or may not need 100%
> > mastering the code base first.  There should still be some fundamental
> > confidence to work on the code though as a starting point, then it's about
> > serious use case to back this up, and careful testings while getting more
> > familiar with it.
> 
> How much experience we expect of maintainers depends on the subsystem
> and other circumstances.  The hard requirement isn't experience, it's
> trust.  See the recent attack on xz.
> 
> I do not mean to express any doubts whatsoever on Yu Zhang's integrity!
> I'm merely reminding y'all what's at stake.

I think we shouldn't overly obsess[1] about 'xz', because the overwhealmingly
common scenario is that volunteer maintainers are honest people. QEMU is
in a massively better peer review situation. With xz there was basically no
oversight of the new maintainer. With QEMU, we have oversight from 1000's
of people on the list, a huge pool of general maintainers, the specific
migration maintainers, and the release manager merging code.

With a lack of historical experiance with QEMU maintainership, I'd suggest
that new RDMA volunteers would start by adding themselves to the "MAINTAINERS"
file with only the 'Reviewer' classification. The main migration maintainers
would still handle pull requests, but wait for a R-b from one of the RMDA
volunteers. After some period of time the RDMA folks could graduate to full
maintainer status if the migration maintainers needed to reduce their load.
I suspect that might prove unneccesary though, given RDMA isn't an area of
code with a high turnover of patches.

With regards,
Daniel

[1] If we do want to obsess about something bad though, we should
    look at our handling of binary blobs in the repo and tarballs.
    ie the firmware binaries that all get built in an arbitrary
    environment of their respective maintainer. If we need firmware
    blobs in tree, we should strive to come up with a reprodicble
    build environment that gives us byte-for-byte identical results,
    so the blobs can be verified. This is rather a tangent from this
    thread though :)
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
  2024-04-30  8:00                                       ` Daniel P. Berrangé
@ 2024-05-01 15:31                                         ` Peter Xu
  2024-05-01 15:59                                           ` Daniel P. Berrangé
  2024-05-06  2:06                                           ` Gonglei (Arei) via
  0 siblings, 2 replies; 52+ messages in thread
From: Peter Xu @ 2024-05-01 15:31 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Markus Armbruster, Michael Galaxy, Yu Zhang, Zhijian Li (Fujitsu),
	Jinpu Wang, Elmar Gerdes, qemu-devel, Yuval Shaia, Kevin Wolf,
	Prasanna Kumar Kalever, Cornelia Huck, Michael Roth,
	Prasanna Kumar Kalever, integration, Paolo Bonzini, qemu-block,
	devel, Hanna Reitz, Michael S. Tsirkin, Thomas Huth, Eric Blake,
	Song Gao, Marc-André Lureau, Alex Bennée,
	Wainer dos Santos Moschetta, Beraldo Leal, arei.gonglei,
	pannengyuan

On Tue, Apr 30, 2024 at 09:00:49AM +0100, Daniel P. Berrangé wrote:
> On Tue, Apr 30, 2024 at 09:15:03AM +0200, Markus Armbruster wrote:
> > Peter Xu <peterx@redhat.com> writes:
> > 
> > > On Mon, Apr 29, 2024 at 08:08:10AM -0500, Michael Galaxy wrote:
> > >> Hi All (and Peter),
> > >
> > > Hi, Michael,
> > >
> > >> 
> > >> My name is Michael Galaxy (formerly Hines). Yes, I changed my last name
> > >> (highly irregular for a male) and yes, that's my real last name:
> > >> https://www.linkedin.com/in/mrgalaxy/)
> > >> 
> > >> I'm the original author of the RDMA implementation. I've been discussing
> > >> with Yu Zhang for a little bit about potentially handing over maintainership
> > >> of the codebase to his team.
> > >> 
> > >> I simply have zero access to RoCE or Infiniband hardware at all,
> > >> unfortunately. so I've never been able to run tests or use what I wrote at
> > >> work, and as all of you know, if you don't have a way to test something,
> > >> then you can't maintain it.
> > >> 
> > >> Yu Zhang put a (very kind) proposal forward to me to ask the community if
> > >> they feel comfortable training his team to maintain the codebase (and run
> > >> tests) while they learn about it.
> > >
> > > The "while learning" part is fine at least to me.  IMHO the "ownership" to
> > > the code, or say, taking over the responsibility, may or may not need 100%
> > > mastering the code base first.  There should still be some fundamental
> > > confidence to work on the code though as a starting point, then it's about
> > > serious use case to back this up, and careful testings while getting more
> > > familiar with it.
> > 
> > How much experience we expect of maintainers depends on the subsystem
> > and other circumstances.  The hard requirement isn't experience, it's
> > trust.  See the recent attack on xz.
> > 
> > I do not mean to express any doubts whatsoever on Yu Zhang's integrity!
> > I'm merely reminding y'all what's at stake.
> 
> I think we shouldn't overly obsess[1] about 'xz', because the overwhealmingly
> common scenario is that volunteer maintainers are honest people. QEMU is
> in a massively better peer review situation. With xz there was basically no
> oversight of the new maintainer. With QEMU, we have oversight from 1000's
> of people on the list, a huge pool of general maintainers, the specific
> migration maintainers, and the release manager merging code.
> 
> With a lack of historical experiance with QEMU maintainership, I'd suggest
> that new RDMA volunteers would start by adding themselves to the "MAINTAINERS"
> file with only the 'Reviewer' classification. The main migration maintainers
> would still handle pull requests, but wait for a R-b from one of the RMDA
> volunteers. After some period of time the RDMA folks could graduate to full
> maintainer status if the migration maintainers needed to reduce their load.
> I suspect that might prove unneccesary though, given RDMA isn't an area of
> code with a high turnover of patches.

Right, and we can do that as a start, it also follows our normal rules of
starting from Reviewers to maintain something.  I even considered Zhijian
to be the previous rdma goto guy / maintainer no matter what role he used
to have in the MAINTAINERS file.

Here IMHO it's more about whether any company would like to stand up and
provide help, without yet binding that to be able to send pull requests in
the near future or even longer term.

What I worry more is whether this is really what we want to keep rdma in
qemu, and that's also why I was trying to request for some serious
performance measurements comparing rdma v.s. nics.  And here when I said
"we" I mean both QEMU community and any company that will support keeping
rdma around.

The problem is if NICs now are fast enough to perform at least equally
against rdma, and if it has a lower cost of overall maintenance, does it
mean that rdma migration will only be used by whoever wants to keep them in
the products and existed already?  In that case we should simply ask new
users to stick with tcp, and rdma users should only drop but not increase.

It seems also destined that most new migration features will not support
rdma: see how much we drop old features in migration now (which rdma
_might_ still leverage, but maybe not), and how much we add mostly multifd
relevant which will probably not apply to rdma at all.  So in general what
I am worrying is a both-loss condition, if the company might be easier to
either stick with an old qemu (depending on whether other new features are
requested to be used besides RDMA alone), or do periodic rebase with RDMA
downstream only.

So even if we want to keep RDMA around I hope with this chance we can at
least have clear picture on when we should still suggest any new user to
use RDMA (with the reasons behind).  Or we simply shouldn't suggest any new
user to use RDMA at all (because at least it'll lose many new migration
features).

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
  2024-05-01 15:31                                         ` Peter Xu
@ 2024-05-01 15:59                                           ` Daniel P. Berrangé
  2024-05-01 16:16                                             ` Peter Xu
  2024-05-03  6:40                                             ` Jinpu Wang
  2024-05-06  2:06                                           ` Gonglei (Arei) via
  1 sibling, 2 replies; 52+ messages in thread
From: Daniel P. Berrangé @ 2024-05-01 15:59 UTC (permalink / raw)
  To: Peter Xu
  Cc: Markus Armbruster, Michael Galaxy, Yu Zhang, Zhijian Li (Fujitsu),
	Jinpu Wang, Elmar Gerdes, qemu-devel, Yuval Shaia, Kevin Wolf,
	Prasanna Kumar Kalever, Cornelia Huck, Michael Roth,
	Prasanna Kumar Kalever, integration, Paolo Bonzini, qemu-block,
	devel, Hanna Reitz, Michael S. Tsirkin, Thomas Huth, Eric Blake,
	Song Gao, Marc-André Lureau, Alex Bennée,
	Wainer dos Santos Moschetta, Beraldo Leal, arei.gonglei,
	pannengyuan

On Wed, May 01, 2024 at 11:31:13AM -0400, Peter Xu wrote:
> What I worry more is whether this is really what we want to keep rdma in
> qemu, and that's also why I was trying to request for some serious
> performance measurements comparing rdma v.s. nics.  And here when I said
> "we" I mean both QEMU community and any company that will support keeping
> rdma around.
> 
> The problem is if NICs now are fast enough to perform at least equally
> against rdma, and if it has a lower cost of overall maintenance, does it
> mean that rdma migration will only be used by whoever wants to keep them in
> the products and existed already?  In that case we should simply ask new
> users to stick with tcp, and rdma users should only drop but not increase.
> 
> It seems also destined that most new migration features will not support
> rdma: see how much we drop old features in migration now (which rdma
> _might_ still leverage, but maybe not), and how much we add mostly multifd
> relevant which will probably not apply to rdma at all.  So in general what
> I am worrying is a both-loss condition, if the company might be easier to
> either stick with an old qemu (depending on whether other new features are
> requested to be used besides RDMA alone), or do periodic rebase with RDMA
> downstream only.

I don't know much about the originals of RDMA support in QEMU and why
this particular design was taken. It is indeed a huge maint burden to
have a completely different code flow for RDMA with 4000+ lines of
custom protocol signalling which is barely understandable.

I would note that /usr/include/rdma/rsocket.h provides a higher level
API that is a 1-1 match of the normal kernel 'sockets' API. If we had
leveraged that, then QIOChannelSocket class and the QAPI SocketAddress
type could almost[1] trivially have supported RDMA. There would have
been almost no RDMA code required in the migration subsystem, and all
the modern features like compression, multifd, post-copy, etc would
"just work".

I guess the 'rsocket.h' shim may well limit some of the possible
performance gains, but it might still have been a better tradeoff
to have not quite so good peak performance, but with massively
less maint burden.

With regards,
Daniel

[1] "almost" trivially, because the poll() integration for rsockets
    requires a bit more magic sauce since rsockets FDs are not
    really FDs from the kernel's POV. Still, QIOCHannel likely can
    abstract that probme.
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
  2024-05-01 15:59                                           ` Daniel P. Berrangé
@ 2024-05-01 16:16                                             ` Peter Xu
  2024-05-02 13:22                                               ` Michael Galaxy
  2024-05-03  6:40                                             ` Jinpu Wang
  1 sibling, 1 reply; 52+ messages in thread
From: Peter Xu @ 2024-05-01 16:16 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Markus Armbruster, Michael Galaxy, Yu Zhang, Zhijian Li (Fujitsu),
	Jinpu Wang, Elmar Gerdes, qemu-devel, Yuval Shaia, Kevin Wolf,
	Prasanna Kumar Kalever, Cornelia Huck, Michael Roth,
	Prasanna Kumar Kalever, integration, Paolo Bonzini, qemu-block,
	devel, Hanna Reitz, Michael S. Tsirkin, Thomas Huth, Eric Blake,
	Song Gao, Marc-André Lureau, Alex Bennée,
	Wainer dos Santos Moschetta, Beraldo Leal, arei.gonglei,
	pannengyuan

On Wed, May 01, 2024 at 04:59:38PM +0100, Daniel P. Berrangé wrote:
> On Wed, May 01, 2024 at 11:31:13AM -0400, Peter Xu wrote:
> > What I worry more is whether this is really what we want to keep rdma in
> > qemu, and that's also why I was trying to request for some serious
> > performance measurements comparing rdma v.s. nics.  And here when I said
> > "we" I mean both QEMU community and any company that will support keeping
> > rdma around.
> > 
> > The problem is if NICs now are fast enough to perform at least equally
> > against rdma, and if it has a lower cost of overall maintenance, does it
> > mean that rdma migration will only be used by whoever wants to keep them in
> > the products and existed already?  In that case we should simply ask new
> > users to stick with tcp, and rdma users should only drop but not increase.
> > 
> > It seems also destined that most new migration features will not support
> > rdma: see how much we drop old features in migration now (which rdma
> > _might_ still leverage, but maybe not), and how much we add mostly multifd
> > relevant which will probably not apply to rdma at all.  So in general what
> > I am worrying is a both-loss condition, if the company might be easier to
> > either stick with an old qemu (depending on whether other new features are
> > requested to be used besides RDMA alone), or do periodic rebase with RDMA
> > downstream only.
> 
> I don't know much about the originals of RDMA support in QEMU and why
> this particular design was taken. It is indeed a huge maint burden to
> have a completely different code flow for RDMA with 4000+ lines of
> custom protocol signalling which is barely understandable.
> 
> I would note that /usr/include/rdma/rsocket.h provides a higher level
> API that is a 1-1 match of the normal kernel 'sockets' API. If we had
> leveraged that, then QIOChannelSocket class and the QAPI SocketAddress
> type could almost[1] trivially have supported RDMA. There would have
> been almost no RDMA code required in the migration subsystem, and all
> the modern features like compression, multifd, post-copy, etc would
> "just work".
> 
> I guess the 'rsocket.h' shim may well limit some of the possible
> performance gains, but it might still have been a better tradeoff
> to have not quite so good peak performance, but with massively
> less maint burden.

My understanding so far is RDMA is sololy for performance but nothing else,
then it's a question on whether rdma existing users would like to do so if
it will run slower.

Jinpu mentioned on the explicit usages of ib verbs but I am just mostly
quotting that word as I don't really know such details:

https://lore.kernel.org/qemu-devel/CAMGffEm2TWJxOPcNQTQ1Sjytf5395dBzTCMYiKRqfxDzJwSN6A@mail.gmail.com/

So not sure whether that applies here too, in that having qiochannel
wrapper may not allow direct access to those ib verbs.

Thanks,

> 
> With regards,
> Daniel
> 
> [1] "almost" trivially, because the poll() integration for rsockets
>     requires a bit more magic sauce since rsockets FDs are not
>     really FDs from the kernel's POV. Still, QIOCHannel likely can
>     abstract that probme.
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
  2024-05-01 16:16                                             ` Peter Xu
@ 2024-05-02 13:22                                               ` Michael Galaxy
  2024-05-02 13:30                                                 ` Jinpu Wang
  0 siblings, 1 reply; 52+ messages in thread
From: Michael Galaxy @ 2024-05-02 13:22 UTC (permalink / raw)
  To: Peter Xu, Daniel P. Berrangé
  Cc: Markus Armbruster, Yu Zhang, Zhijian Li (Fujitsu),
	Jinpu Wang, Elmar Gerdes, qemu-devel, Yuval Shaia, Kevin Wolf,
	Prasanna Kumar Kalever, Cornelia Huck, Michael Roth,
	Prasanna Kumar Kalever, integration, Paolo Bonzini, qemu-block,
	devel, Hanna Reitz, Michael S. Tsirkin, Thomas Huth, Eric Blake,
	Song Gao, Marc-André Lureau, Alex Bennée,
	Wainer dos Santos Moschetta, Beraldo Leal, arei.gonglei,
	pannengyuan

Yu Zhang / Jinpu,

Any possibility (at your lesiure, and within the disclosure rules of 
your company, IONOS) if you could share any of your performance 
information to educate the group?

NICs have indeed changed, but not everybody has 100ge mellanox cards at 
their disposal. Some people don't.

- Michael

On 5/1/24 11:16, Peter Xu wrote:
> On Wed, May 01, 2024 at 04:59:38PM +0100, Daniel P. Berrangé wrote:
>> On Wed, May 01, 2024 at 11:31:13AM -0400, Peter Xu wrote:
>>> What I worry more is whether this is really what we want to keep rdma in
>>> qemu, and that's also why I was trying to request for some serious
>>> performance measurements comparing rdma v.s. nics.  And here when I said
>>> "we" I mean both QEMU community and any company that will support keeping
>>> rdma around.
>>>
>>> The problem is if NICs now are fast enough to perform at least equally
>>> against rdma, and if it has a lower cost of overall maintenance, does it
>>> mean that rdma migration will only be used by whoever wants to keep them in
>>> the products and existed already?  In that case we should simply ask new
>>> users to stick with tcp, and rdma users should only drop but not increase.
>>>
>>> It seems also destined that most new migration features will not support
>>> rdma: see how much we drop old features in migration now (which rdma
>>> _might_ still leverage, but maybe not), and how much we add mostly multifd
>>> relevant which will probably not apply to rdma at all.  So in general what
>>> I am worrying is a both-loss condition, if the company might be easier to
>>> either stick with an old qemu (depending on whether other new features are
>>> requested to be used besides RDMA alone), or do periodic rebase with RDMA
>>> downstream only.
>> I don't know much about the originals of RDMA support in QEMU and why
>> this particular design was taken. It is indeed a huge maint burden to
>> have a completely different code flow for RDMA with 4000+ lines of
>> custom protocol signalling which is barely understandable.
>>
>> I would note that /usr/include/rdma/rsocket.h provides a higher level
>> API that is a 1-1 match of the normal kernel 'sockets' API. If we had
>> leveraged that, then QIOChannelSocket class and the QAPI SocketAddress
>> type could almost[1] trivially have supported RDMA. There would have
>> been almost no RDMA code required in the migration subsystem, and all
>> the modern features like compression, multifd, post-copy, etc would
>> "just work".
>>
>> I guess the 'rsocket.h' shim may well limit some of the possible
>> performance gains, but it might still have been a better tradeoff
>> to have not quite so good peak performance, but with massively
>> less maint burden.
> My understanding so far is RDMA is sololy for performance but nothing else,
> then it's a question on whether rdma existing users would like to do so if
> it will run slower.
>
> Jinpu mentioned on the explicit usages of ib verbs but I am just mostly
> quotting that word as I don't really know such details:
>
> https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/CAMGffEm2TWJxOPcNQTQ1Sjytf5395dBzTCMYiKRqfxDzJwSN6A@mail.gmail.com/__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOew9oW_kg$
>
> So not sure whether that applies here too, in that having qiochannel
> wrapper may not allow direct access to those ib verbs.
>
> Thanks,
>
>> With regards,
>> Daniel
>>
>> [1] "almost" trivially, because the poll() integration for rsockets
>>      requires a bit more magic sauce since rsockets FDs are not
>>      really FDs from the kernel's POV. Still, QIOCHannel likely can
>>      abstract that probme.
>> -- 
>> |: https://urldefense.com/v3/__https://berrange.com__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOfyTmFFUQ$       -o-    https://urldefense.com/v3/__https://www.flickr.com/photos/dberrange__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOf8A5OC0Q$  :|
>> |: https://urldefense.com/v3/__https://libvirt.org__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOf3gffAdg$          -o-            https://urldefense.com/v3/__https://fstop138.berrange.com__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOfPMofYqw$  :|
>> |: https://urldefense.com/v3/__https://entangle-photo.org__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOeQ5jjAeQ$     -o-    https://urldefense.com/v3/__https://www.instagram.com/dberrange__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOfhaDF9WA$  :|
>>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
  2024-05-02 13:22                                               ` Michael Galaxy
@ 2024-05-02 13:30                                                 ` Jinpu Wang
  2024-05-02 16:19                                                   ` Peter Xu
  0 siblings, 1 reply; 52+ messages in thread
From: Jinpu Wang @ 2024-05-02 13:30 UTC (permalink / raw)
  To: Michael Galaxy
  Cc: Peter Xu, Daniel P. Berrangé,
	Markus Armbruster, Yu Zhang, Zhijian Li (Fujitsu),
	Elmar Gerdes, qemu-devel, Yuval Shaia, Kevin Wolf,
	Prasanna Kumar Kalever, Cornelia Huck, Michael Roth,
	Prasanna Kumar Kalever, integration, Paolo Bonzini, qemu-block,
	devel, Hanna Reitz, Michael S. Tsirkin, Thomas Huth, Eric Blake,
	Song Gao, Marc-André Lureau, Alex Bennée,
	Wainer dos Santos Moschetta, Beraldo Leal, arei.gonglei,
	pannengyuan

Hi Michael, Hi Peter,


On Thu, May 2, 2024 at 3:23 PM Michael Galaxy <mgalaxy@akamai.com> wrote:
>
> Yu Zhang / Jinpu,
>
> Any possibility (at your lesiure, and within the disclosure rules of
> your company, IONOS) if you could share any of your performance
> information to educate the group?
>
> NICs have indeed changed, but not everybody has 100ge mellanox cards at
> their disposal. Some people don't.
Our staging env is with 100 Gb/s IB environment.
We will have a new setup in the coming months with Ethernet (RoCE), we
will run some performance
comparison when we have the environment ready.

>
> - Michael

Thx!
Jinpu
>
> On 5/1/24 11:16, Peter Xu wrote:
> > On Wed, May 01, 2024 at 04:59:38PM +0100, Daniel P. Berrangé wrote:
> >> On Wed, May 01, 2024 at 11:31:13AM -0400, Peter Xu wrote:
> >>> What I worry more is whether this is really what we want to keep rdma in
> >>> qemu, and that's also why I was trying to request for some serious
> >>> performance measurements comparing rdma v.s. nics.  And here when I said
> >>> "we" I mean both QEMU community and any company that will support keeping
> >>> rdma around.
> >>>
> >>> The problem is if NICs now are fast enough to perform at least equally
> >>> against rdma, and if it has a lower cost of overall maintenance, does it
> >>> mean that rdma migration will only be used by whoever wants to keep them in
> >>> the products and existed already?  In that case we should simply ask new
> >>> users to stick with tcp, and rdma users should only drop but not increase.
> >>>
> >>> It seems also destined that most new migration features will not support
> >>> rdma: see how much we drop old features in migration now (which rdma
> >>> _might_ still leverage, but maybe not), and how much we add mostly multifd
> >>> relevant which will probably not apply to rdma at all.  So in general what
> >>> I am worrying is a both-loss condition, if the company might be easier to
> >>> either stick with an old qemu (depending on whether other new features are
> >>> requested to be used besides RDMA alone), or do periodic rebase with RDMA
> >>> downstream only.
> >> I don't know much about the originals of RDMA support in QEMU and why
> >> this particular design was taken. It is indeed a huge maint burden to
> >> have a completely different code flow for RDMA with 4000+ lines of
> >> custom protocol signalling which is barely understandable.
> >>
> >> I would note that /usr/include/rdma/rsocket.h provides a higher level
> >> API that is a 1-1 match of the normal kernel 'sockets' API. If we had
> >> leveraged that, then QIOChannelSocket class and the QAPI SocketAddress
> >> type could almost[1] trivially have supported RDMA. There would have
> >> been almost no RDMA code required in the migration subsystem, and all
> >> the modern features like compression, multifd, post-copy, etc would
> >> "just work".
> >>
> >> I guess the 'rsocket.h' shim may well limit some of the possible
> >> performance gains, but it might still have been a better tradeoff
> >> to have not quite so good peak performance, but with massively
> >> less maint burden.
> > My understanding so far is RDMA is sololy for performance but nothing else,
> > then it's a question on whether rdma existing users would like to do so if
> > it will run slower.
> >
> > Jinpu mentioned on the explicit usages of ib verbs but I am just mostly
> > quotting that word as I don't really know such details:
> >
> > https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/CAMGffEm2TWJxOPcNQTQ1Sjytf5395dBzTCMYiKRqfxDzJwSN6A@mail.gmail.com/__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOew9oW_kg$
> >
> > So not sure whether that applies here too, in that having qiochannel
> > wrapper may not allow direct access to those ib verbs.
> >
> > Thanks,
> >
> >> With regards,
> >> Daniel
> >>
> >> [1] "almost" trivially, because the poll() integration for rsockets
> >>      requires a bit more magic sauce since rsockets FDs are not
> >>      really FDs from the kernel's POV. Still, QIOCHannel likely can
> >>      abstract that probme.
> >> --
> >> |: https://urldefense.com/v3/__https://berrange.com__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOfyTmFFUQ$       -o-    https://urldefense.com/v3/__https://www.flickr.com/photos/dberrange__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOf8A5OC0Q$  :|
> >> |: https://urldefense.com/v3/__https://libvirt.org__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOf3gffAdg$          -o-            https://urldefense.com/v3/__https://fstop138.berrange.com__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOfPMofYqw$  :|
> >> |: https://urldefense.com/v3/__https://entangle-photo.org__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOeQ5jjAeQ$     -o-    https://urldefense.com/v3/__https://www.instagram.com/dberrange__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOfhaDF9WA$  :|
> >>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
  2024-05-02 13:30                                                 ` Jinpu Wang
@ 2024-05-02 16:19                                                   ` Peter Xu
  2024-05-02 17:10                                                     ` Jinpu Wang
  0 siblings, 1 reply; 52+ messages in thread
From: Peter Xu @ 2024-05-02 16:19 UTC (permalink / raw)
  To: Jinpu Wang
  Cc: Michael Galaxy, Daniel P. Berrangé,
	Markus Armbruster, Yu Zhang, Zhijian Li (Fujitsu),
	Elmar Gerdes, qemu-devel, Yuval Shaia, Kevin Wolf,
	Prasanna Kumar Kalever, Cornelia Huck, Michael Roth,
	Prasanna Kumar Kalever, integration, Paolo Bonzini, qemu-block,
	devel, Hanna Reitz, Michael S. Tsirkin, Thomas Huth, Eric Blake,
	Song Gao, Marc-André Lureau, Alex Bennée,
	Wainer dos Santos Moschetta, Beraldo Leal, arei.gonglei,
	pannengyuan

On Thu, May 02, 2024 at 03:30:58PM +0200, Jinpu Wang wrote:
> Hi Michael, Hi Peter,
> 
> 
> On Thu, May 2, 2024 at 3:23 PM Michael Galaxy <mgalaxy@akamai.com> wrote:
> >
> > Yu Zhang / Jinpu,
> >
> > Any possibility (at your lesiure, and within the disclosure rules of
> > your company, IONOS) if you could share any of your performance
> > information to educate the group?
> >
> > NICs have indeed changed, but not everybody has 100ge mellanox cards at
> > their disposal. Some people don't.
> Our staging env is with 100 Gb/s IB environment.
> We will have a new setup in the coming months with Ethernet (RoCE), we
> will run some performance
> comparison when we have the environment ready.

Thanks both.  Please keep us posted.

Just to double check, we're comparing "tcp:" v.s. "rdma:", RoCE is not
involved, am I right?

The other note is that the comparison needs to be with multifd enabled for
the "tcp:" case.  I'd suggest we start with 8 threads if it's 100Gbps.

I think I can still fetch some 100Gbps or even 200Gbps nics around our labs
without even waiting for months.  If you want I can try to see how we can
test together.  And btw I don't think we need a cluster, IIUC we simply
need two hosts, 100G nic on both sides?  IOW, it seems to me we only need
two cards just for experiments, systems that can drive the cards, and a
wire supporting 100G?

> 
> >
> > - Michael
> 
> Thx!
> Jinpu
> >
> > On 5/1/24 11:16, Peter Xu wrote:
> > > On Wed, May 01, 2024 at 04:59:38PM +0100, Daniel P. Berrangé wrote:
> > >> On Wed, May 01, 2024 at 11:31:13AM -0400, Peter Xu wrote:
> > >>> What I worry more is whether this is really what we want to keep rdma in
> > >>> qemu, and that's also why I was trying to request for some serious
> > >>> performance measurements comparing rdma v.s. nics.  And here when I said
> > >>> "we" I mean both QEMU community and any company that will support keeping
> > >>> rdma around.
> > >>>
> > >>> The problem is if NICs now are fast enough to perform at least equally
> > >>> against rdma, and if it has a lower cost of overall maintenance, does it
> > >>> mean that rdma migration will only be used by whoever wants to keep them in
> > >>> the products and existed already?  In that case we should simply ask new
> > >>> users to stick with tcp, and rdma users should only drop but not increase.
> > >>>
> > >>> It seems also destined that most new migration features will not support
> > >>> rdma: see how much we drop old features in migration now (which rdma
> > >>> _might_ still leverage, but maybe not), and how much we add mostly multifd
> > >>> relevant which will probably not apply to rdma at all.  So in general what
> > >>> I am worrying is a both-loss condition, if the company might be easier to
> > >>> either stick with an old qemu (depending on whether other new features are
> > >>> requested to be used besides RDMA alone), or do periodic rebase with RDMA
> > >>> downstream only.
> > >> I don't know much about the originals of RDMA support in QEMU and why
> > >> this particular design was taken. It is indeed a huge maint burden to
> > >> have a completely different code flow for RDMA with 4000+ lines of
> > >> custom protocol signalling which is barely understandable.
> > >>
> > >> I would note that /usr/include/rdma/rsocket.h provides a higher level
> > >> API that is a 1-1 match of the normal kernel 'sockets' API. If we had
> > >> leveraged that, then QIOChannelSocket class and the QAPI SocketAddress
> > >> type could almost[1] trivially have supported RDMA. There would have
> > >> been almost no RDMA code required in the migration subsystem, and all
> > >> the modern features like compression, multifd, post-copy, etc would
> > >> "just work".
> > >>
> > >> I guess the 'rsocket.h' shim may well limit some of the possible
> > >> performance gains, but it might still have been a better tradeoff
> > >> to have not quite so good peak performance, but with massively
> > >> less maint burden.
> > > My understanding so far is RDMA is sololy for performance but nothing else,
> > > then it's a question on whether rdma existing users would like to do so if
> > > it will run slower.
> > >
> > > Jinpu mentioned on the explicit usages of ib verbs but I am just mostly
> > > quotting that word as I don't really know such details:
> > >
> > > https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/CAMGffEm2TWJxOPcNQTQ1Sjytf5395dBzTCMYiKRqfxDzJwSN6A@mail.gmail.com/__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOew9oW_kg$
> > >
> > > So not sure whether that applies here too, in that having qiochannel
> > > wrapper may not allow direct access to those ib verbs.
> > >
> > > Thanks,
> > >
> > >> With regards,
> > >> Daniel
> > >>
> > >> [1] "almost" trivially, because the poll() integration for rsockets
> > >>      requires a bit more magic sauce since rsockets FDs are not
> > >>      really FDs from the kernel's POV. Still, QIOCHannel likely can
> > >>      abstract that probme.
> > >> --
> > >> |: https://urldefense.com/v3/__https://berrange.com__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOfyTmFFUQ$       -o-    https://urldefense.com/v3/__https://www.flickr.com/photos/dberrange__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOf8A5OC0Q$  :|
> > >> |: https://urldefense.com/v3/__https://libvirt.org__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOf3gffAdg$          -o-            https://urldefense.com/v3/__https://fstop138.berrange.com__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOfPMofYqw$  :|
> > >> |: https://urldefense.com/v3/__https://entangle-photo.org__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOeQ5jjAeQ$     -o-    https://urldefense.com/v3/__https://www.instagram.com/dberrange__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOfhaDF9WA$  :|
> > >>
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
  2024-05-02 16:19                                                   ` Peter Xu
@ 2024-05-02 17:10                                                     ` Jinpu Wang
  0 siblings, 0 replies; 52+ messages in thread
From: Jinpu Wang @ 2024-05-02 17:10 UTC (permalink / raw)
  To: Peter Xu
  Cc: Michael Galaxy, Daniel P. Berrangé,
	Markus Armbruster, Yu Zhang, Zhijian Li (Fujitsu),
	Elmar Gerdes, qemu-devel, Yuval Shaia, Kevin Wolf,
	Prasanna Kumar Kalever, Cornelia Huck, Michael Roth,
	Prasanna Kumar Kalever, integration, Paolo Bonzini, qemu-block,
	devel, Hanna Reitz, Michael S. Tsirkin, Thomas Huth, Eric Blake,
	Song Gao, Marc-André Lureau, Alex Bennée,
	Wainer dos Santos Moschetta, Beraldo Leal, arei.gonglei,
	pannengyuan

Hi Peter

On Thu, May 2, 2024 at 6:20 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Thu, May 02, 2024 at 03:30:58PM +0200, Jinpu Wang wrote:
> > Hi Michael, Hi Peter,
> >
> >
> > On Thu, May 2, 2024 at 3:23 PM Michael Galaxy <mgalaxy@akamai.com> wrote:
> > >
> > > Yu Zhang / Jinpu,
> > >
> > > Any possibility (at your lesiure, and within the disclosure rules of
> > > your company, IONOS) if you could share any of your performance
> > > information to educate the group?
> > >
> > > NICs have indeed changed, but not everybody has 100ge mellanox cards at
> > > their disposal. Some people don't.
> > Our staging env is with 100 Gb/s IB environment.
> > We will have a new setup in the coming months with Ethernet (RoCE), we
> > will run some performance
> > comparison when we have the environment ready.
>
> Thanks both.  Please keep us posted.
>
> Just to double check, we're comparing "tcp:" v.s. "rdma:", RoCE is not
> involved, am I right?
kinds of. Our new hardware is RDMA capable, we can configure it to run
in "rdma" transport or "tcp"
it is more straight comparison,
When run "rdma" transport, RoCE is involved, eg the
rdma-core/ibverbs/rdmacm/vendor verbs driver are used.
>
> The other note is that the comparison needs to be with multifd enabled for
> the "tcp:" case.  I'd suggest we start with 8 threads if it's 100Gbps.
>
> I think I can still fetch some 100Gbps or even 200Gbps nics around our labs
> without even waiting for months.  If you want I can try to see how we can
> test together.  And btw I don't think we need a cluster, IIUC we simply
> need two hosts, 100G nic on both sides?  IOW, it seems to me we only need
> two cards just for experiments, systems that can drive the cards, and a
> wire supporting 100G?

Yes, the simple setup can be just two hosts directly connected. This remind me,
I may also able to find a test setup with 100 G nic in lab, will keep
you posted.

Regards!
>
> >
> > >
> > > - Michael
> >
> > Thx!
> > Jinpu
> > >
> > > On 5/1/24 11:16, Peter Xu wrote:
> > > > On Wed, May 01, 2024 at 04:59:38PM +0100, Daniel P. Berrangé wrote:
> > > >> On Wed, May 01, 2024 at 11:31:13AM -0400, Peter Xu wrote:
> > > >>> What I worry more is whether this is really what we want to keep rdma in
> > > >>> qemu, and that's also why I was trying to request for some serious
> > > >>> performance measurements comparing rdma v.s. nics.  And here when I said
> > > >>> "we" I mean both QEMU community and any company that will support keeping
> > > >>> rdma around.
> > > >>>
> > > >>> The problem is if NICs now are fast enough to perform at least equally
> > > >>> against rdma, and if it has a lower cost of overall maintenance, does it
> > > >>> mean that rdma migration will only be used by whoever wants to keep them in
> > > >>> the products and existed already?  In that case we should simply ask new
> > > >>> users to stick with tcp, and rdma users should only drop but not increase.
> > > >>>
> > > >>> It seems also destined that most new migration features will not support
> > > >>> rdma: see how much we drop old features in migration now (which rdma
> > > >>> _might_ still leverage, but maybe not), and how much we add mostly multifd
> > > >>> relevant which will probably not apply to rdma at all.  So in general what
> > > >>> I am worrying is a both-loss condition, if the company might be easier to
> > > >>> either stick with an old qemu (depending on whether other new features are
> > > >>> requested to be used besides RDMA alone), or do periodic rebase with RDMA
> > > >>> downstream only.
> > > >> I don't know much about the originals of RDMA support in QEMU and why
> > > >> this particular design was taken. It is indeed a huge maint burden to
> > > >> have a completely different code flow for RDMA with 4000+ lines of
> > > >> custom protocol signalling which is barely understandable.
> > > >>
> > > >> I would note that /usr/include/rdma/rsocket.h provides a higher level
> > > >> API that is a 1-1 match of the normal kernel 'sockets' API. If we had
> > > >> leveraged that, then QIOChannelSocket class and the QAPI SocketAddress
> > > >> type could almost[1] trivially have supported RDMA. There would have
> > > >> been almost no RDMA code required in the migration subsystem, and all
> > > >> the modern features like compression, multifd, post-copy, etc would
> > > >> "just work".
> > > >>
> > > >> I guess the 'rsocket.h' shim may well limit some of the possible
> > > >> performance gains, but it might still have been a better tradeoff
> > > >> to have not quite so good peak performance, but with massively
> > > >> less maint burden.
> > > > My understanding so far is RDMA is sololy for performance but nothing else,
> > > > then it's a question on whether rdma existing users would like to do so if
> > > > it will run slower.
> > > >
> > > > Jinpu mentioned on the explicit usages of ib verbs but I am just mostly
> > > > quotting that word as I don't really know such details:
> > > >
> > > > https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/CAMGffEm2TWJxOPcNQTQ1Sjytf5395dBzTCMYiKRqfxDzJwSN6A@mail.gmail.com/__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOew9oW_kg$
> > > >
> > > > So not sure whether that applies here too, in that having qiochannel
> > > > wrapper may not allow direct access to those ib verbs.
> > > >
> > > > Thanks,
> > > >
> > > >> With regards,
> > > >> Daniel
> > > >>
> > > >> [1] "almost" trivially, because the poll() integration for rsockets
> > > >>      requires a bit more magic sauce since rsockets FDs are not
> > > >>      really FDs from the kernel's POV. Still, QIOCHannel likely can
> > > >>      abstract that probme.
> > > >> --
> > > >> |: https://urldefense.com/v3/__https://berrange.com__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOfyTmFFUQ$       -o-    https://urldefense.com/v3/__https://www.flickr.com/photos/dberrange__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOf8A5OC0Q$  :|
> > > >> |: https://urldefense.com/v3/__https://libvirt.org__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOf3gffAdg$          -o-            https://urldefense.com/v3/__https://fstop138.berrange.com__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOfPMofYqw$  :|
> > > >> |: https://urldefense.com/v3/__https://entangle-photo.org__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOeQ5jjAeQ$     -o-    https://urldefense.com/v3/__https://www.instagram.com/dberrange__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOfhaDF9WA$  :|
> > > >>
> >
>
> --
> Peter Xu
>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
  2024-05-01 15:59                                           ` Daniel P. Berrangé
  2024-05-01 16:16                                             ` Peter Xu
@ 2024-05-03  6:40                                             ` Jinpu Wang
  2024-05-03 14:33                                               ` Peter Xu
  1 sibling, 1 reply; 52+ messages in thread
From: Jinpu Wang @ 2024-05-03  6:40 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Peter Xu, Markus Armbruster, Michael Galaxy, Yu Zhang,
	Zhijian Li (Fujitsu),
	Elmar Gerdes, qemu-devel, Yuval Shaia, Kevin Wolf,
	Prasanna Kumar Kalever, Cornelia Huck, Michael Roth,
	Prasanna Kumar Kalever, integration, Paolo Bonzini, qemu-block,
	devel, Hanna Reitz, Michael S. Tsirkin, Thomas Huth, Eric Blake,
	Song Gao, Marc-André Lureau, Alex Bennée,
	Wainer dos Santos Moschetta, Beraldo Leal, arei.gonglei,
	pannengyuan

Hi Daniel,

On Wed, May 1, 2024 at 6:00 PM Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> On Wed, May 01, 2024 at 11:31:13AM -0400, Peter Xu wrote:
> > What I worry more is whether this is really what we want to keep rdma in
> > qemu, and that's also why I was trying to request for some serious
> > performance measurements comparing rdma v.s. nics.  And here when I said
> > "we" I mean both QEMU community and any company that will support keeping
> > rdma around.
> >
> > The problem is if NICs now are fast enough to perform at least equally
> > against rdma, and if it has a lower cost of overall maintenance, does it
> > mean that rdma migration will only be used by whoever wants to keep them in
> > the products and existed already?  In that case we should simply ask new
> > users to stick with tcp, and rdma users should only drop but not increase.
> >
> > It seems also destined that most new migration features will not support
> > rdma: see how much we drop old features in migration now (which rdma
> > _might_ still leverage, but maybe not), and how much we add mostly multifd
> > relevant which will probably not apply to rdma at all.  So in general what
> > I am worrying is a both-loss condition, if the company might be easier to
> > either stick with an old qemu (depending on whether other new features are
> > requested to be used besides RDMA alone), or do periodic rebase with RDMA
> > downstream only.
>
> I don't know much about the originals of RDMA support in QEMU and why
> this particular design was taken. It is indeed a huge maint burden to
> have a completely different code flow for RDMA with 4000+ lines of
> custom protocol signalling which is barely understandable.
>
> I would note that /usr/include/rdma/rsocket.h provides a higher level
> API that is a 1-1 match of the normal kernel 'sockets' API. If we had
> leveraged that, then QIOChannelSocket class and the QAPI SocketAddress
> type could almost[1] trivially have supported RDMA. There would have
> been almost no RDMA code required in the migration subsystem, and all
> the modern features like compression, multifd, post-copy, etc would
> "just work".
I guess at the time rsocket is less mature, and less performant
compared to using uverbs directly.



>
> I guess the 'rsocket.h' shim may well limit some of the possible
> performance gains, but it might still have been a better tradeoff
> to have not quite so good peak performance, but with massively
> less maint burden.
I had a brief check in the rsocket changelog, there seems some
improvement over time,
 might be worth revisiting this. due to socket abstraction, we can't
use some feature like
 ODP, it won't be a small and easy task.
> With regards,
> Daniel
Thanks for the suggestion.
>
> [1] "almost" trivially, because the poll() integration for rsockets
>     requires a bit more magic sauce since rsockets FDs are not
>     really FDs from the kernel's POV. Still, QIOCHannel likely can
>     abstract that probme.
> --
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
  2024-05-03  6:40                                             ` Jinpu Wang
@ 2024-05-03 14:33                                               ` Peter Xu
  2024-05-06 10:08                                                 ` Jinpu Wang
  0 siblings, 1 reply; 52+ messages in thread
From: Peter Xu @ 2024-05-03 14:33 UTC (permalink / raw)
  To: Jinpu Wang
  Cc: Daniel P. Berrangé,
	Markus Armbruster, Michael Galaxy, Yu Zhang, Zhijian Li (Fujitsu),
	Elmar Gerdes, qemu-devel, Yuval Shaia, Kevin Wolf,
	Prasanna Kumar Kalever, Cornelia Huck, Michael Roth,
	Prasanna Kumar Kalever, integration, Paolo Bonzini, qemu-block,
	devel, Hanna Reitz, Michael S. Tsirkin, Thomas Huth, Eric Blake,
	Song Gao, Marc-André Lureau, Alex Bennée,
	Wainer dos Santos Moschetta, Beraldo Leal, arei.gonglei,
	pannengyuan

On Fri, May 03, 2024 at 08:40:03AM +0200, Jinpu Wang wrote:
> I had a brief check in the rsocket changelog, there seems some
> improvement over time,
>  might be worth revisiting this. due to socket abstraction, we can't
> use some feature like
>  ODP, it won't be a small and easy task.

It'll be good to know whether Dan's suggestion would work first, without
rewritting everything yet so far.  Not sure whether some perf test could
help with the rsocket APIs even without QEMU's involvements (or looking for
test data supporting / invalidate such conversions).

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
  2024-05-01 15:31                                         ` Peter Xu
  2024-05-01 15:59                                           ` Daniel P. Berrangé
@ 2024-05-06  2:06                                           ` Gonglei (Arei) via
  2024-05-06 15:18                                             ` Peter Xu
  1 sibling, 1 reply; 52+ messages in thread
From: Gonglei (Arei) via @ 2024-05-06  2:06 UTC (permalink / raw)
  To: Peter Xu, Daniel P. Berrangé
  Cc: Markus Armbruster, Michael Galaxy, Yu Zhang, Zhijian Li (Fujitsu),
	Jinpu Wang, Elmar Gerdes, qemu-devel, Yuval Shaia, Kevin Wolf,
	Prasanna Kumar Kalever, Cornelia Huck, Michael Roth,
	Prasanna Kumar Kalever, integration, Paolo Bonzini, qemu-block,
	devel, Hanna Reitz, Michael S. Tsirkin, Thomas Huth, Eric Blake,
	Song Gao, Marc-André Lureau, Alex Bennée,
	Wainer dos Santos Moschetta, Beraldo Leal, Pannengyuan,
	Xiexiangyou

Hi, Peter

RDMA features high bandwidth, low latency (in non-blocking lossless network), and direct remote 
memory access by bypassing the CPU (As you know, CPU resources are expensive for cloud vendors, 
which is one of the reasons why we introduced offload cards.), which TCP does not have. 

In some scenarios where fast live migration is needed (extremely short interruption duration and migration 
duration) is very useful. To this end, we have also developed RDMA support for multifd.

Regards,
-Gonglei

> -----Original Message-----
> From: Peter Xu [mailto:peterx@redhat.com]
> Sent: Wednesday, May 1, 2024 11:31 PM
> To: Daniel P. Berrangé <berrange@redhat.com>
> Cc: Markus Armbruster <armbru@redhat.com>; Michael Galaxy
> <mgalaxy@akamai.com>; Yu Zhang <yu.zhang@ionos.com>; Zhijian Li (Fujitsu)
> <lizhijian@fujitsu.com>; Jinpu Wang <jinpu.wang@ionos.com>; Elmar Gerdes
> <elmar.gerdes@ionos.com>; qemu-devel@nongnu.org; Yuval Shaia
> <yuval.shaia.ml@gmail.com>; Kevin Wolf <kwolf@redhat.com>; Prasanna
> Kumar Kalever <prasanna.kalever@redhat.com>; Cornelia Huck
> <cohuck@redhat.com>; Michael Roth <michael.roth@amd.com>; Prasanna
> Kumar Kalever <prasanna4324@gmail.com>; integration@gluster.org; Paolo
> Bonzini <pbonzini@redhat.com>; qemu-block@nongnu.org;
> devel@lists.libvirt.org; Hanna Reitz <hreitz@redhat.com>; Michael S. Tsirkin
> <mst@redhat.com>; Thomas Huth <thuth@redhat.com>; Eric Blake
> <eblake@redhat.com>; Song Gao <gaosong@loongson.cn>; Marc-André
> Lureau <marcandre.lureau@redhat.com>; Alex Bennée
> <alex.bennee@linaro.org>; Wainer dos Santos Moschetta
> <wainersm@redhat.com>; Beraldo Leal <bleal@redhat.com>; Gonglei (Arei)
> <arei.gonglei@huawei.com>; Pannengyuan <pannengyuan@huawei.com>
> Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
> 
> On Tue, Apr 30, 2024 at 09:00:49AM +0100, Daniel P. Berrangé wrote:
> > On Tue, Apr 30, 2024 at 09:15:03AM +0200, Markus Armbruster wrote:
> > > Peter Xu <peterx@redhat.com> writes:
> > >
> > > > On Mon, Apr 29, 2024 at 08:08:10AM -0500, Michael Galaxy wrote:
> > > >> Hi All (and Peter),
> > > >
> > > > Hi, Michael,
> > > >
> > > >>
> > > >> My name is Michael Galaxy (formerly Hines). Yes, I changed my
> > > >> last name (highly irregular for a male) and yes, that's my real last name:
> > > >> https://www.linkedin.com/in/mrgalaxy/)
> > > >>
> > > >> I'm the original author of the RDMA implementation. I've been
> > > >> discussing with Yu Zhang for a little bit about potentially
> > > >> handing over maintainership of the codebase to his team.
> > > >>
> > > >> I simply have zero access to RoCE or Infiniband hardware at all,
> > > >> unfortunately. so I've never been able to run tests or use what I
> > > >> wrote at work, and as all of you know, if you don't have a way to
> > > >> test something, then you can't maintain it.
> > > >>
> > > >> Yu Zhang put a (very kind) proposal forward to me to ask the
> > > >> community if they feel comfortable training his team to maintain
> > > >> the codebase (and run
> > > >> tests) while they learn about it.
> > > >
> > > > The "while learning" part is fine at least to me.  IMHO the
> > > > "ownership" to the code, or say, taking over the responsibility,
> > > > may or may not need 100% mastering the code base first.  There
> > > > should still be some fundamental confidence to work on the code
> > > > though as a starting point, then it's about serious use case to
> > > > back this up, and careful testings while getting more familiar with it.
> > >
> > > How much experience we expect of maintainers depends on the
> > > subsystem and other circumstances.  The hard requirement isn't
> > > experience, it's trust.  See the recent attack on xz.
> > >
> > > I do not mean to express any doubts whatsoever on Yu Zhang's integrity!
> > > I'm merely reminding y'all what's at stake.
> >
> > I think we shouldn't overly obsess[1] about 'xz', because the
> > overwhealmingly common scenario is that volunteer maintainers are
> > honest people. QEMU is in a massively better peer review situation.
> > With xz there was basically no oversight of the new maintainer. With
> > QEMU, we have oversight from 1000's of people on the list, a huge pool
> > of general maintainers, the specific migration maintainers, and the release
> manager merging code.
> >
> > With a lack of historical experiance with QEMU maintainership, I'd
> > suggest that new RDMA volunteers would start by adding themselves to the
> "MAINTAINERS"
> > file with only the 'Reviewer' classification. The main migration
> > maintainers would still handle pull requests, but wait for a R-b from
> > one of the RMDA volunteers. After some period of time the RDMA folks
> > could graduate to full maintainer status if the migration maintainers needed
> to reduce their load.
> > I suspect that might prove unneccesary though, given RDMA isn't an
> > area of code with a high turnover of patches.
> 
> Right, and we can do that as a start, it also follows our normal rules of starting
> from Reviewers to maintain something.  I even considered Zhijian to be the
> previous rdma goto guy / maintainer no matter what role he used to have in
> the MAINTAINERS file.
> 
> Here IMHO it's more about whether any company would like to stand up and
> provide help, without yet binding that to be able to send pull requests in the
> near future or even longer term.
> 
> What I worry more is whether this is really what we want to keep rdma in
> qemu, and that's also why I was trying to request for some serious
> performance measurements comparing rdma v.s. nics.  And here when I said
> "we" I mean both QEMU community and any company that will support
> keeping rdma around.
> 
> The problem is if NICs now are fast enough to perform at least equally against
> rdma, and if it has a lower cost of overall maintenance, does it mean that rdma
> migration will only be used by whoever wants to keep them in the products and
> existed already?  In that case we should simply ask new users to stick with tcp,
> and rdma users should only drop but not increase.
> 
> It seems also destined that most new migration features will not support
> rdma: see how much we drop old features in migration now (which rdma
> _might_ still leverage, but maybe not), and how much we add mostly multifd
> relevant which will probably not apply to rdma at all.  So in general what I am
> worrying is a both-loss condition, if the company might be easier to either stick
> with an old qemu (depending on whether other new features are requested to
> be used besides RDMA alone), or do periodic rebase with RDMA downstream
> only.
> 
> So even if we want to keep RDMA around I hope with this chance we can at
> least have clear picture on when we should still suggest any new user to use
> RDMA (with the reasons behind).  Or we simply shouldn't suggest any new
> user to use RDMA at all (because at least it'll lose many new migration
> features).
> 
> Thanks,
> 
> --
> Peter Xu


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
  2024-05-03 14:33                                               ` Peter Xu
@ 2024-05-06 10:08                                                 ` Jinpu Wang
  2024-05-06 15:28                                                   ` Peter Xu
  0 siblings, 1 reply; 52+ messages in thread
From: Jinpu Wang @ 2024-05-06 10:08 UTC (permalink / raw)
  To: Peter Xu
  Cc: Daniel P. Berrangé,
	Markus Armbruster, Michael Galaxy, Yu Zhang, Zhijian Li (Fujitsu),
	Elmar Gerdes, qemu-devel, Yuval Shaia, Kevin Wolf,
	Prasanna Kumar Kalever, Cornelia Huck, Michael Roth,
	Prasanna Kumar Kalever, integration, Paolo Bonzini, qemu-block,
	devel, Hanna Reitz, Michael S. Tsirkin, Thomas Huth, Eric Blake,
	Song Gao, Marc-André Lureau, Alex Bennée,
	Wainer dos Santos Moschetta, Beraldo Leal, arei.gonglei,
	pannengyuan

Hi Peter, hi Daniel,

On Fri, May 3, 2024 at 4:33 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, May 03, 2024 at 08:40:03AM +0200, Jinpu Wang wrote:
> > I had a brief check in the rsocket changelog, there seems some
> > improvement over time,
> >  might be worth revisiting this. due to socket abstraction, we can't
> > use some feature like
> >  ODP, it won't be a small and easy task.
>
> It'll be good to know whether Dan's suggestion would work first, without
> rewritting everything yet so far.  Not sure whether some perf test could
> help with the rsocket APIs even without QEMU's involvements (or looking for
> test data supporting / invalidate such conversions).
>
I did a quick test with iperf on 100 G environment and 40 G
environment, in summary rsocket works pretty well.

iperf tests between 2 hosts with 40 G (IB),
first  a few test with different num. of threads on top of ipoib
interface, later with preload rsocket on top of same ipoib interface.

jwang@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145
------------------------------------------------------------
Client connecting to 10.43.3.145, TCP port 52000
TCP window size:  165 KByte (default)
------------------------------------------------------------
[  3] local 10.43.3.146 port 55602 connected with 10.43.3.145 port 52000
[ ID] Interval       Transfer     Bandwidth
[  3] 0.0000-10.0001 sec  2.85 GBytes  2.44 Gbits/sec
jwang@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145 -P 2
------------------------------------------------------------
Client connecting to 10.43.3.145, TCP port 52000
TCP window size:  165 KByte (default)
------------------------------------------------------------
[  4] local 10.43.3.146 port 39640 connected with 10.43.3.145 port 52000
[  3] local 10.43.3.146 port 39626 connected with 10.43.3.145 port 52000
[ ID] Interval       Transfer     Bandwidth
[  3] 0.0000-10.0012 sec  2.85 GBytes  2.45 Gbits/sec
[  4] 0.0000-10.0026 sec  2.86 GBytes  2.45 Gbits/sec
[SUM] 0.0000-10.0026 sec  5.71 GBytes  4.90 Gbits/sec
[ CT] final connect times (min/avg/max/stdev) =
0.281/0.300/0.318/0.318 ms (tot/err) = 2/0
jwang@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145 -P 4
------------------------------------------------------------
Client connecting to 10.43.3.145, TCP port 52000
TCP window size:  165 KByte (default)
------------------------------------------------------------
[  4] local 10.43.3.146 port 46956 connected with 10.43.3.145 port 52000
[  6] local 10.43.3.146 port 46978 connected with 10.43.3.145 port 52000
[  3] local 10.43.3.146 port 46944 connected with 10.43.3.145 port 52000
[  5] local 10.43.3.146 port 46962 connected with 10.43.3.145 port 52000
[ ID] Interval       Transfer     Bandwidth
[  3] 0.0000-10.0017 sec  2.85 GBytes  2.45 Gbits/sec
[  4] 0.0000-10.0015 sec  2.85 GBytes  2.45 Gbits/sec
[  5] 0.0000-10.0026 sec  2.85 GBytes  2.45 Gbits/sec
[  6] 0.0000-10.0005 sec  2.85 GBytes  2.45 Gbits/sec
[SUM] 0.0000-10.0005 sec  11.4 GBytes  9.80 Gbits/sec
[ CT] final connect times (min/avg/max/stdev) =
0.274/0.312/0.360/0.212 ms (tot/err) = 4/0
jwang@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145 -P 8
------------------------------------------------------------
Client connecting to 10.43.3.145, TCP port 52000
TCP window size:  165 KByte (default)
------------------------------------------------------------
[  7] local 10.43.3.146 port 35062 connected with 10.43.3.145 port 52000
[  6] local 10.43.3.146 port 35058 connected with 10.43.3.145 port 52000
[  8] local 10.43.3.146 port 35066 connected with 10.43.3.145 port 52000
[  9] local 10.43.3.146 port 35074 connected with 10.43.3.145 port 52000
[  3] local 10.43.3.146 port 35038 connected with 10.43.3.145 port 52000
[ 12] local 10.43.3.146 port 35088 connected with 10.43.3.145 port 52000
[  5] local 10.43.3.146 port 35048 connected with 10.43.3.145 port 52000
[  4] local 10.43.3.146 port 35050 connected with 10.43.3.145 port 52000
[ ID] Interval       Transfer     Bandwidth
[  4] 0.0000-10.0005 sec  2.85 GBytes  2.44 Gbits/sec
[  8] 0.0000-10.0011 sec  2.85 GBytes  2.45 Gbits/sec
[  5] 0.0000-10.0000 sec  2.85 GBytes  2.45 Gbits/sec
[ 12] 0.0000-10.0021 sec  2.85 GBytes  2.44 Gbits/sec
[  3] 0.0000-10.0003 sec  2.85 GBytes  2.44 Gbits/sec
[  7] 0.0000-10.0065 sec  2.50 GBytes  2.14 Gbits/sec
[  9] 0.0000-10.0077 sec  2.52 GBytes  2.16 Gbits/sec
[  6] 0.0000-10.0003 sec  2.85 GBytes  2.44 Gbits/sec
[SUM] 0.0000-10.0003 sec  22.1 GBytes  19.0 Gbits/sec
[ CT] final connect times (min/avg/max/stdev) =
0.096/0.226/0.339/0.109 ms (tot/err) = 8/0
jwang@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145 -P 16
[  3] local 10.43.3.146 port 49540 connected with 10.43.3.145 port 52000
------------------------------------------------------------
Client connecting to 10.43.3.145, TCP port 52000
TCP window size:  165 KByte (default)
------------------------------------------------------------
[  6] local 10.43.3.146 port 49554 connected with 10.43.3.145 port 52000
[  8] local 10.43.3.146 port 49584 connected with 10.43.3.145 port 52000
[  5] local 10.43.3.146 port 49552 connected with 10.43.3.145 port 52000
[ 20] local 10.43.3.146 port 49626 connected with 10.43.3.145 port 52000
[  4] local 10.43.3.146 port 49606 connected with 10.43.3.145 port 52000
[  9] local 10.43.3.146 port 49596 connected with 10.43.3.145 port 52000
[ 10] local 10.43.3.146 port 49604 connected with 10.43.3.145 port 52000
[ 26] local 10.43.3.146 port 49678 connected with 10.43.3.145 port 52000
[  7] local 10.43.3.146 port 49556 connected with 10.43.3.145 port 52000
[ 25] local 10.43.3.146 port 49662 connected with 10.43.3.145 port 52000
[ 22] local 10.43.3.146 port 49636 connected with 10.43.3.145 port 52000
[ 11] local 10.43.3.146 port 49612 connected with 10.43.3.145 port 52000
[ 13] local 10.43.3.146 port 49618 connected with 10.43.3.145 port 52000
[ 23] local 10.43.3.146 port 49646 connected with 10.43.3.145 port 52000
[ 15] local 10.43.3.146 port 49688 connected with 10.43.3.145 port 52000
[ ID] Interval       Transfer     Bandwidth
[ 11] 0.0000-10.0024 sec  2.28 GBytes  1.95 Gbits/sec
[ 23] 0.0000-10.0022 sec  2.28 GBytes  1.95 Gbits/sec
[ 20] 0.0000-10.0010 sec  2.28 GBytes  1.95 Gbits/sec
[  8] 0.0000-10.0032 sec  2.28 GBytes  1.95 Gbits/sec
[ 26] 0.0000-10.0038 sec  2.28 GBytes  1.95 Gbits/sec
[ 10] 0.0000-10.0002 sec  2.28 GBytes  1.95 Gbits/sec
[  7] 0.0000-10.0033 sec  2.28 GBytes  1.95 Gbits/sec
[ 15] 0.0000-10.0015 sec  2.27 GBytes  1.95 Gbits/sec
[  4] 0.0000-10.0028 sec  2.28 GBytes  1.95 Gbits/sec
[  6] 0.0000-10.0012 sec  2.28 GBytes  1.96 Gbits/sec
[ 13] 0.0000-10.0030 sec  2.28 GBytes  1.95 Gbits/sec
[ 25] 0.0000-10.0051 sec  2.28 GBytes  1.95 Gbits/sec
[  5] 0.0000-10.0001 sec  2.28 GBytes  1.96 Gbits/sec
[  9] 0.0000-10.0017 sec  2.28 GBytes  1.95 Gbits/sec
[ 22] 0.0000-10.0008 sec  2.27 GBytes  1.95 Gbits/sec
[  3] 0.0000-10.0033 sec  2.28 GBytes  1.95 Gbits/sec
[SUM] 0.0000-10.0034 sec  36.4 GBytes  31.3 Gbits/sec
[ CT] final connect times (min/avg/max/stdev) =
0.105/0.217/0.401/0.093 ms (tot/err) = 16/0
jwang@ps401a-914.nst:~$
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/rsocket/librspreload.so iperf -p
52000 -c 10.43.3.145 -P 16
------------------------------------------------------------
Client connecting to 10.43.3.145, TCP port 52000
TCP window size:  128 KByte (default)
------------------------------------------------------------
[  3] local 10.43.3.146 port 48902 connected with 10.43.3.145 port 52000
[  5] local 10.43.3.146 port 52777 connected with 10.43.3.145 port 52000
[  9] local 10.43.3.146 port 42911 connected with 10.43.3.145 port 52000
[ 11] local 10.43.3.146 port 56354 connected with 10.43.3.145 port 52000
[ 15] local 10.43.3.146 port 43325 connected with 10.43.3.145 port 52000
[  6] local 10.43.3.146 port 37041 connected with 10.43.3.145 port 52000
[  7] local 10.43.3.146 port 58828 connected with 10.43.3.145 port 52000
[ 17] local 10.43.3.146 port 48858 connected with 10.43.3.145 port 52000
[ 13] local 10.43.3.146 port 49256 connected with 10.43.3.145 port 52000
[ 16] local 10.43.3.146 port 35652 connected with 10.43.3.145 port 52000
[  8] local 10.43.3.146 port 48567 connected with 10.43.3.145 port 52000
[ 18] local 10.43.3.146 port 47394 connected with 10.43.3.145 port 52000
[ 19] local 10.43.3.146 port 48065 connected with 10.43.3.145 port 52000
[ 10] local 10.43.3.146 port 39788 connected with 10.43.3.145 port 52000
[  4] local 10.43.3.146 port 46818 connected with 10.43.3.145 port 52000
[ 14] local 10.43.3.146 port 57174 connected with 10.43.3.145 port 52000
[ ID] Interval       Transfer     Bandwidth
[ 14] 0.0000-10.0002 sec  2.30 GBytes  1.98 Gbits/sec
[  6] 0.0000-10.0004 sec  2.31 GBytes  1.98 Gbits/sec
[  5] 0.0000-10.0002 sec  2.31 GBytes  1.98 Gbits/sec
[  8] 0.0000-10.0001 sec  2.31 GBytes  1.98 Gbits/sec
[ 11] 0.0000-10.0003 sec  2.31 GBytes  1.98 Gbits/sec
[ 18] 0.0000-10.0003 sec  2.31 GBytes  1.98 Gbits/sec
[  3] 0.0000-10.0004 sec  2.31 GBytes  1.98 Gbits/sec
[  4] 0.0000-10.0005 sec  2.30 GBytes  1.98 Gbits/sec
[ 17] 0.0000-10.0004 sec  2.31 GBytes  1.98 Gbits/sec
[ 15] 0.0000-10.0005 sec  2.31 GBytes  1.98 Gbits/sec
[ 19] 0.0000-10.0001 sec  2.30 GBytes  1.98 Gbits/sec
[  7] 0.0000-10.0004 sec  2.31 GBytes  1.98 Gbits/sec
[ 13] 0.0000-10.0005 sec  2.31 GBytes  1.98 Gbits/sec
[ 10] 0.0000-10.0003 sec  2.30 GBytes  1.98 Gbits/sec
[  9] 0.0000-10.0000 sec  2.31 GBytes  1.98 Gbits/sec
[ 16] 0.0000-10.0002 sec  2.31 GBytes  1.98 Gbits/sec
[SUM] 0.0000-10.0003 sec  36.9 GBytes  31.7 Gbits/sec
[ CT] final connect times (min/avg/max/stdev) =
88.398/101.706/114.726/24.755 ms (tot/err) = 16/0
jwang@ps401a-914.nst:~$
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/rsocket/librspreload.so iperf -p
52000 -c 10.43.3.145 -P 1
------------------------------------------------------------
Client connecting to 10.43.3.145, TCP port 52000
TCP window size:  128 KByte (default)
------------------------------------------------------------
[  3] local 10.43.3.146 port 49168 connected with 10.43.3.145 port 52000
[ ID] Interval       Transfer     Bandwidth
[  3] 0.0000-10.0000 sec  34.3 GBytes  29.5 Gbits/sec
jwang@ps401a-914.nst:~$
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/rsocket/librspreload.so iperf -p
52000 -c 10.43.3.145 -P 2
------------------------------------------------------------
Client connecting to 10.43.3.145, TCP port 52000
TCP window size:  128 KByte (default)
------------------------------------------------------------
[  3] local 10.43.3.146 port 42096 connected with 10.43.3.145 port 52000
[  4] local 10.43.3.146 port 58667 connected with 10.43.3.145 port 52000
[ ID] Interval       Transfer     Bandwidth
[  4] 0.0000-10.0001 sec  18.4 GBytes  15.8 Gbits/sec
[  3] 0.0000-10.0000 sec  18.5 GBytes  15.9 Gbits/sec
[SUM] 0.0000-10.0001 sec  36.9 GBytes  31.7 Gbits/sec
[ CT] final connect times (min/avg/max/stdev) =
38.155/38.997/39.839/39.839 ms (tot/err) = 2/0
jwang@ps401a-914.nst:~$
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/rsocket/librspreload.so iperf -p
52000 -c 10.43.3.145 -P 4
------------------------------------------------------------
Client connecting to 10.43.3.145, TCP port 52000
TCP window size:  128 KByte (default)
------------------------------------------------------------
[  3] local 10.43.3.146 port 36100 connected with 10.43.3.145 port 52000
[  5] local 10.43.3.146 port 55108 connected with 10.43.3.145 port 52000
[  6] local 10.43.3.146 port 41039 connected with 10.43.3.145 port 52000
[  7] local 10.43.3.146 port 34868 connected with 10.43.3.145 port 52000
[ ID] Interval       Transfer     Bandwidth
[  7] 0.0000-10.0000 sec  9.22 GBytes  7.92 Gbits/sec
[  5] 0.0000-10.0000 sec  9.22 GBytes  7.92 Gbits/sec
[  3] 0.0000-10.0000 sec  9.22 GBytes  7.92 Gbits/sec
[  6] 0.0000-10.0001 sec  9.22 GBytes  7.92 Gbits/sec
[SUM] 0.0000-10.0001 sec  36.9 GBytes  31.7 Gbits/sec
[ CT] final connect times (min/avg/max/stdev) =
51.401/53.644/56.015/30.487 ms (tot/err) = 4/0

You can see with rsocket it reaches ~ 30 Gb/s with single stream,
while ipoib only 2.5 Gb/s (12 X), ipoib scales with more threads until
 ~ 32 Gb/s, which is the link limit.

With 100 G env, rsocket also out perform ipoib, see below:


jwang@ps404a-59.stg:~$
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/rsocket/librspreload.so iperf -p
52000 -c 10.43.48.58
------------------------------------------------------------
Client connecting to 10.43.48.58, TCP port 52000
TCP window size:  128 KByte (default)
------------------------------------------------------------
[  3] local 10.43.48.59 port 40588 connected with 10.43.48.58 port 52000
[ ID] Interval       Transfer     Bandwidth
[  3] 0.0000-10.0000 sec  80.7 GBytes  69.4 Gbits/sec
jwang@ps404a-59.stg:~$
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/rsocket/librspreload.so iperf -p
52000 -c 10.43.48.58 -P 2
------------------------------------------------------------
Client connecting to 10.43.48.58, TCP port 52000
TCP window size:  128 KByte (default)
------------------------------------------------------------
[  3] local 10.43.48.59 port 41813 connected with 10.43.48.58 port 52000
[  5] local 10.43.48.59 port 60638 connected with 10.43.48.58 port 52000
[ ID] Interval       Transfer     Bandwidth
[  5] 0.0000-10.0000 sec  48.9 GBytes  42.0 Gbits/sec
[  3] 0.0000-10.0000 sec  49.8 GBytes  42.8 Gbits/sec
[SUM] 0.0000-10.0000 sec  98.7 GBytes  84.8 Gbits/sec
[ CT] final connect times (min/avg/max/stdev) =
6.962/7.764/8.567/8.567 ms (tot/err) = 2/0
jwang@ps404a-59.stg:~$
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/rsocket/librspreload.so iperf -p
52000 -c 10.43.48.58 -P 4
[  6] local 10.43.48.59 port 58086 connected with 10.43.48.58 port 52000
[  3] local 10.43.48.59 port 49335 connected with 10.43.48.58 port 52000
------------------------------------------------------------
Client connecting to 10.43.48.58, TCP port 52000
TCP window size:  128 KByte (default)
------------------------------------------------------------
[  4] local 10.43.48.59 port 44593 connected with 10.43.48.58 port 52000
[  5] local 10.43.48.59 port 60464 connected with 10.43.48.58 port 52000
[ ID] Interval       Transfer     Bandwidth
[  5] 0.0000-10.0000 sec  28.0 GBytes  24.0 Gbits/sec
[  4] 0.0000-10.0000 sec  28.0 GBytes  24.0 Gbits/sec
[  3] 0.0000-10.0000 sec  28.0 GBytes  24.1 Gbits/sec
[  6] 0.0000-10.0000 sec  28.0 GBytes  24.1 Gbits/sec
[SUM] 0.0000-10.0001 sec   112 GBytes  96.3 Gbits/sec
[ CT] final connect times (min/avg/max/stdev) =
7.344/9.619/12.199/5.271 ms (tot/err) = 4/0
jwang@ps404a-59.stg:~$
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/rsocket/librspreload.so iperf -p
52000 -c 10.43.48.58 -P 8
[  3] local 10.43.48.59 port 43020 connected with 10.43.48.58 port 52000
[  7] local 10.43.48.59 port 59720 connected with 10.43.48.58 port 52000
[  4] local 10.43.48.59 port 52547 connected with 10.43.48.58 port 52000
[  8] local 10.43.48.59 port 41712 connected with 10.43.48.58 port 52000
[ 10] local 10.43.48.59 port 53126 connected with 10.43.48.58 port 52000
------------------------------------------------------------
Client connecting to 10.43.48.58, TCP port 52000
TCP window size:  128 KByte (default)
------------------------------------------------------------
[  6] local 10.43.48.59 port 60311 connected with 10.43.48.58 port 52000
[  5] local 10.43.48.59 port 44103 connected with 10.43.48.58 port 52000
[  9] local 10.43.48.59 port 49007 connected with 10.43.48.58 port 52000
[ ID] Interval       Transfer     Bandwidth
[  9] 0.0000-10.0001 sec  14.0 GBytes  12.0 Gbits/sec
[  8] 0.0000-10.0000 sec  14.0 GBytes  12.0 Gbits/sec
[  4] 0.0000-10.0001 sec  14.0 GBytes  12.0 Gbits/sec
[  6] 0.0000-10.0000 sec  14.0 GBytes  12.0 Gbits/sec
[ 10] 0.0000-10.0000 sec  14.0 GBytes  12.0 Gbits/sec
[  7] 0.0000-10.0000 sec  14.0 GBytes  12.0 Gbits/sec
[  5] 0.0000-10.0001 sec  14.0 GBytes  12.0 Gbits/sec
[  3] 0.0000-10.0001 sec  14.0 GBytes  12.0 Gbits/sec
[SUM] 0.0000-10.0001 sec   112 GBytes  96.3 Gbits/sec
[ CT] final connect times (min/avg/max/stdev) =
6.942/12.361/18.109/4.872 ms (tot/err) = 8/0
jwang@ps404a-59.stg:~$ iperf -p 52000 -c 10.43.48.58 -P 8
------------------------------------------------------------
Client connecting to 10.43.48.58, TCP port 52000
TCP window size:  165 KByte (default)
------------------------------------------------------------
[  4] local 10.43.48.59 port 58176 connected with 10.43.48.58 port 52000
[  5] local 10.43.48.59 port 58180 connected with 10.43.48.58 port 52000
[  3] local 10.43.48.59 port 58178 connected with 10.43.48.58 port 52000
[ 10] local 10.43.48.59 port 58226 connected with 10.43.48.58 port 52000
[ 11] local 10.43.48.59 port 58228 connected with 10.43.48.58 port 52000
[  9] local 10.43.48.59 port 58212 connected with 10.43.48.58 port 52000
[  7] local 10.43.48.59 port 58194 connected with 10.43.48.58 port 52000
[  8] local 10.43.48.59 port 58198 connected with 10.43.48.58 port 52000
[ ID] Interval       Transfer     Bandwidth
[  9] 0.0000-10.0005 sec  15.8 GBytes  13.5 Gbits/sec
[  4] 0.0000-10.0002 sec  15.8 GBytes  13.6 Gbits/sec
[  3] 0.0000-10.0000 sec  15.8 GBytes  13.6 Gbits/sec
[  5] 0.0000-10.0002 sec  15.8 GBytes  13.6 Gbits/sec
[  8] 0.0000-10.0005 sec  7.89 GBytes  6.78 Gbits/sec
[ 10] 0.0000-10.0000 sec  15.8 GBytes  13.6 Gbits/sec
[ 11] 0.0000-10.0014 sec  7.94 GBytes  6.82 Gbits/sec
[  7] 0.0000-10.0009 sec  15.8 GBytes  13.6 Gbits/sec
[SUM] 0.0000-10.0009 sec   111 GBytes  95.1 Gbits/sec
[ CT] final connect times (min/avg/max/stdev) =
0.234/0.325/0.406/0.155 ms (tot/err) = 8/0
jwang@ps404a-59.stg:~$ iperf -p 52000 -c 10.43.48.58 -P 4
[  3] local 10.43.48.59 port 42548 connected with 10.43.48.58 port 52000
------------------------------------------------------------
Client connecting to 10.43.48.58, TCP port 52000
TCP window size:  165 KByte (default)
------------------------------------------------------------
[  4] local 10.43.48.59 port 42558 connected with 10.43.48.58 port 52000
[  5] local 10.43.48.59 port 42560 connected with 10.43.48.58 port 52000
[  6] local 10.43.48.59 port 42562 connected with 10.43.48.58 port 52000
[ ID] Interval       Transfer     Bandwidth
[  6] 0.0000-10.0000 sec  27.8 GBytes  23.9 Gbits/sec
[  5] 0.0000-10.0001 sec  27.3 GBytes  23.5 Gbits/sec
[  3] 0.0000-10.0001 sec  27.8 GBytes  23.9 Gbits/sec
[  4] 0.0000-10.0001 sec  27.8 GBytes  23.9 Gbits/sec
[SUM] 0.0000-10.0001 sec   111 GBytes  95.1 Gbits/sec
[ CT] final connect times (min/avg/max/stdev) =
0.295/0.340/0.390/0.201 ms (tot/err) = 4/0
jwang@ps404a-59.stg:~$ iperf -p 52000 -c 10.43.48.58 -P 2
------------------------------------------------------------
Client connecting to 10.43.48.58, TCP port 52000
TCP window size:  165 KByte (default)
------------------------------------------------------------
[  4] local 10.43.48.59 port 44194 connected with 10.43.48.58 port 52000
[  3] local 10.43.48.59 port 44186 connected with 10.43.48.58 port 52000
[ ID] Interval       Transfer     Bandwidth
[  3] 0.0000-10.0000 sec  48.3 GBytes  41.5 Gbits/sec
[  4] 0.0000-10.0000 sec  41.3 GBytes  35.5 Gbits/sec
[SUM] 0.0000-10.0000 sec  89.7 GBytes  77.0 Gbits/sec
[ CT] final connect times (min/avg/max/stdev) =
0.227/0.233/0.240/0.240 ms (tot/err) = 2/0
jwang@ps404a-59.stg:~$ pbkvm list
 VM  State  PID  Cores  Mem  VNC  Migration
--------------------------------------------

Total: 0 VMs, Running: 0
jwang@ps404a-59.stg:~$ iperf -p 52000 -c 10.43.48.58 -P 1
------------------------------------------------------------
Client connecting to 10.43.48.58, TCP port 52000
TCP window size:  165 KByte (default)
------------------------------------------------------------
[  3] local 10.43.48.59 port 40364 connected with 10.43.48.58 port 52000
[ ID] Interval       Transfer     Bandwidth
[  3] 0.0000-10.0000 sec  51.2 GBytes  44.0 Gbits/sec

Thanks!


> Thanks,
>
> --
> Peter Xu
>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
  2024-05-06  2:06                                           ` Gonglei (Arei) via
@ 2024-05-06 15:18                                             ` Peter Xu
  2024-05-07  1:50                                               ` Gonglei (Arei) via
  0 siblings, 1 reply; 52+ messages in thread
From: Peter Xu @ 2024-05-06 15:18 UTC (permalink / raw)
  To: Gonglei (Arei)
  Cc: Daniel P. Berrangé,
	Markus Armbruster, Michael Galaxy, Yu Zhang, Zhijian Li (Fujitsu),
	Jinpu Wang, Elmar Gerdes, qemu-devel, Yuval Shaia, Kevin Wolf,
	Prasanna Kumar Kalever, Cornelia Huck, Michael Roth,
	Prasanna Kumar Kalever, integration, Paolo Bonzini, qemu-block,
	devel, Hanna Reitz, Michael S. Tsirkin, Thomas Huth, Eric Blake,
	Song Gao, Marc-André Lureau, Alex Bennée,
	Wainer dos Santos Moschetta, Beraldo Leal, Pannengyuan,
	Xiexiangyou

On Mon, May 06, 2024 at 02:06:28AM +0000, Gonglei (Arei) wrote:
> Hi, Peter

Hey, Lei,

Happy to see you around again after years.

> RDMA features high bandwidth, low latency (in non-blocking lossless
> network), and direct remote memory access by bypassing the CPU (As you
> know, CPU resources are expensive for cloud vendors, which is one of the
> reasons why we introduced offload cards.), which TCP does not have.

It's another cost to use offload cards, v.s. preparing more cpu resources?

> In some scenarios where fast live migration is needed (extremely short
> interruption duration and migration duration) is very useful. To this
> end, we have also developed RDMA support for multifd.

Will any of you upstream that work?  I'm curious how intrusive would it be
when adding it to multifd, if it can keep only 5 exported functions like
what rdma.h does right now it'll be pretty nice.  We also want to make sure
it works with arbitrary sized loads and buffers, e.g. vfio is considering
to add IO loads to multifd channels too.

One thing to note that the question here is not about a pure performance
comparison between rdma and nics only.  It's about help us make a decision
on whether to drop rdma, iow, even if rdma performs well, the community
still has the right to drop it if nobody can actively work and maintain it.
It's just that if nics can perform as good it's more a reason to drop,
unless companies can help to provide good support and work together.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
  2024-05-06 10:08                                                 ` Jinpu Wang
@ 2024-05-06 15:28                                                   ` Peter Xu
  2024-05-07  4:52                                                     ` Jinpu Wang
  0 siblings, 1 reply; 52+ messages in thread
From: Peter Xu @ 2024-05-06 15:28 UTC (permalink / raw)
  To: Jinpu Wang
  Cc: Daniel P. Berrangé,
	Markus Armbruster, Michael Galaxy, Yu Zhang, Zhijian Li (Fujitsu),
	Elmar Gerdes, qemu-devel, Yuval Shaia, Kevin Wolf,
	Prasanna Kumar Kalever, Cornelia Huck, Michael Roth,
	Prasanna Kumar Kalever, integration, Paolo Bonzini, qemu-block,
	devel, Hanna Reitz, Michael S. Tsirkin, Thomas Huth, Eric Blake,
	Song Gao, Marc-André Lureau, Alex Bennée,
	Wainer dos Santos Moschetta, Beraldo Leal, arei.gonglei,
	pannengyuan

On Mon, May 06, 2024 at 12:08:43PM +0200, Jinpu Wang wrote:
> Hi Peter, hi Daniel,

Hi, Jinpu,

Thanks for sharing this test results.  Sounds like a great news.

What's your plan next?  Would it then be worthwhile / possible moving QEMU
into that direction?  Would that greatly simplify rdma code as Dan
mentioned?

Thanks,

> 
> On Fri, May 3, 2024 at 4:33 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Fri, May 03, 2024 at 08:40:03AM +0200, Jinpu Wang wrote:
> > > I had a brief check in the rsocket changelog, there seems some
> > > improvement over time,
> > >  might be worth revisiting this. due to socket abstraction, we can't
> > > use some feature like
> > >  ODP, it won't be a small and easy task.
> >
> > It'll be good to know whether Dan's suggestion would work first, without
> > rewritting everything yet so far.  Not sure whether some perf test could
> > help with the rsocket APIs even without QEMU's involvements (or looking for
> > test data supporting / invalidate such conversions).
> >
> I did a quick test with iperf on 100 G environment and 40 G
> environment, in summary rsocket works pretty well.
> 
> iperf tests between 2 hosts with 40 G (IB),
> first  a few test with different num. of threads on top of ipoib
> interface, later with preload rsocket on top of same ipoib interface.
> 
> jwang@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145
> ------------------------------------------------------------
> Client connecting to 10.43.3.145, TCP port 52000
> TCP window size:  165 KByte (default)
> ------------------------------------------------------------
> [  3] local 10.43.3.146 port 55602 connected with 10.43.3.145 port 52000
> [ ID] Interval       Transfer     Bandwidth
> [  3] 0.0000-10.0001 sec  2.85 GBytes  2.44 Gbits/sec
> jwang@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145 -P 2
> ------------------------------------------------------------
> Client connecting to 10.43.3.145, TCP port 52000
> TCP window size:  165 KByte (default)
> ------------------------------------------------------------
> [  4] local 10.43.3.146 port 39640 connected with 10.43.3.145 port 52000
> [  3] local 10.43.3.146 port 39626 connected with 10.43.3.145 port 52000
> [ ID] Interval       Transfer     Bandwidth
> [  3] 0.0000-10.0012 sec  2.85 GBytes  2.45 Gbits/sec
> [  4] 0.0000-10.0026 sec  2.86 GBytes  2.45 Gbits/sec
> [SUM] 0.0000-10.0026 sec  5.71 GBytes  4.90 Gbits/sec
> [ CT] final connect times (min/avg/max/stdev) =
> 0.281/0.300/0.318/0.318 ms (tot/err) = 2/0
> jwang@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145 -P 4
> ------------------------------------------------------------
> Client connecting to 10.43.3.145, TCP port 52000
> TCP window size:  165 KByte (default)
> ------------------------------------------------------------
> [  4] local 10.43.3.146 port 46956 connected with 10.43.3.145 port 52000
> [  6] local 10.43.3.146 port 46978 connected with 10.43.3.145 port 52000
> [  3] local 10.43.3.146 port 46944 connected with 10.43.3.145 port 52000
> [  5] local 10.43.3.146 port 46962 connected with 10.43.3.145 port 52000
> [ ID] Interval       Transfer     Bandwidth
> [  3] 0.0000-10.0017 sec  2.85 GBytes  2.45 Gbits/sec
> [  4] 0.0000-10.0015 sec  2.85 GBytes  2.45 Gbits/sec
> [  5] 0.0000-10.0026 sec  2.85 GBytes  2.45 Gbits/sec
> [  6] 0.0000-10.0005 sec  2.85 GBytes  2.45 Gbits/sec
> [SUM] 0.0000-10.0005 sec  11.4 GBytes  9.80 Gbits/sec
> [ CT] final connect times (min/avg/max/stdev) =
> 0.274/0.312/0.360/0.212 ms (tot/err) = 4/0
> jwang@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145 -P 8
> ------------------------------------------------------------
> Client connecting to 10.43.3.145, TCP port 52000
> TCP window size:  165 KByte (default)
> ------------------------------------------------------------
> [  7] local 10.43.3.146 port 35062 connected with 10.43.3.145 port 52000
> [  6] local 10.43.3.146 port 35058 connected with 10.43.3.145 port 52000
> [  8] local 10.43.3.146 port 35066 connected with 10.43.3.145 port 52000
> [  9] local 10.43.3.146 port 35074 connected with 10.43.3.145 port 52000
> [  3] local 10.43.3.146 port 35038 connected with 10.43.3.145 port 52000
> [ 12] local 10.43.3.146 port 35088 connected with 10.43.3.145 port 52000
> [  5] local 10.43.3.146 port 35048 connected with 10.43.3.145 port 52000
> [  4] local 10.43.3.146 port 35050 connected with 10.43.3.145 port 52000
> [ ID] Interval       Transfer     Bandwidth
> [  4] 0.0000-10.0005 sec  2.85 GBytes  2.44 Gbits/sec
> [  8] 0.0000-10.0011 sec  2.85 GBytes  2.45 Gbits/sec
> [  5] 0.0000-10.0000 sec  2.85 GBytes  2.45 Gbits/sec
> [ 12] 0.0000-10.0021 sec  2.85 GBytes  2.44 Gbits/sec
> [  3] 0.0000-10.0003 sec  2.85 GBytes  2.44 Gbits/sec
> [  7] 0.0000-10.0065 sec  2.50 GBytes  2.14 Gbits/sec
> [  9] 0.0000-10.0077 sec  2.52 GBytes  2.16 Gbits/sec
> [  6] 0.0000-10.0003 sec  2.85 GBytes  2.44 Gbits/sec
> [SUM] 0.0000-10.0003 sec  22.1 GBytes  19.0 Gbits/sec
> [ CT] final connect times (min/avg/max/stdev) =
> 0.096/0.226/0.339/0.109 ms (tot/err) = 8/0
> jwang@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145 -P 16
> [  3] local 10.43.3.146 port 49540 connected with 10.43.3.145 port 52000
> ------------------------------------------------------------
> Client connecting to 10.43.3.145, TCP port 52000
> TCP window size:  165 KByte (default)
> ------------------------------------------------------------
> [  6] local 10.43.3.146 port 49554 connected with 10.43.3.145 port 52000
> [  8] local 10.43.3.146 port 49584 connected with 10.43.3.145 port 52000
> [  5] local 10.43.3.146 port 49552 connected with 10.43.3.145 port 52000
> [ 20] local 10.43.3.146 port 49626 connected with 10.43.3.145 port 52000
> [  4] local 10.43.3.146 port 49606 connected with 10.43.3.145 port 52000
> [  9] local 10.43.3.146 port 49596 connected with 10.43.3.145 port 52000
> [ 10] local 10.43.3.146 port 49604 connected with 10.43.3.145 port 52000
> [ 26] local 10.43.3.146 port 49678 connected with 10.43.3.145 port 52000
> [  7] local 10.43.3.146 port 49556 connected with 10.43.3.145 port 52000
> [ 25] local 10.43.3.146 port 49662 connected with 10.43.3.145 port 52000
> [ 22] local 10.43.3.146 port 49636 connected with 10.43.3.145 port 52000
> [ 11] local 10.43.3.146 port 49612 connected with 10.43.3.145 port 52000
> [ 13] local 10.43.3.146 port 49618 connected with 10.43.3.145 port 52000
> [ 23] local 10.43.3.146 port 49646 connected with 10.43.3.145 port 52000
> [ 15] local 10.43.3.146 port 49688 connected with 10.43.3.145 port 52000
> [ ID] Interval       Transfer     Bandwidth
> [ 11] 0.0000-10.0024 sec  2.28 GBytes  1.95 Gbits/sec
> [ 23] 0.0000-10.0022 sec  2.28 GBytes  1.95 Gbits/sec
> [ 20] 0.0000-10.0010 sec  2.28 GBytes  1.95 Gbits/sec
> [  8] 0.0000-10.0032 sec  2.28 GBytes  1.95 Gbits/sec
> [ 26] 0.0000-10.0038 sec  2.28 GBytes  1.95 Gbits/sec
> [ 10] 0.0000-10.0002 sec  2.28 GBytes  1.95 Gbits/sec
> [  7] 0.0000-10.0033 sec  2.28 GBytes  1.95 Gbits/sec
> [ 15] 0.0000-10.0015 sec  2.27 GBytes  1.95 Gbits/sec
> [  4] 0.0000-10.0028 sec  2.28 GBytes  1.95 Gbits/sec
> [  6] 0.0000-10.0012 sec  2.28 GBytes  1.96 Gbits/sec
> [ 13] 0.0000-10.0030 sec  2.28 GBytes  1.95 Gbits/sec
> [ 25] 0.0000-10.0051 sec  2.28 GBytes  1.95 Gbits/sec
> [  5] 0.0000-10.0001 sec  2.28 GBytes  1.96 Gbits/sec
> [  9] 0.0000-10.0017 sec  2.28 GBytes  1.95 Gbits/sec
> [ 22] 0.0000-10.0008 sec  2.27 GBytes  1.95 Gbits/sec
> [  3] 0.0000-10.0033 sec  2.28 GBytes  1.95 Gbits/sec
> [SUM] 0.0000-10.0034 sec  36.4 GBytes  31.3 Gbits/sec
> [ CT] final connect times (min/avg/max/stdev) =
> 0.105/0.217/0.401/0.093 ms (tot/err) = 16/0
> jwang@ps401a-914.nst:~$
> LD_PRELOAD=/usr/lib/x86_64-linux-gnu/rsocket/librspreload.so iperf -p
> 52000 -c 10.43.3.145 -P 16
> ------------------------------------------------------------
> Client connecting to 10.43.3.145, TCP port 52000
> TCP window size:  128 KByte (default)
> ------------------------------------------------------------
> [  3] local 10.43.3.146 port 48902 connected with 10.43.3.145 port 52000
> [  5] local 10.43.3.146 port 52777 connected with 10.43.3.145 port 52000
> [  9] local 10.43.3.146 port 42911 connected with 10.43.3.145 port 52000
> [ 11] local 10.43.3.146 port 56354 connected with 10.43.3.145 port 52000
> [ 15] local 10.43.3.146 port 43325 connected with 10.43.3.145 port 52000
> [  6] local 10.43.3.146 port 37041 connected with 10.43.3.145 port 52000
> [  7] local 10.43.3.146 port 58828 connected with 10.43.3.145 port 52000
> [ 17] local 10.43.3.146 port 48858 connected with 10.43.3.145 port 52000
> [ 13] local 10.43.3.146 port 49256 connected with 10.43.3.145 port 52000
> [ 16] local 10.43.3.146 port 35652 connected with 10.43.3.145 port 52000
> [  8] local 10.43.3.146 port 48567 connected with 10.43.3.145 port 52000
> [ 18] local 10.43.3.146 port 47394 connected with 10.43.3.145 port 52000
> [ 19] local 10.43.3.146 port 48065 connected with 10.43.3.145 port 52000
> [ 10] local 10.43.3.146 port 39788 connected with 10.43.3.145 port 52000
> [  4] local 10.43.3.146 port 46818 connected with 10.43.3.145 port 52000
> [ 14] local 10.43.3.146 port 57174 connected with 10.43.3.145 port 52000
> [ ID] Interval       Transfer     Bandwidth
> [ 14] 0.0000-10.0002 sec  2.30 GBytes  1.98 Gbits/sec
> [  6] 0.0000-10.0004 sec  2.31 GBytes  1.98 Gbits/sec
> [  5] 0.0000-10.0002 sec  2.31 GBytes  1.98 Gbits/sec
> [  8] 0.0000-10.0001 sec  2.31 GBytes  1.98 Gbits/sec
> [ 11] 0.0000-10.0003 sec  2.31 GBytes  1.98 Gbits/sec
> [ 18] 0.0000-10.0003 sec  2.31 GBytes  1.98 Gbits/sec
> [  3] 0.0000-10.0004 sec  2.31 GBytes  1.98 Gbits/sec
> [  4] 0.0000-10.0005 sec  2.30 GBytes  1.98 Gbits/sec
> [ 17] 0.0000-10.0004 sec  2.31 GBytes  1.98 Gbits/sec
> [ 15] 0.0000-10.0005 sec  2.31 GBytes  1.98 Gbits/sec
> [ 19] 0.0000-10.0001 sec  2.30 GBytes  1.98 Gbits/sec
> [  7] 0.0000-10.0004 sec  2.31 GBytes  1.98 Gbits/sec
> [ 13] 0.0000-10.0005 sec  2.31 GBytes  1.98 Gbits/sec
> [ 10] 0.0000-10.0003 sec  2.30 GBytes  1.98 Gbits/sec
> [  9] 0.0000-10.0000 sec  2.31 GBytes  1.98 Gbits/sec
> [ 16] 0.0000-10.0002 sec  2.31 GBytes  1.98 Gbits/sec
> [SUM] 0.0000-10.0003 sec  36.9 GBytes  31.7 Gbits/sec
> [ CT] final connect times (min/avg/max/stdev) =
> 88.398/101.706/114.726/24.755 ms (tot/err) = 16/0
> jwang@ps401a-914.nst:~$
> LD_PRELOAD=/usr/lib/x86_64-linux-gnu/rsocket/librspreload.so iperf -p
> 52000 -c 10.43.3.145 -P 1
> ------------------------------------------------------------
> Client connecting to 10.43.3.145, TCP port 52000
> TCP window size:  128 KByte (default)
> ------------------------------------------------------------
> [  3] local 10.43.3.146 port 49168 connected with 10.43.3.145 port 52000
> [ ID] Interval       Transfer     Bandwidth
> [  3] 0.0000-10.0000 sec  34.3 GBytes  29.5 Gbits/sec
> jwang@ps401a-914.nst:~$
> LD_PRELOAD=/usr/lib/x86_64-linux-gnu/rsocket/librspreload.so iperf -p
> 52000 -c 10.43.3.145 -P 2
> ------------------------------------------------------------
> Client connecting to 10.43.3.145, TCP port 52000
> TCP window size:  128 KByte (default)
> ------------------------------------------------------------
> [  3] local 10.43.3.146 port 42096 connected with 10.43.3.145 port 52000
> [  4] local 10.43.3.146 port 58667 connected with 10.43.3.145 port 52000
> [ ID] Interval       Transfer     Bandwidth
> [  4] 0.0000-10.0001 sec  18.4 GBytes  15.8 Gbits/sec
> [  3] 0.0000-10.0000 sec  18.5 GBytes  15.9 Gbits/sec
> [SUM] 0.0000-10.0001 sec  36.9 GBytes  31.7 Gbits/sec
> [ CT] final connect times (min/avg/max/stdev) =
> 38.155/38.997/39.839/39.839 ms (tot/err) = 2/0
> jwang@ps401a-914.nst:~$
> LD_PRELOAD=/usr/lib/x86_64-linux-gnu/rsocket/librspreload.so iperf -p
> 52000 -c 10.43.3.145 -P 4
> ------------------------------------------------------------
> Client connecting to 10.43.3.145, TCP port 52000
> TCP window size:  128 KByte (default)
> ------------------------------------------------------------
> [  3] local 10.43.3.146 port 36100 connected with 10.43.3.145 port 52000
> [  5] local 10.43.3.146 port 55108 connected with 10.43.3.145 port 52000
> [  6] local 10.43.3.146 port 41039 connected with 10.43.3.145 port 52000
> [  7] local 10.43.3.146 port 34868 connected with 10.43.3.145 port 52000
> [ ID] Interval       Transfer     Bandwidth
> [  7] 0.0000-10.0000 sec  9.22 GBytes  7.92 Gbits/sec
> [  5] 0.0000-10.0000 sec  9.22 GBytes  7.92 Gbits/sec
> [  3] 0.0000-10.0000 sec  9.22 GBytes  7.92 Gbits/sec
> [  6] 0.0000-10.0001 sec  9.22 GBytes  7.92 Gbits/sec
> [SUM] 0.0000-10.0001 sec  36.9 GBytes  31.7 Gbits/sec
> [ CT] final connect times (min/avg/max/stdev) =
> 51.401/53.644/56.015/30.487 ms (tot/err) = 4/0
> 
> You can see with rsocket it reaches ~ 30 Gb/s with single stream,
> while ipoib only 2.5 Gb/s (12 X), ipoib scales with more threads until
>  ~ 32 Gb/s, which is the link limit.
> 
> With 100 G env, rsocket also out perform ipoib, see below:
> 
> 
> jwang@ps404a-59.stg:~$
> LD_PRELOAD=/usr/lib/x86_64-linux-gnu/rsocket/librspreload.so iperf -p
> 52000 -c 10.43.48.58
> ------------------------------------------------------------
> Client connecting to 10.43.48.58, TCP port 52000
> TCP window size:  128 KByte (default)
> ------------------------------------------------------------
> [  3] local 10.43.48.59 port 40588 connected with 10.43.48.58 port 52000
> [ ID] Interval       Transfer     Bandwidth
> [  3] 0.0000-10.0000 sec  80.7 GBytes  69.4 Gbits/sec
> jwang@ps404a-59.stg:~$
> LD_PRELOAD=/usr/lib/x86_64-linux-gnu/rsocket/librspreload.so iperf -p
> 52000 -c 10.43.48.58 -P 2
> ------------------------------------------------------------
> Client connecting to 10.43.48.58, TCP port 52000
> TCP window size:  128 KByte (default)
> ------------------------------------------------------------
> [  3] local 10.43.48.59 port 41813 connected with 10.43.48.58 port 52000
> [  5] local 10.43.48.59 port 60638 connected with 10.43.48.58 port 52000
> [ ID] Interval       Transfer     Bandwidth
> [  5] 0.0000-10.0000 sec  48.9 GBytes  42.0 Gbits/sec
> [  3] 0.0000-10.0000 sec  49.8 GBytes  42.8 Gbits/sec
> [SUM] 0.0000-10.0000 sec  98.7 GBytes  84.8 Gbits/sec
> [ CT] final connect times (min/avg/max/stdev) =
> 6.962/7.764/8.567/8.567 ms (tot/err) = 2/0
> jwang@ps404a-59.stg:~$
> LD_PRELOAD=/usr/lib/x86_64-linux-gnu/rsocket/librspreload.so iperf -p
> 52000 -c 10.43.48.58 -P 4
> [  6] local 10.43.48.59 port 58086 connected with 10.43.48.58 port 52000
> [  3] local 10.43.48.59 port 49335 connected with 10.43.48.58 port 52000
> ------------------------------------------------------------
> Client connecting to 10.43.48.58, TCP port 52000
> TCP window size:  128 KByte (default)
> ------------------------------------------------------------
> [  4] local 10.43.48.59 port 44593 connected with 10.43.48.58 port 52000
> [  5] local 10.43.48.59 port 60464 connected with 10.43.48.58 port 52000
> [ ID] Interval       Transfer     Bandwidth
> [  5] 0.0000-10.0000 sec  28.0 GBytes  24.0 Gbits/sec
> [  4] 0.0000-10.0000 sec  28.0 GBytes  24.0 Gbits/sec
> [  3] 0.0000-10.0000 sec  28.0 GBytes  24.1 Gbits/sec
> [  6] 0.0000-10.0000 sec  28.0 GBytes  24.1 Gbits/sec
> [SUM] 0.0000-10.0001 sec   112 GBytes  96.3 Gbits/sec
> [ CT] final connect times (min/avg/max/stdev) =
> 7.344/9.619/12.199/5.271 ms (tot/err) = 4/0
> jwang@ps404a-59.stg:~$
> LD_PRELOAD=/usr/lib/x86_64-linux-gnu/rsocket/librspreload.so iperf -p
> 52000 -c 10.43.48.58 -P 8
> [  3] local 10.43.48.59 port 43020 connected with 10.43.48.58 port 52000
> [  7] local 10.43.48.59 port 59720 connected with 10.43.48.58 port 52000
> [  4] local 10.43.48.59 port 52547 connected with 10.43.48.58 port 52000
> [  8] local 10.43.48.59 port 41712 connected with 10.43.48.58 port 52000
> [ 10] local 10.43.48.59 port 53126 connected with 10.43.48.58 port 52000
> ------------------------------------------------------------
> Client connecting to 10.43.48.58, TCP port 52000
> TCP window size:  128 KByte (default)
> ------------------------------------------------------------
> [  6] local 10.43.48.59 port 60311 connected with 10.43.48.58 port 52000
> [  5] local 10.43.48.59 port 44103 connected with 10.43.48.58 port 52000
> [  9] local 10.43.48.59 port 49007 connected with 10.43.48.58 port 52000
> [ ID] Interval       Transfer     Bandwidth
> [  9] 0.0000-10.0001 sec  14.0 GBytes  12.0 Gbits/sec
> [  8] 0.0000-10.0000 sec  14.0 GBytes  12.0 Gbits/sec
> [  4] 0.0000-10.0001 sec  14.0 GBytes  12.0 Gbits/sec
> [  6] 0.0000-10.0000 sec  14.0 GBytes  12.0 Gbits/sec
> [ 10] 0.0000-10.0000 sec  14.0 GBytes  12.0 Gbits/sec
> [  7] 0.0000-10.0000 sec  14.0 GBytes  12.0 Gbits/sec
> [  5] 0.0000-10.0001 sec  14.0 GBytes  12.0 Gbits/sec
> [  3] 0.0000-10.0001 sec  14.0 GBytes  12.0 Gbits/sec
> [SUM] 0.0000-10.0001 sec   112 GBytes  96.3 Gbits/sec
> [ CT] final connect times (min/avg/max/stdev) =
> 6.942/12.361/18.109/4.872 ms (tot/err) = 8/0
> jwang@ps404a-59.stg:~$ iperf -p 52000 -c 10.43.48.58 -P 8
> ------------------------------------------------------------
> Client connecting to 10.43.48.58, TCP port 52000
> TCP window size:  165 KByte (default)
> ------------------------------------------------------------
> [  4] local 10.43.48.59 port 58176 connected with 10.43.48.58 port 52000
> [  5] local 10.43.48.59 port 58180 connected with 10.43.48.58 port 52000
> [  3] local 10.43.48.59 port 58178 connected with 10.43.48.58 port 52000
> [ 10] local 10.43.48.59 port 58226 connected with 10.43.48.58 port 52000
> [ 11] local 10.43.48.59 port 58228 connected with 10.43.48.58 port 52000
> [  9] local 10.43.48.59 port 58212 connected with 10.43.48.58 port 52000
> [  7] local 10.43.48.59 port 58194 connected with 10.43.48.58 port 52000
> [  8] local 10.43.48.59 port 58198 connected with 10.43.48.58 port 52000
> [ ID] Interval       Transfer     Bandwidth
> [  9] 0.0000-10.0005 sec  15.8 GBytes  13.5 Gbits/sec
> [  4] 0.0000-10.0002 sec  15.8 GBytes  13.6 Gbits/sec
> [  3] 0.0000-10.0000 sec  15.8 GBytes  13.6 Gbits/sec
> [  5] 0.0000-10.0002 sec  15.8 GBytes  13.6 Gbits/sec
> [  8] 0.0000-10.0005 sec  7.89 GBytes  6.78 Gbits/sec
> [ 10] 0.0000-10.0000 sec  15.8 GBytes  13.6 Gbits/sec
> [ 11] 0.0000-10.0014 sec  7.94 GBytes  6.82 Gbits/sec
> [  7] 0.0000-10.0009 sec  15.8 GBytes  13.6 Gbits/sec
> [SUM] 0.0000-10.0009 sec   111 GBytes  95.1 Gbits/sec
> [ CT] final connect times (min/avg/max/stdev) =
> 0.234/0.325/0.406/0.155 ms (tot/err) = 8/0
> jwang@ps404a-59.stg:~$ iperf -p 52000 -c 10.43.48.58 -P 4
> [  3] local 10.43.48.59 port 42548 connected with 10.43.48.58 port 52000
> ------------------------------------------------------------
> Client connecting to 10.43.48.58, TCP port 52000
> TCP window size:  165 KByte (default)
> ------------------------------------------------------------
> [  4] local 10.43.48.59 port 42558 connected with 10.43.48.58 port 52000
> [  5] local 10.43.48.59 port 42560 connected with 10.43.48.58 port 52000
> [  6] local 10.43.48.59 port 42562 connected with 10.43.48.58 port 52000
> [ ID] Interval       Transfer     Bandwidth
> [  6] 0.0000-10.0000 sec  27.8 GBytes  23.9 Gbits/sec
> [  5] 0.0000-10.0001 sec  27.3 GBytes  23.5 Gbits/sec
> [  3] 0.0000-10.0001 sec  27.8 GBytes  23.9 Gbits/sec
> [  4] 0.0000-10.0001 sec  27.8 GBytes  23.9 Gbits/sec
> [SUM] 0.0000-10.0001 sec   111 GBytes  95.1 Gbits/sec
> [ CT] final connect times (min/avg/max/stdev) =
> 0.295/0.340/0.390/0.201 ms (tot/err) = 4/0
> jwang@ps404a-59.stg:~$ iperf -p 52000 -c 10.43.48.58 -P 2
> ------------------------------------------------------------
> Client connecting to 10.43.48.58, TCP port 52000
> TCP window size:  165 KByte (default)
> ------------------------------------------------------------
> [  4] local 10.43.48.59 port 44194 connected with 10.43.48.58 port 52000
> [  3] local 10.43.48.59 port 44186 connected with 10.43.48.58 port 52000
> [ ID] Interval       Transfer     Bandwidth
> [  3] 0.0000-10.0000 sec  48.3 GBytes  41.5 Gbits/sec
> [  4] 0.0000-10.0000 sec  41.3 GBytes  35.5 Gbits/sec
> [SUM] 0.0000-10.0000 sec  89.7 GBytes  77.0 Gbits/sec
> [ CT] final connect times (min/avg/max/stdev) =
> 0.227/0.233/0.240/0.240 ms (tot/err) = 2/0
> jwang@ps404a-59.stg:~$ pbkvm list
>  VM  State  PID  Cores  Mem  VNC  Migration
> --------------------------------------------
> 
> Total: 0 VMs, Running: 0
> jwang@ps404a-59.stg:~$ iperf -p 52000 -c 10.43.48.58 -P 1
> ------------------------------------------------------------
> Client connecting to 10.43.48.58, TCP port 52000
> TCP window size:  165 KByte (default)
> ------------------------------------------------------------
> [  3] local 10.43.48.59 port 40364 connected with 10.43.48.58 port 52000
> [ ID] Interval       Transfer     Bandwidth
> [  3] 0.0000-10.0000 sec  51.2 GBytes  44.0 Gbits/sec
> 
> Thanks!
> 
> 
> > Thanks,
> >
> > --
> > Peter Xu
> >
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
  2024-05-06 15:18                                             ` Peter Xu
@ 2024-05-07  1:50                                               ` Gonglei (Arei) via
  2024-05-07 16:28                                                 ` Peter Xu
  0 siblings, 1 reply; 52+ messages in thread
From: Gonglei (Arei) via @ 2024-05-07  1:50 UTC (permalink / raw)
  To: Peter Xu
  Cc: Daniel P. Berrangé,
	Markus Armbruster, Michael Galaxy, Yu Zhang, Zhijian Li (Fujitsu),
	Jinpu Wang, Elmar Gerdes, qemu-devel, Yuval Shaia, Kevin Wolf,
	Prasanna Kumar Kalever, Cornelia Huck, Michael Roth,
	Prasanna Kumar Kalever, integration, Paolo Bonzini, qemu-block,
	devel, Hanna Reitz, Michael S. Tsirkin, Thomas Huth, Eric Blake,
	Song Gao, Marc-André Lureau, Alex Bennée,
	Wainer dos Santos Moschetta, Beraldo Leal, Pannengyuan,
	Xiexiangyou, zhengchuan

Hello,

> -----Original Message-----
> From: Peter Xu [mailto:peterx@redhat.com]
> Sent: Monday, May 6, 2024 11:18 PM
> To: Gonglei (Arei) <arei.gonglei@huawei.com>
> Cc: Daniel P. Berrangé <berrange@redhat.com>; Markus Armbruster
> <armbru@redhat.com>; Michael Galaxy <mgalaxy@akamai.com>; Yu Zhang
> <yu.zhang@ionos.com>; Zhijian Li (Fujitsu) <lizhijian@fujitsu.com>; Jinpu Wang
> <jinpu.wang@ionos.com>; Elmar Gerdes <elmar.gerdes@ionos.com>;
> qemu-devel@nongnu.org; Yuval Shaia <yuval.shaia.ml@gmail.com>; Kevin Wolf
> <kwolf@redhat.com>; Prasanna Kumar Kalever
> <prasanna.kalever@redhat.com>; Cornelia Huck <cohuck@redhat.com>;
> Michael Roth <michael.roth@amd.com>; Prasanna Kumar Kalever
> <prasanna4324@gmail.com>; integration@gluster.org; Paolo Bonzini
> <pbonzini@redhat.com>; qemu-block@nongnu.org; devel@lists.libvirt.org;
> Hanna Reitz <hreitz@redhat.com>; Michael S. Tsirkin <mst@redhat.com>;
> Thomas Huth <thuth@redhat.com>; Eric Blake <eblake@redhat.com>; Song
> Gao <gaosong@loongson.cn>; Marc-André Lureau
> <marcandre.lureau@redhat.com>; Alex Bennée <alex.bennee@linaro.org>;
> Wainer dos Santos Moschetta <wainersm@redhat.com>; Beraldo Leal
> <bleal@redhat.com>; Pannengyuan <pannengyuan@huawei.com>;
> Xiexiangyou <xiexiangyou@huawei.com>
> Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
> 
> On Mon, May 06, 2024 at 02:06:28AM +0000, Gonglei (Arei) wrote:
> > Hi, Peter
> 
> Hey, Lei,
> 
> Happy to see you around again after years.
> 
Haha, me too.

> > RDMA features high bandwidth, low latency (in non-blocking lossless
> > network), and direct remote memory access by bypassing the CPU (As you
> > know, CPU resources are expensive for cloud vendors, which is one of
> > the reasons why we introduced offload cards.), which TCP does not have.
> 
> It's another cost to use offload cards, v.s. preparing more cpu resources?
> 
Software and hardware offload converged architecture is the way to go for all cloud vendors 
(Including comprehensive benefits in terms of performance, cost, security, and innovation speed), 
it's not just a matter of adding the resource of a DPU card.

> > In some scenarios where fast live migration is needed (extremely short
> > interruption duration and migration duration) is very useful. To this
> > end, we have also developed RDMA support for multifd.
> 
> Will any of you upstream that work?  I'm curious how intrusive would it be
> when adding it to multifd, if it can keep only 5 exported functions like what
> rdma.h does right now it'll be pretty nice.  We also want to make sure it works
> with arbitrary sized loads and buffers, e.g. vfio is considering to add IO loads to
> multifd channels too.
> 

In fact, we sent the patchset to the community in 2021. Pls see:
https://lore.kernel.org/all/20210203185906.GT2950@work-vm/T/


> One thing to note that the question here is not about a pure performance
> comparison between rdma and nics only.  It's about help us make a decision
> on whether to drop rdma, iow, even if rdma performs well, the community still
> has the right to drop it if nobody can actively work and maintain it.
> It's just that if nics can perform as good it's more a reason to drop, unless
> companies can help to provide good support and work together.
> 

We are happy to provide the necessary review and maintenance work for RDMA
if the community needs it.

CC'ing Chuan Zheng.


Regards,
-Gonglei


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
  2024-05-06 15:28                                                   ` Peter Xu
@ 2024-05-07  4:52                                                     ` Jinpu Wang
  2024-05-08 10:06                                                       ` Daniel P. Berrangé
  0 siblings, 1 reply; 52+ messages in thread
From: Jinpu Wang @ 2024-05-07  4:52 UTC (permalink / raw)
  To: Peter Xu
  Cc: Daniel P. Berrangé,
	Markus Armbruster, Michael Galaxy, Yu Zhang, Zhijian Li (Fujitsu),
	Elmar Gerdes, qemu-devel, Yuval Shaia, Kevin Wolf,
	Prasanna Kumar Kalever, Cornelia Huck, Michael Roth,
	Prasanna Kumar Kalever, integration, Paolo Bonzini, qemu-block,
	devel, Hanna Reitz, Michael S. Tsirkin, Thomas Huth, Eric Blake,
	Song Gao, Marc-André Lureau, Alex Bennée,
	Wainer dos Santos Moschetta, Beraldo Leal, arei.gonglei,
	pannengyuan

Hi Peter, hi Daniel,
On Mon, May 6, 2024 at 5:29 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Mon, May 06, 2024 at 12:08:43PM +0200, Jinpu Wang wrote:
> > Hi Peter, hi Daniel,
>
> Hi, Jinpu,
>
> Thanks for sharing this test results.  Sounds like a great news.
>
> What's your plan next?  Would it then be worthwhile / possible moving QEMU
> into that direction?  Would that greatly simplify rdma code as Dan
> mentioned?
I'm rather not familiar with QEMU migration yet,  from the test
result, I think it's a possible direction,
just we need to at least based on a rather recent release like
rdma-core v33 with proper 'fork' support.

Maybe Dan or you could give more detail about what you have in mind
for using rsocket as a replacement for the future.
We will also look into the implementation details in the meantime.

Thx!
J

>
> Thanks,
>
> >
> > On Fri, May 3, 2024 at 4:33 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Fri, May 03, 2024 at 08:40:03AM +0200, Jinpu Wang wrote:
> > > > I had a brief check in the rsocket changelog, there seems some
> > > > improvement over time,
> > > >  might be worth revisiting this. due to socket abstraction, we can't
> > > > use some feature like
> > > >  ODP, it won't be a small and easy task.
> > >
> > > It'll be good to know whether Dan's suggestion would work first, without
> > > rewritting everything yet so far.  Not sure whether some perf test could
> > > help with the rsocket APIs even without QEMU's involvements (or looking for
> > > test data supporting / invalidate such conversions).
> > >
> > I did a quick test with iperf on 100 G environment and 40 G
> > environment, in summary rsocket works pretty well.
> >
> > iperf tests between 2 hosts with 40 G (IB),
> > first  a few test with different num. of threads on top of ipoib
> > interface, later with preload rsocket on top of same ipoib interface.
> >
> > jwang@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145
> > ------------------------------------------------------------
> > Client connecting to 10.43.3.145, TCP port 52000
> > TCP window size:  165 KByte (default)
> > ------------------------------------------------------------
> > [  3] local 10.43.3.146 port 55602 connected with 10.43.3.145 port 52000
> > [ ID] Interval       Transfer     Bandwidth
> > [  3] 0.0000-10.0001 sec  2.85 GBytes  2.44 Gbits/sec
> > jwang@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145 -P 2
> > ------------------------------------------------------------
> > Client connecting to 10.43.3.145, TCP port 52000
> > TCP window size:  165 KByte (default)
> > ------------------------------------------------------------
> > [  4] local 10.43.3.146 port 39640 connected with 10.43.3.145 port 52000
> > [  3] local 10.43.3.146 port 39626 connected with 10.43.3.145 port 52000
> > [ ID] Interval       Transfer     Bandwidth
> > [  3] 0.0000-10.0012 sec  2.85 GBytes  2.45 Gbits/sec
> > [  4] 0.0000-10.0026 sec  2.86 GBytes  2.45 Gbits/sec
> > [SUM] 0.0000-10.0026 sec  5.71 GBytes  4.90 Gbits/sec
> > [ CT] final connect times (min/avg/max/stdev) =
> > 0.281/0.300/0.318/0.318 ms (tot/err) = 2/0
> > jwang@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145 -P 4
> > ------------------------------------------------------------
> > Client connecting to 10.43.3.145, TCP port 52000
> > TCP window size:  165 KByte (default)
> > ------------------------------------------------------------
> > [  4] local 10.43.3.146 port 46956 connected with 10.43.3.145 port 52000
> > [  6] local 10.43.3.146 port 46978 connected with 10.43.3.145 port 52000
> > [  3] local 10.43.3.146 port 46944 connected with 10.43.3.145 port 52000
> > [  5] local 10.43.3.146 port 46962 connected with 10.43.3.145 port 52000
> > [ ID] Interval       Transfer     Bandwidth
> > [  3] 0.0000-10.0017 sec  2.85 GBytes  2.45 Gbits/sec
> > [  4] 0.0000-10.0015 sec  2.85 GBytes  2.45 Gbits/sec
> > [  5] 0.0000-10.0026 sec  2.85 GBytes  2.45 Gbits/sec
> > [  6] 0.0000-10.0005 sec  2.85 GBytes  2.45 Gbits/sec
> > [SUM] 0.0000-10.0005 sec  11.4 GBytes  9.80 Gbits/sec
> > [ CT] final connect times (min/avg/max/stdev) =
> > 0.274/0.312/0.360/0.212 ms (tot/err) = 4/0
> > jwang@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145 -P 8
> > ------------------------------------------------------------
> > Client connecting to 10.43.3.145, TCP port 52000
> > TCP window size:  165 KByte (default)
> > ------------------------------------------------------------
> > [  7] local 10.43.3.146 port 35062 connected with 10.43.3.145 port 52000
> > [  6] local 10.43.3.146 port 35058 connected with 10.43.3.145 port 52000
> > [  8] local 10.43.3.146 port 35066 connected with 10.43.3.145 port 52000
> > [  9] local 10.43.3.146 port 35074 connected with 10.43.3.145 port 52000
> > [  3] local 10.43.3.146 port 35038 connected with 10.43.3.145 port 52000
> > [ 12] local 10.43.3.146 port 35088 connected with 10.43.3.145 port 52000
> > [  5] local 10.43.3.146 port 35048 connected with 10.43.3.145 port 52000
> > [  4] local 10.43.3.146 port 35050 connected with 10.43.3.145 port 52000
> > [ ID] Interval       Transfer     Bandwidth
> > [  4] 0.0000-10.0005 sec  2.85 GBytes  2.44 Gbits/sec
> > [  8] 0.0000-10.0011 sec  2.85 GBytes  2.45 Gbits/sec
> > [  5] 0.0000-10.0000 sec  2.85 GBytes  2.45 Gbits/sec
> > [ 12] 0.0000-10.0021 sec  2.85 GBytes  2.44 Gbits/sec
> > [  3] 0.0000-10.0003 sec  2.85 GBytes  2.44 Gbits/sec
> > [  7] 0.0000-10.0065 sec  2.50 GBytes  2.14 Gbits/sec
> > [  9] 0.0000-10.0077 sec  2.52 GBytes  2.16 Gbits/sec
> > [  6] 0.0000-10.0003 sec  2.85 GBytes  2.44 Gbits/sec
> > [SUM] 0.0000-10.0003 sec  22.1 GBytes  19.0 Gbits/sec
> > [ CT] final connect times (min/avg/max/stdev) =
> > 0.096/0.226/0.339/0.109 ms (tot/err) = 8/0
> > jwang@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145 -P 16
> > [  3] local 10.43.3.146 port 49540 connected with 10.43.3.145 port 52000
> > ------------------------------------------------------------
> > Client connecting to 10.43.3.145, TCP port 52000
> > TCP window size:  165 KByte (default)
> > ------------------------------------------------------------
> > [  6] local 10.43.3.146 port 49554 connected with 10.43.3.145 port 52000
> > [  8] local 10.43.3.146 port 49584 connected with 10.43.3.145 port 52000
> > [  5] local 10.43.3.146 port 49552 connected with 10.43.3.145 port 52000
> > [ 20] local 10.43.3.146 port 49626 connected with 10.43.3.145 port 52000
> > [  4] local 10.43.3.146 port 49606 connected with 10.43.3.145 port 52000
> > [  9] local 10.43.3.146 port 49596 connected with 10.43.3.145 port 52000
> > [ 10] local 10.43.3.146 port 49604 connected with 10.43.3.145 port 52000
> > [ 26] local 10.43.3.146 port 49678 connected with 10.43.3.145 port 52000
> > [  7] local 10.43.3.146 port 49556 connected with 10.43.3.145 port 52000
> > [ 25] local 10.43.3.146 port 49662 connected with 10.43.3.145 port 52000
> > [ 22] local 10.43.3.146 port 49636 connected with 10.43.3.145 port 52000
> > [ 11] local 10.43.3.146 port 49612 connected with 10.43.3.145 port 52000
> > [ 13] local 10.43.3.146 port 49618 connected with 10.43.3.145 port 52000
> > [ 23] local 10.43.3.146 port 49646 connected with 10.43.3.145 port 52000
> > [ 15] local 10.43.3.146 port 49688 connected with 10.43.3.145 port 52000
> > [ ID] Interval       Transfer     Bandwidth
> > [ 11] 0.0000-10.0024 sec  2.28 GBytes  1.95 Gbits/sec
> > [ 23] 0.0000-10.0022 sec  2.28 GBytes  1.95 Gbits/sec
> > [ 20] 0.0000-10.0010 sec  2.28 GBytes  1.95 Gbits/sec
> > [  8] 0.0000-10.0032 sec  2.28 GBytes  1.95 Gbits/sec
> > [ 26] 0.0000-10.0038 sec  2.28 GBytes  1.95 Gbits/sec
> > [ 10] 0.0000-10.0002 sec  2.28 GBytes  1.95 Gbits/sec
> > [  7] 0.0000-10.0033 sec  2.28 GBytes  1.95 Gbits/sec
> > [ 15] 0.0000-10.0015 sec  2.27 GBytes  1.95 Gbits/sec
> > [  4] 0.0000-10.0028 sec  2.28 GBytes  1.95 Gbits/sec
> > [  6] 0.0000-10.0012 sec  2.28 GBytes  1.96 Gbits/sec
> > [ 13] 0.0000-10.0030 sec  2.28 GBytes  1.95 Gbits/sec
> > [ 25] 0.0000-10.0051 sec  2.28 GBytes  1.95 Gbits/sec
> > [  5] 0.0000-10.0001 sec  2.28 GBytes  1.96 Gbits/sec
> > [  9] 0.0000-10.0017 sec  2.28 GBytes  1.95 Gbits/sec
> > [ 22] 0.0000-10.0008 sec  2.27 GBytes  1.95 Gbits/sec
> > [  3] 0.0000-10.0033 sec  2.28 GBytes  1.95 Gbits/sec
> > [SUM] 0.0000-10.0034 sec  36.4 GBytes  31.3 Gbits/sec
> > [ CT] final connect times (min/avg/max/stdev) =
> > 0.105/0.217/0.401/0.093 ms (tot/err) = 16/0
> > jwang@ps401a-914.nst:~$
> > LD_PRELOAD=/usr/lib/x86_64-linux-gnu/rsocket/librspreload.so iperf -p
> > 52000 -c 10.43.3.145 -P 16
> > ------------------------------------------------------------
> > Client connecting to 10.43.3.145, TCP port 52000
> > TCP window size:  128 KByte (default)
> > ------------------------------------------------------------
> > [  3] local 10.43.3.146 port 48902 connected with 10.43.3.145 port 52000
> > [  5] local 10.43.3.146 port 52777 connected with 10.43.3.145 port 52000
> > [  9] local 10.43.3.146 port 42911 connected with 10.43.3.145 port 52000
> > [ 11] local 10.43.3.146 port 56354 connected with 10.43.3.145 port 52000
> > [ 15] local 10.43.3.146 port 43325 connected with 10.43.3.145 port 52000
> > [  6] local 10.43.3.146 port 37041 connected with 10.43.3.145 port 52000
> > [  7] local 10.43.3.146 port 58828 connected with 10.43.3.145 port 52000
> > [ 17] local 10.43.3.146 port 48858 connected with 10.43.3.145 port 52000
> > [ 13] local 10.43.3.146 port 49256 connected with 10.43.3.145 port 52000
> > [ 16] local 10.43.3.146 port 35652 connected with 10.43.3.145 port 52000
> > [  8] local 10.43.3.146 port 48567 connected with 10.43.3.145 port 52000
> > [ 18] local 10.43.3.146 port 47394 connected with 10.43.3.145 port 52000
> > [ 19] local 10.43.3.146 port 48065 connected with 10.43.3.145 port 52000
> > [ 10] local 10.43.3.146 port 39788 connected with 10.43.3.145 port 52000
> > [  4] local 10.43.3.146 port 46818 connected with 10.43.3.145 port 52000
> > [ 14] local 10.43.3.146 port 57174 connected with 10.43.3.145 port 52000
> > [ ID] Interval       Transfer     Bandwidth
> > [ 14] 0.0000-10.0002 sec  2.30 GBytes  1.98 Gbits/sec
> > [  6] 0.0000-10.0004 sec  2.31 GBytes  1.98 Gbits/sec
> > [  5] 0.0000-10.0002 sec  2.31 GBytes  1.98 Gbits/sec
> > [  8] 0.0000-10.0001 sec  2.31 GBytes  1.98 Gbits/sec
> > [ 11] 0.0000-10.0003 sec  2.31 GBytes  1.98 Gbits/sec
> > [ 18] 0.0000-10.0003 sec  2.31 GBytes  1.98 Gbits/sec
> > [  3] 0.0000-10.0004 sec  2.31 GBytes  1.98 Gbits/sec
> > [  4] 0.0000-10.0005 sec  2.30 GBytes  1.98 Gbits/sec
> > [ 17] 0.0000-10.0004 sec  2.31 GBytes  1.98 Gbits/sec
> > [ 15] 0.0000-10.0005 sec  2.31 GBytes  1.98 Gbits/sec
> > [ 19] 0.0000-10.0001 sec  2.30 GBytes  1.98 Gbits/sec
> > [  7] 0.0000-10.0004 sec  2.31 GBytes  1.98 Gbits/sec
> > [ 13] 0.0000-10.0005 sec  2.31 GBytes  1.98 Gbits/sec
> > [ 10] 0.0000-10.0003 sec  2.30 GBytes  1.98 Gbits/sec
> > [  9] 0.0000-10.0000 sec  2.31 GBytes  1.98 Gbits/sec
> > [ 16] 0.0000-10.0002 sec  2.31 GBytes  1.98 Gbits/sec
> > [SUM] 0.0000-10.0003 sec  36.9 GBytes  31.7 Gbits/sec
> > [ CT] final connect times (min/avg/max/stdev) =
> > 88.398/101.706/114.726/24.755 ms (tot/err) = 16/0
> > jwang@ps401a-914.nst:~$
> > LD_PRELOAD=/usr/lib/x86_64-linux-gnu/rsocket/librspreload.so iperf -p
> > 52000 -c 10.43.3.145 -P 1
> > ------------------------------------------------------------
> > Client connecting to 10.43.3.145, TCP port 52000
> > TCP window size:  128 KByte (default)
> > ------------------------------------------------------------
> > [  3] local 10.43.3.146 port 49168 connected with 10.43.3.145 port 52000
> > [ ID] Interval       Transfer     Bandwidth
> > [  3] 0.0000-10.0000 sec  34.3 GBytes  29.5 Gbits/sec
> > jwang@ps401a-914.nst:~$
> > LD_PRELOAD=/usr/lib/x86_64-linux-gnu/rsocket/librspreload.so iperf -p
> > 52000 -c 10.43.3.145 -P 2
> > ------------------------------------------------------------
> > Client connecting to 10.43.3.145, TCP port 52000
> > TCP window size:  128 KByte (default)
> > ------------------------------------------------------------
> > [  3] local 10.43.3.146 port 42096 connected with 10.43.3.145 port 52000
> > [  4] local 10.43.3.146 port 58667 connected with 10.43.3.145 port 52000
> > [ ID] Interval       Transfer     Bandwidth
> > [  4] 0.0000-10.0001 sec  18.4 GBytes  15.8 Gbits/sec
> > [  3] 0.0000-10.0000 sec  18.5 GBytes  15.9 Gbits/sec
> > [SUM] 0.0000-10.0001 sec  36.9 GBytes  31.7 Gbits/sec
> > [ CT] final connect times (min/avg/max/stdev) =
> > 38.155/38.997/39.839/39.839 ms (tot/err) = 2/0
> > jwang@ps401a-914.nst:~$
> > LD_PRELOAD=/usr/lib/x86_64-linux-gnu/rsocket/librspreload.so iperf -p
> > 52000 -c 10.43.3.145 -P 4
> > ------------------------------------------------------------
> > Client connecting to 10.43.3.145, TCP port 52000
> > TCP window size:  128 KByte (default)
> > ------------------------------------------------------------
> > [  3] local 10.43.3.146 port 36100 connected with 10.43.3.145 port 52000
> > [  5] local 10.43.3.146 port 55108 connected with 10.43.3.145 port 52000
> > [  6] local 10.43.3.146 port 41039 connected with 10.43.3.145 port 52000
> > [  7] local 10.43.3.146 port 34868 connected with 10.43.3.145 port 52000
> > [ ID] Interval       Transfer     Bandwidth
> > [  7] 0.0000-10.0000 sec  9.22 GBytes  7.92 Gbits/sec
> > [  5] 0.0000-10.0000 sec  9.22 GBytes  7.92 Gbits/sec
> > [  3] 0.0000-10.0000 sec  9.22 GBytes  7.92 Gbits/sec
> > [  6] 0.0000-10.0001 sec  9.22 GBytes  7.92 Gbits/sec
> > [SUM] 0.0000-10.0001 sec  36.9 GBytes  31.7 Gbits/sec
> > [ CT] final connect times (min/avg/max/stdev) =
> > 51.401/53.644/56.015/30.487 ms (tot/err) = 4/0
> >
> > You can see with rsocket it reaches ~ 30 Gb/s with single stream,
> > while ipoib only 2.5 Gb/s (12 X), ipoib scales with more threads until
> >  ~ 32 Gb/s, which is the link limit.
> >
> > With 100 G env, rsocket also out perform ipoib, see below:
> >
> >
> > jwang@ps404a-59.stg:~$
> > LD_PRELOAD=/usr/lib/x86_64-linux-gnu/rsocket/librspreload.so iperf -p
> > 52000 -c 10.43.48.58
> > ------------------------------------------------------------
> > Client connecting to 10.43.48.58, TCP port 52000
> > TCP window size:  128 KByte (default)
> > ------------------------------------------------------------
> > [  3] local 10.43.48.59 port 40588 connected with 10.43.48.58 port 52000
> > [ ID] Interval       Transfer     Bandwidth
> > [  3] 0.0000-10.0000 sec  80.7 GBytes  69.4 Gbits/sec
> > jwang@ps404a-59.stg:~$
> > LD_PRELOAD=/usr/lib/x86_64-linux-gnu/rsocket/librspreload.so iperf -p
> > 52000 -c 10.43.48.58 -P 2
> > ------------------------------------------------------------
> > Client connecting to 10.43.48.58, TCP port 52000
> > TCP window size:  128 KByte (default)
> > ------------------------------------------------------------
> > [  3] local 10.43.48.59 port 41813 connected with 10.43.48.58 port 52000
> > [  5] local 10.43.48.59 port 60638 connected with 10.43.48.58 port 52000
> > [ ID] Interval       Transfer     Bandwidth
> > [  5] 0.0000-10.0000 sec  48.9 GBytes  42.0 Gbits/sec
> > [  3] 0.0000-10.0000 sec  49.8 GBytes  42.8 Gbits/sec
> > [SUM] 0.0000-10.0000 sec  98.7 GBytes  84.8 Gbits/sec
> > [ CT] final connect times (min/avg/max/stdev) =
> > 6.962/7.764/8.567/8.567 ms (tot/err) = 2/0
> > jwang@ps404a-59.stg:~$
> > LD_PRELOAD=/usr/lib/x86_64-linux-gnu/rsocket/librspreload.so iperf -p
> > 52000 -c 10.43.48.58 -P 4
> > [  6] local 10.43.48.59 port 58086 connected with 10.43.48.58 port 52000
> > [  3] local 10.43.48.59 port 49335 connected with 10.43.48.58 port 52000
> > ------------------------------------------------------------
> > Client connecting to 10.43.48.58, TCP port 52000
> > TCP window size:  128 KByte (default)
> > ------------------------------------------------------------
> > [  4] local 10.43.48.59 port 44593 connected with 10.43.48.58 port 52000
> > [  5] local 10.43.48.59 port 60464 connected with 10.43.48.58 port 52000
> > [ ID] Interval       Transfer     Bandwidth
> > [  5] 0.0000-10.0000 sec  28.0 GBytes  24.0 Gbits/sec
> > [  4] 0.0000-10.0000 sec  28.0 GBytes  24.0 Gbits/sec
> > [  3] 0.0000-10.0000 sec  28.0 GBytes  24.1 Gbits/sec
> > [  6] 0.0000-10.0000 sec  28.0 GBytes  24.1 Gbits/sec
> > [SUM] 0.0000-10.0001 sec   112 GBytes  96.3 Gbits/sec
> > [ CT] final connect times (min/avg/max/stdev) =
> > 7.344/9.619/12.199/5.271 ms (tot/err) = 4/0
> > jwang@ps404a-59.stg:~$
> > LD_PRELOAD=/usr/lib/x86_64-linux-gnu/rsocket/librspreload.so iperf -p
> > 52000 -c 10.43.48.58 -P 8
> > [  3] local 10.43.48.59 port 43020 connected with 10.43.48.58 port 52000
> > [  7] local 10.43.48.59 port 59720 connected with 10.43.48.58 port 52000
> > [  4] local 10.43.48.59 port 52547 connected with 10.43.48.58 port 52000
> > [  8] local 10.43.48.59 port 41712 connected with 10.43.48.58 port 52000
> > [ 10] local 10.43.48.59 port 53126 connected with 10.43.48.58 port 52000
> > ------------------------------------------------------------
> > Client connecting to 10.43.48.58, TCP port 52000
> > TCP window size:  128 KByte (default)
> > ------------------------------------------------------------
> > [  6] local 10.43.48.59 port 60311 connected with 10.43.48.58 port 52000
> > [  5] local 10.43.48.59 port 44103 connected with 10.43.48.58 port 52000
> > [  9] local 10.43.48.59 port 49007 connected with 10.43.48.58 port 52000
> > [ ID] Interval       Transfer     Bandwidth
> > [  9] 0.0000-10.0001 sec  14.0 GBytes  12.0 Gbits/sec
> > [  8] 0.0000-10.0000 sec  14.0 GBytes  12.0 Gbits/sec
> > [  4] 0.0000-10.0001 sec  14.0 GBytes  12.0 Gbits/sec
> > [  6] 0.0000-10.0000 sec  14.0 GBytes  12.0 Gbits/sec
> > [ 10] 0.0000-10.0000 sec  14.0 GBytes  12.0 Gbits/sec
> > [  7] 0.0000-10.0000 sec  14.0 GBytes  12.0 Gbits/sec
> > [  5] 0.0000-10.0001 sec  14.0 GBytes  12.0 Gbits/sec
> > [  3] 0.0000-10.0001 sec  14.0 GBytes  12.0 Gbits/sec
> > [SUM] 0.0000-10.0001 sec   112 GBytes  96.3 Gbits/sec
> > [ CT] final connect times (min/avg/max/stdev) =
> > 6.942/12.361/18.109/4.872 ms (tot/err) = 8/0
> > jwang@ps404a-59.stg:~$ iperf -p 52000 -c 10.43.48.58 -P 8
> > ------------------------------------------------------------
> > Client connecting to 10.43.48.58, TCP port 52000
> > TCP window size:  165 KByte (default)
> > ------------------------------------------------------------
> > [  4] local 10.43.48.59 port 58176 connected with 10.43.48.58 port 52000
> > [  5] local 10.43.48.59 port 58180 connected with 10.43.48.58 port 52000
> > [  3] local 10.43.48.59 port 58178 connected with 10.43.48.58 port 52000
> > [ 10] local 10.43.48.59 port 58226 connected with 10.43.48.58 port 52000
> > [ 11] local 10.43.48.59 port 58228 connected with 10.43.48.58 port 52000
> > [  9] local 10.43.48.59 port 58212 connected with 10.43.48.58 port 52000
> > [  7] local 10.43.48.59 port 58194 connected with 10.43.48.58 port 52000
> > [  8] local 10.43.48.59 port 58198 connected with 10.43.48.58 port 52000
> > [ ID] Interval       Transfer     Bandwidth
> > [  9] 0.0000-10.0005 sec  15.8 GBytes  13.5 Gbits/sec
> > [  4] 0.0000-10.0002 sec  15.8 GBytes  13.6 Gbits/sec
> > [  3] 0.0000-10.0000 sec  15.8 GBytes  13.6 Gbits/sec
> > [  5] 0.0000-10.0002 sec  15.8 GBytes  13.6 Gbits/sec
> > [  8] 0.0000-10.0005 sec  7.89 GBytes  6.78 Gbits/sec
> > [ 10] 0.0000-10.0000 sec  15.8 GBytes  13.6 Gbits/sec
> > [ 11] 0.0000-10.0014 sec  7.94 GBytes  6.82 Gbits/sec
> > [  7] 0.0000-10.0009 sec  15.8 GBytes  13.6 Gbits/sec
> > [SUM] 0.0000-10.0009 sec   111 GBytes  95.1 Gbits/sec
> > [ CT] final connect times (min/avg/max/stdev) =
> > 0.234/0.325/0.406/0.155 ms (tot/err) = 8/0
> > jwang@ps404a-59.stg:~$ iperf -p 52000 -c 10.43.48.58 -P 4
> > [  3] local 10.43.48.59 port 42548 connected with 10.43.48.58 port 52000
> > ------------------------------------------------------------
> > Client connecting to 10.43.48.58, TCP port 52000
> > TCP window size:  165 KByte (default)
> > ------------------------------------------------------------
> > [  4] local 10.43.48.59 port 42558 connected with 10.43.48.58 port 52000
> > [  5] local 10.43.48.59 port 42560 connected with 10.43.48.58 port 52000
> > [  6] local 10.43.48.59 port 42562 connected with 10.43.48.58 port 52000
> > [ ID] Interval       Transfer     Bandwidth
> > [  6] 0.0000-10.0000 sec  27.8 GBytes  23.9 Gbits/sec
> > [  5] 0.0000-10.0001 sec  27.3 GBytes  23.5 Gbits/sec
> > [  3] 0.0000-10.0001 sec  27.8 GBytes  23.9 Gbits/sec
> > [  4] 0.0000-10.0001 sec  27.8 GBytes  23.9 Gbits/sec
> > [SUM] 0.0000-10.0001 sec   111 GBytes  95.1 Gbits/sec
> > [ CT] final connect times (min/avg/max/stdev) =
> > 0.295/0.340/0.390/0.201 ms (tot/err) = 4/0
> > jwang@ps404a-59.stg:~$ iperf -p 52000 -c 10.43.48.58 -P 2
> > ------------------------------------------------------------
> > Client connecting to 10.43.48.58, TCP port 52000
> > TCP window size:  165 KByte (default)
> > ------------------------------------------------------------
> > [  4] local 10.43.48.59 port 44194 connected with 10.43.48.58 port 52000
> > [  3] local 10.43.48.59 port 44186 connected with 10.43.48.58 port 52000
> > [ ID] Interval       Transfer     Bandwidth
> > [  3] 0.0000-10.0000 sec  48.3 GBytes  41.5 Gbits/sec
> > [  4] 0.0000-10.0000 sec  41.3 GBytes  35.5 Gbits/sec
> > [SUM] 0.0000-10.0000 sec  89.7 GBytes  77.0 Gbits/sec
> > [ CT] final connect times (min/avg/max/stdev) =
> > 0.227/0.233/0.240/0.240 ms (tot/err) = 2/0
> > jwang@ps404a-59.stg:~$ pbkvm list
> >  VM  State  PID  Cores  Mem  VNC  Migration
> > --------------------------------------------
> >
> > Total: 0 VMs, Running: 0
> > jwang@ps404a-59.stg:~$ iperf -p 52000 -c 10.43.48.58 -P 1
> > ------------------------------------------------------------
> > Client connecting to 10.43.48.58, TCP port 52000
> > TCP window size:  165 KByte (default)
> > ------------------------------------------------------------
> > [  3] local 10.43.48.59 port 40364 connected with 10.43.48.58 port 52000
> > [ ID] Interval       Transfer     Bandwidth
> > [  3] 0.0000-10.0000 sec  51.2 GBytes  44.0 Gbits/sec
> >
> > Thanks!
> >
> >
> > > Thanks,
> > >
> > > --
> > > Peter Xu
> > >
> >
>
> --
> Peter Xu
>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
  2024-05-07  1:50                                               ` Gonglei (Arei) via
@ 2024-05-07 16:28                                                 ` Peter Xu
  0 siblings, 0 replies; 52+ messages in thread
From: Peter Xu @ 2024-05-07 16:28 UTC (permalink / raw)
  To: Gonglei (Arei)
  Cc: Daniel P. Berrangé,
	Markus Armbruster, Michael Galaxy, Yu Zhang, Zhijian Li (Fujitsu),
	Jinpu Wang, Elmar Gerdes, qemu-devel, Yuval Shaia, Kevin Wolf,
	Prasanna Kumar Kalever, Cornelia Huck, Michael Roth,
	Prasanna Kumar Kalever, integration, Paolo Bonzini, qemu-block,
	devel, Hanna Reitz, Michael S. Tsirkin, Thomas Huth, Eric Blake,
	Song Gao, Marc-André Lureau, Alex Bennée,
	Wainer dos Santos Moschetta, Beraldo Leal, Pannengyuan,
	Xiexiangyou, zhengchuan

On Tue, May 07, 2024 at 01:50:43AM +0000, Gonglei (Arei) wrote:
> Hello,
> 
> > -----Original Message-----
> > From: Peter Xu [mailto:peterx@redhat.com]
> > Sent: Monday, May 6, 2024 11:18 PM
> > To: Gonglei (Arei) <arei.gonglei@huawei.com>
> > Cc: Daniel P. Berrangé <berrange@redhat.com>; Markus Armbruster
> > <armbru@redhat.com>; Michael Galaxy <mgalaxy@akamai.com>; Yu Zhang
> > <yu.zhang@ionos.com>; Zhijian Li (Fujitsu) <lizhijian@fujitsu.com>; Jinpu Wang
> > <jinpu.wang@ionos.com>; Elmar Gerdes <elmar.gerdes@ionos.com>;
> > qemu-devel@nongnu.org; Yuval Shaia <yuval.shaia.ml@gmail.com>; Kevin Wolf
> > <kwolf@redhat.com>; Prasanna Kumar Kalever
> > <prasanna.kalever@redhat.com>; Cornelia Huck <cohuck@redhat.com>;
> > Michael Roth <michael.roth@amd.com>; Prasanna Kumar Kalever
> > <prasanna4324@gmail.com>; integration@gluster.org; Paolo Bonzini
> > <pbonzini@redhat.com>; qemu-block@nongnu.org; devel@lists.libvirt.org;
> > Hanna Reitz <hreitz@redhat.com>; Michael S. Tsirkin <mst@redhat.com>;
> > Thomas Huth <thuth@redhat.com>; Eric Blake <eblake@redhat.com>; Song
> > Gao <gaosong@loongson.cn>; Marc-André Lureau
> > <marcandre.lureau@redhat.com>; Alex Bennée <alex.bennee@linaro.org>;
> > Wainer dos Santos Moschetta <wainersm@redhat.com>; Beraldo Leal
> > <bleal@redhat.com>; Pannengyuan <pannengyuan@huawei.com>;
> > Xiexiangyou <xiexiangyou@huawei.com>
> > Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
> > 
> > On Mon, May 06, 2024 at 02:06:28AM +0000, Gonglei (Arei) wrote:
> > > Hi, Peter
> > 
> > Hey, Lei,
> > 
> > Happy to see you around again after years.
> > 
> Haha, me too.
> 
> > > RDMA features high bandwidth, low latency (in non-blocking lossless
> > > network), and direct remote memory access by bypassing the CPU (As you
> > > know, CPU resources are expensive for cloud vendors, which is one of
> > > the reasons why we introduced offload cards.), which TCP does not have.
> > 
> > It's another cost to use offload cards, v.s. preparing more cpu resources?
> > 
> Software and hardware offload converged architecture is the way to go for all cloud vendors 
> (Including comprehensive benefits in terms of performance, cost, security, and innovation speed), 
> it's not just a matter of adding the resource of a DPU card.
> 
> > > In some scenarios where fast live migration is needed (extremely short
> > > interruption duration and migration duration) is very useful. To this
> > > end, we have also developed RDMA support for multifd.
> > 
> > Will any of you upstream that work?  I'm curious how intrusive would it be
> > when adding it to multifd, if it can keep only 5 exported functions like what
> > rdma.h does right now it'll be pretty nice.  We also want to make sure it works
> > with arbitrary sized loads and buffers, e.g. vfio is considering to add IO loads to
> > multifd channels too.
> > 
> 
> In fact, we sent the patchset to the community in 2021. Pls see:
> https://lore.kernel.org/all/20210203185906.GT2950@work-vm/T/

I wasn't aware of that for sure in the past..

Multifd has changed quite a bit in the last 9.0 release, that may not apply
anymore.  One thing to mention is please look at Dan's comment on possible
use of rsocket.h:

https://lore.kernel.org/all/ZjJm6rcqS5EhoKgK@redhat.com/

And Jinpu did help provide an initial test result over the library:

https://lore.kernel.org/qemu-devel/CAMGffEk8wiKNQmoUYxcaTHGtiEm2dwoCF_W7T0vMcD-i30tUkA@mail.gmail.com/

It looks like we have a chance to apply that in QEMU.

> 
> 
> > One thing to note that the question here is not about a pure performance
> > comparison between rdma and nics only.  It's about help us make a decision
> > on whether to drop rdma, iow, even if rdma performs well, the community still
> > has the right to drop it if nobody can actively work and maintain it.
> > It's just that if nics can perform as good it's more a reason to drop, unless
> > companies can help to provide good support and work together.
> > 
> 
> We are happy to provide the necessary review and maintenance work for RDMA
> if the community needs it.
> 
> CC'ing Chuan Zheng.

I'm not sure whether you and Jinpu's team would like to work together and
provide a final solution for rdma over multifd.  It could be much simpler
than the original 2021 proposal if the rsocket API will work out.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
  2024-05-07  4:52                                                     ` Jinpu Wang
@ 2024-05-08 10:06                                                       ` Daniel P. Berrangé
  0 siblings, 0 replies; 52+ messages in thread
From: Daniel P. Berrangé @ 2024-05-08 10:06 UTC (permalink / raw)
  To: Jinpu Wang
  Cc: Peter Xu, Markus Armbruster, Michael Galaxy, Yu Zhang,
	Zhijian Li (Fujitsu),
	Elmar Gerdes, qemu-devel, Yuval Shaia, Kevin Wolf,
	Prasanna Kumar Kalever, Cornelia Huck, Michael Roth,
	Prasanna Kumar Kalever, integration, Paolo Bonzini, qemu-block,
	devel, Hanna Reitz, Michael S. Tsirkin, Thomas Huth, Eric Blake,
	Song Gao, Marc-André Lureau, Alex Bennée,
	Wainer dos Santos Moschetta, Beraldo Leal, arei.gonglei,
	pannengyuan

On Tue, May 07, 2024 at 06:52:50AM +0200, Jinpu Wang wrote:
> Hi Peter, hi Daniel,
> On Mon, May 6, 2024 at 5:29 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Mon, May 06, 2024 at 12:08:43PM +0200, Jinpu Wang wrote:
> > > Hi Peter, hi Daniel,
> >
> > Hi, Jinpu,
> >
> > Thanks for sharing this test results.  Sounds like a great news.
> >
> > What's your plan next?  Would it then be worthwhile / possible moving QEMU
> > into that direction?  Would that greatly simplify rdma code as Dan
> > mentioned?
> I'm rather not familiar with QEMU migration yet,  from the test
> result, I think it's a possible direction,
> just we need to at least based on a rather recent release like
> rdma-core v33 with proper 'fork' support.
> 
> Maybe Dan or you could give more detail about what you have in mind
> for using rsocket as a replacement for the future.
> We will also look into the implementation details in the meantime.

The migration/socket.c file is the entrypoint for traditional TCP
based migration code. It uses the QIOChannelSocket class which is
written against the traditional sockets APIs, and uses the QAPI
SocketAddress data type to configure it..

My thought was that potentially SocketAddress could be extended to
offer RDMA addressing eg


{ 'union': 'SocketAddress',
  'base': { 'type': 'SocketAddressType' },
  'discriminator': 'type',
  'data': { 'inet': 'InetSocketAddress',
            'unix': 'UnixSocketAddress',
            'vsock': 'VsockSocketAddress',
            'fd': 'FdSocketAddress',
	    'rdma': 'InetSocketAddress' } }

And then QIOChannelSocket could be also extended to call the
alternative 'rsockets' APIs where needed. That would mean that
existing sockets migration code would almost "just work" with
RDMA. Theoreticaly any other part of QEMU using QIOChannelSocket
would also then magically support RDMA too, with very little (if
any) extra work.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2024-05-08 10:07 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-28 13:02 [PATCH-for-9.1 v2 0/3] rdma: Remove RDMA subsystem and pvrdma device Philippe Mathieu-Daudé
2024-03-28 13:02 ` [PATCH-for-9.1 v2 1/3] hw/rdma: Remove pvrdma device and rdmacm-mux helper Philippe Mathieu-Daudé
2024-03-28 17:51   ` Thomas Huth
2024-03-28 13:02 ` [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling Philippe Mathieu-Daudé
2024-03-28 14:18   ` Fabiano Rosas
2024-03-28 15:01     ` Peter Xu
2024-03-28 15:22       ` Thomas Huth
2024-03-28 19:04         ` Peter Xu
2024-03-29  1:53       ` Zhijian Li (Fujitsu) via
2024-03-29 10:28         ` Philippe Mathieu-Daudé
2024-03-29 19:44           ` Daniel P. Berrangé
2024-04-01  7:55           ` Zhijian Li (Fujitsu) via
2024-04-01 21:26             ` Yu Zhang
2024-04-02 21:23               ` Peter Xu
2024-04-08 14:07                 ` Jinpu Wang
2024-04-08 16:18                   ` Peter Xu
2024-04-09  7:32                     ` Jinpu Wang
2024-04-09 19:46                       ` Peter Xu
2024-04-10  2:28                         ` Zhijian Li (Fujitsu) via
2024-04-10 13:49                           ` Peter Xu
2024-04-11 14:20                             ` Peter Xu
2024-04-11 16:36                               ` Yu Zhang
2024-04-12 14:04                                 ` Peter Xu
2024-04-29 13:08                                 ` Michael Galaxy
2024-04-29 14:56                                   ` Peter Xu
2024-04-29 20:45                                     ` Yu Zhang
2024-04-29 20:56                                       ` Michael Galaxy
2024-04-30  7:15                                     ` Markus Armbruster
2024-04-30  8:00                                       ` Daniel P. Berrangé
2024-05-01 15:31                                         ` Peter Xu
2024-05-01 15:59                                           ` Daniel P. Berrangé
2024-05-01 16:16                                             ` Peter Xu
2024-05-02 13:22                                               ` Michael Galaxy
2024-05-02 13:30                                                 ` Jinpu Wang
2024-05-02 16:19                                                   ` Peter Xu
2024-05-02 17:10                                                     ` Jinpu Wang
2024-05-03  6:40                                             ` Jinpu Wang
2024-05-03 14:33                                               ` Peter Xu
2024-05-06 10:08                                                 ` Jinpu Wang
2024-05-06 15:28                                                   ` Peter Xu
2024-05-07  4:52                                                     ` Jinpu Wang
2024-05-08 10:06                                                       ` Daniel P. Berrangé
2024-05-06  2:06                                           ` Gonglei (Arei) via
2024-05-06 15:18                                             ` Peter Xu
2024-05-07  1:50                                               ` Gonglei (Arei) via
2024-05-07 16:28                                                 ` Peter Xu
2024-04-11 14:42                         ` Jinpu Wang
2024-04-09  9:00                     ` Markus Armbruster
2024-03-28 13:02 ` [PATCH-for-9.1 v2 3/3] block/gluster: " Philippe Mathieu-Daudé
2024-03-28 17:54   ` Thomas Huth
2024-03-29  9:17 ` [PATCH-for-9.1 v2 0/3] rdma: Remove RDMA subsystem and pvrdma device Michael S. Tsirkin
2024-04-03  9:37 ` Philippe Mathieu-Daudé

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.