From: Juan Quintela <quintela@redhat.com>
To: qemu-devel@nongnu.org
Cc: "Li Zhijian" <lizhijian@cn.fujitsu.com>,
"Stefano Stabellini" <sstabellini@kernel.org>,
"Eduardo Habkost" <ehabkost@redhat.com>,
kvm@vger.kernel.org, "David Hildenbrand" <david@redhat.com>,
"Philippe Mathieu-Daudé" <philmd@redhat.com>,
"Paul Durrant" <paul@xen.org>,
"Richard Henderson" <richard.henderson@linaro.org>,
"Markus Armbruster" <armbru@redhat.com>,
"Peter Xu" <peterx@redhat.com>,
"Dr. David Alan Gilbert" <dgilbert@redhat.com>,
"Juan Quintela" <quintela@redhat.com>,
"Paolo Bonzini" <pbonzini@redhat.com>,
"Michael S. Tsirkin" <mst@redhat.com>,
"Marc-André Lureau" <marcandre.lureau@redhat.com>,
"Anthony Perard" <anthony.perard@citrix.com>,
xen-devel@lists.xenproject.org, "Eric Blake" <eblake@redhat.com>
Subject: [PULL 01/20] migration/rdma: Fix out of order wrid
Date: Mon, 1 Nov 2021 23:08:53 +0100 [thread overview]
Message-ID: <20211101220912.10039-2-quintela@redhat.com> (raw)
In-Reply-To: <20211101220912.10039-1-quintela@redhat.com>
From: Li Zhijian <lizhijian@cn.fujitsu.com>
destination:
../qemu/build/qemu-system-x86_64 -enable-kvm -netdev tap,id=hn0,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown -device e1000,netdev=hn0,mac=50:52:54:00:11:22 -boot c -drive if=none,file=./Fedora-rdma-server-migration.qcow2,id=drive-virtio-disk0 -device virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -m 2048 -smp 2 -device piix3-usb-uhci -device usb-tablet -monitor stdio -vga qxl -spice streaming-video=filter,port=5902,disable-ticketing -incoming rdma:192.168.22.23:8888
qemu-system-x86_64: -spice streaming-video=filter,port=5902,disable-ticketing: warning: short-form boolean option 'disable-ticketing' deprecated
Please use disable-ticketing=on instead
QEMU 6.0.50 monitor - type 'help' for more information
(qemu) trace-event qemu_rdma_block_for_wrid_miss on
(qemu) dest_init RDMA Device opened: kernel name rxe_eth0 uverbs device name uverbs2, infiniband_verbs class device path /sys/class/infiniband_verbs/uverbs2, infiniband class device path /sys/class/infiniband/rxe_eth0, transport: (2) Ethernet
qemu_rdma_block_for_wrid_miss A Wanted wrid CONTROL SEND (2000) but got CONTROL RECV (4000)
source:
../qemu/build/qemu-system-x86_64 -enable-kvm -netdev tap,id=hn0,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown -device e1000,netdev=hn0,mac=50:52:54:00:11:22 -boot c -drive if=none,file=./Fedora-rdma-server.qcow2,id=drive-virtio-disk0 -device virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -m 2048 -smp 2 -device piix3-usb-uhci -device usb-tablet -monitor stdio -vga qxl -spice streaming-video=filter,port=5901,disable-ticketing -S
qemu-system-x86_64: -spice streaming-video=filter,port=5901,disable-ticketing: warning: short-form boolean option 'disable-ticketing' deprecated
Please use disable-ticketing=on instead
QEMU 6.0.50 monitor - type 'help' for more information
(qemu)
(qemu) trace-event qemu_rdma_block_for_wrid_miss on
(qemu) migrate -d rdma:192.168.22.23:8888
source_resolve_host RDMA Device opened: kernel name rxe_eth0 uverbs device name uverbs2, infiniband_verbs class device path /sys/class/infiniband_verbs/uverbs2, infiniband class device path /sys/class/infiniband/rxe_eth0, transport: (2) Ethernet
(qemu) qemu_rdma_block_for_wrid_miss A Wanted wrid WRITE RDMA (1) but got CONTROL RECV (4000)
NOTE: we use soft RoCE as the rdma device.
[root@iaas-rpma images]# rdma link show rxe_eth0/1
link rxe_eth0/1 state ACTIVE physical_state LINK_UP netdev eth0
This migration could not be completed when out of order(OOO) CQ event occurs.
The send queue and receive queue shared a same completion queue, and
qemu_rdma_block_for_wrid() will drop the CQs it's not interested in. But
the dropped CQs by qemu_rdma_block_for_wrid() could be later CQs it wants.
So in this case, qemu_rdma_block_for_wrid() will block forever.
OOO cases will occur in both source side and destination side. And a
forever blocking happens on only SEND and RECV are out of order. OOO between
'WRITE RDMA' and 'RECV' doesn't matter.
below the OOO sequence:
source destination
rdma_write_one() qemu_rdma_registration_handle()
1. S1: post_recv X D1: post_recv Y
2. wait for recv CQ event X
3. D2: post_send X ---------------+
4. wait for send CQ send event X (D2) |
5. recv CQ event X reaches (D2) |
6. +-S2: post_send Y |
7. | wait for send CQ event Y |
8. | recv CQ event Y (S2) (drop it) |
9. +-send CQ event Y reaches (S2) |
10. send CQ event X reaches (D2) -----+
11. wait recv CQ event Y (dropped by (8))
Although a hardware IB works fine in my a hundred of runs, the IB specification
doesn't guaratee the CQ order in such case.
Here we introduce a independent send completion queue to distinguish
ibv_post_send completion queue from the original mixed completion queue.
It helps us to poll the specific CQE we are really interested in.
Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Juan Quintela <quintela@redhat.com>
---
migration/rdma.c | 138 ++++++++++++++++++++++++++++++++++-------------
1 file changed, 101 insertions(+), 37 deletions(-)
diff --git a/migration/rdma.c b/migration/rdma.c
index 2a3c7889b9..f5d3bbe7e9 100644
--- a/migration/rdma.c
+++ b/migration/rdma.c
@@ -358,9 +358,11 @@ typedef struct RDMAContext {
struct ibv_context *verbs;
struct rdma_event_channel *channel;
struct ibv_qp *qp; /* queue pair */
- struct ibv_comp_channel *comp_channel; /* completion channel */
+ struct ibv_comp_channel *recv_comp_channel; /* recv completion channel */
+ struct ibv_comp_channel *send_comp_channel; /* send completion channel */
struct ibv_pd *pd; /* protection domain */
- struct ibv_cq *cq; /* completion queue */
+ struct ibv_cq *recv_cq; /* recvieve completion queue */
+ struct ibv_cq *send_cq; /* send completion queue */
/*
* If a previous write failed (perhaps because of a failed
@@ -1059,21 +1061,34 @@ static int qemu_rdma_alloc_pd_cq(RDMAContext *rdma)
return -1;
}
- /* create completion channel */
- rdma->comp_channel = ibv_create_comp_channel(rdma->verbs);
- if (!rdma->comp_channel) {
- error_report("failed to allocate completion channel");
+ /* create receive completion channel */
+ rdma->recv_comp_channel = ibv_create_comp_channel(rdma->verbs);
+ if (!rdma->recv_comp_channel) {
+ error_report("failed to allocate receive completion channel");
goto err_alloc_pd_cq;
}
/*
- * Completion queue can be filled by both read and write work requests,
- * so must reflect the sum of both possible queue sizes.
+ * Completion queue can be filled by read work requests.
*/
- rdma->cq = ibv_create_cq(rdma->verbs, (RDMA_SIGNALED_SEND_MAX * 3),
- NULL, rdma->comp_channel, 0);
- if (!rdma->cq) {
- error_report("failed to allocate completion queue");
+ rdma->recv_cq = ibv_create_cq(rdma->verbs, (RDMA_SIGNALED_SEND_MAX * 3),
+ NULL, rdma->recv_comp_channel, 0);
+ if (!rdma->recv_cq) {
+ error_report("failed to allocate receive completion queue");
+ goto err_alloc_pd_cq;
+ }
+
+ /* create send completion channel */
+ rdma->send_comp_channel = ibv_create_comp_channel(rdma->verbs);
+ if (!rdma->send_comp_channel) {
+ error_report("failed to allocate send completion channel");
+ goto err_alloc_pd_cq;
+ }
+
+ rdma->send_cq = ibv_create_cq(rdma->verbs, (RDMA_SIGNALED_SEND_MAX * 3),
+ NULL, rdma->send_comp_channel, 0);
+ if (!rdma->send_cq) {
+ error_report("failed to allocate send completion queue");
goto err_alloc_pd_cq;
}
@@ -1083,11 +1098,19 @@ err_alloc_pd_cq:
if (rdma->pd) {
ibv_dealloc_pd(rdma->pd);
}
- if (rdma->comp_channel) {
- ibv_destroy_comp_channel(rdma->comp_channel);
+ if (rdma->recv_comp_channel) {
+ ibv_destroy_comp_channel(rdma->recv_comp_channel);
+ }
+ if (rdma->send_comp_channel) {
+ ibv_destroy_comp_channel(rdma->send_comp_channel);
+ }
+ if (rdma->recv_cq) {
+ ibv_destroy_cq(rdma->recv_cq);
+ rdma->recv_cq = NULL;
}
rdma->pd = NULL;
- rdma->comp_channel = NULL;
+ rdma->recv_comp_channel = NULL;
+ rdma->send_comp_channel = NULL;
return -1;
}
@@ -1104,8 +1127,8 @@ static int qemu_rdma_alloc_qp(RDMAContext *rdma)
attr.cap.max_recv_wr = 3;
attr.cap.max_send_sge = 1;
attr.cap.max_recv_sge = 1;
- attr.send_cq = rdma->cq;
- attr.recv_cq = rdma->cq;
+ attr.send_cq = rdma->send_cq;
+ attr.recv_cq = rdma->recv_cq;
attr.qp_type = IBV_QPT_RC;
ret = rdma_create_qp(rdma->cm_id, rdma->pd, &attr);
@@ -1496,14 +1519,14 @@ static void qemu_rdma_signal_unregister(RDMAContext *rdma, uint64_t index,
* (of any kind) has completed.
* Return the work request ID that completed.
*/
-static uint64_t qemu_rdma_poll(RDMAContext *rdma, uint64_t *wr_id_out,
- uint32_t *byte_len)
+static uint64_t qemu_rdma_poll(RDMAContext *rdma, struct ibv_cq *cq,
+ uint64_t *wr_id_out, uint32_t *byte_len)
{
int ret;
struct ibv_wc wc;
uint64_t wr_id;
- ret = ibv_poll_cq(rdma->cq, 1, &wc);
+ ret = ibv_poll_cq(cq, 1, &wc);
if (!ret) {
*wr_id_out = RDMA_WRID_NONE;
@@ -1575,7 +1598,8 @@ static uint64_t qemu_rdma_poll(RDMAContext *rdma, uint64_t *wr_id_out,
/* Wait for activity on the completion channel.
* Returns 0 on success, none-0 on error.
*/
-static int qemu_rdma_wait_comp_channel(RDMAContext *rdma)
+static int qemu_rdma_wait_comp_channel(RDMAContext *rdma,
+ struct ibv_comp_channel *comp_channel)
{
struct rdma_cm_event *cm_event;
int ret = -1;
@@ -1586,7 +1610,7 @@ static int qemu_rdma_wait_comp_channel(RDMAContext *rdma)
*/
if (rdma->migration_started_on_destination &&
migration_incoming_get_current()->state == MIGRATION_STATUS_ACTIVE) {
- yield_until_fd_readable(rdma->comp_channel->fd);
+ yield_until_fd_readable(comp_channel->fd);
} else {
/* This is the source side, we're in a separate thread
* or destination prior to migration_fd_process_incoming()
@@ -1597,7 +1621,7 @@ static int qemu_rdma_wait_comp_channel(RDMAContext *rdma)
*/
while (!rdma->error_state && !rdma->received_error) {
GPollFD pfds[2];
- pfds[0].fd = rdma->comp_channel->fd;
+ pfds[0].fd = comp_channel->fd;
pfds[0].events = G_IO_IN | G_IO_HUP | G_IO_ERR;
pfds[0].revents = 0;
@@ -1655,6 +1679,17 @@ static int qemu_rdma_wait_comp_channel(RDMAContext *rdma)
return rdma->error_state;
}
+static struct ibv_comp_channel *to_channel(RDMAContext *rdma, int wrid)
+{
+ return wrid < RDMA_WRID_RECV_CONTROL ? rdma->send_comp_channel :
+ rdma->recv_comp_channel;
+}
+
+static struct ibv_cq *to_cq(RDMAContext *rdma, int wrid)
+{
+ return wrid < RDMA_WRID_RECV_CONTROL ? rdma->send_cq : rdma->recv_cq;
+}
+
/*
* Block until the next work request has completed.
*
@@ -1675,13 +1710,15 @@ static int qemu_rdma_block_for_wrid(RDMAContext *rdma, int wrid_requested,
struct ibv_cq *cq;
void *cq_ctx;
uint64_t wr_id = RDMA_WRID_NONE, wr_id_in;
+ struct ibv_comp_channel *ch = to_channel(rdma, wrid_requested);
+ struct ibv_cq *poll_cq = to_cq(rdma, wrid_requested);
- if (ibv_req_notify_cq(rdma->cq, 0)) {
+ if (ibv_req_notify_cq(poll_cq, 0)) {
return -1;
}
/* poll cq first */
while (wr_id != wrid_requested) {
- ret = qemu_rdma_poll(rdma, &wr_id_in, byte_len);
+ ret = qemu_rdma_poll(rdma, poll_cq, &wr_id_in, byte_len);
if (ret < 0) {
return ret;
}
@@ -1702,12 +1739,12 @@ static int qemu_rdma_block_for_wrid(RDMAContext *rdma, int wrid_requested,
}
while (1) {
- ret = qemu_rdma_wait_comp_channel(rdma);
+ ret = qemu_rdma_wait_comp_channel(rdma, ch);
if (ret) {
goto err_block_for_wrid;
}
- ret = ibv_get_cq_event(rdma->comp_channel, &cq, &cq_ctx);
+ ret = ibv_get_cq_event(ch, &cq, &cq_ctx);
if (ret) {
perror("ibv_get_cq_event");
goto err_block_for_wrid;
@@ -1721,7 +1758,7 @@ static int qemu_rdma_block_for_wrid(RDMAContext *rdma, int wrid_requested,
}
while (wr_id != wrid_requested) {
- ret = qemu_rdma_poll(rdma, &wr_id_in, byte_len);
+ ret = qemu_rdma_poll(rdma, poll_cq, &wr_id_in, byte_len);
if (ret < 0) {
goto err_block_for_wrid;
}
@@ -2437,13 +2474,21 @@ static void qemu_rdma_cleanup(RDMAContext *rdma)
rdma_destroy_qp(rdma->cm_id);
rdma->qp = NULL;
}
- if (rdma->cq) {
- ibv_destroy_cq(rdma->cq);
- rdma->cq = NULL;
+ if (rdma->recv_cq) {
+ ibv_destroy_cq(rdma->recv_cq);
+ rdma->recv_cq = NULL;
}
- if (rdma->comp_channel) {
- ibv_destroy_comp_channel(rdma->comp_channel);
- rdma->comp_channel = NULL;
+ if (rdma->send_cq) {
+ ibv_destroy_cq(rdma->send_cq);
+ rdma->send_cq = NULL;
+ }
+ if (rdma->recv_comp_channel) {
+ ibv_destroy_comp_channel(rdma->recv_comp_channel);
+ rdma->recv_comp_channel = NULL;
+ }
+ if (rdma->send_comp_channel) {
+ ibv_destroy_comp_channel(rdma->send_comp_channel);
+ rdma->send_comp_channel = NULL;
}
if (rdma->pd) {
ibv_dealloc_pd(rdma->pd);
@@ -3115,10 +3160,14 @@ static void qio_channel_rdma_set_aio_fd_handler(QIOChannel *ioc,
{
QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
if (io_read) {
- aio_set_fd_handler(ctx, rioc->rdmain->comp_channel->fd,
+ aio_set_fd_handler(ctx, rioc->rdmain->recv_comp_channel->fd,
+ false, io_read, io_write, NULL, opaque);
+ aio_set_fd_handler(ctx, rioc->rdmain->send_comp_channel->fd,
false, io_read, io_write, NULL, opaque);
} else {
- aio_set_fd_handler(ctx, rioc->rdmaout->comp_channel->fd,
+ aio_set_fd_handler(ctx, rioc->rdmaout->recv_comp_channel->fd,
+ false, io_read, io_write, NULL, opaque);
+ aio_set_fd_handler(ctx, rioc->rdmaout->send_comp_channel->fd,
false, io_read, io_write, NULL, opaque);
}
}
@@ -3332,7 +3381,22 @@ static size_t qemu_rdma_save_page(QEMUFile *f, void *opaque,
*/
while (1) {
uint64_t wr_id, wr_id_in;
- int ret = qemu_rdma_poll(rdma, &wr_id_in, NULL);
+ int ret = qemu_rdma_poll(rdma, rdma->recv_cq, &wr_id_in, NULL);
+ if (ret < 0) {
+ error_report("rdma migration: polling error! %d", ret);
+ goto err;
+ }
+
+ wr_id = wr_id_in & RDMA_WRID_TYPE_MASK;
+
+ if (wr_id == RDMA_WRID_NONE) {
+ break;
+ }
+ }
+
+ while (1) {
+ uint64_t wr_id, wr_id_in;
+ int ret = qemu_rdma_poll(rdma, rdma->send_cq, &wr_id_in, NULL);
if (ret < 0) {
error_report("rdma migration: polling error! %d", ret);
goto err;
--
2.33.1
next prev parent reply other threads:[~2021-11-01 22:15 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-11-01 22:08 [PULL 00/20] Migration 20211031 patches Juan Quintela
2021-11-01 22:08 ` Juan Quintela [this message]
2021-11-01 22:08 ` [PULL 02/20] KVM: introduce dirty_pages and kvm_dirty_ring_enabled Juan Quintela
2021-11-01 22:08 ` [PULL 03/20] memory: make global_dirty_tracking a bitmask Juan Quintela
2021-11-01 22:08 ` [PULL 04/20] migration/dirtyrate: introduce struct and adjust DirtyRateStat Juan Quintela
2021-11-04 20:54 ` Eric Blake
2021-11-01 22:08 ` [PULL 05/20] migration/dirtyrate: adjust order of registering thread Juan Quintela
2021-11-01 22:08 ` [PULL 06/20] migration/dirtyrate: move init step of calculation to main thread Juan Quintela
2021-11-01 22:08 ` [PULL 07/20] migration/dirtyrate: implement dirty-ring dirtyrate calculation Juan Quintela
2021-11-04 22:05 ` Philippe Mathieu-Daudé
2021-11-06 11:45 ` Juan Quintela
2021-11-01 22:09 ` [PULL 08/20] migration: Make migration blocker work for snapshots too Juan Quintela
2021-11-01 22:09 ` [PULL 09/20] migration: Add migrate_add_blocker_internal() Juan Quintela
2021-11-01 22:09 ` [PULL 10/20] dump-guest-memory: Block live migration Juan Quintela
2021-11-01 22:09 ` [PULL 11/20] memory: Introduce replay_discarded callback for RamDiscardManager Juan Quintela
2021-11-01 22:09 ` [PULL 12/20] virtio-mem: Implement replay_discarded RamDiscardManager callback Juan Quintela
2021-11-01 22:09 ` [PULL 13/20] migration/ram: Handle RAMBlocks with a RamDiscardManager on the migration source Juan Quintela
2021-11-01 22:09 ` [PULL 14/20] virtio-mem: Drop precopy notifier Juan Quintela
2021-11-01 22:09 ` [PULL 15/20] migration/postcopy: Handle RAMBlocks with a RamDiscardManager on the destination Juan Quintela
2021-11-01 22:09 ` [PULL 16/20] migration: Simplify alignment and alignment checks Juan Quintela
2021-11-01 22:09 ` [PULL 17/20] migration/ram: Factor out populating pages readable in ram_block_populate_pages() Juan Quintela
2021-11-01 22:09 ` [PULL 18/20] migration/ram: Handle RAMBlocks with a RamDiscardManager on background snapshots Juan Quintela
2021-11-01 22:09 ` [PULL 19/20] memory: introduce total_dirty_pages to stat dirty pages Juan Quintela
2021-11-01 22:09 ` [PULL 20/20] migration/dirtyrate: implement dirty-bitmap dirtyrate calculation Juan Quintela
2021-11-02 15:45 ` [PULL 00/20] Migration 20211031 patches Richard Henderson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20211101220912.10039-2-quintela@redhat.com \
--to=quintela@redhat.com \
--cc=anthony.perard@citrix.com \
--cc=armbru@redhat.com \
--cc=david@redhat.com \
--cc=dgilbert@redhat.com \
--cc=eblake@redhat.com \
--cc=ehabkost@redhat.com \
--cc=kvm@vger.kernel.org \
--cc=lizhijian@cn.fujitsu.com \
--cc=marcandre.lureau@redhat.com \
--cc=mst@redhat.com \
--cc=paul@xen.org \
--cc=pbonzini@redhat.com \
--cc=peterx@redhat.com \
--cc=philmd@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=richard.henderson@linaro.org \
--cc=sstabellini@kernel.org \
--cc=xen-devel@lists.xenproject.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).