From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:43662) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1fJDcT-0006On-Qu for qemu-devel@nongnu.org; Thu, 17 May 2018 03:46:40 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1fJDcQ-0007sI-KC for qemu-devel@nongnu.org; Thu, 17 May 2018 03:46:37 -0400 Received: from mail-wm0-x22e.google.com ([2a00:1450:400c:c09::22e]:55324) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1fJDcQ-0007qV-9d for qemu-devel@nongnu.org; Thu, 17 May 2018 03:46:34 -0400 Received: by mail-wm0-x22e.google.com with SMTP id a8-v6so6645612wmg.5 for ; Thu, 17 May 2018 00:46:34 -0700 (PDT) References: <20180508184035.GT2500@work-vm> <20180511180335.GH2720@work-vm> <20180514192723.GH2497@work-vm> <20180516093944.GA2531@work-vm> <20180516095344.GB2531@work-vm> <20180516131327.GF2531@work-vm> <68c2aa43-89ab-c72d-8bd1-09c95a6cc95a@dev.mellanox.co.il> From: Aviad Yehezkel Message-ID: Date: Thu, 17 May 2018 10:46:30 +0300 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Content-Language: en-US Subject: Re: [Qemu-devel] FW: [PATCH 2/2] migration: not wait RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: 858585 jemmy Cc: Aviad Yehezkel , Juan Quintela , qemu-devel , Gal Shachaf , Adi Dotan , Lidong Chen On 5/17/2018 10:41 AM, 858585 jemmy wrote: > On Thu, May 17, 2018 at 3:31 PM, Aviad Yehezkel > wrote: >> >> On 5/17/2018 5:42 AM, 858585 jemmy wrote: >>> On Wed, May 16, 2018 at 11:11 PM, Aviad Yehezkel >>> wrote: >>>> Hi Lidong and David, >>>> Sorry for the late response, I had to ramp up on migration code and build >>>> a >>>> setup on my side. >>>> >>>> PSB my comments for this patch below. >>>> For the RDMA post-copy patches I will comment next week after testing on >>>> Mellanox side too. >>>> >>>> Thanks! >>>> >>>> On 5/16/2018 5:21 PM, Aviad Yehezkel wrote: >>>>> >>>>> -----Original Message----- >>>>> From: Dr. David Alan Gilbert [mailto:dgilbert@redhat.com] >>>>> Sent: Wednesday, May 16, 2018 4:13 PM >>>>> To: 858585 jemmy >>>>> Cc: Aviad Yehezkel ; Juan Quintela >>>>> ; qemu-devel ; Gal Shachaf >>>>> ; Adi Dotan ; Lidong Chen >>>>> >>>>> Subject: Re: [PATCH 2/2] migration: not wait RDMA_CM_EVENT_DISCONNECTED >>>>> event after rdma_disconnect >>>>> >>>>> * 858585 jemmy (jemmy858585@gmail.com) wrote: >>>>> >>>>> >>>>> >>>>>>>>>>> I wonder why dereg_mr takes so long - I could understand if >>>>>>>>>>> reg_mr took a long time, but why for dereg, that sounds like the >>>>>>>>>>> easy side. >>>>>>>>>> I use perf collect the information when ibv_dereg_mr is invoked. >>>>>>>>>> >>>>>>>>>> - 9.95% client2 [kernel.kallsyms] [k] put_compound_page >>>>>>>>>> ` >>>>>>>>>> - put_compound_page >>>>>>>>>> - 98.45% put_page >>>>>>>>>> __ib_umem_release >>>>>>>>>> ib_umem_release >>>>>>>>>> dereg_mr >>>>>>>>>> mlx5_ib_dereg_mr >>>>>>>>>> ib_dereg_mr >>>>>>>>>> uverbs_free_mr >>>>>>>>>> remove_commit_idr_uobject >>>>>>>>>> _rdma_remove_commit_uobject >>>>>>>>>> rdma_remove_commit_uobject >>>>>>>>>> ib_uverbs_dereg_mr >>>>>>>>>> ib_uverbs_write >>>>>>>>>> vfs_write >>>>>>>>>> sys_write >>>>>>>>>> system_call_fastpath >>>>>>>>>> __GI___libc_write >>>>>>>>>> 0 >>>>>>>>>> + 1.55% __ib_umem_release >>>>>>>>>> + 8.31% client2 [kernel.kallsyms] [k] >>>>>>>>>> compound_unlock_irqrestore >>>>>>>>>> + 7.01% client2 [kernel.kallsyms] [k] page_waitqueue >>>>>>>>>> + 7.00% client2 [kernel.kallsyms] [k] set_page_dirty >>>>>>>>>> + 6.61% client2 [kernel.kallsyms] [k] unlock_page >>>>>>>>>> + 6.33% client2 [kernel.kallsyms] [k] put_page_testzero >>>>>>>>>> + 5.68% client2 [kernel.kallsyms] [k] set_page_dirty_lock >>>>>>>>>> + 4.30% client2 [kernel.kallsyms] [k] __wake_up_bit >>>>>>>>>> + 4.04% client2 [kernel.kallsyms] [k] free_pages_prepare >>>>>>>>>> + 3.65% client2 [kernel.kallsyms] [k] release_pages >>>>>>>>>> + 3.62% client2 [kernel.kallsyms] [k] arch_local_irq_save >>>>>>>>>> + 3.35% client2 [kernel.kallsyms] [k] page_mapping >>>>>>>>>> + 3.13% client2 [kernel.kallsyms] [k] >>>>>>>>>> get_pageblock_flags_group >>>>>>>>>> + 3.09% client2 [kernel.kallsyms] [k] put_page >>>>>>>>>> >>>>>>>>>> the reason is __ib_umem_release will loop many times for each page. >>>>>>>>>> >>>>>>>>>> static void __ib_umem_release(struct ib_device *dev, struct >>>>>>>>>> ib_umem *umem, int dirty) { >>>>>>>>>> struct scatterlist *sg; >>>>>>>>>> struct page *page; >>>>>>>>>> int i; >>>>>>>>>> >>>>>>>>>> if (umem->nmap > 0) >>>>>>>>>> ib_dma_unmap_sg(dev, umem->sg_head.sgl, >>>>>>>>>> umem->npages, >>>>>>>>>> DMA_BIDIRECTIONAL); >>>>>>>>>> >>>>>>>>>> for_each_sg(umem->sg_head.sgl, sg, umem->npages, i) { >>>>>>>>>> << >>>>>>>>>> loop a lot of times for each page.here >>>>>>>>> Why 'lot of times for each page'? I don't know this code at all, >>>>>>>>> but I'd expected once per page? >>>>>>>> sorry, once per page, but a lot of page for a big size virtual >>>>>>>> machine. >>>>>>> Ah OK; so yes it seems best if you can find a way to do the release >>>>>>> in the migration thread then; still maybe this is something some of >>>>>>> the kernel people could look at speeding up? >>>>>> The kernel code seem is not complex, and I have no idea how to speed >>>>>> up. >>>>> Me neither; but I'll ask around. >>>> I will ask internally and get back on this one. >>>>> >>>>>>>>> With your other kernel fix, does the problem of the missing >>>>>>>>> RDMA_CM_EVENT_DISCONNECTED events go away? >>>>>>>> Yes, after kernel and qemu fixed, this issue never happens again. >>>>>>> I'm confused; which qemu fix; my question was whether the kernel fix >>>>>>> by itself fixed the problem of the missing event. >>>>>> this qemu fix: >>>>>> migration: update index field when delete or qsort RDMALocalBlock >>>>> OK good; so then we shouldn't need this 2/2 patch. >>>> It is good that the qemu fix solved this issue but me and my colleagues >>>> think we need 2/2 patch anyway. >>>> According to IB Spec once active side send DREQ message he should wait >>>> for >>>> DREP message and only once it arrived it should trigger a DISCONNECT >>>> event. >>>> DREP message can be dropped due to network issues. >>>> For that case the spec defines a DREP_timeout state in the CM state >>>> machine, >>>> if the DREP is dropped we should get a timeout and a TIMEWAIT_EXIT event >>>> will be trigger. >>>> Unfortunately the current kernel CM implementation doesn't include the >>>> DREP_timeout state and in above scenario we will not get DISCONNECT or >>>> TIMEWAIT_EXIT events. >>>> (Similar scenario exists also for passive side). >>>> We think it is best to apply this patch until we will enhance the kernel >>>> code. >>>> >>> Hi Aviad: >>> How long about the DREP_timeout state? >>> Do RDMA have some tools like tcpdump? then I can use to confirm >>> whether >>> DREP message has received. >>> >>> Thanks. >> Are you running IB or RoCE? > RoCE v2. Than download the latest wireshark, it should have RDMA RoCE support. Capture packets on the associated Ethernet interface. Let me know if you need help. > >>>>>> this issue also cause by some ram block is not released. but I do not >>>>>> find the root cause. >>>>> Hmm, we should try and track that down. >>>>> >>>>>>>> Do you think we should remove rdma_get_cm_event after >>>>>>>> rdma_disconnect? >>>>>>> I don't think so; if 'rdma_disconnect' is supposed to generate the >>>>>>> event I think we're supposed to wait for it to know that the >>>>>>> disconnect is really complete. >>>>>> After move qemu_fclose to migration thread, it will not block the main >>>>>> thread when wait the disconnection event. >>>>> I'm not sure about moving the fclose to the migration thread; it worries >>>>> me with the interaction with cancel and other failures. >>>>> >>>>> Dave >>>>> >>>>>>> Dave >>>>>>> >>>>>>>>> Dave >>>>>>>>> >>>>>>>>>> page = sg_page(sg); >>>>>>>>>> if (umem->writable && dirty) >>>>>>>>>> set_page_dirty_lock(page); >>>>>>>>>> put_page(page); >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> sg_free_table(&umem->sg_head); >>>>>>>>>> return; >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>>> Dave >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>>> Dave >>>>>>>>>>>>> >>>>>>>>>>>>>>>> Anyway, it should not invoke rdma_get_cm_event in main >>>>>>>>>>>>>>>> thread, and the event channel is also destroyed in >>>>>>>>>>>>>>>> qemu_rdma_cleanup. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Signed-off-by: Lidong Chen >>>>>>>>>>>>>>>> --- >>>>>>>>>>>>>>>> migration/rdma.c | 12 ++---------- >>>>>>>>>>>>>>>> migration/trace-events | 1 - >>>>>>>>>>>>>>>> 2 files changed, 2 insertions(+), 11 deletions(-) >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> diff --git a/migration/rdma.c b/migration/rdma.c index >>>>>>>>>>>>>>>> 0dd4033..92e4d30 100644 >>>>>>>>>>>>>>>> --- a/migration/rdma.c >>>>>>>>>>>>>>>> +++ b/migration/rdma.c >>>>>>>>>>>>>>>> @@ -2275,8 +2275,7 @@ static int >>>>>>>>>>>>>>>> qemu_rdma_write(QEMUFile *f, RDMAContext *rdma, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> static void qemu_rdma_cleanup(RDMAContext *rdma) { >>>>>>>>>>>>>>>> - struct rdma_cm_event *cm_event; >>>>>>>>>>>>>>>> - int ret, idx; >>>>>>>>>>>>>>>> + int idx; >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> if (rdma->cm_id && rdma->connected) { >>>>>>>>>>>>>>>> if ((rdma->error_state || >>>>>>>>>>>>>>>> @@ -2290,14 +2289,7 @@ static void >>>>>>>>>>>>>>>> qemu_rdma_cleanup(RDMAContext *rdma) >>>>>>>>>>>>>>>> qemu_rdma_post_send_control(rdma, NULL, >>>>>>>>>>>>>>>> &head); >>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> - ret = rdma_disconnect(rdma->cm_id); >>>>>>>>>>>>>>>> - if (!ret) { >>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>> trace_qemu_rdma_cleanup_waiting_for_disconnect(); >>>>>>>>>>>>>>>> - ret = rdma_get_cm_event(rdma->channel, >>>>>>>>>>>>>>>> &cm_event); >>>>>>>>>>>>>>>> - if (!ret) { >>>>>>>>>>>>>>>> - rdma_ack_cm_event(cm_event); >>>>>>>>>>>>>>>> - } >>>>>>>>>>>>>>>> - } >>>>>>>>>>>>>>>> + rdma_disconnect(rdma->cm_id); >>>>>>>>>>>>>>> I'm worried whether this change could break stuff: >>>>>>>>>>>>>>> The docs say for rdma_disconnect that it flushes any posted >>>>>>>>>>>>>>> work >>>>>>>>>>>>>>> requests to the completion queue; so unless we wait for the >>>>>>>>>>>>>>> event >>>>>>>>>>>>>>> do we know the stuff has been flushed? In the normal >>>>>>>>>>>>>>> non-cancel case >>>>>>>>>>>>>>> I'm worried that means we could lose something. >>>>>>>>>>>>>>> (But I don't know the rdma/infiniband specs well enough to >>>>>>>>>>>>>>> know >>>>>>>>>>>>>>> if it's >>>>>>>>>>>>>>> really a problem). >>>>>>>>>>>>>> In qemu_fclose function, it invoke qemu_fflush(f) before invoke >>>>>>>>>>>>>> f->ops->close. >>>>>>>>>>>>>> so I think it's safe for normal migration case. >>>>>>>>>>>>>> >>>>>>>>>>>>>> For the root cause why not receive RDMA_CM_EVENT_DISCONNECTED >>>>>>>>>>>>>> event >>>>>>>>>>>>>> after rdma_disconnect, >>>>>>>>>>>>>> I loop in Aviad Yehezkel, Aviad will help >>>>>>>>>>>>>> us >>>>>>>>>>>>>> find the root cause. >>>> >>>> Regarding previous discussion on that thread: >>>> rdma_disconnect can be invoke even without invoking ib_dreg_mr first. >>>> >>>> common teardown flow: >>>> >>>> rdma_disconnect >>>> <-- here resources like QPs are still alive and can DMA to RAM >>>> rdma_destroy_cm_id >>>> <-- All resources are destroyed by MR can be still active >>>> rdma_dreg_mr >>>> >>>> >>>> >>>>>>>>>>>>>>> Dave >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> trace_qemu_rdma_cleanup_disconnect(); >>>>>>>>>>>>>>>> rdma->connected = false; >>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>> diff --git a/migration/trace-events b/migration/trace-events >>>>>>>>>>>>>>>> index d6be74b..64573ff 100644 >>>>>>>>>>>>>>>> --- a/migration/trace-events >>>>>>>>>>>>>>>> +++ b/migration/trace-events >>>>>>>>>>>>>>>> @@ -125,7 +125,6 @@ qemu_rdma_accept_pin_state(bool pin) "%d" >>>>>>>>>>>>>>>> qemu_rdma_accept_pin_verbsc(void *verbs) "Verbs context >>>>>>>>>>>>>>>> after >>>>>>>>>>>>>>>> listen: %p" >>>>>>>>>>>>>>>> qemu_rdma_block_for_wrid_miss(const char *wcompstr, int >>>>>>>>>>>>>>>> wcomp, const char *gcompstr, uint64_t req) "A Wanted wrid %s >>>>>>>>>>>>>>>> (%d) but got %s >>>>>>>>>>>>>>>> (%" PRIu64 ")" >>>>>>>>>>>>>>>> qemu_rdma_cleanup_disconnect(void) "" >>>>>>>>>>>>>>>> -qemu_rdma_cleanup_waiting_for_disconnect(void) "" >>>>>>>>>>>>>>>> qemu_rdma_close(void) "" >>>>>>>>>>>>>>>> qemu_rdma_connect_pin_all_requested(void) "" >>>>>>>>>>>>>>>> qemu_rdma_connect_pin_all_outcome(bool pin) "%d" >>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>> 1.8.3.1 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK >>>>>>>>>>>>> -- >>>>>>>>>>>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK >>>>>>>>>>> -- >>>>>>>>>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK >>>>>>>>> -- >>>>>>>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK >>>>>>> -- >>>>>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK >>>>> -- >>>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK >>>>