From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.3 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE, SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1BFC6C433DB for ; Thu, 25 Mar 2021 01:23:18 +0000 (UTC) Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 73E4E61A16 for ; Thu, 25 Mar 2021 01:23:17 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 73E4E61A16 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=huawei.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Received: from localhost ([::1]:38950 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1lPEiS-0005Fy-DB for qemu-devel@archiver.kernel.org; Wed, 24 Mar 2021 21:23:16 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:46458) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1lPEhP-0004jl-Nu for qemu-devel@nongnu.org; Wed, 24 Mar 2021 21:22:11 -0400 Received: from szxga07-in.huawei.com ([45.249.212.35]:4104) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1lPEhH-0002aD-PN for qemu-devel@nongnu.org; Wed, 24 Mar 2021 21:22:11 -0400 Received: from DGGEMS401-HUB.china.huawei.com (unknown [172.30.72.58]) by szxga07-in.huawei.com (SkyGuard) with ESMTP id 4F5S2M6zpXz9sd5; Thu, 25 Mar 2021 09:19:55 +0800 (CST) Received: from [10.174.184.42] (10.174.184.42) by DGGEMS401-HUB.china.huawei.com (10.3.19.201) with Microsoft SMTP Server id 14.3.498.0; Thu, 25 Mar 2021 09:21:53 +0800 Subject: Re: [PATCH v5 00/10] KVM: Dirty ring support (QEMU part) To: Peter Xu References: <20210310203301.194842-1-peterx@redhat.com> <2e057323-8102-7bfc-051b-cd3950c93875@huawei.com> <20210322194533.GE16645@xz-x1> <20210323143429.GB6486@xz-x1> <5da1dd71-58e9-6579-c7c1-6cb60baf7ac1@huawei.com> <20210324150943.GB219069@xz-x1> From: Keqian Zhu Message-ID: <25da9dde-bd02-b557-66ed-06e4422c5634@huawei.com> Date: Thu, 25 Mar 2021 09:21:53 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.7.1 MIME-Version: 1.0 In-Reply-To: <20210324150943.GB219069@xz-x1> Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.174.184.42] X-CFilter-Loop: Reflected Received-SPF: pass client-ip=45.249.212.35; envelope-from=zhukeqian1@huawei.com; helo=szxga07-in.huawei.com X-Spam_score_int: -41 X-Spam_score: -4.2 X-Spam_bar: ---- X-Spam_report: (-4.2 / 5.0 requ) BAYES_00=-1.9, NICE_REPLY_A=-0.001, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Paolo Bonzini , Hyman , qemu-devel@nongnu.org, "Dr . David Alan Gilbert" Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" Peter, On 2021/3/24 23:09, Peter Xu wrote: > On Wed, Mar 24, 2021 at 10:56:22AM +0800, Keqian Zhu wrote: >> Hi Peter, >> >> On 2021/3/23 22:34, Peter Xu wrote: >>> Keqian, >>> >>> On Tue, Mar 23, 2021 at 02:40:43PM +0800, Keqian Zhu wrote: >>>>>> The second question is that you observed longer migration time (55s->73s) when guest >>>>>> has 24G ram and dirty rate is 800M/s. I am not clear about the reason. As with dirty >>>>>> ring enabled, Qemu can get dirty info faster which means it handles dirty page more >>>>>> quick, and guest can be throttled which means dirty page is generated slower. What's >>>>>> the rationale for the longer migration time? >>>>> >>>>> Because dirty ring is more sensitive to dirty rate, while dirty bitmap is more >>>> Emm... Sorry that I'm very clear about this... I think that higher dirty rate doesn't cause >>>> slower dirty_log_sync compared to that of legacy bitmap mode. Besides, higher dirty rate >>>> means we may have more full-exit, which can properly limit the dirty rate. So it seems that >>>> dirty ring "prefers" higher dirty rate. >>> >>> When I measured the 800MB/s it's in the guest, after throttling. >>> >>> Imagine another example: a VM has 1G memory keep dirtying with 10GB/s. Dirty >>> logging will need to collect even less for each iteration because memory size >>> shrinked, collect even less frequent due to the high dirty rate, however dirty >>> ring will use 100% cpu power to collect dirty pages because the ring keeps full. >> Looks good. >> >> We have many places to collect dirty pages: the background reaper, vCPU exit handler, >> and the migration thread. I think migration time is closely related to the migration thread. >> >> The migration thread calls kvm_dirty_ring_flush(). >> 1. kvm_cpu_synchronize_kick_all() will wait vcpu handles full-exit. >> 2. kvm_dirty_ring_reap() collects and resets dirty pages. >> The above two operation will spend more time with higher dirty rate. >> >> But I suddenly realize that the key problem maybe not at this. Though we have separate >> "reset" operation for dirty ring, actually it is performed right after we collect dirty >> ring to kvmslot. So in dirty ring mode, it likes legacy bitmap mode without manual_dirty_clear. >> >> If we can "reset" dirty ring just before we really handle the dirty pages, we can have >> shorter migration time. But the design of dirty ring doesn't allow this, because we must >> perform reset to make free space... > > This is a very good point. > > Dirty ring should have been better in quite some ways already, but from that > pov as you said it goes a bit backwards on reprotection of pages (not to > mention currently we can't even reset the ring per-vcpu; that seems to be not > fully matching the full locality that the rings have provided as well; but > Paolo and I discussed with that issue, it's about TLB flush expensiveness, so > we still need to think more of it..). > > Ideally the ring could have been both per-vcpu but also bi-directional (then > we'll have 2*N rings, N=vcpu number), so as to split the state transition into > "dirty ring" and "reprotect ring", then that reprotect ring will be the clear > dirty log. That'll look more like virtio as used ring. However we'll still > need to think about the TLB flush issue too as Paolo used to mention, as > that'll exist too with any per-vcpu flush model (each reprotect of page will > need a tlb flush of all vcpus). > > Or.. maybe we can make the flush ring a standalone one, so we need N dirty ring > and one global flush ring. Yep, have separate "reprotect" ring(s) is a good idea. > > Anyway.. Before that, I'd still think the next step should be how to integrate > qemu to fully leverage current ring model, so as to be able to throttle in > per-vcpu fashion. > > The major issue (IMHO) with huge VM migration is: > > 1. Convergence > 2. Responsiveness > > Here we'll have a chance to solve (1) by highly throttle the working vcpu > threads, meanwhile still keep (2) by not throttle user interactive threads. > I'm not sure whether this will really work as expected, but just show what I'm > thinking about it. These may not matter a lot yet with further improving ring > reset mechanism, which definitely sounds even better, but seems orthogonal. > > That's also why I think we should still merge this series first as a fundation > for the rest. I see. > >> >>> >>>> >>>>> sensitive to memory footprint. In above 24G mem + 800MB/s dirty rate >>>>> condition, dirty bitmap seems to be more efficient, say, collecting dirty >>>>> bitmap of 24G mem (24G/4K/8=0.75MB) for each migration cycle is fast enough. >>>>> >>>>> Not to mention that current implementation of dirty ring in QEMU is not >>>>> complete - we still have two more layers of dirty bitmap, so it's actually a >>>>> mixture of dirty bitmap and dirty ring. This series is more like a POC on >>>>> dirty ring interface, so as to let QEMU be able to run on KVM dirty ring. >>>>> E.g., we won't have hang issue when getting dirty pages since it's totally >>>>> async, however we'll still have some legacy dirty bitmap issues e.g. memory >>>>> consumption of userspace dirty bitmaps are still linear to memory footprint. >>>> The plan looks good and coordinated, but I have a concern. Our dirty ring actually depends >>>> on the structure of hardware logging buffer (PML buffer). We can't say it can be properly >>>> adapted to all kinds of hardware design in the future. >>> >>> Sorry I don't get it - dirty ring can work with pure page wr-protect too? >> Sure, it can. I just want to discuss many possible kinds of hardware logging buffer. >> However, I'd like to stop at this, at least dirty ring works well with PML. :) > > I see your point. That'll be a good topic at least when we'd like to port > dirty ring to other archs for sure. However as you see I hoped we can start to > use dirty ring first, find issues, fix it, even redesign some of it, make it > really beneficial at least on one arch, then it'll make more sense to port it, > or attract people porting it. :) > > QEMU does not yet have a good solution for huge vm migration yet. Maybe dirty > ring is a good start for it, maybe not (e.g., with uffd minor mode postcopy has > the other chance). We'll see... OK. Thanks, Keqian