Re: [PATCH v5 00/10] KVM: Dirty ring support (QEMU part)

From: Keqian Zhu <zhukeqian1@huawei.com>
To: Peter Xu <peterx@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
	Hyman <huangy81@chinatelecom.cn>,
	qemu-devel@nongnu.org,
	"Dr . David Alan Gilbert" <dgilbert@redhat.com>
Subject: Re: [PATCH v5 00/10] KVM: Dirty ring support (QEMU part)
Date: Thu, 25 Mar 2021 09:21:53 +0800	[thread overview]
Message-ID: <25da9dde-bd02-b557-66ed-06e4422c5634@huawei.com> (raw)
In-Reply-To: <20210324150943.GB219069@xz-x1>

Peter,

On 2021/3/24 23:09, Peter Xu wrote:
> On Wed, Mar 24, 2021 at 10:56:22AM +0800, Keqian Zhu wrote:
>> Hi Peter,
>>
>> On 2021/3/23 22:34, Peter Xu wrote:
>>> Keqian,
>>>
>>> On Tue, Mar 23, 2021 at 02:40:43PM +0800, Keqian Zhu wrote:
>>>>>> The second question is that you observed longer migration time (55s->73s) when guest
>>>>>> has 24G ram and dirty rate is 800M/s. I am not clear about the reason. As with dirty
>>>>>> ring enabled, Qemu can get dirty info faster which means it handles dirty page more
>>>>>> quick, and guest can be throttled which means dirty page is generated slower. What's
>>>>>> the rationale for the longer migration time?
>>>>>
>>>>> Because dirty ring is more sensitive to dirty rate, while dirty bitmap is more
>>>> Emm... Sorry that I'm very clear about this... I think that higher dirty rate doesn't cause
>>>> slower dirty_log_sync compared to that of legacy bitmap mode. Besides, higher dirty rate
>>>> means we may have more full-exit, which can properly limit the dirty rate. So it seems that
>>>> dirty ring "prefers" higher dirty rate.
>>>
>>> When I measured the 800MB/s it's in the guest, after throttling.
>>>
>>> Imagine another example: a VM has 1G memory keep dirtying with 10GB/s.  Dirty
>>> logging will need to collect even less for each iteration because memory size
>>> shrinked, collect even less frequent due to the high dirty rate, however dirty
>>> ring will use 100% cpu power to collect dirty pages because the ring keeps full.
>> Looks good.
>>
>> We have many places to collect dirty pages: the background reaper, vCPU exit handler,
>> and the migration thread. I think migration time is closely related to the migration thread.
>>
>> The migration thread calls kvm_dirty_ring_flush().
>> 1. kvm_cpu_synchronize_kick_all() will wait vcpu handles full-exit.
>> 2. kvm_dirty_ring_reap() collects and resets dirty pages.
>> The above two operation will spend more time with higher dirty rate.
>>
>> But I suddenly realize that the key problem maybe not at this. Though we have separate
>> "reset" operation for dirty ring, actually it is performed right after we collect dirty
>> ring to kvmslot. So in dirty ring mode, it likes legacy bitmap mode without manual_dirty_clear.
>>
>> If we can "reset" dirty ring just before we really handle the dirty pages, we can have
>> shorter migration time. But the design of dirty ring doesn't allow this, because we must
>> perform reset to make free space...
> 
> This is a very good point.
> 
> Dirty ring should have been better in quite some ways already, but from that
> pov as you said it goes a bit backwards on reprotection of pages (not to
> mention currently we can't even reset the ring per-vcpu; that seems to be not
> fully matching the full locality that the rings have provided as well; but
> Paolo and I discussed with that issue, it's about TLB flush expensiveness, so
> we still need to think more of it..).
> 
> Ideally the ring could have been both per-vcpu but also bi-directional (then
> we'll have 2*N rings, N=vcpu number), so as to split the state transition into
> "dirty ring" and "reprotect ring", then that reprotect ring will be the clear
> dirty log.  That'll look more like virtio as used ring.  However we'll still
> need to think about the TLB flush issue too as Paolo used to mention, as
> that'll exist too with any per-vcpu flush model (each reprotect of page will
> need a tlb flush of all vcpus).
> 
> Or.. maybe we can make the flush ring a standalone one, so we need N dirty ring
> and one global flush ring.
Yep, have separate "reprotect" ring(s) is a good idea.

> 
> Anyway.. Before that, I'd still think the next step should be how to integrate
> qemu to fully leverage current ring model, so as to be able to throttle in
> per-vcpu fashion.
> 
> The major issue (IMHO) with huge VM migration is:
> 
>   1. Convergence
>   2. Responsiveness
> 
> Here we'll have a chance to solve (1) by highly throttle the working vcpu
> threads, meanwhile still keep (2) by not throttle user interactive threads.
> I'm not sure whether this will really work as expected, but just show what I'm
> thinking about it.  These may not matter a lot yet with further improving ring
> reset mechanism, which definitely sounds even better, but seems orthogonal.
> 
> That's also why I think we should still merge this series first as a fundation
> for the rest.
I see.

> 
>>
>>>
>>>>
>>>>> sensitive to memory footprint.  In above 24G mem + 800MB/s dirty rate
>>>>> condition, dirty bitmap seems to be more efficient, say, collecting dirty
>>>>> bitmap of 24G mem (24G/4K/8=0.75MB) for each migration cycle is fast enough.
>>>>>
>>>>> Not to mention that current implementation of dirty ring in QEMU is not
>>>>> complete - we still have two more layers of dirty bitmap, so it's actually a
>>>>> mixture of dirty bitmap and dirty ring.  This series is more like a POC on
>>>>> dirty ring interface, so as to let QEMU be able to run on KVM dirty ring.
>>>>> E.g., we won't have hang issue when getting dirty pages since it's totally
>>>>> async, however we'll still have some legacy dirty bitmap issues e.g. memory
>>>>> consumption of userspace dirty bitmaps are still linear to memory footprint.
>>>> The plan looks good and coordinated, but I have a concern. Our dirty ring actually depends
>>>> on the structure of hardware logging buffer (PML buffer). We can't say it can be properly
>>>> adapted to all kinds of hardware design in the future.
>>>
>>> Sorry I don't get it - dirty ring can work with pure page wr-protect too?
>> Sure, it can. I just want to discuss many possible kinds of hardware logging buffer.
>> However, I'd like to stop at this, at least dirty ring works well with PML. :)
> 
> I see your point.  That'll be a good topic at least when we'd like to port
> dirty ring to other archs for sure.  However as you see I hoped we can start to
> use dirty ring first, find issues, fix it, even redesign some of it, make it
> really beneficial at least on one arch, then it'll make more sense to port it,
> or attract people porting it. :)
> 
> QEMU does not yet have a good solution for huge vm migration yet.  Maybe dirty
> ring is a good start for it, maybe not (e.g., with uffd minor mode postcopy has
> the other chance).  We'll see...
OK.

Thanks,
Keqian