Re: [PATCH v2 0/8] KVM: arm64: Support HW dirty log based on DBM

From: zhukeqian <zhukeqian1@huawei.com>
To: Marc Zyngier <maz@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>, kvmarm@lists.cs.columbia.edu
Subject: Re: [PATCH v2 0/8] KVM: arm64: Support HW dirty log based on DBM
Date: Mon, 13 Jul 2020 10:47:25 +0800	[thread overview]
Message-ID: <4eee3e4c-db73-c4ce-ca3d-d665ee87d66a@huawei.com> (raw)
In-Reply-To: <015847afd67e8bd4f8a158b604854838@kernel.org>

Hi Marc,

Sorry for the delay reply.

On 2020/7/6 15:54, Marc Zyngier wrote:
> Hi Keqian,
> 
> On 2020-07-06 02:28, zhukeqian wrote:
>> Hi Catalin and Marc,
>>
>> On 2020/7/2 21:55, Keqian Zhu wrote:
>>> This patch series add support for dirty log based on HW DBM.
>>>
>>> It works well under some migration test cases, including VM with 4K
>>> pages or 2M THP. I checked the SHA256 hash digest of all memory and
>>> they keep same for source VM and destination VM, which means no dirty
>>> pages is missed under hardware DBM.
>>>
>>> Some key points:
>>>
>>> 1. Only support hardware updates of dirty status for PTEs. PMDs and PUDs
>>>    are not involved for now.
>>>
>>> 2. About *performance*: In RFC patch, I have mentioned that for every 64GB
>>>    memory, KVM consumes about 40ms to scan all PTEs to collect dirty log.
>>>    This patch solves this problem through two ways: HW/SW dynamic switch
>>>    and Multi-core offload.
>>>
>>>    HW/SW dynamic switch: Give userspace right to enable/disable hw dirty
>>>    log. This adds a new KVM cap named KVM_CAP_ARM_HW_DIRTY_LOG. We can
>>>    achieve this by change the kvm->arch.vtcr value and kick vCPUs out to
>>>    reload this value to VCTR_EL2. Then userspace can enable hw dirty log
>>>    at the begining and disable it when dirty pages is little and about to
>>>    stop VM, so VM downtime is not affected.
>>>
>>>    Multi-core offload: Offload the PT scanning workload to multi-core can
>>>    greatly reduce scanning time. To promise we can complete in time, I use
>>>    smp_call_fuction to realize this policy, which utilize IPI to dispatch
>>>    workload to other CPUs. Under 128U Kunpeng 920 platform, it just takes
>>>    about 5ms to scan PTs of 256 RAM (use mempress and almost all PTs have
>>>    been established). And We dispatch workload iterately (every CPU just
>>>    scan PTs of 512M RAM for each iteration), so it won't affect physical
>>>    CPUs seriously.
>>
>> What do you think of these two methods to solve high-cost PTs scaning? Maybe
>> you are waiting for PML like feature on ARM :-) , but for my test, DBM is usable
>> after these two methods applied.
> 
> Useable, maybe. But leaving to userspace the decision to switch from one
> mode to another isn't an acceptable outcome. Userspace doesn't need nor
> want to know about this.
> 
OK, maybe this is worth discussing. The switch logic can be encapsulated into Qemu
and can not be seen from VM users. Well, I think it maybe acceptable. :)

> Another thing is that sending IPIs all over to trigger scanning may
> work well on a system that runs a limited number of guests (or some
> other userspace, actually), but I seriously doubt that it is impact
> free once you start doing this on an otherwise loaded system.
> 
Yes, it is not suitable to send IPIs to all other physical CPUs. Currently I just
want to show you my idea and to prove it is effective. In real cloud product, we
have resource isolation mechanism, so we will have a bit worse result (compared to 5ms)
but we won't effect other VMs.

> You may have better results by having an alternative mapping of your
> S2 page tables so that they are accessible linearly, which would
> sidestep the PT parsing altogether, probably saving some cycles. But
Yeah, this is a good idea. But for my understanding, to make them linear, we have to preserve
enough physical memory at VM start (may waste much memory), and the effect of this optimization
*maybe* not obvious.

> this is still a marginal gain compared to the overall overhead of
> scanning 4kB of memory per 2MB of guest RAM, as opposed to 64 *bytes*
> per 2MB (assuming strict 4kB mappings at S2, no block mappings).
> 
I ever tested scanning PTs by reading only one byte of each PTE and the test result keeps same.
So, when we scan PTs using just one core, the bottle-neck is CPU speed, instead of memory bandwidth.

> Finally, this doesn't work with pages dirtied from DMA, which is the
> biggest problem. If you cannot track pages that are dirtied behind your
> back, what is the purpose of scanning the dirty bits?
> 
> As for a PML-like feature, this would only be useful if the SMMU
> architecture took part in it and provided consistent logging of
> the dirtied pages in the IPA space. Only having it at the CPU level
> would be making the exact same mistake.
Even SMMU is equipped with PML like feature, we still rely on device suspend to
avoid omitting dirty pages, so I think the only advantage of PML is reducing dirty
log sync time compared to multi-core offload. Maybe I missing something?

Thanks,
Keqian
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm