KVM ARM Archive on lore.kernel.org
 help / color / Atom feed
From: zhukeqian <zhukeqian1@huawei.com>
To: Marc Zyngier <maz@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>, kvmarm@lists.cs.columbia.edu
Subject: Re: [PATCH v2 0/8] KVM: arm64: Support HW dirty log based on DBM
Date: Mon, 13 Jul 2020 10:47:25 +0800
Message-ID: <4eee3e4c-db73-c4ce-ca3d-d665ee87d66a@huawei.com> (raw)
In-Reply-To: <015847afd67e8bd4f8a158b604854838@kernel.org>

Hi Marc,

Sorry for the delay reply.

On 2020/7/6 15:54, Marc Zyngier wrote:
> Hi Keqian,
> 
> On 2020-07-06 02:28, zhukeqian wrote:
>> Hi Catalin and Marc,
>>
>> On 2020/7/2 21:55, Keqian Zhu wrote:
>>> This patch series add support for dirty log based on HW DBM.
>>>
>>> It works well under some migration test cases, including VM with 4K
>>> pages or 2M THP. I checked the SHA256 hash digest of all memory and
>>> they keep same for source VM and destination VM, which means no dirty
>>> pages is missed under hardware DBM.
>>>
>>> Some key points:
>>>
>>> 1. Only support hardware updates of dirty status for PTEs. PMDs and PUDs
>>>    are not involved for now.
>>>
>>> 2. About *performance*: In RFC patch, I have mentioned that for every 64GB
>>>    memory, KVM consumes about 40ms to scan all PTEs to collect dirty log.
>>>    This patch solves this problem through two ways: HW/SW dynamic switch
>>>    and Multi-core offload.
>>>
>>>    HW/SW dynamic switch: Give userspace right to enable/disable hw dirty
>>>    log. This adds a new KVM cap named KVM_CAP_ARM_HW_DIRTY_LOG. We can
>>>    achieve this by change the kvm->arch.vtcr value and kick vCPUs out to
>>>    reload this value to VCTR_EL2. Then userspace can enable hw dirty log
>>>    at the begining and disable it when dirty pages is little and about to
>>>    stop VM, so VM downtime is not affected.
>>>
>>>    Multi-core offload: Offload the PT scanning workload to multi-core can
>>>    greatly reduce scanning time. To promise we can complete in time, I use
>>>    smp_call_fuction to realize this policy, which utilize IPI to dispatch
>>>    workload to other CPUs. Under 128U Kunpeng 920 platform, it just takes
>>>    about 5ms to scan PTs of 256 RAM (use mempress and almost all PTs have
>>>    been established). And We dispatch workload iterately (every CPU just
>>>    scan PTs of 512M RAM for each iteration), so it won't affect physical
>>>    CPUs seriously.
>>
>> What do you think of these two methods to solve high-cost PTs scaning? Maybe
>> you are waiting for PML like feature on ARM :-) , but for my test, DBM is usable
>> after these two methods applied.
> 
> Useable, maybe. But leaving to userspace the decision to switch from one
> mode to another isn't an acceptable outcome. Userspace doesn't need nor
> want to know about this.
> 
OK, maybe this is worth discussing. The switch logic can be encapsulated into Qemu
and can not be seen from VM users. Well, I think it maybe acceptable. :)

> Another thing is that sending IPIs all over to trigger scanning may
> work well on a system that runs a limited number of guests (or some
> other userspace, actually), but I seriously doubt that it is impact
> free once you start doing this on an otherwise loaded system.
> 
Yes, it is not suitable to send IPIs to all other physical CPUs. Currently I just
want to show you my idea and to prove it is effective. In real cloud product, we
have resource isolation mechanism, so we will have a bit worse result (compared to 5ms)
but we won't effect other VMs.

> You may have better results by having an alternative mapping of your
> S2 page tables so that they are accessible linearly, which would
> sidestep the PT parsing altogether, probably saving some cycles. But
Yeah, this is a good idea. But for my understanding, to make them linear, we have to preserve
enough physical memory at VM start (may waste much memory), and the effect of this optimization
*maybe* not obvious.

> this is still a marginal gain compared to the overall overhead of
> scanning 4kB of memory per 2MB of guest RAM, as opposed to 64 *bytes*
> per 2MB (assuming strict 4kB mappings at S2, no block mappings).
> 
I ever tested scanning PTs by reading only one byte of each PTE and the test result keeps same.
So, when we scan PTs using just one core, the bottle-neck is CPU speed, instead of memory bandwidth.

> Finally, this doesn't work with pages dirtied from DMA, which is the
> biggest problem. If you cannot track pages that are dirtied behind your
> back, what is the purpose of scanning the dirty bits?
> 
> As for a PML-like feature, this would only be useful if the SMMU
> architecture took part in it and provided consistent logging of
> the dirtied pages in the IPA space. Only having it at the CPU level
> would be making the exact same mistake.
Even SMMU is equipped with PML like feature, we still rely on device suspend to
avoid omitting dirty pages, so I think the only advantage of PML is reducing dirty
log sync time compared to multi-core offload. Maybe I missing something?

Thanks,
Keqian
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

  reply index

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-02 13:55 Keqian Zhu
2020-07-02 13:55 ` [PATCH v2 1/8] KVM: arm64: Set DBM bit for writable PTEs Keqian Zhu
2020-07-02 13:55 ` [PATCH v2 2/8] KVM: arm64: Scan PTEs to sync dirty log Keqian Zhu
2020-07-02 13:55 ` [PATCH v2 3/8] KVM: arm64: Modify stage2 young mechanism to support hw DBM Keqian Zhu
2020-07-02 13:55 ` [PATCH v2 4/8] KVM: arm64: Save stage2 PTE dirty status if it is covered Keqian Zhu
2020-07-02 13:55 ` [PATCH v2 5/8] KVM: arm64: Steply write protect page table by mask bit Keqian Zhu
2020-07-02 13:55 ` [PATCH v2 6/8] KVM: arm64: Add KVM_CAP_ARM_HW_DIRTY_LOG capability Keqian Zhu
2020-07-06  1:08   ` zhukeqian
2020-07-02 13:55 ` [PATCH v2 7/8] KVM: arm64: Sync dirty log parallel Keqian Zhu
2020-07-02 13:55 ` [PATCH v2 8/8] KVM: Omit dirty log sync in log clear if initially all set Keqian Zhu
2020-07-06  1:28 ` [PATCH v2 0/8] KVM: arm64: Support HW dirty log based on DBM zhukeqian
2020-07-06  7:54   ` Marc Zyngier
2020-07-13  2:47     ` zhukeqian [this message]
2020-07-13 14:53       ` Marc Zyngier
2020-07-28  2:11         ` zhukeqian
2020-07-28  7:52           ` Marc Zyngier
2020-07-28  8:32             ` zhukeqian

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4eee3e4c-db73-c4ce-ca3d-d665ee87d66a@huawei.com \
    --to=zhukeqian1@huawei.com \
    --cc=catalin.marinas@arm.com \
    --cc=kvmarm@lists.cs.columbia.edu \
    --cc=maz@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

KVM ARM Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/kvmarm/0 kvmarm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 kvmarm kvmarm/ https://lore.kernel.org/kvmarm \
		kvmarm@lists.cs.columbia.edu
	public-inbox-index kvmarm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/edu.columbia.cs.lists.kvmarm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git