From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 267E0C433EF
	for <linux-kernel@archiver.kernel.org>; Sun, 17 Oct 2021 11:43:35 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 0312261054
	for <linux-kernel@archiver.kernel.org>; Sun, 17 Oct 2021 11:43:34 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S245586AbhJQLpn (ORCPT <rfc822;linux-kernel@archiver.kernel.org>);
        Sun, 17 Oct 2021 07:45:43 -0400
Received: from mail.kernel.org ([198.145.29.99]:60348 "EHLO mail.kernel.org"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S238081AbhJQLpm (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
        Sun, 17 Oct 2021 07:45:42 -0400
Received: from disco-boy.misterjones.org (disco-boy.misterjones.org [51.254.78.96])
        (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
        (No client certificate requested)
        by mail.kernel.org (Postfix) with ESMTPSA id 7AD0960F46;
        Sun, 17 Oct 2021 11:43:32 +0000 (UTC)
Received: from sofa.misterjones.org ([185.219.108.64] helo=why.misterjones.org)
        by disco-boy.misterjones.org with esmtpsa  (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
        (Exim 4.94.2)
        (envelope-from <maz@kernel.org>)
        id 1mc4Ze-00HK3Z-8n; Sun, 17 Oct 2021 12:43:30 +0100
Date:   Sun, 17 Oct 2021 12:43:29 +0100
Message-ID: <871r4jq3ku.wl-maz@kernel.org>
From:   Marc Zyngier <maz@kernel.org>
To:     Alexandru Elisei <alexandru.elisei@arm.com>
Cc:     james.morse@arm.com, suzuki.poulose@arm.com,
        linux-arm-kernel@lists.infradead.org, kvmarm@lists.cs.columbia.edu,
        will@kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH v4 04/39] KVM: arm64: Defer CMOs for locked memslots until a VCPU is run
In-Reply-To: <20210825161815.266051-5-alexandru.elisei@arm.com>
References: <20210825161815.266051-1-alexandru.elisei@arm.com>
        <20210825161815.266051-5-alexandru.elisei@arm.com>
User-Agent: Wanderlust/2.15.9 (Almost Unreal) SEMI-EPG/1.14.7 (Harue)
 FLIM-LB/1.14.9 (=?UTF-8?B?R29qxY0=?=) APEL-LB/10.8 EasyPG/1.0.0 Emacs/27.1
 (x86_64-pc-linux-gnu) MULE/6.0 (HANACHIRUSATO)
MIME-Version: 1.0 (generated by SEMI-EPG 1.14.7 - "Harue")
Content-Type: text/plain; charset=US-ASCII
X-SA-Exim-Connect-IP: 185.219.108.64
X-SA-Exim-Rcpt-To: alexandru.elisei@arm.com, james.morse@arm.com, suzuki.poulose@arm.com, linux-arm-kernel@lists.infradead.org, kvmarm@lists.cs.columbia.edu, will@kernel.org, linux-kernel@vger.kernel.org
X-SA-Exim-Mail-From: maz@kernel.org
X-SA-Exim-Scanned: No (on disco-boy.misterjones.org); SAEximRunCond expanded to false
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Wed, 25 Aug 2021 17:17:40 +0100,
Alexandru Elisei <alexandru.elisei@arm.com> wrote:
> 
> KVM relies on doing dcache maintenance on stage 2 faults to present to a
> gueste running with the MMU off the same view of memory as userspace. For
> locked memslots, KVM so far has done the dcache maintenance when a memslot
> is locked, but that leaves KVM in a rather awkward position: what userspace
> writes to guest memory after the memslot is locked, but before a VCPU is
> run, might not be visible to the guest.
> 
> Fix this by deferring the dcache maintenance until the first VCPU is run.
> 
> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
> ---
>  arch/arm64/include/asm/kvm_host.h |  7 ++++
>  arch/arm64/include/asm/kvm_mmu.h  |  5 +++
>  arch/arm64/kvm/arm.c              |  3 ++
>  arch/arm64/kvm/mmu.c              | 56 ++++++++++++++++++++++++++++---
>  4 files changed, 67 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index 97ff3ed5d4b7..ed67f914d169 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -112,6 +112,10 @@ struct kvm_arch_memory_slot {
>  	u32 flags;
>  };
>  
> +/* kvm->arch.mmu_pending_ops flags */
> +#define KVM_LOCKED_MEMSLOT_FLUSH_DCACHE	0
> +#define KVM_MAX_MMU_PENDING_OPS		1
> +
>  struct kvm_arch {
>  	struct kvm_s2_mmu mmu;
>  
> @@ -135,6 +139,9 @@ struct kvm_arch {
>  	 */
>  	bool return_nisv_io_abort_to_user;
>  
> +	/* Defer MMU operations until a VCPU is run. */
> +	unsigned long mmu_pending_ops;

This has a funny taste of VM-wide requests...

> +
>  	/*
>  	 * VM-wide PMU filter, implemented as a bitmap and big enough for
>  	 * up to 2^10 events (ARMv8.0) or 2^16 events (ARMv8.1+).
> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> index ef079b5eb475..525c223e769f 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -219,6 +219,11 @@ void kvm_toggle_cache(struct kvm_vcpu *vcpu, bool was_enabled);
>  int kvm_mmu_lock_memslot(struct kvm *kvm, u64 slot, u64 flags);
>  int kvm_mmu_unlock_memslot(struct kvm *kvm, u64 slot, u64 flags);
>  
> +#define kvm_mmu_has_pending_ops(kvm)	\
> +	(!bitmap_empty(&(kvm)->arch.mmu_pending_ops, KVM_MAX_MMU_PENDING_OPS))
> +
> +void kvm_mmu_perform_pending_ops(struct kvm *kvm);
> +
>  static inline unsigned int kvm_get_vmid_bits(void)
>  {
>  	int reg = read_sanitised_ftr_reg(SYS_ID_AA64MMFR1_EL1);
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index efb3501c6016..144c982912d8 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -829,6 +829,9 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
>  	if (unlikely(!kvm_vcpu_initialized(vcpu)))
>  		return -ENOEXEC;
>  
> +	if (unlikely(kvm_mmu_has_pending_ops(vcpu->kvm)))
> +		kvm_mmu_perform_pending_ops(vcpu->kvm);
> +
>  	ret = kvm_vcpu_first_run_init(vcpu);

Is there any reason why this isn't done as part of the 'first run'
handling? I am refactoring that part to remove as many things as
possible from the fast path, and would love not to go back to piling
more stuff here.

Or do you expect this to happen more than once per VM (despite what
the comment says)?

>  	if (ret)
>  		return ret;
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 59c2bfef2fd1..94fa08f3d9d3 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1253,6 +1253,41 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
>  	return ret;
>  }
>  
> +/*
> + * It's safe to do the CMOs when the first VCPU is run because:
> + * - VCPUs cannot run until mmu_cmo_needed is cleared.
> + * - Memslots cannot be modified because we hold the kvm->slots_lock.

It would be good to document the expected locking order for this kind
of stuff.

> + *
> + * It's safe to periodically release the mmu_lock because:
> + * - VCPUs cannot run.
> + * - Any changes to the stage 2 tables triggered by the MMU notifiers also take
> + *   the mmu_lock, which means accesses will be serialized.
> + * - Stage 2 tables cannot be freed from under us as long as at least one VCPU
> + *   is live, which means that the VM will be live.
> + */
> +void kvm_mmu_perform_pending_ops(struct kvm *kvm)
> +{
> +	struct kvm_memory_slot *memslot;
> +
> +	mutex_lock(&kvm->slots_lock);
> +	if (!kvm_mmu_has_pending_ops(kvm))
> +		goto out_unlock;
> +
> +	if (test_bit(KVM_LOCKED_MEMSLOT_FLUSH_DCACHE, &kvm->arch.mmu_pending_ops)) {
> +		kvm_for_each_memslot(memslot, kvm_memslots(kvm)) {
> +			if (!memslot_is_locked(memslot))
> +				continue;
> +			stage2_flush_memslot(kvm, memslot);
> +		}
> +	}
> +
> +	bitmap_zero(&kvm->arch.mmu_pending_ops, KVM_MAX_MMU_PENDING_OPS);

clear_bit() instead? I understand that you want to support multiple
ops, but this looks odd.

> +
> +out_unlock:
> +	mutex_unlock(&kvm->slots_lock);
> +	return;
> +}
> +
>  static int try_rlimit_memlock(unsigned long npages)
>  {
>  	unsigned long lock_limit;
> @@ -1293,7 +1328,8 @@ static int lock_memslot(struct kvm *kvm, struct kvm_memory_slot *memslot,
>  	struct kvm_memory_slot_page *page_entry;
>  	bool writable = flags & KVM_ARM_LOCK_MEM_WRITE;
>  	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
> -	struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
> +	struct kvm_pgtable pgt;
> +	struct kvm_pgtable_mm_ops mm_ops;
>  	struct vm_area_struct *vma;
>  	unsigned long npages = memslot->npages;
>  	unsigned int pin_flags = FOLL_LONGTERM;
> @@ -1311,6 +1347,16 @@ static int lock_memslot(struct kvm *kvm, struct kvm_memory_slot *memslot,
>  		pin_flags |= FOLL_WRITE;
>  	}
>  
> +	/*
> +	 * Make a copy of the stage 2 translation table struct to remove the
> +	 * dcache callback so we can postpone the cache maintenance operations
> +	 * until the first VCPU is run.
> +	 */
> +	mm_ops = *kvm->arch.mmu.pgt->mm_ops;
> +	mm_ops.dcache_clean_inval_poc = NULL;
> +	pgt = *kvm->arch.mmu.pgt;
> +	pgt.mm_ops = &mm_ops;

Huhuh... Can't really say I'm in love with this. Are you trying to
avoid a double dcache clean to PoC? Is this a performance or a
correctness issue?

> +
>  	hva = memslot->userspace_addr;
>  	ipa = memslot->base_gfn << PAGE_SHIFT;
>  
> @@ -1362,13 +1408,13 @@ static int lock_memslot(struct kvm *kvm, struct kvm_memory_slot *memslot,
>  			goto out_err;
>  		}
>  
> -		ret = kvm_pgtable_stage2_map(pgt, ipa, PAGE_SIZE,
> +		ret = kvm_pgtable_stage2_map(&pgt, ipa, PAGE_SIZE,
>  					     page_to_phys(page_entry->page),
>  					     prot, &cache);
>  		spin_unlock(&kvm->mmu_lock);
>  
>  		if (ret) {
> -			kvm_pgtable_stage2_unmap(pgt, memslot->base_gfn << PAGE_SHIFT,
> +			kvm_pgtable_stage2_unmap(&pgt, memslot->base_gfn << PAGE_SHIFT,
>  						 i << PAGE_SHIFT);
>  			unpin_memslot_pages(memslot, writable);
>  			goto out_err;
> @@ -1387,7 +1433,7 @@ static int lock_memslot(struct kvm *kvm, struct kvm_memory_slot *memslot,
>  	 */
>  	ret = account_locked_vm(current->mm, npages, true);
>  	if (ret) {
> -		kvm_pgtable_stage2_unmap(pgt, memslot->base_gfn << PAGE_SHIFT,
> +		kvm_pgtable_stage2_unmap(&pgt, memslot->base_gfn << PAGE_SHIFT,
>  					 npages << PAGE_SHIFT);
>  		unpin_memslot_pages(memslot, writable);
>  		goto out_err;
> @@ -1397,6 +1443,8 @@ static int lock_memslot(struct kvm *kvm, struct kvm_memory_slot *memslot,
>  	if (writable)
>  		memslot->arch.flags |= KVM_MEMSLOT_LOCK_WRITE;
>  
> +	set_bit(KVM_LOCKED_MEMSLOT_FLUSH_DCACHE, &kvm->arch.mmu_pending_ops);
> +
>  	kvm_mmu_free_memory_cache(&cache);
>  
>  	return 0;

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=PgOR=PF=lists.cs.columbia.edu=kvmarm-bounces@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 43584C433EF
	for <kvmarm@archiver.kernel.org>; Sun, 17 Oct 2021 11:43:38 +0000 (UTC)
Received: from mm01.cs.columbia.edu (mm01.cs.columbia.edu [128.59.11.253])
	by mail.kernel.org (Postfix) with ESMTP id AB0E761054
	for <kvmarm@archiver.kernel.org>; Sun, 17 Oct 2021 11:43:37 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org AB0E761054
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=lists.cs.columbia.edu
Received: from localhost (localhost [127.0.0.1])
	by mm01.cs.columbia.edu (Postfix) with ESMTP id 1FB814B12C;
	Sun, 17 Oct 2021 07:43:37 -0400 (EDT)
X-Virus-Scanned: at lists.cs.columbia.edu
Received: from mm01.cs.columbia.edu ([127.0.0.1])
	by localhost (mm01.cs.columbia.edu [127.0.0.1]) (amavisd-new, port 10024)
	with ESMTP id sAzuriAx0WVW; Sun, 17 Oct 2021 07:43:35 -0400 (EDT)
Received: from mm01.cs.columbia.edu (localhost [127.0.0.1])
	by mm01.cs.columbia.edu (Postfix) with ESMTP id B14714B134;
	Sun, 17 Oct 2021 07:43:35 -0400 (EDT)
Received: from localhost (localhost [127.0.0.1])
 by mm01.cs.columbia.edu (Postfix) with ESMTP id 33B954B12C
 for <kvmarm@lists.cs.columbia.edu>; Sun, 17 Oct 2021 07:43:35 -0400 (EDT)
X-Virus-Scanned: at lists.cs.columbia.edu
Received: from mm01.cs.columbia.edu ([127.0.0.1])
 by localhost (mm01.cs.columbia.edu [127.0.0.1]) (amavisd-new, port 10024)
 with ESMTP id oKyRuUDuI2lP for <kvmarm@lists.cs.columbia.edu>;
 Sun, 17 Oct 2021 07:43:33 -0400 (EDT)
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
 by mm01.cs.columbia.edu (Postfix) with ESMTPS id B61224B128
 for <kvmarm@lists.cs.columbia.edu>; Sun, 17 Oct 2021 07:43:33 -0400 (EDT)
Received: from disco-boy.misterjones.org (disco-boy.misterjones.org
 [51.254.78.96])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by mail.kernel.org (Postfix) with ESMTPSA id 7AD0960F46;
 Sun, 17 Oct 2021 11:43:32 +0000 (UTC)
Received: from sofa.misterjones.org ([185.219.108.64] helo=why.misterjones.org)
 by disco-boy.misterjones.org with esmtpsa (TLS1.3) tls
 TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2)
 (envelope-from <maz@kernel.org>)
 id 1mc4Ze-00HK3Z-8n; Sun, 17 Oct 2021 12:43:30 +0100
Date: Sun, 17 Oct 2021 12:43:29 +0100
Message-ID: <871r4jq3ku.wl-maz@kernel.org>
From: Marc Zyngier <maz@kernel.org>
To: Alexandru Elisei <alexandru.elisei@arm.com>
Subject: Re: [RFC PATCH v4 04/39] KVM: arm64: Defer CMOs for locked memslots
 until a VCPU is run
In-Reply-To: <20210825161815.266051-5-alexandru.elisei@arm.com>
References: <20210825161815.266051-1-alexandru.elisei@arm.com>
 <20210825161815.266051-5-alexandru.elisei@arm.com>
User-Agent: Wanderlust/2.15.9 (Almost Unreal) SEMI-EPG/1.14.7 (Harue)
 FLIM-LB/1.14.9 (=?UTF-8?B?R29qxY0=?=) APEL-LB/10.8 EasyPG/1.0.0 Emacs/27.1
 (x86_64-pc-linux-gnu) MULE/6.0 (HANACHIRUSATO)
MIME-Version: 1.0 (generated by SEMI-EPG 1.14.7 - "Harue")
X-SA-Exim-Connect-IP: 185.219.108.64
X-SA-Exim-Rcpt-To: alexandru.elisei@arm.com, james.morse@arm.com,
 suzuki.poulose@arm.com, linux-arm-kernel@lists.infradead.org,
 kvmarm@lists.cs.columbia.edu, will@kernel.org, linux-kernel@vger.kernel.org
X-SA-Exim-Mail-From: maz@kernel.org
X-SA-Exim-Scanned: No (on disco-boy.misterjones.org);
 SAEximRunCond expanded to false
Cc: linux-kernel@vger.kernel.org, will@kernel.org, kvmarm@lists.cs.columbia.edu,
 linux-arm-kernel@lists.infradead.org
X-BeenThere: kvmarm@lists.cs.columbia.edu
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Where KVM/ARM decisions are made <kvmarm.lists.cs.columbia.edu>
List-Unsubscribe: <https://lists.cs.columbia.edu/mailman/options/kvmarm>,
 <mailto:kvmarm-request@lists.cs.columbia.edu?subject=unsubscribe>
List-Archive: <https://lists.cs.columbia.edu/pipermail/kvmarm>
List-Post: <mailto:kvmarm@lists.cs.columbia.edu>
List-Help: <mailto:kvmarm-request@lists.cs.columbia.edu?subject=help>
List-Subscribe: <https://lists.cs.columbia.edu/mailman/listinfo/kvmarm>,
 <mailto:kvmarm-request@lists.cs.columbia.edu?subject=subscribe>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Errors-To: kvmarm-bounces@lists.cs.columbia.edu
Sender: kvmarm-bounces@lists.cs.columbia.edu

On Wed, 25 Aug 2021 17:17:40 +0100,
Alexandru Elisei <alexandru.elisei@arm.com> wrote:
> 
> KVM relies on doing dcache maintenance on stage 2 faults to present to a
> gueste running with the MMU off the same view of memory as userspace. For
> locked memslots, KVM so far has done the dcache maintenance when a memslot
> is locked, but that leaves KVM in a rather awkward position: what userspace
> writes to guest memory after the memslot is locked, but before a VCPU is
> run, might not be visible to the guest.
> 
> Fix this by deferring the dcache maintenance until the first VCPU is run.
> 
> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
> ---
>  arch/arm64/include/asm/kvm_host.h |  7 ++++
>  arch/arm64/include/asm/kvm_mmu.h  |  5 +++
>  arch/arm64/kvm/arm.c              |  3 ++
>  arch/arm64/kvm/mmu.c              | 56 ++++++++++++++++++++++++++++---
>  4 files changed, 67 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index 97ff3ed5d4b7..ed67f914d169 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -112,6 +112,10 @@ struct kvm_arch_memory_slot {
>  	u32 flags;
>  };
>  
> +/* kvm->arch.mmu_pending_ops flags */
> +#define KVM_LOCKED_MEMSLOT_FLUSH_DCACHE	0
> +#define KVM_MAX_MMU_PENDING_OPS		1
> +
>  struct kvm_arch {
>  	struct kvm_s2_mmu mmu;
>  
> @@ -135,6 +139,9 @@ struct kvm_arch {
>  	 */
>  	bool return_nisv_io_abort_to_user;
>  
> +	/* Defer MMU operations until a VCPU is run. */
> +	unsigned long mmu_pending_ops;

This has a funny taste of VM-wide requests...

> +
>  	/*
>  	 * VM-wide PMU filter, implemented as a bitmap and big enough for
>  	 * up to 2^10 events (ARMv8.0) or 2^16 events (ARMv8.1+).
> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> index ef079b5eb475..525c223e769f 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -219,6 +219,11 @@ void kvm_toggle_cache(struct kvm_vcpu *vcpu, bool was_enabled);
>  int kvm_mmu_lock_memslot(struct kvm *kvm, u64 slot, u64 flags);
>  int kvm_mmu_unlock_memslot(struct kvm *kvm, u64 slot, u64 flags);
>  
> +#define kvm_mmu_has_pending_ops(kvm)	\
> +	(!bitmap_empty(&(kvm)->arch.mmu_pending_ops, KVM_MAX_MMU_PENDING_OPS))
> +
> +void kvm_mmu_perform_pending_ops(struct kvm *kvm);
> +
>  static inline unsigned int kvm_get_vmid_bits(void)
>  {
>  	int reg = read_sanitised_ftr_reg(SYS_ID_AA64MMFR1_EL1);
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index efb3501c6016..144c982912d8 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -829,6 +829,9 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
>  	if (unlikely(!kvm_vcpu_initialized(vcpu)))
>  		return -ENOEXEC;
>  
> +	if (unlikely(kvm_mmu_has_pending_ops(vcpu->kvm)))
> +		kvm_mmu_perform_pending_ops(vcpu->kvm);
> +
>  	ret = kvm_vcpu_first_run_init(vcpu);

Is there any reason why this isn't done as part of the 'first run'
handling? I am refactoring that part to remove as many things as
possible from the fast path, and would love not to go back to piling
more stuff here.

Or do you expect this to happen more than once per VM (despite what
the comment says)?

>  	if (ret)
>  		return ret;
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 59c2bfef2fd1..94fa08f3d9d3 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1253,6 +1253,41 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
>  	return ret;
>  }
>  
> +/*
> + * It's safe to do the CMOs when the first VCPU is run because:
> + * - VCPUs cannot run until mmu_cmo_needed is cleared.
> + * - Memslots cannot be modified because we hold the kvm->slots_lock.

It would be good to document the expected locking order for this kind
of stuff.

> + *
> + * It's safe to periodically release the mmu_lock because:
> + * - VCPUs cannot run.
> + * - Any changes to the stage 2 tables triggered by the MMU notifiers also take
> + *   the mmu_lock, which means accesses will be serialized.
> + * - Stage 2 tables cannot be freed from under us as long as at least one VCPU
> + *   is live, which means that the VM will be live.
> + */
> +void kvm_mmu_perform_pending_ops(struct kvm *kvm)
> +{
> +	struct kvm_memory_slot *memslot;
> +
> +	mutex_lock(&kvm->slots_lock);
> +	if (!kvm_mmu_has_pending_ops(kvm))
> +		goto out_unlock;
> +
> +	if (test_bit(KVM_LOCKED_MEMSLOT_FLUSH_DCACHE, &kvm->arch.mmu_pending_ops)) {
> +		kvm_for_each_memslot(memslot, kvm_memslots(kvm)) {
> +			if (!memslot_is_locked(memslot))
> +				continue;
> +			stage2_flush_memslot(kvm, memslot);
> +		}
> +	}
> +
> +	bitmap_zero(&kvm->arch.mmu_pending_ops, KVM_MAX_MMU_PENDING_OPS);

clear_bit() instead? I understand that you want to support multiple
ops, but this looks odd.

> +
> +out_unlock:
> +	mutex_unlock(&kvm->slots_lock);
> +	return;
> +}
> +
>  static int try_rlimit_memlock(unsigned long npages)
>  {
>  	unsigned long lock_limit;
> @@ -1293,7 +1328,8 @@ static int lock_memslot(struct kvm *kvm, struct kvm_memory_slot *memslot,
>  	struct kvm_memory_slot_page *page_entry;
>  	bool writable = flags & KVM_ARM_LOCK_MEM_WRITE;
>  	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
> -	struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
> +	struct kvm_pgtable pgt;
> +	struct kvm_pgtable_mm_ops mm_ops;
>  	struct vm_area_struct *vma;
>  	unsigned long npages = memslot->npages;
>  	unsigned int pin_flags = FOLL_LONGTERM;
> @@ -1311,6 +1347,16 @@ static int lock_memslot(struct kvm *kvm, struct kvm_memory_slot *memslot,
>  		pin_flags |= FOLL_WRITE;
>  	}
>  
> +	/*
> +	 * Make a copy of the stage 2 translation table struct to remove the
> +	 * dcache callback so we can postpone the cache maintenance operations
> +	 * until the first VCPU is run.
> +	 */
> +	mm_ops = *kvm->arch.mmu.pgt->mm_ops;
> +	mm_ops.dcache_clean_inval_poc = NULL;
> +	pgt = *kvm->arch.mmu.pgt;
> +	pgt.mm_ops = &mm_ops;

Huhuh... Can't really say I'm in love with this. Are you trying to
avoid a double dcache clean to PoC? Is this a performance or a
correctness issue?

> +
>  	hva = memslot->userspace_addr;
>  	ipa = memslot->base_gfn << PAGE_SHIFT;
>  
> @@ -1362,13 +1408,13 @@ static int lock_memslot(struct kvm *kvm, struct kvm_memory_slot *memslot,
>  			goto out_err;
>  		}
>  
> -		ret = kvm_pgtable_stage2_map(pgt, ipa, PAGE_SIZE,
> +		ret = kvm_pgtable_stage2_map(&pgt, ipa, PAGE_SIZE,
>  					     page_to_phys(page_entry->page),
>  					     prot, &cache);
>  		spin_unlock(&kvm->mmu_lock);
>  
>  		if (ret) {
> -			kvm_pgtable_stage2_unmap(pgt, memslot->base_gfn << PAGE_SHIFT,
> +			kvm_pgtable_stage2_unmap(&pgt, memslot->base_gfn << PAGE_SHIFT,
>  						 i << PAGE_SHIFT);
>  			unpin_memslot_pages(memslot, writable);
>  			goto out_err;
> @@ -1387,7 +1433,7 @@ static int lock_memslot(struct kvm *kvm, struct kvm_memory_slot *memslot,
>  	 */
>  	ret = account_locked_vm(current->mm, npages, true);
>  	if (ret) {
> -		kvm_pgtable_stage2_unmap(pgt, memslot->base_gfn << PAGE_SHIFT,
> +		kvm_pgtable_stage2_unmap(&pgt, memslot->base_gfn << PAGE_SHIFT,
>  					 npages << PAGE_SHIFT);
>  		unpin_memslot_pages(memslot, writable);
>  		goto out_err;
> @@ -1397,6 +1443,8 @@ static int lock_memslot(struct kvm *kvm, struct kvm_memory_slot *memslot,
>  	if (writable)
>  		memslot->arch.flags |= KVM_MEMSLOT_LOCK_WRITE;
>  
> +	set_bit(KVM_LOCKED_MEMSLOT_FLUSH_DCACHE, &kvm->arch.mmu_pending_ops);
> +
>  	kvm_mmu_free_memory_cache(&cache);
>  
>  	return 0;

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=FWlZ=PF=lists.infradead.org=linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 9BA9AC433F5
	for <linux-arm-kernel@archiver.kernel.org>; Sun, 17 Oct 2021 11:44:49 +0000 (UTC)
Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by mail.kernel.org (Postfix) with ESMTPS id 4EA6A60F46
	for <linux-arm-kernel@archiver.kernel.org>; Sun, 17 Oct 2021 11:44:49 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 4EA6A60F46
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org
Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=lists.infradead.org
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=lists.infradead.org; s=bombadil.20210309; h=Sender:
	Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post:
	List-Archive:List-Unsubscribe:List-Id:MIME-Version:References:In-Reply-To:
	Subject:Cc:To:From:Message-ID:Date:Reply-To:Content-ID:Content-Description:
	Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:
	List-Owner; bh=ozOQ+UKF849SbnqhCBVh7Zi+X1MLADTSXuJ/7AEixZ8=; b=wRKEldhugdDqnw
	2iudCmGPIttpXGgAt+nW+uqyDIx9peDMZY16s1TcuMYCJdJWAzxjBDuOIXyd6xd5gdNSz6b9XNhcY
	AAMqeIm772QfnYbcOCDCPevWIRYlLpJssuh3S485QFRgQeYh1f0LG28BVogeqRGKpjdAHaHyGxJex
	enqgwLzl6Bj3ppeIDN/+4GkjU6j/5dmYhWCiN5TNBtBl5fhEemJJNJY1HVNMR/TYTsTFY6bNIf+HL
	mmhvuXLRXxmukFzzNyx6DdRQe9c8Y6nw0jrvwhqGsbAp6wrKBp2ltAOo42TL9vnJ4AYhSOEKdsNed
	pdF6wJDL1C2l7/A9LDAg==;
Received: from localhost ([::1] helo=bombadil.infradead.org)
	by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux))
	id 1mc4Zl-00CGbQ-Qa; Sun, 17 Oct 2021 11:43:37 +0000
Received: from mail.kernel.org ([198.145.29.99])
 by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux))
 id 1mc4Zh-00CGZw-CI
 for linux-arm-kernel@lists.infradead.org; Sun, 17 Oct 2021 11:43:35 +0000
Received: from disco-boy.misterjones.org (disco-boy.misterjones.org
 [51.254.78.96])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by mail.kernel.org (Postfix) with ESMTPSA id 7AD0960F46;
 Sun, 17 Oct 2021 11:43:32 +0000 (UTC)
Received: from sofa.misterjones.org ([185.219.108.64] helo=why.misterjones.org)
 by disco-boy.misterjones.org with esmtpsa (TLS1.3) tls
 TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2)
 (envelope-from <maz@kernel.org>)
 id 1mc4Ze-00HK3Z-8n; Sun, 17 Oct 2021 12:43:30 +0100
Date: Sun, 17 Oct 2021 12:43:29 +0100
Message-ID: <871r4jq3ku.wl-maz@kernel.org>
From: Marc Zyngier <maz@kernel.org>
To: Alexandru Elisei <alexandru.elisei@arm.com>
Cc: james.morse@arm.com, suzuki.poulose@arm.com,
 linux-arm-kernel@lists.infradead.org, kvmarm@lists.cs.columbia.edu,
 will@kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH v4 04/39] KVM: arm64: Defer CMOs for locked memslots
 until a VCPU is run
In-Reply-To: <20210825161815.266051-5-alexandru.elisei@arm.com>
References: <20210825161815.266051-1-alexandru.elisei@arm.com>
 <20210825161815.266051-5-alexandru.elisei@arm.com>
User-Agent: Wanderlust/2.15.9 (Almost Unreal) SEMI-EPG/1.14.7 (Harue)
 FLIM-LB/1.14.9 (=?UTF-8?B?R29qxY0=?=) APEL-LB/10.8 EasyPG/1.0.0 Emacs/27.1
 (x86_64-pc-linux-gnu) MULE/6.0 (HANACHIRUSATO)
MIME-Version: 1.0 (generated by SEMI-EPG 1.14.7 - "Harue")
X-SA-Exim-Connect-IP: 185.219.108.64
X-SA-Exim-Rcpt-To: alexandru.elisei@arm.com, james.morse@arm.com,
 suzuki.poulose@arm.com, linux-arm-kernel@lists.infradead.org,
 kvmarm@lists.cs.columbia.edu, will@kernel.org, linux-kernel@vger.kernel.org
X-SA-Exim-Mail-From: maz@kernel.org
X-SA-Exim-Scanned: No (on disco-boy.misterjones.org);
 SAEximRunCond expanded to false
X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 
X-CRM114-CacheID: sfid-20211017_044333_514018_5EAE6D93 
X-CRM114-Status: GOOD (  41.41  )
X-BeenThere: linux-arm-kernel@lists.infradead.org
X-Mailman-Version: 2.1.34
Precedence: list
List-Id: <linux-arm-kernel.lists.infradead.org>
List-Unsubscribe: <http://lists.infradead.org/mailman/options/linux-arm-kernel>, 
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=unsubscribe>
List-Archive: <http://lists.infradead.org/pipermail/linux-arm-kernel/>
List-Post: <mailto:linux-arm-kernel@lists.infradead.org>
List-Help: <mailto:linux-arm-kernel-request@lists.infradead.org?subject=help>
List-Subscribe: <http://lists.infradead.org/mailman/listinfo/linux-arm-kernel>, 
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=subscribe>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org

On Wed, 25 Aug 2021 17:17:40 +0100,
Alexandru Elisei <alexandru.elisei@arm.com> wrote:
> 
> KVM relies on doing dcache maintenance on stage 2 faults to present to a
> gueste running with the MMU off the same view of memory as userspace. For
> locked memslots, KVM so far has done the dcache maintenance when a memslot
> is locked, but that leaves KVM in a rather awkward position: what userspace
> writes to guest memory after the memslot is locked, but before a VCPU is
> run, might not be visible to the guest.
> 
> Fix this by deferring the dcache maintenance until the first VCPU is run.
> 
> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
> ---
>  arch/arm64/include/asm/kvm_host.h |  7 ++++
>  arch/arm64/include/asm/kvm_mmu.h  |  5 +++
>  arch/arm64/kvm/arm.c              |  3 ++
>  arch/arm64/kvm/mmu.c              | 56 ++++++++++++++++++++++++++++---
>  4 files changed, 67 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index 97ff3ed5d4b7..ed67f914d169 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -112,6 +112,10 @@ struct kvm_arch_memory_slot {
>  	u32 flags;
>  };
>  
> +/* kvm->arch.mmu_pending_ops flags */
> +#define KVM_LOCKED_MEMSLOT_FLUSH_DCACHE	0
> +#define KVM_MAX_MMU_PENDING_OPS		1
> +
>  struct kvm_arch {
>  	struct kvm_s2_mmu mmu;
>  
> @@ -135,6 +139,9 @@ struct kvm_arch {
>  	 */
>  	bool return_nisv_io_abort_to_user;
>  
> +	/* Defer MMU operations until a VCPU is run. */
> +	unsigned long mmu_pending_ops;

This has a funny taste of VM-wide requests...

> +
>  	/*
>  	 * VM-wide PMU filter, implemented as a bitmap and big enough for
>  	 * up to 2^10 events (ARMv8.0) or 2^16 events (ARMv8.1+).
> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> index ef079b5eb475..525c223e769f 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -219,6 +219,11 @@ void kvm_toggle_cache(struct kvm_vcpu *vcpu, bool was_enabled);
>  int kvm_mmu_lock_memslot(struct kvm *kvm, u64 slot, u64 flags);
>  int kvm_mmu_unlock_memslot(struct kvm *kvm, u64 slot, u64 flags);
>  
> +#define kvm_mmu_has_pending_ops(kvm)	\
> +	(!bitmap_empty(&(kvm)->arch.mmu_pending_ops, KVM_MAX_MMU_PENDING_OPS))
> +
> +void kvm_mmu_perform_pending_ops(struct kvm *kvm);
> +
>  static inline unsigned int kvm_get_vmid_bits(void)
>  {
>  	int reg = read_sanitised_ftr_reg(SYS_ID_AA64MMFR1_EL1);
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index efb3501c6016..144c982912d8 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -829,6 +829,9 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
>  	if (unlikely(!kvm_vcpu_initialized(vcpu)))
>  		return -ENOEXEC;
>  
> +	if (unlikely(kvm_mmu_has_pending_ops(vcpu->kvm)))
> +		kvm_mmu_perform_pending_ops(vcpu->kvm);
> +
>  	ret = kvm_vcpu_first_run_init(vcpu);

Is there any reason why this isn't done as part of the 'first run'
handling? I am refactoring that part to remove as many things as
possible from the fast path, and would love not to go back to piling
more stuff here.

Or do you expect this to happen more than once per VM (despite what
the comment says)?

>  	if (ret)
>  		return ret;
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 59c2bfef2fd1..94fa08f3d9d3 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1253,6 +1253,41 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
>  	return ret;
>  }
>  
> +/*
> + * It's safe to do the CMOs when the first VCPU is run because:
> + * - VCPUs cannot run until mmu_cmo_needed is cleared.
> + * - Memslots cannot be modified because we hold the kvm->slots_lock.

It would be good to document the expected locking order for this kind
of stuff.

> + *
> + * It's safe to periodically release the mmu_lock because:
> + * - VCPUs cannot run.
> + * - Any changes to the stage 2 tables triggered by the MMU notifiers also take
> + *   the mmu_lock, which means accesses will be serialized.
> + * - Stage 2 tables cannot be freed from under us as long as at least one VCPU
> + *   is live, which means that the VM will be live.
> + */
> +void kvm_mmu_perform_pending_ops(struct kvm *kvm)
> +{
> +	struct kvm_memory_slot *memslot;
> +
> +	mutex_lock(&kvm->slots_lock);
> +	if (!kvm_mmu_has_pending_ops(kvm))
> +		goto out_unlock;
> +
> +	if (test_bit(KVM_LOCKED_MEMSLOT_FLUSH_DCACHE, &kvm->arch.mmu_pending_ops)) {
> +		kvm_for_each_memslot(memslot, kvm_memslots(kvm)) {
> +			if (!memslot_is_locked(memslot))
> +				continue;
> +			stage2_flush_memslot(kvm, memslot);
> +		}
> +	}
> +
> +	bitmap_zero(&kvm->arch.mmu_pending_ops, KVM_MAX_MMU_PENDING_OPS);

clear_bit() instead? I understand that you want to support multiple
ops, but this looks odd.

> +
> +out_unlock:
> +	mutex_unlock(&kvm->slots_lock);
> +	return;
> +}
> +
>  static int try_rlimit_memlock(unsigned long npages)
>  {
>  	unsigned long lock_limit;
> @@ -1293,7 +1328,8 @@ static int lock_memslot(struct kvm *kvm, struct kvm_memory_slot *memslot,
>  	struct kvm_memory_slot_page *page_entry;
>  	bool writable = flags & KVM_ARM_LOCK_MEM_WRITE;
>  	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
> -	struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
> +	struct kvm_pgtable pgt;
> +	struct kvm_pgtable_mm_ops mm_ops;
>  	struct vm_area_struct *vma;
>  	unsigned long npages = memslot->npages;
>  	unsigned int pin_flags = FOLL_LONGTERM;
> @@ -1311,6 +1347,16 @@ static int lock_memslot(struct kvm *kvm, struct kvm_memory_slot *memslot,
>  		pin_flags |= FOLL_WRITE;
>  	}
>  
> +	/*
> +	 * Make a copy of the stage 2 translation table struct to remove the
> +	 * dcache callback so we can postpone the cache maintenance operations
> +	 * until the first VCPU is run.
> +	 */
> +	mm_ops = *kvm->arch.mmu.pgt->mm_ops;
> +	mm_ops.dcache_clean_inval_poc = NULL;
> +	pgt = *kvm->arch.mmu.pgt;
> +	pgt.mm_ops = &mm_ops;

Huhuh... Can't really say I'm in love with this. Are you trying to
avoid a double dcache clean to PoC? Is this a performance or a
correctness issue?

> +
>  	hva = memslot->userspace_addr;
>  	ipa = memslot->base_gfn << PAGE_SHIFT;
>  
> @@ -1362,13 +1408,13 @@ static int lock_memslot(struct kvm *kvm, struct kvm_memory_slot *memslot,
>  			goto out_err;
>  		}
>  
> -		ret = kvm_pgtable_stage2_map(pgt, ipa, PAGE_SIZE,
> +		ret = kvm_pgtable_stage2_map(&pgt, ipa, PAGE_SIZE,
>  					     page_to_phys(page_entry->page),
>  					     prot, &cache);
>  		spin_unlock(&kvm->mmu_lock);
>  
>  		if (ret) {
> -			kvm_pgtable_stage2_unmap(pgt, memslot->base_gfn << PAGE_SHIFT,
> +			kvm_pgtable_stage2_unmap(&pgt, memslot->base_gfn << PAGE_SHIFT,
>  						 i << PAGE_SHIFT);
>  			unpin_memslot_pages(memslot, writable);
>  			goto out_err;
> @@ -1387,7 +1433,7 @@ static int lock_memslot(struct kvm *kvm, struct kvm_memory_slot *memslot,
>  	 */
>  	ret = account_locked_vm(current->mm, npages, true);
>  	if (ret) {
> -		kvm_pgtable_stage2_unmap(pgt, memslot->base_gfn << PAGE_SHIFT,
> +		kvm_pgtable_stage2_unmap(&pgt, memslot->base_gfn << PAGE_SHIFT,
>  					 npages << PAGE_SHIFT);
>  		unpin_memslot_pages(memslot, writable);
>  		goto out_err;
> @@ -1397,6 +1443,8 @@ static int lock_memslot(struct kvm *kvm, struct kvm_memory_slot *memslot,
>  	if (writable)
>  		memslot->arch.flags |= KVM_MEMSLOT_LOCK_WRITE;
>  
> +	set_bit(KVM_LOCKED_MEMSLOT_FLUSH_DCACHE, &kvm->arch.mmu_pending_ops);
> +
>  	kvm_mmu_free_memory_cache(&cache);
>  
>  	return 0;

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel