From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=vgYq=CS=lists.infradead.org=linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-10.0 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH,
	DKIM_SIGNED,DKIM_VALID,INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY,
	SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=unavailable autolearn_force=no
	version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 7C1B8C43461
	for <linux-arm-kernel@archiver.kernel.org>; Wed,  9 Sep 2020 17:13:54 +0000 (UTC)
Received: from merlin.infradead.org (merlin.infradead.org [205.233.59.134])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by mail.kernel.org (Postfix) with ESMTPS id D9FCE206E6
	for <linux-arm-kernel@archiver.kernel.org>; Wed,  9 Sep 2020 17:13:53 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=lists.infradead.org header.i=@lists.infradead.org header.b="D0ocYjeV";
	dkim=fail reason="signature verification failed" (1024-bit key) header.d=kernel.org header.i=@kernel.org header.b="Uh9RRLGt"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org D9FCE206E6
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org
Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=lists.infradead.org; s=merlin.20170209; h=Sender:Content-Type:
	Content-Transfer-Encoding:Cc:List-Subscribe:List-Help:List-Post:List-Archive:
	List-Unsubscribe:List-Id:Message-ID:References:In-Reply-To:Subject:To:From:
	Date:MIME-Version:Reply-To:Content-ID:Content-Description:Resent-Date:
	Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner;
	 bh=V9lGSp3Qy0cyc98VgkPwTrxI5wSYmliSeBD7C2aawSg=; b=D0ocYjeV4xqtCxjjB0OVI0SVb
	oNW6keIN1+qyBWDJ223aTin7aoF3ZBr44ODSL4b2vP7rwohiTHpRGBjWMyQaH9QyNtwje4sFRP7pI
	3g/W//xgGflatW1Fo7b0rcfKFEuBUQxxkSr3QdExCy3A808YomS+JGol8m/GJAQHE+SAITuO6Vi1c
	7CesAtIdy0zchYsZx8tj1tBQ1tHM9fzwxzcFuhkK6cqTRoncyvLelXsJzL2TpVjGKgAkpye4j0lOl
	MFbryOcr3MbNGMyZw50ZgCRzP9viPX0xPRabCGpM8KcRQwcPwF6kAuJWWfQ90TryTNko8Heh9iB4v
	AM1yOUcIA==;
Received: from localhost ([::1] helo=merlin.infradead.org)
	by merlin.infradead.org with esmtp (Exim 4.92.3 #3 (Red Hat Linux))
	id 1kG3e9-0000oN-Cx; Wed, 09 Sep 2020 17:12:37 +0000
Received: from mail.kernel.org ([198.145.29.99])
 by merlin.infradead.org with esmtps (Exim 4.92.3 #3 (Red Hat Linux))
 id 1kG3e5-0000nj-Ua
 for linux-arm-kernel@lists.infradead.org; Wed, 09 Sep 2020 17:12:35 +0000
Received: from disco-boy.misterjones.org (disco-boy.misterjones.org
 [51.254.78.96])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by mail.kernel.org (Postfix) with ESMTPSA id 779AB206E6;
 Wed,  9 Sep 2020 17:12:31 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org;
 s=default; t=1599671551;
 bh=v0g3QnFoVJU7n7iPmuQiCcTtp7sU8H7GnpEBbQ5WMxE=;
 h=Date:From:To:Cc:Subject:In-Reply-To:References:From;
 b=Uh9RRLGtjjmyuwrqevOGc27Dwsri1dmOasiQsvBLdrl8suIVeGP9wWc2nUC13P7yP
 HtWoO926E/qEzewMRqW4pyzI2mekH9gVFzRjOaQiREPmu3LvpnK5HaaEHIpIVOq+zD
 in6qb1xaj3fcD1BYcUlDB/2y/aHx3vUDYlKdYcnA=
Received: from disco-boy.misterjones.org ([51.254.78.96] helo=www.loen.fr)
 by disco-boy.misterjones.org with esmtpsa
 (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.92)
 (envelope-from <maz@kernel.org>)
 id 1kG3e1-00AQVA-CE; Wed, 09 Sep 2020 18:12:29 +0100
MIME-Version: 1.0
Date: Wed, 09 Sep 2020 18:12:29 +0100
From: Marc Zyngier <maz@kernel.org>
To: Alexandru Elisei <alexandru.elisei@arm.com>
Subject: Re: [PATCH v4 17/21] KVM: arm64: Convert user_mem_abort() to generic
 page-table API
In-Reply-To: <2ae77a66-9cc4-f4e1-9e98-a50d5891cf20@arm.com>
References: <20200907152344.12978-1-will@kernel.org>
 <20200907152344.12978-18-will@kernel.org>
 <2ae77a66-9cc4-f4e1-9e98-a50d5891cf20@arm.com>
User-Agent: Roundcube Webmail/1.4.8
Message-ID: <c2e5e055251db637327f81da452cea2c@kernel.org>
X-Sender: maz@kernel.org
X-SA-Exim-Connect-IP: 51.254.78.96
X-SA-Exim-Rcpt-To: alexandru.elisei@arm.com, will@kernel.org,
 kvmarm@lists.cs.columbia.edu, qperret@google.com, james.morse@arm.com,
 suzuki.poulose@arm.com, catalin.marinas@arm.com, gshan@redhat.com,
 kernel-team@android.com, linux-arm-kernel@lists.infradead.org
X-SA-Exim-Mail-From: maz@kernel.org
X-SA-Exim-Scanned: No (on disco-boy.misterjones.org);
 SAEximRunCond expanded to false
X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 
X-CRM114-CacheID: sfid-20200909_131234_171632_8C8C366E 
X-CRM114-Status: GOOD (  40.93  )
X-BeenThere: linux-arm-kernel@lists.infradead.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: <linux-arm-kernel.lists.infradead.org>
List-Unsubscribe: <http://lists.infradead.org/mailman/options/linux-arm-kernel>, 
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=unsubscribe>
List-Archive: <http://lists.infradead.org/pipermail/linux-arm-kernel/>
List-Post: <mailto:linux-arm-kernel@lists.infradead.org>
List-Help: <mailto:linux-arm-kernel-request@lists.infradead.org?subject=help>
List-Subscribe: <http://lists.infradead.org/mailman/listinfo/linux-arm-kernel>, 
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=subscribe>
Cc: kernel-team@android.com, Gavin Shan <gshan@redhat.com>,
 Suzuki Poulose <suzuki.poulose@arm.com>,
 Catalin Marinas <catalin.marinas@arm.com>, Quentin Perret <qperret@google.com>,
 James Morse <james.morse@arm.com>, Will Deacon <will@kernel.org>,
 kvmarm@lists.cs.columbia.edu, linux-arm-kernel@lists.infradead.org
Content-Transfer-Encoding: 7bit
Content-Type: text/plain; charset="us-ascii"; Format="flowed"
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org

Hi Alex,

On 2020-09-09 15:20, Alexandru Elisei wrote:
> Hi Will,
> 
> On 9/7/20 4:23 PM, Will Deacon wrote:
>> Convert user_mem_abort() to call kvm_pgtable_stage2_relax_perms() when
>> handling a stage-2 permission fault and kvm_pgtable_stage2_map() when
>> handling a stage-2 translation fault, rather than walking the 
>> page-table
>> manually.
>> 
>> Cc: Marc Zyngier <maz@kernel.org>
>> Cc: Quentin Perret <qperret@google.com>
>> Reviewed-by: Gavin Shan <gshan@redhat.com>
>> Signed-off-by: Will Deacon <will@kernel.org>
>> ---
>>  arch/arm64/kvm/mmu.c | 124 
>> +++++++++++++++----------------------------
>>  1 file changed, 44 insertions(+), 80 deletions(-)
>> 
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index 0af48f35c8dd..dc923e873dad 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -1496,18 +1496,19 @@ static int user_mem_abort(struct kvm_vcpu 
>> *vcpu, phys_addr_t fault_ipa,
>>  {
>>  	int ret;
>>  	bool write_fault, writable, force_pte = false;
>> -	bool exec_fault, needs_exec;
>> +	bool exec_fault;
>> +	bool device = false;
>>  	unsigned long mmu_seq;
>> -	gfn_t gfn = fault_ipa >> PAGE_SHIFT;
>>  	struct kvm *kvm = vcpu->kvm;
>>  	struct kvm_mmu_memory_cache *memcache = &vcpu->arch.mmu_page_cache;
>>  	struct vm_area_struct *vma;
>>  	short vma_shift;
>> +	gfn_t gfn;
>>  	kvm_pfn_t pfn;
>> -	pgprot_t mem_type = PAGE_S2;
>>  	bool logging_active = memslot_is_logging(memslot);
>> -	unsigned long vma_pagesize, flags = 0;
>> -	struct kvm_s2_mmu *mmu = vcpu->arch.hw_mmu;
>> +	unsigned long vma_pagesize;
>> +	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
>> +	struct kvm_pgtable *pgt;
>> 
>>  	write_fault = kvm_is_write_fault(vcpu);
>>  	exec_fault = kvm_vcpu_trap_is_iabt(vcpu);
>> @@ -1540,22 +1541,24 @@ static int user_mem_abort(struct kvm_vcpu 
>> *vcpu, phys_addr_t fault_ipa,
>>  		vma_pagesize = PAGE_SIZE;
>>  	}
>> 
>> -	/*
>> -	 * The stage2 has a minimum of 2 level table (For arm64 see
>> -	 * kvm_arm_setup_stage2()). Hence, we are guaranteed that we can
>> -	 * use PMD_SIZE huge mappings (even when the PMD is folded into 
>> PGD).
>> -	 * As for PUD huge maps, we must make sure that we have at least
>> -	 * 3 levels, i.e, PMD is not folded.
>> -	 */
>> -	if (vma_pagesize == PMD_SIZE ||
>> -	    (vma_pagesize == PUD_SIZE && kvm_stage2_has_pmd(kvm)))
>> -		gfn = (fault_ipa & huge_page_mask(hstate_vma(vma))) >> PAGE_SHIFT;
>> +	if (vma_pagesize == PMD_SIZE || vma_pagesize == PUD_SIZE)
>> +		fault_ipa &= huge_page_mask(hstate_vma(vma));
> 
> This looks correct to me - if !kvm_stage2_has_pmd(), then PMD is folded 
> onto PUD
> and PGD, and PMD_SIZE == PUD_SIZE. Also I like the fact that we update
> gfn **and**
> fault_ipa, the previous version updated only gfn, which made gfn !=
> (fault_ipa >>
> PAGE_SHIFT).
> 
>> +
>> +	gfn = fault_ipa >> PAGE_SHIFT;
>>  	mmap_read_unlock(current->mm);
>> 
>> -	/* We need minimum second+third level pages */
>> -	ret = kvm_mmu_topup_memory_cache(memcache, 
>> kvm_mmu_cache_min_pages(kvm));
>> -	if (ret)
>> -		return ret;
>> +	/*
>> +	 * Permission faults just need to update the existing leaf entry,
>> +	 * and so normally don't require allocations from the memcache. The
>> +	 * only exception to this is when dirty logging is enabled at 
>> runtime
>> +	 * and a write fault needs to collapse a block entry into a table.
>> +	 */
>> +	if (fault_status != FSC_PERM || (logging_active && write_fault)) {
>> +		ret = kvm_mmu_topup_memory_cache(memcache,
>> +						 kvm_mmu_cache_min_pages(kvm));
>> +		if (ret)
>> +			return ret;
>> +	}
> 
> I'm not 100% sure about this.
> 
> I don't think we gain much over the previous code - if we had allocated 
> cache
> objects which we hadn't used, we would have used them next time 
> user_mem_abort()
> is called (kvm_mmu_topup_memory_cache() checks if we have the required 
> number of
> objects in the cache and returns early).
> 
> I'm not sure the condition is entirely correct either - if stage 2 
> already has a
> mapping for the IPA and we only need to set write permissions, 
> according to the
> condition above we still try to topup the cache, even though we don't 
> strictly
> need to.

That's because if you are logging, you may have to split an existing 
block
mapping and map a single page instead. This requires (at least) an extra
level, and that's why you need to top-up the cache in this case.

> 
>> 
>>  	mmu_seq = vcpu->kvm->mmu_notifier_seq;
>>  	/*
>> @@ -1578,28 +1581,20 @@ static int user_mem_abort(struct kvm_vcpu 
>> *vcpu, phys_addr_t fault_ipa,
>>  		return -EFAULT;
>> 
>>  	if (kvm_is_device_pfn(pfn)) {
>> -		mem_type = PAGE_S2_DEVICE;
>> -		flags |= KVM_S2PTE_FLAG_IS_IOMAP;
>> -	} else if (logging_active) {
>> -		/*
>> -		 * Faults on pages in a memslot with logging enabled
>> -		 * should not be mapped with huge pages (it introduces churn
>> -		 * and performance degradation), so force a pte mapping.
>> -		 */
>> -		flags |= KVM_S2_FLAG_LOGGING_ACTIVE;
>> -
>> +		device = true;
>> +	} else if (logging_active && !write_fault) {
>>  		/*
>>  		 * Only actually map the page as writable if this was a write
>>  		 * fault.
>>  		 */
>> -		if (!write_fault)
>> -			writable = false;
>> +		writable = false;
>>  	}
>> 
>> -	if (exec_fault && is_iomap(flags))
>> +	if (exec_fault && device)
>>  		return -ENOEXEC;
>> 
>>  	spin_lock(&kvm->mmu_lock);
>> +	pgt = vcpu->arch.hw_mmu->pgt;
>>  	if (mmu_notifier_retry(kvm, mmu_seq))
>>  		goto out_unlock;
>> 
>> @@ -1610,62 +1605,31 @@ static int user_mem_abort(struct kvm_vcpu 
>> *vcpu, phys_addr_t fault_ipa,
>>  	if (vma_pagesize == PAGE_SIZE && !force_pte)
>>  		vma_pagesize = transparent_hugepage_adjust(memslot, hva,
>>  							   &pfn, &fault_ipa);
>> -	if (writable)
>> +	if (writable) {
>> +		prot |= KVM_PGTABLE_PROT_W;
>>  		kvm_set_pfn_dirty(pfn);
>> +		mark_page_dirty(kvm, gfn);
> 
> The previous code called mark_page_dirty() only if the vma_pagesize == 
> PAGE_SIZE
> (and writable was true, obviously). Is this supposed to fix a bug?

No, this is actually introducing one. mark_page_dirty() checks that 
there is an
associated bitmap, and thus only happens when writing to a single page, 
but we
shouldn't do it for R/O memslots, which the current code avoids. It 
should be
guarded by logging_active.

> 
>> +	}
>> 
>> -	if (fault_status != FSC_PERM && !is_iomap(flags))
>> +	if (fault_status != FSC_PERM && !device)
>>  		clean_dcache_guest_page(pfn, vma_pagesize);
>> 
>> -	if (exec_fault)
>> +	if (exec_fault) {
>> +		prot |= KVM_PGTABLE_PROT_X;
>>  		invalidate_icache_guest_page(pfn, vma_pagesize);
>> +	}
>> 
>> -	/*
>> -	 * If we took an execution fault we have made the
>> -	 * icache/dcache coherent above and should now let the s2
>> -	 * mapping be executable.
>> -	 *
>> -	 * Write faults (!exec_fault && FSC_PERM) are orthogonal to
>> -	 * execute permissions, and we preserve whatever we have.
>> -	 */
>> -	needs_exec = exec_fault ||
>> -		(fault_status == FSC_PERM &&
>> -		 stage2_is_exec(mmu, fault_ipa, vma_pagesize));
>> -
>> -	if (vma_pagesize == PUD_SIZE) {
>> -		pud_t new_pud = kvm_pfn_pud(pfn, mem_type);
>> -
>> -		new_pud = kvm_pud_mkhuge(new_pud);
>> -		if (writable)
>> -			new_pud = kvm_s2pud_mkwrite(new_pud);
>> -
>> -		if (needs_exec)
>> -			new_pud = kvm_s2pud_mkexec(new_pud);
>> -
>> -		ret = stage2_set_pud_huge(mmu, memcache, fault_ipa, &new_pud);
>> -	} else if (vma_pagesize == PMD_SIZE) {
>> -		pmd_t new_pmd = kvm_pfn_pmd(pfn, mem_type);
>> -
>> -		new_pmd = kvm_pmd_mkhuge(new_pmd);
>> -
>> -		if (writable)
>> -			new_pmd = kvm_s2pmd_mkwrite(new_pmd);
>> -
>> -		if (needs_exec)
>> -			new_pmd = kvm_s2pmd_mkexec(new_pmd);
>> +	if (device)
>> +		prot |= KVM_PGTABLE_PROT_DEVICE;
>> +	else if (cpus_have_const_cap(ARM64_HAS_CACHE_DIC))
>> +		prot |= KVM_PGTABLE_PROT_X;
>> 
>> -		ret = stage2_set_pmd_huge(mmu, memcache, fault_ipa, &new_pmd);
>> +	if (fault_status == FSC_PERM && !(logging_active && writable)) {
> 
> I don't understand the second part of the condition (!(logging_active 
> &&
> writable)). With logging active, when we get a fault because of a
> missing stage 2
> entry, we map the IPA as read-only at stage 2. If I understand this 
> code
> correctly, when the guest then tries to write to the same IPA, writable 
> == true
> and we map the IPA again instead of relaxing the permissions. Why is 
> that?

See my reply above: logging means potentially adding a new level, so we
treat it as a new mapping altogether (break the block mapping, TLBI, 
install
the new mapping one level down).

All the other cases are happily handled by just relaxing the 
permissions.

Thanks,

         M.
-- 
Jazz is not dead. It just smells funny...

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel