From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <kvm-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id A31D4C43334
	for <kvm@archiver.kernel.org>; Tue, 28 Jun 2022 17:58:30 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S233621AbiF1R6a (ORCPT <rfc822;kvm@archiver.kernel.org>);
        Tue, 28 Jun 2022 13:58:30 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44706 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S233609AbiF1R6O (ORCPT <rfc822;kvm@vger.kernel.org>);
        Tue, 28 Jun 2022 13:58:14 -0400
Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6F80A6599
        for <kvm@vger.kernel.org>; Tue, 28 Jun 2022 10:58:00 -0700 (PDT)
Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140])
        (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
        (No client certificate requested)
        by dfw.source.kernel.org (Postfix) with ESMTPS id 0C69A619E1
        for <kvm@vger.kernel.org>; Tue, 28 Jun 2022 17:58:00 +0000 (UTC)
Received: by smtp.kernel.org (Postfix) with ESMTPSA id 3F616C3411D;
        Tue, 28 Jun 2022 17:57:57 +0000 (UTC)
Date:   Tue, 28 Jun 2022 18:57:53 +0100
From:   Catalin Marinas <catalin.marinas@arm.com>
To:     Peter Collingbourne <pcc@google.com>
Cc:     kvmarm@lists.cs.columbia.edu, Marc Zyngier <maz@kernel.org>,
        kvm@vger.kernel.org, Andy Lutomirski <luto@amacapital.net>,
        Linux ARM <linux-arm-kernel@lists.infradead.org>,
        Michael Roth <michael.roth@amd.com>,
        Chao Peng <chao.p.peng@linux.intel.com>,
        Will Deacon <will@kernel.org>,
        Evgenii Stepanov <eugenis@google.com>,
        Steven Price <steven.price@arm.com>
Subject: Re: [PATCH] KVM: arm64: permit MAP_SHARED mappings with MTE enabled
Message-ID: <YrtBIX0/0jyAdgnz@arm.com>
References: <20220623234944.141869-1-pcc@google.com>
 <YrXu0Uzi73pUDwye@arm.com>
 <CAMn1gO7-qVzZrAt63BJC-M8gKLw4=60iVUo6Eu8T_5y3AZnKcA@mail.gmail.com>
 <YrmXzHXv4babwbNZ@arm.com>
 <CAMn1gO5s2m-AkoYpY0dcLkKVyEAGeC2borZfgT09iqc=w_LZxQ@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CAMn1gO5s2m-AkoYpY0dcLkKVyEAGeC2borZfgT09iqc=w_LZxQ@mail.gmail.com>
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org

On Mon, Jun 27, 2022 at 11:16:17AM -0700, Peter Collingbourne wrote:
> On Mon, Jun 27, 2022 at 4:43 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > On Fri, Jun 24, 2022 at 02:50:53PM -0700, Peter Collingbourne wrote:
> > > On Fri, Jun 24, 2022 at 10:05 AM Catalin Marinas
> > > <catalin.marinas@arm.com> wrote:
> > > > + Steven as he added the KVM and swap support for MTE.
> > > >
> > > > On Thu, Jun 23, 2022 at 04:49:44PM -0700, Peter Collingbourne wrote:
> > > > > Certain VMMs such as crosvm have features (e.g. sandboxing, pmem) that
> > > > > depend on being able to map guest memory as MAP_SHARED. The current
> > > > > restriction on sharing MAP_SHARED pages with the guest is preventing
> > > > > the use of those features with MTE. Therefore, remove this restriction.
> > > >
> > > > We already have some corner cases where the PG_mte_tagged logic fails
> > > > even for MAP_PRIVATE (but page shared with CoW). Adding this on top for
> > > > KVM MAP_SHARED will potentially make things worse (or hard to reason
> > > > about; for example the VMM sets PROT_MTE as well). I'm more inclined to
> > > > get rid of PG_mte_tagged altogether, always zero (or restore) the tags
> > > > on user page allocation, copy them on write. For swap we can scan and if
> > > > all tags are 0 and just skip saving them.
> > >
> > > A problem with this approach is that it would conflict with any
> > > potential future changes that we might make that would require the
> > > kernel to avoid modifying the tags for non-PROT_MTE pages.
> >
> > Not if in all those cases we check VM_MTE_ALLOWED. We seem to have the
> > vma available where it matters. We can keep PG_mte_tagged around but
> > always set it on page allocation (e.g. when zeroing or CoW) and check
> > VM_MTE_ALLOWED rather than VM_MTE.
> 
> Right, but for avoiding tagging we would like that to apply to as many
> pages as possible. If we check VM_MTE_ALLOWED then the heap pages of
> those processes that are not using MTE would not be covered, which on
> a mostly non-MTE system would be a majority of pages.

By non-MTE system, I guess you mean a system that supports MTE but most
of the user apps don't use it. That's why it would be interesting to see
the effect of using DC GZVA instead of DC ZVA for page zeroing.

I suspect on Android you'd notice the fork() penalty a bit more with all
the copy-on-write having to copy tags. But we can't tell until we do
some benchmarks. If the penalty is indeed significant, we'll go back to
assessing the races here.

Another thing that won't happen for PG_mte_tagged currently is KSM page
merging. I had a patch to allow comparing the tags but eventually
dropped it (can dig it out).

> Over the weekend I thought of another policy, which would be similar
> to your original one. We can always tag pages which are mapped as
> MAP_SHARED. These pages are much less common than private pages, so
> the impact would be less. So the if statement in
> alloc_zeroed_user_highpage_movable would become:
> 
> if ((vma->vm_flags & VM_MTE) || (system_supports_mte() &&
>				   (vma->vm_flags & VM_SHARED)))
> 
> That would allow us to put basically any shared mapping in the guest
> address space without needing to deal with races in sanitise_mte_tags.

It's not just about VM_SHARED. A page can be effectively shared as a
result of a fork(). It is read-only in all processes but still shared
and one task may call mprotect(PROT_MTE).

Another case of sharing is between the VMM and the guest though I think
an mprotect() in the VMM would trigger the unmapping of the guest
address and pages mapped into guests already have PG_mte_tagged set.

We probably need to draw a state machine of all the cases. AFAICT, we
need to take into account a few of the below (it's probably incomplete;
I've been with Steven through most of them IIRC):

1. Private mappings with mixed PROT_MTE, CoW sharing and concurrent
   mprotect(PROT_MTE). That's one of the things I dislike is that a late
   tag clearing via set_pte_at() can happen without breaking the CoW
   mapping. It's a bit counter-intuitive if you treat the tags as data
   (rather than some cache), you don't expect a read-only page to have
   some (tag) updated.

2. Shared mappings with concurrent mprotect(PROT_MTE).

3. Shared mapping restoring from swap.

4. Private mapping restoring from swap into CoW mapping.

5. KVM faults.

6. Concurrent ptrace accesses (or KVM tag copying)

What currently risks failing I think is breaking a CoW mapping with
concurrent mprotect(PROT_MTE) - we set PG_mte_tagged before zeroing the
tags. A concurrent copy may read stale tags. In sanitise_mte_tags() we
do this the other way around - clear tags first and then set the flag.

I think using another bit as a lock may solve most (all) of these but
another option is to treat the tags as data and make sure they are set
before mapping.

> We may consider going further than this and require all pages mapped
> into guests with MTE enabled to be PROT_MTE.

We discussed this when upstreaming KVM support and the idea got pushed
back. The main problem is that the VMM may use MTE for itself but can no
longer access the guest memory without the risk of taking a fault. We
don't have a match-all tag in user space and we can't teach the VMM to
use the PSTATE.TCO bit since driver emulation can be fairly generic.
And, of course, there's also the ABI change now.

> I think it would allow
> dropping sanitise_mte_tags entirely. This would not be a relaxation of
> the ABI but perhaps we can get away with it if, as Cornelia mentioned,
> QEMU does not currently support MTE, and since crosvm doesn't
> currently support it either there's no userspace to break AFAIK. This
> would also address a current weirdness in the API where it is possible
> for the underlying pages of a MAP_SHARED file mapping to become tagged
> via KVM, said tags are exposed to the guest and are discarded when the
> underlying page is paged out.

Ah, good point, shared file mappings is another reason we did not allow
MAP_SHARED and MTE for guest memory.

BTW, in user_mem_abort() we should probably check for VM_MTE_ALLOWED
irrespective of whether we allow MAP_SHARED or not.

> We can perhaps accomplish it by dropping
> support for KVM_CAP_ARM_MTE in the kernel and introducing something
> like a KVM_CAP_ARM_MTE_V2 with the new restriction.

That's an option for the ABI upgrade but we still need to solve the
potential races.

-- 
Catalin

From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <kvmarm-bounces@lists.cs.columbia.edu>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from mm01.cs.columbia.edu (mm01.cs.columbia.edu [128.59.11.253])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 2A2C5C43334
	for <kvmarm@archiver.kernel.org>; Tue, 28 Jun 2022 17:58:06 +0000 (UTC)
Received: from localhost (localhost [127.0.0.1])
	by mm01.cs.columbia.edu (Postfix) with ESMTP id 6FA804B431;
	Tue, 28 Jun 2022 13:58:06 -0400 (EDT)
X-Virus-Scanned: at lists.cs.columbia.edu
Received: from mm01.cs.columbia.edu ([127.0.0.1])
	by localhost (mm01.cs.columbia.edu [127.0.0.1]) (amavisd-new, port 10024)
	with ESMTP id TT1gy6tgprw9; Tue, 28 Jun 2022 13:58:05 -0400 (EDT)
Received: from mm01.cs.columbia.edu (localhost [127.0.0.1])
	by mm01.cs.columbia.edu (Postfix) with ESMTP id 08CA04B47E;
	Tue, 28 Jun 2022 13:58:05 -0400 (EDT)
Received: from localhost (localhost [127.0.0.1])
 by mm01.cs.columbia.edu (Postfix) with ESMTP id 7FBFB4B431
 for <kvmarm@lists.cs.columbia.edu>; Tue, 28 Jun 2022 13:58:03 -0400 (EDT)
X-Virus-Scanned: at lists.cs.columbia.edu
Received: from mm01.cs.columbia.edu ([127.0.0.1])
 by localhost (mm01.cs.columbia.edu [127.0.0.1]) (amavisd-new, port 10024)
 with ESMTP id ZqFqoEJKXVhj for <kvmarm@lists.cs.columbia.edu>;
 Tue, 28 Jun 2022 13:58:02 -0400 (EDT)
Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217])
 by mm01.cs.columbia.edu (Postfix) with ESMTPS id 0EC504B21A
 for <kvmarm@lists.cs.columbia.edu>; Tue, 28 Jun 2022 13:58:01 -0400 (EDT)
Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by dfw.source.kernel.org (Postfix) with ESMTPS id 136D161A01;
 Tue, 28 Jun 2022 17:58:00 +0000 (UTC)
Received: by smtp.kernel.org (Postfix) with ESMTPSA id 3F616C3411D;
 Tue, 28 Jun 2022 17:57:57 +0000 (UTC)
Date: Tue, 28 Jun 2022 18:57:53 +0100
From: Catalin Marinas <catalin.marinas@arm.com>
To: Peter Collingbourne <pcc@google.com>
Subject: Re: [PATCH] KVM: arm64: permit MAP_SHARED mappings with MTE enabled
Message-ID: <YrtBIX0/0jyAdgnz@arm.com>
References: <20220623234944.141869-1-pcc@google.com> <YrXu0Uzi73pUDwye@arm.com>
 <CAMn1gO7-qVzZrAt63BJC-M8gKLw4=60iVUo6Eu8T_5y3AZnKcA@mail.gmail.com>
 <YrmXzHXv4babwbNZ@arm.com>
 <CAMn1gO5s2m-AkoYpY0dcLkKVyEAGeC2borZfgT09iqc=w_LZxQ@mail.gmail.com>
MIME-Version: 1.0
Content-Disposition: inline
In-Reply-To: <CAMn1gO5s2m-AkoYpY0dcLkKVyEAGeC2borZfgT09iqc=w_LZxQ@mail.gmail.com>
Cc: kvm@vger.kernel.org, Marc Zyngier <maz@kernel.org>,
 Andy Lutomirski <luto@amacapital.net>, Evgenii Stepanov <eugenis@google.com>,
 Michael Roth <michael.roth@amd.com>, Chao Peng <chao.p.peng@linux.intel.com>,
 Steven Price <steven.price@arm.com>, Will Deacon <will@kernel.org>,
 kvmarm@lists.cs.columbia.edu, Linux ARM <linux-arm-kernel@lists.infradead.org>
X-BeenThere: kvmarm@lists.cs.columbia.edu
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Where KVM/ARM decisions are made <kvmarm.lists.cs.columbia.edu>
List-Unsubscribe: <https://lists.cs.columbia.edu/mailman/options/kvmarm>,
 <mailto:kvmarm-request@lists.cs.columbia.edu?subject=unsubscribe>
List-Archive: <https://lists.cs.columbia.edu/pipermail/kvmarm>
List-Post: <mailto:kvmarm@lists.cs.columbia.edu>
List-Help: <mailto:kvmarm-request@lists.cs.columbia.edu?subject=help>
List-Subscribe: <https://lists.cs.columbia.edu/mailman/listinfo/kvmarm>,
 <mailto:kvmarm-request@lists.cs.columbia.edu?subject=subscribe>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Errors-To: kvmarm-bounces@lists.cs.columbia.edu
Sender: kvmarm-bounces@lists.cs.columbia.edu

On Mon, Jun 27, 2022 at 11:16:17AM -0700, Peter Collingbourne wrote:
> On Mon, Jun 27, 2022 at 4:43 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > On Fri, Jun 24, 2022 at 02:50:53PM -0700, Peter Collingbourne wrote:
> > > On Fri, Jun 24, 2022 at 10:05 AM Catalin Marinas
> > > <catalin.marinas@arm.com> wrote:
> > > > + Steven as he added the KVM and swap support for MTE.
> > > >
> > > > On Thu, Jun 23, 2022 at 04:49:44PM -0700, Peter Collingbourne wrote:
> > > > > Certain VMMs such as crosvm have features (e.g. sandboxing, pmem) that
> > > > > depend on being able to map guest memory as MAP_SHARED. The current
> > > > > restriction on sharing MAP_SHARED pages with the guest is preventing
> > > > > the use of those features with MTE. Therefore, remove this restriction.
> > > >
> > > > We already have some corner cases where the PG_mte_tagged logic fails
> > > > even for MAP_PRIVATE (but page shared with CoW). Adding this on top for
> > > > KVM MAP_SHARED will potentially make things worse (or hard to reason
> > > > about; for example the VMM sets PROT_MTE as well). I'm more inclined to
> > > > get rid of PG_mte_tagged altogether, always zero (or restore) the tags
> > > > on user page allocation, copy them on write. For swap we can scan and if
> > > > all tags are 0 and just skip saving them.
> > >
> > > A problem with this approach is that it would conflict with any
> > > potential future changes that we might make that would require the
> > > kernel to avoid modifying the tags for non-PROT_MTE pages.
> >
> > Not if in all those cases we check VM_MTE_ALLOWED. We seem to have the
> > vma available where it matters. We can keep PG_mte_tagged around but
> > always set it on page allocation (e.g. when zeroing or CoW) and check
> > VM_MTE_ALLOWED rather than VM_MTE.
> 
> Right, but for avoiding tagging we would like that to apply to as many
> pages as possible. If we check VM_MTE_ALLOWED then the heap pages of
> those processes that are not using MTE would not be covered, which on
> a mostly non-MTE system would be a majority of pages.

By non-MTE system, I guess you mean a system that supports MTE but most
of the user apps don't use it. That's why it would be interesting to see
the effect of using DC GZVA instead of DC ZVA for page zeroing.

I suspect on Android you'd notice the fork() penalty a bit more with all
the copy-on-write having to copy tags. But we can't tell until we do
some benchmarks. If the penalty is indeed significant, we'll go back to
assessing the races here.

Another thing that won't happen for PG_mte_tagged currently is KSM page
merging. I had a patch to allow comparing the tags but eventually
dropped it (can dig it out).

> Over the weekend I thought of another policy, which would be similar
> to your original one. We can always tag pages which are mapped as
> MAP_SHARED. These pages are much less common than private pages, so
> the impact would be less. So the if statement in
> alloc_zeroed_user_highpage_movable would become:
> 
> if ((vma->vm_flags & VM_MTE) || (system_supports_mte() &&
>				   (vma->vm_flags & VM_SHARED)))
> 
> That would allow us to put basically any shared mapping in the guest
> address space without needing to deal with races in sanitise_mte_tags.

It's not just about VM_SHARED. A page can be effectively shared as a
result of a fork(). It is read-only in all processes but still shared
and one task may call mprotect(PROT_MTE).

Another case of sharing is between the VMM and the guest though I think
an mprotect() in the VMM would trigger the unmapping of the guest
address and pages mapped into guests already have PG_mte_tagged set.

We probably need to draw a state machine of all the cases. AFAICT, we
need to take into account a few of the below (it's probably incomplete;
I've been with Steven through most of them IIRC):

1. Private mappings with mixed PROT_MTE, CoW sharing and concurrent
   mprotect(PROT_MTE). That's one of the things I dislike is that a late
   tag clearing via set_pte_at() can happen without breaking the CoW
   mapping. It's a bit counter-intuitive if you treat the tags as data
   (rather than some cache), you don't expect a read-only page to have
   some (tag) updated.

2. Shared mappings with concurrent mprotect(PROT_MTE).

3. Shared mapping restoring from swap.

4. Private mapping restoring from swap into CoW mapping.

5. KVM faults.

6. Concurrent ptrace accesses (or KVM tag copying)

What currently risks failing I think is breaking a CoW mapping with
concurrent mprotect(PROT_MTE) - we set PG_mte_tagged before zeroing the
tags. A concurrent copy may read stale tags. In sanitise_mte_tags() we
do this the other way around - clear tags first and then set the flag.

I think using another bit as a lock may solve most (all) of these but
another option is to treat the tags as data and make sure they are set
before mapping.

> We may consider going further than this and require all pages mapped
> into guests with MTE enabled to be PROT_MTE.

We discussed this when upstreaming KVM support and the idea got pushed
back. The main problem is that the VMM may use MTE for itself but can no
longer access the guest memory without the risk of taking a fault. We
don't have a match-all tag in user space and we can't teach the VMM to
use the PSTATE.TCO bit since driver emulation can be fairly generic.
And, of course, there's also the ABI change now.

> I think it would allow
> dropping sanitise_mte_tags entirely. This would not be a relaxation of
> the ABI but perhaps we can get away with it if, as Cornelia mentioned,
> QEMU does not currently support MTE, and since crosvm doesn't
> currently support it either there's no userspace to break AFAIK. This
> would also address a current weirdness in the API where it is possible
> for the underlying pages of a MAP_SHARED file mapping to become tagged
> via KVM, said tags are exposed to the guest and are discarded when the
> underlying page is paged out.

Ah, good point, shared file mappings is another reason we did not allow
MAP_SHARED and MTE for guest memory.

BTW, in user_mem_abort() we should probably check for VM_MTE_ALLOWED
irrespective of whether we allow MAP_SHARED or not.

> We can perhaps accomplish it by dropping
> support for KVM_CAP_ARM_MTE in the kernel and introducing something
> like a KVM_CAP_ARM_MTE_V2 with the new restriction.

That's an option for the ABI upgrade but we still need to solve the
potential races.

-- 
Catalin
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.lore.kernel.org (Postfix) with ESMTPS id C1473C43334
	for <linux-arm-kernel@archiver.kernel.org>; Tue, 28 Jun 2022 17:59:03 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=lists.infradead.org; s=bombadil.20210309; h=Sender:
	Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post:
	List-Archive:List-Unsubscribe:List-Id:In-Reply-To:MIME-Version:References:
	Message-ID:Subject:Cc:To:From:Date:Reply-To:Content-ID:Content-Description:
	Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:
	List-Owner; bh=K5Lvtnu4pE2PI4C2pRXT2EJEn9x8guwTfG7+Z6Idtbg=; b=dA/N7TQNJ0nN9e
	tLOrxvd1tYw2XYVR+BlrVp7D5pAhEFlUp2KDqWolfTB2yROFPBhqV8TpzijcP5/y2ytTbQ4HDsu1/
	BRN6yoEuSB1oehyjEVef9G8EdYUPAW9oljcbqkRiuJ9v4Ze2D5OQmcq1O5oXoPGuF8KW/xqFlExzn
	vjBmzpglFV50gHh4GwIg+QoxNLX5Tlol6kfZFgZE7PGvIwHS/45X5slrcb5yNAkgGRlc1Xa82c1Z6
	t/1/+c3vmwAaJMMjrQW6lhmtdQbqWIYk1BOYQnDTiTekn6kXiP6zwwTTFNLeTUXUtKhWNYzlNVx/m
	b8UV3lkBQevm4oWbvncg==;
Received: from localhost ([::1] helo=bombadil.infradead.org)
	by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux))
	id 1o6FTS-007UgE-Jn; Tue, 28 Jun 2022 17:58:06 +0000
Received: from dfw.source.kernel.org ([139.178.84.217])
	by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux))
	id 1o6FTN-007Uef-1b
	for linux-arm-kernel@lists.infradead.org; Tue, 28 Jun 2022 17:58:02 +0000
Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by dfw.source.kernel.org (Postfix) with ESMTPS id 136D161A01;
	Tue, 28 Jun 2022 17:58:00 +0000 (UTC)
Received: by smtp.kernel.org (Postfix) with ESMTPSA id 3F616C3411D;
	Tue, 28 Jun 2022 17:57:57 +0000 (UTC)
Date: Tue, 28 Jun 2022 18:57:53 +0100
From: Catalin Marinas <catalin.marinas@arm.com>
To: Peter Collingbourne <pcc@google.com>
Cc: kvmarm@lists.cs.columbia.edu, Marc Zyngier <maz@kernel.org>,
	kvm@vger.kernel.org, Andy Lutomirski <luto@amacapital.net>,
	Linux ARM <linux-arm-kernel@lists.infradead.org>,
	Michael Roth <michael.roth@amd.com>,
	Chao Peng <chao.p.peng@linux.intel.com>,
	Will Deacon <will@kernel.org>,
	Evgenii Stepanov <eugenis@google.com>,
	Steven Price <steven.price@arm.com>
Subject: Re: [PATCH] KVM: arm64: permit MAP_SHARED mappings with MTE enabled
Message-ID: <YrtBIX0/0jyAdgnz@arm.com>
References: <20220623234944.141869-1-pcc@google.com>
 <YrXu0Uzi73pUDwye@arm.com>
 <CAMn1gO7-qVzZrAt63BJC-M8gKLw4=60iVUo6Eu8T_5y3AZnKcA@mail.gmail.com>
 <YrmXzHXv4babwbNZ@arm.com>
 <CAMn1gO5s2m-AkoYpY0dcLkKVyEAGeC2borZfgT09iqc=w_LZxQ@mail.gmail.com>
MIME-Version: 1.0
Content-Disposition: inline
In-Reply-To: <CAMn1gO5s2m-AkoYpY0dcLkKVyEAGeC2borZfgT09iqc=w_LZxQ@mail.gmail.com>
X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 
X-CRM114-CacheID: sfid-20220628_105801_210173_88A0061F 
X-CRM114-Status: GOOD (  54.77  )
X-BeenThere: linux-arm-kernel@lists.infradead.org
X-Mailman-Version: 2.1.34
Precedence: list
List-Id: <linux-arm-kernel.lists.infradead.org>
List-Unsubscribe: <http://lists.infradead.org/mailman/options/linux-arm-kernel>,
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=unsubscribe>
List-Archive: <http://lists.infradead.org/pipermail/linux-arm-kernel/>
List-Post: <mailto:linux-arm-kernel@lists.infradead.org>
List-Help: <mailto:linux-arm-kernel-request@lists.infradead.org?subject=help>
List-Subscribe: <http://lists.infradead.org/mailman/listinfo/linux-arm-kernel>,
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=subscribe>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org

On Mon, Jun 27, 2022 at 11:16:17AM -0700, Peter Collingbourne wrote:
> On Mon, Jun 27, 2022 at 4:43 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > On Fri, Jun 24, 2022 at 02:50:53PM -0700, Peter Collingbourne wrote:
> > > On Fri, Jun 24, 2022 at 10:05 AM Catalin Marinas
> > > <catalin.marinas@arm.com> wrote:
> > > > + Steven as he added the KVM and swap support for MTE.
> > > >
> > > > On Thu, Jun 23, 2022 at 04:49:44PM -0700, Peter Collingbourne wrote:
> > > > > Certain VMMs such as crosvm have features (e.g. sandboxing, pmem) that
> > > > > depend on being able to map guest memory as MAP_SHARED. The current
> > > > > restriction on sharing MAP_SHARED pages with the guest is preventing
> > > > > the use of those features with MTE. Therefore, remove this restriction.
> > > >
> > > > We already have some corner cases where the PG_mte_tagged logic fails
> > > > even for MAP_PRIVATE (but page shared with CoW). Adding this on top for
> > > > KVM MAP_SHARED will potentially make things worse (or hard to reason
> > > > about; for example the VMM sets PROT_MTE as well). I'm more inclined to
> > > > get rid of PG_mte_tagged altogether, always zero (or restore) the tags
> > > > on user page allocation, copy them on write. For swap we can scan and if
> > > > all tags are 0 and just skip saving them.
> > >
> > > A problem with this approach is that it would conflict with any
> > > potential future changes that we might make that would require the
> > > kernel to avoid modifying the tags for non-PROT_MTE pages.
> >
> > Not if in all those cases we check VM_MTE_ALLOWED. We seem to have the
> > vma available where it matters. We can keep PG_mte_tagged around but
> > always set it on page allocation (e.g. when zeroing or CoW) and check
> > VM_MTE_ALLOWED rather than VM_MTE.
> 
> Right, but for avoiding tagging we would like that to apply to as many
> pages as possible. If we check VM_MTE_ALLOWED then the heap pages of
> those processes that are not using MTE would not be covered, which on
> a mostly non-MTE system would be a majority of pages.

By non-MTE system, I guess you mean a system that supports MTE but most
of the user apps don't use it. That's why it would be interesting to see
the effect of using DC GZVA instead of DC ZVA for page zeroing.

I suspect on Android you'd notice the fork() penalty a bit more with all
the copy-on-write having to copy tags. But we can't tell until we do
some benchmarks. If the penalty is indeed significant, we'll go back to
assessing the races here.

Another thing that won't happen for PG_mte_tagged currently is KSM page
merging. I had a patch to allow comparing the tags but eventually
dropped it (can dig it out).

> Over the weekend I thought of another policy, which would be similar
> to your original one. We can always tag pages which are mapped as
> MAP_SHARED. These pages are much less common than private pages, so
> the impact would be less. So the if statement in
> alloc_zeroed_user_highpage_movable would become:
> 
> if ((vma->vm_flags & VM_MTE) || (system_supports_mte() &&
>				   (vma->vm_flags & VM_SHARED)))
> 
> That would allow us to put basically any shared mapping in the guest
> address space without needing to deal with races in sanitise_mte_tags.

It's not just about VM_SHARED. A page can be effectively shared as a
result of a fork(). It is read-only in all processes but still shared
and one task may call mprotect(PROT_MTE).

Another case of sharing is between the VMM and the guest though I think
an mprotect() in the VMM would trigger the unmapping of the guest
address and pages mapped into guests already have PG_mte_tagged set.

We probably need to draw a state machine of all the cases. AFAICT, we
need to take into account a few of the below (it's probably incomplete;
I've been with Steven through most of them IIRC):

1. Private mappings with mixed PROT_MTE, CoW sharing and concurrent
   mprotect(PROT_MTE). That's one of the things I dislike is that a late
   tag clearing via set_pte_at() can happen without breaking the CoW
   mapping. It's a bit counter-intuitive if you treat the tags as data
   (rather than some cache), you don't expect a read-only page to have
   some (tag) updated.

2. Shared mappings with concurrent mprotect(PROT_MTE).

3. Shared mapping restoring from swap.

4. Private mapping restoring from swap into CoW mapping.

5. KVM faults.

6. Concurrent ptrace accesses (or KVM tag copying)

What currently risks failing I think is breaking a CoW mapping with
concurrent mprotect(PROT_MTE) - we set PG_mte_tagged before zeroing the
tags. A concurrent copy may read stale tags. In sanitise_mte_tags() we
do this the other way around - clear tags first and then set the flag.

I think using another bit as a lock may solve most (all) of these but
another option is to treat the tags as data and make sure they are set
before mapping.

> We may consider going further than this and require all pages mapped
> into guests with MTE enabled to be PROT_MTE.

We discussed this when upstreaming KVM support and the idea got pushed
back. The main problem is that the VMM may use MTE for itself but can no
longer access the guest memory without the risk of taking a fault. We
don't have a match-all tag in user space and we can't teach the VMM to
use the PSTATE.TCO bit since driver emulation can be fairly generic.
And, of course, there's also the ABI change now.

> I think it would allow
> dropping sanitise_mte_tags entirely. This would not be a relaxation of
> the ABI but perhaps we can get away with it if, as Cornelia mentioned,
> QEMU does not currently support MTE, and since crosvm doesn't
> currently support it either there's no userspace to break AFAIK. This
> would also address a current weirdness in the API where it is possible
> for the underlying pages of a MAP_SHARED file mapping to become tagged
> via KVM, said tags are exposed to the guest and are discarded when the
> underlying page is paged out.

Ah, good point, shared file mappings is another reason we did not allow
MAP_SHARED and MTE for guest memory.

BTW, in user_mem_abort() we should probably check for VM_MTE_ALLOWED
irrespective of whether we allow MAP_SHARED or not.

> We can perhaps accomplish it by dropping
> support for KVM_CAP_ARM_MTE in the kernel and introducing something
> like a KVM_CAP_ARM_MTE_V2 with the new restriction.

That's an option for the ABI upgrade but we still need to solve the
potential races.

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel