From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.9 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B7F8EC433ED for ; Thu, 6 May 2021 16:17:11 +0000 (UTC) Received: from desiato.infradead.org (desiato.infradead.org [90.155.92.199]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 1D7C761027 for ; Thu, 6 May 2021 16:17:11 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 1D7C761027 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=arm.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=desiato.20200630; h=Sender:Content-Type: Content-Transfer-Encoding:List-Subscribe:List-Help:List-Post:List-Archive: List-Unsubscribe:List-Id:In-Reply-To:MIME-Version:Date:Message-ID:From: References:Cc:To:Subject:Reply-To:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner; bh=sNED+kPdi2wC9oITqvPxSW/Bj77Omb/WTB8M3IS5Klw=; b=UYlS/e1YI5DQ4qacsb1VDYvVE ICMTjyc+d2NMvkAy7c3GSvJDaPAp+RrCyzhy6XITWda6XJB2ohPZavj5JBkLlHTs3AuTZxH1xWdiP Gh2z8N+gxR88Wb4CCNQjwSiKJriF/pTSda4BUCixD9QiotMyFtZWJ7XtYY7vygNoR0TW7dslIs/zm Ka4PqIXuIWVdc2M7LRVAuYLP3rV7/c4IpFbTbfPkap5Bj2XlPtMDvB6kwVnz0WHaGtJrDAJ9gr9Do At2ONN8AI1TTBDx6RL1ckXkkX7zYUlCfiayi6qk4QXQdcAPKNIW5oG/4cvyX6BrY/giXG7LOgnX2H RbQ9TMWpA==; Received: from localhost ([::1] helo=desiato.infradead.org) by desiato.infradead.org with esmtp (Exim 4.94 #2 (Red Hat Linux)) id 1legf8-004m7i-Cv; Thu, 06 May 2021 16:15:42 +0000 Received: from bombadil.infradead.org ([2607:7c80:54:e::133]) by desiato.infradead.org with esmtps (Exim 4.94 #2 (Red Hat Linux)) id 1legf5-004m7P-TD for linux-arm-kernel@desiato.infradead.org; Thu, 06 May 2021 16:15:40 +0000 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=bombadil.20210309; h=Content-Transfer-Encoding: Content-Type:In-Reply-To:MIME-Version:Date:Message-ID:From:References:Cc:To: Subject:Sender:Reply-To:Content-ID:Content-Description; bh=blmbu7va4LaWop9argpc67kho4uABSF/oeTCDwofqmk=; b=0U9rEo+OE1v+s+L4cuEHQPIIJ8 dci9aAaQqu+E1vbPMIE6Tg4IZ70osQ5No+7qQ6/NE0wlfYPxiee+wcHSO7tYrsT/vROfeBpGKe8U6 lFgA1LXMdKMmar6ZOG+o/F6isQimiwmWo4PWNyBi/Zud+gI5uZvsALZ2vy4XrVG4xwREJcRCGfYQI kC6p6I6JXOloem57TosbhTwBPqoT+vLx/NdLhNaFbp3ew8kFIfnQf8kjbeddYceXPXEA7X1h+af2H 1cNWWunOCl8EG4+cvv8etCmHjJtbQWW7CwvIHaE486376rFKHI778j3v5snnA7xmBKinBsaauV5j8 Qzfp/l0g==; Received: from foss.arm.com ([217.140.110.172]) by bombadil.infradead.org with esmtp (Exim 4.94 #2 (Red Hat Linux)) id 1legf2-006CzF-Ij for linux-arm-kernel@lists.infradead.org; Thu, 06 May 2021 16:15:38 +0000 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 3582331B; Thu, 6 May 2021 09:15:34 -0700 (PDT) Received: from [192.168.1.179] (unknown [172.31.20.19]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 292B43F718; Thu, 6 May 2021 09:15:27 -0700 (PDT) Subject: Re: [PATCH v11 2/6] arm64: kvm: Introduce MTE VM feature To: Catalin Marinas Cc: Marc Zyngier , Will Deacon , James Morse , Julien Thierry , Suzuki K Poulose , kvmarm@lists.cs.columbia.edu, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, Dave Martin , Mark Rutland , Thomas Gleixner , qemu-devel@nongnu.org, Juan Quintela , "Dr. David Alan Gilbert" , Richard Henderson , Peter Maydell , Haibo Xu , Andrew Jones References: <20210416154309.22129-1-steven.price@arm.com> <20210416154309.22129-3-steven.price@arm.com> <20210428170705.GB4022@arm.com> From: Steven Price Message-ID: <329286e8-a8f3-ea1a-1802-58813255a4a5@arm.com> Date: Thu, 6 May 2021 17:15:25 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.7.1 MIME-Version: 1.0 In-Reply-To: Content-Language: en-GB X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20210506_091536_757993_09090679 X-CRM114-Status: GOOD ( 52.49 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset="us-ascii"; Format="flowed" Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org On 04/05/2021 18:40, Catalin Marinas wrote: > On Thu, Apr 29, 2021 at 05:06:41PM +0100, Steven Price wrote: >> On 28/04/2021 18:07, Catalin Marinas wrote: >>> I probably asked already but is the only way to map a standard RAM page >>> (not device) in stage 2 via the fault handler? One case I had in mind >>> was something like get_user_pages() but it looks like that one doesn't >>> call set_pte_at_notify(). There are a few other places where >>> set_pte_at_notify() is called and these may happen before we got a >>> chance to fault on stage 2, effectively populating the entry (IIUC). If >>> that's an issue, we could move the above loop and check closer to the >>> actual pte setting like kvm_pgtable_stage2_map(). >> >> The only call sites of kvm_pgtable_stage2_map() are in mmu.c: >> >> * kvm_phys_addr_ioremap() - maps as device in stage 2 >> >> * user_mem_abort() - handled above >> >> * kvm_set_spte_handler() - ultimately called from the .change_pte() >> callback of the MMU notifier >> >> So the last one is potentially a problem. It's called via the MMU notifiers >> in the case of set_pte_at_notify(). The users of that are: >> >> * uprobe_write_opcode(): Allocates a new page and performs a >> copy_highpage() to copy the data to the new page (which with MTE includes >> the tags and will copy across the PG_mte_tagged flag). >> >> * write_protect_page() (KSM): Changes the permissions on the PTE but it's >> still the same page, so nothing to do regarding MTE. > > My concern here is that the VMM had a stage 1 pte but we haven't yet > faulted in at stage 2 via user_mem_abort(), so we don't have any stage 2 > pte set. write_protect_page() comes in and sets the new stage 2 pte via > the callback. I couldn't find any check in kvm_pgtable_stage2_map() for > the old pte, so it will set the new stage 2 pte regardless. A subsequent > guest read would no longer fault at stage 2. > >> * replace_page() (KSM): If the page has MTE tags then the MTE version of >> memcmp_pages() will return false, so the only caller >> (try_to_merge_one_page()) will never call this on a page with tags. >> >> * wp_page_copy(): This one is more interesting - if we go down the >> cow_user_page() path with an old page then everything is safe (tags are >> copied over). The is_zero_pfn() case worries me a bit - a new page is >> allocated, but I can't instantly see anything to zero out the tags (and set >> PG_mte_tagged). > > True, I think tag zeroing happens only if we map it as PROT_MTE in the > VMM. > >> * migrate_vma_insert_page(): I think migration should be safe as the tags >> should be copied. >> >> So wp_page_copy() looks suspicious. >> >> kvm_pgtable_stage2_map() looks like it could be a good place for the checks, >> it looks like it should work and is probably a more obvious place for the >> checks. > > That would be my preference. It also matches the stage 1 set_pte_at(). > >>> While the set_pte_at() race on the page flags is somewhat clearer, we >>> may still have a race here with the VMM's set_pte_at() if the page is >>> mapped as tagged. KVM has its own mmu_lock but it wouldn't be held when >>> handling the VMM page tables (well, not always, see below). >>> >>> gfn_to_pfn_prot() ends up calling get_user_pages*(). At least the slow >>> path (hva_to_pfn_slow()) ends up with FOLL_TOUCH in gup and the VMM pte >>> would be set, tags cleared (if PROT_MTE) before the stage 2 pte. I'm not >>> sure whether get_user_page_fast_only() does the same. >>> >>> The race with an mprotect(PROT_MTE) in the VMM is fine I think as the >>> KVM mmu notifier is invoked before set_pte_at() and racing with another >>> user_mem_abort() is serialised by the KVM mmu_lock. The subsequent >>> set_pte_at() would see the PG_mte_tagged set either by the current CPU >>> or by the one it was racing with. >> >> Given the changes to set_pte_at() which means that tags are restored from >> swap even if !PROT_MTE, the only race I can see remaining is the creation of >> new PROT_MTE mappings. As you mention an attempt to change mappings in the >> VMM memory space should involve a mmu notifier call which I think serialises >> this. So the remaining issue is doing this in a separate address space. >> >> So I guess the potential problem is: >> >> * allocate memory MAP_SHARED but !PROT_MTE >> * fork() >> * VM causes a fault in parent address space >> * child does a mprotect(PROT_MTE) >> >> With the last two potentially racing. Sadly I can't see a good way of >> handling that. > > Ah, the mmap lock doesn't help as they are different processes > (mprotect() acquires it as a writer). > > I wonder whether this is racy even in the absence of KVM. If both parent > and child do an mprotect(PROT_MTE), one of them may be reading stale > tags for a brief period. > > Maybe we should revisit whether shared MTE pages are of any use, though > it's an ABI change (not bad if no-one is relying on this). However... Shared MTE pages are certainly hard to use correctly (e.g. see the discussions with the VMM accessing guest memory). But I guess that boat has sailed. > Thinking about this, we have a similar problem with the PG_dcache_clean > and two processes doing mprotect(PROT_EXEC). One of them could see the > flag set and skip the I-cache maintenance while the other executes > stale instructions. change_pte_range() could acquire the page lock if > the page is VM_SHARED (my preferred core mm fix). It doesn't immediately > solve the MTE/KVM case but we could at least take the page lock via > user_mem_abort(). For PG_dcache_clean AFAICS the solution is actually simple: split the test and set parts. i.e..: if (!test_bit(PG_dcache_clean, &page->flags)) { sync_icache_aliases(page_address(page), page_size(page)); set_bit(PG_dcache_clean, &page->flags); } There isn't a problem with repeating the sync_icache_aliases() call in the case of a race. Or am I missing something? > Or maybe we just document this behaviour as racy both for PROT_EXEC and > PROT_MTE mappings and be done with this. The minor issue with PROT_MTE > is the potential leaking of tags (it's harder to leak information > through the I-cache). > This is the real issue I see - the race in PROT_MTE case is either a data leak (syncing after setting the bit) or data loss (syncing before setting the bit). But without serialising through a spinlock (in mte_sync_tags()) I haven't been able to come up with any way of closing the race. But with the change to set_pte_at() to call mte_sync_tags() even if the PTE isn't PROT_MTE that is likely to seriously hurt performance. Steve _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel