From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4CA2EC433FE for ; Tue, 19 Apr 2022 17:57:23 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1349941AbiDSSAE (ORCPT ); Tue, 19 Apr 2022 14:00:04 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38146 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1347463AbiDSSAA (ORCPT ); Tue, 19 Apr 2022 14:00:00 -0400 Received: from mail-yw1-x112d.google.com (mail-yw1-x112d.google.com [IPv6:2607:f8b0:4864:20::112d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3777E10FFA for ; Tue, 19 Apr 2022 10:57:17 -0700 (PDT) Received: by mail-yw1-x112d.google.com with SMTP id 00721157ae682-2eba37104a2so181823897b3.0 for ; Tue, 19 Apr 2022 10:57:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=IbS3Xpxm43bSb82M/GpDSiaT9N8mAu6Po3BZDjwUF6A=; b=EreiokSozIoFy2pQ8c8zv034yqpvG8UyR2ue3PpVD4YkGEyul8JfgQtOq6X7uNeTr0 YHHk0naLTx6Xq3LO8yWNXVN9XjpVY9soNj0QPT0B8Kq/dniEtAbs1NiMa55N/HagWaJE TALfaleKL/R19tfAb6/O3S05YqsNovEMYe3er73m+K0IyP7j+//VRLC9N6SxK5Gt3svN bYyF2BW7ZTZqv5/VII6kybObwakPnnawTxFnARTzW22xfgLEvE6BCUkNRbgJREDT4+U1 mvXLBrW/MHOOSfZFgBG8fsDbBx9AtOHQ7DWaq1Vp7kX4c6aUl6mASXbC96nFP7PVcQ/F Fy/Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=IbS3Xpxm43bSb82M/GpDSiaT9N8mAu6Po3BZDjwUF6A=; b=izPzM/9t+8VBH63VtExZdq1FQtsP/4DQI7iUxzPx3TMnP0UgnWVG11UPWZBI4kxlCC o3/kiXR98ByK2Lo1dB0zIYZ5V2JBEQ1Aufg9B+oK5udobO9dZpaWF+DH1jn7jUX7A9z5 g+k1RpWHR0L28P8lWA7iJaLZ2tLKwYwGmtsQuD4/psi9oBUuEpo9fEc0iAWZcXEvvtsN 1ihbgJj96yT8r8A6o5F3tFt0YjEMfHWvzns9M2wBDknZr66gKcx8+Ml3G9U4BT/ZQH0+ 1tJTGOez2kXvI6Xu43NM+OOB3hbL5+GApVpgD6nLE4VCGYvHEnM0Gqk4rhN8H9HGQtfG L0bQ== X-Gm-Message-State: AOAM532U5BERzF2zZCxaB7zoOsC98Dp59Z44JoIkj7Fm9alAih5alclP CdW4DJWEwcrN66GMcroP0EY0X+VBLYztuD79GbzcPg== X-Google-Smtp-Source: ABdhPJyZ2F/Bb4tPwUu4up+hTdywsYpbJswod0K4gdZBdKuxntx30CfwXRiDm5TQoF2qpU2Asfrst5xEM850f3smY4w= X-Received: by 2002:a0d:ccd0:0:b0:2f1:c824:5bba with SMTP id o199-20020a0dccd0000000b002f1c8245bbamr4709131ywd.156.1650391036163; Tue, 19 Apr 2022 10:57:16 -0700 (PDT) MIME-Version: 1.0 References: <20220415215901.1737897-1-oupton@google.com> In-Reply-To: <20220415215901.1737897-1-oupton@google.com> From: Ben Gardon Date: Tue, 19 Apr 2022 10:57:05 -0700 Message-ID: Subject: Re: [RFC PATCH 00/17] KVM: arm64: Parallelize stage 2 fault handling To: Oliver Upton Cc: kvmarm@lists.cs.columbia.edu, kvm , Marc Zyngier , James Morse , Alexandru Elisei , Suzuki K Poulose , linux-arm-kernel@lists.infradead.org, Peter Shier , Ricardo Koller , Reiji Watanabe , Paolo Bonzini , Sean Christopherson , David Matlack Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton wrote: > > Presently KVM only takes a read lock for stage 2 faults if it believes > the fault can be fixed by relaxing permissions on a PTE (write unprotect > for dirty logging). Otherwise, stage 2 faults grab the write lock, which > predictably can pile up all the vCPUs in a sufficiently large VM. > > The x86 port of KVM has what it calls the TDP MMU. Basically, it is an > MMU protected by the combination of a read-write lock and RCU, allowing > page walkers to traverse in parallel. > > This series is strongly inspired by the mechanics of the TDP MMU, > making use of RCU to protect parallel walks. Note that the TLB > invalidation mechanics are a bit different between x86 and ARM, so we > need to use the 'break-before-make' sequence to split/collapse a > block/table mapping, respectively. > > Nonetheless, using atomics on the break side allows fault handlers to > acquire exclusive access to a PTE (lets just call it locked). Once the > PTE lock is acquired it is then safe to assume exclusive access. > > Special consideration is required when pruning the page tables in > parallel. Suppose we are collapsing a table into a block. Allowing > parallel faults means that a software walker could be in the middle of > a lower level traversal when the table is unlinked. Table > walkers that prune the paging structures must now 'lock' all descendent > PTEs, effectively asserting exclusive ownership of the substructure > (no other walker can install something to an already locked pte). > > Additionally, for parallel walks we need to punt the freeing of table > pages to the next RCU sync, as there could be multiple observers of the > table until all walkers exit the RCU critical section. For this I > decided to cram an rcu_head into page private data for every table page. > We wind up spending a bit more on table pages now, but lazily allocating > for rcu callbacks probably doesn't make a lot of sense. Not only would > we need a large cache of them (think about installing a level 1 block) > to wire up callbacks on all descendent tables, but we also then need to > spend memory to actually free memory. FWIW we used a similar approach in early versions of the TDP MMU, but instead of page->private used page->lru so that more metadata could be stored in page->private. Ultimately that ended up being too limiting and we decided to switch to just using the associated struct kvm_mmu_page as the list element. I don't know if ARM has an equivalent construct though. > > I tried to organize these patches as best I could w/o introducing > intermediate breakage. > > The first 5 patches are meant mostly as prepatory reworks, and, in the > case of RCU a nop. > > Patch 6 is quite large, but I had a hard time deciding how to change the > way we link/unlink tables to use atomics without breaking things along > the way. > > Patch 7 probably should come before patch 6, as it informs the other > read-side fault (perm relax) about when a map is in progress so it'll > back off. > > Patches 8-10 take care of the pruning case, actually locking the child ptes > instead of simply dropping table page references along the way. Note > that we cannot assume a pte points to a table/page at this point, hence > the same helper is called for pre- and leaf-traversal. Guide the > recursion based on what got yanked from the PTE. > > Patches 11-14 wire up everything to schedule rcu callbacks on > to-be-freed table pages. rcu_barrier() is called on the way out from > tearing down a stage 2 page table to guarantee all memory associated > with the VM has actually been cleaned up. > > Patches 15-16 loop in the fault handler to the new table traversal game. > > Lastly, patch 17 is a nasty bit of debugging residue to spot possible > table page leaks. Please don't laugh ;-) > > Smoke tested with KVM selftests + kvm_page_table_test w/ 2M hugetlb to > exercise the table pruning code. Haven't done anything beyond this, > sending as an RFC now to get eyes on the code. > > Applies to commit fb649bda6f56 ("Merge tag 'block-5.18-2022-04-15' of > git://git.kernel.dk/linux-block") > > Oliver Upton (17): > KVM: arm64: Directly read owner id field in stage2_pte_is_counted() > KVM: arm64: Only read the pte once per visit > KVM: arm64: Return the next table from map callbacks > KVM: arm64: Protect page table traversal with RCU > KVM: arm64: Take an argument to indicate parallel walk > KVM: arm64: Implement break-before-make sequence for parallel walks > KVM: arm64: Enlighten perm relax path about parallel walks > KVM: arm64: Spin off helper for initializing table pte > KVM: arm64: Tear down unlinked page tables in parallel walk > KVM: arm64: Assume a table pte is already owned in post-order > traversal > KVM: arm64: Move MMU cache init/destroy into helpers > KVM: arm64: Stuff mmu page cache in sub struct > KVM: arm64: Setup cache for stage2 page headers > KVM: arm64: Punt last page reference to rcu callback for parallel walk > KVM: arm64: Allow parallel calls to kvm_pgtable_stage2_map() > KVM: arm64: Enable parallel stage 2 MMU faults > TESTONLY: KVM: arm64: Add super lazy accounting of stage 2 table pages > > arch/arm64/include/asm/kvm_host.h | 5 +- > arch/arm64/include/asm/kvm_mmu.h | 2 + > arch/arm64/include/asm/kvm_pgtable.h | 14 +- > arch/arm64/kvm/arm.c | 4 +- > arch/arm64/kvm/hyp/nvhe/mem_protect.c | 13 +- > arch/arm64/kvm/hyp/nvhe/setup.c | 13 +- > arch/arm64/kvm/hyp/pgtable.c | 518 +++++++++++++++++++------- > arch/arm64/kvm/mmu.c | 120 ++++-- > 8 files changed, 503 insertions(+), 186 deletions(-) > > -- > 2.36.0.rc0.470.gd361397f0d-goog > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id B1309C433EF for ; Tue, 19 Apr 2022 17:59:02 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender: Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:Cc:To:Subject:Message-ID:Date:From: In-Reply-To:References:MIME-Version:Reply-To:Content-ID:Content-Description: Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID: List-Owner; bh=AuHOtJ0k665Rq7BT+qBj37mm0DH0Ob1c0+UvVf91CCY=; b=UoLm7DMKLYLs8O 82/kTj5+/G/A3sitiu+xaJUlCi3DHkvORsFYHsAQCOHdbz3vJZujn7f68TWOdqaaY91TMyHGeK7od iATX/IAiEmyhjhNgq5ck/hLkaTQjiUeyMaKsaNUeZHCpxRPQeutNAzqpM/hsNldptuM1la51HbNMi Wr/ZBfmzc+0w84Au76kl8wh4mp0v4zWfkAPrdx1hI0/NnLZBPy3umy1h1bnQnbFLo6vHgmSUFexYr 5M/yUGySixwNU2ZcxU4TtF5W6g4bH69wWxXD68G3j4l6GNbt1KRdFUsKKMmJqb0fBO8ypdAq8KfFg iDcqnEipxJvr3mObQkEQ==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux)) id 1ngs6Y-005Tt6-22; Tue, 19 Apr 2022 17:57:34 +0000 Received: from mail-yw1-x112f.google.com ([2607:f8b0:4864:20::112f]) by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux)) id 1ngs6M-005Tpz-El for linux-arm-kernel@lists.infradead.org; Tue, 19 Apr 2022 17:57:24 +0000 Received: by mail-yw1-x112f.google.com with SMTP id 00721157ae682-2ebf3746f87so181216897b3.6 for ; Tue, 19 Apr 2022 10:57:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=IbS3Xpxm43bSb82M/GpDSiaT9N8mAu6Po3BZDjwUF6A=; b=EreiokSozIoFy2pQ8c8zv034yqpvG8UyR2ue3PpVD4YkGEyul8JfgQtOq6X7uNeTr0 YHHk0naLTx6Xq3LO8yWNXVN9XjpVY9soNj0QPT0B8Kq/dniEtAbs1NiMa55N/HagWaJE TALfaleKL/R19tfAb6/O3S05YqsNovEMYe3er73m+K0IyP7j+//VRLC9N6SxK5Gt3svN bYyF2BW7ZTZqv5/VII6kybObwakPnnawTxFnARTzW22xfgLEvE6BCUkNRbgJREDT4+U1 mvXLBrW/MHOOSfZFgBG8fsDbBx9AtOHQ7DWaq1Vp7kX4c6aUl6mASXbC96nFP7PVcQ/F Fy/Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=IbS3Xpxm43bSb82M/GpDSiaT9N8mAu6Po3BZDjwUF6A=; b=wDtIelK838CGKXXD9gdhkKbTvcHzzc1NiwQeAAZFt4DcO+DpsIgXvp7e80GUjjCOMZ lyGLZJU3xfl/hVjXQ8hmw2nmzkkEudpG9KaJ1e0KuLWK8svRg8DSqaLV9dIquLGhZ55d nAf+rCS0bv2bhVewX27tEh6dD1oaZV/854+/izRR0D3O2gYHT7xQ0Gfn5GE5SjOIXgKw R0d2EZx2vi/WlRfctC2bG1m3MvHOKQQJ1iXUVZAQxm4l/H5kZWEU6yuHosn0dGf9Eta/ DBg48qd3vfFc5cjLfB7hOGdBArVPsoq65ugfa8Gj+SL5SjHjLm0tqETB1aiVtm0cCjM1 ur+w== X-Gm-Message-State: AOAM533Z/RWtl0Tv3i+qRRUZ/ByNiZPFqidCg+/AxEmlAv8QmGNcWaBC fiWFBF7MrT0kSQ+juzPRmyi/i+7X2hEBkAA8pev07Q== X-Google-Smtp-Source: ABdhPJyZ2F/Bb4tPwUu4up+hTdywsYpbJswod0K4gdZBdKuxntx30CfwXRiDm5TQoF2qpU2Asfrst5xEM850f3smY4w= X-Received: by 2002:a0d:ccd0:0:b0:2f1:c824:5bba with SMTP id o199-20020a0dccd0000000b002f1c8245bbamr4709131ywd.156.1650391036163; Tue, 19 Apr 2022 10:57:16 -0700 (PDT) MIME-Version: 1.0 References: <20220415215901.1737897-1-oupton@google.com> In-Reply-To: <20220415215901.1737897-1-oupton@google.com> From: Ben Gardon Date: Tue, 19 Apr 2022 10:57:05 -0700 Message-ID: Subject: Re: [RFC PATCH 00/17] KVM: arm64: Parallelize stage 2 fault handling To: Oliver Upton Cc: kvmarm@lists.cs.columbia.edu, kvm , Marc Zyngier , James Morse , Alexandru Elisei , Suzuki K Poulose , linux-arm-kernel@lists.infradead.org, Peter Shier , Ricardo Koller , Reiji Watanabe , Paolo Bonzini , Sean Christopherson , David Matlack X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20220419_105722_568017_BDBE6A83 X-CRM114-Status: GOOD ( 37.17 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton wrote: > > Presently KVM only takes a read lock for stage 2 faults if it believes > the fault can be fixed by relaxing permissions on a PTE (write unprotect > for dirty logging). Otherwise, stage 2 faults grab the write lock, which > predictably can pile up all the vCPUs in a sufficiently large VM. > > The x86 port of KVM has what it calls the TDP MMU. Basically, it is an > MMU protected by the combination of a read-write lock and RCU, allowing > page walkers to traverse in parallel. > > This series is strongly inspired by the mechanics of the TDP MMU, > making use of RCU to protect parallel walks. Note that the TLB > invalidation mechanics are a bit different between x86 and ARM, so we > need to use the 'break-before-make' sequence to split/collapse a > block/table mapping, respectively. > > Nonetheless, using atomics on the break side allows fault handlers to > acquire exclusive access to a PTE (lets just call it locked). Once the > PTE lock is acquired it is then safe to assume exclusive access. > > Special consideration is required when pruning the page tables in > parallel. Suppose we are collapsing a table into a block. Allowing > parallel faults means that a software walker could be in the middle of > a lower level traversal when the table is unlinked. Table > walkers that prune the paging structures must now 'lock' all descendent > PTEs, effectively asserting exclusive ownership of the substructure > (no other walker can install something to an already locked pte). > > Additionally, for parallel walks we need to punt the freeing of table > pages to the next RCU sync, as there could be multiple observers of the > table until all walkers exit the RCU critical section. For this I > decided to cram an rcu_head into page private data for every table page. > We wind up spending a bit more on table pages now, but lazily allocating > for rcu callbacks probably doesn't make a lot of sense. Not only would > we need a large cache of them (think about installing a level 1 block) > to wire up callbacks on all descendent tables, but we also then need to > spend memory to actually free memory. FWIW we used a similar approach in early versions of the TDP MMU, but instead of page->private used page->lru so that more metadata could be stored in page->private. Ultimately that ended up being too limiting and we decided to switch to just using the associated struct kvm_mmu_page as the list element. I don't know if ARM has an equivalent construct though. > > I tried to organize these patches as best I could w/o introducing > intermediate breakage. > > The first 5 patches are meant mostly as prepatory reworks, and, in the > case of RCU a nop. > > Patch 6 is quite large, but I had a hard time deciding how to change the > way we link/unlink tables to use atomics without breaking things along > the way. > > Patch 7 probably should come before patch 6, as it informs the other > read-side fault (perm relax) about when a map is in progress so it'll > back off. > > Patches 8-10 take care of the pruning case, actually locking the child ptes > instead of simply dropping table page references along the way. Note > that we cannot assume a pte points to a table/page at this point, hence > the same helper is called for pre- and leaf-traversal. Guide the > recursion based on what got yanked from the PTE. > > Patches 11-14 wire up everything to schedule rcu callbacks on > to-be-freed table pages. rcu_barrier() is called on the way out from > tearing down a stage 2 page table to guarantee all memory associated > with the VM has actually been cleaned up. > > Patches 15-16 loop in the fault handler to the new table traversal game. > > Lastly, patch 17 is a nasty bit of debugging residue to spot possible > table page leaks. Please don't laugh ;-) > > Smoke tested with KVM selftests + kvm_page_table_test w/ 2M hugetlb to > exercise the table pruning code. Haven't done anything beyond this, > sending as an RFC now to get eyes on the code. > > Applies to commit fb649bda6f56 ("Merge tag 'block-5.18-2022-04-15' of > git://git.kernel.dk/linux-block") > > Oliver Upton (17): > KVM: arm64: Directly read owner id field in stage2_pte_is_counted() > KVM: arm64: Only read the pte once per visit > KVM: arm64: Return the next table from map callbacks > KVM: arm64: Protect page table traversal with RCU > KVM: arm64: Take an argument to indicate parallel walk > KVM: arm64: Implement break-before-make sequence for parallel walks > KVM: arm64: Enlighten perm relax path about parallel walks > KVM: arm64: Spin off helper for initializing table pte > KVM: arm64: Tear down unlinked page tables in parallel walk > KVM: arm64: Assume a table pte is already owned in post-order > traversal > KVM: arm64: Move MMU cache init/destroy into helpers > KVM: arm64: Stuff mmu page cache in sub struct > KVM: arm64: Setup cache for stage2 page headers > KVM: arm64: Punt last page reference to rcu callback for parallel walk > KVM: arm64: Allow parallel calls to kvm_pgtable_stage2_map() > KVM: arm64: Enable parallel stage 2 MMU faults > TESTONLY: KVM: arm64: Add super lazy accounting of stage 2 table pages > > arch/arm64/include/asm/kvm_host.h | 5 +- > arch/arm64/include/asm/kvm_mmu.h | 2 + > arch/arm64/include/asm/kvm_pgtable.h | 14 +- > arch/arm64/kvm/arm.c | 4 +- > arch/arm64/kvm/hyp/nvhe/mem_protect.c | 13 +- > arch/arm64/kvm/hyp/nvhe/setup.c | 13 +- > arch/arm64/kvm/hyp/pgtable.c | 518 +++++++++++++++++++------- > arch/arm64/kvm/mmu.c | 120 ++++-- > 8 files changed, 503 insertions(+), 186 deletions(-) > > -- > 2.36.0.rc0.470.gd361397f0d-goog > _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mm01.cs.columbia.edu (mm01.cs.columbia.edu [128.59.11.253]) by smtp.lore.kernel.org (Postfix) with ESMTP id BEF72C433F5 for ; Wed, 20 Apr 2022 05:59:31 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by mm01.cs.columbia.edu (Postfix) with ESMTP id 2A2834087B; Wed, 20 Apr 2022 01:59:31 -0400 (EDT) X-Virus-Scanned: at lists.cs.columbia.edu Authentication-Results: mm01.cs.columbia.edu (amavisd-new); dkim=softfail (fail, message has been altered) header.i=@google.com Received: from mm01.cs.columbia.edu ([127.0.0.1]) by localhost (mm01.cs.columbia.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 4bG+-0wqkWjk; Wed, 20 Apr 2022 01:59:29 -0400 (EDT) Received: from mm01.cs.columbia.edu (localhost [127.0.0.1]) by mm01.cs.columbia.edu (Postfix) with ESMTP id 8DCC2408AA; Wed, 20 Apr 2022 01:59:29 -0400 (EDT) Received: from localhost (localhost [127.0.0.1]) by mm01.cs.columbia.edu (Postfix) with ESMTP id 4B9A74B1D8 for ; Tue, 19 Apr 2022 13:57:18 -0400 (EDT) X-Virus-Scanned: at lists.cs.columbia.edu Received: from mm01.cs.columbia.edu ([127.0.0.1]) by localhost (mm01.cs.columbia.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id LuyryFvuxOjI for ; Tue, 19 Apr 2022 13:57:17 -0400 (EDT) Received: from mail-yw1-f177.google.com (mail-yw1-f177.google.com [209.85.128.177]) by mm01.cs.columbia.edu (Postfix) with ESMTPS id 040BD4B0A0 for ; Tue, 19 Apr 2022 13:57:16 -0400 (EDT) Received: by mail-yw1-f177.google.com with SMTP id 00721157ae682-2ec04a2ebadso180930387b3.12 for ; Tue, 19 Apr 2022 10:57:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=IbS3Xpxm43bSb82M/GpDSiaT9N8mAu6Po3BZDjwUF6A=; b=EreiokSozIoFy2pQ8c8zv034yqpvG8UyR2ue3PpVD4YkGEyul8JfgQtOq6X7uNeTr0 YHHk0naLTx6Xq3LO8yWNXVN9XjpVY9soNj0QPT0B8Kq/dniEtAbs1NiMa55N/HagWaJE TALfaleKL/R19tfAb6/O3S05YqsNovEMYe3er73m+K0IyP7j+//VRLC9N6SxK5Gt3svN bYyF2BW7ZTZqv5/VII6kybObwakPnnawTxFnARTzW22xfgLEvE6BCUkNRbgJREDT4+U1 mvXLBrW/MHOOSfZFgBG8fsDbBx9AtOHQ7DWaq1Vp7kX4c6aUl6mASXbC96nFP7PVcQ/F Fy/Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=IbS3Xpxm43bSb82M/GpDSiaT9N8mAu6Po3BZDjwUF6A=; b=lMDwDJTXIYbNKJpDodM4QbrZ5VS4ryWfTuGFO3QTwKGsaEaMsA8q/Fb66U91b4fJ4+ rB4MX/oIY+2a+S9vtTYxUbsZdIKmVJIq5ZJKgrVNk35xCr/jORak09XZaQ9RNd2Aat4Y fUiDjeBXl42ADlSeRUp7JL91MP7+79HP/wpwqukwNhisVCPv/DiKODCc8RB32dvoLkCa YZ1Jc5JQXSE16iDPM9uG6hvgdkU+KUUAqNZXiY0hiYV0d8ssY7r/ohsT3i+WifU4WrIn APIVlrqzj2f6hC5p41kR73zRBlQiWLDPlLs/Kfxc0q5zNmTz1PkpBDqGm/OvLH0IFHbn fXcQ== X-Gm-Message-State: AOAM530srqgktv09XJvX+8QIQJVqD0twMIPO4aQBRa3zTFf5xkDmLwBh /U10ZsJBrXsLj1ylPLG71w9s4u0M0nDvHMZAoIdr9g== X-Google-Smtp-Source: ABdhPJyZ2F/Bb4tPwUu4up+hTdywsYpbJswod0K4gdZBdKuxntx30CfwXRiDm5TQoF2qpU2Asfrst5xEM850f3smY4w= X-Received: by 2002:a0d:ccd0:0:b0:2f1:c824:5bba with SMTP id o199-20020a0dccd0000000b002f1c8245bbamr4709131ywd.156.1650391036163; Tue, 19 Apr 2022 10:57:16 -0700 (PDT) MIME-Version: 1.0 References: <20220415215901.1737897-1-oupton@google.com> In-Reply-To: <20220415215901.1737897-1-oupton@google.com> From: Ben Gardon Date: Tue, 19 Apr 2022 10:57:05 -0700 Message-ID: Subject: Re: [RFC PATCH 00/17] KVM: arm64: Parallelize stage 2 fault handling To: Oliver Upton X-Mailman-Approved-At: Wed, 20 Apr 2022 01:59:28 -0400 Cc: kvm , Marc Zyngier , Peter Shier , David Matlack , Paolo Bonzini , kvmarm@lists.cs.columbia.edu, linux-arm-kernel@lists.infradead.org X-BeenThere: kvmarm@lists.cs.columbia.edu X-Mailman-Version: 2.1.14 Precedence: list List-Id: Where KVM/ARM decisions are made List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: kvmarm-bounces@lists.cs.columbia.edu Sender: kvmarm-bounces@lists.cs.columbia.edu On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton wrote: > > Presently KVM only takes a read lock for stage 2 faults if it believes > the fault can be fixed by relaxing permissions on a PTE (write unprotect > for dirty logging). Otherwise, stage 2 faults grab the write lock, which > predictably can pile up all the vCPUs in a sufficiently large VM. > > The x86 port of KVM has what it calls the TDP MMU. Basically, it is an > MMU protected by the combination of a read-write lock and RCU, allowing > page walkers to traverse in parallel. > > This series is strongly inspired by the mechanics of the TDP MMU, > making use of RCU to protect parallel walks. Note that the TLB > invalidation mechanics are a bit different between x86 and ARM, so we > need to use the 'break-before-make' sequence to split/collapse a > block/table mapping, respectively. > > Nonetheless, using atomics on the break side allows fault handlers to > acquire exclusive access to a PTE (lets just call it locked). Once the > PTE lock is acquired it is then safe to assume exclusive access. > > Special consideration is required when pruning the page tables in > parallel. Suppose we are collapsing a table into a block. Allowing > parallel faults means that a software walker could be in the middle of > a lower level traversal when the table is unlinked. Table > walkers that prune the paging structures must now 'lock' all descendent > PTEs, effectively asserting exclusive ownership of the substructure > (no other walker can install something to an already locked pte). > > Additionally, for parallel walks we need to punt the freeing of table > pages to the next RCU sync, as there could be multiple observers of the > table until all walkers exit the RCU critical section. For this I > decided to cram an rcu_head into page private data for every table page. > We wind up spending a bit more on table pages now, but lazily allocating > for rcu callbacks probably doesn't make a lot of sense. Not only would > we need a large cache of them (think about installing a level 1 block) > to wire up callbacks on all descendent tables, but we also then need to > spend memory to actually free memory. FWIW we used a similar approach in early versions of the TDP MMU, but instead of page->private used page->lru so that more metadata could be stored in page->private. Ultimately that ended up being too limiting and we decided to switch to just using the associated struct kvm_mmu_page as the list element. I don't know if ARM has an equivalent construct though. > > I tried to organize these patches as best I could w/o introducing > intermediate breakage. > > The first 5 patches are meant mostly as prepatory reworks, and, in the > case of RCU a nop. > > Patch 6 is quite large, but I had a hard time deciding how to change the > way we link/unlink tables to use atomics without breaking things along > the way. > > Patch 7 probably should come before patch 6, as it informs the other > read-side fault (perm relax) about when a map is in progress so it'll > back off. > > Patches 8-10 take care of the pruning case, actually locking the child ptes > instead of simply dropping table page references along the way. Note > that we cannot assume a pte points to a table/page at this point, hence > the same helper is called for pre- and leaf-traversal. Guide the > recursion based on what got yanked from the PTE. > > Patches 11-14 wire up everything to schedule rcu callbacks on > to-be-freed table pages. rcu_barrier() is called on the way out from > tearing down a stage 2 page table to guarantee all memory associated > with the VM has actually been cleaned up. > > Patches 15-16 loop in the fault handler to the new table traversal game. > > Lastly, patch 17 is a nasty bit of debugging residue to spot possible > table page leaks. Please don't laugh ;-) > > Smoke tested with KVM selftests + kvm_page_table_test w/ 2M hugetlb to > exercise the table pruning code. Haven't done anything beyond this, > sending as an RFC now to get eyes on the code. > > Applies to commit fb649bda6f56 ("Merge tag 'block-5.18-2022-04-15' of > git://git.kernel.dk/linux-block") > > Oliver Upton (17): > KVM: arm64: Directly read owner id field in stage2_pte_is_counted() > KVM: arm64: Only read the pte once per visit > KVM: arm64: Return the next table from map callbacks > KVM: arm64: Protect page table traversal with RCU > KVM: arm64: Take an argument to indicate parallel walk > KVM: arm64: Implement break-before-make sequence for parallel walks > KVM: arm64: Enlighten perm relax path about parallel walks > KVM: arm64: Spin off helper for initializing table pte > KVM: arm64: Tear down unlinked page tables in parallel walk > KVM: arm64: Assume a table pte is already owned in post-order > traversal > KVM: arm64: Move MMU cache init/destroy into helpers > KVM: arm64: Stuff mmu page cache in sub struct > KVM: arm64: Setup cache for stage2 page headers > KVM: arm64: Punt last page reference to rcu callback for parallel walk > KVM: arm64: Allow parallel calls to kvm_pgtable_stage2_map() > KVM: arm64: Enable parallel stage 2 MMU faults > TESTONLY: KVM: arm64: Add super lazy accounting of stage 2 table pages > > arch/arm64/include/asm/kvm_host.h | 5 +- > arch/arm64/include/asm/kvm_mmu.h | 2 + > arch/arm64/include/asm/kvm_pgtable.h | 14 +- > arch/arm64/kvm/arm.c | 4 +- > arch/arm64/kvm/hyp/nvhe/mem_protect.c | 13 +- > arch/arm64/kvm/hyp/nvhe/setup.c | 13 +- > arch/arm64/kvm/hyp/pgtable.c | 518 +++++++++++++++++++------- > arch/arm64/kvm/mmu.c | 120 ++++-- > 8 files changed, 503 insertions(+), 186 deletions(-) > > -- > 2.36.0.rc0.470.gd361397f0d-goog > _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm