All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 00/20] KVM: Extend Eager Page Splitting to the shadow MMU
@ 2022-04-22 21:05 ` David Matlack
  0 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-04-22 21:05 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

This series extends KVM's Eager Page Splitting to also split huge pages
mapped by the shadow MMU, specifically **nested MMUs**.

For background on Eager Page Splitting, see:
 - Proposal: https://lore.kernel.org/kvm/CALzav=dV_U4r1K9oDq4esb4mpBQDQ2ROQ5zH5wV3KpOaZrRW-A@mail.gmail.com/
 - TDP MMU support: https://lore.kernel.org/kvm/20220119230739.2234394-1-dmatlack@google.com/

Splitting huge pages mapped by the shadow MMU is more complicated than
the TDP MMU, but it is also more important for performance as the shadow
MMU handles huge page write-protection faults under the write lock.  See
the Performance section for more details.

The extra complexity of splitting huge pages mapped by the shadow MMU
comes from a few places:

(1) The shadow MMU has a limit on the number of shadow pages that are
    allowed to be allocated. So, as a policy, Eager Page Splitting
    refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
    pages available.

(2) Huge pages may be mapped by indirect shadow pages which may have access
    permission constraints from the guest (unlike the TDP MMU which is
    ACC_ALL by default).

(3) Splitting a huge page may end up re-using an existing lower level
    shadow page tables. This is unlike the TDP MMU which always allocates
    new shadow page tables when splitting.

(4) When installing the lower level SPTEs, they must be added to the
    rmap which may require allocating additional pte_list_desc structs.

In Google's internal implementation of Eager Page Splitting, we do not
handle cases (3) and (4), and intstead opts to skip splitting entirely
(case 3) or only partially splitting (case 4). This series handles the
additional cases, which requires an additional 4KiB of memory per VM to
store the extra pte_list_desc cache. However it does also avoids the need
for TLB flushes in most cases and allows KVM to split more pages mapped
by shadow paging.

The bulk of this series is just refactoring the existing MMU code in
preparation for splitting, specifically to make it possible to operate
on the MMU outside of a vCPU context.

Motivation
----------

During dirty logging, VMs using the shadow MMU suffer from:

(1) Write-protection faults on huge pages that take the MMU lock to
    unmap the huge page, map a 4KiB page, and update the dirty log.

(2) Non-present faults caused by (1) that take the MMU lock to map in
    the missing page.

(3) Write-protection faults on 4KiB pages that take the MMU lock to
    make the page writable and update the dirty log. [Note: These faults
    only take the MMU lock during shadow paging.]

The lock contention from (1), (2) and (3) can severely degrade
application performance to the point of failure.  Eager page splitting
eliminates (1) by moving the splitting of huge pages off the vCPU
threads onto the thread invoking VM-ioctls to configure dirty logging,
and eliminates (2) by fully splitting each huge page into its
constituent small pages. (3) is still a concern for shadow paging
workloads (e.g. nested virtualization) but is not addressed by this
series.

Splitting in the VM-ioctl thread is useful because it can run in the
background without interrupting vCPU execution. However, it does take
the MMU lock so it may introduce some extra contention if vCPUs are
hammering the MMU lock. This is offset by the fact that eager page
splitting drops the MMU lock after splitting each SPTE if there is any
contention, and the fact that eager page splitting is reducing the MMU
lock contention from (1) and (2) above. Even workloads that only write
to 5% of their memory see massive MMU lock contention reduction during
dirty logging thanks to Eager Page Splitting (see Performance data
below).

A downside of Eager Page Splitting is that it splits all huge pages,
which may include ranges of memory that are never written to by the
guest and thus could theoretically stay huge. Workloads that write to
only a fraction of their memory may see higher TLB miss costs with Eager
Page Splitting enabled. However, that is secondary to the application
failure that otherwise may occur without Eager Page Splitting.

Further work is necessary to improve the TLB miss performance for
read-heavy workoads, such as dirty logging at 2M instead of 4K.

Performance
-----------

To measure the performance impact of Eager Page Splitting I ran
dirty_log_perf_test with support for a new flag, -n, that causes each vCPU
thread to run in L2 instead of L1. This support will be sent out in a
separate series.

To measure the imapct of customer performance, we can look at the time
it takes all vCPUs to dirty memory after dirty logging has been enabled.
Without Eager Page Splitting enabled, such dirtying must take faults to
split huge pages and bottleneck on the MMU lock.

For write-heavy workloads, there is not as much benefit since nested MMUs
still have to take the write-lock when resolving 4K write-protection
faults (case (3) in the Motivation section). But ready-heavy workloads
greatly benefit.

             | Config: tdp_mmu=Y, nested, 100% writes                  |
             | Iteration 1 dirty memory time                           |
             | ------------------------------------------------------- |
vCPU Count   | eager_page_split=N         | eager_page_split=Y         |
------------ | -------------------------- | -------------------------- |
2            | 0.367445635s               | 0.359880160s               |
4            | 0.503976497s               | 0.418760595s               |
8            | 1.328792652s               | 1.442455382s               |
16           | 4.609457301s               | 3.649754574s               |
32           | 8.751328485s               | 7.659014140s               |
64           | 20.438482174s              | 17.890019577s              |

             | Config: tdp_mmu=Y, nested, 50% writes                   |
             | Iteration 1 dirty memory time                           |
             | ------------------------------------------------------- |
vCPU Count   | eager_page_split=N         | eager_page_split=Y         |
------------ | -------------------------- | -------------------------- |
2            | 0.374082549s               | 0.189881327s               |
4            | 0.498175012s               | 0.216221200s               |
8            | 1.848155856s               | 0.525316794s               |
16           | 4.387725630s               | 1.844867390s               |
32           | 9.153260046s               | 4.061645844s               |
64           | 20.077600588s              | 8.825413269s               |

             | Config: tdp_mmu=Y, nested, 5% writes                    |
             | Iteration 1 dirty memory time                           |
             | ------------------------------------------------------- |
vCPU Count   | eager_page_split=N         | eager_page_split=Y         |
------------ | -------------------------- | -------------------------- |
2            | 0.386395635s               | 0.023315599s               |
4            | 0.495352933s               | 0.024971794s               |
8            | 1.568730321s               | 0.052010563s               |
16           | 4.258323166s               | 0.174402708s               |
32           | 9.260176347s               | 0.377929203s               |
64           | 19.861473882s              | 0.905998574s               |

Eager Page Splitting does increase the time it takes to enable dirty
logging when not using initially-all-set, since that's when KVM splits
huge pages. However, this runs in parallel with vCPU execution and drops
the MMU lock whenever there is contention.

             | Config: tdp_mmu=Y, nested, 100% writes                  |
             | Enabling dirty logging time                             |
             | ------------------------------------------------------- |
vCPU Count   | eager_page_split=N         | eager_page_split=Y         |
------------ | -------------------------- | -------------------------- |
2            | 0.001330088s               | 0.018624938s               |
4            | 0.002763111s               | 0.037247815s               |
8            | 0.005220762s               | 0.074637543s               |
16           | 0.010381925s               | 0.149096917s               |
32           | 0.022109466s               | 0.307983859s               |
64           | 0.085547182s               | 0.854228170s               |

Similarly, Eager Page Splitting increases the time it takes to clear the
dirty log for when using initially-all-set. The first time userspace
clears the dirty log, KVM will split huge pages:

             | Config: tdp_mmu=Y, nested, 100% writes initially-all-set |
             | Iteration 1 clear dirty log time                        |
             | ------------------------------------------------------- |
vCPU Count   | eager_page_split=N         | eager_page_split=Y         |
------------ | -------------------------- | -------------------------- |
2            | 0.001947098s               | 0.019836052s               |
4            | 0.003817996s               | 0.039574178s               |
8            | 0.007673616s               | 0.079118964s               |
16           | 0.015733003s               | 0.158006697s               |
32           | 0.031728367s               | 0.330793049s               |
64           | 0.108699714s               | 0.891762988s               |

Subsequent calls to clear the dirty log incur almost no additional cost
since KVM can very quickly determine there are no more huge pages to
split via the RMAP. This is unlike the TDP MMU which must re-traverse
the entire page table to check for huge pages.

             | Config: tdp_mmu=Y, nested, 100% writes initially-all-set |
             | Iteration 2 clear dirty log time                        |
             | ------------------------------------------------------- |
vCPU Count   | eager_page_split=N         | eager_page_split=Y         |
------------ | -------------------------- | -------------------------- |
2            | 0.009585296s               | 0.009931437s               |
4            | 0.019188984s               | 0.019842738s               |
8            | 0.038568630s               | 0.039951832s               |
16           | 0.077188525s               | 0.079536780s               |
32           | 0.156728329s               | 0.163612725s               |
64           | 0.418679324s               | 0.337336844s               |

Testing
-------

 - Ran all kvm-unit-tests and KVM selftests.

 - Booted a 32-bit non-PAE kernel with shadow paging to verify the
   quadrant change.

 - Ran dirty_log_perf_test with support for a new flag, -n, that causes
   each vCPU thread to run in L2 instead of L1. This support will be
   sent out in a separate series.

 - Tested VM live migration with nested MMUs and huge pages. The live
   migration setup consisted of an 8 vCPU 8 GiB VM running on an Intel
   Cascade Lake host and backed by 1GiB HugeTLBFS memory.  The VM was
   running Debian 10.  Inside a VM was a 6 vCPU 4Gib nested VM also
   Debian 10 and backed by 2M HugeTLBFS. Inside the nested VM ran a
   workload that aggressively accessed memory across 6 threads.
   Tracepoints during the migration confirmes eager page splitting
   occurred, both for the direct TDP MMU mappings, and the nested MMU
   mappings.

Version Log
-----------

v4:
 - Limit eager page splitting to nested MMUs [Sean]
 - Use memory caches for SP allocation [Sean]
 - Use kvm_mmu_get_page() with NULL vCPU for EPS [Sean]
 - Use u64 instead of bit field for shadow translation entry [Sean]
 - Add Sean's R-b to "Use a bool" patch.
 - Fix printf warning in "Cache access bits" patch.
 - Fix asymmentrical pr_err_ratelimit() + WARN() [Sean]
 - Drop unnecessary unsync check for huge pages [Sean]
 - Eliminate use of we in comments and change logs [Sean]
 - Allocate objects arrays dynamically [Ben]

v3: https://lore.kernel.org/kvm/20220401175554.1931568-1-dmatlack@google.com/
 - Add R-b tags from Peter.
 - Explain direct SPs in indirect MMUs in commit message [Peter]
 - Change BUG_ON() to WARN_ON_ONCE() in quadrant calculation [me]
 - Eliminate unnecessary gotos [Peter]
 - Drop mmu_alloc_pte_list_desc() [Peter]
 - Also update access cache in mmu_set_spte() if was_rmapped [Peter]
 - Fix number of gfn bits in shadowed_translation cache [Peter]
 - Pass sp to make_huge_page_split_spte() to derive level and exec [me]
 - Eliminate flush var in kvm_rmap_zap_collapsible_sptes() [Peter]
 - Drop NULL pte_list_desc cache fallback [Peter]
 - Fix get_access to return sp->role.access. [me]
 - Re-use split cache across calls to CLEAR_DIRTY_LOG for better perf [me]
 - Top-up the split cache outside of the MMU lock when possible [me]
 - Refactor prepare_to_split_huge_page() into try_split_huge_page() [me]
 - Collapse PATCH 20, 23, and 24 avoid intermediate complexity [Peter]
 - Update the RISC-V function stage2_ioremap() [Anup]

v2: https://lore.kernel.org/kvm/20220311002528.2230172-1-dmatlack@google.com/
 - Add performance data for workloads that mix reads and writes [Peter]
 - Collect R-b tags from Ben and Sean.
 - Fix quadrant calculation when deriving role from parent [Sean]
 - Tweak new shadow page function names [Sean]
 - Move set_page_private() to allocation functions [Ben]
 - Only zap collapsible SPTEs up to MAX_LEVEL-1 [Ben]
 - Always top-up pte_list_desc cache to reduce complexity [Ben]
 - Require mmu cache capacity field to be initialized and add WARN()
   to reduce chance of programmer error [Marc]
 - Fix up kvm_mmu_memory_cache struct initialization in arm64 [Marc]

v1: https://lore.kernel.org/kvm/20220203010051.2813563-1-dmatlack@google.com/


David Matlack (20):
  KVM: x86/mmu: Optimize MMU page cache lookup for all direct SPs
  KVM: x86/mmu: Use a bool for direct
  KVM: x86/mmu: Derive shadow MMU page role from parent
  KVM: x86/mmu: Decompose kvm_mmu_get_page() into separate functions
  KVM: x86/mmu: Consolidate shadow page allocation and initialization
  KVM: x86/mmu: Rename shadow MMU functions that deal with shadow pages
  KVM: x86/mmu: Move guest PT write-protection to account_shadowed()
  KVM: x86/mmu: Pass memory caches to allocate SPs separately
  KVM: x86/mmu: Replace vcpu with kvm in kvm_mmu_alloc_shadow_page()
  KVM: x86/mmu: Pass kvm pointer separately from vcpu to
    kvm_mmu_find_shadow_page()
  KVM: x86/mmu: Allow for NULL vcpu pointer in
    __kvm_mmu_get_shadow_page()
  KVM: x86/mmu: Pass const memslot to rmap_add()
  KVM: x86/mmu: Decouple rmap_add() and link_shadow_page() from kvm_vcpu
  KVM: x86/mmu: Update page stats in __rmap_add()
  KVM: x86/mmu: Cache the access bits of shadowed translations
  KVM: x86/mmu: Extend make_huge_page_split_spte() for the shadow MMU
  KVM: x86/mmu: Zap collapsible SPTEs at all levels in the shadow MMU
  KVM: x86/mmu: Refactor drop_large_spte()
  KVM: Allow for different capacities in kvm_mmu_memory_cache structs
  KVM: x86/mmu: Extend Eager Page Splitting to nested MMUs

 .../admin-guide/kernel-parameters.txt         |   3 +-
 arch/arm64/kvm/arm.c                          |   1 +
 arch/arm64/kvm/mmu.c                          |   5 +-
 arch/mips/kvm/mips.c                          |   2 +
 arch/riscv/kvm/mmu.c                          |  14 +-
 arch/riscv/kvm/vcpu.c                         |   1 +
 arch/x86/include/asm/kvm_host.h               |  22 +-
 arch/x86/kvm/mmu/mmu.c                        | 711 ++++++++++++++----
 arch/x86/kvm/mmu/mmu_internal.h               |  17 +-
 arch/x86/kvm/mmu/paging_tmpl.h                |  17 +-
 arch/x86/kvm/mmu/spte.c                       |  13 +-
 arch/x86/kvm/mmu/spte.h                       |   2 +-
 arch/x86/kvm/mmu/tdp_mmu.c                    |   2 +-
 arch/x86/kvm/x86.c                            |   6 +
 include/linux/kvm_types.h                     |   9 +-
 virt/kvm/kvm_main.c                           |  17 +-
 16 files changed, 675 insertions(+), 167 deletions(-)


base-commit: 150866cd0ec871c765181d145aa0912628289c8a
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v4 00/20] KVM: Extend Eager Page Splitting to the shadow MMU
@ 2022-04-22 21:05 ` David Matlack
  0 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-04-22 21:05 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

This series extends KVM's Eager Page Splitting to also split huge pages
mapped by the shadow MMU, specifically **nested MMUs**.

For background on Eager Page Splitting, see:
 - Proposal: https://lore.kernel.org/kvm/CALzav=dV_U4r1K9oDq4esb4mpBQDQ2ROQ5zH5wV3KpOaZrRW-A@mail.gmail.com/
 - TDP MMU support: https://lore.kernel.org/kvm/20220119230739.2234394-1-dmatlack@google.com/

Splitting huge pages mapped by the shadow MMU is more complicated than
the TDP MMU, but it is also more important for performance as the shadow
MMU handles huge page write-protection faults under the write lock.  See
the Performance section for more details.

The extra complexity of splitting huge pages mapped by the shadow MMU
comes from a few places:

(1) The shadow MMU has a limit on the number of shadow pages that are
    allowed to be allocated. So, as a policy, Eager Page Splitting
    refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
    pages available.

(2) Huge pages may be mapped by indirect shadow pages which may have access
    permission constraints from the guest (unlike the TDP MMU which is
    ACC_ALL by default).

(3) Splitting a huge page may end up re-using an existing lower level
    shadow page tables. This is unlike the TDP MMU which always allocates
    new shadow page tables when splitting.

(4) When installing the lower level SPTEs, they must be added to the
    rmap which may require allocating additional pte_list_desc structs.

In Google's internal implementation of Eager Page Splitting, we do not
handle cases (3) and (4), and intstead opts to skip splitting entirely
(case 3) or only partially splitting (case 4). This series handles the
additional cases, which requires an additional 4KiB of memory per VM to
store the extra pte_list_desc cache. However it does also avoids the need
for TLB flushes in most cases and allows KVM to split more pages mapped
by shadow paging.

The bulk of this series is just refactoring the existing MMU code in
preparation for splitting, specifically to make it possible to operate
on the MMU outside of a vCPU context.

Motivation
----------

During dirty logging, VMs using the shadow MMU suffer from:

(1) Write-protection faults on huge pages that take the MMU lock to
    unmap the huge page, map a 4KiB page, and update the dirty log.

(2) Non-present faults caused by (1) that take the MMU lock to map in
    the missing page.

(3) Write-protection faults on 4KiB pages that take the MMU lock to
    make the page writable and update the dirty log. [Note: These faults
    only take the MMU lock during shadow paging.]

The lock contention from (1), (2) and (3) can severely degrade
application performance to the point of failure.  Eager page splitting
eliminates (1) by moving the splitting of huge pages off the vCPU
threads onto the thread invoking VM-ioctls to configure dirty logging,
and eliminates (2) by fully splitting each huge page into its
constituent small pages. (3) is still a concern for shadow paging
workloads (e.g. nested virtualization) but is not addressed by this
series.

Splitting in the VM-ioctl thread is useful because it can run in the
background without interrupting vCPU execution. However, it does take
the MMU lock so it may introduce some extra contention if vCPUs are
hammering the MMU lock. This is offset by the fact that eager page
splitting drops the MMU lock after splitting each SPTE if there is any
contention, and the fact that eager page splitting is reducing the MMU
lock contention from (1) and (2) above. Even workloads that only write
to 5% of their memory see massive MMU lock contention reduction during
dirty logging thanks to Eager Page Splitting (see Performance data
below).

A downside of Eager Page Splitting is that it splits all huge pages,
which may include ranges of memory that are never written to by the
guest and thus could theoretically stay huge. Workloads that write to
only a fraction of their memory may see higher TLB miss costs with Eager
Page Splitting enabled. However, that is secondary to the application
failure that otherwise may occur without Eager Page Splitting.

Further work is necessary to improve the TLB miss performance for
read-heavy workoads, such as dirty logging at 2M instead of 4K.

Performance
-----------

To measure the performance impact of Eager Page Splitting I ran
dirty_log_perf_test with support for a new flag, -n, that causes each vCPU
thread to run in L2 instead of L1. This support will be sent out in a
separate series.

To measure the imapct of customer performance, we can look at the time
it takes all vCPUs to dirty memory after dirty logging has been enabled.
Without Eager Page Splitting enabled, such dirtying must take faults to
split huge pages and bottleneck on the MMU lock.

For write-heavy workloads, there is not as much benefit since nested MMUs
still have to take the write-lock when resolving 4K write-protection
faults (case (3) in the Motivation section). But ready-heavy workloads
greatly benefit.

             | Config: tdp_mmu=Y, nested, 100% writes                  |
             | Iteration 1 dirty memory time                           |
             | ------------------------------------------------------- |
vCPU Count   | eager_page_split=N         | eager_page_split=Y         |
------------ | -------------------------- | -------------------------- |
2            | 0.367445635s               | 0.359880160s               |
4            | 0.503976497s               | 0.418760595s               |
8            | 1.328792652s               | 1.442455382s               |
16           | 4.609457301s               | 3.649754574s               |
32           | 8.751328485s               | 7.659014140s               |
64           | 20.438482174s              | 17.890019577s              |

             | Config: tdp_mmu=Y, nested, 50% writes                   |
             | Iteration 1 dirty memory time                           |
             | ------------------------------------------------------- |
vCPU Count   | eager_page_split=N         | eager_page_split=Y         |
------------ | -------------------------- | -------------------------- |
2            | 0.374082549s               | 0.189881327s               |
4            | 0.498175012s               | 0.216221200s               |
8            | 1.848155856s               | 0.525316794s               |
16           | 4.387725630s               | 1.844867390s               |
32           | 9.153260046s               | 4.061645844s               |
64           | 20.077600588s              | 8.825413269s               |

             | Config: tdp_mmu=Y, nested, 5% writes                    |
             | Iteration 1 dirty memory time                           |
             | ------------------------------------------------------- |
vCPU Count   | eager_page_split=N         | eager_page_split=Y         |
------------ | -------------------------- | -------------------------- |
2            | 0.386395635s               | 0.023315599s               |
4            | 0.495352933s               | 0.024971794s               |
8            | 1.568730321s               | 0.052010563s               |
16           | 4.258323166s               | 0.174402708s               |
32           | 9.260176347s               | 0.377929203s               |
64           | 19.861473882s              | 0.905998574s               |

Eager Page Splitting does increase the time it takes to enable dirty
logging when not using initially-all-set, since that's when KVM splits
huge pages. However, this runs in parallel with vCPU execution and drops
the MMU lock whenever there is contention.

             | Config: tdp_mmu=Y, nested, 100% writes                  |
             | Enabling dirty logging time                             |
             | ------------------------------------------------------- |
vCPU Count   | eager_page_split=N         | eager_page_split=Y         |
------------ | -------------------------- | -------------------------- |
2            | 0.001330088s               | 0.018624938s               |
4            | 0.002763111s               | 0.037247815s               |
8            | 0.005220762s               | 0.074637543s               |
16           | 0.010381925s               | 0.149096917s               |
32           | 0.022109466s               | 0.307983859s               |
64           | 0.085547182s               | 0.854228170s               |

Similarly, Eager Page Splitting increases the time it takes to clear the
dirty log for when using initially-all-set. The first time userspace
clears the dirty log, KVM will split huge pages:

             | Config: tdp_mmu=Y, nested, 100% writes initially-all-set |
             | Iteration 1 clear dirty log time                        |
             | ------------------------------------------------------- |
vCPU Count   | eager_page_split=N         | eager_page_split=Y         |
------------ | -------------------------- | -------------------------- |
2            | 0.001947098s               | 0.019836052s               |
4            | 0.003817996s               | 0.039574178s               |
8            | 0.007673616s               | 0.079118964s               |
16           | 0.015733003s               | 0.158006697s               |
32           | 0.031728367s               | 0.330793049s               |
64           | 0.108699714s               | 0.891762988s               |

Subsequent calls to clear the dirty log incur almost no additional cost
since KVM can very quickly determine there are no more huge pages to
split via the RMAP. This is unlike the TDP MMU which must re-traverse
the entire page table to check for huge pages.

             | Config: tdp_mmu=Y, nested, 100% writes initially-all-set |
             | Iteration 2 clear dirty log time                        |
             | ------------------------------------------------------- |
vCPU Count   | eager_page_split=N         | eager_page_split=Y         |
------------ | -------------------------- | -------------------------- |
2            | 0.009585296s               | 0.009931437s               |
4            | 0.019188984s               | 0.019842738s               |
8            | 0.038568630s               | 0.039951832s               |
16           | 0.077188525s               | 0.079536780s               |
32           | 0.156728329s               | 0.163612725s               |
64           | 0.418679324s               | 0.337336844s               |

Testing
-------

 - Ran all kvm-unit-tests and KVM selftests.

 - Booted a 32-bit non-PAE kernel with shadow paging to verify the
   quadrant change.

 - Ran dirty_log_perf_test with support for a new flag, -n, that causes
   each vCPU thread to run in L2 instead of L1. This support will be
   sent out in a separate series.

 - Tested VM live migration with nested MMUs and huge pages. The live
   migration setup consisted of an 8 vCPU 8 GiB VM running on an Intel
   Cascade Lake host and backed by 1GiB HugeTLBFS memory.  The VM was
   running Debian 10.  Inside a VM was a 6 vCPU 4Gib nested VM also
   Debian 10 and backed by 2M HugeTLBFS. Inside the nested VM ran a
   workload that aggressively accessed memory across 6 threads.
   Tracepoints during the migration confirmes eager page splitting
   occurred, both for the direct TDP MMU mappings, and the nested MMU
   mappings.

Version Log
-----------

v4:
 - Limit eager page splitting to nested MMUs [Sean]
 - Use memory caches for SP allocation [Sean]
 - Use kvm_mmu_get_page() with NULL vCPU for EPS [Sean]
 - Use u64 instead of bit field for shadow translation entry [Sean]
 - Add Sean's R-b to "Use a bool" patch.
 - Fix printf warning in "Cache access bits" patch.
 - Fix asymmentrical pr_err_ratelimit() + WARN() [Sean]
 - Drop unnecessary unsync check for huge pages [Sean]
 - Eliminate use of we in comments and change logs [Sean]
 - Allocate objects arrays dynamically [Ben]

v3: https://lore.kernel.org/kvm/20220401175554.1931568-1-dmatlack@google.com/
 - Add R-b tags from Peter.
 - Explain direct SPs in indirect MMUs in commit message [Peter]
 - Change BUG_ON() to WARN_ON_ONCE() in quadrant calculation [me]
 - Eliminate unnecessary gotos [Peter]
 - Drop mmu_alloc_pte_list_desc() [Peter]
 - Also update access cache in mmu_set_spte() if was_rmapped [Peter]
 - Fix number of gfn bits in shadowed_translation cache [Peter]
 - Pass sp to make_huge_page_split_spte() to derive level and exec [me]
 - Eliminate flush var in kvm_rmap_zap_collapsible_sptes() [Peter]
 - Drop NULL pte_list_desc cache fallback [Peter]
 - Fix get_access to return sp->role.access. [me]
 - Re-use split cache across calls to CLEAR_DIRTY_LOG for better perf [me]
 - Top-up the split cache outside of the MMU lock when possible [me]
 - Refactor prepare_to_split_huge_page() into try_split_huge_page() [me]
 - Collapse PATCH 20, 23, and 24 avoid intermediate complexity [Peter]
 - Update the RISC-V function stage2_ioremap() [Anup]

v2: https://lore.kernel.org/kvm/20220311002528.2230172-1-dmatlack@google.com/
 - Add performance data for workloads that mix reads and writes [Peter]
 - Collect R-b tags from Ben and Sean.
 - Fix quadrant calculation when deriving role from parent [Sean]
 - Tweak new shadow page function names [Sean]
 - Move set_page_private() to allocation functions [Ben]
 - Only zap collapsible SPTEs up to MAX_LEVEL-1 [Ben]
 - Always top-up pte_list_desc cache to reduce complexity [Ben]
 - Require mmu cache capacity field to be initialized and add WARN()
   to reduce chance of programmer error [Marc]
 - Fix up kvm_mmu_memory_cache struct initialization in arm64 [Marc]

v1: https://lore.kernel.org/kvm/20220203010051.2813563-1-dmatlack@google.com/


David Matlack (20):
  KVM: x86/mmu: Optimize MMU page cache lookup for all direct SPs
  KVM: x86/mmu: Use a bool for direct
  KVM: x86/mmu: Derive shadow MMU page role from parent
  KVM: x86/mmu: Decompose kvm_mmu_get_page() into separate functions
  KVM: x86/mmu: Consolidate shadow page allocation and initialization
  KVM: x86/mmu: Rename shadow MMU functions that deal with shadow pages
  KVM: x86/mmu: Move guest PT write-protection to account_shadowed()
  KVM: x86/mmu: Pass memory caches to allocate SPs separately
  KVM: x86/mmu: Replace vcpu with kvm in kvm_mmu_alloc_shadow_page()
  KVM: x86/mmu: Pass kvm pointer separately from vcpu to
    kvm_mmu_find_shadow_page()
  KVM: x86/mmu: Allow for NULL vcpu pointer in
    __kvm_mmu_get_shadow_page()
  KVM: x86/mmu: Pass const memslot to rmap_add()
  KVM: x86/mmu: Decouple rmap_add() and link_shadow_page() from kvm_vcpu
  KVM: x86/mmu: Update page stats in __rmap_add()
  KVM: x86/mmu: Cache the access bits of shadowed translations
  KVM: x86/mmu: Extend make_huge_page_split_spte() for the shadow MMU
  KVM: x86/mmu: Zap collapsible SPTEs at all levels in the shadow MMU
  KVM: x86/mmu: Refactor drop_large_spte()
  KVM: Allow for different capacities in kvm_mmu_memory_cache structs
  KVM: x86/mmu: Extend Eager Page Splitting to nested MMUs

 .../admin-guide/kernel-parameters.txt         |   3 +-
 arch/arm64/kvm/arm.c                          |   1 +
 arch/arm64/kvm/mmu.c                          |   5 +-
 arch/mips/kvm/mips.c                          |   2 +
 arch/riscv/kvm/mmu.c                          |  14 +-
 arch/riscv/kvm/vcpu.c                         |   1 +
 arch/x86/include/asm/kvm_host.h               |  22 +-
 arch/x86/kvm/mmu/mmu.c                        | 711 ++++++++++++++----
 arch/x86/kvm/mmu/mmu_internal.h               |  17 +-
 arch/x86/kvm/mmu/paging_tmpl.h                |  17 +-
 arch/x86/kvm/mmu/spte.c                       |  13 +-
 arch/x86/kvm/mmu/spte.h                       |   2 +-
 arch/x86/kvm/mmu/tdp_mmu.c                    |   2 +-
 arch/x86/kvm/x86.c                            |   6 +
 include/linux/kvm_types.h                     |   9 +-
 virt/kvm/kvm_main.c                           |  17 +-
 16 files changed, 675 insertions(+), 167 deletions(-)


base-commit: 150866cd0ec871c765181d145aa0912628289c8a
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v4 01/20] KVM: x86/mmu: Optimize MMU page cache lookup for all direct SPs
  2022-04-22 21:05 ` David Matlack
@ 2022-04-22 21:05   ` David Matlack
  -1 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-04-22 21:05 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

Commit fb58a9c345f6 ("KVM: x86/mmu: Optimize MMU page cache lookup for
fully direct MMUs") skipped the unsync checks and write flood clearing
for full direct MMUs. We can extend this further to skip the checks for
all direct shadow pages. Direct shadow pages in indirect MMUs (i.e.
shadow paging) are used when shadowing a guest huge page with smaller
pages. Such direct shadow pages, like their counterparts in fully direct
MMUs, are never marked unsynced or have a non-zero write-flooding count.

Checking sp->role.direct also generates better code than checking
direct_map because, due to register pressure, direct_map has to get
shoved onto the stack and then pulled back off.

No functional change intended.

Reviewed-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 69a30d6d1e2b..3de4cce317e4 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2028,7 +2028,6 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 					     int direct,
 					     unsigned int access)
 {
-	bool direct_mmu = vcpu->arch.mmu->root_role.direct;
 	union kvm_mmu_page_role role;
 	struct hlist_head *sp_list;
 	unsigned quadrant;
@@ -2070,7 +2069,8 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 			continue;
 		}
 
-		if (direct_mmu)
+		/* unsync and write-flooding only apply to indirect SPs. */
+		if (sp->role.direct)
 			goto trace_get_page;
 
 		if (sp->unsync) {

base-commit: 150866cd0ec871c765181d145aa0912628289c8a
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 01/20] KVM: x86/mmu: Optimize MMU page cache lookup for all direct SPs
@ 2022-04-22 21:05   ` David Matlack
  0 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-04-22 21:05 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Commit fb58a9c345f6 ("KVM: x86/mmu: Optimize MMU page cache lookup for
fully direct MMUs") skipped the unsync checks and write flood clearing
for full direct MMUs. We can extend this further to skip the checks for
all direct shadow pages. Direct shadow pages in indirect MMUs (i.e.
shadow paging) are used when shadowing a guest huge page with smaller
pages. Such direct shadow pages, like their counterparts in fully direct
MMUs, are never marked unsynced or have a non-zero write-flooding count.

Checking sp->role.direct also generates better code than checking
direct_map because, due to register pressure, direct_map has to get
shoved onto the stack and then pulled back off.

No functional change intended.

Reviewed-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 69a30d6d1e2b..3de4cce317e4 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2028,7 +2028,6 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 					     int direct,
 					     unsigned int access)
 {
-	bool direct_mmu = vcpu->arch.mmu->root_role.direct;
 	union kvm_mmu_page_role role;
 	struct hlist_head *sp_list;
 	unsigned quadrant;
@@ -2070,7 +2069,8 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 			continue;
 		}
 
-		if (direct_mmu)
+		/* unsync and write-flooding only apply to indirect SPs. */
+		if (sp->role.direct)
 			goto trace_get_page;
 
 		if (sp->unsync) {

base-commit: 150866cd0ec871c765181d145aa0912628289c8a
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 02/20] KVM: x86/mmu: Use a bool for direct
  2022-04-22 21:05 ` David Matlack
@ 2022-04-22 21:05   ` David Matlack
  -1 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-04-22 21:05 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

The parameter "direct" can either be true or false, and all of the
callers pass in a bool variable or true/false literal, so just use the
type bool.

No functional change intended.

Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 3de4cce317e4..dc20eccd6a77 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1703,7 +1703,7 @@ static void drop_parent_pte(struct kvm_mmu_page *sp,
 	mmu_spte_clear_no_track(parent_pte);
 }
 
-static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, int direct)
+static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, bool direct)
 {
 	struct kvm_mmu_page *sp;
 
@@ -2025,7 +2025,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 					     gfn_t gfn,
 					     gva_t gaddr,
 					     unsigned level,
-					     int direct,
+					     bool direct,
 					     unsigned int access)
 {
 	union kvm_mmu_page_role role;
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 02/20] KVM: x86/mmu: Use a bool for direct
@ 2022-04-22 21:05   ` David Matlack
  0 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-04-22 21:05 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

The parameter "direct" can either be true or false, and all of the
callers pass in a bool variable or true/false literal, so just use the
type bool.

No functional change intended.

Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 3de4cce317e4..dc20eccd6a77 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1703,7 +1703,7 @@ static void drop_parent_pte(struct kvm_mmu_page *sp,
 	mmu_spte_clear_no_track(parent_pte);
 }
 
-static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, int direct)
+static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, bool direct)
 {
 	struct kvm_mmu_page *sp;
 
@@ -2025,7 +2025,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 					     gfn_t gfn,
 					     gva_t gaddr,
 					     unsigned level,
-					     int direct,
+					     bool direct,
 					     unsigned int access)
 {
 	union kvm_mmu_page_role role;
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 03/20] KVM: x86/mmu: Derive shadow MMU page role from parent
  2022-04-22 21:05 ` David Matlack
@ 2022-04-22 21:05   ` David Matlack
  -1 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-04-22 21:05 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

Instead of computing the shadow page role from scratch for every new
page, derive most of the information from the parent shadow page.  This
avoids redundant calculations and reduces the number of parameters to
kvm_mmu_get_page().

Preemptively split out the role calculation to a separate function for
use in a following commit.

No functional change intended.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c         | 96 +++++++++++++++++++++++-----------
 arch/x86/kvm/mmu/paging_tmpl.h |  9 ++--
 2 files changed, 71 insertions(+), 34 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index dc20eccd6a77..4249a771818b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2021,31 +2021,15 @@ static void clear_sp_write_flooding_count(u64 *spte)
 	__clear_sp_write_flooding_count(sptep_to_sp(spte));
 }
 
-static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
-					     gfn_t gfn,
-					     gva_t gaddr,
-					     unsigned level,
-					     bool direct,
-					     unsigned int access)
+static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
+					     union kvm_mmu_page_role role)
 {
-	union kvm_mmu_page_role role;
 	struct hlist_head *sp_list;
-	unsigned quadrant;
 	struct kvm_mmu_page *sp;
 	int ret;
 	int collisions = 0;
 	LIST_HEAD(invalid_list);
 
-	role = vcpu->arch.mmu->root_role;
-	role.level = level;
-	role.direct = direct;
-	role.access = access;
-	if (role.has_4_byte_gpte) {
-		quadrant = gaddr >> (PAGE_SHIFT + (PT64_PT_BITS * level));
-		quadrant &= (1 << ((PT32_PT_BITS - PT64_PT_BITS) * level)) - 1;
-		role.quadrant = quadrant;
-	}
-
 	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
 	for_each_valid_sp(vcpu->kvm, sp, sp_list) {
 		if (sp->gfn != gfn) {
@@ -2063,7 +2047,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 			 * Unsync pages must not be left as is, because the new
 			 * upper-level page will be write-protected.
 			 */
-			if (level > PG_LEVEL_4K && sp->unsync)
+			if (role.level > PG_LEVEL_4K && sp->unsync)
 				kvm_mmu_prepare_zap_page(vcpu->kvm, sp,
 							 &invalid_list);
 			continue;
@@ -2104,14 +2088,14 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 
 	++vcpu->kvm->stat.mmu_cache_miss;
 
-	sp = kvm_mmu_alloc_page(vcpu, direct);
+	sp = kvm_mmu_alloc_page(vcpu, role.direct);
 
 	sp->gfn = gfn;
 	sp->role = role;
 	hlist_add_head(&sp->hash_link, sp_list);
-	if (!direct) {
+	if (!role.direct) {
 		account_shadowed(vcpu->kvm, sp);
-		if (level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))
+		if (role.level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))
 			kvm_flush_remote_tlbs_with_address(vcpu->kvm, gfn, 1);
 	}
 	trace_kvm_mmu_get_page(sp, true);
@@ -2123,6 +2107,51 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 	return sp;
 }
 
+static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct, u32 access)
+{
+	struct kvm_mmu_page *parent_sp = sptep_to_sp(sptep);
+	union kvm_mmu_page_role role;
+
+	role = parent_sp->role;
+	role.level--;
+	role.access = access;
+	role.direct = direct;
+
+	/*
+	 * If the guest has 4-byte PTEs then that means it's using 32-bit,
+	 * 2-level, non-PAE paging. KVM shadows such guests using 4 PAE page
+	 * directories, each mapping 1/4 of the guest's linear address space
+	 * (1GiB). The shadow pages for those 4 page directories are
+	 * pre-allocated and assigned a separate quadrant in their role.
+	 *
+	 * Since this role is for a child shadow page and there are only 2
+	 * levels, this must be a PG_LEVEL_4K shadow page. Here the quadrant
+	 * will either be 0 or 1 because it maps 1/2 of the address space mapped
+	 * by the guest's PG_LEVEL_4K page table (or 4MiB huge page) that it is
+	 * shadowing. In this case, the quadrant can be derived by the index of
+	 * the SPTE that points to the new child shadow page in the page
+	 * directory (parent_sp). Specifically, every 2 SPTEs in parent_sp
+	 * shadow one half of a guest's page table (or 4MiB huge page) so the
+	 * quadrant is just the parity of the index of the SPTE.
+	 */
+	if (role.has_4_byte_gpte) {
+		WARN_ON_ONCE(role.level != PG_LEVEL_4K);
+		role.quadrant = (sptep - parent_sp->spt) % 2;
+	}
+
+	return role;
+}
+
+static struct kvm_mmu_page *kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu,
+						 u64 *sptep, gfn_t gfn,
+						 bool direct, u32 access)
+{
+	union kvm_mmu_page_role role;
+
+	role = kvm_mmu_child_role(sptep, direct, access);
+	return kvm_mmu_get_page(vcpu, gfn, role);
+}
+
 static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterator,
 					struct kvm_vcpu *vcpu, hpa_t root,
 					u64 addr)
@@ -2927,8 +2956,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		if (is_shadow_present_pte(*it.sptep))
 			continue;
 
-		sp = kvm_mmu_get_page(vcpu, base_gfn, it.addr,
-				      it.level - 1, true, ACC_ALL);
+		sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn, true, ACC_ALL);
 
 		link_shadow_page(vcpu, it.sptep, sp);
 		if (fault->is_tdp && fault->huge_page_disallowed &&
@@ -3310,12 +3338,21 @@ static int mmu_check_root(struct kvm_vcpu *vcpu, gfn_t root_gfn)
 	return ret;
 }
 
-static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
+static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant,
 			    u8 level, bool direct)
 {
+	union kvm_mmu_page_role role;
 	struct kvm_mmu_page *sp;
 
-	sp = kvm_mmu_get_page(vcpu, gfn, gva, level, direct, ACC_ALL);
+	role = vcpu->arch.mmu->root_role;
+	role.level = level;
+	role.direct = direct;
+	role.access = ACC_ALL;
+
+	if (role.has_4_byte_gpte)
+		role.quadrant = quadrant;
+
+	sp = kvm_mmu_get_page(vcpu, gfn, role);
 	++sp->root_count;
 
 	return __pa(sp->spt);
@@ -3349,8 +3386,8 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
 		for (i = 0; i < 4; ++i) {
 			WARN_ON_ONCE(IS_VALID_PAE_ROOT(mmu->pae_root[i]));
 
-			root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT),
-					      i << 30, PT32_ROOT_LEVEL, true);
+			root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT), i,
+					      PT32_ROOT_LEVEL, true);
 			mmu->pae_root[i] = root | PT_PRESENT_MASK |
 					   shadow_me_mask;
 		}
@@ -3519,8 +3556,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 			root_gfn = pdptrs[i] >> PAGE_SHIFT;
 		}
 
-		root = mmu_alloc_root(vcpu, root_gfn, i << 30,
-				      PT32_ROOT_LEVEL, false);
+		root = mmu_alloc_root(vcpu, root_gfn, i, PT32_ROOT_LEVEL, false);
 		mmu->pae_root[i] = root | pm_mask;
 	}
 
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 66f1acf153c4..a8a755e1561d 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -648,8 +648,9 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 		if (!is_shadow_present_pte(*it.sptep)) {
 			table_gfn = gw->table_gfn[it.level - 2];
 			access = gw->pt_access[it.level - 2];
-			sp = kvm_mmu_get_page(vcpu, table_gfn, fault->addr,
-					      it.level-1, false, access);
+			sp = kvm_mmu_get_child_sp(vcpu, it.sptep, table_gfn,
+						  false, access);
+
 			/*
 			 * We must synchronize the pagetable before linking it
 			 * because the guest doesn't need to flush tlb when
@@ -705,8 +706,8 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 		drop_large_spte(vcpu, it.sptep);
 
 		if (!is_shadow_present_pte(*it.sptep)) {
-			sp = kvm_mmu_get_page(vcpu, base_gfn, fault->addr,
-					      it.level - 1, true, direct_access);
+			sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn,
+						  true, direct_access);
 			link_shadow_page(vcpu, it.sptep, sp);
 			if (fault->huge_page_disallowed &&
 			    fault->req_level >= it.level)
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 03/20] KVM: x86/mmu: Derive shadow MMU page role from parent
@ 2022-04-22 21:05   ` David Matlack
  0 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-04-22 21:05 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Instead of computing the shadow page role from scratch for every new
page, derive most of the information from the parent shadow page.  This
avoids redundant calculations and reduces the number of parameters to
kvm_mmu_get_page().

Preemptively split out the role calculation to a separate function for
use in a following commit.

No functional change intended.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c         | 96 +++++++++++++++++++++++-----------
 arch/x86/kvm/mmu/paging_tmpl.h |  9 ++--
 2 files changed, 71 insertions(+), 34 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index dc20eccd6a77..4249a771818b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2021,31 +2021,15 @@ static void clear_sp_write_flooding_count(u64 *spte)
 	__clear_sp_write_flooding_count(sptep_to_sp(spte));
 }
 
-static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
-					     gfn_t gfn,
-					     gva_t gaddr,
-					     unsigned level,
-					     bool direct,
-					     unsigned int access)
+static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
+					     union kvm_mmu_page_role role)
 {
-	union kvm_mmu_page_role role;
 	struct hlist_head *sp_list;
-	unsigned quadrant;
 	struct kvm_mmu_page *sp;
 	int ret;
 	int collisions = 0;
 	LIST_HEAD(invalid_list);
 
-	role = vcpu->arch.mmu->root_role;
-	role.level = level;
-	role.direct = direct;
-	role.access = access;
-	if (role.has_4_byte_gpte) {
-		quadrant = gaddr >> (PAGE_SHIFT + (PT64_PT_BITS * level));
-		quadrant &= (1 << ((PT32_PT_BITS - PT64_PT_BITS) * level)) - 1;
-		role.quadrant = quadrant;
-	}
-
 	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
 	for_each_valid_sp(vcpu->kvm, sp, sp_list) {
 		if (sp->gfn != gfn) {
@@ -2063,7 +2047,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 			 * Unsync pages must not be left as is, because the new
 			 * upper-level page will be write-protected.
 			 */
-			if (level > PG_LEVEL_4K && sp->unsync)
+			if (role.level > PG_LEVEL_4K && sp->unsync)
 				kvm_mmu_prepare_zap_page(vcpu->kvm, sp,
 							 &invalid_list);
 			continue;
@@ -2104,14 +2088,14 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 
 	++vcpu->kvm->stat.mmu_cache_miss;
 
-	sp = kvm_mmu_alloc_page(vcpu, direct);
+	sp = kvm_mmu_alloc_page(vcpu, role.direct);
 
 	sp->gfn = gfn;
 	sp->role = role;
 	hlist_add_head(&sp->hash_link, sp_list);
-	if (!direct) {
+	if (!role.direct) {
 		account_shadowed(vcpu->kvm, sp);
-		if (level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))
+		if (role.level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))
 			kvm_flush_remote_tlbs_with_address(vcpu->kvm, gfn, 1);
 	}
 	trace_kvm_mmu_get_page(sp, true);
@@ -2123,6 +2107,51 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 	return sp;
 }
 
+static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct, u32 access)
+{
+	struct kvm_mmu_page *parent_sp = sptep_to_sp(sptep);
+	union kvm_mmu_page_role role;
+
+	role = parent_sp->role;
+	role.level--;
+	role.access = access;
+	role.direct = direct;
+
+	/*
+	 * If the guest has 4-byte PTEs then that means it's using 32-bit,
+	 * 2-level, non-PAE paging. KVM shadows such guests using 4 PAE page
+	 * directories, each mapping 1/4 of the guest's linear address space
+	 * (1GiB). The shadow pages for those 4 page directories are
+	 * pre-allocated and assigned a separate quadrant in their role.
+	 *
+	 * Since this role is for a child shadow page and there are only 2
+	 * levels, this must be a PG_LEVEL_4K shadow page. Here the quadrant
+	 * will either be 0 or 1 because it maps 1/2 of the address space mapped
+	 * by the guest's PG_LEVEL_4K page table (or 4MiB huge page) that it is
+	 * shadowing. In this case, the quadrant can be derived by the index of
+	 * the SPTE that points to the new child shadow page in the page
+	 * directory (parent_sp). Specifically, every 2 SPTEs in parent_sp
+	 * shadow one half of a guest's page table (or 4MiB huge page) so the
+	 * quadrant is just the parity of the index of the SPTE.
+	 */
+	if (role.has_4_byte_gpte) {
+		WARN_ON_ONCE(role.level != PG_LEVEL_4K);
+		role.quadrant = (sptep - parent_sp->spt) % 2;
+	}
+
+	return role;
+}
+
+static struct kvm_mmu_page *kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu,
+						 u64 *sptep, gfn_t gfn,
+						 bool direct, u32 access)
+{
+	union kvm_mmu_page_role role;
+
+	role = kvm_mmu_child_role(sptep, direct, access);
+	return kvm_mmu_get_page(vcpu, gfn, role);
+}
+
 static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterator,
 					struct kvm_vcpu *vcpu, hpa_t root,
 					u64 addr)
@@ -2927,8 +2956,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		if (is_shadow_present_pte(*it.sptep))
 			continue;
 
-		sp = kvm_mmu_get_page(vcpu, base_gfn, it.addr,
-				      it.level - 1, true, ACC_ALL);
+		sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn, true, ACC_ALL);
 
 		link_shadow_page(vcpu, it.sptep, sp);
 		if (fault->is_tdp && fault->huge_page_disallowed &&
@@ -3310,12 +3338,21 @@ static int mmu_check_root(struct kvm_vcpu *vcpu, gfn_t root_gfn)
 	return ret;
 }
 
-static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
+static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant,
 			    u8 level, bool direct)
 {
+	union kvm_mmu_page_role role;
 	struct kvm_mmu_page *sp;
 
-	sp = kvm_mmu_get_page(vcpu, gfn, gva, level, direct, ACC_ALL);
+	role = vcpu->arch.mmu->root_role;
+	role.level = level;
+	role.direct = direct;
+	role.access = ACC_ALL;
+
+	if (role.has_4_byte_gpte)
+		role.quadrant = quadrant;
+
+	sp = kvm_mmu_get_page(vcpu, gfn, role);
 	++sp->root_count;
 
 	return __pa(sp->spt);
@@ -3349,8 +3386,8 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
 		for (i = 0; i < 4; ++i) {
 			WARN_ON_ONCE(IS_VALID_PAE_ROOT(mmu->pae_root[i]));
 
-			root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT),
-					      i << 30, PT32_ROOT_LEVEL, true);
+			root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT), i,
+					      PT32_ROOT_LEVEL, true);
 			mmu->pae_root[i] = root | PT_PRESENT_MASK |
 					   shadow_me_mask;
 		}
@@ -3519,8 +3556,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 			root_gfn = pdptrs[i] >> PAGE_SHIFT;
 		}
 
-		root = mmu_alloc_root(vcpu, root_gfn, i << 30,
-				      PT32_ROOT_LEVEL, false);
+		root = mmu_alloc_root(vcpu, root_gfn, i, PT32_ROOT_LEVEL, false);
 		mmu->pae_root[i] = root | pm_mask;
 	}
 
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 66f1acf153c4..a8a755e1561d 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -648,8 +648,9 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 		if (!is_shadow_present_pte(*it.sptep)) {
 			table_gfn = gw->table_gfn[it.level - 2];
 			access = gw->pt_access[it.level - 2];
-			sp = kvm_mmu_get_page(vcpu, table_gfn, fault->addr,
-					      it.level-1, false, access);
+			sp = kvm_mmu_get_child_sp(vcpu, it.sptep, table_gfn,
+						  false, access);
+
 			/*
 			 * We must synchronize the pagetable before linking it
 			 * because the guest doesn't need to flush tlb when
@@ -705,8 +706,8 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 		drop_large_spte(vcpu, it.sptep);
 
 		if (!is_shadow_present_pte(*it.sptep)) {
-			sp = kvm_mmu_get_page(vcpu, base_gfn, fault->addr,
-					      it.level - 1, true, direct_access);
+			sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn,
+						  true, direct_access);
 			link_shadow_page(vcpu, it.sptep, sp);
 			if (fault->huge_page_disallowed &&
 			    fault->req_level >= it.level)
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 04/20] KVM: x86/mmu: Decompose kvm_mmu_get_page() into separate functions
  2022-04-22 21:05 ` David Matlack
@ 2022-04-22 21:05   ` David Matlack
  -1 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-04-22 21:05 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

Decompose kvm_mmu_get_page() into separate helper functions to increase
readability and prepare for allocating shadow pages without a vcpu
pointer.

Specifically, pull the guts of kvm_mmu_get_page() into 2 helper
functions:

kvm_mmu_find_shadow_page() -
  Walks the page hash checking for any existing mmu pages that match the
  given gfn and role.

kvm_mmu_alloc_shadow_page()
  Allocates and initializes an entirely new kvm_mmu_page. This currently
  requries a vcpu pointer for allocation and looking up the memslot but
  that will be removed in a future commit.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 52 +++++++++++++++++++++++++++++++-----------
 1 file changed, 39 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 4249a771818b..5582badf4947 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2021,16 +2021,16 @@ static void clear_sp_write_flooding_count(u64 *spte)
 	__clear_sp_write_flooding_count(sptep_to_sp(spte));
 }
 
-static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
-					     union kvm_mmu_page_role role)
+static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
+						     gfn_t gfn,
+						     struct hlist_head *sp_list,
+						     union kvm_mmu_page_role role)
 {
-	struct hlist_head *sp_list;
 	struct kvm_mmu_page *sp;
 	int ret;
 	int collisions = 0;
 	LIST_HEAD(invalid_list);
 
-	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
 	for_each_valid_sp(vcpu->kvm, sp, sp_list) {
 		if (sp->gfn != gfn) {
 			collisions++;
@@ -2055,7 +2055,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
 
 		/* unsync and write-flooding only apply to indirect SPs. */
 		if (sp->role.direct)
-			goto trace_get_page;
+			goto out;
 
 		if (sp->unsync) {
 			/*
@@ -2081,14 +2081,26 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
 
 		__clear_sp_write_flooding_count(sp);
 
-trace_get_page:
-		trace_kvm_mmu_get_page(sp, false);
 		goto out;
 	}
 
+	sp = NULL;
 	++vcpu->kvm->stat.mmu_cache_miss;
 
-	sp = kvm_mmu_alloc_page(vcpu, role.direct);
+out:
+	kvm_mmu_commit_zap_page(vcpu->kvm, &invalid_list);
+
+	if (collisions > vcpu->kvm->stat.max_mmu_page_hash_collisions)
+		vcpu->kvm->stat.max_mmu_page_hash_collisions = collisions;
+	return sp;
+}
+
+static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
+						      gfn_t gfn,
+						      struct hlist_head *sp_list,
+						      union kvm_mmu_page_role role)
+{
+	struct kvm_mmu_page *sp = kvm_mmu_alloc_page(vcpu, role.direct);
 
 	sp->gfn = gfn;
 	sp->role = role;
@@ -2098,12 +2110,26 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
 		if (role.level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))
 			kvm_flush_remote_tlbs_with_address(vcpu->kvm, gfn, 1);
 	}
-	trace_kvm_mmu_get_page(sp, true);
-out:
-	kvm_mmu_commit_zap_page(vcpu->kvm, &invalid_list);
 
-	if (collisions > vcpu->kvm->stat.max_mmu_page_hash_collisions)
-		vcpu->kvm->stat.max_mmu_page_hash_collisions = collisions;
+	return sp;
+}
+
+static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
+					     union kvm_mmu_page_role role)
+{
+	struct hlist_head *sp_list;
+	struct kvm_mmu_page *sp;
+	bool created = false;
+
+	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
+
+	sp = kvm_mmu_find_shadow_page(vcpu, gfn, sp_list, role);
+	if (!sp) {
+		created = true;
+		sp = kvm_mmu_alloc_shadow_page(vcpu, gfn, sp_list, role);
+	}
+
+	trace_kvm_mmu_get_page(sp, created);
 	return sp;
 }
 
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 04/20] KVM: x86/mmu: Decompose kvm_mmu_get_page() into separate functions
@ 2022-04-22 21:05   ` David Matlack
  0 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-04-22 21:05 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Decompose kvm_mmu_get_page() into separate helper functions to increase
readability and prepare for allocating shadow pages without a vcpu
pointer.

Specifically, pull the guts of kvm_mmu_get_page() into 2 helper
functions:

kvm_mmu_find_shadow_page() -
  Walks the page hash checking for any existing mmu pages that match the
  given gfn and role.

kvm_mmu_alloc_shadow_page()
  Allocates and initializes an entirely new kvm_mmu_page. This currently
  requries a vcpu pointer for allocation and looking up the memslot but
  that will be removed in a future commit.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 52 +++++++++++++++++++++++++++++++-----------
 1 file changed, 39 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 4249a771818b..5582badf4947 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2021,16 +2021,16 @@ static void clear_sp_write_flooding_count(u64 *spte)
 	__clear_sp_write_flooding_count(sptep_to_sp(spte));
 }
 
-static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
-					     union kvm_mmu_page_role role)
+static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
+						     gfn_t gfn,
+						     struct hlist_head *sp_list,
+						     union kvm_mmu_page_role role)
 {
-	struct hlist_head *sp_list;
 	struct kvm_mmu_page *sp;
 	int ret;
 	int collisions = 0;
 	LIST_HEAD(invalid_list);
 
-	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
 	for_each_valid_sp(vcpu->kvm, sp, sp_list) {
 		if (sp->gfn != gfn) {
 			collisions++;
@@ -2055,7 +2055,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
 
 		/* unsync and write-flooding only apply to indirect SPs. */
 		if (sp->role.direct)
-			goto trace_get_page;
+			goto out;
 
 		if (sp->unsync) {
 			/*
@@ -2081,14 +2081,26 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
 
 		__clear_sp_write_flooding_count(sp);
 
-trace_get_page:
-		trace_kvm_mmu_get_page(sp, false);
 		goto out;
 	}
 
+	sp = NULL;
 	++vcpu->kvm->stat.mmu_cache_miss;
 
-	sp = kvm_mmu_alloc_page(vcpu, role.direct);
+out:
+	kvm_mmu_commit_zap_page(vcpu->kvm, &invalid_list);
+
+	if (collisions > vcpu->kvm->stat.max_mmu_page_hash_collisions)
+		vcpu->kvm->stat.max_mmu_page_hash_collisions = collisions;
+	return sp;
+}
+
+static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
+						      gfn_t gfn,
+						      struct hlist_head *sp_list,
+						      union kvm_mmu_page_role role)
+{
+	struct kvm_mmu_page *sp = kvm_mmu_alloc_page(vcpu, role.direct);
 
 	sp->gfn = gfn;
 	sp->role = role;
@@ -2098,12 +2110,26 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
 		if (role.level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))
 			kvm_flush_remote_tlbs_with_address(vcpu->kvm, gfn, 1);
 	}
-	trace_kvm_mmu_get_page(sp, true);
-out:
-	kvm_mmu_commit_zap_page(vcpu->kvm, &invalid_list);
 
-	if (collisions > vcpu->kvm->stat.max_mmu_page_hash_collisions)
-		vcpu->kvm->stat.max_mmu_page_hash_collisions = collisions;
+	return sp;
+}
+
+static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
+					     union kvm_mmu_page_role role)
+{
+	struct hlist_head *sp_list;
+	struct kvm_mmu_page *sp;
+	bool created = false;
+
+	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
+
+	sp = kvm_mmu_find_shadow_page(vcpu, gfn, sp_list, role);
+	if (!sp) {
+		created = true;
+		sp = kvm_mmu_alloc_shadow_page(vcpu, gfn, sp_list, role);
+	}
+
+	trace_kvm_mmu_get_page(sp, created);
 	return sp;
 }
 
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 05/20] KVM: x86/mmu: Consolidate shadow page allocation and initialization
  2022-04-22 21:05 ` David Matlack
@ 2022-04-22 21:05   ` David Matlack
  -1 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-04-22 21:05 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

Consolidate kvm_mmu_alloc_page() and kvm_mmu_alloc_shadow_page() under
the latter so that all shadow page allocation and initialization happens
in one place.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 39 +++++++++++++++++----------------------
 1 file changed, 17 insertions(+), 22 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 5582badf4947..7d03320f6e08 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1703,27 +1703,6 @@ static void drop_parent_pte(struct kvm_mmu_page *sp,
 	mmu_spte_clear_no_track(parent_pte);
 }
 
-static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, bool direct)
-{
-	struct kvm_mmu_page *sp;
-
-	sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
-	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
-	if (!direct)
-		sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache);
-	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
-
-	/*
-	 * active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages()
-	 * depends on valid pages being added to the head of the list.  See
-	 * comments in kvm_zap_obsolete_pages().
-	 */
-	sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
-	list_add(&sp->link, &vcpu->kvm->arch.active_mmu_pages);
-	kvm_mod_used_mmu_pages(vcpu->kvm, +1);
-	return sp;
-}
-
 static void mark_unsync(u64 *spte);
 static void kvm_mmu_mark_parents_unsync(struct kvm_mmu_page *sp)
 {
@@ -2100,7 +2079,23 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
 						      struct hlist_head *sp_list,
 						      union kvm_mmu_page_role role)
 {
-	struct kvm_mmu_page *sp = kvm_mmu_alloc_page(vcpu, role.direct);
+	struct kvm_mmu_page *sp;
+
+	sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
+	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
+	if (!role.direct)
+		sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache);
+
+	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
+
+	/*
+	 * active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages()
+	 * depends on valid pages being added to the head of the list.  See
+	 * comments in kvm_zap_obsolete_pages().
+	 */
+	sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
+	list_add(&sp->link, &vcpu->kvm->arch.active_mmu_pages);
+	kvm_mod_used_mmu_pages(vcpu->kvm, +1);
 
 	sp->gfn = gfn;
 	sp->role = role;
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 05/20] KVM: x86/mmu: Consolidate shadow page allocation and initialization
@ 2022-04-22 21:05   ` David Matlack
  0 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-04-22 21:05 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Consolidate kvm_mmu_alloc_page() and kvm_mmu_alloc_shadow_page() under
the latter so that all shadow page allocation and initialization happens
in one place.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 39 +++++++++++++++++----------------------
 1 file changed, 17 insertions(+), 22 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 5582badf4947..7d03320f6e08 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1703,27 +1703,6 @@ static void drop_parent_pte(struct kvm_mmu_page *sp,
 	mmu_spte_clear_no_track(parent_pte);
 }
 
-static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, bool direct)
-{
-	struct kvm_mmu_page *sp;
-
-	sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
-	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
-	if (!direct)
-		sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache);
-	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
-
-	/*
-	 * active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages()
-	 * depends on valid pages being added to the head of the list.  See
-	 * comments in kvm_zap_obsolete_pages().
-	 */
-	sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
-	list_add(&sp->link, &vcpu->kvm->arch.active_mmu_pages);
-	kvm_mod_used_mmu_pages(vcpu->kvm, +1);
-	return sp;
-}
-
 static void mark_unsync(u64 *spte);
 static void kvm_mmu_mark_parents_unsync(struct kvm_mmu_page *sp)
 {
@@ -2100,7 +2079,23 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
 						      struct hlist_head *sp_list,
 						      union kvm_mmu_page_role role)
 {
-	struct kvm_mmu_page *sp = kvm_mmu_alloc_page(vcpu, role.direct);
+	struct kvm_mmu_page *sp;
+
+	sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
+	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
+	if (!role.direct)
+		sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache);
+
+	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
+
+	/*
+	 * active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages()
+	 * depends on valid pages being added to the head of the list.  See
+	 * comments in kvm_zap_obsolete_pages().
+	 */
+	sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
+	list_add(&sp->link, &vcpu->kvm->arch.active_mmu_pages);
+	kvm_mod_used_mmu_pages(vcpu->kvm, +1);
 
 	sp->gfn = gfn;
 	sp->role = role;
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 06/20] KVM: x86/mmu: Rename shadow MMU functions that deal with shadow pages
  2022-04-22 21:05 ` David Matlack
@ 2022-04-22 21:05   ` David Matlack
  -1 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-04-22 21:05 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

Rename 2 functions:

  kvm_mmu_get_page() -> kvm_mmu_get_shadow_page()
  kvm_mmu_free_page() -> kvm_mmu_free_shadow_page()

This change makes it clear that these functions deal with shadow pages
rather than struct pages. It also aligns these functions with the naming
scheme for kvm_mmu_find_shadow_page() and kvm_mmu_alloc_shadow_page().

Prefer "shadow_page" over the shorter "sp" since these are core
functions and the line lengths aren't terrible.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 7d03320f6e08..fa7846760887 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1665,7 +1665,7 @@ static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, long nr)
 	percpu_counter_add(&kvm_total_used_mmu_pages, nr);
 }
 
-static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
+static void kvm_mmu_free_shadow_page(struct kvm_mmu_page *sp)
 {
 	MMU_WARN_ON(!is_empty_shadow_page(sp->spt));
 	hlist_del(&sp->hash_link);
@@ -2109,8 +2109,9 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
 	return sp;
 }
 
-static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
-					     union kvm_mmu_page_role role)
+static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
+						    gfn_t gfn,
+						    union kvm_mmu_page_role role)
 {
 	struct hlist_head *sp_list;
 	struct kvm_mmu_page *sp;
@@ -2170,7 +2171,7 @@ static struct kvm_mmu_page *kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu,
 	union kvm_mmu_page_role role;
 
 	role = kvm_mmu_child_role(sptep, direct, access);
-	return kvm_mmu_get_page(vcpu, gfn, role);
+	return kvm_mmu_get_shadow_page(vcpu, gfn, role);
 }
 
 static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterator,
@@ -2446,7 +2447,7 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
 
 	list_for_each_entry_safe(sp, nsp, invalid_list, link) {
 		WARN_ON(!sp->role.invalid || sp->root_count);
-		kvm_mmu_free_page(sp);
+		kvm_mmu_free_shadow_page(sp);
 	}
 }
 
@@ -3373,7 +3374,7 @@ static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant,
 	if (role.has_4_byte_gpte)
 		role.quadrant = quadrant;
 
-	sp = kvm_mmu_get_page(vcpu, gfn, role);
+	sp = kvm_mmu_get_shadow_page(vcpu, gfn, role);
 	++sp->root_count;
 
 	return __pa(sp->spt);
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 06/20] KVM: x86/mmu: Rename shadow MMU functions that deal with shadow pages
@ 2022-04-22 21:05   ` David Matlack
  0 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-04-22 21:05 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Rename 2 functions:

  kvm_mmu_get_page() -> kvm_mmu_get_shadow_page()
  kvm_mmu_free_page() -> kvm_mmu_free_shadow_page()

This change makes it clear that these functions deal with shadow pages
rather than struct pages. It also aligns these functions with the naming
scheme for kvm_mmu_find_shadow_page() and kvm_mmu_alloc_shadow_page().

Prefer "shadow_page" over the shorter "sp" since these are core
functions and the line lengths aren't terrible.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 7d03320f6e08..fa7846760887 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1665,7 +1665,7 @@ static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, long nr)
 	percpu_counter_add(&kvm_total_used_mmu_pages, nr);
 }
 
-static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
+static void kvm_mmu_free_shadow_page(struct kvm_mmu_page *sp)
 {
 	MMU_WARN_ON(!is_empty_shadow_page(sp->spt));
 	hlist_del(&sp->hash_link);
@@ -2109,8 +2109,9 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
 	return sp;
 }
 
-static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
-					     union kvm_mmu_page_role role)
+static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
+						    gfn_t gfn,
+						    union kvm_mmu_page_role role)
 {
 	struct hlist_head *sp_list;
 	struct kvm_mmu_page *sp;
@@ -2170,7 +2171,7 @@ static struct kvm_mmu_page *kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu,
 	union kvm_mmu_page_role role;
 
 	role = kvm_mmu_child_role(sptep, direct, access);
-	return kvm_mmu_get_page(vcpu, gfn, role);
+	return kvm_mmu_get_shadow_page(vcpu, gfn, role);
 }
 
 static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterator,
@@ -2446,7 +2447,7 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
 
 	list_for_each_entry_safe(sp, nsp, invalid_list, link) {
 		WARN_ON(!sp->role.invalid || sp->root_count);
-		kvm_mmu_free_page(sp);
+		kvm_mmu_free_shadow_page(sp);
 	}
 }
 
@@ -3373,7 +3374,7 @@ static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant,
 	if (role.has_4_byte_gpte)
 		role.quadrant = quadrant;
 
-	sp = kvm_mmu_get_page(vcpu, gfn, role);
+	sp = kvm_mmu_get_shadow_page(vcpu, gfn, role);
 	++sp->root_count;
 
 	return __pa(sp->spt);
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 07/20] KVM: x86/mmu: Move guest PT write-protection to account_shadowed()
  2022-04-22 21:05 ` David Matlack
@ 2022-04-22 21:05   ` David Matlack
  -1 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-04-22 21:05 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

Move the code that write-protects newly-shadowed guest page tables into
account_shadowed(). This avoids a extra gfn-to-memslot lookup and is a
more logical place for this code to live. But most importantly, this
reduces kvm_mmu_alloc_shadow_page()'s reliance on having a struct
kvm_vcpu pointer, which will be necessary when creating new shadow pages
during VM ioctls for eager page splitting.

Note, it is safe to drop the role.level == PG_LEVEL_4K check since
account_shadowed() returns early if role.level > PG_LEVEL_4K.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index fa7846760887..4f894db88bbf 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -807,6 +807,9 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
 						    KVM_PAGE_TRACK_WRITE);
 
 	kvm_mmu_gfn_disallow_lpage(slot, gfn);
+
+	if (kvm_mmu_slot_gfn_write_protect(kvm, slot, gfn, PG_LEVEL_4K))
+		kvm_flush_remote_tlbs_with_address(kvm, gfn, 1);
 }
 
 void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp)
@@ -2100,11 +2103,9 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
 	sp->gfn = gfn;
 	sp->role = role;
 	hlist_add_head(&sp->hash_link, sp_list);
-	if (!role.direct) {
+
+	if (!role.direct)
 		account_shadowed(vcpu->kvm, sp);
-		if (role.level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))
-			kvm_flush_remote_tlbs_with_address(vcpu->kvm, gfn, 1);
-	}
 
 	return sp;
 }
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 07/20] KVM: x86/mmu: Move guest PT write-protection to account_shadowed()
@ 2022-04-22 21:05   ` David Matlack
  0 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-04-22 21:05 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Move the code that write-protects newly-shadowed guest page tables into
account_shadowed(). This avoids a extra gfn-to-memslot lookup and is a
more logical place for this code to live. But most importantly, this
reduces kvm_mmu_alloc_shadow_page()'s reliance on having a struct
kvm_vcpu pointer, which will be necessary when creating new shadow pages
during VM ioctls for eager page splitting.

Note, it is safe to drop the role.level == PG_LEVEL_4K check since
account_shadowed() returns early if role.level > PG_LEVEL_4K.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index fa7846760887..4f894db88bbf 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -807,6 +807,9 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
 						    KVM_PAGE_TRACK_WRITE);
 
 	kvm_mmu_gfn_disallow_lpage(slot, gfn);
+
+	if (kvm_mmu_slot_gfn_write_protect(kvm, slot, gfn, PG_LEVEL_4K))
+		kvm_flush_remote_tlbs_with_address(kvm, gfn, 1);
 }
 
 void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp)
@@ -2100,11 +2103,9 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
 	sp->gfn = gfn;
 	sp->role = role;
 	hlist_add_head(&sp->hash_link, sp_list);
-	if (!role.direct) {
+
+	if (!role.direct)
 		account_shadowed(vcpu->kvm, sp);
-		if (role.level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))
-			kvm_flush_remote_tlbs_with_address(vcpu->kvm, gfn, 1);
-	}
 
 	return sp;
 }
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 08/20] KVM: x86/mmu: Pass memory caches to allocate SPs separately
  2022-04-22 21:05 ` David Matlack
@ 2022-04-22 21:05   ` David Matlack
  -1 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-04-22 21:05 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

Refactor kvm_mmu_alloc_shadow_page() to receive the caches from which it
will allocate the various pieces of memory for shadow pages as a
parameter, rather than deriving them from the vcpu pointer. This will be
useful in a future commit where shadow pages are allocated during VM
ioctls for eager page splitting, and thus will use a different set of
caches.

Preemptively pull the caches out all the way to
kvm_mmu_get_shadow_page() since eager page splitting will not be calling
kvm_mmu_alloc_shadow_page() directly.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 36 +++++++++++++++++++++++++++++-------
 1 file changed, 29 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 4f894db88bbf..943595ed0ba1 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2077,17 +2077,25 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
 	return sp;
 }
 
+/* Caches used when allocating a new shadow page. */
+struct shadow_page_caches {
+	struct kvm_mmu_memory_cache *page_header_cache;
+	struct kvm_mmu_memory_cache *shadow_page_cache;
+	struct kvm_mmu_memory_cache *gfn_array_cache;
+};
+
 static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
+						      struct shadow_page_caches *caches,
 						      gfn_t gfn,
 						      struct hlist_head *sp_list,
 						      union kvm_mmu_page_role role)
 {
 	struct kvm_mmu_page *sp;
 
-	sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
-	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
+	sp = kvm_mmu_memory_cache_alloc(caches->page_header_cache);
+	sp->spt = kvm_mmu_memory_cache_alloc(caches->shadow_page_cache);
 	if (!role.direct)
-		sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache);
+		sp->gfns = kvm_mmu_memory_cache_alloc(caches->gfn_array_cache);
 
 	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
 
@@ -2110,9 +2118,10 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
 	return sp;
 }
 
-static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
-						    gfn_t gfn,
-						    union kvm_mmu_page_role role)
+static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
+						      struct shadow_page_caches *caches,
+						      gfn_t gfn,
+						      union kvm_mmu_page_role role)
 {
 	struct hlist_head *sp_list;
 	struct kvm_mmu_page *sp;
@@ -2123,13 +2132,26 @@ static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
 	sp = kvm_mmu_find_shadow_page(vcpu, gfn, sp_list, role);
 	if (!sp) {
 		created = true;
-		sp = kvm_mmu_alloc_shadow_page(vcpu, gfn, sp_list, role);
+		sp = kvm_mmu_alloc_shadow_page(vcpu, caches, gfn, sp_list, role);
 	}
 
 	trace_kvm_mmu_get_page(sp, created);
 	return sp;
 }
 
+static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
+						    gfn_t gfn,
+						    union kvm_mmu_page_role role)
+{
+	struct shadow_page_caches caches = {
+		.page_header_cache = &vcpu->arch.mmu_page_header_cache,
+		.shadow_page_cache = &vcpu->arch.mmu_shadow_page_cache,
+		.gfn_array_cache = &vcpu->arch.mmu_gfn_array_cache,
+	};
+
+	return __kvm_mmu_get_shadow_page(vcpu, &caches, gfn, role);
+}
+
 static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct, u32 access)
 {
 	struct kvm_mmu_page *parent_sp = sptep_to_sp(sptep);
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 08/20] KVM: x86/mmu: Pass memory caches to allocate SPs separately
@ 2022-04-22 21:05   ` David Matlack
  0 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-04-22 21:05 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Refactor kvm_mmu_alloc_shadow_page() to receive the caches from which it
will allocate the various pieces of memory for shadow pages as a
parameter, rather than deriving them from the vcpu pointer. This will be
useful in a future commit where shadow pages are allocated during VM
ioctls for eager page splitting, and thus will use a different set of
caches.

Preemptively pull the caches out all the way to
kvm_mmu_get_shadow_page() since eager page splitting will not be calling
kvm_mmu_alloc_shadow_page() directly.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 36 +++++++++++++++++++++++++++++-------
 1 file changed, 29 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 4f894db88bbf..943595ed0ba1 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2077,17 +2077,25 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
 	return sp;
 }
 
+/* Caches used when allocating a new shadow page. */
+struct shadow_page_caches {
+	struct kvm_mmu_memory_cache *page_header_cache;
+	struct kvm_mmu_memory_cache *shadow_page_cache;
+	struct kvm_mmu_memory_cache *gfn_array_cache;
+};
+
 static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
+						      struct shadow_page_caches *caches,
 						      gfn_t gfn,
 						      struct hlist_head *sp_list,
 						      union kvm_mmu_page_role role)
 {
 	struct kvm_mmu_page *sp;
 
-	sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
-	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
+	sp = kvm_mmu_memory_cache_alloc(caches->page_header_cache);
+	sp->spt = kvm_mmu_memory_cache_alloc(caches->shadow_page_cache);
 	if (!role.direct)
-		sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache);
+		sp->gfns = kvm_mmu_memory_cache_alloc(caches->gfn_array_cache);
 
 	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
 
@@ -2110,9 +2118,10 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
 	return sp;
 }
 
-static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
-						    gfn_t gfn,
-						    union kvm_mmu_page_role role)
+static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
+						      struct shadow_page_caches *caches,
+						      gfn_t gfn,
+						      union kvm_mmu_page_role role)
 {
 	struct hlist_head *sp_list;
 	struct kvm_mmu_page *sp;
@@ -2123,13 +2132,26 @@ static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
 	sp = kvm_mmu_find_shadow_page(vcpu, gfn, sp_list, role);
 	if (!sp) {
 		created = true;
-		sp = kvm_mmu_alloc_shadow_page(vcpu, gfn, sp_list, role);
+		sp = kvm_mmu_alloc_shadow_page(vcpu, caches, gfn, sp_list, role);
 	}
 
 	trace_kvm_mmu_get_page(sp, created);
 	return sp;
 }
 
+static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
+						    gfn_t gfn,
+						    union kvm_mmu_page_role role)
+{
+	struct shadow_page_caches caches = {
+		.page_header_cache = &vcpu->arch.mmu_page_header_cache,
+		.shadow_page_cache = &vcpu->arch.mmu_shadow_page_cache,
+		.gfn_array_cache = &vcpu->arch.mmu_gfn_array_cache,
+	};
+
+	return __kvm_mmu_get_shadow_page(vcpu, &caches, gfn, role);
+}
+
 static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct, u32 access)
 {
 	struct kvm_mmu_page *parent_sp = sptep_to_sp(sptep);
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 09/20] KVM: x86/mmu: Replace vcpu with kvm in kvm_mmu_alloc_shadow_page()
  2022-04-22 21:05 ` David Matlack
@ 2022-04-22 21:05   ` David Matlack
  -1 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-04-22 21:05 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

The vcpu pointer in kvm_mmu_alloc_shadow_page() is only used to get the
kvm pointer. So drop the vcpu pointer and just pass in the kvm pointer.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 943595ed0ba1..df110e0d57cf 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2084,7 +2084,7 @@ struct shadow_page_caches {
 	struct kvm_mmu_memory_cache *gfn_array_cache;
 };
 
-static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
+static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
 						      struct shadow_page_caches *caches,
 						      gfn_t gfn,
 						      struct hlist_head *sp_list,
@@ -2104,16 +2104,16 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
 	 * depends on valid pages being added to the head of the list.  See
 	 * comments in kvm_zap_obsolete_pages().
 	 */
-	sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
-	list_add(&sp->link, &vcpu->kvm->arch.active_mmu_pages);
-	kvm_mod_used_mmu_pages(vcpu->kvm, +1);
+	sp->mmu_valid_gen = kvm->arch.mmu_valid_gen;
+	list_add(&sp->link, &kvm->arch.active_mmu_pages);
+	kvm_mod_used_mmu_pages(kvm, +1);
 
 	sp->gfn = gfn;
 	sp->role = role;
 	hlist_add_head(&sp->hash_link, sp_list);
 
 	if (!role.direct)
-		account_shadowed(vcpu->kvm, sp);
+		account_shadowed(kvm, sp);
 
 	return sp;
 }
@@ -2132,7 +2132,7 @@ static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
 	sp = kvm_mmu_find_shadow_page(vcpu, gfn, sp_list, role);
 	if (!sp) {
 		created = true;
-		sp = kvm_mmu_alloc_shadow_page(vcpu, caches, gfn, sp_list, role);
+		sp = kvm_mmu_alloc_shadow_page(vcpu->kvm, caches, gfn, sp_list, role);
 	}
 
 	trace_kvm_mmu_get_page(sp, created);
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 09/20] KVM: x86/mmu: Replace vcpu with kvm in kvm_mmu_alloc_shadow_page()
@ 2022-04-22 21:05   ` David Matlack
  0 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-04-22 21:05 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

The vcpu pointer in kvm_mmu_alloc_shadow_page() is only used to get the
kvm pointer. So drop the vcpu pointer and just pass in the kvm pointer.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 943595ed0ba1..df110e0d57cf 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2084,7 +2084,7 @@ struct shadow_page_caches {
 	struct kvm_mmu_memory_cache *gfn_array_cache;
 };
 
-static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
+static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
 						      struct shadow_page_caches *caches,
 						      gfn_t gfn,
 						      struct hlist_head *sp_list,
@@ -2104,16 +2104,16 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
 	 * depends on valid pages being added to the head of the list.  See
 	 * comments in kvm_zap_obsolete_pages().
 	 */
-	sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
-	list_add(&sp->link, &vcpu->kvm->arch.active_mmu_pages);
-	kvm_mod_used_mmu_pages(vcpu->kvm, +1);
+	sp->mmu_valid_gen = kvm->arch.mmu_valid_gen;
+	list_add(&sp->link, &kvm->arch.active_mmu_pages);
+	kvm_mod_used_mmu_pages(kvm, +1);
 
 	sp->gfn = gfn;
 	sp->role = role;
 	hlist_add_head(&sp->hash_link, sp_list);
 
 	if (!role.direct)
-		account_shadowed(vcpu->kvm, sp);
+		account_shadowed(kvm, sp);
 
 	return sp;
 }
@@ -2132,7 +2132,7 @@ static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
 	sp = kvm_mmu_find_shadow_page(vcpu, gfn, sp_list, role);
 	if (!sp) {
 		created = true;
-		sp = kvm_mmu_alloc_shadow_page(vcpu, caches, gfn, sp_list, role);
+		sp = kvm_mmu_alloc_shadow_page(vcpu->kvm, caches, gfn, sp_list, role);
 	}
 
 	trace_kvm_mmu_get_page(sp, created);
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 10/20] KVM: x86/mmu: Pass kvm pointer separately from vcpu to kvm_mmu_find_shadow_page()
  2022-04-22 21:05 ` David Matlack
@ 2022-04-22 21:05   ` David Matlack
  -1 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-04-22 21:05 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

Get the kvm pointer from the caller, rather than deriving it from
vcpu->kvm, and plumb the kvm pointer all the way from
kvm_mmu_get_shadow_page(). With this change in place, the vcpu pointer
is only needed to sync indirect shadow pages. In other words,
__kvm_mmu_get_shadow_page() can now be used to get *direct* shadow pages
without a vcpu pointer. This enables eager page splitting, which needs
to allocate direct shadow pages during VM ioctls.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 28 +++++++++++++++-------------
 1 file changed, 15 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index df110e0d57cf..04029c01aebd 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2003,7 +2003,8 @@ static void clear_sp_write_flooding_count(u64 *spte)
 	__clear_sp_write_flooding_count(sptep_to_sp(spte));
 }
 
-static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
+static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm,
+						     struct kvm_vcpu *vcpu,
 						     gfn_t gfn,
 						     struct hlist_head *sp_list,
 						     union kvm_mmu_page_role role)
@@ -2013,7 +2014,7 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
 	int collisions = 0;
 	LIST_HEAD(invalid_list);
 
-	for_each_valid_sp(vcpu->kvm, sp, sp_list) {
+	for_each_valid_sp(kvm, sp, sp_list) {
 		if (sp->gfn != gfn) {
 			collisions++;
 			continue;
@@ -2030,7 +2031,7 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
 			 * upper-level page will be write-protected.
 			 */
 			if (role.level > PG_LEVEL_4K && sp->unsync)
-				kvm_mmu_prepare_zap_page(vcpu->kvm, sp,
+				kvm_mmu_prepare_zap_page(kvm, sp,
 							 &invalid_list);
 			continue;
 		}
@@ -2058,7 +2059,7 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
 
 			WARN_ON(!list_empty(&invalid_list));
 			if (ret > 0)
-				kvm_flush_remote_tlbs(vcpu->kvm);
+				kvm_flush_remote_tlbs(kvm);
 		}
 
 		__clear_sp_write_flooding_count(sp);
@@ -2067,13 +2068,13 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
 	}
 
 	sp = NULL;
-	++vcpu->kvm->stat.mmu_cache_miss;
+	++kvm->stat.mmu_cache_miss;
 
 out:
-	kvm_mmu_commit_zap_page(vcpu->kvm, &invalid_list);
+	kvm_mmu_commit_zap_page(kvm, &invalid_list);
 
-	if (collisions > vcpu->kvm->stat.max_mmu_page_hash_collisions)
-		vcpu->kvm->stat.max_mmu_page_hash_collisions = collisions;
+	if (collisions > kvm->stat.max_mmu_page_hash_collisions)
+		kvm->stat.max_mmu_page_hash_collisions = collisions;
 	return sp;
 }
 
@@ -2118,7 +2119,8 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
 	return sp;
 }
 
-static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
+static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm *kvm,
+						      struct kvm_vcpu *vcpu,
 						      struct shadow_page_caches *caches,
 						      gfn_t gfn,
 						      union kvm_mmu_page_role role)
@@ -2127,12 +2129,12 @@ static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
 	struct kvm_mmu_page *sp;
 	bool created = false;
 
-	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
+	sp_list = &kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
 
-	sp = kvm_mmu_find_shadow_page(vcpu, gfn, sp_list, role);
+	sp = kvm_mmu_find_shadow_page(kvm, vcpu, gfn, sp_list, role);
 	if (!sp) {
 		created = true;
-		sp = kvm_mmu_alloc_shadow_page(vcpu->kvm, caches, gfn, sp_list, role);
+		sp = kvm_mmu_alloc_shadow_page(kvm, caches, gfn, sp_list, role);
 	}
 
 	trace_kvm_mmu_get_page(sp, created);
@@ -2149,7 +2151,7 @@ static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
 		.gfn_array_cache = &vcpu->arch.mmu_gfn_array_cache,
 	};
 
-	return __kvm_mmu_get_shadow_page(vcpu, &caches, gfn, role);
+	return __kvm_mmu_get_shadow_page(vcpu->kvm, vcpu, &caches, gfn, role);
 }
 
 static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct, u32 access)
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 10/20] KVM: x86/mmu: Pass kvm pointer separately from vcpu to kvm_mmu_find_shadow_page()
@ 2022-04-22 21:05   ` David Matlack
  0 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-04-22 21:05 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Get the kvm pointer from the caller, rather than deriving it from
vcpu->kvm, and plumb the kvm pointer all the way from
kvm_mmu_get_shadow_page(). With this change in place, the vcpu pointer
is only needed to sync indirect shadow pages. In other words,
__kvm_mmu_get_shadow_page() can now be used to get *direct* shadow pages
without a vcpu pointer. This enables eager page splitting, which needs
to allocate direct shadow pages during VM ioctls.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 28 +++++++++++++++-------------
 1 file changed, 15 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index df110e0d57cf..04029c01aebd 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2003,7 +2003,8 @@ static void clear_sp_write_flooding_count(u64 *spte)
 	__clear_sp_write_flooding_count(sptep_to_sp(spte));
 }
 
-static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
+static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm,
+						     struct kvm_vcpu *vcpu,
 						     gfn_t gfn,
 						     struct hlist_head *sp_list,
 						     union kvm_mmu_page_role role)
@@ -2013,7 +2014,7 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
 	int collisions = 0;
 	LIST_HEAD(invalid_list);
 
-	for_each_valid_sp(vcpu->kvm, sp, sp_list) {
+	for_each_valid_sp(kvm, sp, sp_list) {
 		if (sp->gfn != gfn) {
 			collisions++;
 			continue;
@@ -2030,7 +2031,7 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
 			 * upper-level page will be write-protected.
 			 */
 			if (role.level > PG_LEVEL_4K && sp->unsync)
-				kvm_mmu_prepare_zap_page(vcpu->kvm, sp,
+				kvm_mmu_prepare_zap_page(kvm, sp,
 							 &invalid_list);
 			continue;
 		}
@@ -2058,7 +2059,7 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
 
 			WARN_ON(!list_empty(&invalid_list));
 			if (ret > 0)
-				kvm_flush_remote_tlbs(vcpu->kvm);
+				kvm_flush_remote_tlbs(kvm);
 		}
 
 		__clear_sp_write_flooding_count(sp);
@@ -2067,13 +2068,13 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
 	}
 
 	sp = NULL;
-	++vcpu->kvm->stat.mmu_cache_miss;
+	++kvm->stat.mmu_cache_miss;
 
 out:
-	kvm_mmu_commit_zap_page(vcpu->kvm, &invalid_list);
+	kvm_mmu_commit_zap_page(kvm, &invalid_list);
 
-	if (collisions > vcpu->kvm->stat.max_mmu_page_hash_collisions)
-		vcpu->kvm->stat.max_mmu_page_hash_collisions = collisions;
+	if (collisions > kvm->stat.max_mmu_page_hash_collisions)
+		kvm->stat.max_mmu_page_hash_collisions = collisions;
 	return sp;
 }
 
@@ -2118,7 +2119,8 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
 	return sp;
 }
 
-static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
+static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm *kvm,
+						      struct kvm_vcpu *vcpu,
 						      struct shadow_page_caches *caches,
 						      gfn_t gfn,
 						      union kvm_mmu_page_role role)
@@ -2127,12 +2129,12 @@ static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
 	struct kvm_mmu_page *sp;
 	bool created = false;
 
-	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
+	sp_list = &kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
 
-	sp = kvm_mmu_find_shadow_page(vcpu, gfn, sp_list, role);
+	sp = kvm_mmu_find_shadow_page(kvm, vcpu, gfn, sp_list, role);
 	if (!sp) {
 		created = true;
-		sp = kvm_mmu_alloc_shadow_page(vcpu->kvm, caches, gfn, sp_list, role);
+		sp = kvm_mmu_alloc_shadow_page(kvm, caches, gfn, sp_list, role);
 	}
 
 	trace_kvm_mmu_get_page(sp, created);
@@ -2149,7 +2151,7 @@ static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
 		.gfn_array_cache = &vcpu->arch.mmu_gfn_array_cache,
 	};
 
-	return __kvm_mmu_get_shadow_page(vcpu, &caches, gfn, role);
+	return __kvm_mmu_get_shadow_page(vcpu->kvm, vcpu, &caches, gfn, role);
 }
 
 static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct, u32 access)
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 11/20] KVM: x86/mmu: Allow for NULL vcpu pointer in __kvm_mmu_get_shadow_page()
  2022-04-22 21:05 ` David Matlack
@ 2022-04-22 21:05   ` David Matlack
  -1 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-04-22 21:05 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

Allow the vcpu pointer in __kvm_mmu_get_shadow_page() to be NULL. Rename
it to vcpu_or_null to prevent future commits from accidentally taking
dependency on it without first considering the NULL case.

The vcpu pointer is only used for syncing indirect shadow pages in
kvm_mmu_find_shadow_page(). A vcpu pointer it not required for
correctness since unsync pages can simply be zapped. But this should
never occur in practice, since the only use-case for passing a NULL vCPU
pointer is eager page splitting which will only request direct shadow
pages (which can never be unsync).

Even though __kvm_mmu_get_shadow_page() can gracefully handle a NULL
vcpu, add a WARN() that will fire if __kvm_mmu_get_shadow_page() is ever
called to get an indirect shadow page with a NULL vCPU pointer, since
zapping unsync SPs is a performance overhead that should be considered.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 40 ++++++++++++++++++++++++++++++++--------
 1 file changed, 32 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 04029c01aebd..21407bd4435a 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1845,16 +1845,27 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
 	  &(_kvm)->arch.mmu_page_hash[kvm_page_table_hashfn(_gfn)])	\
 		if ((_sp)->gfn != (_gfn) || (_sp)->role.direct) {} else
 
-static int kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
-			 struct list_head *invalid_list)
+static int __kvm_sync_page(struct kvm *kvm, struct kvm_vcpu *vcpu_or_null,
+			   struct kvm_mmu_page *sp,
+			   struct list_head *invalid_list)
 {
-	int ret = vcpu->arch.mmu->sync_page(vcpu, sp);
+	int ret = -1;
+
+	if (vcpu_or_null)
+		ret = vcpu_or_null->arch.mmu->sync_page(vcpu_or_null, sp);
 
 	if (ret < 0)
-		kvm_mmu_prepare_zap_page(vcpu->kvm, sp, invalid_list);
+		kvm_mmu_prepare_zap_page(kvm, sp, invalid_list);
+
 	return ret;
 }
 
+static int kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
+			 struct list_head *invalid_list)
+{
+	return __kvm_sync_page(vcpu->kvm, vcpu, sp, invalid_list);
+}
+
 static bool kvm_mmu_remote_flush_or_zap(struct kvm *kvm,
 					struct list_head *invalid_list,
 					bool remote_flush)
@@ -2004,7 +2015,7 @@ static void clear_sp_write_flooding_count(u64 *spte)
 }
 
 static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm,
-						     struct kvm_vcpu *vcpu,
+						     struct kvm_vcpu *vcpu_or_null,
 						     gfn_t gfn,
 						     struct hlist_head *sp_list,
 						     union kvm_mmu_page_role role)
@@ -2053,7 +2064,7 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm,
 			 * If the sync fails, the page is zapped.  If so, break
 			 * in order to rebuild it.
 			 */
-			ret = kvm_sync_page(vcpu, sp, &invalid_list);
+			ret = __kvm_sync_page(kvm, vcpu_or_null, sp, &invalid_list);
 			if (ret < 0)
 				break;
 
@@ -2120,7 +2131,7 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
 }
 
 static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm *kvm,
-						      struct kvm_vcpu *vcpu,
+						      struct kvm_vcpu *vcpu_or_null,
 						      struct shadow_page_caches *caches,
 						      gfn_t gfn,
 						      union kvm_mmu_page_role role)
@@ -2129,9 +2140,22 @@ static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm *kvm,
 	struct kvm_mmu_page *sp;
 	bool created = false;
 
+	/*
+	 * A vCPU pointer should always be provided when getting indirect
+	 * shadow pages, as that shadow page may already exist and need to be
+	 * synced using the vCPU pointer (see __kvm_sync_page()). Direct shadow
+	 * pages are never unsync and thus do not require a vCPU pointer.
+	 *
+	 * No need to panic here as __kvm_sync_page() falls back to zapping an
+	 * unsync page if the vCPU pointer is NULL. But still WARN() since
+	 * such zapping will impact performance and this situation is never
+	 * expected to occur in practice.
+	 */
+	WARN_ON(!vcpu_or_null && !role.direct);
+
 	sp_list = &kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
 
-	sp = kvm_mmu_find_shadow_page(kvm, vcpu, gfn, sp_list, role);
+	sp = kvm_mmu_find_shadow_page(kvm, vcpu_or_null, gfn, sp_list, role);
 	if (!sp) {
 		created = true;
 		sp = kvm_mmu_alloc_shadow_page(kvm, caches, gfn, sp_list, role);
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 11/20] KVM: x86/mmu: Allow for NULL vcpu pointer in __kvm_mmu_get_shadow_page()
@ 2022-04-22 21:05   ` David Matlack
  0 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-04-22 21:05 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Allow the vcpu pointer in __kvm_mmu_get_shadow_page() to be NULL. Rename
it to vcpu_or_null to prevent future commits from accidentally taking
dependency on it without first considering the NULL case.

The vcpu pointer is only used for syncing indirect shadow pages in
kvm_mmu_find_shadow_page(). A vcpu pointer it not required for
correctness since unsync pages can simply be zapped. But this should
never occur in practice, since the only use-case for passing a NULL vCPU
pointer is eager page splitting which will only request direct shadow
pages (which can never be unsync).

Even though __kvm_mmu_get_shadow_page() can gracefully handle a NULL
vcpu, add a WARN() that will fire if __kvm_mmu_get_shadow_page() is ever
called to get an indirect shadow page with a NULL vCPU pointer, since
zapping unsync SPs is a performance overhead that should be considered.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 40 ++++++++++++++++++++++++++++++++--------
 1 file changed, 32 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 04029c01aebd..21407bd4435a 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1845,16 +1845,27 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
 	  &(_kvm)->arch.mmu_page_hash[kvm_page_table_hashfn(_gfn)])	\
 		if ((_sp)->gfn != (_gfn) || (_sp)->role.direct) {} else
 
-static int kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
-			 struct list_head *invalid_list)
+static int __kvm_sync_page(struct kvm *kvm, struct kvm_vcpu *vcpu_or_null,
+			   struct kvm_mmu_page *sp,
+			   struct list_head *invalid_list)
 {
-	int ret = vcpu->arch.mmu->sync_page(vcpu, sp);
+	int ret = -1;
+
+	if (vcpu_or_null)
+		ret = vcpu_or_null->arch.mmu->sync_page(vcpu_or_null, sp);
 
 	if (ret < 0)
-		kvm_mmu_prepare_zap_page(vcpu->kvm, sp, invalid_list);
+		kvm_mmu_prepare_zap_page(kvm, sp, invalid_list);
+
 	return ret;
 }
 
+static int kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
+			 struct list_head *invalid_list)
+{
+	return __kvm_sync_page(vcpu->kvm, vcpu, sp, invalid_list);
+}
+
 static bool kvm_mmu_remote_flush_or_zap(struct kvm *kvm,
 					struct list_head *invalid_list,
 					bool remote_flush)
@@ -2004,7 +2015,7 @@ static void clear_sp_write_flooding_count(u64 *spte)
 }
 
 static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm,
-						     struct kvm_vcpu *vcpu,
+						     struct kvm_vcpu *vcpu_or_null,
 						     gfn_t gfn,
 						     struct hlist_head *sp_list,
 						     union kvm_mmu_page_role role)
@@ -2053,7 +2064,7 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm,
 			 * If the sync fails, the page is zapped.  If so, break
 			 * in order to rebuild it.
 			 */
-			ret = kvm_sync_page(vcpu, sp, &invalid_list);
+			ret = __kvm_sync_page(kvm, vcpu_or_null, sp, &invalid_list);
 			if (ret < 0)
 				break;
 
@@ -2120,7 +2131,7 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
 }
 
 static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm *kvm,
-						      struct kvm_vcpu *vcpu,
+						      struct kvm_vcpu *vcpu_or_null,
 						      struct shadow_page_caches *caches,
 						      gfn_t gfn,
 						      union kvm_mmu_page_role role)
@@ -2129,9 +2140,22 @@ static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm *kvm,
 	struct kvm_mmu_page *sp;
 	bool created = false;
 
+	/*
+	 * A vCPU pointer should always be provided when getting indirect
+	 * shadow pages, as that shadow page may already exist and need to be
+	 * synced using the vCPU pointer (see __kvm_sync_page()). Direct shadow
+	 * pages are never unsync and thus do not require a vCPU pointer.
+	 *
+	 * No need to panic here as __kvm_sync_page() falls back to zapping an
+	 * unsync page if the vCPU pointer is NULL. But still WARN() since
+	 * such zapping will impact performance and this situation is never
+	 * expected to occur in practice.
+	 */
+	WARN_ON(!vcpu_or_null && !role.direct);
+
 	sp_list = &kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
 
-	sp = kvm_mmu_find_shadow_page(kvm, vcpu, gfn, sp_list, role);
+	sp = kvm_mmu_find_shadow_page(kvm, vcpu_or_null, gfn, sp_list, role);
 	if (!sp) {
 		created = true;
 		sp = kvm_mmu_alloc_shadow_page(kvm, caches, gfn, sp_list, role);
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 12/20] KVM: x86/mmu: Pass const memslot to rmap_add()
  2022-04-22 21:05 ` David Matlack
@ 2022-04-22 21:05   ` David Matlack
  -1 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-04-22 21:05 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

rmap_add() only uses the slot to call gfn_to_rmap() which takes a const
memslot.

No functional change intended.

Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 21407bd4435a..142c2f357f1b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1595,7 +1595,7 @@ static bool kvm_test_age_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 
 #define RMAP_RECYCLE_THRESHOLD 1000
 
-static void rmap_add(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
+static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
 		     u64 *spte, gfn_t gfn)
 {
 	struct kvm_mmu_page *sp;
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 12/20] KVM: x86/mmu: Pass const memslot to rmap_add()
@ 2022-04-22 21:05   ` David Matlack
  0 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-04-22 21:05 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

rmap_add() only uses the slot to call gfn_to_rmap() which takes a const
memslot.

No functional change intended.

Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 21407bd4435a..142c2f357f1b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1595,7 +1595,7 @@ static bool kvm_test_age_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 
 #define RMAP_RECYCLE_THRESHOLD 1000
 
-static void rmap_add(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
+static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
 		     u64 *spte, gfn_t gfn)
 {
 	struct kvm_mmu_page *sp;
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 13/20] KVM: x86/mmu: Decouple rmap_add() and link_shadow_page() from kvm_vcpu
  2022-04-22 21:05 ` David Matlack
@ 2022-04-22 21:05   ` David Matlack
  -1 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-04-22 21:05 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

Allow adding new entries to the rmap and linking shadow pages without a
struct kvm_vcpu pointer by moving the implementation of rmap_add() and
link_shadow_page() into inner helper functions.

No functional change intended.

Reviewed-by: Ben Gardon <bgardon@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 44 +++++++++++++++++++++++++-----------------
 1 file changed, 26 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 142c2f357f1b..ed23782eef8a 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -722,11 +722,6 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
 }
 
-static struct pte_list_desc *mmu_alloc_pte_list_desc(struct kvm_vcpu *vcpu)
-{
-	return kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_pte_list_desc_cache);
-}
-
 static void mmu_free_pte_list_desc(struct pte_list_desc *pte_list_desc)
 {
 	kmem_cache_free(pte_list_desc_cache, pte_list_desc);
@@ -873,7 +868,7 @@ gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu, gfn_t gfn,
 /*
  * Returns the number of pointers in the rmap chain, not counting the new one.
  */
-static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
+static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte,
 			struct kvm_rmap_head *rmap_head)
 {
 	struct pte_list_desc *desc;
@@ -884,7 +879,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
 		rmap_head->val = (unsigned long)spte;
 	} else if (!(rmap_head->val & 1)) {
 		rmap_printk("%p %llx 1->many\n", spte, *spte);
-		desc = mmu_alloc_pte_list_desc(vcpu);
+		desc = kvm_mmu_memory_cache_alloc(cache);
 		desc->sptes[0] = (u64 *)rmap_head->val;
 		desc->sptes[1] = spte;
 		desc->spte_count = 2;
@@ -896,7 +891,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
 		while (desc->spte_count == PTE_LIST_EXT) {
 			count += PTE_LIST_EXT;
 			if (!desc->more) {
-				desc->more = mmu_alloc_pte_list_desc(vcpu);
+				desc->more = kvm_mmu_memory_cache_alloc(cache);
 				desc = desc->more;
 				desc->spte_count = 0;
 				break;
@@ -1595,8 +1590,10 @@ static bool kvm_test_age_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 
 #define RMAP_RECYCLE_THRESHOLD 1000
 
-static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
-		     u64 *spte, gfn_t gfn)
+static void __rmap_add(struct kvm *kvm,
+		       struct kvm_mmu_memory_cache *cache,
+		       const struct kvm_memory_slot *slot,
+		       u64 *spte, gfn_t gfn)
 {
 	struct kvm_mmu_page *sp;
 	struct kvm_rmap_head *rmap_head;
@@ -1605,15 +1602,21 @@ static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
 	sp = sptep_to_sp(spte);
 	kvm_mmu_page_set_gfn(sp, spte - sp->spt, gfn);
 	rmap_head = gfn_to_rmap(gfn, sp->role.level, slot);
-	rmap_count = pte_list_add(vcpu, spte, rmap_head);
+	rmap_count = pte_list_add(cache, spte, rmap_head);
 
 	if (rmap_count > RMAP_RECYCLE_THRESHOLD) {
-		kvm_unmap_rmapp(vcpu->kvm, rmap_head, NULL, gfn, sp->role.level, __pte(0));
+		kvm_unmap_rmapp(kvm, rmap_head, NULL, gfn, sp->role.level, __pte(0));
 		kvm_flush_remote_tlbs_with_address(
-				vcpu->kvm, sp->gfn, KVM_PAGES_PER_HPAGE(sp->role.level));
+				kvm, sp->gfn, KVM_PAGES_PER_HPAGE(sp->role.level));
 	}
 }
 
+static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
+		     u64 *spte, gfn_t gfn)
+{
+	__rmap_add(vcpu->kvm, &vcpu->arch.mmu_pte_list_desc_cache, slot, spte, gfn);
+}
+
 bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	bool young = false;
@@ -1684,13 +1687,13 @@ static unsigned kvm_page_table_hashfn(gfn_t gfn)
 	return hash_64(gfn, KVM_MMU_HASH_SHIFT);
 }
 
-static void mmu_page_add_parent_pte(struct kvm_vcpu *vcpu,
+static void mmu_page_add_parent_pte(struct kvm_mmu_memory_cache *cache,
 				    struct kvm_mmu_page *sp, u64 *parent_pte)
 {
 	if (!parent_pte)
 		return;
 
-	pte_list_add(vcpu, parent_pte, &sp->parent_ptes);
+	pte_list_add(cache, parent_pte, &sp->parent_ptes);
 }
 
 static void mmu_page_remove_parent_pte(struct kvm_mmu_page *sp,
@@ -2286,8 +2289,8 @@ static void shadow_walk_next(struct kvm_shadow_walk_iterator *iterator)
 	__shadow_walk_next(iterator, *iterator->sptep);
 }
 
-static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep,
-			     struct kvm_mmu_page *sp)
+static void __link_shadow_page(struct kvm_mmu_memory_cache *cache, u64 *sptep,
+			       struct kvm_mmu_page *sp)
 {
 	u64 spte;
 
@@ -2297,12 +2300,17 @@ static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep,
 
 	mmu_spte_set(sptep, spte);
 
-	mmu_page_add_parent_pte(vcpu, sp, sptep);
+	mmu_page_add_parent_pte(cache, sp, sptep);
 
 	if (sp->unsync_children || sp->unsync)
 		mark_unsync(sptep);
 }
 
+static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep, struct kvm_mmu_page *sp)
+{
+	__link_shadow_page(&vcpu->arch.mmu_pte_list_desc_cache, sptep, sp);
+}
+
 static void validate_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 				   unsigned direct_access)
 {
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 13/20] KVM: x86/mmu: Decouple rmap_add() and link_shadow_page() from kvm_vcpu
@ 2022-04-22 21:05   ` David Matlack
  0 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-04-22 21:05 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Allow adding new entries to the rmap and linking shadow pages without a
struct kvm_vcpu pointer by moving the implementation of rmap_add() and
link_shadow_page() into inner helper functions.

No functional change intended.

Reviewed-by: Ben Gardon <bgardon@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 44 +++++++++++++++++++++++++-----------------
 1 file changed, 26 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 142c2f357f1b..ed23782eef8a 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -722,11 +722,6 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
 }
 
-static struct pte_list_desc *mmu_alloc_pte_list_desc(struct kvm_vcpu *vcpu)
-{
-	return kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_pte_list_desc_cache);
-}
-
 static void mmu_free_pte_list_desc(struct pte_list_desc *pte_list_desc)
 {
 	kmem_cache_free(pte_list_desc_cache, pte_list_desc);
@@ -873,7 +868,7 @@ gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu, gfn_t gfn,
 /*
  * Returns the number of pointers in the rmap chain, not counting the new one.
  */
-static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
+static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte,
 			struct kvm_rmap_head *rmap_head)
 {
 	struct pte_list_desc *desc;
@@ -884,7 +879,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
 		rmap_head->val = (unsigned long)spte;
 	} else if (!(rmap_head->val & 1)) {
 		rmap_printk("%p %llx 1->many\n", spte, *spte);
-		desc = mmu_alloc_pte_list_desc(vcpu);
+		desc = kvm_mmu_memory_cache_alloc(cache);
 		desc->sptes[0] = (u64 *)rmap_head->val;
 		desc->sptes[1] = spte;
 		desc->spte_count = 2;
@@ -896,7 +891,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
 		while (desc->spte_count == PTE_LIST_EXT) {
 			count += PTE_LIST_EXT;
 			if (!desc->more) {
-				desc->more = mmu_alloc_pte_list_desc(vcpu);
+				desc->more = kvm_mmu_memory_cache_alloc(cache);
 				desc = desc->more;
 				desc->spte_count = 0;
 				break;
@@ -1595,8 +1590,10 @@ static bool kvm_test_age_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 
 #define RMAP_RECYCLE_THRESHOLD 1000
 
-static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
-		     u64 *spte, gfn_t gfn)
+static void __rmap_add(struct kvm *kvm,
+		       struct kvm_mmu_memory_cache *cache,
+		       const struct kvm_memory_slot *slot,
+		       u64 *spte, gfn_t gfn)
 {
 	struct kvm_mmu_page *sp;
 	struct kvm_rmap_head *rmap_head;
@@ -1605,15 +1602,21 @@ static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
 	sp = sptep_to_sp(spte);
 	kvm_mmu_page_set_gfn(sp, spte - sp->spt, gfn);
 	rmap_head = gfn_to_rmap(gfn, sp->role.level, slot);
-	rmap_count = pte_list_add(vcpu, spte, rmap_head);
+	rmap_count = pte_list_add(cache, spte, rmap_head);
 
 	if (rmap_count > RMAP_RECYCLE_THRESHOLD) {
-		kvm_unmap_rmapp(vcpu->kvm, rmap_head, NULL, gfn, sp->role.level, __pte(0));
+		kvm_unmap_rmapp(kvm, rmap_head, NULL, gfn, sp->role.level, __pte(0));
 		kvm_flush_remote_tlbs_with_address(
-				vcpu->kvm, sp->gfn, KVM_PAGES_PER_HPAGE(sp->role.level));
+				kvm, sp->gfn, KVM_PAGES_PER_HPAGE(sp->role.level));
 	}
 }
 
+static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
+		     u64 *spte, gfn_t gfn)
+{
+	__rmap_add(vcpu->kvm, &vcpu->arch.mmu_pte_list_desc_cache, slot, spte, gfn);
+}
+
 bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	bool young = false;
@@ -1684,13 +1687,13 @@ static unsigned kvm_page_table_hashfn(gfn_t gfn)
 	return hash_64(gfn, KVM_MMU_HASH_SHIFT);
 }
 
-static void mmu_page_add_parent_pte(struct kvm_vcpu *vcpu,
+static void mmu_page_add_parent_pte(struct kvm_mmu_memory_cache *cache,
 				    struct kvm_mmu_page *sp, u64 *parent_pte)
 {
 	if (!parent_pte)
 		return;
 
-	pte_list_add(vcpu, parent_pte, &sp->parent_ptes);
+	pte_list_add(cache, parent_pte, &sp->parent_ptes);
 }
 
 static void mmu_page_remove_parent_pte(struct kvm_mmu_page *sp,
@@ -2286,8 +2289,8 @@ static void shadow_walk_next(struct kvm_shadow_walk_iterator *iterator)
 	__shadow_walk_next(iterator, *iterator->sptep);
 }
 
-static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep,
-			     struct kvm_mmu_page *sp)
+static void __link_shadow_page(struct kvm_mmu_memory_cache *cache, u64 *sptep,
+			       struct kvm_mmu_page *sp)
 {
 	u64 spte;
 
@@ -2297,12 +2300,17 @@ static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep,
 
 	mmu_spte_set(sptep, spte);
 
-	mmu_page_add_parent_pte(vcpu, sp, sptep);
+	mmu_page_add_parent_pte(cache, sp, sptep);
 
 	if (sp->unsync_children || sp->unsync)
 		mark_unsync(sptep);
 }
 
+static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep, struct kvm_mmu_page *sp)
+{
+	__link_shadow_page(&vcpu->arch.mmu_pte_list_desc_cache, sptep, sp);
+}
+
 static void validate_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 				   unsigned direct_access)
 {
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 14/20] KVM: x86/mmu: Update page stats in __rmap_add()
  2022-04-22 21:05 ` David Matlack
@ 2022-04-22 21:05   ` David Matlack
  -1 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-04-22 21:05 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

Update the page stats in __rmap_add() rather than at the call site. This
will avoid having to manually update page stats when splitting huge
pages in a subsequent commit.

No functional change intended.

Reviewed-by: Ben Gardon <bgardon@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index ed23782eef8a..6a190fe6c9dc 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1601,6 +1601,8 @@ static void __rmap_add(struct kvm *kvm,
 
 	sp = sptep_to_sp(spte);
 	kvm_mmu_page_set_gfn(sp, spte - sp->spt, gfn);
+	kvm_update_page_stats(kvm, sp->role.level, 1);
+
 	rmap_head = gfn_to_rmap(gfn, sp->role.level, slot);
 	rmap_count = pte_list_add(cache, spte, rmap_head);
 
@@ -2818,7 +2820,6 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
 
 	if (!was_rmapped) {
 		WARN_ON_ONCE(ret == RET_PF_SPURIOUS);
-		kvm_update_page_stats(vcpu->kvm, level, 1);
 		rmap_add(vcpu, slot, sptep, gfn);
 	}
 
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 14/20] KVM: x86/mmu: Update page stats in __rmap_add()
@ 2022-04-22 21:05   ` David Matlack
  0 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-04-22 21:05 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Update the page stats in __rmap_add() rather than at the call site. This
will avoid having to manually update page stats when splitting huge
pages in a subsequent commit.

No functional change intended.

Reviewed-by: Ben Gardon <bgardon@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index ed23782eef8a..6a190fe6c9dc 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1601,6 +1601,8 @@ static void __rmap_add(struct kvm *kvm,
 
 	sp = sptep_to_sp(spte);
 	kvm_mmu_page_set_gfn(sp, spte - sp->spt, gfn);
+	kvm_update_page_stats(kvm, sp->role.level, 1);
+
 	rmap_head = gfn_to_rmap(gfn, sp->role.level, slot);
 	rmap_count = pte_list_add(cache, spte, rmap_head);
 
@@ -2818,7 +2820,6 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
 
 	if (!was_rmapped) {
 		WARN_ON_ONCE(ret == RET_PF_SPURIOUS);
-		kvm_update_page_stats(vcpu->kvm, level, 1);
 		rmap_add(vcpu, slot, sptep, gfn);
 	}
 
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 15/20] KVM: x86/mmu: Cache the access bits of shadowed translations
  2022-04-22 21:05 ` David Matlack
@ 2022-04-22 21:05   ` David Matlack
  -1 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-04-22 21:05 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

Splitting huge pages requires allocating/finding shadow pages to replace
the huge page. Shadow pages are keyed, in part, off the guest access
permissions they are shadowing. For fully direct MMUs, there is no
shadowing so the access bits in the shadow page role are always ACC_ALL.
But during shadow paging, the guest can enforce whatever access
permissions it wants.

When KVM is resolving a fault, it walks the guest pages tables to
determine the guest access permissions. But that is difficult to plumb
when splitting huge pages outside of a fault context, e.g. for eager
page splitting.

To enable eager page splitting, KVM can cache the shadowed (guest)
access permissions whenever it updates the shadow page tables (e.g.
during fault, or FNAME(sync_page)). In fact KVM already does this to
cache the shadowed GFN using the gfns array in the shadow page.
The access bits only take up 3 bits, which leaves 61 bits left over for
gfns, which is more than enough. So this change does not require any
additional memory.

Now that the gfns array caches more information than just GFNs, rename
it to shadowed_translation.

While here, preemptively fix up the WARN_ON() that detects gfn
mismatches in direct SPs. The WARN_ON() was paired with a
pr_err_ratelimited(), which means that users could sometimes see the
WARN without the accompanying error message. Fix this by outputting the
error message as part of the WARN splat.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/include/asm/kvm_host.h |  2 +-
 arch/x86/kvm/mmu/mmu.c          | 82 +++++++++++++++++++++++++--------
 arch/x86/kvm/mmu/mmu_internal.h | 17 ++++++-
 arch/x86/kvm/mmu/paging_tmpl.h  |  8 +++-
 4 files changed, 85 insertions(+), 24 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 2c20f715f009..15131aa05701 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -691,7 +691,7 @@ struct kvm_vcpu_arch {
 
 	struct kvm_mmu_memory_cache mmu_pte_list_desc_cache;
 	struct kvm_mmu_memory_cache mmu_shadow_page_cache;
-	struct kvm_mmu_memory_cache mmu_gfn_array_cache;
+	struct kvm_mmu_memory_cache mmu_shadowed_info_cache;
 	struct kvm_mmu_memory_cache mmu_page_header_cache;
 
 	/*
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6a190fe6c9dc..ed65899d15a2 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -705,7 +705,7 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
 	if (r)
 		return r;
 	if (maybe_indirect) {
-		r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_gfn_array_cache,
+		r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadowed_info_cache,
 					       PT64_ROOT_MAX_LEVEL);
 		if (r)
 			return r;
@@ -718,7 +718,7 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
 {
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
-	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_gfn_array_cache);
+	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadowed_info_cache);
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
 }
 
@@ -730,23 +730,64 @@ static void mmu_free_pte_list_desc(struct pte_list_desc *pte_list_desc)
 static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index)
 {
 	if (!sp->role.direct)
-		return sp->gfns[index];
+		return sp->shadowed_translation[index] >> PAGE_SHIFT;
 
 	return sp->gfn + (index << ((sp->role.level - 1) * PT64_LEVEL_BITS));
 }
 
-static void kvm_mmu_page_set_gfn(struct kvm_mmu_page *sp, int index, gfn_t gfn)
+static void kvm_mmu_page_set_access(struct kvm_mmu_page *sp, int index, u32 access)
 {
 	if (!sp->role.direct) {
-		sp->gfns[index] = gfn;
+		sp->shadowed_translation[index] &= PAGE_MASK;
+		sp->shadowed_translation[index] |= access;
 		return;
 	}
 
-	if (WARN_ON(gfn != kvm_mmu_page_get_gfn(sp, index)))
-		pr_err_ratelimited("gfn mismatch under direct page %llx "
-				   "(expected %llx, got %llx)\n",
-				   sp->gfn,
-				   kvm_mmu_page_get_gfn(sp, index), gfn);
+	WARN(access != sp->role.access,
+	     "access mismatch under direct page %llx (expected %u, got %u)\n",
+	     sp->gfn, sp->role.access, access);
+}
+
+static void kvm_mmu_page_set_translation(struct kvm_mmu_page *sp, int index, gfn_t gfn, u32 access)
+{
+	if (!sp->role.direct) {
+		sp->shadowed_translation[index] = (gfn << PAGE_SHIFT) | access;
+		return;
+	}
+
+	WARN(access != sp->role.access,
+	     "access mismatch under direct page %llx (expected %u, got %u)\n",
+	     sp->gfn, sp->role.access, access);
+
+	WARN(gfn != kvm_mmu_page_get_gfn(sp, index),
+	     "gfn mismatch under direct page %llx (expected %llx, got %llx)\n",
+	     sp->gfn, kvm_mmu_page_get_gfn(sp, index), gfn);
+}
+
+/*
+ * For leaf SPTEs, fetch the *guest* access permissions being shadowed. Note
+ * that the SPTE itself may have a more constrained access permissions that
+ * what the guest enforces. For example, a guest may create an executable
+ * huge PTE but KVM may disallow execution to mitigate iTLB multihit.
+ */
+static u32 kvm_mmu_page_get_access(struct kvm_mmu_page *sp, int index)
+{
+	if (!sp->role.direct)
+		return sp->shadowed_translation[index] & ACC_ALL;
+
+	/*
+	 * For direct MMUs (e.g. TDP or non-paging guests) there are no *guest*
+	 * access permissions being shadowed, which is equivalent to ACC_ALL.
+	 *
+	 * For indirect MMUs (shadow paging), direct shadow pages exist when KVM
+	 * is shadowing a guest huge page with smaller pages, since the guest
+	 * huge page is being directly mapped. In this case the guest access
+	 * permissions being shadowed are the access permissions of the huge
+	 * page.
+	 *
+	 * In both cases, sp->role.access contains the correct access bits.
+	 */
+	return sp->role.access;
 }
 
 /*
@@ -1593,14 +1634,14 @@ static bool kvm_test_age_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 static void __rmap_add(struct kvm *kvm,
 		       struct kvm_mmu_memory_cache *cache,
 		       const struct kvm_memory_slot *slot,
-		       u64 *spte, gfn_t gfn)
+		       u64 *spte, gfn_t gfn, u32 access)
 {
 	struct kvm_mmu_page *sp;
 	struct kvm_rmap_head *rmap_head;
 	int rmap_count;
 
 	sp = sptep_to_sp(spte);
-	kvm_mmu_page_set_gfn(sp, spte - sp->spt, gfn);
+	kvm_mmu_page_set_translation(sp, spte - sp->spt, gfn, access);
 	kvm_update_page_stats(kvm, sp->role.level, 1);
 
 	rmap_head = gfn_to_rmap(gfn, sp->role.level, slot);
@@ -1614,9 +1655,9 @@ static void __rmap_add(struct kvm *kvm,
 }
 
 static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
-		     u64 *spte, gfn_t gfn)
+		     u64 *spte, gfn_t gfn, u32 access)
 {
-	__rmap_add(vcpu->kvm, &vcpu->arch.mmu_pte_list_desc_cache, slot, spte, gfn);
+	__rmap_add(vcpu->kvm, &vcpu->arch.mmu_pte_list_desc_cache, slot, spte, gfn, access);
 }
 
 bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
@@ -1680,7 +1721,7 @@ static void kvm_mmu_free_shadow_page(struct kvm_mmu_page *sp)
 	list_del(&sp->link);
 	free_page((unsigned long)sp->spt);
 	if (!sp->role.direct)
-		free_page((unsigned long)sp->gfns);
+		free_page((unsigned long)sp->shadowed_translation);
 	kmem_cache_free(mmu_page_header_cache, sp);
 }
 
@@ -2098,7 +2139,7 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm,
 struct shadow_page_caches {
 	struct kvm_mmu_memory_cache *page_header_cache;
 	struct kvm_mmu_memory_cache *shadow_page_cache;
-	struct kvm_mmu_memory_cache *gfn_array_cache;
+	struct kvm_mmu_memory_cache *shadowed_info_cache;
 };
 
 static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
@@ -2112,7 +2153,7 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
 	sp = kvm_mmu_memory_cache_alloc(caches->page_header_cache);
 	sp->spt = kvm_mmu_memory_cache_alloc(caches->shadow_page_cache);
 	if (!role.direct)
-		sp->gfns = kvm_mmu_memory_cache_alloc(caches->gfn_array_cache);
+		sp->shadowed_translation = kvm_mmu_memory_cache_alloc(caches->shadowed_info_cache);
 
 	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
 
@@ -2177,7 +2218,7 @@ static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
 	struct shadow_page_caches caches = {
 		.page_header_cache = &vcpu->arch.mmu_page_header_cache,
 		.shadow_page_cache = &vcpu->arch.mmu_shadow_page_cache,
-		.gfn_array_cache = &vcpu->arch.mmu_gfn_array_cache,
+		.shadowed_info_cache = &vcpu->arch.mmu_shadowed_info_cache,
 	};
 
 	return __kvm_mmu_get_shadow_page(vcpu->kvm, vcpu, &caches, gfn, role);
@@ -2820,7 +2861,10 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
 
 	if (!was_rmapped) {
 		WARN_ON_ONCE(ret == RET_PF_SPURIOUS);
-		rmap_add(vcpu, slot, sptep, gfn);
+		rmap_add(vcpu, slot, sptep, gfn, pte_access);
+	} else {
+		/* Already rmapped but the pte_access bits may have changed. */
+		kvm_mmu_page_set_access(sp, sptep - sp->spt, pte_access);
 	}
 
 	return ret;
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 1bff453f7cbe..1614a577a1cb 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -53,8 +53,21 @@ struct kvm_mmu_page {
 	gfn_t gfn;
 
 	u64 *spt;
-	/* hold the gfn of each spte inside spt */
-	gfn_t *gfns;
+
+	/*
+	 * Stores the result of the guest translation being shadowed by each
+	 * SPTE.  KVM shadows two types of guest translations: nGPA -> GPA
+	 * (shadow EPT/NPT) and GVA -> GPA (traditional shadow paging). In both
+	 * cases the result of the translation is a GPA and a set of access
+	 * constraints.
+	 *
+	 * The GFN is stored in the upper bits (PAGE_SHIFT) and the shadowed
+	 * access permissions are stored in the lower bits. Note, for
+	 * convenience and uniformity across guests, the access permissions are
+	 * stored in KVM format (e.g.  ACC_EXEC_MASK) not the raw guest format.
+	 */
+	u64 *shadowed_translation;
+
 	/* Currently serving as active root */
 	union {
 		int root_count;
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index a8a755e1561d..97bf53b29b88 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -978,7 +978,8 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 }
 
 /*
- * Using the cached information from sp->gfns is safe because:
+ * Using the information in sp->shadowed_translation (kvm_mmu_page_get_gfn()
+ * and kvm_mmu_page_get_access()) is safe because:
  * - The spte has a reference to the struct page, so the pfn for a given gfn
  *   can't change unless all sptes pointing to it are nuked first.
  *
@@ -1052,12 +1053,15 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
 		if (sync_mmio_spte(vcpu, &sp->spt[i], gfn, pte_access))
 			continue;
 
-		if (gfn != sp->gfns[i]) {
+		if (gfn != kvm_mmu_page_get_gfn(sp, i)) {
 			drop_spte(vcpu->kvm, &sp->spt[i]);
 			flush = true;
 			continue;
 		}
 
+		if (pte_access != kvm_mmu_page_get_access(sp, i))
+			kvm_mmu_page_set_access(sp, i, pte_access);
+
 		sptep = &sp->spt[i];
 		spte = *sptep;
 		host_writable = spte & shadow_host_writable_mask;
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 15/20] KVM: x86/mmu: Cache the access bits of shadowed translations
@ 2022-04-22 21:05   ` David Matlack
  0 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-04-22 21:05 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Splitting huge pages requires allocating/finding shadow pages to replace
the huge page. Shadow pages are keyed, in part, off the guest access
permissions they are shadowing. For fully direct MMUs, there is no
shadowing so the access bits in the shadow page role are always ACC_ALL.
But during shadow paging, the guest can enforce whatever access
permissions it wants.

When KVM is resolving a fault, it walks the guest pages tables to
determine the guest access permissions. But that is difficult to plumb
when splitting huge pages outside of a fault context, e.g. for eager
page splitting.

To enable eager page splitting, KVM can cache the shadowed (guest)
access permissions whenever it updates the shadow page tables (e.g.
during fault, or FNAME(sync_page)). In fact KVM already does this to
cache the shadowed GFN using the gfns array in the shadow page.
The access bits only take up 3 bits, which leaves 61 bits left over for
gfns, which is more than enough. So this change does not require any
additional memory.

Now that the gfns array caches more information than just GFNs, rename
it to shadowed_translation.

While here, preemptively fix up the WARN_ON() that detects gfn
mismatches in direct SPs. The WARN_ON() was paired with a
pr_err_ratelimited(), which means that users could sometimes see the
WARN without the accompanying error message. Fix this by outputting the
error message as part of the WARN splat.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/include/asm/kvm_host.h |  2 +-
 arch/x86/kvm/mmu/mmu.c          | 82 +++++++++++++++++++++++++--------
 arch/x86/kvm/mmu/mmu_internal.h | 17 ++++++-
 arch/x86/kvm/mmu/paging_tmpl.h  |  8 +++-
 4 files changed, 85 insertions(+), 24 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 2c20f715f009..15131aa05701 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -691,7 +691,7 @@ struct kvm_vcpu_arch {
 
 	struct kvm_mmu_memory_cache mmu_pte_list_desc_cache;
 	struct kvm_mmu_memory_cache mmu_shadow_page_cache;
-	struct kvm_mmu_memory_cache mmu_gfn_array_cache;
+	struct kvm_mmu_memory_cache mmu_shadowed_info_cache;
 	struct kvm_mmu_memory_cache mmu_page_header_cache;
 
 	/*
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6a190fe6c9dc..ed65899d15a2 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -705,7 +705,7 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
 	if (r)
 		return r;
 	if (maybe_indirect) {
-		r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_gfn_array_cache,
+		r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadowed_info_cache,
 					       PT64_ROOT_MAX_LEVEL);
 		if (r)
 			return r;
@@ -718,7 +718,7 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
 {
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
-	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_gfn_array_cache);
+	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadowed_info_cache);
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
 }
 
@@ -730,23 +730,64 @@ static void mmu_free_pte_list_desc(struct pte_list_desc *pte_list_desc)
 static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index)
 {
 	if (!sp->role.direct)
-		return sp->gfns[index];
+		return sp->shadowed_translation[index] >> PAGE_SHIFT;
 
 	return sp->gfn + (index << ((sp->role.level - 1) * PT64_LEVEL_BITS));
 }
 
-static void kvm_mmu_page_set_gfn(struct kvm_mmu_page *sp, int index, gfn_t gfn)
+static void kvm_mmu_page_set_access(struct kvm_mmu_page *sp, int index, u32 access)
 {
 	if (!sp->role.direct) {
-		sp->gfns[index] = gfn;
+		sp->shadowed_translation[index] &= PAGE_MASK;
+		sp->shadowed_translation[index] |= access;
 		return;
 	}
 
-	if (WARN_ON(gfn != kvm_mmu_page_get_gfn(sp, index)))
-		pr_err_ratelimited("gfn mismatch under direct page %llx "
-				   "(expected %llx, got %llx)\n",
-				   sp->gfn,
-				   kvm_mmu_page_get_gfn(sp, index), gfn);
+	WARN(access != sp->role.access,
+	     "access mismatch under direct page %llx (expected %u, got %u)\n",
+	     sp->gfn, sp->role.access, access);
+}
+
+static void kvm_mmu_page_set_translation(struct kvm_mmu_page *sp, int index, gfn_t gfn, u32 access)
+{
+	if (!sp->role.direct) {
+		sp->shadowed_translation[index] = (gfn << PAGE_SHIFT) | access;
+		return;
+	}
+
+	WARN(access != sp->role.access,
+	     "access mismatch under direct page %llx (expected %u, got %u)\n",
+	     sp->gfn, sp->role.access, access);
+
+	WARN(gfn != kvm_mmu_page_get_gfn(sp, index),
+	     "gfn mismatch under direct page %llx (expected %llx, got %llx)\n",
+	     sp->gfn, kvm_mmu_page_get_gfn(sp, index), gfn);
+}
+
+/*
+ * For leaf SPTEs, fetch the *guest* access permissions being shadowed. Note
+ * that the SPTE itself may have a more constrained access permissions that
+ * what the guest enforces. For example, a guest may create an executable
+ * huge PTE but KVM may disallow execution to mitigate iTLB multihit.
+ */
+static u32 kvm_mmu_page_get_access(struct kvm_mmu_page *sp, int index)
+{
+	if (!sp->role.direct)
+		return sp->shadowed_translation[index] & ACC_ALL;
+
+	/*
+	 * For direct MMUs (e.g. TDP or non-paging guests) there are no *guest*
+	 * access permissions being shadowed, which is equivalent to ACC_ALL.
+	 *
+	 * For indirect MMUs (shadow paging), direct shadow pages exist when KVM
+	 * is shadowing a guest huge page with smaller pages, since the guest
+	 * huge page is being directly mapped. In this case the guest access
+	 * permissions being shadowed are the access permissions of the huge
+	 * page.
+	 *
+	 * In both cases, sp->role.access contains the correct access bits.
+	 */
+	return sp->role.access;
 }
 
 /*
@@ -1593,14 +1634,14 @@ static bool kvm_test_age_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 static void __rmap_add(struct kvm *kvm,
 		       struct kvm_mmu_memory_cache *cache,
 		       const struct kvm_memory_slot *slot,
-		       u64 *spte, gfn_t gfn)
+		       u64 *spte, gfn_t gfn, u32 access)
 {
 	struct kvm_mmu_page *sp;
 	struct kvm_rmap_head *rmap_head;
 	int rmap_count;
 
 	sp = sptep_to_sp(spte);
-	kvm_mmu_page_set_gfn(sp, spte - sp->spt, gfn);
+	kvm_mmu_page_set_translation(sp, spte - sp->spt, gfn, access);
 	kvm_update_page_stats(kvm, sp->role.level, 1);
 
 	rmap_head = gfn_to_rmap(gfn, sp->role.level, slot);
@@ -1614,9 +1655,9 @@ static void __rmap_add(struct kvm *kvm,
 }
 
 static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
-		     u64 *spte, gfn_t gfn)
+		     u64 *spte, gfn_t gfn, u32 access)
 {
-	__rmap_add(vcpu->kvm, &vcpu->arch.mmu_pte_list_desc_cache, slot, spte, gfn);
+	__rmap_add(vcpu->kvm, &vcpu->arch.mmu_pte_list_desc_cache, slot, spte, gfn, access);
 }
 
 bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
@@ -1680,7 +1721,7 @@ static void kvm_mmu_free_shadow_page(struct kvm_mmu_page *sp)
 	list_del(&sp->link);
 	free_page((unsigned long)sp->spt);
 	if (!sp->role.direct)
-		free_page((unsigned long)sp->gfns);
+		free_page((unsigned long)sp->shadowed_translation);
 	kmem_cache_free(mmu_page_header_cache, sp);
 }
 
@@ -2098,7 +2139,7 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm,
 struct shadow_page_caches {
 	struct kvm_mmu_memory_cache *page_header_cache;
 	struct kvm_mmu_memory_cache *shadow_page_cache;
-	struct kvm_mmu_memory_cache *gfn_array_cache;
+	struct kvm_mmu_memory_cache *shadowed_info_cache;
 };
 
 static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
@@ -2112,7 +2153,7 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
 	sp = kvm_mmu_memory_cache_alloc(caches->page_header_cache);
 	sp->spt = kvm_mmu_memory_cache_alloc(caches->shadow_page_cache);
 	if (!role.direct)
-		sp->gfns = kvm_mmu_memory_cache_alloc(caches->gfn_array_cache);
+		sp->shadowed_translation = kvm_mmu_memory_cache_alloc(caches->shadowed_info_cache);
 
 	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
 
@@ -2177,7 +2218,7 @@ static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
 	struct shadow_page_caches caches = {
 		.page_header_cache = &vcpu->arch.mmu_page_header_cache,
 		.shadow_page_cache = &vcpu->arch.mmu_shadow_page_cache,
-		.gfn_array_cache = &vcpu->arch.mmu_gfn_array_cache,
+		.shadowed_info_cache = &vcpu->arch.mmu_shadowed_info_cache,
 	};
 
 	return __kvm_mmu_get_shadow_page(vcpu->kvm, vcpu, &caches, gfn, role);
@@ -2820,7 +2861,10 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
 
 	if (!was_rmapped) {
 		WARN_ON_ONCE(ret == RET_PF_SPURIOUS);
-		rmap_add(vcpu, slot, sptep, gfn);
+		rmap_add(vcpu, slot, sptep, gfn, pte_access);
+	} else {
+		/* Already rmapped but the pte_access bits may have changed. */
+		kvm_mmu_page_set_access(sp, sptep - sp->spt, pte_access);
 	}
 
 	return ret;
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 1bff453f7cbe..1614a577a1cb 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -53,8 +53,21 @@ struct kvm_mmu_page {
 	gfn_t gfn;
 
 	u64 *spt;
-	/* hold the gfn of each spte inside spt */
-	gfn_t *gfns;
+
+	/*
+	 * Stores the result of the guest translation being shadowed by each
+	 * SPTE.  KVM shadows two types of guest translations: nGPA -> GPA
+	 * (shadow EPT/NPT) and GVA -> GPA (traditional shadow paging). In both
+	 * cases the result of the translation is a GPA and a set of access
+	 * constraints.
+	 *
+	 * The GFN is stored in the upper bits (PAGE_SHIFT) and the shadowed
+	 * access permissions are stored in the lower bits. Note, for
+	 * convenience and uniformity across guests, the access permissions are
+	 * stored in KVM format (e.g.  ACC_EXEC_MASK) not the raw guest format.
+	 */
+	u64 *shadowed_translation;
+
 	/* Currently serving as active root */
 	union {
 		int root_count;
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index a8a755e1561d..97bf53b29b88 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -978,7 +978,8 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 }
 
 /*
- * Using the cached information from sp->gfns is safe because:
+ * Using the information in sp->shadowed_translation (kvm_mmu_page_get_gfn()
+ * and kvm_mmu_page_get_access()) is safe because:
  * - The spte has a reference to the struct page, so the pfn for a given gfn
  *   can't change unless all sptes pointing to it are nuked first.
  *
@@ -1052,12 +1053,15 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
 		if (sync_mmio_spte(vcpu, &sp->spt[i], gfn, pte_access))
 			continue;
 
-		if (gfn != sp->gfns[i]) {
+		if (gfn != kvm_mmu_page_get_gfn(sp, i)) {
 			drop_spte(vcpu->kvm, &sp->spt[i]);
 			flush = true;
 			continue;
 		}
 
+		if (pte_access != kvm_mmu_page_get_access(sp, i))
+			kvm_mmu_page_set_access(sp, i, pte_access);
+
 		sptep = &sp->spt[i];
 		spte = *sptep;
 		host_writable = spte & shadow_host_writable_mask;
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 16/20] KVM: x86/mmu: Extend make_huge_page_split_spte() for the shadow MMU
  2022-04-22 21:05 ` David Matlack
@ 2022-04-22 21:05   ` David Matlack
  -1 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-04-22 21:05 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

Currently make_huge_page_split_spte() assumes execute permissions can be
granted to any 4K SPTE when splitting huge pages. This is true for the
TDP MMU but is not necessarily true for the shadow MMU, since KVM may be
shadowing a non-executable huge page.

To fix this, pass in the child shadow page where the huge page will be
split and derive the execution permission from the shadow page's role.
This is correct because huge pages are always split with direct shadow
page and thus the shadow page role contains the correct access
permissions.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/spte.c    | 13 +++++++------
 arch/x86/kvm/mmu/spte.h    |  2 +-
 arch/x86/kvm/mmu/tdp_mmu.c |  2 +-
 3 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 4739b53c9734..9db98fbeee61 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -215,10 +215,11 @@ static u64 make_spte_executable(u64 spte)
  * This is used during huge page splitting to build the SPTEs that make up the
  * new page table.
  */
-u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index)
+u64 make_huge_page_split_spte(u64 huge_spte, struct kvm_mmu_page *sp, int index)
 {
+	bool exec_allowed = sp->role.access & ACC_EXEC_MASK;
+	int child_level = sp->role.level;
 	u64 child_spte;
-	int child_level;
 
 	if (WARN_ON_ONCE(!is_shadow_present_pte(huge_spte)))
 		return 0;
@@ -227,7 +228,6 @@ u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index)
 		return 0;
 
 	child_spte = huge_spte;
-	child_level = huge_level - 1;
 
 	/*
 	 * The child_spte already has the base address of the huge page being
@@ -240,10 +240,11 @@ u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index)
 		child_spte &= ~PT_PAGE_SIZE_MASK;
 
 		/*
-		 * When splitting to a 4K page, mark the page executable as the
-		 * NX hugepage mitigation no longer applies.
+		 * When splitting to a 4K page where execution is allowed, mark
+		 * the page executable as the NX hugepage mitigation no longer
+		 * applies.
 		 */
-		if (is_nx_huge_page_enabled())
+		if (exec_allowed && is_nx_huge_page_enabled())
 			child_spte = make_spte_executable(child_spte);
 	}
 
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 73f12615416f..921ea77f1b5e 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -415,7 +415,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	       unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
 	       u64 old_spte, bool prefetch, bool can_unsync,
 	       bool host_writable, u64 *new_spte);
-u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index);
+u64 make_huge_page_split_spte(u64 huge_spte, struct kvm_mmu_page *sp, int index);
 u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled);
 u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access);
 u64 mark_spte_for_access_track(u64 spte);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 566548a3efa7..110a34ca41c2 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1469,7 +1469,7 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
 	 * not been linked in yet and thus is not reachable from any other CPU.
 	 */
 	for (i = 0; i < PT64_ENT_PER_PAGE; i++)
-		sp->spt[i] = make_huge_page_split_spte(huge_spte, level, i);
+		sp->spt[i] = make_huge_page_split_spte(huge_spte, sp, i);
 
 	/*
 	 * Replace the huge spte with a pointer to the populated lower level
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 16/20] KVM: x86/mmu: Extend make_huge_page_split_spte() for the shadow MMU
@ 2022-04-22 21:05   ` David Matlack
  0 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-04-22 21:05 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Currently make_huge_page_split_spte() assumes execute permissions can be
granted to any 4K SPTE when splitting huge pages. This is true for the
TDP MMU but is not necessarily true for the shadow MMU, since KVM may be
shadowing a non-executable huge page.

To fix this, pass in the child shadow page where the huge page will be
split and derive the execution permission from the shadow page's role.
This is correct because huge pages are always split with direct shadow
page and thus the shadow page role contains the correct access
permissions.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/spte.c    | 13 +++++++------
 arch/x86/kvm/mmu/spte.h    |  2 +-
 arch/x86/kvm/mmu/tdp_mmu.c |  2 +-
 3 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 4739b53c9734..9db98fbeee61 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -215,10 +215,11 @@ static u64 make_spte_executable(u64 spte)
  * This is used during huge page splitting to build the SPTEs that make up the
  * new page table.
  */
-u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index)
+u64 make_huge_page_split_spte(u64 huge_spte, struct kvm_mmu_page *sp, int index)
 {
+	bool exec_allowed = sp->role.access & ACC_EXEC_MASK;
+	int child_level = sp->role.level;
 	u64 child_spte;
-	int child_level;
 
 	if (WARN_ON_ONCE(!is_shadow_present_pte(huge_spte)))
 		return 0;
@@ -227,7 +228,6 @@ u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index)
 		return 0;
 
 	child_spte = huge_spte;
-	child_level = huge_level - 1;
 
 	/*
 	 * The child_spte already has the base address of the huge page being
@@ -240,10 +240,11 @@ u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index)
 		child_spte &= ~PT_PAGE_SIZE_MASK;
 
 		/*
-		 * When splitting to a 4K page, mark the page executable as the
-		 * NX hugepage mitigation no longer applies.
+		 * When splitting to a 4K page where execution is allowed, mark
+		 * the page executable as the NX hugepage mitigation no longer
+		 * applies.
 		 */
-		if (is_nx_huge_page_enabled())
+		if (exec_allowed && is_nx_huge_page_enabled())
 			child_spte = make_spte_executable(child_spte);
 	}
 
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 73f12615416f..921ea77f1b5e 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -415,7 +415,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	       unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
 	       u64 old_spte, bool prefetch, bool can_unsync,
 	       bool host_writable, u64 *new_spte);
-u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index);
+u64 make_huge_page_split_spte(u64 huge_spte, struct kvm_mmu_page *sp, int index);
 u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled);
 u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access);
 u64 mark_spte_for_access_track(u64 spte);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 566548a3efa7..110a34ca41c2 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1469,7 +1469,7 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
 	 * not been linked in yet and thus is not reachable from any other CPU.
 	 */
 	for (i = 0; i < PT64_ENT_PER_PAGE; i++)
-		sp->spt[i] = make_huge_page_split_spte(huge_spte, level, i);
+		sp->spt[i] = make_huge_page_split_spte(huge_spte, sp, i);
 
 	/*
 	 * Replace the huge spte with a pointer to the populated lower level
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 17/20] KVM: x86/mmu: Zap collapsible SPTEs at all levels in the shadow MMU
  2022-04-22 21:05 ` David Matlack
@ 2022-04-22 21:05   ` David Matlack
  -1 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-04-22 21:05 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

Currently KVM only zaps collapsible 4KiB SPTEs in the shadow MMU (i.e.
in the rmap). This is fine for now KVM never creates intermediate huge
pages during dirty logging, i.e. a 1GiB page is never partially split to
a 2MiB page.

However, this will stop being true once the shadow MMU participates in
eager page splitting, which can in fact leave behind partially split
huge pages. In preparation for that change, change the shadow MMU to
iterate over all necessary levels when zapping collapsible SPTEs.

No functional change intended.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 21 ++++++++++++++-------
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index ed65899d15a2..479c581e8a96 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6098,18 +6098,25 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
 	return need_tlb_flush;
 }
 
+static void kvm_rmap_zap_collapsible_sptes(struct kvm *kvm,
+					   const struct kvm_memory_slot *slot)
+{
+	/*
+	 * Note, use KVM_MAX_HUGEPAGE_LEVEL - 1 since there's no need to zap
+	 * pages that are already mapped at the maximum possible level.
+	 */
+	if (slot_handle_level(kvm, slot, kvm_mmu_zap_collapsible_spte,
+			      PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL - 1,
+			      true))
+		kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
+}
+
 void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
 				   const struct kvm_memory_slot *slot)
 {
 	if (kvm_memslots_have_rmaps(kvm)) {
 		write_lock(&kvm->mmu_lock);
-		/*
-		 * Zap only 4k SPTEs since the legacy MMU only supports dirty
-		 * logging at a 4k granularity and never creates collapsible
-		 * 2m SPTEs during dirty logging.
-		 */
-		if (slot_handle_level_4k(kvm, slot, kvm_mmu_zap_collapsible_spte, true))
-			kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
+		kvm_rmap_zap_collapsible_sptes(kvm, slot);
 		write_unlock(&kvm->mmu_lock);
 	}
 
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 17/20] KVM: x86/mmu: Zap collapsible SPTEs at all levels in the shadow MMU
@ 2022-04-22 21:05   ` David Matlack
  0 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-04-22 21:05 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Currently KVM only zaps collapsible 4KiB SPTEs in the shadow MMU (i.e.
in the rmap). This is fine for now KVM never creates intermediate huge
pages during dirty logging, i.e. a 1GiB page is never partially split to
a 2MiB page.

However, this will stop being true once the shadow MMU participates in
eager page splitting, which can in fact leave behind partially split
huge pages. In preparation for that change, change the shadow MMU to
iterate over all necessary levels when zapping collapsible SPTEs.

No functional change intended.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 21 ++++++++++++++-------
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index ed65899d15a2..479c581e8a96 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6098,18 +6098,25 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
 	return need_tlb_flush;
 }
 
+static void kvm_rmap_zap_collapsible_sptes(struct kvm *kvm,
+					   const struct kvm_memory_slot *slot)
+{
+	/*
+	 * Note, use KVM_MAX_HUGEPAGE_LEVEL - 1 since there's no need to zap
+	 * pages that are already mapped at the maximum possible level.
+	 */
+	if (slot_handle_level(kvm, slot, kvm_mmu_zap_collapsible_spte,
+			      PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL - 1,
+			      true))
+		kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
+}
+
 void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
 				   const struct kvm_memory_slot *slot)
 {
 	if (kvm_memslots_have_rmaps(kvm)) {
 		write_lock(&kvm->mmu_lock);
-		/*
-		 * Zap only 4k SPTEs since the legacy MMU only supports dirty
-		 * logging at a 4k granularity and never creates collapsible
-		 * 2m SPTEs during dirty logging.
-		 */
-		if (slot_handle_level_4k(kvm, slot, kvm_mmu_zap_collapsible_spte, true))
-			kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
+		kvm_rmap_zap_collapsible_sptes(kvm, slot);
 		write_unlock(&kvm->mmu_lock);
 	}
 
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 18/20] KVM: x86/mmu: Refactor drop_large_spte()
  2022-04-22 21:05 ` David Matlack
@ 2022-04-22 21:05   ` David Matlack
  -1 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-04-22 21:05 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

drop_large_spte() drops a large SPTE if it exists and then flushes TLBs.
Its helper function, __drop_large_spte(), does the drop without the
flush.

In preparation for eager page splitting, which will need to sometimes
flush when dropping large SPTEs (and sometimes not), push the flushing
logic down into __drop_large_spte() and add a bool parameter to control
it.

No functional change intended.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 29 +++++++++++++++--------------
 1 file changed, 15 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 479c581e8a96..a5961c17eb36 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1183,28 +1183,29 @@ static void drop_spte(struct kvm *kvm, u64 *sptep)
 		rmap_remove(kvm, sptep);
 }
 
-
-static bool __drop_large_spte(struct kvm *kvm, u64 *sptep)
+static void __drop_large_spte(struct kvm *kvm, u64 *sptep, bool flush)
 {
-	if (is_large_pte(*sptep)) {
-		WARN_ON(sptep_to_sp(sptep)->role.level == PG_LEVEL_4K);
-		drop_spte(kvm, sptep);
-		return true;
-	}
+	struct kvm_mmu_page *sp;
 
-	return false;
-}
+	if (!is_large_pte(*sptep))
+		return;
 
-static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
-{
-	if (__drop_large_spte(vcpu->kvm, sptep)) {
-		struct kvm_mmu_page *sp = sptep_to_sp(sptep);
+	sp = sptep_to_sp(sptep);
+	WARN_ON(sp->role.level == PG_LEVEL_4K);
 
-		kvm_flush_remote_tlbs_with_address(vcpu->kvm, sp->gfn,
+	drop_spte(kvm, sptep);
+
+	if (flush) {
+		kvm_flush_remote_tlbs_with_address(kvm, sp->gfn,
 			KVM_PAGES_PER_HPAGE(sp->role.level));
 	}
 }
 
+static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
+{
+	return __drop_large_spte(vcpu->kvm, sptep, true);
+}
+
 /*
  * Write-protect on the specified @sptep, @pt_protect indicates whether
  * spte write-protection is caused by protecting shadow page table.
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 18/20] KVM: x86/mmu: Refactor drop_large_spte()
@ 2022-04-22 21:05   ` David Matlack
  0 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-04-22 21:05 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

drop_large_spte() drops a large SPTE if it exists and then flushes TLBs.
Its helper function, __drop_large_spte(), does the drop without the
flush.

In preparation for eager page splitting, which will need to sometimes
flush when dropping large SPTEs (and sometimes not), push the flushing
logic down into __drop_large_spte() and add a bool parameter to control
it.

No functional change intended.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 29 +++++++++++++++--------------
 1 file changed, 15 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 479c581e8a96..a5961c17eb36 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1183,28 +1183,29 @@ static void drop_spte(struct kvm *kvm, u64 *sptep)
 		rmap_remove(kvm, sptep);
 }
 
-
-static bool __drop_large_spte(struct kvm *kvm, u64 *sptep)
+static void __drop_large_spte(struct kvm *kvm, u64 *sptep, bool flush)
 {
-	if (is_large_pte(*sptep)) {
-		WARN_ON(sptep_to_sp(sptep)->role.level == PG_LEVEL_4K);
-		drop_spte(kvm, sptep);
-		return true;
-	}
+	struct kvm_mmu_page *sp;
 
-	return false;
-}
+	if (!is_large_pte(*sptep))
+		return;
 
-static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
-{
-	if (__drop_large_spte(vcpu->kvm, sptep)) {
-		struct kvm_mmu_page *sp = sptep_to_sp(sptep);
+	sp = sptep_to_sp(sptep);
+	WARN_ON(sp->role.level == PG_LEVEL_4K);
 
-		kvm_flush_remote_tlbs_with_address(vcpu->kvm, sp->gfn,
+	drop_spte(kvm, sptep);
+
+	if (flush) {
+		kvm_flush_remote_tlbs_with_address(kvm, sp->gfn,
 			KVM_PAGES_PER_HPAGE(sp->role.level));
 	}
 }
 
+static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
+{
+	return __drop_large_spte(vcpu->kvm, sptep, true);
+}
+
 /*
  * Write-protect on the specified @sptep, @pt_protect indicates whether
  * spte write-protection is caused by protecting shadow page table.
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 19/20] KVM: Allow for different capacities in kvm_mmu_memory_cache structs
  2022-04-22 21:05 ` David Matlack
@ 2022-04-22 21:05   ` David Matlack
  -1 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-04-22 21:05 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

Allow the capacity of the kvm_mmu_memory_cache struct to be chosen at
declaration time rather than being fixed for all declarations. This will
be used in a follow-up commit to declare an cache in x86 with a capacity
of 512+ objects without having to increase the capacity of all caches in
KVM.

This change requires each cache now specify its capacity at runtime,
since the cache struct itself no longer has a fixed capacity known at
compile time. To protect against someone accidentally defining a
kvm_mmu_memory_cache struct directly (without the extra storage), this
commit includes a WARN_ON() in kvm_mmu_topup_memory_cache().

In order to support different capacities, this commit changes the
objects pointer array to be dynamically allocated the first time the
cache is topped-up.

An alternative would be to lay out the objects array after the
kvm_mmu_memory_cache struct, which can be done at compile time. But that
change, unfortunately, adds some grottiness to arm64 and riscv, which
uses a function-local (i.e.  stack-allocated) kvm_mmu_memory_cache
struct. Since C does not allow anonymous structs in functions, the new
wrapper struct that contains kvm_mmu_memory_cache and the objects
pointer array, must be named, which means dealing with an outer and
inner struct. The outer struct can't be dropped since then there would
be no guarantee the kvm_mmu_memory_cache struct and objects array would
be laid out consecutively on the stack.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/arm64/kvm/arm.c      |  1 +
 arch/arm64/kvm/mmu.c      |  5 ++++-
 arch/mips/kvm/mips.c      |  2 ++
 arch/riscv/kvm/mmu.c      | 14 +++++++-------
 arch/riscv/kvm/vcpu.c     |  1 +
 arch/x86/kvm/mmu/mmu.c    |  9 +++++++++
 include/linux/kvm_types.h |  9 +++++++--
 virt/kvm/kvm_main.c       | 20 ++++++++++++++++++--
 8 files changed, 49 insertions(+), 12 deletions(-)

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 523bc934fe2f..66f6706a1037 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -320,6 +320,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
 	vcpu->arch.target = -1;
 	bitmap_zero(vcpu->arch.features, KVM_VCPU_MAX_FEATURES);
 
+	vcpu->arch.mmu_page_cache.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
 	vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
 
 	/* Set up the timer */
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 53ae2c0640bc..2f2ef6b60ff4 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -764,7 +764,10 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
 {
 	phys_addr_t addr;
 	int ret = 0;
-	struct kvm_mmu_memory_cache cache = { 0, __GFP_ZERO, NULL, };
+	struct kvm_mmu_memory_cache cache = {
+		.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE,
+		.gfp_zero = __GFP_ZERO,
+	};
 	struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
 	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_DEVICE |
 				     KVM_PGTABLE_PROT_R |
diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
index a25e0b73ee70..45c7179144dc 100644
--- a/arch/mips/kvm/mips.c
+++ b/arch/mips/kvm/mips.c
@@ -387,6 +387,8 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
 	if (err)
 		goto out_free_gebase;
 
+	vcpu->arch.mmu_page_cache.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
+
 	return 0;
 
 out_free_gebase:
diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c
index f80a34fbf102..0e042f40d737 100644
--- a/arch/riscv/kvm/mmu.c
+++ b/arch/riscv/kvm/mmu.c
@@ -347,10 +347,10 @@ static int stage2_ioremap(struct kvm *kvm, gpa_t gpa, phys_addr_t hpa,
 	int ret = 0;
 	unsigned long pfn;
 	phys_addr_t addr, end;
-	struct kvm_mmu_memory_cache pcache;
-
-	memset(&pcache, 0, sizeof(pcache));
-	pcache.gfp_zero = __GFP_ZERO;
+	struct kvm_mmu_memory_cache cache = {
+		.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE,
+		.gfp_zero = __GFP_ZERO,
+	};
 
 	end = (gpa + size + PAGE_SIZE - 1) & PAGE_MASK;
 	pfn = __phys_to_pfn(hpa);
@@ -361,12 +361,12 @@ static int stage2_ioremap(struct kvm *kvm, gpa_t gpa, phys_addr_t hpa,
 		if (!writable)
 			pte = pte_wrprotect(pte);
 
-		ret = kvm_mmu_topup_memory_cache(&pcache, stage2_pgd_levels);
+		ret = kvm_mmu_topup_memory_cache(&cache.cache, stage2_pgd_levels);
 		if (ret)
 			goto out;
 
 		spin_lock(&kvm->mmu_lock);
-		ret = stage2_set_pte(kvm, 0, &pcache, addr, &pte);
+		ret = stage2_set_pte(kvm, 0, &cache.cache, addr, &pte);
 		spin_unlock(&kvm->mmu_lock);
 		if (ret)
 			goto out;
@@ -375,7 +375,7 @@ static int stage2_ioremap(struct kvm *kvm, gpa_t gpa, phys_addr_t hpa,
 	}
 
 out:
-	kvm_mmu_free_memory_cache(&pcache);
+	kvm_mmu_free_memory_cache(&cache.cache);
 	return ret;
 }
 
diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
index 6785aef4cbd4..bbcb9d4a04fb 100644
--- a/arch/riscv/kvm/vcpu.c
+++ b/arch/riscv/kvm/vcpu.c
@@ -94,6 +94,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
 
 	/* Mark this VCPU never ran */
 	vcpu->arch.ran_atleast_once = false;
+	vcpu->arch.mmu_page_cache.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
 	vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
 
 	/* Setup ISA features available to VCPU */
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index a5961c17eb36..5b1458b911ab 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5719,12 +5719,21 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
 {
 	int ret;
 
+	vcpu->arch.mmu_pte_list_desc_cache.capacity =
+		KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
 	vcpu->arch.mmu_pte_list_desc_cache.kmem_cache = pte_list_desc_cache;
 	vcpu->arch.mmu_pte_list_desc_cache.gfp_zero = __GFP_ZERO;
 
+	vcpu->arch.mmu_page_header_cache.capacity =
+		KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
 	vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;
 	vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
 
+	vcpu->arch.mmu_shadowed_info_cache.capacity =
+		KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
+
+	vcpu->arch.mmu_shadow_page_cache.capacity =
+		KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
 	vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
 
 	vcpu->arch.mmu = &vcpu->arch.root_mmu;
diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index ac1ebb37a0ff..549103a4f7bc 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -83,14 +83,19 @@ struct gfn_to_pfn_cache {
  * MMU flows is problematic, as is triggering reclaim, I/O, etc... while
  * holding MMU locks.  Note, these caches act more like prefetch buffers than
  * classical caches, i.e. objects are not returned to the cache on being freed.
+ *
+ * The storage for the cache object pointers is allocated dynamically when the
+ * cache is topped-up. The capacity field defines the number of object pointers
+ * available after the struct.
  */
 struct kvm_mmu_memory_cache {
 	int nobjs;
+	int capacity;
 	gfp_t gfp_zero;
 	struct kmem_cache *kmem_cache;
-	void *objects[KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE];
+	void **objects;
 };
-#endif
+#endif /* KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE */
 
 #define HALT_POLL_HIST_COUNT			32
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index dfb7dabdbc63..29f84d7c1950 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -371,12 +371,23 @@ static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc,
 
 int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
 {
+	gfp_t gfp = GFP_KERNEL_ACCOUNT;
 	void *obj;
 
 	if (mc->nobjs >= min)
 		return 0;
-	while (mc->nobjs < ARRAY_SIZE(mc->objects)) {
-		obj = mmu_memory_cache_alloc_obj(mc, GFP_KERNEL_ACCOUNT);
+
+	if (WARN_ON(mc->capacity == 0))
+		return -EINVAL;
+
+	if (!mc->objects) {
+		mc->objects = kvmalloc_array(sizeof(void *), mc->capacity, gfp);
+		if (!mc->objects)
+			return -ENOMEM;
+	}
+
+	while (mc->nobjs < mc->capacity) {
+		obj = mmu_memory_cache_alloc_obj(mc, gfp);
 		if (!obj)
 			return mc->nobjs >= min ? 0 : -ENOMEM;
 		mc->objects[mc->nobjs++] = obj;
@@ -397,6 +408,11 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc)
 		else
 			free_page((unsigned long)mc->objects[--mc->nobjs]);
 	}
+
+	kvfree(mc->objects);
+
+	/* Note, must set to NULL to avoid use-after-free in the next top-up. */
+	mc->objects = NULL;
 }
 
 void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc)
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 19/20] KVM: Allow for different capacities in kvm_mmu_memory_cache structs
@ 2022-04-22 21:05   ` David Matlack
  0 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-04-22 21:05 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Allow the capacity of the kvm_mmu_memory_cache struct to be chosen at
declaration time rather than being fixed for all declarations. This will
be used in a follow-up commit to declare an cache in x86 with a capacity
of 512+ objects without having to increase the capacity of all caches in
KVM.

This change requires each cache now specify its capacity at runtime,
since the cache struct itself no longer has a fixed capacity known at
compile time. To protect against someone accidentally defining a
kvm_mmu_memory_cache struct directly (without the extra storage), this
commit includes a WARN_ON() in kvm_mmu_topup_memory_cache().

In order to support different capacities, this commit changes the
objects pointer array to be dynamically allocated the first time the
cache is topped-up.

An alternative would be to lay out the objects array after the
kvm_mmu_memory_cache struct, which can be done at compile time. But that
change, unfortunately, adds some grottiness to arm64 and riscv, which
uses a function-local (i.e.  stack-allocated) kvm_mmu_memory_cache
struct. Since C does not allow anonymous structs in functions, the new
wrapper struct that contains kvm_mmu_memory_cache and the objects
pointer array, must be named, which means dealing with an outer and
inner struct. The outer struct can't be dropped since then there would
be no guarantee the kvm_mmu_memory_cache struct and objects array would
be laid out consecutively on the stack.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/arm64/kvm/arm.c      |  1 +
 arch/arm64/kvm/mmu.c      |  5 ++++-
 arch/mips/kvm/mips.c      |  2 ++
 arch/riscv/kvm/mmu.c      | 14 +++++++-------
 arch/riscv/kvm/vcpu.c     |  1 +
 arch/x86/kvm/mmu/mmu.c    |  9 +++++++++
 include/linux/kvm_types.h |  9 +++++++--
 virt/kvm/kvm_main.c       | 20 ++++++++++++++++++--
 8 files changed, 49 insertions(+), 12 deletions(-)

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 523bc934fe2f..66f6706a1037 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -320,6 +320,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
 	vcpu->arch.target = -1;
 	bitmap_zero(vcpu->arch.features, KVM_VCPU_MAX_FEATURES);
 
+	vcpu->arch.mmu_page_cache.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
 	vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
 
 	/* Set up the timer */
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 53ae2c0640bc..2f2ef6b60ff4 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -764,7 +764,10 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
 {
 	phys_addr_t addr;
 	int ret = 0;
-	struct kvm_mmu_memory_cache cache = { 0, __GFP_ZERO, NULL, };
+	struct kvm_mmu_memory_cache cache = {
+		.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE,
+		.gfp_zero = __GFP_ZERO,
+	};
 	struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
 	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_DEVICE |
 				     KVM_PGTABLE_PROT_R |
diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
index a25e0b73ee70..45c7179144dc 100644
--- a/arch/mips/kvm/mips.c
+++ b/arch/mips/kvm/mips.c
@@ -387,6 +387,8 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
 	if (err)
 		goto out_free_gebase;
 
+	vcpu->arch.mmu_page_cache.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
+
 	return 0;
 
 out_free_gebase:
diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c
index f80a34fbf102..0e042f40d737 100644
--- a/arch/riscv/kvm/mmu.c
+++ b/arch/riscv/kvm/mmu.c
@@ -347,10 +347,10 @@ static int stage2_ioremap(struct kvm *kvm, gpa_t gpa, phys_addr_t hpa,
 	int ret = 0;
 	unsigned long pfn;
 	phys_addr_t addr, end;
-	struct kvm_mmu_memory_cache pcache;
-
-	memset(&pcache, 0, sizeof(pcache));
-	pcache.gfp_zero = __GFP_ZERO;
+	struct kvm_mmu_memory_cache cache = {
+		.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE,
+		.gfp_zero = __GFP_ZERO,
+	};
 
 	end = (gpa + size + PAGE_SIZE - 1) & PAGE_MASK;
 	pfn = __phys_to_pfn(hpa);
@@ -361,12 +361,12 @@ static int stage2_ioremap(struct kvm *kvm, gpa_t gpa, phys_addr_t hpa,
 		if (!writable)
 			pte = pte_wrprotect(pte);
 
-		ret = kvm_mmu_topup_memory_cache(&pcache, stage2_pgd_levels);
+		ret = kvm_mmu_topup_memory_cache(&cache.cache, stage2_pgd_levels);
 		if (ret)
 			goto out;
 
 		spin_lock(&kvm->mmu_lock);
-		ret = stage2_set_pte(kvm, 0, &pcache, addr, &pte);
+		ret = stage2_set_pte(kvm, 0, &cache.cache, addr, &pte);
 		spin_unlock(&kvm->mmu_lock);
 		if (ret)
 			goto out;
@@ -375,7 +375,7 @@ static int stage2_ioremap(struct kvm *kvm, gpa_t gpa, phys_addr_t hpa,
 	}
 
 out:
-	kvm_mmu_free_memory_cache(&pcache);
+	kvm_mmu_free_memory_cache(&cache.cache);
 	return ret;
 }
 
diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
index 6785aef4cbd4..bbcb9d4a04fb 100644
--- a/arch/riscv/kvm/vcpu.c
+++ b/arch/riscv/kvm/vcpu.c
@@ -94,6 +94,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
 
 	/* Mark this VCPU never ran */
 	vcpu->arch.ran_atleast_once = false;
+	vcpu->arch.mmu_page_cache.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
 	vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
 
 	/* Setup ISA features available to VCPU */
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index a5961c17eb36..5b1458b911ab 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5719,12 +5719,21 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
 {
 	int ret;
 
+	vcpu->arch.mmu_pte_list_desc_cache.capacity =
+		KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
 	vcpu->arch.mmu_pte_list_desc_cache.kmem_cache = pte_list_desc_cache;
 	vcpu->arch.mmu_pte_list_desc_cache.gfp_zero = __GFP_ZERO;
 
+	vcpu->arch.mmu_page_header_cache.capacity =
+		KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
 	vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;
 	vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
 
+	vcpu->arch.mmu_shadowed_info_cache.capacity =
+		KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
+
+	vcpu->arch.mmu_shadow_page_cache.capacity =
+		KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
 	vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
 
 	vcpu->arch.mmu = &vcpu->arch.root_mmu;
diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index ac1ebb37a0ff..549103a4f7bc 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -83,14 +83,19 @@ struct gfn_to_pfn_cache {
  * MMU flows is problematic, as is triggering reclaim, I/O, etc... while
  * holding MMU locks.  Note, these caches act more like prefetch buffers than
  * classical caches, i.e. objects are not returned to the cache on being freed.
+ *
+ * The storage for the cache object pointers is allocated dynamically when the
+ * cache is topped-up. The capacity field defines the number of object pointers
+ * available after the struct.
  */
 struct kvm_mmu_memory_cache {
 	int nobjs;
+	int capacity;
 	gfp_t gfp_zero;
 	struct kmem_cache *kmem_cache;
-	void *objects[KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE];
+	void **objects;
 };
-#endif
+#endif /* KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE */
 
 #define HALT_POLL_HIST_COUNT			32
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index dfb7dabdbc63..29f84d7c1950 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -371,12 +371,23 @@ static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc,
 
 int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
 {
+	gfp_t gfp = GFP_KERNEL_ACCOUNT;
 	void *obj;
 
 	if (mc->nobjs >= min)
 		return 0;
-	while (mc->nobjs < ARRAY_SIZE(mc->objects)) {
-		obj = mmu_memory_cache_alloc_obj(mc, GFP_KERNEL_ACCOUNT);
+
+	if (WARN_ON(mc->capacity == 0))
+		return -EINVAL;
+
+	if (!mc->objects) {
+		mc->objects = kvmalloc_array(sizeof(void *), mc->capacity, gfp);
+		if (!mc->objects)
+			return -ENOMEM;
+	}
+
+	while (mc->nobjs < mc->capacity) {
+		obj = mmu_memory_cache_alloc_obj(mc, gfp);
 		if (!obj)
 			return mc->nobjs >= min ? 0 : -ENOMEM;
 		mc->objects[mc->nobjs++] = obj;
@@ -397,6 +408,11 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc)
 		else
 			free_page((unsigned long)mc->objects[--mc->nobjs]);
 	}
+
+	kvfree(mc->objects);
+
+	/* Note, must set to NULL to avoid use-after-free in the next top-up. */
+	mc->objects = NULL;
 }
 
 void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc)
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 20/20] KVM: x86/mmu: Extend Eager Page Splitting to nested MMUs
  2022-04-22 21:05 ` David Matlack
@ 2022-04-22 21:05   ` David Matlack
  -1 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-04-22 21:05 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

Add support for Eager Page Splitting pages that are mapped by nested
MMUs. Walk through the rmap first splitting all 1GiB pages to 2MiB
pages, and then splitting all 2MiB pages to 4KiB pages.

Note, Eager Page Splitting is limited to nested MMUs as a policy rather
than due to any technical reason (the sp->role.guest_mode check could
just be deleted and Eager Page Splitting would work correctly for all
shadow MMU pages). There is really no reason to support Eager Page
Splitting for tdp_mmu=N, since such support will eventually be phased
out, and there is no current use case supporting Eager Page Splitting on
hosts where TDP is either disabled or unavailable in hardware.
Furthermore, future improvements to nested MMU scalability may diverge
the code from the legacy shadow paging implementation. These
improvements will be simpler to make if Eager Page Splitting does not
have to worry about legacy shadow paging.

Splitting huge pages mapped by nested MMUs requires dealing with some
extra complexity beyond that of the TDP MMU:

(1) The shadow MMU has a limit on the number of shadow pages that are
    allowed to be allocated. So, as a policy, Eager Page Splitting
    refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
    pages available.

(2) Splitting a huge page may end up re-using an existing lower level
    shadow page tables. This is unlike the TDP MMU which always allocates
    new shadow page tables when splitting.

(3) When installing the lower level SPTEs, they must be added to the
    rmap which may require allocating additional pte_list_desc structs.

Case (2) is especially interesting since it may require a TLB flush,
unlike the TDP MMU which can fully split huge pages without any TLB
flushes. Specifically, an existing lower level page table may point to
even lower level page tables that are not fully populated, effectively
unmapping a portion of the huge page, which requires a flush.

This commit performs such flushes after dropping the huge page and
before installing the lower level page table. This TLB flush could
instead be delayed until the MMU lock is about to be dropped, which
would batch flushes for multiple splits.  However these flushes should
be rare in practice (a huge page must be aliased in multiple SPTEs and
have been split for NX Huge Pages in only some of them). Flushing
immediately is simpler to plumb and also reduces the chances of tripping
over a CPU bug (e.g. see iTLB multihit).

Suggested-by: Peter Feiner <pfeiner@google.com>
[ This commit is based off of the original implementation of Eager Page
  Splitting from Peter in Google's kernel from 2016. ]
Signed-off-by: David Matlack <dmatlack@google.com>
---
 .../admin-guide/kernel-parameters.txt         |   3 +-
 arch/x86/include/asm/kvm_host.h               |  20 ++
 arch/x86/kvm/mmu/mmu.c                        | 276 +++++++++++++++++-
 arch/x86/kvm/x86.c                            |   6 +
 4 files changed, 296 insertions(+), 9 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 3f1cc5e317ed..bc3ad3d4df0b 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2387,8 +2387,7 @@
 			the KVM_CLEAR_DIRTY ioctl, and only for the pages being
 			cleared.
 
-			Eager page splitting currently only supports splitting
-			huge pages mapped by the TDP MMU.
+			Eager page splitting is only supported when kvm.tdp_mmu=Y.
 
 			Default is Y (on).
 
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 15131aa05701..5df4dff385a1 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1240,6 +1240,24 @@ struct kvm_arch {
 	hpa_t	hv_root_tdp;
 	spinlock_t hv_root_tdp_lock;
 #endif
+
+	/*
+	 * Memory caches used to allocate shadow pages when performing eager
+	 * page splitting. No need for a shadowed_info_cache since eager page
+	 * splitting only allocates direct shadow pages.
+	 */
+	struct kvm_mmu_memory_cache split_shadow_page_cache;
+	struct kvm_mmu_memory_cache split_page_header_cache;
+
+	/*
+	 * Memory cache used to allocate pte_list_desc structs while splitting
+	 * huge pages. In the worst case, to split one huge page, 512
+	 * pte_list_desc structs are needed to add each lower level leaf sptep
+	 * to the rmap plus 1 to extend the parent_ptes rmap of the lower level
+	 * page table.
+	 */
+#define SPLIT_DESC_CACHE_CAPACITY 513
+	struct kvm_mmu_memory_cache split_desc_cache;
 };
 
 struct kvm_vm_stat {
@@ -1608,6 +1626,8 @@ void kvm_mmu_zap_all(struct kvm *kvm);
 void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen);
 void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned long kvm_nr_mmu_pages);
 
+void free_split_caches(struct kvm *kvm);
+
 int load_pdptrs(struct kvm_vcpu *vcpu, unsigned long cr3);
 
 int emulator_write_phys(struct kvm_vcpu *vcpu, gpa_t gpa,
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 5b1458b911ab..9a59ec9b78fb 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5897,6 +5897,18 @@ int kvm_mmu_init_vm(struct kvm *kvm)
 	node->track_write = kvm_mmu_pte_write;
 	node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;
 	kvm_page_track_register_notifier(kvm, node);
+
+	kvm->arch.split_page_header_cache.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
+	kvm->arch.split_page_header_cache.kmem_cache = mmu_page_header_cache;
+	kvm->arch.split_page_header_cache.gfp_zero = __GFP_ZERO;
+
+	kvm->arch.split_shadow_page_cache.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
+	kvm->arch.split_shadow_page_cache.gfp_zero = __GFP_ZERO;
+
+	kvm->arch.split_desc_cache.capacity = SPLIT_DESC_CACHE_CAPACITY;
+	kvm->arch.split_desc_cache.kmem_cache = pte_list_desc_cache;
+	kvm->arch.split_desc_cache.gfp_zero = __GFP_ZERO;
+
 	return 0;
 }
 
@@ -6028,15 +6040,256 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
 		kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
 }
 
+void free_split_caches(struct kvm *kvm)
+{
+	kvm_mmu_free_memory_cache(&kvm->arch.split_desc_cache);
+	kvm_mmu_free_memory_cache(&kvm->arch.split_page_header_cache);
+	kvm_mmu_free_memory_cache(&kvm->arch.split_shadow_page_cache);
+}
+
+static inline bool need_topup(struct kvm_mmu_memory_cache *cache, int min)
+{
+	return kvm_mmu_memory_cache_nr_free_objects(cache) < min;
+}
+
+static bool need_topup_split_caches_or_resched(struct kvm *kvm)
+{
+	if (need_resched() || rwlock_needbreak(&kvm->mmu_lock))
+		return true;
+
+	/*
+	 * In the worst case, SPLIT_DESC_CACHE_CAPACITY descriptors are needed
+	 * to split a single huge page. Calculating how many are actually needed
+	 * is possible but not worth the complexity.
+	 */
+	return need_topup(&kvm->arch.split_desc_cache, SPLIT_DESC_CACHE_CAPACITY) ||
+		need_topup(&kvm->arch.split_page_header_cache, 1) ||
+		need_topup(&kvm->arch.split_shadow_page_cache, 1);
+}
+
+static int topup_split_caches(struct kvm *kvm)
+{
+	int r;
+
+	r = kvm_mmu_topup_memory_cache(&kvm->arch.split_desc_cache,
+				       SPLIT_DESC_CACHE_CAPACITY);
+	if (r)
+		return r;
+
+	r = kvm_mmu_topup_memory_cache(&kvm->arch.split_page_header_cache, 1);
+	if (r)
+		return r;
+
+	return kvm_mmu_topup_memory_cache(&kvm->arch.split_shadow_page_cache, 1);
+}
+
+static struct kvm_mmu_page *nested_mmu_get_sp_for_split(struct kvm *kvm, u64 *huge_sptep)
+{
+	struct kvm_mmu_page *huge_sp = sptep_to_sp(huge_sptep);
+	struct shadow_page_caches caches = {};
+	union kvm_mmu_page_role role;
+	unsigned int access;
+	gfn_t gfn;
+
+	gfn = kvm_mmu_page_get_gfn(huge_sp, huge_sptep - huge_sp->spt);
+	access = kvm_mmu_page_get_access(huge_sp, huge_sptep - huge_sp->spt);
+
+	/*
+	 * Note, huge page splitting always uses direct shadow pages, regardless
+	 * of whether the huge page itself is mapped by a direct or indirect
+	 * shadow page, since the huge page region itself is being directly
+	 * mapped with smaller pages.
+	 */
+	role = kvm_mmu_child_role(huge_sptep, /*direct=*/true, access);
+
+	/* Direct SPs do not require a shadowed_info_cache. */
+	caches.page_header_cache = &kvm->arch.split_page_header_cache;
+	caches.shadow_page_cache = &kvm->arch.split_shadow_page_cache;
+
+	/* Safe to pass NULL for vCPU since requesting a direct SP. */
+	return __kvm_mmu_get_shadow_page(kvm, NULL, &caches, gfn, role);
+}
+
+static void nested_mmu_split_huge_page(struct kvm *kvm,
+				       const struct kvm_memory_slot *slot,
+				       u64 *huge_sptep)
+
+{
+	struct kvm_mmu_memory_cache *cache = &kvm->arch.split_desc_cache;
+	u64 huge_spte = READ_ONCE(*huge_sptep);
+	struct kvm_mmu_page *sp;
+	bool flush = false;
+	u64 *sptep, spte;
+	gfn_t gfn;
+	int index;
+
+	sp = nested_mmu_get_sp_for_split(kvm, huge_sptep);
+
+	for (index = 0; index < PT64_ENT_PER_PAGE; index++) {
+		sptep = &sp->spt[index];
+		gfn = kvm_mmu_page_get_gfn(sp, index);
+
+		/*
+		 * The SP may already have populated SPTEs, e.g. if this huge
+		 * page is aliased by multiple sptes with the same access
+		 * permissions. These entries are guaranteed to map the same
+		 * gfn-to-pfn translation since the SP is direct, so no need to
+		 * modify them.
+		 *
+		 * However, if a given SPTE points to a lower level page table,
+		 * that lower level page table may only be partially populated.
+		 * Installing such SPTEs would effectively unmap a potion of the
+		 * huge page, which requires a TLB flush.
+		 */
+		if (is_shadow_present_pte(*sptep)) {
+			flush |= !is_last_spte(*sptep, sp->role.level);
+			continue;
+		}
+
+		spte = make_huge_page_split_spte(huge_spte, sp, index);
+		mmu_spte_set(sptep, spte);
+		__rmap_add(kvm, cache, slot, sptep, gfn, sp->role.access);
+	}
+
+	/*
+	 * Replace the huge spte with a pointer to the populated lower level
+	 * page table. If the lower-level page table indentically maps the huge
+	 * page (i.e. no memory is unmapped), there's no need for a TLB flush.
+	 * Otherwise, flush TLBs after dropping the huge page and before
+	 * installing the shadow page table.
+	 */
+	__drop_large_spte(kvm, huge_sptep, flush);
+	__link_shadow_page(cache, huge_sptep, sp);
+}
+
+static int nested_mmu_try_split_huge_page(struct kvm *kvm,
+					  const struct kvm_memory_slot *slot,
+					  u64 *huge_sptep)
+{
+	struct kvm_mmu_page *huge_sp = sptep_to_sp(huge_sptep);
+	int level, r = 0;
+	gfn_t gfn;
+	u64 spte;
+
+	/* Grab information for the tracepoint before dropping the MMU lock. */
+	gfn = kvm_mmu_page_get_gfn(huge_sp, huge_sptep - huge_sp->spt);
+	level = huge_sp->role.level;
+	spte = *huge_sptep;
+
+	if (kvm_mmu_available_pages(kvm) <= KVM_MIN_FREE_MMU_PAGES) {
+		r = -ENOSPC;
+		goto out;
+	}
+
+	if (need_topup_split_caches_or_resched(kvm)) {
+		write_unlock(&kvm->mmu_lock);
+		cond_resched();
+		/*
+		 * If the topup succeeds, return -EAGAIN to indicate that the
+		 * rmap iterator should be restarted because the MMU lock was
+		 * dropped.
+		 */
+		r = topup_split_caches(kvm) ?: -EAGAIN;
+		write_lock(&kvm->mmu_lock);
+		goto out;
+	}
+
+	nested_mmu_split_huge_page(kvm, slot, huge_sptep);
+
+out:
+	trace_kvm_mmu_split_huge_page(gfn, spte, level, r);
+	return r;
+}
+
+static bool nested_mmu_skip_split_huge_page(u64 *huge_sptep)
+{
+	struct kvm_mmu_page *sp = sptep_to_sp(huge_sptep);
+
+	/* TDP MMU is enabled, so rmap should only contain nested MMU SPs. */
+	if (WARN_ON_ONCE(!sp->role.guest_mode))
+		return true;
+
+	/* The rmaps should never contain non-leaf SPTEs. */
+	if (WARN_ON_ONCE(!is_large_pte(*huge_sptep)))
+		return true;
+
+	/* SPs with level >PG_LEVEL_4K should never by unsync. */
+	if (WARN_ON_ONCE(sp->unsync))
+		return true;
+
+	/* Don't bother splitting huge pages on invalid SPs. */
+	if (sp->role.invalid)
+		return true;
+
+	return false;
+}
+
+static bool nested_mmu_try_split_huge_pages(struct kvm *kvm,
+					    struct kvm_rmap_head *rmap_head,
+					    const struct kvm_memory_slot *slot)
+{
+	struct rmap_iterator iter;
+	u64 *huge_sptep;
+	int r;
+
+restart:
+	for_each_rmap_spte(rmap_head, &iter, huge_sptep) {
+		if (nested_mmu_skip_split_huge_page(huge_sptep))
+			continue;
+
+		r = nested_mmu_try_split_huge_page(kvm, slot, huge_sptep);
+
+		/*
+		 * The split succeeded or needs to be retried because the MMU
+		 * lock was dropped. Either way, restart the iterator to get it
+		 * back into a consistent state.
+		 */
+		if (!r || r == -EAGAIN)
+			goto restart;
+
+		/* The split failed and shouldn't be retried (e.g. -ENOMEM). */
+		break;
+	}
+
+	return false;
+}
+
+static void kvm_nested_mmu_try_split_huge_pages(struct kvm *kvm,
+						const struct kvm_memory_slot *slot,
+						gfn_t start, gfn_t end,
+						int target_level)
+{
+	int level;
+
+	/*
+	 * Split huge pages starting with KVM_MAX_HUGEPAGE_LEVEL and working
+	 * down to the target level. This ensures pages are recursively split
+	 * all the way to the target level. There's no need to split pages
+	 * already at the target level.
+	 */
+	for (level = KVM_MAX_HUGEPAGE_LEVEL; level > target_level; level--) {
+		slot_handle_level_range(kvm, slot,
+					nested_mmu_try_split_huge_pages,
+					level, level, start, end - 1,
+					true, false);
+	}
+}
+
 /* Must be called with the mmu_lock held in write-mode. */
 void kvm_mmu_try_split_huge_pages(struct kvm *kvm,
 				   const struct kvm_memory_slot *memslot,
 				   u64 start, u64 end,
 				   int target_level)
 {
-	if (is_tdp_mmu_enabled(kvm))
-		kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end,
-						 target_level, false);
+	if (!is_tdp_mmu_enabled(kvm))
+		return;
+
+	kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end, target_level,
+					 false);
+
+	if (kvm_memslots_have_rmaps(kvm))
+		kvm_nested_mmu_try_split_huge_pages(kvm, memslot, start, end,
+						    target_level);
 
 	/*
 	 * A TLB flush is unnecessary at this point for the same resons as in
@@ -6051,10 +6304,19 @@ void kvm_mmu_slot_try_split_huge_pages(struct kvm *kvm,
 	u64 start = memslot->base_gfn;
 	u64 end = start + memslot->npages;
 
-	if (is_tdp_mmu_enabled(kvm)) {
-		read_lock(&kvm->mmu_lock);
-		kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end, target_level, true);
-		read_unlock(&kvm->mmu_lock);
+	if (!is_tdp_mmu_enabled(kvm))
+		return;
+
+	read_lock(&kvm->mmu_lock);
+	kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end, target_level,
+					 true);
+	read_unlock(&kvm->mmu_lock);
+
+	if (kvm_memslots_have_rmaps(kvm)) {
+		write_lock(&kvm->mmu_lock);
+		kvm_nested_mmu_try_split_huge_pages(kvm, memslot, start, end,
+						    target_level);
+		write_unlock(&kvm->mmu_lock);
 	}
 
 	/*
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ab336f7c82e4..e123e24a130f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12161,6 +12161,12 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
 		 * page faults will create the large-page sptes.
 		 */
 		kvm_mmu_zap_collapsible_sptes(kvm, new);
+
+		/*
+		 * Free any memory left behind by eager page splitting. Ignore
+		 * the module parameter since userspace might have changed it.
+		 */
+		free_split_caches(kvm);
 	} else {
 		/*
 		 * Initially-all-set does not require write protecting any page,
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 20/20] KVM: x86/mmu: Extend Eager Page Splitting to nested MMUs
@ 2022-04-22 21:05   ` David Matlack
  0 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-04-22 21:05 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Add support for Eager Page Splitting pages that are mapped by nested
MMUs. Walk through the rmap first splitting all 1GiB pages to 2MiB
pages, and then splitting all 2MiB pages to 4KiB pages.

Note, Eager Page Splitting is limited to nested MMUs as a policy rather
than due to any technical reason (the sp->role.guest_mode check could
just be deleted and Eager Page Splitting would work correctly for all
shadow MMU pages). There is really no reason to support Eager Page
Splitting for tdp_mmu=N, since such support will eventually be phased
out, and there is no current use case supporting Eager Page Splitting on
hosts where TDP is either disabled or unavailable in hardware.
Furthermore, future improvements to nested MMU scalability may diverge
the code from the legacy shadow paging implementation. These
improvements will be simpler to make if Eager Page Splitting does not
have to worry about legacy shadow paging.

Splitting huge pages mapped by nested MMUs requires dealing with some
extra complexity beyond that of the TDP MMU:

(1) The shadow MMU has a limit on the number of shadow pages that are
    allowed to be allocated. So, as a policy, Eager Page Splitting
    refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
    pages available.

(2) Splitting a huge page may end up re-using an existing lower level
    shadow page tables. This is unlike the TDP MMU which always allocates
    new shadow page tables when splitting.

(3) When installing the lower level SPTEs, they must be added to the
    rmap which may require allocating additional pte_list_desc structs.

Case (2) is especially interesting since it may require a TLB flush,
unlike the TDP MMU which can fully split huge pages without any TLB
flushes. Specifically, an existing lower level page table may point to
even lower level page tables that are not fully populated, effectively
unmapping a portion of the huge page, which requires a flush.

This commit performs such flushes after dropping the huge page and
before installing the lower level page table. This TLB flush could
instead be delayed until the MMU lock is about to be dropped, which
would batch flushes for multiple splits.  However these flushes should
be rare in practice (a huge page must be aliased in multiple SPTEs and
have been split for NX Huge Pages in only some of them). Flushing
immediately is simpler to plumb and also reduces the chances of tripping
over a CPU bug (e.g. see iTLB multihit).

Suggested-by: Peter Feiner <pfeiner@google.com>
[ This commit is based off of the original implementation of Eager Page
  Splitting from Peter in Google's kernel from 2016. ]
Signed-off-by: David Matlack <dmatlack@google.com>
---
 .../admin-guide/kernel-parameters.txt         |   3 +-
 arch/x86/include/asm/kvm_host.h               |  20 ++
 arch/x86/kvm/mmu/mmu.c                        | 276 +++++++++++++++++-
 arch/x86/kvm/x86.c                            |   6 +
 4 files changed, 296 insertions(+), 9 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 3f1cc5e317ed..bc3ad3d4df0b 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2387,8 +2387,7 @@
 			the KVM_CLEAR_DIRTY ioctl, and only for the pages being
 			cleared.
 
-			Eager page splitting currently only supports splitting
-			huge pages mapped by the TDP MMU.
+			Eager page splitting is only supported when kvm.tdp_mmu=Y.
 
 			Default is Y (on).
 
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 15131aa05701..5df4dff385a1 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1240,6 +1240,24 @@ struct kvm_arch {
 	hpa_t	hv_root_tdp;
 	spinlock_t hv_root_tdp_lock;
 #endif
+
+	/*
+	 * Memory caches used to allocate shadow pages when performing eager
+	 * page splitting. No need for a shadowed_info_cache since eager page
+	 * splitting only allocates direct shadow pages.
+	 */
+	struct kvm_mmu_memory_cache split_shadow_page_cache;
+	struct kvm_mmu_memory_cache split_page_header_cache;
+
+	/*
+	 * Memory cache used to allocate pte_list_desc structs while splitting
+	 * huge pages. In the worst case, to split one huge page, 512
+	 * pte_list_desc structs are needed to add each lower level leaf sptep
+	 * to the rmap plus 1 to extend the parent_ptes rmap of the lower level
+	 * page table.
+	 */
+#define SPLIT_DESC_CACHE_CAPACITY 513
+	struct kvm_mmu_memory_cache split_desc_cache;
 };
 
 struct kvm_vm_stat {
@@ -1608,6 +1626,8 @@ void kvm_mmu_zap_all(struct kvm *kvm);
 void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen);
 void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned long kvm_nr_mmu_pages);
 
+void free_split_caches(struct kvm *kvm);
+
 int load_pdptrs(struct kvm_vcpu *vcpu, unsigned long cr3);
 
 int emulator_write_phys(struct kvm_vcpu *vcpu, gpa_t gpa,
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 5b1458b911ab..9a59ec9b78fb 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5897,6 +5897,18 @@ int kvm_mmu_init_vm(struct kvm *kvm)
 	node->track_write = kvm_mmu_pte_write;
 	node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;
 	kvm_page_track_register_notifier(kvm, node);
+
+	kvm->arch.split_page_header_cache.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
+	kvm->arch.split_page_header_cache.kmem_cache = mmu_page_header_cache;
+	kvm->arch.split_page_header_cache.gfp_zero = __GFP_ZERO;
+
+	kvm->arch.split_shadow_page_cache.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
+	kvm->arch.split_shadow_page_cache.gfp_zero = __GFP_ZERO;
+
+	kvm->arch.split_desc_cache.capacity = SPLIT_DESC_CACHE_CAPACITY;
+	kvm->arch.split_desc_cache.kmem_cache = pte_list_desc_cache;
+	kvm->arch.split_desc_cache.gfp_zero = __GFP_ZERO;
+
 	return 0;
 }
 
@@ -6028,15 +6040,256 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
 		kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
 }
 
+void free_split_caches(struct kvm *kvm)
+{
+	kvm_mmu_free_memory_cache(&kvm->arch.split_desc_cache);
+	kvm_mmu_free_memory_cache(&kvm->arch.split_page_header_cache);
+	kvm_mmu_free_memory_cache(&kvm->arch.split_shadow_page_cache);
+}
+
+static inline bool need_topup(struct kvm_mmu_memory_cache *cache, int min)
+{
+	return kvm_mmu_memory_cache_nr_free_objects(cache) < min;
+}
+
+static bool need_topup_split_caches_or_resched(struct kvm *kvm)
+{
+	if (need_resched() || rwlock_needbreak(&kvm->mmu_lock))
+		return true;
+
+	/*
+	 * In the worst case, SPLIT_DESC_CACHE_CAPACITY descriptors are needed
+	 * to split a single huge page. Calculating how many are actually needed
+	 * is possible but not worth the complexity.
+	 */
+	return need_topup(&kvm->arch.split_desc_cache, SPLIT_DESC_CACHE_CAPACITY) ||
+		need_topup(&kvm->arch.split_page_header_cache, 1) ||
+		need_topup(&kvm->arch.split_shadow_page_cache, 1);
+}
+
+static int topup_split_caches(struct kvm *kvm)
+{
+	int r;
+
+	r = kvm_mmu_topup_memory_cache(&kvm->arch.split_desc_cache,
+				       SPLIT_DESC_CACHE_CAPACITY);
+	if (r)
+		return r;
+
+	r = kvm_mmu_topup_memory_cache(&kvm->arch.split_page_header_cache, 1);
+	if (r)
+		return r;
+
+	return kvm_mmu_topup_memory_cache(&kvm->arch.split_shadow_page_cache, 1);
+}
+
+static struct kvm_mmu_page *nested_mmu_get_sp_for_split(struct kvm *kvm, u64 *huge_sptep)
+{
+	struct kvm_mmu_page *huge_sp = sptep_to_sp(huge_sptep);
+	struct shadow_page_caches caches = {};
+	union kvm_mmu_page_role role;
+	unsigned int access;
+	gfn_t gfn;
+
+	gfn = kvm_mmu_page_get_gfn(huge_sp, huge_sptep - huge_sp->spt);
+	access = kvm_mmu_page_get_access(huge_sp, huge_sptep - huge_sp->spt);
+
+	/*
+	 * Note, huge page splitting always uses direct shadow pages, regardless
+	 * of whether the huge page itself is mapped by a direct or indirect
+	 * shadow page, since the huge page region itself is being directly
+	 * mapped with smaller pages.
+	 */
+	role = kvm_mmu_child_role(huge_sptep, /*direct=*/true, access);
+
+	/* Direct SPs do not require a shadowed_info_cache. */
+	caches.page_header_cache = &kvm->arch.split_page_header_cache;
+	caches.shadow_page_cache = &kvm->arch.split_shadow_page_cache;
+
+	/* Safe to pass NULL for vCPU since requesting a direct SP. */
+	return __kvm_mmu_get_shadow_page(kvm, NULL, &caches, gfn, role);
+}
+
+static void nested_mmu_split_huge_page(struct kvm *kvm,
+				       const struct kvm_memory_slot *slot,
+				       u64 *huge_sptep)
+
+{
+	struct kvm_mmu_memory_cache *cache = &kvm->arch.split_desc_cache;
+	u64 huge_spte = READ_ONCE(*huge_sptep);
+	struct kvm_mmu_page *sp;
+	bool flush = false;
+	u64 *sptep, spte;
+	gfn_t gfn;
+	int index;
+
+	sp = nested_mmu_get_sp_for_split(kvm, huge_sptep);
+
+	for (index = 0; index < PT64_ENT_PER_PAGE; index++) {
+		sptep = &sp->spt[index];
+		gfn = kvm_mmu_page_get_gfn(sp, index);
+
+		/*
+		 * The SP may already have populated SPTEs, e.g. if this huge
+		 * page is aliased by multiple sptes with the same access
+		 * permissions. These entries are guaranteed to map the same
+		 * gfn-to-pfn translation since the SP is direct, so no need to
+		 * modify them.
+		 *
+		 * However, if a given SPTE points to a lower level page table,
+		 * that lower level page table may only be partially populated.
+		 * Installing such SPTEs would effectively unmap a potion of the
+		 * huge page, which requires a TLB flush.
+		 */
+		if (is_shadow_present_pte(*sptep)) {
+			flush |= !is_last_spte(*sptep, sp->role.level);
+			continue;
+		}
+
+		spte = make_huge_page_split_spte(huge_spte, sp, index);
+		mmu_spte_set(sptep, spte);
+		__rmap_add(kvm, cache, slot, sptep, gfn, sp->role.access);
+	}
+
+	/*
+	 * Replace the huge spte with a pointer to the populated lower level
+	 * page table. If the lower-level page table indentically maps the huge
+	 * page (i.e. no memory is unmapped), there's no need for a TLB flush.
+	 * Otherwise, flush TLBs after dropping the huge page and before
+	 * installing the shadow page table.
+	 */
+	__drop_large_spte(kvm, huge_sptep, flush);
+	__link_shadow_page(cache, huge_sptep, sp);
+}
+
+static int nested_mmu_try_split_huge_page(struct kvm *kvm,
+					  const struct kvm_memory_slot *slot,
+					  u64 *huge_sptep)
+{
+	struct kvm_mmu_page *huge_sp = sptep_to_sp(huge_sptep);
+	int level, r = 0;
+	gfn_t gfn;
+	u64 spte;
+
+	/* Grab information for the tracepoint before dropping the MMU lock. */
+	gfn = kvm_mmu_page_get_gfn(huge_sp, huge_sptep - huge_sp->spt);
+	level = huge_sp->role.level;
+	spte = *huge_sptep;
+
+	if (kvm_mmu_available_pages(kvm) <= KVM_MIN_FREE_MMU_PAGES) {
+		r = -ENOSPC;
+		goto out;
+	}
+
+	if (need_topup_split_caches_or_resched(kvm)) {
+		write_unlock(&kvm->mmu_lock);
+		cond_resched();
+		/*
+		 * If the topup succeeds, return -EAGAIN to indicate that the
+		 * rmap iterator should be restarted because the MMU lock was
+		 * dropped.
+		 */
+		r = topup_split_caches(kvm) ?: -EAGAIN;
+		write_lock(&kvm->mmu_lock);
+		goto out;
+	}
+
+	nested_mmu_split_huge_page(kvm, slot, huge_sptep);
+
+out:
+	trace_kvm_mmu_split_huge_page(gfn, spte, level, r);
+	return r;
+}
+
+static bool nested_mmu_skip_split_huge_page(u64 *huge_sptep)
+{
+	struct kvm_mmu_page *sp = sptep_to_sp(huge_sptep);
+
+	/* TDP MMU is enabled, so rmap should only contain nested MMU SPs. */
+	if (WARN_ON_ONCE(!sp->role.guest_mode))
+		return true;
+
+	/* The rmaps should never contain non-leaf SPTEs. */
+	if (WARN_ON_ONCE(!is_large_pte(*huge_sptep)))
+		return true;
+
+	/* SPs with level >PG_LEVEL_4K should never by unsync. */
+	if (WARN_ON_ONCE(sp->unsync))
+		return true;
+
+	/* Don't bother splitting huge pages on invalid SPs. */
+	if (sp->role.invalid)
+		return true;
+
+	return false;
+}
+
+static bool nested_mmu_try_split_huge_pages(struct kvm *kvm,
+					    struct kvm_rmap_head *rmap_head,
+					    const struct kvm_memory_slot *slot)
+{
+	struct rmap_iterator iter;
+	u64 *huge_sptep;
+	int r;
+
+restart:
+	for_each_rmap_spte(rmap_head, &iter, huge_sptep) {
+		if (nested_mmu_skip_split_huge_page(huge_sptep))
+			continue;
+
+		r = nested_mmu_try_split_huge_page(kvm, slot, huge_sptep);
+
+		/*
+		 * The split succeeded or needs to be retried because the MMU
+		 * lock was dropped. Either way, restart the iterator to get it
+		 * back into a consistent state.
+		 */
+		if (!r || r == -EAGAIN)
+			goto restart;
+
+		/* The split failed and shouldn't be retried (e.g. -ENOMEM). */
+		break;
+	}
+
+	return false;
+}
+
+static void kvm_nested_mmu_try_split_huge_pages(struct kvm *kvm,
+						const struct kvm_memory_slot *slot,
+						gfn_t start, gfn_t end,
+						int target_level)
+{
+	int level;
+
+	/*
+	 * Split huge pages starting with KVM_MAX_HUGEPAGE_LEVEL and working
+	 * down to the target level. This ensures pages are recursively split
+	 * all the way to the target level. There's no need to split pages
+	 * already at the target level.
+	 */
+	for (level = KVM_MAX_HUGEPAGE_LEVEL; level > target_level; level--) {
+		slot_handle_level_range(kvm, slot,
+					nested_mmu_try_split_huge_pages,
+					level, level, start, end - 1,
+					true, false);
+	}
+}
+
 /* Must be called with the mmu_lock held in write-mode. */
 void kvm_mmu_try_split_huge_pages(struct kvm *kvm,
 				   const struct kvm_memory_slot *memslot,
 				   u64 start, u64 end,
 				   int target_level)
 {
-	if (is_tdp_mmu_enabled(kvm))
-		kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end,
-						 target_level, false);
+	if (!is_tdp_mmu_enabled(kvm))
+		return;
+
+	kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end, target_level,
+					 false);
+
+	if (kvm_memslots_have_rmaps(kvm))
+		kvm_nested_mmu_try_split_huge_pages(kvm, memslot, start, end,
+						    target_level);
 
 	/*
 	 * A TLB flush is unnecessary at this point for the same resons as in
@@ -6051,10 +6304,19 @@ void kvm_mmu_slot_try_split_huge_pages(struct kvm *kvm,
 	u64 start = memslot->base_gfn;
 	u64 end = start + memslot->npages;
 
-	if (is_tdp_mmu_enabled(kvm)) {
-		read_lock(&kvm->mmu_lock);
-		kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end, target_level, true);
-		read_unlock(&kvm->mmu_lock);
+	if (!is_tdp_mmu_enabled(kvm))
+		return;
+
+	read_lock(&kvm->mmu_lock);
+	kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end, target_level,
+					 true);
+	read_unlock(&kvm->mmu_lock);
+
+	if (kvm_memslots_have_rmaps(kvm)) {
+		write_lock(&kvm->mmu_lock);
+		kvm_nested_mmu_try_split_huge_pages(kvm, memslot, start, end,
+						    target_level);
+		write_unlock(&kvm->mmu_lock);
 	}
 
 	/*
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ab336f7c82e4..e123e24a130f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12161,6 +12161,12 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
 		 * page faults will create the large-page sptes.
 		 */
 		kvm_mmu_zap_collapsible_sptes(kvm, new);
+
+		/*
+		 * Free any memory left behind by eager page splitting. Ignore
+		 * the module parameter since userspace might have changed it.
+		 */
+		free_split_caches(kvm);
 	} else {
 		/*
 		 * Initially-all-set does not require write protecting any page,
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 19/20] KVM: Allow for different capacities in kvm_mmu_memory_cache structs
  2022-04-22 21:05   ` David Matlack
@ 2022-04-23  8:08     ` kernel test robot
  -1 siblings, 0 replies; 120+ messages in thread
From: kernel test robot @ 2022-04-23  8:08 UTC (permalink / raw)
  To: David Matlack, Paolo Bonzini
  Cc: kbuild-all, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Sean Christopherson, Andrew Jones, Ben Gardon, Peter Xu,
	maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

Hi David,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on 150866cd0ec871c765181d145aa0912628289c8a]

url:    https://github.com/intel-lab-lkp/linux/commits/David-Matlack/KVM-Extend-Eager-Page-Splitting-to-the-shadow-MMU/20220423-062108
base:   150866cd0ec871c765181d145aa0912628289c8a
config: riscv-randconfig-r005-20220422 (https://download.01.org/0day-ci/archive/20220423/202204231516.bclimUe4-lkp@intel.com/config)
compiler: riscv64-linux-gcc (GCC) 11.2.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/871c5afc76a6f414c03f433d06bacfd928910b1b
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review David-Matlack/KVM-Extend-Eager-Page-Splitting-to-the-shadow-MMU/20220423-062108
        git checkout 871c5afc76a6f414c03f433d06bacfd928910b1b
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross W=1 O=build_dir ARCH=riscv SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All error/warnings (new ones prefixed by >>):

   arch/riscv/kvm/mmu.c: In function 'stage2_ioremap':
>> arch/riscv/kvm/mmu.c:364:56: error: 'struct kvm_mmu_memory_cache' has no member named 'cache'
     364 |                 ret = kvm_mmu_topup_memory_cache(&cache.cache, stage2_pgd_levels);
         |                                                        ^
   arch/riscv/kvm/mmu.c:369:52: error: 'struct kvm_mmu_memory_cache' has no member named 'cache'
     369 |                 ret = stage2_set_pte(kvm, 0, &cache.cache, addr, &pte);
         |                                                    ^
   arch/riscv/kvm/mmu.c:378:41: error: 'struct kvm_mmu_memory_cache' has no member named 'cache'
     378 |         kvm_mmu_free_memory_cache(&cache.cache);
         |                                         ^
>> arch/riscv/kvm/mmu.c:350:37: warning: variable 'cache' set but not used [-Wunused-but-set-variable]
     350 |         struct kvm_mmu_memory_cache cache = {
         |                                     ^~~~~


vim +364 arch/riscv/kvm/mmu.c

   342	
   343	static int stage2_ioremap(struct kvm *kvm, gpa_t gpa, phys_addr_t hpa,
   344				  unsigned long size, bool writable)
   345	{
   346		pte_t pte;
   347		int ret = 0;
   348		unsigned long pfn;
   349		phys_addr_t addr, end;
 > 350		struct kvm_mmu_memory_cache cache = {
   351			.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE,
   352			.gfp_zero = __GFP_ZERO,
   353		};
   354	
   355		end = (gpa + size + PAGE_SIZE - 1) & PAGE_MASK;
   356		pfn = __phys_to_pfn(hpa);
   357	
   358		for (addr = gpa; addr < end; addr += PAGE_SIZE) {
   359			pte = pfn_pte(pfn, PAGE_KERNEL);
   360	
   361			if (!writable)
   362				pte = pte_wrprotect(pte);
   363	
 > 364			ret = kvm_mmu_topup_memory_cache(&cache.cache, stage2_pgd_levels);
   365			if (ret)
   366				goto out;
   367	
   368			spin_lock(&kvm->mmu_lock);
   369			ret = stage2_set_pte(kvm, 0, &cache.cache, addr, &pte);
   370			spin_unlock(&kvm->mmu_lock);
   371			if (ret)
   372				goto out;
   373	
   374			pfn++;
   375		}
   376	
   377	out:
   378		kvm_mmu_free_memory_cache(&cache.cache);
   379		return ret;
   380	}
   381	

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 19/20] KVM: Allow for different capacities in kvm_mmu_memory_cache structs
@ 2022-04-23  8:08     ` kernel test robot
  0 siblings, 0 replies; 120+ messages in thread
From: kernel test robot @ 2022-04-23  8:08 UTC (permalink / raw)
  To: David Matlack, Paolo Bonzini
  Cc: Albert Ou, kbuild-all, Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Hi David,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on 150866cd0ec871c765181d145aa0912628289c8a]

url:    https://github.com/intel-lab-lkp/linux/commits/David-Matlack/KVM-Extend-Eager-Page-Splitting-to-the-shadow-MMU/20220423-062108
base:   150866cd0ec871c765181d145aa0912628289c8a
config: riscv-randconfig-r005-20220422 (https://download.01.org/0day-ci/archive/20220423/202204231516.bclimUe4-lkp@intel.com/config)
compiler: riscv64-linux-gcc (GCC) 11.2.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/871c5afc76a6f414c03f433d06bacfd928910b1b
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review David-Matlack/KVM-Extend-Eager-Page-Splitting-to-the-shadow-MMU/20220423-062108
        git checkout 871c5afc76a6f414c03f433d06bacfd928910b1b
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross W=1 O=build_dir ARCH=riscv SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All error/warnings (new ones prefixed by >>):

   arch/riscv/kvm/mmu.c: In function 'stage2_ioremap':
>> arch/riscv/kvm/mmu.c:364:56: error: 'struct kvm_mmu_memory_cache' has no member named 'cache'
     364 |                 ret = kvm_mmu_topup_memory_cache(&cache.cache, stage2_pgd_levels);
         |                                                        ^
   arch/riscv/kvm/mmu.c:369:52: error: 'struct kvm_mmu_memory_cache' has no member named 'cache'
     369 |                 ret = stage2_set_pte(kvm, 0, &cache.cache, addr, &pte);
         |                                                    ^
   arch/riscv/kvm/mmu.c:378:41: error: 'struct kvm_mmu_memory_cache' has no member named 'cache'
     378 |         kvm_mmu_free_memory_cache(&cache.cache);
         |                                         ^
>> arch/riscv/kvm/mmu.c:350:37: warning: variable 'cache' set but not used [-Wunused-but-set-variable]
     350 |         struct kvm_mmu_memory_cache cache = {
         |                                     ^~~~~


vim +364 arch/riscv/kvm/mmu.c

   342	
   343	static int stage2_ioremap(struct kvm *kvm, gpa_t gpa, phys_addr_t hpa,
   344				  unsigned long size, bool writable)
   345	{
   346		pte_t pte;
   347		int ret = 0;
   348		unsigned long pfn;
   349		phys_addr_t addr, end;
 > 350		struct kvm_mmu_memory_cache cache = {
   351			.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE,
   352			.gfp_zero = __GFP_ZERO,
   353		};
   354	
   355		end = (gpa + size + PAGE_SIZE - 1) & PAGE_MASK;
   356		pfn = __phys_to_pfn(hpa);
   357	
   358		for (addr = gpa; addr < end; addr += PAGE_SIZE) {
   359			pte = pfn_pte(pfn, PAGE_KERNEL);
   360	
   361			if (!writable)
   362				pte = pte_wrprotect(pte);
   363	
 > 364			ret = kvm_mmu_topup_memory_cache(&cache.cache, stage2_pgd_levels);
   365			if (ret)
   366				goto out;
   367	
   368			spin_lock(&kvm->mmu_lock);
   369			ret = stage2_set_pte(kvm, 0, &cache.cache, addr, &pte);
   370			spin_unlock(&kvm->mmu_lock);
   371			if (ret)
   372				goto out;
   373	
   374			pfn++;
   375		}
   376	
   377	out:
   378		kvm_mmu_free_memory_cache(&cache.cache);
   379		return ret;
   380	}
   381	

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 19/20] KVM: Allow for different capacities in kvm_mmu_memory_cache structs
  2022-04-22 21:05   ` David Matlack
@ 2022-04-24 15:21     ` kernel test robot
  -1 siblings, 0 replies; 120+ messages in thread
From: kernel test robot @ 2022-04-24 15:21 UTC (permalink / raw)
  To: David Matlack, Paolo Bonzini
  Cc: llvm, kbuild-all, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Sean Christopherson, Andrew Jones, Ben Gardon, Peter Xu,
	maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

Hi David,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on 150866cd0ec871c765181d145aa0912628289c8a]

url:    https://github.com/intel-lab-lkp/linux/commits/David-Matlack/KVM-Extend-Eager-Page-Splitting-to-the-shadow-MMU/20220423-062108
base:   150866cd0ec871c765181d145aa0912628289c8a
config: riscv-randconfig-r016-20220424 (https://download.01.org/0day-ci/archive/20220424/202204242355.T1SzNT9S-lkp@intel.com/config)
compiler: clang version 15.0.0 (https://github.com/llvm/llvm-project 1cddcfdc3c683b393df1a5c9063252eb60e52818)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install riscv cross compiling tool for clang build
        # apt-get install binutils-riscv64-linux-gnu
        # https://github.com/intel-lab-lkp/linux/commit/871c5afc76a6f414c03f433d06bacfd928910b1b
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review David-Matlack/KVM-Extend-Eager-Page-Splitting-to-the-shadow-MMU/20220423-062108
        git checkout 871c5afc76a6f414c03f433d06bacfd928910b1b
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=riscv SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

>> arch/riscv/kvm/mmu.c:364:43: error: no member named 'cache' in 'struct kvm_mmu_memory_cache'
                   ret = kvm_mmu_topup_memory_cache(&cache.cache, stage2_pgd_levels);
                                                     ~~~~~ ^
   arch/riscv/kvm/mmu.c:369:39: error: no member named 'cache' in 'struct kvm_mmu_memory_cache'
                   ret = stage2_set_pte(kvm, 0, &cache.cache, addr, &pte);
                                                 ~~~~~ ^
   arch/riscv/kvm/mmu.c:378:35: error: no member named 'cache' in 'struct kvm_mmu_memory_cache'
           kvm_mmu_free_memory_cache(&cache.cache);
                                      ~~~~~ ^
   3 errors generated.


vim +364 arch/riscv/kvm/mmu.c

   342	
   343	static int stage2_ioremap(struct kvm *kvm, gpa_t gpa, phys_addr_t hpa,
   344				  unsigned long size, bool writable)
   345	{
   346		pte_t pte;
   347		int ret = 0;
   348		unsigned long pfn;
   349		phys_addr_t addr, end;
   350		struct kvm_mmu_memory_cache cache = {
   351			.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE,
   352			.gfp_zero = __GFP_ZERO,
   353		};
   354	
   355		end = (gpa + size + PAGE_SIZE - 1) & PAGE_MASK;
   356		pfn = __phys_to_pfn(hpa);
   357	
   358		for (addr = gpa; addr < end; addr += PAGE_SIZE) {
   359			pte = pfn_pte(pfn, PAGE_KERNEL);
   360	
   361			if (!writable)
   362				pte = pte_wrprotect(pte);
   363	
 > 364			ret = kvm_mmu_topup_memory_cache(&cache.cache, stage2_pgd_levels);
   365			if (ret)
   366				goto out;
   367	
   368			spin_lock(&kvm->mmu_lock);
   369			ret = stage2_set_pte(kvm, 0, &cache.cache, addr, &pte);
   370			spin_unlock(&kvm->mmu_lock);
   371			if (ret)
   372				goto out;
   373	
   374			pfn++;
   375		}
   376	
   377	out:
   378		kvm_mmu_free_memory_cache(&cache.cache);
   379		return ret;
   380	}
   381	

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 19/20] KVM: Allow for different capacities in kvm_mmu_memory_cache structs
@ 2022-04-24 15:21     ` kernel test robot
  0 siblings, 0 replies; 120+ messages in thread
From: kernel test robot @ 2022-04-24 15:21 UTC (permalink / raw)
  To: David Matlack, Paolo Bonzini
  Cc: Huacai Chen, kbuild-all,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Marc Zyngier, llvm,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Albert Ou, Palmer Dabbelt,
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Hi David,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on 150866cd0ec871c765181d145aa0912628289c8a]

url:    https://github.com/intel-lab-lkp/linux/commits/David-Matlack/KVM-Extend-Eager-Page-Splitting-to-the-shadow-MMU/20220423-062108
base:   150866cd0ec871c765181d145aa0912628289c8a
config: riscv-randconfig-r016-20220424 (https://download.01.org/0day-ci/archive/20220424/202204242355.T1SzNT9S-lkp@intel.com/config)
compiler: clang version 15.0.0 (https://github.com/llvm/llvm-project 1cddcfdc3c683b393df1a5c9063252eb60e52818)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install riscv cross compiling tool for clang build
        # apt-get install binutils-riscv64-linux-gnu
        # https://github.com/intel-lab-lkp/linux/commit/871c5afc76a6f414c03f433d06bacfd928910b1b
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review David-Matlack/KVM-Extend-Eager-Page-Splitting-to-the-shadow-MMU/20220423-062108
        git checkout 871c5afc76a6f414c03f433d06bacfd928910b1b
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=riscv SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

>> arch/riscv/kvm/mmu.c:364:43: error: no member named 'cache' in 'struct kvm_mmu_memory_cache'
                   ret = kvm_mmu_topup_memory_cache(&cache.cache, stage2_pgd_levels);
                                                     ~~~~~ ^
   arch/riscv/kvm/mmu.c:369:39: error: no member named 'cache' in 'struct kvm_mmu_memory_cache'
                   ret = stage2_set_pte(kvm, 0, &cache.cache, addr, &pte);
                                                 ~~~~~ ^
   arch/riscv/kvm/mmu.c:378:35: error: no member named 'cache' in 'struct kvm_mmu_memory_cache'
           kvm_mmu_free_memory_cache(&cache.cache);
                                      ~~~~~ ^
   3 errors generated.


vim +364 arch/riscv/kvm/mmu.c

   342	
   343	static int stage2_ioremap(struct kvm *kvm, gpa_t gpa, phys_addr_t hpa,
   344				  unsigned long size, bool writable)
   345	{
   346		pte_t pte;
   347		int ret = 0;
   348		unsigned long pfn;
   349		phys_addr_t addr, end;
   350		struct kvm_mmu_memory_cache cache = {
   351			.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE,
   352			.gfp_zero = __GFP_ZERO,
   353		};
   354	
   355		end = (gpa + size + PAGE_SIZE - 1) & PAGE_MASK;
   356		pfn = __phys_to_pfn(hpa);
   357	
   358		for (addr = gpa; addr < end; addr += PAGE_SIZE) {
   359			pte = pfn_pte(pfn, PAGE_KERNEL);
   360	
   361			if (!writable)
   362				pte = pte_wrprotect(pte);
   363	
 > 364			ret = kvm_mmu_topup_memory_cache(&cache.cache, stage2_pgd_levels);
   365			if (ret)
   366				goto out;
   367	
   368			spin_lock(&kvm->mmu_lock);
   369			ret = stage2_set_pte(kvm, 0, &cache.cache, addr, &pte);
   370			spin_unlock(&kvm->mmu_lock);
   371			if (ret)
   372				goto out;
   373	
   374			pfn++;
   375		}
   376	
   377	out:
   378		kvm_mmu_free_memory_cache(&cache.cache);
   379		return ret;
   380	}
   381	

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 03/20] KVM: x86/mmu: Derive shadow MMU page role from parent
  2022-04-22 21:05   ` David Matlack
@ 2022-05-05 21:50     ` Sean Christopherson
  -1 siblings, 0 replies; 120+ messages in thread
From: Sean Christopherson @ 2022-05-05 21:50 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Fri, Apr 22, 2022, David Matlack wrote:
> Instead of computing the shadow page role from scratch for every new
> page, derive most of the information from the parent shadow page.  This
> avoids redundant calculations and reduces the number of parameters to
> kvm_mmu_get_page().
> 
> Preemptively split out the role calculation to a separate function for
> use in a following commit.
> 
> No functional change intended.
> 
> Reviewed-by: Peter Xu <peterx@redhat.com>
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c         | 96 +++++++++++++++++++++++-----------
>  arch/x86/kvm/mmu/paging_tmpl.h |  9 ++--
>  2 files changed, 71 insertions(+), 34 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index dc20eccd6a77..4249a771818b 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2021,31 +2021,15 @@ static void clear_sp_write_flooding_count(u64 *spte)
>  	__clear_sp_write_flooding_count(sptep_to_sp(spte));
>  }
>  
> -static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> -					     gfn_t gfn,
> -					     gva_t gaddr,
> -					     unsigned level,
> -					     bool direct,
> -					     unsigned int access)
> +static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
> +					     union kvm_mmu_page_role role)
>  {
> -	union kvm_mmu_page_role role;
>  	struct hlist_head *sp_list;
> -	unsigned quadrant;
>  	struct kvm_mmu_page *sp;
>  	int ret;
>  	int collisions = 0;
>  	LIST_HEAD(invalid_list);
>  
> -	role = vcpu->arch.mmu->root_role;
> -	role.level = level;
> -	role.direct = direct;
> -	role.access = access;
> -	if (role.has_4_byte_gpte) {
> -		quadrant = gaddr >> (PAGE_SHIFT + (PT64_PT_BITS * level));
> -		quadrant &= (1 << ((PT32_PT_BITS - PT64_PT_BITS) * level)) - 1;
> -		role.quadrant = quadrant;
> -	}
> -

When you rebase to kvm/queue, the helper will need to deal with

	if (level <= vcpu->arch.mmu->cpu_role.base.level)
		role.passthrough = 0;

KVM should never create a passthrough huge page, so I believe it's just a matter
of adding yet another boolean param to kvm_mmu_child_role().


>  	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
>  	for_each_valid_sp(vcpu->kvm, sp, sp_list) {
>  		if (sp->gfn != gfn) {
> @@ -2063,7 +2047,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>  			 * Unsync pages must not be left as is, because the new
>  			 * upper-level page will be write-protected.
>  			 */
> -			if (level > PG_LEVEL_4K && sp->unsync)
> +			if (role.level > PG_LEVEL_4K && sp->unsync)
>  				kvm_mmu_prepare_zap_page(vcpu->kvm, sp,
>  							 &invalid_list);
>  			continue;

...

> @@ -3310,12 +3338,21 @@ static int mmu_check_root(struct kvm_vcpu *vcpu, gfn_t root_gfn)
>  	return ret;
>  }
>  
> -static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
> +static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant,
>  			    u8 level, bool direct)
>  {
> +	union kvm_mmu_page_role role;
>  	struct kvm_mmu_page *sp;
>  
> -	sp = kvm_mmu_get_page(vcpu, gfn, gva, level, direct, ACC_ALL);
> +	role = vcpu->arch.mmu->root_role;
> +	role.level = level;
> +	role.direct = direct;
> +	role.access = ACC_ALL;
> +
> +	if (role.has_4_byte_gpte)
> +		role.quadrant = quadrant;

Maybe add a comment explaining the PAE and 32-bit paging paths share a call for
allocating PDPTEs?  Otherwise it looks like passing a non-zero quadrant when the
guest doesn't have 4-byte PTEs should be a bug.

Hmm, even better, if the check is moved to the caller, then this can be:

	role.level = level;
	role.direct = direct;
	role.access = ACC_ALL;
	role.quadrant = quadrant;

	WARN_ON_ONCE(quadrant && !role.has_4_byte_gpte));
	WARN_ON_ONCE(direct && role.has_4_byte_gpte));

and no comment is necessary.

> +
> +	sp = kvm_mmu_get_page(vcpu, gfn, role);
>  	++sp->root_count;
>  
>  	return __pa(sp->spt);
> @@ -3349,8 +3386,8 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
>  		for (i = 0; i < 4; ++i) {
>  			WARN_ON_ONCE(IS_VALID_PAE_ROOT(mmu->pae_root[i]));
>  
> -			root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT),
> -					      i << 30, PT32_ROOT_LEVEL, true);
> +			root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT), i,

The @quadrant here can be hardcoded to '0', has_4_byte_gpte is guaranteed to be
false if the MMU is direct.  And then in the indirect path, set gva (and then
quadrant) based on 'i' iff the guest is using 32-bit paging.

Probably worth making it a separate patch just in case I'm forgetting something.
Lightly tested...

--
From: Sean Christopherson <seanjc@google.com>
Date: Thu, 5 May 2022 14:19:35 -0700
Subject: [PATCH] KVM: x86/mmu: Pass '0' for @gva when allocating root with
 8-byte gpte

Pass '0' instead of the "real" gva when allocating a direct PAE root,
a.k.a. a direct PDPTE, and when allocating indirect roots that shadow
64-bit / 8-byte GPTEs.

Thee @gva is only needed if the root is shadowing 32-bit paging in the
guest, in which case KVM needs to use different shadow pages for each of
the two 4-byte GPTEs covered by KVM's 8-byte PAE SPTE.

For direct MMUs, there's obviously no shadowing, and for indirect MMU

In anticipation of moving the quadrant logic into mmu_alloc_root(), WARN
if a non-zero @gva is passed for !4-byte GPTEs, and WARN if 4-byte GPTEs
are ever combined with a direct root (there's no shadowing, so TDP roots
should ignore the GPTE size).

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index dc20eccd6a77..6dfa3cfa8394 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3313,8 +3313,12 @@ static int mmu_check_root(struct kvm_vcpu *vcpu, gfn_t root_gfn)
 static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
 			    u8 level, bool direct)
 {
+	union kvm_mmu_page_role role = vcpu->arch.mmu->root_role;
 	struct kvm_mmu_page *sp;

+	WARN_ON_ONCE(gva && !role.has_4_byte_gpte);
+	WARN_ON_ONCE(direct && role.has_4_byte_gpte);
+
 	sp = kvm_mmu_get_page(vcpu, gfn, gva, level, direct, ACC_ALL);
 	++sp->root_count;

@@ -3349,8 +3353,8 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
 		for (i = 0; i < 4; ++i) {
 			WARN_ON_ONCE(IS_VALID_PAE_ROOT(mmu->pae_root[i]));

-			root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT),
-					      i << 30, PT32_ROOT_LEVEL, true);
+			root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT), 0,
+					      PT32_ROOT_LEVEL, true);
 			mmu->pae_root[i] = root | PT_PRESENT_MASK |
 					   shadow_me_mask;
 		}
@@ -3435,6 +3439,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 	u64 pdptrs[4], pm_mask;
 	gfn_t root_gfn, root_pgd;
 	hpa_t root;
+	gva_t gva;
 	unsigned i;
 	int r;

@@ -3508,6 +3513,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 		}
 	}

+	gva = 0;
 	for (i = 0; i < 4; ++i) {
 		WARN_ON_ONCE(IS_VALID_PAE_ROOT(mmu->pae_root[i]));

@@ -3517,9 +3523,11 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 				continue;
 			}
 			root_gfn = pdptrs[i] >> PAGE_SHIFT;
+		} else if (mmu->cpu_role.base.level == PT32_ROOT_LEVEL) {
+			gva = i << 30;
 		}

-		root = mmu_alloc_root(vcpu, root_gfn, i << 30,
+		root = mmu_alloc_root(vcpu, root_gfn, gva,
 				      PT32_ROOT_LEVEL, false);
 		mmu->pae_root[i] = root | pm_mask;
 	}

base-commit: 8bae380ad7dd3c31266d3685841ea4ce574d462d
--


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 03/20] KVM: x86/mmu: Derive shadow MMU page role from parent
@ 2022-05-05 21:50     ` Sean Christopherson
  0 siblings, 0 replies; 120+ messages in thread
From: Sean Christopherson @ 2022-05-05 21:50 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Fri, Apr 22, 2022, David Matlack wrote:
> Instead of computing the shadow page role from scratch for every new
> page, derive most of the information from the parent shadow page.  This
> avoids redundant calculations and reduces the number of parameters to
> kvm_mmu_get_page().
> 
> Preemptively split out the role calculation to a separate function for
> use in a following commit.
> 
> No functional change intended.
> 
> Reviewed-by: Peter Xu <peterx@redhat.com>
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c         | 96 +++++++++++++++++++++++-----------
>  arch/x86/kvm/mmu/paging_tmpl.h |  9 ++--
>  2 files changed, 71 insertions(+), 34 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index dc20eccd6a77..4249a771818b 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2021,31 +2021,15 @@ static void clear_sp_write_flooding_count(u64 *spte)
>  	__clear_sp_write_flooding_count(sptep_to_sp(spte));
>  }
>  
> -static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> -					     gfn_t gfn,
> -					     gva_t gaddr,
> -					     unsigned level,
> -					     bool direct,
> -					     unsigned int access)
> +static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
> +					     union kvm_mmu_page_role role)
>  {
> -	union kvm_mmu_page_role role;
>  	struct hlist_head *sp_list;
> -	unsigned quadrant;
>  	struct kvm_mmu_page *sp;
>  	int ret;
>  	int collisions = 0;
>  	LIST_HEAD(invalid_list);
>  
> -	role = vcpu->arch.mmu->root_role;
> -	role.level = level;
> -	role.direct = direct;
> -	role.access = access;
> -	if (role.has_4_byte_gpte) {
> -		quadrant = gaddr >> (PAGE_SHIFT + (PT64_PT_BITS * level));
> -		quadrant &= (1 << ((PT32_PT_BITS - PT64_PT_BITS) * level)) - 1;
> -		role.quadrant = quadrant;
> -	}
> -

When you rebase to kvm/queue, the helper will need to deal with

	if (level <= vcpu->arch.mmu->cpu_role.base.level)
		role.passthrough = 0;

KVM should never create a passthrough huge page, so I believe it's just a matter
of adding yet another boolean param to kvm_mmu_child_role().


>  	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
>  	for_each_valid_sp(vcpu->kvm, sp, sp_list) {
>  		if (sp->gfn != gfn) {
> @@ -2063,7 +2047,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>  			 * Unsync pages must not be left as is, because the new
>  			 * upper-level page will be write-protected.
>  			 */
> -			if (level > PG_LEVEL_4K && sp->unsync)
> +			if (role.level > PG_LEVEL_4K && sp->unsync)
>  				kvm_mmu_prepare_zap_page(vcpu->kvm, sp,
>  							 &invalid_list);
>  			continue;

...

> @@ -3310,12 +3338,21 @@ static int mmu_check_root(struct kvm_vcpu *vcpu, gfn_t root_gfn)
>  	return ret;
>  }
>  
> -static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
> +static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant,
>  			    u8 level, bool direct)
>  {
> +	union kvm_mmu_page_role role;
>  	struct kvm_mmu_page *sp;
>  
> -	sp = kvm_mmu_get_page(vcpu, gfn, gva, level, direct, ACC_ALL);
> +	role = vcpu->arch.mmu->root_role;
> +	role.level = level;
> +	role.direct = direct;
> +	role.access = ACC_ALL;
> +
> +	if (role.has_4_byte_gpte)
> +		role.quadrant = quadrant;

Maybe add a comment explaining the PAE and 32-bit paging paths share a call for
allocating PDPTEs?  Otherwise it looks like passing a non-zero quadrant when the
guest doesn't have 4-byte PTEs should be a bug.

Hmm, even better, if the check is moved to the caller, then this can be:

	role.level = level;
	role.direct = direct;
	role.access = ACC_ALL;
	role.quadrant = quadrant;

	WARN_ON_ONCE(quadrant && !role.has_4_byte_gpte));
	WARN_ON_ONCE(direct && role.has_4_byte_gpte));

and no comment is necessary.

> +
> +	sp = kvm_mmu_get_page(vcpu, gfn, role);
>  	++sp->root_count;
>  
>  	return __pa(sp->spt);
> @@ -3349,8 +3386,8 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
>  		for (i = 0; i < 4; ++i) {
>  			WARN_ON_ONCE(IS_VALID_PAE_ROOT(mmu->pae_root[i]));
>  
> -			root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT),
> -					      i << 30, PT32_ROOT_LEVEL, true);
> +			root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT), i,

The @quadrant here can be hardcoded to '0', has_4_byte_gpte is guaranteed to be
false if the MMU is direct.  And then in the indirect path, set gva (and then
quadrant) based on 'i' iff the guest is using 32-bit paging.

Probably worth making it a separate patch just in case I'm forgetting something.
Lightly tested...

--
From: Sean Christopherson <seanjc@google.com>
Date: Thu, 5 May 2022 14:19:35 -0700
Subject: [PATCH] KVM: x86/mmu: Pass '0' for @gva when allocating root with
 8-byte gpte

Pass '0' instead of the "real" gva when allocating a direct PAE root,
a.k.a. a direct PDPTE, and when allocating indirect roots that shadow
64-bit / 8-byte GPTEs.

Thee @gva is only needed if the root is shadowing 32-bit paging in the
guest, in which case KVM needs to use different shadow pages for each of
the two 4-byte GPTEs covered by KVM's 8-byte PAE SPTE.

For direct MMUs, there's obviously no shadowing, and for indirect MMU

In anticipation of moving the quadrant logic into mmu_alloc_root(), WARN
if a non-zero @gva is passed for !4-byte GPTEs, and WARN if 4-byte GPTEs
are ever combined with a direct root (there's no shadowing, so TDP roots
should ignore the GPTE size).

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index dc20eccd6a77..6dfa3cfa8394 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3313,8 +3313,12 @@ static int mmu_check_root(struct kvm_vcpu *vcpu, gfn_t root_gfn)
 static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
 			    u8 level, bool direct)
 {
+	union kvm_mmu_page_role role = vcpu->arch.mmu->root_role;
 	struct kvm_mmu_page *sp;

+	WARN_ON_ONCE(gva && !role.has_4_byte_gpte);
+	WARN_ON_ONCE(direct && role.has_4_byte_gpte);
+
 	sp = kvm_mmu_get_page(vcpu, gfn, gva, level, direct, ACC_ALL);
 	++sp->root_count;

@@ -3349,8 +3353,8 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
 		for (i = 0; i < 4; ++i) {
 			WARN_ON_ONCE(IS_VALID_PAE_ROOT(mmu->pae_root[i]));

-			root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT),
-					      i << 30, PT32_ROOT_LEVEL, true);
+			root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT), 0,
+					      PT32_ROOT_LEVEL, true);
 			mmu->pae_root[i] = root | PT_PRESENT_MASK |
 					   shadow_me_mask;
 		}
@@ -3435,6 +3439,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 	u64 pdptrs[4], pm_mask;
 	gfn_t root_gfn, root_pgd;
 	hpa_t root;
+	gva_t gva;
 	unsigned i;
 	int r;

@@ -3508,6 +3513,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 		}
 	}

+	gva = 0;
 	for (i = 0; i < 4; ++i) {
 		WARN_ON_ONCE(IS_VALID_PAE_ROOT(mmu->pae_root[i]));

@@ -3517,9 +3523,11 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 				continue;
 			}
 			root_gfn = pdptrs[i] >> PAGE_SHIFT;
+		} else if (mmu->cpu_role.base.level == PT32_ROOT_LEVEL) {
+			gva = i << 30;
 		}

-		root = mmu_alloc_root(vcpu, root_gfn, i << 30,
+		root = mmu_alloc_root(vcpu, root_gfn, gva,
 				      PT32_ROOT_LEVEL, false);
 		mmu->pae_root[i] = root | pm_mask;
 	}

base-commit: 8bae380ad7dd3c31266d3685841ea4ce574d462d
--

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 04/20] KVM: x86/mmu: Decompose kvm_mmu_get_page() into separate functions
  2022-04-22 21:05   ` David Matlack
@ 2022-05-05 21:58     ` Sean Christopherson
  -1 siblings, 0 replies; 120+ messages in thread
From: Sean Christopherson @ 2022-05-05 21:58 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Fri, Apr 22, 2022, David Matlack wrote:
> Decompose kvm_mmu_get_page() into separate helper functions to increase
> readability and prepare for allocating shadow pages without a vcpu
> pointer.
> 
> Specifically, pull the guts of kvm_mmu_get_page() into 2 helper
> functions:
> 
> kvm_mmu_find_shadow_page() -
>   Walks the page hash checking for any existing mmu pages that match the
>   given gfn and role.
> 
> kvm_mmu_alloc_shadow_page()
>   Allocates and initializes an entirely new kvm_mmu_page. This currently
>   requries a vcpu pointer for allocation and looking up the memslot but
>   that will be removed in a future commit.
> 
> No functional change intended.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---

Nice!

Reviewed-by: Sean Christopherson <seanjc@google.com>

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 04/20] KVM: x86/mmu: Decompose kvm_mmu_get_page() into separate functions
@ 2022-05-05 21:58     ` Sean Christopherson
  0 siblings, 0 replies; 120+ messages in thread
From: Sean Christopherson @ 2022-05-05 21:58 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Fri, Apr 22, 2022, David Matlack wrote:
> Decompose kvm_mmu_get_page() into separate helper functions to increase
> readability and prepare for allocating shadow pages without a vcpu
> pointer.
> 
> Specifically, pull the guts of kvm_mmu_get_page() into 2 helper
> functions:
> 
> kvm_mmu_find_shadow_page() -
>   Walks the page hash checking for any existing mmu pages that match the
>   given gfn and role.
> 
> kvm_mmu_alloc_shadow_page()
>   Allocates and initializes an entirely new kvm_mmu_page. This currently
>   requries a vcpu pointer for allocation and looking up the memslot but
>   that will be removed in a future commit.
> 
> No functional change intended.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---

Nice!

Reviewed-by: Sean Christopherson <seanjc@google.com>
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 05/20] KVM: x86/mmu: Consolidate shadow page allocation and initialization
  2022-04-22 21:05   ` David Matlack
@ 2022-05-05 22:10     ` Sean Christopherson
  -1 siblings, 0 replies; 120+ messages in thread
From: Sean Christopherson @ 2022-05-05 22:10 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Fri, Apr 22, 2022, David Matlack wrote:
> Consolidate kvm_mmu_alloc_page() and kvm_mmu_alloc_shadow_page() under
> the latter so that all shadow page allocation and initialization happens
> in one place.
> 
> No functional change intended.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 39 +++++++++++++++++----------------------
>  1 file changed, 17 insertions(+), 22 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 5582badf4947..7d03320f6e08 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1703,27 +1703,6 @@ static void drop_parent_pte(struct kvm_mmu_page *sp,
>  	mmu_spte_clear_no_track(parent_pte);
>  }
>  
> -static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, bool direct)
> -{
> -	struct kvm_mmu_page *sp;
> -
> -	sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
> -	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
> -	if (!direct)
> -		sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache);
> -	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
> -
> -	/*
> -	 * active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages()
> -	 * depends on valid pages being added to the head of the list.  See
> -	 * comments in kvm_zap_obsolete_pages().
> -	 */
> -	sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
> -	list_add(&sp->link, &vcpu->kvm->arch.active_mmu_pages);
> -	kvm_mod_used_mmu_pages(vcpu->kvm, +1);
> -	return sp;
> -}
> -
>  static void mark_unsync(u64 *spte);
>  static void kvm_mmu_mark_parents_unsync(struct kvm_mmu_page *sp)
>  {
> @@ -2100,7 +2079,23 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
>  						      struct hlist_head *sp_list,
>  						      union kvm_mmu_page_role role)
>  {
> -	struct kvm_mmu_page *sp = kvm_mmu_alloc_page(vcpu, role.direct);
> +	struct kvm_mmu_page *sp;
> +
> +	sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
> +	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
> +	if (!role.direct)
> +		sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache);
> +
> +	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
> +
> +	/*
> +	 * active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages()
> +	 * depends on valid pages being added to the head of the list.  See
> +	 * comments in kvm_zap_obsolete_pages().
> +	 */
> +	sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
> +	list_add(&sp->link, &vcpu->kvm->arch.active_mmu_pages);
> +	kvm_mod_used_mmu_pages(vcpu->kvm, +1);

To reduce churn later on, what about opportunistically grabbing vcpu->kvm in a
local variable in this patch?

Actually, at that point, it's a super trivial change, so you can probably just drop 

	KVM: x86/mmu: Replace vcpu with kvm in kvm_mmu_alloc_shadow_page()

and do the vcpu/kvm swap as part of

	KVM: x86/mmu: Pass memory caches to allocate SPs separately

>  	sp->gfn = gfn;
>  	sp->role = role;
> -- 
> 2.36.0.rc2.479.g8af0fa9b8e-goog
> 

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 05/20] KVM: x86/mmu: Consolidate shadow page allocation and initialization
@ 2022-05-05 22:10     ` Sean Christopherson
  0 siblings, 0 replies; 120+ messages in thread
From: Sean Christopherson @ 2022-05-05 22:10 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Fri, Apr 22, 2022, David Matlack wrote:
> Consolidate kvm_mmu_alloc_page() and kvm_mmu_alloc_shadow_page() under
> the latter so that all shadow page allocation and initialization happens
> in one place.
> 
> No functional change intended.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 39 +++++++++++++++++----------------------
>  1 file changed, 17 insertions(+), 22 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 5582badf4947..7d03320f6e08 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1703,27 +1703,6 @@ static void drop_parent_pte(struct kvm_mmu_page *sp,
>  	mmu_spte_clear_no_track(parent_pte);
>  }
>  
> -static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, bool direct)
> -{
> -	struct kvm_mmu_page *sp;
> -
> -	sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
> -	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
> -	if (!direct)
> -		sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache);
> -	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
> -
> -	/*
> -	 * active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages()
> -	 * depends on valid pages being added to the head of the list.  See
> -	 * comments in kvm_zap_obsolete_pages().
> -	 */
> -	sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
> -	list_add(&sp->link, &vcpu->kvm->arch.active_mmu_pages);
> -	kvm_mod_used_mmu_pages(vcpu->kvm, +1);
> -	return sp;
> -}
> -
>  static void mark_unsync(u64 *spte);
>  static void kvm_mmu_mark_parents_unsync(struct kvm_mmu_page *sp)
>  {
> @@ -2100,7 +2079,23 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
>  						      struct hlist_head *sp_list,
>  						      union kvm_mmu_page_role role)
>  {
> -	struct kvm_mmu_page *sp = kvm_mmu_alloc_page(vcpu, role.direct);
> +	struct kvm_mmu_page *sp;
> +
> +	sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
> +	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
> +	if (!role.direct)
> +		sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache);
> +
> +	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
> +
> +	/*
> +	 * active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages()
> +	 * depends on valid pages being added to the head of the list.  See
> +	 * comments in kvm_zap_obsolete_pages().
> +	 */
> +	sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
> +	list_add(&sp->link, &vcpu->kvm->arch.active_mmu_pages);
> +	kvm_mod_used_mmu_pages(vcpu->kvm, +1);

To reduce churn later on, what about opportunistically grabbing vcpu->kvm in a
local variable in this patch?

Actually, at that point, it's a super trivial change, so you can probably just drop 

	KVM: x86/mmu: Replace vcpu with kvm in kvm_mmu_alloc_shadow_page()

and do the vcpu/kvm swap as part of

	KVM: x86/mmu: Pass memory caches to allocate SPs separately

>  	sp->gfn = gfn;
>  	sp->role = role;
> -- 
> 2.36.0.rc2.479.g8af0fa9b8e-goog
> 
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 06/20] KVM: x86/mmu: Rename shadow MMU functions that deal with shadow pages
  2022-04-22 21:05   ` David Matlack
@ 2022-05-05 22:15     ` Sean Christopherson
  -1 siblings, 0 replies; 120+ messages in thread
From: Sean Christopherson @ 2022-05-05 22:15 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Fri, Apr 22, 2022, David Matlack wrote:
> Rename 2 functions:
> 
>   kvm_mmu_get_page() -> kvm_mmu_get_shadow_page()
>   kvm_mmu_free_page() -> kvm_mmu_free_shadow_page()
> 
> This change makes it clear that these functions deal with shadow pages
> rather than struct pages. It also aligns these functions with the naming
> scheme for kvm_mmu_find_shadow_page() and kvm_mmu_alloc_shadow_page().
> 
> Prefer "shadow_page" over the shorter "sp" since these are core
> functions and the line lengths aren't terrible.
> 
> No functional change intended.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---

Reviewed-by: Sean Christopherson <seanjc@google.com>

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 06/20] KVM: x86/mmu: Rename shadow MMU functions that deal with shadow pages
@ 2022-05-05 22:15     ` Sean Christopherson
  0 siblings, 0 replies; 120+ messages in thread
From: Sean Christopherson @ 2022-05-05 22:15 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Fri, Apr 22, 2022, David Matlack wrote:
> Rename 2 functions:
> 
>   kvm_mmu_get_page() -> kvm_mmu_get_shadow_page()
>   kvm_mmu_free_page() -> kvm_mmu_free_shadow_page()
> 
> This change makes it clear that these functions deal with shadow pages
> rather than struct pages. It also aligns these functions with the naming
> scheme for kvm_mmu_find_shadow_page() and kvm_mmu_alloc_shadow_page().
> 
> Prefer "shadow_page" over the shorter "sp" since these are core
> functions and the line lengths aren't terrible.
> 
> No functional change intended.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---

Reviewed-by: Sean Christopherson <seanjc@google.com>
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 07/20] KVM: x86/mmu: Move guest PT write-protection to account_shadowed()
  2022-04-22 21:05   ` David Matlack
@ 2022-05-05 22:51     ` Sean Christopherson
  -1 siblings, 0 replies; 120+ messages in thread
From: Sean Christopherson @ 2022-05-05 22:51 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Fri, Apr 22, 2022, David Matlack wrote:
> Move the code that write-protects newly-shadowed guest page tables into
> account_shadowed(). This avoids a extra gfn-to-memslot lookup and is a
> more logical place for this code to live. But most importantly, this
> reduces kvm_mmu_alloc_shadow_page()'s reliance on having a struct
> kvm_vcpu pointer, which will be necessary when creating new shadow pages
> during VM ioctls for eager page splitting.
> 
> Note, it is safe to drop the role.level == PG_LEVEL_4K check since
> account_shadowed() returns early if role.level > PG_LEVEL_4K.
> 
> No functional change intended.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---

Reviewed-by: Sean Christopherson <seanjc@google.com>

>  arch/x86/kvm/mmu/mmu.c | 9 +++++----
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index fa7846760887..4f894db88bbf 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -807,6 +807,9 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
>  						    KVM_PAGE_TRACK_WRITE);
>  
>  	kvm_mmu_gfn_disallow_lpage(slot, gfn);
> +
> +	if (kvm_mmu_slot_gfn_write_protect(kvm, slot, gfn, PG_LEVEL_4K))
> +		kvm_flush_remote_tlbs_with_address(kvm, gfn, 1);
>  }
>  
>  void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp)
> @@ -2100,11 +2103,9 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
>  	sp->gfn = gfn;
>  	sp->role = role;
>  	hlist_add_head(&sp->hash_link, sp_list);
> -	if (!role.direct) {
> +
> +	if (!role.direct)
>  		account_shadowed(vcpu->kvm, sp);
> -		if (role.level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))

Huh.  Two thoughts.

1. Can you add a patch to drop kvm_vcpu_write_protect_gfn() entirely, i.e. convert
   mmu_sync_children() to use kvm_mmu_slot_gfn_write_protect?  It's largely a moot
   point, but only because mmu_sync_children() only operates on shadow pages that
   are relevant to the current non-SMM/SMM role.  And _that_ holds true only because
   KVM does kvm_mmu_reset_context() and drops roots for the vCPU on SMM transitions.

   That'd be a good oppurtunity to move this pair into a helper:

   	slots = kvm_memslots_for_spte_role(kvm, sp->role);
	slot = __gfn_to_memslot(slots, gfn);

2. This got me thinking...  Write-protecting for shadow paging should NOT be
   associated with the vCPU or even the role.  The SMM memslots conceptually
   operate on the same guest physical memory, SMM is just given access to memory
   that is not present in the non-SMM memslots.

   If KVM creates SPTEs for non-SMM, then it needs to write-protect _all_ memslots
   that contain the relevant gfn, e.g. if the guest takes an SMI and modifies the
   non-SMM page tables, then KVM needs trap and emulate (or unsync) those writes.

   The mess "works" because no sane SMM handler (kind of a contradiction in terms)
   will modify non-SMM page tables, and vice versa.

   The entire duplicate memslots approach is flawed.  Better would have been to
   make SMM a flag and hide SMM-only memslots, not duplicated everything...

> -			kvm_flush_remote_tlbs_with_address(vcpu->kvm, gfn, 1);
> -	}
>  
>  	return sp;
>  }
> -- 
> 2.36.0.rc2.479.g8af0fa9b8e-goog
> 

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 07/20] KVM: x86/mmu: Move guest PT write-protection to account_shadowed()
@ 2022-05-05 22:51     ` Sean Christopherson
  0 siblings, 0 replies; 120+ messages in thread
From: Sean Christopherson @ 2022-05-05 22:51 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Fri, Apr 22, 2022, David Matlack wrote:
> Move the code that write-protects newly-shadowed guest page tables into
> account_shadowed(). This avoids a extra gfn-to-memslot lookup and is a
> more logical place for this code to live. But most importantly, this
> reduces kvm_mmu_alloc_shadow_page()'s reliance on having a struct
> kvm_vcpu pointer, which will be necessary when creating new shadow pages
> during VM ioctls for eager page splitting.
> 
> Note, it is safe to drop the role.level == PG_LEVEL_4K check since
> account_shadowed() returns early if role.level > PG_LEVEL_4K.
> 
> No functional change intended.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---

Reviewed-by: Sean Christopherson <seanjc@google.com>

>  arch/x86/kvm/mmu/mmu.c | 9 +++++----
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index fa7846760887..4f894db88bbf 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -807,6 +807,9 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
>  						    KVM_PAGE_TRACK_WRITE);
>  
>  	kvm_mmu_gfn_disallow_lpage(slot, gfn);
> +
> +	if (kvm_mmu_slot_gfn_write_protect(kvm, slot, gfn, PG_LEVEL_4K))
> +		kvm_flush_remote_tlbs_with_address(kvm, gfn, 1);
>  }
>  
>  void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp)
> @@ -2100,11 +2103,9 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
>  	sp->gfn = gfn;
>  	sp->role = role;
>  	hlist_add_head(&sp->hash_link, sp_list);
> -	if (!role.direct) {
> +
> +	if (!role.direct)
>  		account_shadowed(vcpu->kvm, sp);
> -		if (role.level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))

Huh.  Two thoughts.

1. Can you add a patch to drop kvm_vcpu_write_protect_gfn() entirely, i.e. convert
   mmu_sync_children() to use kvm_mmu_slot_gfn_write_protect?  It's largely a moot
   point, but only because mmu_sync_children() only operates on shadow pages that
   are relevant to the current non-SMM/SMM role.  And _that_ holds true only because
   KVM does kvm_mmu_reset_context() and drops roots for the vCPU on SMM transitions.

   That'd be a good oppurtunity to move this pair into a helper:

   	slots = kvm_memslots_for_spte_role(kvm, sp->role);
	slot = __gfn_to_memslot(slots, gfn);

2. This got me thinking...  Write-protecting for shadow paging should NOT be
   associated with the vCPU or even the role.  The SMM memslots conceptually
   operate on the same guest physical memory, SMM is just given access to memory
   that is not present in the non-SMM memslots.

   If KVM creates SPTEs for non-SMM, then it needs to write-protect _all_ memslots
   that contain the relevant gfn, e.g. if the guest takes an SMI and modifies the
   non-SMM page tables, then KVM needs trap and emulate (or unsync) those writes.

   The mess "works" because no sane SMM handler (kind of a contradiction in terms)
   will modify non-SMM page tables, and vice versa.

   The entire duplicate memslots approach is flawed.  Better would have been to
   make SMM a flag and hide SMM-only memslots, not duplicated everything...

> -			kvm_flush_remote_tlbs_with_address(vcpu->kvm, gfn, 1);
> -	}
>  
>  	return sp;
>  }
> -- 
> 2.36.0.rc2.479.g8af0fa9b8e-goog
> 
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 08/20] KVM: x86/mmu: Pass memory caches to allocate SPs separately
  2022-04-22 21:05   ` David Matlack
@ 2022-05-05 23:00     ` Sean Christopherson
  -1 siblings, 0 replies; 120+ messages in thread
From: Sean Christopherson @ 2022-05-05 23:00 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Fri, Apr 22, 2022, David Matlack wrote:
> Refactor kvm_mmu_alloc_shadow_page() to receive the caches from which it
> will allocate the various pieces of memory for shadow pages as a
> parameter, rather than deriving them from the vcpu pointer. This will be
> useful in a future commit where shadow pages are allocated during VM
> ioctls for eager page splitting, and thus will use a different set of
> caches.
> 
> Preemptively pull the caches out all the way to
> kvm_mmu_get_shadow_page() since eager page splitting will not be calling
> kvm_mmu_alloc_shadow_page() directly.
> 
> No functional change intended.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---

...

>  static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
> +						      struct shadow_page_caches *caches,

Definitely work doing the "kvm" capture in an earlier patch, and doing the s/vcpu/kvm
here, the diff on top is tiny.  The shortlog/changelog would need minor tweaks, but
that's not a big deal.

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index da1c3cf91778..15784bab985f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2084,13 +2084,12 @@ struct shadow_page_caches {
        struct kvm_mmu_memory_cache *gfn_array_cache;
 };

-static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
+static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
                                                      struct shadow_page_caches *caches,
                                                      gfn_t gfn,
                                                      struct hlist_head *sp_list,
                                                      union kvm_mmu_page_role role)
 {
-       struct kvm *kvm = vcpu->kvm;
        struct kvm_mmu_page *sp;

        sp = kvm_mmu_memory_cache_alloc(caches->page_header_cache);
@@ -2133,7 +2132,7 @@ static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
        sp = kvm_mmu_find_shadow_page(vcpu, gfn, sp_list, role);
        if (!sp) {
                created = true;
-               sp = kvm_mmu_alloc_shadow_page(vcpu, caches, gfn, sp_list, role);
+               sp = kvm_mmu_alloc_shadow_page(vcpu->kvm, caches, gfn, sp_list, role);
        }

        trace_kvm_mmu_get_page(sp, created);


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 08/20] KVM: x86/mmu: Pass memory caches to allocate SPs separately
@ 2022-05-05 23:00     ` Sean Christopherson
  0 siblings, 0 replies; 120+ messages in thread
From: Sean Christopherson @ 2022-05-05 23:00 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Fri, Apr 22, 2022, David Matlack wrote:
> Refactor kvm_mmu_alloc_shadow_page() to receive the caches from which it
> will allocate the various pieces of memory for shadow pages as a
> parameter, rather than deriving them from the vcpu pointer. This will be
> useful in a future commit where shadow pages are allocated during VM
> ioctls for eager page splitting, and thus will use a different set of
> caches.
> 
> Preemptively pull the caches out all the way to
> kvm_mmu_get_shadow_page() since eager page splitting will not be calling
> kvm_mmu_alloc_shadow_page() directly.
> 
> No functional change intended.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---

...

>  static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
> +						      struct shadow_page_caches *caches,

Definitely work doing the "kvm" capture in an earlier patch, and doing the s/vcpu/kvm
here, the diff on top is tiny.  The shortlog/changelog would need minor tweaks, but
that's not a big deal.

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index da1c3cf91778..15784bab985f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2084,13 +2084,12 @@ struct shadow_page_caches {
        struct kvm_mmu_memory_cache *gfn_array_cache;
 };

-static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
+static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
                                                      struct shadow_page_caches *caches,
                                                      gfn_t gfn,
                                                      struct hlist_head *sp_list,
                                                      union kvm_mmu_page_role role)
 {
-       struct kvm *kvm = vcpu->kvm;
        struct kvm_mmu_page *sp;

        sp = kvm_mmu_memory_cache_alloc(caches->page_header_cache);
@@ -2133,7 +2132,7 @@ static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
        sp = kvm_mmu_find_shadow_page(vcpu, gfn, sp_list, role);
        if (!sp) {
                created = true;
-               sp = kvm_mmu_alloc_shadow_page(vcpu, caches, gfn, sp_list, role);
+               sp = kvm_mmu_alloc_shadow_page(vcpu->kvm, caches, gfn, sp_list, role);
        }

        trace_kvm_mmu_get_page(sp, created);

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 11/20] KVM: x86/mmu: Allow for NULL vcpu pointer in __kvm_mmu_get_shadow_page()
  2022-04-22 21:05   ` David Matlack
@ 2022-05-05 23:33     ` Sean Christopherson
  -1 siblings, 0 replies; 120+ messages in thread
From: Sean Christopherson @ 2022-05-05 23:33 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Fri, Apr 22, 2022, David Matlack wrote:
> Allow the vcpu pointer in __kvm_mmu_get_shadow_page() to be NULL. Rename
> it to vcpu_or_null to prevent future commits from accidentally taking
> dependency on it without first considering the NULL case.
> 
> The vcpu pointer is only used for syncing indirect shadow pages in
> kvm_mmu_find_shadow_page(). A vcpu pointer it not required for
> correctness since unsync pages can simply be zapped. But this should
> never occur in practice, since the only use-case for passing a NULL vCPU
> pointer is eager page splitting which will only request direct shadow
> pages (which can never be unsync).
> 
> Even though __kvm_mmu_get_shadow_page() can gracefully handle a NULL
> vcpu, add a WARN() that will fire if __kvm_mmu_get_shadow_page() is ever
> called to get an indirect shadow page with a NULL vCPU pointer, since
> zapping unsync SPs is a performance overhead that should be considered.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 40 ++++++++++++++++++++++++++++++++--------
>  1 file changed, 32 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 04029c01aebd..21407bd4435a 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1845,16 +1845,27 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
>  	  &(_kvm)->arch.mmu_page_hash[kvm_page_table_hashfn(_gfn)])	\
>  		if ((_sp)->gfn != (_gfn) || (_sp)->role.direct) {} else
>  
> -static int kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> -			 struct list_head *invalid_list)
> +static int __kvm_sync_page(struct kvm *kvm, struct kvm_vcpu *vcpu_or_null,
> +			   struct kvm_mmu_page *sp,
> +			   struct list_head *invalid_list)
>  {
> -	int ret = vcpu->arch.mmu->sync_page(vcpu, sp);
> +	int ret = -1;
> +
> +	if (vcpu_or_null)

This should never happen.  I like the idea of warning early, but I really don't
like that the WARN is far removed from the code that actually depends on @vcpu
being non-NULL. Case in point, KVM should have bailed on the WARN and never
reached this point.  And the inner __kvm_sync_page() is completely unnecessary.

I also don't love the vcpu_or_null terminology; I get the intent, but it doesn't
really help because understand why/when it's NULL.

I played around with casting, e.g. to/from an unsigned long or void *, to prevent
usage, but that doesn't work very well because 'unsigned long' ends up being
awkward/confusing, and 'void *' is easily lost on a function call.  And both
lose type safety :-(

All in all, I think I'd prefer this patch to simply be a KVM_BUG_ON() if
kvm_mmu_find_shadow_page() encounters an unsync page.  Less churn, and IMO there's
no real loss in robustness, e.g. we'd really have to screw up code review and
testing to introduce a null vCPU pointer dereference in this code.

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 3d102522804a..5aed9265f592 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2041,6 +2041,13 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm,
                        goto out;

                if (sp->unsync) {
+                       /*
+                        * Getting indirect shadow pages without a vCPU pointer
+                        * is not supported, i.e. this should never happen.
+                        */
+                       if (KVM_BUG_ON(!vcpu, kvm))
+                               break;
+
                        /*
                         * The page is good, but is stale.  kvm_sync_page does
                         * get the latest guest state, but (unlike mmu_unsync_children)


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 11/20] KVM: x86/mmu: Allow for NULL vcpu pointer in __kvm_mmu_get_shadow_page()
@ 2022-05-05 23:33     ` Sean Christopherson
  0 siblings, 0 replies; 120+ messages in thread
From: Sean Christopherson @ 2022-05-05 23:33 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Fri, Apr 22, 2022, David Matlack wrote:
> Allow the vcpu pointer in __kvm_mmu_get_shadow_page() to be NULL. Rename
> it to vcpu_or_null to prevent future commits from accidentally taking
> dependency on it without first considering the NULL case.
> 
> The vcpu pointer is only used for syncing indirect shadow pages in
> kvm_mmu_find_shadow_page(). A vcpu pointer it not required for
> correctness since unsync pages can simply be zapped. But this should
> never occur in practice, since the only use-case for passing a NULL vCPU
> pointer is eager page splitting which will only request direct shadow
> pages (which can never be unsync).
> 
> Even though __kvm_mmu_get_shadow_page() can gracefully handle a NULL
> vcpu, add a WARN() that will fire if __kvm_mmu_get_shadow_page() is ever
> called to get an indirect shadow page with a NULL vCPU pointer, since
> zapping unsync SPs is a performance overhead that should be considered.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 40 ++++++++++++++++++++++++++++++++--------
>  1 file changed, 32 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 04029c01aebd..21407bd4435a 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1845,16 +1845,27 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
>  	  &(_kvm)->arch.mmu_page_hash[kvm_page_table_hashfn(_gfn)])	\
>  		if ((_sp)->gfn != (_gfn) || (_sp)->role.direct) {} else
>  
> -static int kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> -			 struct list_head *invalid_list)
> +static int __kvm_sync_page(struct kvm *kvm, struct kvm_vcpu *vcpu_or_null,
> +			   struct kvm_mmu_page *sp,
> +			   struct list_head *invalid_list)
>  {
> -	int ret = vcpu->arch.mmu->sync_page(vcpu, sp);
> +	int ret = -1;
> +
> +	if (vcpu_or_null)

This should never happen.  I like the idea of warning early, but I really don't
like that the WARN is far removed from the code that actually depends on @vcpu
being non-NULL. Case in point, KVM should have bailed on the WARN and never
reached this point.  And the inner __kvm_sync_page() is completely unnecessary.

I also don't love the vcpu_or_null terminology; I get the intent, but it doesn't
really help because understand why/when it's NULL.

I played around with casting, e.g. to/from an unsigned long or void *, to prevent
usage, but that doesn't work very well because 'unsigned long' ends up being
awkward/confusing, and 'void *' is easily lost on a function call.  And both
lose type safety :-(

All in all, I think I'd prefer this patch to simply be a KVM_BUG_ON() if
kvm_mmu_find_shadow_page() encounters an unsync page.  Less churn, and IMO there's
no real loss in robustness, e.g. we'd really have to screw up code review and
testing to introduce a null vCPU pointer dereference in this code.

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 3d102522804a..5aed9265f592 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2041,6 +2041,13 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm,
                        goto out;

                if (sp->unsync) {
+                       /*
+                        * Getting indirect shadow pages without a vCPU pointer
+                        * is not supported, i.e. this should never happen.
+                        */
+                       if (KVM_BUG_ON(!vcpu, kvm))
+                               break;
+
                        /*
                         * The page is good, but is stale.  kvm_sync_page does
                         * get the latest guest state, but (unlike mmu_unsync_children)

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 13/20] KVM: x86/mmu: Decouple rmap_add() and link_shadow_page() from kvm_vcpu
  2022-04-22 21:05   ` David Matlack
@ 2022-05-05 23:46     ` Sean Christopherson
  -1 siblings, 0 replies; 120+ messages in thread
From: Sean Christopherson @ 2022-05-05 23:46 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Fri, Apr 22, 2022, David Matlack wrote:
> -static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep,
> -			     struct kvm_mmu_page *sp)
> +static void __link_shadow_page(struct kvm_mmu_memory_cache *cache, u64 *sptep,
> +			       struct kvm_mmu_page *sp)
>  {
>  	u64 spte;
>  
> @@ -2297,12 +2300,17 @@ static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep,
>  
>  	mmu_spte_set(sptep, spte);
>  
> -	mmu_page_add_parent_pte(vcpu, sp, sptep);
> +	mmu_page_add_parent_pte(cache, sp, sptep);
>  
>  	if (sp->unsync_children || sp->unsync)
>  		mark_unsync(sptep);
>  }
>  
> +static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep, struct kvm_mmu_page *sp)

Nit, would prefer to wrap here, especially since __link_shadow_page() wraps.

> +{
> +	__link_shadow_page(&vcpu->arch.mmu_pte_list_desc_cache, sptep, sp);
> +}
> +
>  static void validate_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep,
>  				   unsigned direct_access)
>  {
> -- 
> 2.36.0.rc2.479.g8af0fa9b8e-goog
> 

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 13/20] KVM: x86/mmu: Decouple rmap_add() and link_shadow_page() from kvm_vcpu
@ 2022-05-05 23:46     ` Sean Christopherson
  0 siblings, 0 replies; 120+ messages in thread
From: Sean Christopherson @ 2022-05-05 23:46 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Fri, Apr 22, 2022, David Matlack wrote:
> -static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep,
> -			     struct kvm_mmu_page *sp)
> +static void __link_shadow_page(struct kvm_mmu_memory_cache *cache, u64 *sptep,
> +			       struct kvm_mmu_page *sp)
>  {
>  	u64 spte;
>  
> @@ -2297,12 +2300,17 @@ static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep,
>  
>  	mmu_spte_set(sptep, spte);
>  
> -	mmu_page_add_parent_pte(vcpu, sp, sptep);
> +	mmu_page_add_parent_pte(cache, sp, sptep);
>  
>  	if (sp->unsync_children || sp->unsync)
>  		mark_unsync(sptep);
>  }
>  
> +static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep, struct kvm_mmu_page *sp)

Nit, would prefer to wrap here, especially since __link_shadow_page() wraps.

> +{
> +	__link_shadow_page(&vcpu->arch.mmu_pte_list_desc_cache, sptep, sp);
> +}
> +
>  static void validate_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep,
>  				   unsigned direct_access)
>  {
> -- 
> 2.36.0.rc2.479.g8af0fa9b8e-goog
> 
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 15/20] KVM: x86/mmu: Cache the access bits of shadowed translations
  2022-04-22 21:05   ` David Matlack
@ 2022-05-06 19:47     ` Sean Christopherson
  -1 siblings, 0 replies; 120+ messages in thread
From: Sean Christopherson @ 2022-05-06 19:47 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Fri, Apr 22, 2022, David Matlack wrote:
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index a8a755e1561d..97bf53b29b88 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -978,7 +978,8 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
>  }
>  
>  /*
> - * Using the cached information from sp->gfns is safe because:
> + * Using the information in sp->shadowed_translation (kvm_mmu_page_get_gfn()
> + * and kvm_mmu_page_get_access()) is safe because:
>   * - The spte has a reference to the struct page, so the pfn for a given gfn
>   *   can't change unless all sptes pointing to it are nuked first.
>   *
> @@ -1052,12 +1053,15 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
>  		if (sync_mmio_spte(vcpu, &sp->spt[i], gfn, pte_access))
>  			continue;
>  
> -		if (gfn != sp->gfns[i]) {
> +		if (gfn != kvm_mmu_page_get_gfn(sp, i)) {
>  			drop_spte(vcpu->kvm, &sp->spt[i]);
>  			flush = true;
>  			continue;
>  		}
>  
> +		if (pte_access != kvm_mmu_page_get_access(sp, i))
> +			kvm_mmu_page_set_access(sp, i, pte_access);

Very tangentially related, and more an FYI than anything else (I'll send a patch
separately)...   Replying here because this got me wondering about the validity of
pte_access.

There's an existing bug for nEPT here, which I proved after 3 hours of fighting
with KUT's EPT tests (ugh).

  1. execute-only EPT entries are supported
  2. L1 creates a upper-level RW EPTE and a 4kb leaf RW EPTE
  3. L2 accesses the address; KVM installs a SPTE
  4. L1 modifies the leaf EPTE to be X-only, and does INVEPT
  5. ept_sync_page() creates a SPTE with pte_access=0 / RWX=0

The logic for pte_access (just above this code) is:

		pte_access = sp->role.access;
		pte_access &= FNAME(gpte_access)(gpte);

The parent guest EPTE is 'RW', and so sp->role.access is 'RW'.  When the new 'X'
EPTE is ANDed with the 'RW' parent protections, the result is a RWX=0 SPTE.  This
is only possible if execute-only is supported, because otherwise PTEs are always
readable, i.e. shadow_present_mask is non-zero.

I don't think anything bad happens per se, but it's odd to have a !PRESENT in
hardware, shadow-present SPTE.  I think the correct/easiest fix is:

diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index b025decf610d..f8ea881cfce6 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -1052,7 +1052,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
                if (sync_mmio_spte(vcpu, &sp->spt[i], gfn, pte_access))
                        continue;

-               if (gfn != sp->gfns[i]) {
+               if ((!pte_access && !shadow_present_mask) || gfn != sp->gfns[i]) {
                        drop_spte(vcpu->kvm, &sp->spt[i]);
                        flush = true;
                        continue;
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 75c9e87d446a..9ad60662beac 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -101,6 +101,8 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
        u64 spte = SPTE_MMU_PRESENT_MASK;
        bool wrprot = false;

+       WARN_ON_ONCE(!pte_access && !shadow_present_mask);
+
        if (sp->role.ad_disabled)
                spte |= SPTE_TDP_AD_DISABLED_MASK;
        else if (kvm_mmu_page_ad_need_write_protect(sp))


> +
>  		sptep = &sp->spt[i];
>  		spte = *sptep;
>  		host_writable = spte & shadow_host_writable_mask;
> -- 
> 2.36.0.rc2.479.g8af0fa9b8e-goog
> 

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 15/20] KVM: x86/mmu: Cache the access bits of shadowed translations
@ 2022-05-06 19:47     ` Sean Christopherson
  0 siblings, 0 replies; 120+ messages in thread
From: Sean Christopherson @ 2022-05-06 19:47 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Fri, Apr 22, 2022, David Matlack wrote:
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index a8a755e1561d..97bf53b29b88 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -978,7 +978,8 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
>  }
>  
>  /*
> - * Using the cached information from sp->gfns is safe because:
> + * Using the information in sp->shadowed_translation (kvm_mmu_page_get_gfn()
> + * and kvm_mmu_page_get_access()) is safe because:
>   * - The spte has a reference to the struct page, so the pfn for a given gfn
>   *   can't change unless all sptes pointing to it are nuked first.
>   *
> @@ -1052,12 +1053,15 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
>  		if (sync_mmio_spte(vcpu, &sp->spt[i], gfn, pte_access))
>  			continue;
>  
> -		if (gfn != sp->gfns[i]) {
> +		if (gfn != kvm_mmu_page_get_gfn(sp, i)) {
>  			drop_spte(vcpu->kvm, &sp->spt[i]);
>  			flush = true;
>  			continue;
>  		}
>  
> +		if (pte_access != kvm_mmu_page_get_access(sp, i))
> +			kvm_mmu_page_set_access(sp, i, pte_access);

Very tangentially related, and more an FYI than anything else (I'll send a patch
separately)...   Replying here because this got me wondering about the validity of
pte_access.

There's an existing bug for nEPT here, which I proved after 3 hours of fighting
with KUT's EPT tests (ugh).

  1. execute-only EPT entries are supported
  2. L1 creates a upper-level RW EPTE and a 4kb leaf RW EPTE
  3. L2 accesses the address; KVM installs a SPTE
  4. L1 modifies the leaf EPTE to be X-only, and does INVEPT
  5. ept_sync_page() creates a SPTE with pte_access=0 / RWX=0

The logic for pte_access (just above this code) is:

		pte_access = sp->role.access;
		pte_access &= FNAME(gpte_access)(gpte);

The parent guest EPTE is 'RW', and so sp->role.access is 'RW'.  When the new 'X'
EPTE is ANDed with the 'RW' parent protections, the result is a RWX=0 SPTE.  This
is only possible if execute-only is supported, because otherwise PTEs are always
readable, i.e. shadow_present_mask is non-zero.

I don't think anything bad happens per se, but it's odd to have a !PRESENT in
hardware, shadow-present SPTE.  I think the correct/easiest fix is:

diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index b025decf610d..f8ea881cfce6 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -1052,7 +1052,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
                if (sync_mmio_spte(vcpu, &sp->spt[i], gfn, pte_access))
                        continue;

-               if (gfn != sp->gfns[i]) {
+               if ((!pte_access && !shadow_present_mask) || gfn != sp->gfns[i]) {
                        drop_spte(vcpu->kvm, &sp->spt[i]);
                        flush = true;
                        continue;
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 75c9e87d446a..9ad60662beac 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -101,6 +101,8 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
        u64 spte = SPTE_MMU_PRESENT_MASK;
        bool wrprot = false;

+       WARN_ON_ONCE(!pte_access && !shadow_present_mask);
+
        if (sp->role.ad_disabled)
                spte |= SPTE_TDP_AD_DISABLED_MASK;
        else if (kvm_mmu_page_ad_need_write_protect(sp))


> +
>  		sptep = &sp->spt[i];
>  		spte = *sptep;
>  		host_writable = spte & shadow_host_writable_mask;
> -- 
> 2.36.0.rc2.479.g8af0fa9b8e-goog
> 
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 01/20] KVM: x86/mmu: Optimize MMU page cache lookup for all direct SPs
  2022-04-22 21:05   ` David Matlack
@ 2022-05-07  7:46     ` Lai Jiangshan
  -1 siblings, 0 replies; 120+ messages in thread
From: Lai Jiangshan @ 2022-05-07  7:46 UTC (permalink / raw)
  To: David Matlack, Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner


On 2022/4/23 05:05, David Matlack wrote:
> Commit fb58a9c345f6 ("KVM: x86/mmu: Optimize MMU page cache lookup for
> fully direct MMUs") skipped the unsync checks and write flood clearing
> for full direct MMUs. We can extend this further to skip the checks for
> all direct shadow pages. Direct shadow pages in indirect MMUs (i.e.
> shadow paging) are used when shadowing a guest huge page with smaller
> pages. Such direct shadow pages, like their counterparts in fully direct
> MMUs, are never marked unsynced or have a non-zero write-flooding count.
>
> Checking sp->role.direct also generates better code than checking
> direct_map because, due to register pressure, direct_map has to get
> shoved onto the stack and then pulled back off.
>
> No functional change intended.
>
> Reviewed-by: Sean Christopherson <seanjc@google.com>
> Reviewed-by: Peter Xu <peterx@redhat.com>
> Signed-off-by: David Matlack <dmatlack@google.com>


Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 01/20] KVM: x86/mmu: Optimize MMU page cache lookup for all direct SPs
@ 2022-05-07  7:46     ` Lai Jiangshan
  0 siblings, 0 replies; 120+ messages in thread
From: Lai Jiangshan @ 2022-05-07  7:46 UTC (permalink / raw)
  To: David Matlack, Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner


On 2022/4/23 05:05, David Matlack wrote:
> Commit fb58a9c345f6 ("KVM: x86/mmu: Optimize MMU page cache lookup for
> fully direct MMUs") skipped the unsync checks and write flood clearing
> for full direct MMUs. We can extend this further to skip the checks for
> all direct shadow pages. Direct shadow pages in indirect MMUs (i.e.
> shadow paging) are used when shadowing a guest huge page with smaller
> pages. Such direct shadow pages, like their counterparts in fully direct
> MMUs, are never marked unsynced or have a non-zero write-flooding count.
>
> Checking sp->role.direct also generates better code than checking
> direct_map because, due to register pressure, direct_map has to get
> shoved onto the stack and then pulled back off.
>
> No functional change intended.
>
> Reviewed-by: Sean Christopherson <seanjc@google.com>
> Reviewed-by: Peter Xu <peterx@redhat.com>
> Signed-off-by: David Matlack <dmatlack@google.com>


Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>


_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 02/20] KVM: x86/mmu: Use a bool for direct
  2022-04-22 21:05   ` David Matlack
@ 2022-05-07  7:46     ` Lai Jiangshan
  -1 siblings, 0 replies; 120+ messages in thread
From: Lai Jiangshan @ 2022-05-07  7:46 UTC (permalink / raw)
  To: David Matlack, Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner


On 2022/4/23 05:05, David Matlack wrote:
> The parameter "direct" can either be true or false, and all of the
> callers pass in a bool variable or true/false literal, so just use the
> type bool.
>
> No functional change intended.
>
> Reviewed-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---


Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 02/20] KVM: x86/mmu: Use a bool for direct
@ 2022-05-07  7:46     ` Lai Jiangshan
  0 siblings, 0 replies; 120+ messages in thread
From: Lai Jiangshan @ 2022-05-07  7:46 UTC (permalink / raw)
  To: David Matlack, Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner


On 2022/4/23 05:05, David Matlack wrote:
> The parameter "direct" can either be true or false, and all of the
> callers pass in a bool variable or true/false literal, so just use the
> type bool.
>
> No functional change intended.
>
> Reviewed-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---


Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 20/20] KVM: x86/mmu: Extend Eager Page Splitting to nested MMUs
  2022-04-22 21:05   ` David Matlack
@ 2022-05-07  7:51     ` Lai Jiangshan
  -1 siblings, 0 replies; 120+ messages in thread
From: Lai Jiangshan @ 2022-05-07  7:51 UTC (permalink / raw)
  To: David Matlack, Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner


On 2022/4/23 05:05, David Matlack wrote:
> Add support for Eager Page Splitting pages that are mapped by nested
> MMUs. Walk through the rmap first splitting all 1GiB pages to 2MiB
> pages, and then splitting all 2MiB pages to 4KiB pages.
>
> Note, Eager Page Splitting is limited to nested MMUs as a policy rather
> than due to any technical reason (the sp->role.guest_mode check could
> just be deleted and Eager Page Splitting would work correctly for all
> shadow MMU pages). There is really no reason to support Eager Page
> Splitting for tdp_mmu=N, since such support will eventually be phased
> out, and there is no current use case supporting Eager Page Splitting on
> hosts where TDP is either disabled or unavailable in hardware.
> Furthermore, future improvements to nested MMU scalability may diverge
> the code from the legacy shadow paging implementation. These
> improvements will be simpler to make if Eager Page Splitting does not
> have to worry about legacy shadow paging.
>
> Splitting huge pages mapped by nested MMUs requires dealing with some
> extra complexity beyond that of the TDP MMU:
>
> (1) The shadow MMU has a limit on the number of shadow pages that are
>      allowed to be allocated. So, as a policy, Eager Page Splitting
>      refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
>      pages available.
>
> (2) Splitting a huge page may end up re-using an existing lower level
>      shadow page tables. This is unlike the TDP MMU which always allocates
>      new shadow page tables when splitting.
>
> (3) When installing the lower level SPTEs, they must be added to the
>      rmap which may require allocating additional pte_list_desc structs.
>
> Case (2) is especially interesting since it may require a TLB flush,
> unlike the TDP MMU which can fully split huge pages without any TLB
> flushes. Specifically, an existing lower level page table may point to
> even lower level page tables that are not fully populated, effectively
> unmapping a portion of the huge page, which requires a flush.
>
> This commit performs such flushes after dropping the huge page and
> before installing the lower level page table. This TLB flush could
> instead be delayed until the MMU lock is about to be dropped, which
> would batch flushes for multiple splits.  However these flushes should
> be rare in practice (a huge page must be aliased in multiple SPTEs and
> have been split for NX Huge Pages in only some of them). Flushing
> immediately is simpler to plumb and also reduces the chances of tripping
> over a CPU bug (e.g. see iTLB multihit).
>
> Suggested-by: Peter Feiner <pfeiner@google.com>
> [ This commit is based off of the original implementation of Eager Page
>    Splitting from Peter in Google's kernel from 2016. ]
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>   .../admin-guide/kernel-parameters.txt         |   3 +-
>   arch/x86/include/asm/kvm_host.h               |  20 ++
>   arch/x86/kvm/mmu/mmu.c                        | 276 +++++++++++++++++-
>   arch/x86/kvm/x86.c                            |   6 +
>   4 files changed, 296 insertions(+), 9 deletions(-)
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 3f1cc5e317ed..bc3ad3d4df0b 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -2387,8 +2387,7 @@
>   			the KVM_CLEAR_DIRTY ioctl, and only for the pages being
>   			cleared.
>   
> -			Eager page splitting currently only supports splitting
> -			huge pages mapped by the TDP MMU.
> +			Eager page splitting is only supported when kvm.tdp_mmu=Y.
>   
>   			Default is Y (on).
>   
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 15131aa05701..5df4dff385a1 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1240,6 +1240,24 @@ struct kvm_arch {
>   	hpa_t	hv_root_tdp;
>   	spinlock_t hv_root_tdp_lock;
>   #endif
> +
> +	/*
> +	 * Memory caches used to allocate shadow pages when performing eager
> +	 * page splitting. No need for a shadowed_info_cache since eager page
> +	 * splitting only allocates direct shadow pages.
> +	 */
> +	struct kvm_mmu_memory_cache split_shadow_page_cache;
> +	struct kvm_mmu_memory_cache split_page_header_cache;
> +
> +	/*
> +	 * Memory cache used to allocate pte_list_desc structs while splitting
> +	 * huge pages. In the worst case, to split one huge page, 512
> +	 * pte_list_desc structs are needed to add each lower level leaf sptep
> +	 * to the rmap plus 1 to extend the parent_ptes rmap of the lower level
> +	 * page table.
> +	 */
> +#define SPLIT_DESC_CACHE_CAPACITY 513
> +	struct kvm_mmu_memory_cache split_desc_cache;
>   };
>   
>   


I think it needs to document that the topup operations for these caches are

protected by kvm->slots_lock.


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 20/20] KVM: x86/mmu: Extend Eager Page Splitting to nested MMUs
@ 2022-05-07  7:51     ` Lai Jiangshan
  0 siblings, 0 replies; 120+ messages in thread
From: Lai Jiangshan @ 2022-05-07  7:51 UTC (permalink / raw)
  To: David Matlack, Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner


On 2022/4/23 05:05, David Matlack wrote:
> Add support for Eager Page Splitting pages that are mapped by nested
> MMUs. Walk through the rmap first splitting all 1GiB pages to 2MiB
> pages, and then splitting all 2MiB pages to 4KiB pages.
>
> Note, Eager Page Splitting is limited to nested MMUs as a policy rather
> than due to any technical reason (the sp->role.guest_mode check could
> just be deleted and Eager Page Splitting would work correctly for all
> shadow MMU pages). There is really no reason to support Eager Page
> Splitting for tdp_mmu=N, since such support will eventually be phased
> out, and there is no current use case supporting Eager Page Splitting on
> hosts where TDP is either disabled or unavailable in hardware.
> Furthermore, future improvements to nested MMU scalability may diverge
> the code from the legacy shadow paging implementation. These
> improvements will be simpler to make if Eager Page Splitting does not
> have to worry about legacy shadow paging.
>
> Splitting huge pages mapped by nested MMUs requires dealing with some
> extra complexity beyond that of the TDP MMU:
>
> (1) The shadow MMU has a limit on the number of shadow pages that are
>      allowed to be allocated. So, as a policy, Eager Page Splitting
>      refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
>      pages available.
>
> (2) Splitting a huge page may end up re-using an existing lower level
>      shadow page tables. This is unlike the TDP MMU which always allocates
>      new shadow page tables when splitting.
>
> (3) When installing the lower level SPTEs, they must be added to the
>      rmap which may require allocating additional pte_list_desc structs.
>
> Case (2) is especially interesting since it may require a TLB flush,
> unlike the TDP MMU which can fully split huge pages without any TLB
> flushes. Specifically, an existing lower level page table may point to
> even lower level page tables that are not fully populated, effectively
> unmapping a portion of the huge page, which requires a flush.
>
> This commit performs such flushes after dropping the huge page and
> before installing the lower level page table. This TLB flush could
> instead be delayed until the MMU lock is about to be dropped, which
> would batch flushes for multiple splits.  However these flushes should
> be rare in practice (a huge page must be aliased in multiple SPTEs and
> have been split for NX Huge Pages in only some of them). Flushing
> immediately is simpler to plumb and also reduces the chances of tripping
> over a CPU bug (e.g. see iTLB multihit).
>
> Suggested-by: Peter Feiner <pfeiner@google.com>
> [ This commit is based off of the original implementation of Eager Page
>    Splitting from Peter in Google's kernel from 2016. ]
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>   .../admin-guide/kernel-parameters.txt         |   3 +-
>   arch/x86/include/asm/kvm_host.h               |  20 ++
>   arch/x86/kvm/mmu/mmu.c                        | 276 +++++++++++++++++-
>   arch/x86/kvm/x86.c                            |   6 +
>   4 files changed, 296 insertions(+), 9 deletions(-)
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 3f1cc5e317ed..bc3ad3d4df0b 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -2387,8 +2387,7 @@
>   			the KVM_CLEAR_DIRTY ioctl, and only for the pages being
>   			cleared.
>   
> -			Eager page splitting currently only supports splitting
> -			huge pages mapped by the TDP MMU.
> +			Eager page splitting is only supported when kvm.tdp_mmu=Y.
>   
>   			Default is Y (on).
>   
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 15131aa05701..5df4dff385a1 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1240,6 +1240,24 @@ struct kvm_arch {
>   	hpa_t	hv_root_tdp;
>   	spinlock_t hv_root_tdp_lock;
>   #endif
> +
> +	/*
> +	 * Memory caches used to allocate shadow pages when performing eager
> +	 * page splitting. No need for a shadowed_info_cache since eager page
> +	 * splitting only allocates direct shadow pages.
> +	 */
> +	struct kvm_mmu_memory_cache split_shadow_page_cache;
> +	struct kvm_mmu_memory_cache split_page_header_cache;
> +
> +	/*
> +	 * Memory cache used to allocate pte_list_desc structs while splitting
> +	 * huge pages. In the worst case, to split one huge page, 512
> +	 * pte_list_desc structs are needed to add each lower level leaf sptep
> +	 * to the rmap plus 1 to extend the parent_ptes rmap of the lower level
> +	 * page table.
> +	 */
> +#define SPLIT_DESC_CACHE_CAPACITY 513
> +	struct kvm_mmu_memory_cache split_desc_cache;
>   };
>   
>   


I think it needs to document that the topup operations for these caches are

protected by kvm->slots_lock.

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 03/20] KVM: x86/mmu: Derive shadow MMU page role from parent
  2022-04-22 21:05   ` David Matlack
@ 2022-05-07  8:28     ` Lai Jiangshan
  -1 siblings, 0 replies; 120+ messages in thread
From: Lai Jiangshan @ 2022-05-07  8:28 UTC (permalink / raw)
  To: David Matlack, Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner



On 2022/4/23 05:05, David Matlack wrote:
> Instead of computing the shadow page role from scratch for every new
> page, derive most of the information from the parent shadow page.  This
> avoids redundant calculations and reduces the number of parameters to
> kvm_mmu_get_page().
> 
> Preemptively split out the role calculation to a separate function for
> use in a following commit.
> 
> No functional change intended.
> 
> Reviewed-by: Peter Xu <peterx@redhat.com>
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>   arch/x86/kvm/mmu/mmu.c         | 96 +++++++++++++++++++++++-----------
>   arch/x86/kvm/mmu/paging_tmpl.h |  9 ++--
>   2 files changed, 71 insertions(+), 34 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index dc20eccd6a77..4249a771818b 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2021,31 +2021,15 @@ static void clear_sp_write_flooding_count(u64 *spte)
>   	__clear_sp_write_flooding_count(sptep_to_sp(spte));
>   }
>   
> -static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> -					     gfn_t gfn,
> -					     gva_t gaddr,
> -					     unsigned level,
> -					     bool direct,
> -					     unsigned int access)
> +static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
> +					     union kvm_mmu_page_role role)
>   {
> -	union kvm_mmu_page_role role;
>   	struct hlist_head *sp_list;
> -	unsigned quadrant;
>   	struct kvm_mmu_page *sp;
>   	int ret;
>   	int collisions = 0;
>   	LIST_HEAD(invalid_list);
>   
> -	role = vcpu->arch.mmu->root_role;
> -	role.level = level;
> -	role.direct = direct;
> -	role.access = access;
> -	if (role.has_4_byte_gpte) {
> -		quadrant = gaddr >> (PAGE_SHIFT + (PT64_PT_BITS * level));
> -		quadrant &= (1 << ((PT32_PT_BITS - PT64_PT_BITS) * level)) - 1;
> -		role.quadrant = quadrant;
> -	}


I don't think replacing it with kvm_mmu_child_role() can reduce any calculations.

role.level, role.direct, role.access and role.quadrant still need to be
calculated.  And the old code is still in mmu_alloc_root().

I think kvm_mmu_get_shadow_page() can keep the those parameters and
kvm_mmu_child_role() is only introduced for nested_mmu_get_sp_for_split().

Both kvm_mmu_get_shadow_page() and nested_mmu_get_sp_for_split() call
__kvm_mmu_get_shadow_page() which uses role as a parameter.



> -
>   	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
>   	for_each_valid_sp(vcpu->kvm, sp, sp_list) {
>   		if (sp->gfn != gfn) {
> @@ -2063,7 +2047,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>   			 * Unsync pages must not be left as is, because the new
>   			 * upper-level page will be write-protected.
>   			 */
> -			if (level > PG_LEVEL_4K && sp->unsync)
> +			if (role.level > PG_LEVEL_4K && sp->unsync)
>   				kvm_mmu_prepare_zap_page(vcpu->kvm, sp,
>   							 &invalid_list);
>   			continue;
> @@ -2104,14 +2088,14 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>   
>   	++vcpu->kvm->stat.mmu_cache_miss;
>   
> -	sp = kvm_mmu_alloc_page(vcpu, direct);
> +	sp = kvm_mmu_alloc_page(vcpu, role.direct);
>   
>   	sp->gfn = gfn;
>   	sp->role = role;
>   	hlist_add_head(&sp->hash_link, sp_list);
> -	if (!direct) {
> +	if (!role.direct) {
>   		account_shadowed(vcpu->kvm, sp);
> -		if (level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))
> +		if (role.level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))
>   			kvm_flush_remote_tlbs_with_address(vcpu->kvm, gfn, 1);
>   	}
>   	trace_kvm_mmu_get_page(sp, true);
> @@ -2123,6 +2107,51 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>   	return sp;
>   }
>   
> +static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct, u32 access)
> +{
> +	struct kvm_mmu_page *parent_sp = sptep_to_sp(sptep);
> +	union kvm_mmu_page_role role;
> +
> +	role = parent_sp->role;
> +	role.level--;
> +	role.access = access;
> +	role.direct = direct;
> +
> +	/*
> +	 * If the guest has 4-byte PTEs then that means it's using 32-bit,
> +	 * 2-level, non-PAE paging. KVM shadows such guests using 4 PAE page
> +	 * directories, each mapping 1/4 of the guest's linear address space
> +	 * (1GiB). The shadow pages for those 4 page directories are
> +	 * pre-allocated and assigned a separate quadrant in their role.


It is not going to be true in patchset:
[PATCH V2 0/7] KVM: X86/MMU: Use one-off special shadow page for special roots

https://lore.kernel.org/lkml/20220503150735.32723-1-jiangshanlai@gmail.com/

The shadow pages for those 4 page directories are also allocated on demand.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 03/20] KVM: x86/mmu: Derive shadow MMU page role from parent
@ 2022-05-07  8:28     ` Lai Jiangshan
  0 siblings, 0 replies; 120+ messages in thread
From: Lai Jiangshan @ 2022-05-07  8:28 UTC (permalink / raw)
  To: David Matlack, Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner



On 2022/4/23 05:05, David Matlack wrote:
> Instead of computing the shadow page role from scratch for every new
> page, derive most of the information from the parent shadow page.  This
> avoids redundant calculations and reduces the number of parameters to
> kvm_mmu_get_page().
> 
> Preemptively split out the role calculation to a separate function for
> use in a following commit.
> 
> No functional change intended.
> 
> Reviewed-by: Peter Xu <peterx@redhat.com>
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>   arch/x86/kvm/mmu/mmu.c         | 96 +++++++++++++++++++++++-----------
>   arch/x86/kvm/mmu/paging_tmpl.h |  9 ++--
>   2 files changed, 71 insertions(+), 34 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index dc20eccd6a77..4249a771818b 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2021,31 +2021,15 @@ static void clear_sp_write_flooding_count(u64 *spte)
>   	__clear_sp_write_flooding_count(sptep_to_sp(spte));
>   }
>   
> -static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> -					     gfn_t gfn,
> -					     gva_t gaddr,
> -					     unsigned level,
> -					     bool direct,
> -					     unsigned int access)
> +static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
> +					     union kvm_mmu_page_role role)
>   {
> -	union kvm_mmu_page_role role;
>   	struct hlist_head *sp_list;
> -	unsigned quadrant;
>   	struct kvm_mmu_page *sp;
>   	int ret;
>   	int collisions = 0;
>   	LIST_HEAD(invalid_list);
>   
> -	role = vcpu->arch.mmu->root_role;
> -	role.level = level;
> -	role.direct = direct;
> -	role.access = access;
> -	if (role.has_4_byte_gpte) {
> -		quadrant = gaddr >> (PAGE_SHIFT + (PT64_PT_BITS * level));
> -		quadrant &= (1 << ((PT32_PT_BITS - PT64_PT_BITS) * level)) - 1;
> -		role.quadrant = quadrant;
> -	}


I don't think replacing it with kvm_mmu_child_role() can reduce any calculations.

role.level, role.direct, role.access and role.quadrant still need to be
calculated.  And the old code is still in mmu_alloc_root().

I think kvm_mmu_get_shadow_page() can keep the those parameters and
kvm_mmu_child_role() is only introduced for nested_mmu_get_sp_for_split().

Both kvm_mmu_get_shadow_page() and nested_mmu_get_sp_for_split() call
__kvm_mmu_get_shadow_page() which uses role as a parameter.



> -
>   	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
>   	for_each_valid_sp(vcpu->kvm, sp, sp_list) {
>   		if (sp->gfn != gfn) {
> @@ -2063,7 +2047,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>   			 * Unsync pages must not be left as is, because the new
>   			 * upper-level page will be write-protected.
>   			 */
> -			if (level > PG_LEVEL_4K && sp->unsync)
> +			if (role.level > PG_LEVEL_4K && sp->unsync)
>   				kvm_mmu_prepare_zap_page(vcpu->kvm, sp,
>   							 &invalid_list);
>   			continue;
> @@ -2104,14 +2088,14 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>   
>   	++vcpu->kvm->stat.mmu_cache_miss;
>   
> -	sp = kvm_mmu_alloc_page(vcpu, direct);
> +	sp = kvm_mmu_alloc_page(vcpu, role.direct);
>   
>   	sp->gfn = gfn;
>   	sp->role = role;
>   	hlist_add_head(&sp->hash_link, sp_list);
> -	if (!direct) {
> +	if (!role.direct) {
>   		account_shadowed(vcpu->kvm, sp);
> -		if (level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))
> +		if (role.level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))
>   			kvm_flush_remote_tlbs_with_address(vcpu->kvm, gfn, 1);
>   	}
>   	trace_kvm_mmu_get_page(sp, true);
> @@ -2123,6 +2107,51 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>   	return sp;
>   }
>   
> +static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct, u32 access)
> +{
> +	struct kvm_mmu_page *parent_sp = sptep_to_sp(sptep);
> +	union kvm_mmu_page_role role;
> +
> +	role = parent_sp->role;
> +	role.level--;
> +	role.access = access;
> +	role.direct = direct;
> +
> +	/*
> +	 * If the guest has 4-byte PTEs then that means it's using 32-bit,
> +	 * 2-level, non-PAE paging. KVM shadows such guests using 4 PAE page
> +	 * directories, each mapping 1/4 of the guest's linear address space
> +	 * (1GiB). The shadow pages for those 4 page directories are
> +	 * pre-allocated and assigned a separate quadrant in their role.


It is not going to be true in patchset:
[PATCH V2 0/7] KVM: X86/MMU: Use one-off special shadow page for special roots

https://lore.kernel.org/lkml/20220503150735.32723-1-jiangshanlai@gmail.com/

The shadow pages for those 4 page directories are also allocated on demand.
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 15/20] KVM: x86/mmu: Cache the access bits of shadowed translations
  2022-04-22 21:05   ` David Matlack
@ 2022-05-09 16:10     ` Sean Christopherson
  -1 siblings, 0 replies; 120+ messages in thread
From: Sean Christopherson @ 2022-05-09 16:10 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Fri, Apr 22, 2022, David Matlack wrote:
> @@ -2820,7 +2861,10 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
>  
>  	if (!was_rmapped) {
>  		WARN_ON_ONCE(ret == RET_PF_SPURIOUS);
> -		rmap_add(vcpu, slot, sptep, gfn);
> +		rmap_add(vcpu, slot, sptep, gfn, pte_access);
> +	} else {
> +		/* Already rmapped but the pte_access bits may have changed. */
> +		kvm_mmu_page_set_access(sp, sptep - sp->spt, pte_access);
>  	}
>  
>  	return ret;

...

> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index a8a755e1561d..97bf53b29b88 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -978,7 +978,8 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
>  }
>  
>  /*
> - * Using the cached information from sp->gfns is safe because:
> + * Using the information in sp->shadowed_translation (kvm_mmu_page_get_gfn()
> + * and kvm_mmu_page_get_access()) is safe because:
>   * - The spte has a reference to the struct page, so the pfn for a given gfn
>   *   can't change unless all sptes pointing to it are nuked first.
>   *
> @@ -1052,12 +1053,15 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
>  		if (sync_mmio_spte(vcpu, &sp->spt[i], gfn, pte_access))
>  			continue;
>  
> -		if (gfn != sp->gfns[i]) {
> +		if (gfn != kvm_mmu_page_get_gfn(sp, i)) {
>  			drop_spte(vcpu->kvm, &sp->spt[i]);
>  			flush = true;
>  			continue;
>  		}
>  
> +		if (pte_access != kvm_mmu_page_get_access(sp, i))

I think it makes sense to do this unconditionally, same as mmu_set_spte().  Or
make the mmu_set_spte() case conditional.  I don't have a strong preference either
way, but the two callers should be consistent with each other.

> +			kvm_mmu_page_set_access(sp, i, pte_access);
> +
>  		sptep = &sp->spt[i];
>  		spte = *sptep;
>  		host_writable = spte & shadow_host_writable_mask;
> -- 
> 2.36.0.rc2.479.g8af0fa9b8e-goog
> 

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 15/20] KVM: x86/mmu: Cache the access bits of shadowed translations
@ 2022-05-09 16:10     ` Sean Christopherson
  0 siblings, 0 replies; 120+ messages in thread
From: Sean Christopherson @ 2022-05-09 16:10 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Fri, Apr 22, 2022, David Matlack wrote:
> @@ -2820,7 +2861,10 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
>  
>  	if (!was_rmapped) {
>  		WARN_ON_ONCE(ret == RET_PF_SPURIOUS);
> -		rmap_add(vcpu, slot, sptep, gfn);
> +		rmap_add(vcpu, slot, sptep, gfn, pte_access);
> +	} else {
> +		/* Already rmapped but the pte_access bits may have changed. */
> +		kvm_mmu_page_set_access(sp, sptep - sp->spt, pte_access);
>  	}
>  
>  	return ret;

...

> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index a8a755e1561d..97bf53b29b88 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -978,7 +978,8 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
>  }
>  
>  /*
> - * Using the cached information from sp->gfns is safe because:
> + * Using the information in sp->shadowed_translation (kvm_mmu_page_get_gfn()
> + * and kvm_mmu_page_get_access()) is safe because:
>   * - The spte has a reference to the struct page, so the pfn for a given gfn
>   *   can't change unless all sptes pointing to it are nuked first.
>   *
> @@ -1052,12 +1053,15 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
>  		if (sync_mmio_spte(vcpu, &sp->spt[i], gfn, pte_access))
>  			continue;
>  
> -		if (gfn != sp->gfns[i]) {
> +		if (gfn != kvm_mmu_page_get_gfn(sp, i)) {
>  			drop_spte(vcpu->kvm, &sp->spt[i]);
>  			flush = true;
>  			continue;
>  		}
>  
> +		if (pte_access != kvm_mmu_page_get_access(sp, i))

I think it makes sense to do this unconditionally, same as mmu_set_spte().  Or
make the mmu_set_spte() case conditional.  I don't have a strong preference either
way, but the two callers should be consistent with each other.

> +			kvm_mmu_page_set_access(sp, i, pte_access);
> +
>  		sptep = &sp->spt[i];
>  		spte = *sptep;
>  		host_writable = spte & shadow_host_writable_mask;
> -- 
> 2.36.0.rc2.479.g8af0fa9b8e-goog
> 
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 16/20] KVM: x86/mmu: Extend make_huge_page_split_spte() for the shadow MMU
  2022-04-22 21:05   ` David Matlack
@ 2022-05-09 16:22     ` Sean Christopherson
  -1 siblings, 0 replies; 120+ messages in thread
From: Sean Christopherson @ 2022-05-09 16:22 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Fri, Apr 22, 2022, David Matlack wrote:
> Currently make_huge_page_split_spte() assumes execute permissions can be
> granted to any 4K SPTE when splitting huge pages. This is true for the
> TDP MMU but is not necessarily true for the shadow MMU, since KVM may be
> shadowing a non-executable huge page.
> 
> To fix this, pass in the child shadow page where the huge page will be
> split and derive the execution permission from the shadow page's role.
> This is correct because huge pages are always split with direct shadow
> page and thus the shadow page role contains the correct access
> permissions.
> 
> No functional change intended.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/x86/kvm/mmu/spte.c    | 13 +++++++------
>  arch/x86/kvm/mmu/spte.h    |  2 +-
>  arch/x86/kvm/mmu/tdp_mmu.c |  2 +-
>  3 files changed, 9 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> index 4739b53c9734..9db98fbeee61 100644
> --- a/arch/x86/kvm/mmu/spte.c
> +++ b/arch/x86/kvm/mmu/spte.c
> @@ -215,10 +215,11 @@ static u64 make_spte_executable(u64 spte)
>   * This is used during huge page splitting to build the SPTEs that make up the
>   * new page table.
>   */
> -u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index)
> +u64 make_huge_page_split_spte(u64 huge_spte, struct kvm_mmu_page *sp, int index)

Rather than pass in @sp, what about passing in @role?  Then the need for
exec_allowed and child_level goes away (for whatever reason I reacted to the
"allowed" part of exec_allowed).

E.g.

---
 arch/x86/kvm/mmu/spte.c    | 11 +++++------
 arch/x86/kvm/mmu/spte.h    |  3 ++-
 arch/x86/kvm/mmu/tdp_mmu.c |  2 +-
 3 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 9db98fbeee61..1b766e381727 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -215,10 +215,9 @@ static u64 make_spte_executable(u64 spte)
  * This is used during huge page splitting to build the SPTEs that make up the
  * new page table.
  */
-u64 make_huge_page_split_spte(u64 huge_spte, struct kvm_mmu_page *sp, int index)
+u64 make_huge_page_split_spte(u64 huge_spte, union kvm_mmu_page_role role,
+			      int index)
 {
-	bool exec_allowed = sp->role.access & ACC_EXEC_MASK;
-	int child_level = sp->role.level;
 	u64 child_spte;

 	if (WARN_ON_ONCE(!is_shadow_present_pte(huge_spte)))
@@ -234,9 +233,9 @@ u64 make_huge_page_split_spte(u64 huge_spte, struct kvm_mmu_page *sp, int index)
 	 * split. So we just have to OR in the offset to the page at the next
 	 * lower level for the given index.
 	 */
-	child_spte |= (index * KVM_PAGES_PER_HPAGE(child_level)) << PAGE_SHIFT;
+	child_spte |= (index * KVM_PAGES_PER_HPAGE(role.level)) << PAGE_SHIFT;

-	if (child_level == PG_LEVEL_4K) {
+	if (role.level == PG_LEVEL_4K) {
 		child_spte &= ~PT_PAGE_SIZE_MASK;

 		/*
@@ -244,7 +243,7 @@ u64 make_huge_page_split_spte(u64 huge_spte, struct kvm_mmu_page *sp, int index)
 		 * the page executable as the NX hugepage mitigation no longer
 		 * applies.
 		 */
-		if (exec_allowed && is_nx_huge_page_enabled())
+		if ((role.access & ACC_EXEC_MASK) && is_nx_huge_page_enabled())
 			child_spte = make_spte_executable(child_spte);
 	}

diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 921ea77f1b5e..80d36d0d9def 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -415,7 +415,8 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	       unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
 	       u64 old_spte, bool prefetch, bool can_unsync,
 	       bool host_writable, u64 *new_spte);
-u64 make_huge_page_split_spte(u64 huge_spte, struct kvm_mmu_page *sp, int index);
+u64 make_huge_page_split_spte(u64 huge_spte, union kvm_mmu_page_role role,
+			      int index);
 u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled);
 u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access);
 u64 mark_spte_for_access_track(u64 spte);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 110a34ca41c2..c4c4bad69f38 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1469,7 +1469,7 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
 	 * not been linked in yet and thus is not reachable from any other CPU.
 	 */
 	for (i = 0; i < PT64_ENT_PER_PAGE; i++)
-		sp->spt[i] = make_huge_page_split_spte(huge_spte, sp, i);
+		sp->spt[i] = make_huge_page_split_spte(huge_spte, sp->role, i);

 	/*
 	 * Replace the huge spte with a pointer to the populated lower level

base-commit: 721828e2397ab854b536de3ea10a9bc7962091a9
--

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 16/20] KVM: x86/mmu: Extend make_huge_page_split_spte() for the shadow MMU
@ 2022-05-09 16:22     ` Sean Christopherson
  0 siblings, 0 replies; 120+ messages in thread
From: Sean Christopherson @ 2022-05-09 16:22 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Fri, Apr 22, 2022, David Matlack wrote:
> Currently make_huge_page_split_spte() assumes execute permissions can be
> granted to any 4K SPTE when splitting huge pages. This is true for the
> TDP MMU but is not necessarily true for the shadow MMU, since KVM may be
> shadowing a non-executable huge page.
> 
> To fix this, pass in the child shadow page where the huge page will be
> split and derive the execution permission from the shadow page's role.
> This is correct because huge pages are always split with direct shadow
> page and thus the shadow page role contains the correct access
> permissions.
> 
> No functional change intended.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/x86/kvm/mmu/spte.c    | 13 +++++++------
>  arch/x86/kvm/mmu/spte.h    |  2 +-
>  arch/x86/kvm/mmu/tdp_mmu.c |  2 +-
>  3 files changed, 9 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> index 4739b53c9734..9db98fbeee61 100644
> --- a/arch/x86/kvm/mmu/spte.c
> +++ b/arch/x86/kvm/mmu/spte.c
> @@ -215,10 +215,11 @@ static u64 make_spte_executable(u64 spte)
>   * This is used during huge page splitting to build the SPTEs that make up the
>   * new page table.
>   */
> -u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index)
> +u64 make_huge_page_split_spte(u64 huge_spte, struct kvm_mmu_page *sp, int index)

Rather than pass in @sp, what about passing in @role?  Then the need for
exec_allowed and child_level goes away (for whatever reason I reacted to the
"allowed" part of exec_allowed).

E.g.

---
 arch/x86/kvm/mmu/spte.c    | 11 +++++------
 arch/x86/kvm/mmu/spte.h    |  3 ++-
 arch/x86/kvm/mmu/tdp_mmu.c |  2 +-
 3 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 9db98fbeee61..1b766e381727 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -215,10 +215,9 @@ static u64 make_spte_executable(u64 spte)
  * This is used during huge page splitting to build the SPTEs that make up the
  * new page table.
  */
-u64 make_huge_page_split_spte(u64 huge_spte, struct kvm_mmu_page *sp, int index)
+u64 make_huge_page_split_spte(u64 huge_spte, union kvm_mmu_page_role role,
+			      int index)
 {
-	bool exec_allowed = sp->role.access & ACC_EXEC_MASK;
-	int child_level = sp->role.level;
 	u64 child_spte;

 	if (WARN_ON_ONCE(!is_shadow_present_pte(huge_spte)))
@@ -234,9 +233,9 @@ u64 make_huge_page_split_spte(u64 huge_spte, struct kvm_mmu_page *sp, int index)
 	 * split. So we just have to OR in the offset to the page at the next
 	 * lower level for the given index.
 	 */
-	child_spte |= (index * KVM_PAGES_PER_HPAGE(child_level)) << PAGE_SHIFT;
+	child_spte |= (index * KVM_PAGES_PER_HPAGE(role.level)) << PAGE_SHIFT;

-	if (child_level == PG_LEVEL_4K) {
+	if (role.level == PG_LEVEL_4K) {
 		child_spte &= ~PT_PAGE_SIZE_MASK;

 		/*
@@ -244,7 +243,7 @@ u64 make_huge_page_split_spte(u64 huge_spte, struct kvm_mmu_page *sp, int index)
 		 * the page executable as the NX hugepage mitigation no longer
 		 * applies.
 		 */
-		if (exec_allowed && is_nx_huge_page_enabled())
+		if ((role.access & ACC_EXEC_MASK) && is_nx_huge_page_enabled())
 			child_spte = make_spte_executable(child_spte);
 	}

diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 921ea77f1b5e..80d36d0d9def 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -415,7 +415,8 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	       unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
 	       u64 old_spte, bool prefetch, bool can_unsync,
 	       bool host_writable, u64 *new_spte);
-u64 make_huge_page_split_spte(u64 huge_spte, struct kvm_mmu_page *sp, int index);
+u64 make_huge_page_split_spte(u64 huge_spte, union kvm_mmu_page_role role,
+			      int index);
 u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled);
 u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access);
 u64 mark_spte_for_access_track(u64 spte);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 110a34ca41c2..c4c4bad69f38 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1469,7 +1469,7 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
 	 * not been linked in yet and thus is not reachable from any other CPU.
 	 */
 	for (i = 0; i < PT64_ENT_PER_PAGE; i++)
-		sp->spt[i] = make_huge_page_split_spte(huge_spte, sp, i);
+		sp->spt[i] = make_huge_page_split_spte(huge_spte, sp->role, i);

 	/*
 	 * Replace the huge spte with a pointer to the populated lower level

base-commit: 721828e2397ab854b536de3ea10a9bc7962091a9
--
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 17/20] KVM: x86/mmu: Zap collapsible SPTEs at all levels in the shadow MMU
  2022-04-22 21:05   ` David Matlack
@ 2022-05-09 16:31     ` Sean Christopherson
  -1 siblings, 0 replies; 120+ messages in thread
From: Sean Christopherson @ 2022-05-09 16:31 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

Maybe a slight tweak to the shortlog?  "Zap collapsible SPTEs at all levels in
the shadow MMU" left me wondering "when is KVM zapping at all levels?"

  KVM: x86/mmu: Zap all possible levels in shadow MMU when collapsing SPTEs

On Fri, Apr 22, 2022, David Matlack wrote:
> Currently KVM only zaps collapsible 4KiB SPTEs in the shadow MMU (i.e.
> in the rmap). This is fine for now KVM never creates intermediate huge
> pages during dirty logging, i.e. a 1GiB page is never partially split to
> a 2MiB page.

"partially" is really confusing.  I think what you mean is that KVM can split a
1gb to a 2mb page, and not split all the way down to 4kb.  But "partially" makes
it sound like KVM ends up with a huge SPTE that is half split or something.  I
think you can just avoid that altogether and be more explicit:

  i.e. a 1GiB pager is never split to just 2MiB, dirty logging always splits
  down to 4KiB pages.

> However, this will stop being true once the shadow MMU participates in
> eager page splitting, which can in fact leave behind partially split

"partially" again.  Maybe

  which can in fact leave behind 2MiB pages after splitting 1GiB huge pages.

> huge pages. In preparation for that change, change the shadow MMU to
> iterate over all necessary levels when zapping collapsible SPTEs.
> 
> No functional change intended.
> 
> Reviewed-by: Peter Xu <peterx@redhat.com>
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 21 ++++++++++++++-------
>  1 file changed, 14 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index ed65899d15a2..479c581e8a96 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -6098,18 +6098,25 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
>  	return need_tlb_flush;
>  }
>  
> +static void kvm_rmap_zap_collapsible_sptes(struct kvm *kvm,
> +					   const struct kvm_memory_slot *slot)
> +{
> +	/*
> +	 * Note, use KVM_MAX_HUGEPAGE_LEVEL - 1 since there's no need to zap
> +	 * pages that are already mapped at the maximum possible level.
> +	 */
> +	if (slot_handle_level(kvm, slot, kvm_mmu_zap_collapsible_spte,
> +			      PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL - 1,
> +			      true))
> +		kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
> +}
> +
>  void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
>  				   const struct kvm_memory_slot *slot)
>  {
>  	if (kvm_memslots_have_rmaps(kvm)) {
>  		write_lock(&kvm->mmu_lock);
> -		/*
> -		 * Zap only 4k SPTEs since the legacy MMU only supports dirty
> -		 * logging at a 4k granularity and never creates collapsible
> -		 * 2m SPTEs during dirty logging.
> -		 */
> -		if (slot_handle_level_4k(kvm, slot, kvm_mmu_zap_collapsible_spte, true))
> -			kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
> +		kvm_rmap_zap_collapsible_sptes(kvm, slot);
>  		write_unlock(&kvm->mmu_lock);
>  	}
>  
> -- 
> 2.36.0.rc2.479.g8af0fa9b8e-goog
> 

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 17/20] KVM: x86/mmu: Zap collapsible SPTEs at all levels in the shadow MMU
@ 2022-05-09 16:31     ` Sean Christopherson
  0 siblings, 0 replies; 120+ messages in thread
From: Sean Christopherson @ 2022-05-09 16:31 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Maybe a slight tweak to the shortlog?  "Zap collapsible SPTEs at all levels in
the shadow MMU" left me wondering "when is KVM zapping at all levels?"

  KVM: x86/mmu: Zap all possible levels in shadow MMU when collapsing SPTEs

On Fri, Apr 22, 2022, David Matlack wrote:
> Currently KVM only zaps collapsible 4KiB SPTEs in the shadow MMU (i.e.
> in the rmap). This is fine for now KVM never creates intermediate huge
> pages during dirty logging, i.e. a 1GiB page is never partially split to
> a 2MiB page.

"partially" is really confusing.  I think what you mean is that KVM can split a
1gb to a 2mb page, and not split all the way down to 4kb.  But "partially" makes
it sound like KVM ends up with a huge SPTE that is half split or something.  I
think you can just avoid that altogether and be more explicit:

  i.e. a 1GiB pager is never split to just 2MiB, dirty logging always splits
  down to 4KiB pages.

> However, this will stop being true once the shadow MMU participates in
> eager page splitting, which can in fact leave behind partially split

"partially" again.  Maybe

  which can in fact leave behind 2MiB pages after splitting 1GiB huge pages.

> huge pages. In preparation for that change, change the shadow MMU to
> iterate over all necessary levels when zapping collapsible SPTEs.
> 
> No functional change intended.
> 
> Reviewed-by: Peter Xu <peterx@redhat.com>
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 21 ++++++++++++++-------
>  1 file changed, 14 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index ed65899d15a2..479c581e8a96 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -6098,18 +6098,25 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
>  	return need_tlb_flush;
>  }
>  
> +static void kvm_rmap_zap_collapsible_sptes(struct kvm *kvm,
> +					   const struct kvm_memory_slot *slot)
> +{
> +	/*
> +	 * Note, use KVM_MAX_HUGEPAGE_LEVEL - 1 since there's no need to zap
> +	 * pages that are already mapped at the maximum possible level.
> +	 */
> +	if (slot_handle_level(kvm, slot, kvm_mmu_zap_collapsible_spte,
> +			      PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL - 1,
> +			      true))
> +		kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
> +}
> +
>  void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
>  				   const struct kvm_memory_slot *slot)
>  {
>  	if (kvm_memslots_have_rmaps(kvm)) {
>  		write_lock(&kvm->mmu_lock);
> -		/*
> -		 * Zap only 4k SPTEs since the legacy MMU only supports dirty
> -		 * logging at a 4k granularity and never creates collapsible
> -		 * 2m SPTEs during dirty logging.
> -		 */
> -		if (slot_handle_level_4k(kvm, slot, kvm_mmu_zap_collapsible_spte, true))
> -			kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
> +		kvm_rmap_zap_collapsible_sptes(kvm, slot);
>  		write_unlock(&kvm->mmu_lock);
>  	}
>  
> -- 
> 2.36.0.rc2.479.g8af0fa9b8e-goog
> 
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 18/20] KVM: x86/mmu: Refactor drop_large_spte()
  2022-04-22 21:05   ` David Matlack
@ 2022-05-09 16:36     ` Sean Christopherson
  -1 siblings, 0 replies; 120+ messages in thread
From: Sean Christopherson @ 2022-05-09 16:36 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Fri, Apr 22, 2022, David Matlack wrote:
> drop_large_spte() drops a large SPTE if it exists and then flushes TLBs.
> Its helper function, __drop_large_spte(), does the drop without the
> flush.
> 
> In preparation for eager page splitting, which will need to sometimes
> flush when dropping large SPTEs (and sometimes not), push the flushing
> logic down into __drop_large_spte() and add a bool parameter to control
> it.
> 
> No functional change intended.
> 
> Reviewed-by: Peter Xu <peterx@redhat.com>
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 29 +++++++++++++++--------------
>  1 file changed, 15 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 479c581e8a96..a5961c17eb36 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1183,28 +1183,29 @@ static void drop_spte(struct kvm *kvm, u64 *sptep)
>  		rmap_remove(kvm, sptep);
>  }
>  
> -
> -static bool __drop_large_spte(struct kvm *kvm, u64 *sptep)
> +static void __drop_large_spte(struct kvm *kvm, u64 *sptep, bool flush)
>  {
> -	if (is_large_pte(*sptep)) {
> -		WARN_ON(sptep_to_sp(sptep)->role.level == PG_LEVEL_4K);
> -		drop_spte(kvm, sptep);
> -		return true;
> -	}
> +	struct kvm_mmu_page *sp;
>  
> -	return false;
> -}
> +	if (!is_large_pte(*sptep))
> +		return;
>  
> -static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
> -{
> -	if (__drop_large_spte(vcpu->kvm, sptep)) {
> -		struct kvm_mmu_page *sp = sptep_to_sp(sptep);
> +	sp = sptep_to_sp(sptep);
> +	WARN_ON(sp->role.level == PG_LEVEL_4K);
>  
> -		kvm_flush_remote_tlbs_with_address(vcpu->kvm, sp->gfn,
> +	drop_spte(kvm, sptep);
> +
> +	if (flush) {

Unnecessary curly braces.

> +		kvm_flush_remote_tlbs_with_address(kvm, sp->gfn,
>  			KVM_PAGES_PER_HPAGE(sp->role.level));
>  	}
>  }
>  
> +static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
> +{
> +	return __drop_large_spte(vcpu->kvm, sptep, true);
> +}
> +
>  /*
>   * Write-protect on the specified @sptep, @pt_protect indicates whether
>   * spte write-protection is caused by protecting shadow page table.
> -- 
> 2.36.0.rc2.479.g8af0fa9b8e-goog
> 
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 18/20] KVM: x86/mmu: Refactor drop_large_spte()
@ 2022-05-09 16:36     ` Sean Christopherson
  0 siblings, 0 replies; 120+ messages in thread
From: Sean Christopherson @ 2022-05-09 16:36 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Fri, Apr 22, 2022, David Matlack wrote:
> drop_large_spte() drops a large SPTE if it exists and then flushes TLBs.
> Its helper function, __drop_large_spte(), does the drop without the
> flush.
> 
> In preparation for eager page splitting, which will need to sometimes
> flush when dropping large SPTEs (and sometimes not), push the flushing
> logic down into __drop_large_spte() and add a bool parameter to control
> it.
> 
> No functional change intended.
> 
> Reviewed-by: Peter Xu <peterx@redhat.com>
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 29 +++++++++++++++--------------
>  1 file changed, 15 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 479c581e8a96..a5961c17eb36 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1183,28 +1183,29 @@ static void drop_spte(struct kvm *kvm, u64 *sptep)
>  		rmap_remove(kvm, sptep);
>  }
>  
> -
> -static bool __drop_large_spte(struct kvm *kvm, u64 *sptep)
> +static void __drop_large_spte(struct kvm *kvm, u64 *sptep, bool flush)
>  {
> -	if (is_large_pte(*sptep)) {
> -		WARN_ON(sptep_to_sp(sptep)->role.level == PG_LEVEL_4K);
> -		drop_spte(kvm, sptep);
> -		return true;
> -	}
> +	struct kvm_mmu_page *sp;
>  
> -	return false;
> -}
> +	if (!is_large_pte(*sptep))
> +		return;
>  
> -static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
> -{
> -	if (__drop_large_spte(vcpu->kvm, sptep)) {
> -		struct kvm_mmu_page *sp = sptep_to_sp(sptep);
> +	sp = sptep_to_sp(sptep);
> +	WARN_ON(sp->role.level == PG_LEVEL_4K);
>  
> -		kvm_flush_remote_tlbs_with_address(vcpu->kvm, sp->gfn,
> +	drop_spte(kvm, sptep);
> +
> +	if (flush) {

Unnecessary curly braces.

> +		kvm_flush_remote_tlbs_with_address(kvm, sp->gfn,
>  			KVM_PAGES_PER_HPAGE(sp->role.level));
>  	}
>  }
>  
> +static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
> +{
> +	return __drop_large_spte(vcpu->kvm, sptep, true);
> +}
> +
>  /*
>   * Write-protect on the specified @sptep, @pt_protect indicates whether
>   * spte write-protection is caused by protecting shadow page table.
> -- 
> 2.36.0.rc2.479.g8af0fa9b8e-goog
> 

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 20/20] KVM: x86/mmu: Extend Eager Page Splitting to nested MMUs
  2022-04-22 21:05   ` David Matlack
@ 2022-05-09 16:48     ` Sean Christopherson
  -1 siblings, 0 replies; 120+ messages in thread
From: Sean Christopherson @ 2022-05-09 16:48 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Fri, Apr 22, 2022, David Matlack wrote:
> +static bool need_topup_split_caches_or_resched(struct kvm *kvm)
> +{
> +	if (need_resched() || rwlock_needbreak(&kvm->mmu_lock))
> +		return true;
> +
> +	/*
> +	 * In the worst case, SPLIT_DESC_CACHE_CAPACITY descriptors are needed
> +	 * to split a single huge page. Calculating how many are actually needed
> +	 * is possible but not worth the complexity.
> +	 */
> +	return need_topup(&kvm->arch.split_desc_cache, SPLIT_DESC_CACHE_CAPACITY) ||
> +		need_topup(&kvm->arch.split_page_header_cache, 1) ||
> +		need_topup(&kvm->arch.split_shadow_page_cache, 1);

Uber nit that Paolo will make fun of me for... please align indentiation

	return need_topup(&kvm->arch.split_desc_cache, SPLIT_DESC_CACHE_CAPACITY) ||
	       need_topup(&kvm->arch.split_page_header_cache, 1) ||
	       need_topup(&kvm->arch.split_shadow_page_cache, 1);

> +static void nested_mmu_split_huge_page(struct kvm *kvm,
> +				       const struct kvm_memory_slot *slot,
> +				       u64 *huge_sptep)
> +
> +{
> +	struct kvm_mmu_memory_cache *cache = &kvm->arch.split_desc_cache;
> +	u64 huge_spte = READ_ONCE(*huge_sptep);
> +	struct kvm_mmu_page *sp;
> +	bool flush = false;
> +	u64 *sptep, spte;
> +	gfn_t gfn;
> +	int index;
> +
> +	sp = nested_mmu_get_sp_for_split(kvm, huge_sptep);
> +
> +	for (index = 0; index < PT64_ENT_PER_PAGE; index++) {
> +		sptep = &sp->spt[index];
> +		gfn = kvm_mmu_page_get_gfn(sp, index);
> +
> +		/*
> +		 * The SP may already have populated SPTEs, e.g. if this huge
> +		 * page is aliased by multiple sptes with the same access
> +		 * permissions. These entries are guaranteed to map the same
> +		 * gfn-to-pfn translation since the SP is direct, so no need to
> +		 * modify them.
> +		 *
> +		 * However, if a given SPTE points to a lower level page table,
> +		 * that lower level page table may only be partially populated.
> +		 * Installing such SPTEs would effectively unmap a potion of the
> +		 * huge page, which requires a TLB flush.

Maybe explain why a TLB flush is required?  E.g. "which requires a TLB flush as
a subsequent mmu_notifier event on the unmapped region would fail to detect the
need to flush".

> +static bool nested_mmu_skip_split_huge_page(u64 *huge_sptep)

"skip" is kinda odd terminology.  It reads like a command, but it's actually
querying state _and_ it's returning a boolean, which I've learned to hate :-)

I don't see any reason for a helper, there's one caller and it can just do
"continue" directly.

> +static void kvm_nested_mmu_try_split_huge_pages(struct kvm *kvm,
> +						const struct kvm_memory_slot *slot,
> +						gfn_t start, gfn_t end,
> +						int target_level)
> +{
> +	int level;
> +
> +	/*
> +	 * Split huge pages starting with KVM_MAX_HUGEPAGE_LEVEL and working
> +	 * down to the target level. This ensures pages are recursively split
> +	 * all the way to the target level. There's no need to split pages
> +	 * already at the target level.
> +	 */
> +	for (level = KVM_MAX_HUGEPAGE_LEVEL; level > target_level; level--) {

Unnecessary braces.
> +		slot_handle_level_range(kvm, slot,
> +					nested_mmu_try_split_huge_pages,
> +					level, level, start, end - 1,
> +					true, false);

IMO it's worth running over by 4 chars to drop 2 lines:

	for (level = KVM_MAX_HUGEPAGE_LEVEL; level > target_level; level--)
		slot_handle_level_range(kvm, slot, nested_mmu_try_split_huge_pages,
					level, level, start, end - 1, true, false);
> +	}
> +}
> +
>  /* Must be called with the mmu_lock held in write-mode. */

Add a lockdep assertion, not a comment.

>  void kvm_mmu_try_split_huge_pages(struct kvm *kvm,
>  				   const struct kvm_memory_slot *memslot,
>  				   u64 start, u64 end,
>  				   int target_level)
>  {
> -	if (is_tdp_mmu_enabled(kvm))
> -		kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end,
> -						 target_level, false);
> +	if (!is_tdp_mmu_enabled(kvm))
> +		return;
> +
> +	kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end, target_level,
> +					 false);
> +
> +	if (kvm_memslots_have_rmaps(kvm))
> +		kvm_nested_mmu_try_split_huge_pages(kvm, memslot, start, end,
> +						    target_level);
>  
>  	/*
>  	 * A TLB flush is unnecessary at this point for the same resons as in
> @@ -6051,10 +6304,19 @@ void kvm_mmu_slot_try_split_huge_pages(struct kvm *kvm,
>  	u64 start = memslot->base_gfn;
>  	u64 end = start + memslot->npages;
>  
> -	if (is_tdp_mmu_enabled(kvm)) {
> -		read_lock(&kvm->mmu_lock);
> -		kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end, target_level, true);
> -		read_unlock(&kvm->mmu_lock);
> +	if (!is_tdp_mmu_enabled(kvm))
> +		return;
> +
> +	read_lock(&kvm->mmu_lock);
> +	kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end, target_level,
> +					 true);

Eh, let this poke out.

> +	read_unlock(&kvm->mmu_lock);
> +
> +	if (kvm_memslots_have_rmaps(kvm)) {
> +		write_lock(&kvm->mmu_lock);
> +		kvm_nested_mmu_try_split_huge_pages(kvm, memslot, start, end,
> +						    target_level);
> +		write_unlock(&kvm->mmu_lock);

Super duper nit: all other flows do rmaps first, than TDP MMU.  Might as well keep
that ordering here, otherwise it suggests there's a reason to be different.

>  	}
>  
>  	/*
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index ab336f7c82e4..e123e24a130f 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -12161,6 +12161,12 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
>  		 * page faults will create the large-page sptes.
>  		 */
>  		kvm_mmu_zap_collapsible_sptes(kvm, new);
> +
> +		/*
> +		 * Free any memory left behind by eager page splitting. Ignore
> +		 * the module parameter since userspace might have changed it.
> +		 */
> +		free_split_caches(kvm);
>  	} else {
>  		/*
>  		 * Initially-all-set does not require write protecting any page,
> -- 
> 2.36.0.rc2.479.g8af0fa9b8e-goog
> 

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 20/20] KVM: x86/mmu: Extend Eager Page Splitting to nested MMUs
@ 2022-05-09 16:48     ` Sean Christopherson
  0 siblings, 0 replies; 120+ messages in thread
From: Sean Christopherson @ 2022-05-09 16:48 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Fri, Apr 22, 2022, David Matlack wrote:
> +static bool need_topup_split_caches_or_resched(struct kvm *kvm)
> +{
> +	if (need_resched() || rwlock_needbreak(&kvm->mmu_lock))
> +		return true;
> +
> +	/*
> +	 * In the worst case, SPLIT_DESC_CACHE_CAPACITY descriptors are needed
> +	 * to split a single huge page. Calculating how many are actually needed
> +	 * is possible but not worth the complexity.
> +	 */
> +	return need_topup(&kvm->arch.split_desc_cache, SPLIT_DESC_CACHE_CAPACITY) ||
> +		need_topup(&kvm->arch.split_page_header_cache, 1) ||
> +		need_topup(&kvm->arch.split_shadow_page_cache, 1);

Uber nit that Paolo will make fun of me for... please align indentiation

	return need_topup(&kvm->arch.split_desc_cache, SPLIT_DESC_CACHE_CAPACITY) ||
	       need_topup(&kvm->arch.split_page_header_cache, 1) ||
	       need_topup(&kvm->arch.split_shadow_page_cache, 1);

> +static void nested_mmu_split_huge_page(struct kvm *kvm,
> +				       const struct kvm_memory_slot *slot,
> +				       u64 *huge_sptep)
> +
> +{
> +	struct kvm_mmu_memory_cache *cache = &kvm->arch.split_desc_cache;
> +	u64 huge_spte = READ_ONCE(*huge_sptep);
> +	struct kvm_mmu_page *sp;
> +	bool flush = false;
> +	u64 *sptep, spte;
> +	gfn_t gfn;
> +	int index;
> +
> +	sp = nested_mmu_get_sp_for_split(kvm, huge_sptep);
> +
> +	for (index = 0; index < PT64_ENT_PER_PAGE; index++) {
> +		sptep = &sp->spt[index];
> +		gfn = kvm_mmu_page_get_gfn(sp, index);
> +
> +		/*
> +		 * The SP may already have populated SPTEs, e.g. if this huge
> +		 * page is aliased by multiple sptes with the same access
> +		 * permissions. These entries are guaranteed to map the same
> +		 * gfn-to-pfn translation since the SP is direct, so no need to
> +		 * modify them.
> +		 *
> +		 * However, if a given SPTE points to a lower level page table,
> +		 * that lower level page table may only be partially populated.
> +		 * Installing such SPTEs would effectively unmap a potion of the
> +		 * huge page, which requires a TLB flush.

Maybe explain why a TLB flush is required?  E.g. "which requires a TLB flush as
a subsequent mmu_notifier event on the unmapped region would fail to detect the
need to flush".

> +static bool nested_mmu_skip_split_huge_page(u64 *huge_sptep)

"skip" is kinda odd terminology.  It reads like a command, but it's actually
querying state _and_ it's returning a boolean, which I've learned to hate :-)

I don't see any reason for a helper, there's one caller and it can just do
"continue" directly.

> +static void kvm_nested_mmu_try_split_huge_pages(struct kvm *kvm,
> +						const struct kvm_memory_slot *slot,
> +						gfn_t start, gfn_t end,
> +						int target_level)
> +{
> +	int level;
> +
> +	/*
> +	 * Split huge pages starting with KVM_MAX_HUGEPAGE_LEVEL and working
> +	 * down to the target level. This ensures pages are recursively split
> +	 * all the way to the target level. There's no need to split pages
> +	 * already at the target level.
> +	 */
> +	for (level = KVM_MAX_HUGEPAGE_LEVEL; level > target_level; level--) {

Unnecessary braces.
> +		slot_handle_level_range(kvm, slot,
> +					nested_mmu_try_split_huge_pages,
> +					level, level, start, end - 1,
> +					true, false);

IMO it's worth running over by 4 chars to drop 2 lines:

	for (level = KVM_MAX_HUGEPAGE_LEVEL; level > target_level; level--)
		slot_handle_level_range(kvm, slot, nested_mmu_try_split_huge_pages,
					level, level, start, end - 1, true, false);
> +	}
> +}
> +
>  /* Must be called with the mmu_lock held in write-mode. */

Add a lockdep assertion, not a comment.

>  void kvm_mmu_try_split_huge_pages(struct kvm *kvm,
>  				   const struct kvm_memory_slot *memslot,
>  				   u64 start, u64 end,
>  				   int target_level)
>  {
> -	if (is_tdp_mmu_enabled(kvm))
> -		kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end,
> -						 target_level, false);
> +	if (!is_tdp_mmu_enabled(kvm))
> +		return;
> +
> +	kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end, target_level,
> +					 false);
> +
> +	if (kvm_memslots_have_rmaps(kvm))
> +		kvm_nested_mmu_try_split_huge_pages(kvm, memslot, start, end,
> +						    target_level);
>  
>  	/*
>  	 * A TLB flush is unnecessary at this point for the same resons as in
> @@ -6051,10 +6304,19 @@ void kvm_mmu_slot_try_split_huge_pages(struct kvm *kvm,
>  	u64 start = memslot->base_gfn;
>  	u64 end = start + memslot->npages;
>  
> -	if (is_tdp_mmu_enabled(kvm)) {
> -		read_lock(&kvm->mmu_lock);
> -		kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end, target_level, true);
> -		read_unlock(&kvm->mmu_lock);
> +	if (!is_tdp_mmu_enabled(kvm))
> +		return;
> +
> +	read_lock(&kvm->mmu_lock);
> +	kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end, target_level,
> +					 true);

Eh, let this poke out.

> +	read_unlock(&kvm->mmu_lock);
> +
> +	if (kvm_memslots_have_rmaps(kvm)) {
> +		write_lock(&kvm->mmu_lock);
> +		kvm_nested_mmu_try_split_huge_pages(kvm, memslot, start, end,
> +						    target_level);
> +		write_unlock(&kvm->mmu_lock);

Super duper nit: all other flows do rmaps first, than TDP MMU.  Might as well keep
that ordering here, otherwise it suggests there's a reason to be different.

>  	}
>  
>  	/*
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index ab336f7c82e4..e123e24a130f 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -12161,6 +12161,12 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
>  		 * page faults will create the large-page sptes.
>  		 */
>  		kvm_mmu_zap_collapsible_sptes(kvm, new);
> +
> +		/*
> +		 * Free any memory left behind by eager page splitting. Ignore
> +		 * the module parameter since userspace might have changed it.
> +		 */
> +		free_split_caches(kvm);
>  	} else {
>  		/*
>  		 * Initially-all-set does not require write protecting any page,
> -- 
> 2.36.0.rc2.479.g8af0fa9b8e-goog
> 
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 05/20] KVM: x86/mmu: Consolidate shadow page allocation and initialization
  2022-05-05 22:10     ` Sean Christopherson
@ 2022-05-09 20:53       ` David Matlack
  -1 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-05-09 20:53 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Andrew Jones, Ben Gardon, Peter Xu, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Thu, May 5, 2022 at 3:10 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Apr 22, 2022, David Matlack wrote:
> > Consolidate kvm_mmu_alloc_page() and kvm_mmu_alloc_shadow_page() under
> > the latter so that all shadow page allocation and initialization happens
> > in one place.
> >
> > No functional change intended.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> >  arch/x86/kvm/mmu/mmu.c | 39 +++++++++++++++++----------------------
> >  1 file changed, 17 insertions(+), 22 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 5582badf4947..7d03320f6e08 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -1703,27 +1703,6 @@ static void drop_parent_pte(struct kvm_mmu_page *sp,
> >       mmu_spte_clear_no_track(parent_pte);
> >  }
> >
> > -static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, bool direct)
> > -{
> > -     struct kvm_mmu_page *sp;
> > -
> > -     sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
> > -     sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
> > -     if (!direct)
> > -             sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache);
> > -     set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
> > -
> > -     /*
> > -      * active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages()
> > -      * depends on valid pages being added to the head of the list.  See
> > -      * comments in kvm_zap_obsolete_pages().
> > -      */
> > -     sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
> > -     list_add(&sp->link, &vcpu->kvm->arch.active_mmu_pages);
> > -     kvm_mod_used_mmu_pages(vcpu->kvm, +1);
> > -     return sp;
> > -}
> > -
> >  static void mark_unsync(u64 *spte);
> >  static void kvm_mmu_mark_parents_unsync(struct kvm_mmu_page *sp)
> >  {
> > @@ -2100,7 +2079,23 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
> >                                                     struct hlist_head *sp_list,
> >                                                     union kvm_mmu_page_role role)
> >  {
> > -     struct kvm_mmu_page *sp = kvm_mmu_alloc_page(vcpu, role.direct);
> > +     struct kvm_mmu_page *sp;
> > +
> > +     sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
> > +     sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
> > +     if (!role.direct)
> > +             sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache);
> > +
> > +     set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
> > +
> > +     /*
> > +      * active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages()
> > +      * depends on valid pages being added to the head of the list.  See
> > +      * comments in kvm_zap_obsolete_pages().
> > +      */
> > +     sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
> > +     list_add(&sp->link, &vcpu->kvm->arch.active_mmu_pages);
> > +     kvm_mod_used_mmu_pages(vcpu->kvm, +1);
>
> To reduce churn later on, what about opportunistically grabbing vcpu->kvm in a
> local variable in this patch?
>
> Actually, at that point, it's a super trivial change, so you can probably just drop
>
>         KVM: x86/mmu: Replace vcpu with kvm in kvm_mmu_alloc_shadow_page()
>
> and do the vcpu/kvm swap as part of
>
>         KVM: x86/mmu: Pass memory caches to allocate SPs separately

I'm not sure it's any less churn, it's just doing the same amount of
changes in fewer commits. Is there a benefit of using less commits? I
can only think of downsides (harder to review, harder to bisect).

>
> >       sp->gfn = gfn;
> >       sp->role = role;
> > --
> > 2.36.0.rc2.479.g8af0fa9b8e-goog
> >

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 05/20] KVM: x86/mmu: Consolidate shadow page allocation and initialization
@ 2022-05-09 20:53       ` David Matlack
  0 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-05-09 20:53 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Thu, May 5, 2022 at 3:10 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Apr 22, 2022, David Matlack wrote:
> > Consolidate kvm_mmu_alloc_page() and kvm_mmu_alloc_shadow_page() under
> > the latter so that all shadow page allocation and initialization happens
> > in one place.
> >
> > No functional change intended.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> >  arch/x86/kvm/mmu/mmu.c | 39 +++++++++++++++++----------------------
> >  1 file changed, 17 insertions(+), 22 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 5582badf4947..7d03320f6e08 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -1703,27 +1703,6 @@ static void drop_parent_pte(struct kvm_mmu_page *sp,
> >       mmu_spte_clear_no_track(parent_pte);
> >  }
> >
> > -static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, bool direct)
> > -{
> > -     struct kvm_mmu_page *sp;
> > -
> > -     sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
> > -     sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
> > -     if (!direct)
> > -             sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache);
> > -     set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
> > -
> > -     /*
> > -      * active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages()
> > -      * depends on valid pages being added to the head of the list.  See
> > -      * comments in kvm_zap_obsolete_pages().
> > -      */
> > -     sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
> > -     list_add(&sp->link, &vcpu->kvm->arch.active_mmu_pages);
> > -     kvm_mod_used_mmu_pages(vcpu->kvm, +1);
> > -     return sp;
> > -}
> > -
> >  static void mark_unsync(u64 *spte);
> >  static void kvm_mmu_mark_parents_unsync(struct kvm_mmu_page *sp)
> >  {
> > @@ -2100,7 +2079,23 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
> >                                                     struct hlist_head *sp_list,
> >                                                     union kvm_mmu_page_role role)
> >  {
> > -     struct kvm_mmu_page *sp = kvm_mmu_alloc_page(vcpu, role.direct);
> > +     struct kvm_mmu_page *sp;
> > +
> > +     sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
> > +     sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
> > +     if (!role.direct)
> > +             sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache);
> > +
> > +     set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
> > +
> > +     /*
> > +      * active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages()
> > +      * depends on valid pages being added to the head of the list.  See
> > +      * comments in kvm_zap_obsolete_pages().
> > +      */
> > +     sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
> > +     list_add(&sp->link, &vcpu->kvm->arch.active_mmu_pages);
> > +     kvm_mod_used_mmu_pages(vcpu->kvm, +1);
>
> To reduce churn later on, what about opportunistically grabbing vcpu->kvm in a
> local variable in this patch?
>
> Actually, at that point, it's a super trivial change, so you can probably just drop
>
>         KVM: x86/mmu: Replace vcpu with kvm in kvm_mmu_alloc_shadow_page()
>
> and do the vcpu/kvm swap as part of
>
>         KVM: x86/mmu: Pass memory caches to allocate SPs separately

I'm not sure it's any less churn, it's just doing the same amount of
changes in fewer commits. Is there a benefit of using less commits? I
can only think of downsides (harder to review, harder to bisect).

>
> >       sp->gfn = gfn;
> >       sp->role = role;
> > --
> > 2.36.0.rc2.479.g8af0fa9b8e-goog
> >
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 03/20] KVM: x86/mmu: Derive shadow MMU page role from parent
  2022-05-07  8:28     ` Lai Jiangshan
@ 2022-05-09 21:04       ` David Matlack
  -1 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-05-09 21:04 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Sean Christopherson, Andrew Jones, Ben Gardon, Peter Xu,
	Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Sat, May 7, 2022 at 1:28 AM Lai Jiangshan <jiangshanlai@gmail.com> wrote:
>
>
>
> On 2022/4/23 05:05, David Matlack wrote:
> > Instead of computing the shadow page role from scratch for every new
> > page, derive most of the information from the parent shadow page.  This
> > avoids redundant calculations and reduces the number of parameters to
> > kvm_mmu_get_page().
> >
> > Preemptively split out the role calculation to a separate function for
> > use in a following commit.
> >
> > No functional change intended.
> >
> > Reviewed-by: Peter Xu <peterx@redhat.com>
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> >   arch/x86/kvm/mmu/mmu.c         | 96 +++++++++++++++++++++++-----------
> >   arch/x86/kvm/mmu/paging_tmpl.h |  9 ++--
> >   2 files changed, 71 insertions(+), 34 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index dc20eccd6a77..4249a771818b 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -2021,31 +2021,15 @@ static void clear_sp_write_flooding_count(u64 *spte)
> >       __clear_sp_write_flooding_count(sptep_to_sp(spte));
> >   }
> >
> > -static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > -                                          gfn_t gfn,
> > -                                          gva_t gaddr,
> > -                                          unsigned level,
> > -                                          bool direct,
> > -                                          unsigned int access)
> > +static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
> > +                                          union kvm_mmu_page_role role)
> >   {
> > -     union kvm_mmu_page_role role;
> >       struct hlist_head *sp_list;
> > -     unsigned quadrant;
> >       struct kvm_mmu_page *sp;
> >       int ret;
> >       int collisions = 0;
> >       LIST_HEAD(invalid_list);
> >
> > -     role = vcpu->arch.mmu->root_role;
> > -     role.level = level;
> > -     role.direct = direct;
> > -     role.access = access;
> > -     if (role.has_4_byte_gpte) {
> > -             quadrant = gaddr >> (PAGE_SHIFT + (PT64_PT_BITS * level));
> > -             quadrant &= (1 << ((PT32_PT_BITS - PT64_PT_BITS) * level)) - 1;
> > -             role.quadrant = quadrant;
> > -     }
>
>
> I don't think replacing it with kvm_mmu_child_role() can reduce any calculations.
>
> role.level, role.direct, role.access and role.quadrant still need to be
> calculated.  And the old code is still in mmu_alloc_root().

You are correct. Instead of saying "less calculations" I should have
said "eliminates the dependency on vcpu->arch.mmu->root_role".

>
> I think kvm_mmu_get_shadow_page() can keep the those parameters and
> kvm_mmu_child_role() is only introduced for nested_mmu_get_sp_for_split().
>
> Both kvm_mmu_get_shadow_page() and nested_mmu_get_sp_for_split() call
> __kvm_mmu_get_shadow_page() which uses role as a parameter.

I agree this would work, but I think the end result would be more
difficult to read.

The way I've implemented it there are two ways the SP roles are calculated:

1. For root SPs: Derive the role from vCPU root_role and caller-provided inputs.
2. For child SPs: Derive the role from parent SP and caller-provided inputs.

Your proposal would still have two ways to calculate the SP role, but
split along a different dimension:

1. For vCPUs creating SPs: Derive the role from vCPU root_role and
caller-provided inputs.
2. For Eager Page Splitting creating SPs: Derive the role from parent
SP and caller-provided inputs.

With your proposal, it is less obvious that eager page splitting is
going to end up with the correct role. Whereas if we use the same
calculation for all child SPs, it is immediately obvious.

>
>
>
> > -
> >       sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
> >       for_each_valid_sp(vcpu->kvm, sp, sp_list) {
> >               if (sp->gfn != gfn) {
> > @@ -2063,7 +2047,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> >                        * Unsync pages must not be left as is, because the new
> >                        * upper-level page will be write-protected.
> >                        */
> > -                     if (level > PG_LEVEL_4K && sp->unsync)
> > +                     if (role.level > PG_LEVEL_4K && sp->unsync)
> >                               kvm_mmu_prepare_zap_page(vcpu->kvm, sp,
> >                                                        &invalid_list);
> >                       continue;
> > @@ -2104,14 +2088,14 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> >
> >       ++vcpu->kvm->stat.mmu_cache_miss;
> >
> > -     sp = kvm_mmu_alloc_page(vcpu, direct);
> > +     sp = kvm_mmu_alloc_page(vcpu, role.direct);
> >
> >       sp->gfn = gfn;
> >       sp->role = role;
> >       hlist_add_head(&sp->hash_link, sp_list);
> > -     if (!direct) {
> > +     if (!role.direct) {
> >               account_shadowed(vcpu->kvm, sp);
> > -             if (level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))
> > +             if (role.level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))
> >                       kvm_flush_remote_tlbs_with_address(vcpu->kvm, gfn, 1);
> >       }
> >       trace_kvm_mmu_get_page(sp, true);
> > @@ -2123,6 +2107,51 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> >       return sp;
> >   }
> >
> > +static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct, u32 access)
> > +{
> > +     struct kvm_mmu_page *parent_sp = sptep_to_sp(sptep);
> > +     union kvm_mmu_page_role role;
> > +
> > +     role = parent_sp->role;
> > +     role.level--;
> > +     role.access = access;
> > +     role.direct = direct;
> > +
> > +     /*
> > +      * If the guest has 4-byte PTEs then that means it's using 32-bit,
> > +      * 2-level, non-PAE paging. KVM shadows such guests using 4 PAE page
> > +      * directories, each mapping 1/4 of the guest's linear address space
> > +      * (1GiB). The shadow pages for those 4 page directories are
> > +      * pre-allocated and assigned a separate quadrant in their role.
>
>
> It is not going to be true in patchset:
> [PATCH V2 0/7] KVM: X86/MMU: Use one-off special shadow page for special roots
>
> https://lore.kernel.org/lkml/20220503150735.32723-1-jiangshanlai@gmail.com/
>
> The shadow pages for those 4 page directories are also allocated on demand.

Ack. I can even just drop this sentence in v5, it's just background information.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 03/20] KVM: x86/mmu: Derive shadow MMU page role from parent
@ 2022-05-09 21:04       ` David Matlack
  0 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-05-09 21:04 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Sat, May 7, 2022 at 1:28 AM Lai Jiangshan <jiangshanlai@gmail.com> wrote:
>
>
>
> On 2022/4/23 05:05, David Matlack wrote:
> > Instead of computing the shadow page role from scratch for every new
> > page, derive most of the information from the parent shadow page.  This
> > avoids redundant calculations and reduces the number of parameters to
> > kvm_mmu_get_page().
> >
> > Preemptively split out the role calculation to a separate function for
> > use in a following commit.
> >
> > No functional change intended.
> >
> > Reviewed-by: Peter Xu <peterx@redhat.com>
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> >   arch/x86/kvm/mmu/mmu.c         | 96 +++++++++++++++++++++++-----------
> >   arch/x86/kvm/mmu/paging_tmpl.h |  9 ++--
> >   2 files changed, 71 insertions(+), 34 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index dc20eccd6a77..4249a771818b 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -2021,31 +2021,15 @@ static void clear_sp_write_flooding_count(u64 *spte)
> >       __clear_sp_write_flooding_count(sptep_to_sp(spte));
> >   }
> >
> > -static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > -                                          gfn_t gfn,
> > -                                          gva_t gaddr,
> > -                                          unsigned level,
> > -                                          bool direct,
> > -                                          unsigned int access)
> > +static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
> > +                                          union kvm_mmu_page_role role)
> >   {
> > -     union kvm_mmu_page_role role;
> >       struct hlist_head *sp_list;
> > -     unsigned quadrant;
> >       struct kvm_mmu_page *sp;
> >       int ret;
> >       int collisions = 0;
> >       LIST_HEAD(invalid_list);
> >
> > -     role = vcpu->arch.mmu->root_role;
> > -     role.level = level;
> > -     role.direct = direct;
> > -     role.access = access;
> > -     if (role.has_4_byte_gpte) {
> > -             quadrant = gaddr >> (PAGE_SHIFT + (PT64_PT_BITS * level));
> > -             quadrant &= (1 << ((PT32_PT_BITS - PT64_PT_BITS) * level)) - 1;
> > -             role.quadrant = quadrant;
> > -     }
>
>
> I don't think replacing it with kvm_mmu_child_role() can reduce any calculations.
>
> role.level, role.direct, role.access and role.quadrant still need to be
> calculated.  And the old code is still in mmu_alloc_root().

You are correct. Instead of saying "less calculations" I should have
said "eliminates the dependency on vcpu->arch.mmu->root_role".

>
> I think kvm_mmu_get_shadow_page() can keep the those parameters and
> kvm_mmu_child_role() is only introduced for nested_mmu_get_sp_for_split().
>
> Both kvm_mmu_get_shadow_page() and nested_mmu_get_sp_for_split() call
> __kvm_mmu_get_shadow_page() which uses role as a parameter.

I agree this would work, but I think the end result would be more
difficult to read.

The way I've implemented it there are two ways the SP roles are calculated:

1. For root SPs: Derive the role from vCPU root_role and caller-provided inputs.
2. For child SPs: Derive the role from parent SP and caller-provided inputs.

Your proposal would still have two ways to calculate the SP role, but
split along a different dimension:

1. For vCPUs creating SPs: Derive the role from vCPU root_role and
caller-provided inputs.
2. For Eager Page Splitting creating SPs: Derive the role from parent
SP and caller-provided inputs.

With your proposal, it is less obvious that eager page splitting is
going to end up with the correct role. Whereas if we use the same
calculation for all child SPs, it is immediately obvious.

>
>
>
> > -
> >       sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
> >       for_each_valid_sp(vcpu->kvm, sp, sp_list) {
> >               if (sp->gfn != gfn) {
> > @@ -2063,7 +2047,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> >                        * Unsync pages must not be left as is, because the new
> >                        * upper-level page will be write-protected.
> >                        */
> > -                     if (level > PG_LEVEL_4K && sp->unsync)
> > +                     if (role.level > PG_LEVEL_4K && sp->unsync)
> >                               kvm_mmu_prepare_zap_page(vcpu->kvm, sp,
> >                                                        &invalid_list);
> >                       continue;
> > @@ -2104,14 +2088,14 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> >
> >       ++vcpu->kvm->stat.mmu_cache_miss;
> >
> > -     sp = kvm_mmu_alloc_page(vcpu, direct);
> > +     sp = kvm_mmu_alloc_page(vcpu, role.direct);
> >
> >       sp->gfn = gfn;
> >       sp->role = role;
> >       hlist_add_head(&sp->hash_link, sp_list);
> > -     if (!direct) {
> > +     if (!role.direct) {
> >               account_shadowed(vcpu->kvm, sp);
> > -             if (level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))
> > +             if (role.level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))
> >                       kvm_flush_remote_tlbs_with_address(vcpu->kvm, gfn, 1);
> >       }
> >       trace_kvm_mmu_get_page(sp, true);
> > @@ -2123,6 +2107,51 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> >       return sp;
> >   }
> >
> > +static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct, u32 access)
> > +{
> > +     struct kvm_mmu_page *parent_sp = sptep_to_sp(sptep);
> > +     union kvm_mmu_page_role role;
> > +
> > +     role = parent_sp->role;
> > +     role.level--;
> > +     role.access = access;
> > +     role.direct = direct;
> > +
> > +     /*
> > +      * If the guest has 4-byte PTEs then that means it's using 32-bit,
> > +      * 2-level, non-PAE paging. KVM shadows such guests using 4 PAE page
> > +      * directories, each mapping 1/4 of the guest's linear address space
> > +      * (1GiB). The shadow pages for those 4 page directories are
> > +      * pre-allocated and assigned a separate quadrant in their role.
>
>
> It is not going to be true in patchset:
> [PATCH V2 0/7] KVM: X86/MMU: Use one-off special shadow page for special roots
>
> https://lore.kernel.org/lkml/20220503150735.32723-1-jiangshanlai@gmail.com/
>
> The shadow pages for those 4 page directories are also allocated on demand.

Ack. I can even just drop this sentence in v5, it's just background information.
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 07/20] KVM: x86/mmu: Move guest PT write-protection to account_shadowed()
  2022-05-05 22:51     ` Sean Christopherson
@ 2022-05-09 21:18       ` David Matlack
  -1 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-05-09 21:18 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Andrew Jones, Ben Gardon, Peter Xu, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Thu, May 5, 2022 at 3:51 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Apr 22, 2022, David Matlack wrote:
> > Move the code that write-protects newly-shadowed guest page tables into
> > account_shadowed(). This avoids a extra gfn-to-memslot lookup and is a
> > more logical place for this code to live. But most importantly, this
> > reduces kvm_mmu_alloc_shadow_page()'s reliance on having a struct
> > kvm_vcpu pointer, which will be necessary when creating new shadow pages
> > during VM ioctls for eager page splitting.
> >
> > Note, it is safe to drop the role.level == PG_LEVEL_4K check since
> > account_shadowed() returns early if role.level > PG_LEVEL_4K.
> >
> > No functional change intended.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
>
> Reviewed-by: Sean Christopherson <seanjc@google.com>
>
> >  arch/x86/kvm/mmu/mmu.c | 9 +++++----
> >  1 file changed, 5 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index fa7846760887..4f894db88bbf 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -807,6 +807,9 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
> >                                                   KVM_PAGE_TRACK_WRITE);
> >
> >       kvm_mmu_gfn_disallow_lpage(slot, gfn);
> > +
> > +     if (kvm_mmu_slot_gfn_write_protect(kvm, slot, gfn, PG_LEVEL_4K))
> > +             kvm_flush_remote_tlbs_with_address(kvm, gfn, 1);
> >  }
> >
> >  void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp)
> > @@ -2100,11 +2103,9 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
> >       sp->gfn = gfn;
> >       sp->role = role;
> >       hlist_add_head(&sp->hash_link, sp_list);
> > -     if (!role.direct) {
> > +
> > +     if (!role.direct)
> >               account_shadowed(vcpu->kvm, sp);
> > -             if (role.level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))
>
> Huh.  Two thoughts.
>
> 1. Can you add a patch to drop kvm_vcpu_write_protect_gfn() entirely, i.e. convert
>    mmu_sync_children() to use kvm_mmu_slot_gfn_write_protect?  It's largely a moot
>    point, but only because mmu_sync_children() only operates on shadow pages that
>    are relevant to the current non-SMM/SMM role.  And _that_ holds true only because
>    KVM does kvm_mmu_reset_context() and drops roots for the vCPU on SMM transitions.
>
>    That'd be a good oppurtunity to move this pair into a helper:
>
>         slots = kvm_memslots_for_spte_role(kvm, sp->role);
>         slot = __gfn_to_memslot(slots, gfn);

Sounds reasonable but let's do that in a separate series since this is
already on v4 and I wouldn't say it's obvious that using the role to
get the memslot will give the same result as using the vCPU, although
that does look to be true.

>
> 2. This got me thinking...  Write-protecting for shadow paging should NOT be
>    associated with the vCPU or even the role.  The SMM memslots conceptually
>    operate on the same guest physical memory, SMM is just given access to memory
>    that is not present in the non-SMM memslots.
>
>    If KVM creates SPTEs for non-SMM, then it needs to write-protect _all_ memslots
>    that contain the relevant gfn, e.g. if the guest takes an SMI and modifies the
>    non-SMM page tables, then KVM needs trap and emulate (or unsync) those writes.
>
>    The mess "works" because no sane SMM handler (kind of a contradiction in terms)
>    will modify non-SMM page tables, and vice versa.
>
>    The entire duplicate memslots approach is flawed.  Better would have been to
>    make SMM a flag and hide SMM-only memslots, not duplicated everything...
>
> > -                     kvm_flush_remote_tlbs_with_address(vcpu->kvm, gfn, 1);
> > -     }
> >
> >       return sp;
> >  }
> > --
> > 2.36.0.rc2.479.g8af0fa9b8e-goog
> >

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 07/20] KVM: x86/mmu: Move guest PT write-protection to account_shadowed()
@ 2022-05-09 21:18       ` David Matlack
  0 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-05-09 21:18 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Thu, May 5, 2022 at 3:51 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Apr 22, 2022, David Matlack wrote:
> > Move the code that write-protects newly-shadowed guest page tables into
> > account_shadowed(). This avoids a extra gfn-to-memslot lookup and is a
> > more logical place for this code to live. But most importantly, this
> > reduces kvm_mmu_alloc_shadow_page()'s reliance on having a struct
> > kvm_vcpu pointer, which will be necessary when creating new shadow pages
> > during VM ioctls for eager page splitting.
> >
> > Note, it is safe to drop the role.level == PG_LEVEL_4K check since
> > account_shadowed() returns early if role.level > PG_LEVEL_4K.
> >
> > No functional change intended.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
>
> Reviewed-by: Sean Christopherson <seanjc@google.com>
>
> >  arch/x86/kvm/mmu/mmu.c | 9 +++++----
> >  1 file changed, 5 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index fa7846760887..4f894db88bbf 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -807,6 +807,9 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
> >                                                   KVM_PAGE_TRACK_WRITE);
> >
> >       kvm_mmu_gfn_disallow_lpage(slot, gfn);
> > +
> > +     if (kvm_mmu_slot_gfn_write_protect(kvm, slot, gfn, PG_LEVEL_4K))
> > +             kvm_flush_remote_tlbs_with_address(kvm, gfn, 1);
> >  }
> >
> >  void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp)
> > @@ -2100,11 +2103,9 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
> >       sp->gfn = gfn;
> >       sp->role = role;
> >       hlist_add_head(&sp->hash_link, sp_list);
> > -     if (!role.direct) {
> > +
> > +     if (!role.direct)
> >               account_shadowed(vcpu->kvm, sp);
> > -             if (role.level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))
>
> Huh.  Two thoughts.
>
> 1. Can you add a patch to drop kvm_vcpu_write_protect_gfn() entirely, i.e. convert
>    mmu_sync_children() to use kvm_mmu_slot_gfn_write_protect?  It's largely a moot
>    point, but only because mmu_sync_children() only operates on shadow pages that
>    are relevant to the current non-SMM/SMM role.  And _that_ holds true only because
>    KVM does kvm_mmu_reset_context() and drops roots for the vCPU on SMM transitions.
>
>    That'd be a good oppurtunity to move this pair into a helper:
>
>         slots = kvm_memslots_for_spte_role(kvm, sp->role);
>         slot = __gfn_to_memslot(slots, gfn);

Sounds reasonable but let's do that in a separate series since this is
already on v4 and I wouldn't say it's obvious that using the role to
get the memslot will give the same result as using the vCPU, although
that does look to be true.

>
> 2. This got me thinking...  Write-protecting for shadow paging should NOT be
>    associated with the vCPU or even the role.  The SMM memslots conceptually
>    operate on the same guest physical memory, SMM is just given access to memory
>    that is not present in the non-SMM memslots.
>
>    If KVM creates SPTEs for non-SMM, then it needs to write-protect _all_ memslots
>    that contain the relevant gfn, e.g. if the guest takes an SMI and modifies the
>    non-SMM page tables, then KVM needs trap and emulate (or unsync) those writes.
>
>    The mess "works" because no sane SMM handler (kind of a contradiction in terms)
>    will modify non-SMM page tables, and vice versa.
>
>    The entire duplicate memslots approach is flawed.  Better would have been to
>    make SMM a flag and hide SMM-only memslots, not duplicated everything...
>
> > -                     kvm_flush_remote_tlbs_with_address(vcpu->kvm, gfn, 1);
> > -     }
> >
> >       return sp;
> >  }
> > --
> > 2.36.0.rc2.479.g8af0fa9b8e-goog
> >
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 11/20] KVM: x86/mmu: Allow for NULL vcpu pointer in __kvm_mmu_get_shadow_page()
  2022-05-05 23:33     ` Sean Christopherson
@ 2022-05-09 21:26       ` David Matlack
  -1 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-05-09 21:26 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Andrew Jones, Ben Gardon, Peter Xu, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Thu, May 5, 2022 at 4:33 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Apr 22, 2022, David Matlack wrote:
> > Allow the vcpu pointer in __kvm_mmu_get_shadow_page() to be NULL. Rename
> > it to vcpu_or_null to prevent future commits from accidentally taking
> > dependency on it without first considering the NULL case.
> >
> > The vcpu pointer is only used for syncing indirect shadow pages in
> > kvm_mmu_find_shadow_page(). A vcpu pointer it not required for
> > correctness since unsync pages can simply be zapped. But this should
> > never occur in practice, since the only use-case for passing a NULL vCPU
> > pointer is eager page splitting which will only request direct shadow
> > pages (which can never be unsync).
> >
> > Even though __kvm_mmu_get_shadow_page() can gracefully handle a NULL
> > vcpu, add a WARN() that will fire if __kvm_mmu_get_shadow_page() is ever
> > called to get an indirect shadow page with a NULL vCPU pointer, since
> > zapping unsync SPs is a performance overhead that should be considered.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> >  arch/x86/kvm/mmu/mmu.c | 40 ++++++++++++++++++++++++++++++++--------
> >  1 file changed, 32 insertions(+), 8 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 04029c01aebd..21407bd4435a 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -1845,16 +1845,27 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
> >         &(_kvm)->arch.mmu_page_hash[kvm_page_table_hashfn(_gfn)])     \
> >               if ((_sp)->gfn != (_gfn) || (_sp)->role.direct) {} else
> >
> > -static int kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> > -                      struct list_head *invalid_list)
> > +static int __kvm_sync_page(struct kvm *kvm, struct kvm_vcpu *vcpu_or_null,
> > +                        struct kvm_mmu_page *sp,
> > +                        struct list_head *invalid_list)
> >  {
> > -     int ret = vcpu->arch.mmu->sync_page(vcpu, sp);
> > +     int ret = -1;
> > +
> > +     if (vcpu_or_null)
>
> This should never happen.  I like the idea of warning early, but I really don't
> like that the WARN is far removed from the code that actually depends on @vcpu
> being non-NULL. Case in point, KVM should have bailed on the WARN and never
> reached this point.  And the inner __kvm_sync_page() is completely unnecessary.

Yeah that's fair.

>
> I also don't love the vcpu_or_null terminology; I get the intent, but it doesn't
> really help because understand why/when it's NULL.

Eh, I don't think it needs to encode why or when. It just needs to
flag to the reader (and future code authors) that this vcpu pointer
(unlike all other vcpu pointers in KVM) is NULL in certain cases.

>
> I played around with casting, e.g. to/from an unsigned long or void *, to prevent
> usage, but that doesn't work very well because 'unsigned long' ends up being
> awkward/confusing, and 'void *' is easily lost on a function call.  And both
> lose type safety :-(

Yet another shortcoming of C :(

(The other being our other discussion about the RET_PF* return codes
getting easily misinterpreted as KVM's magic return-to-user /
continue-running-guest return codes.)

Makes me miss Rust!

>
> All in all, I think I'd prefer this patch to simply be a KVM_BUG_ON() if
> kvm_mmu_find_shadow_page() encounters an unsync page.  Less churn, and IMO there's
> no real loss in robustness, e.g. we'd really have to screw up code review and
> testing to introduce a null vCPU pointer dereference in this code.

Agreed about moving the check here and dropping __kvm_sync_page(). But
I would prefer to retain the vcpu_or_null name (or at least something
other than "vcpu" to indicate there's something non-standard about
this pointer).

>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 3d102522804a..5aed9265f592 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2041,6 +2041,13 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm,
>                         goto out;
>
>                 if (sp->unsync) {
> +                       /*
> +                        * Getting indirect shadow pages without a vCPU pointer
> +                        * is not supported, i.e. this should never happen.
> +                        */
> +                       if (KVM_BUG_ON(!vcpu, kvm))
> +                               break;
> +
>                         /*
>                          * The page is good, but is stale.  kvm_sync_page does
>                          * get the latest guest state, but (unlike mmu_unsync_children)
>

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 11/20] KVM: x86/mmu: Allow for NULL vcpu pointer in __kvm_mmu_get_shadow_page()
@ 2022-05-09 21:26       ` David Matlack
  0 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-05-09 21:26 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Thu, May 5, 2022 at 4:33 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Apr 22, 2022, David Matlack wrote:
> > Allow the vcpu pointer in __kvm_mmu_get_shadow_page() to be NULL. Rename
> > it to vcpu_or_null to prevent future commits from accidentally taking
> > dependency on it without first considering the NULL case.
> >
> > The vcpu pointer is only used for syncing indirect shadow pages in
> > kvm_mmu_find_shadow_page(). A vcpu pointer it not required for
> > correctness since unsync pages can simply be zapped. But this should
> > never occur in practice, since the only use-case for passing a NULL vCPU
> > pointer is eager page splitting which will only request direct shadow
> > pages (which can never be unsync).
> >
> > Even though __kvm_mmu_get_shadow_page() can gracefully handle a NULL
> > vcpu, add a WARN() that will fire if __kvm_mmu_get_shadow_page() is ever
> > called to get an indirect shadow page with a NULL vCPU pointer, since
> > zapping unsync SPs is a performance overhead that should be considered.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> >  arch/x86/kvm/mmu/mmu.c | 40 ++++++++++++++++++++++++++++++++--------
> >  1 file changed, 32 insertions(+), 8 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 04029c01aebd..21407bd4435a 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -1845,16 +1845,27 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
> >         &(_kvm)->arch.mmu_page_hash[kvm_page_table_hashfn(_gfn)])     \
> >               if ((_sp)->gfn != (_gfn) || (_sp)->role.direct) {} else
> >
> > -static int kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> > -                      struct list_head *invalid_list)
> > +static int __kvm_sync_page(struct kvm *kvm, struct kvm_vcpu *vcpu_or_null,
> > +                        struct kvm_mmu_page *sp,
> > +                        struct list_head *invalid_list)
> >  {
> > -     int ret = vcpu->arch.mmu->sync_page(vcpu, sp);
> > +     int ret = -1;
> > +
> > +     if (vcpu_or_null)
>
> This should never happen.  I like the idea of warning early, but I really don't
> like that the WARN is far removed from the code that actually depends on @vcpu
> being non-NULL. Case in point, KVM should have bailed on the WARN and never
> reached this point.  And the inner __kvm_sync_page() is completely unnecessary.

Yeah that's fair.

>
> I also don't love the vcpu_or_null terminology; I get the intent, but it doesn't
> really help because understand why/when it's NULL.

Eh, I don't think it needs to encode why or when. It just needs to
flag to the reader (and future code authors) that this vcpu pointer
(unlike all other vcpu pointers in KVM) is NULL in certain cases.

>
> I played around with casting, e.g. to/from an unsigned long or void *, to prevent
> usage, but that doesn't work very well because 'unsigned long' ends up being
> awkward/confusing, and 'void *' is easily lost on a function call.  And both
> lose type safety :-(

Yet another shortcoming of C :(

(The other being our other discussion about the RET_PF* return codes
getting easily misinterpreted as KVM's magic return-to-user /
continue-running-guest return codes.)

Makes me miss Rust!

>
> All in all, I think I'd prefer this patch to simply be a KVM_BUG_ON() if
> kvm_mmu_find_shadow_page() encounters an unsync page.  Less churn, and IMO there's
> no real loss in robustness, e.g. we'd really have to screw up code review and
> testing to introduce a null vCPU pointer dereference in this code.

Agreed about moving the check here and dropping __kvm_sync_page(). But
I would prefer to retain the vcpu_or_null name (or at least something
other than "vcpu" to indicate there's something non-standard about
this pointer).

>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 3d102522804a..5aed9265f592 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2041,6 +2041,13 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm,
>                         goto out;
>
>                 if (sp->unsync) {
> +                       /*
> +                        * Getting indirect shadow pages without a vCPU pointer
> +                        * is not supported, i.e. this should never happen.
> +                        */
> +                       if (KVM_BUG_ON(!vcpu, kvm))
> +                               break;
> +
>                         /*
>                          * The page is good, but is stale.  kvm_sync_page does
>                          * get the latest guest state, but (unlike mmu_unsync_children)
>
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 13/20] KVM: x86/mmu: Decouple rmap_add() and link_shadow_page() from kvm_vcpu
  2022-05-05 23:46     ` Sean Christopherson
@ 2022-05-09 21:27       ` David Matlack
  -1 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-05-09 21:27 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Andrew Jones, Ben Gardon, Peter Xu, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Thu, May 5, 2022 at 4:46 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Apr 22, 2022, David Matlack wrote:
> > -static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep,
> > -                          struct kvm_mmu_page *sp)
> > +static void __link_shadow_page(struct kvm_mmu_memory_cache *cache, u64 *sptep,
> > +                            struct kvm_mmu_page *sp)
> >  {
> >       u64 spte;
> >
> > @@ -2297,12 +2300,17 @@ static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep,
> >
> >       mmu_spte_set(sptep, spte);
> >
> > -     mmu_page_add_parent_pte(vcpu, sp, sptep);
> > +     mmu_page_add_parent_pte(cache, sp, sptep);
> >
> >       if (sp->unsync_children || sp->unsync)
> >               mark_unsync(sptep);
> >  }
> >
> > +static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep, struct kvm_mmu_page *sp)
>
> Nit, would prefer to wrap here, especially since __link_shadow_page() wraps.

Will do.

>
> > +{
> > +     __link_shadow_page(&vcpu->arch.mmu_pte_list_desc_cache, sptep, sp);
> > +}
> > +
> >  static void validate_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep,
> >                                  unsigned direct_access)
> >  {
> > --
> > 2.36.0.rc2.479.g8af0fa9b8e-goog
> >

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 13/20] KVM: x86/mmu: Decouple rmap_add() and link_shadow_page() from kvm_vcpu
@ 2022-05-09 21:27       ` David Matlack
  0 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-05-09 21:27 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Thu, May 5, 2022 at 4:46 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Apr 22, 2022, David Matlack wrote:
> > -static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep,
> > -                          struct kvm_mmu_page *sp)
> > +static void __link_shadow_page(struct kvm_mmu_memory_cache *cache, u64 *sptep,
> > +                            struct kvm_mmu_page *sp)
> >  {
> >       u64 spte;
> >
> > @@ -2297,12 +2300,17 @@ static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep,
> >
> >       mmu_spte_set(sptep, spte);
> >
> > -     mmu_page_add_parent_pte(vcpu, sp, sptep);
> > +     mmu_page_add_parent_pte(cache, sp, sptep);
> >
> >       if (sp->unsync_children || sp->unsync)
> >               mark_unsync(sptep);
> >  }
> >
> > +static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep, struct kvm_mmu_page *sp)
>
> Nit, would prefer to wrap here, especially since __link_shadow_page() wraps.

Will do.

>
> > +{
> > +     __link_shadow_page(&vcpu->arch.mmu_pte_list_desc_cache, sptep, sp);
> > +}
> > +
> >  static void validate_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep,
> >                                  unsigned direct_access)
> >  {
> > --
> > 2.36.0.rc2.479.g8af0fa9b8e-goog
> >
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 15/20] KVM: x86/mmu: Cache the access bits of shadowed translations
  2022-05-09 16:10     ` Sean Christopherson
@ 2022-05-09 21:29       ` David Matlack
  -1 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-05-09 21:29 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Andrew Jones, Ben Gardon, Peter Xu, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Mon, May 9, 2022 at 9:10 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Apr 22, 2022, David Matlack wrote:
> > @@ -2820,7 +2861,10 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
> >
> >       if (!was_rmapped) {
> >               WARN_ON_ONCE(ret == RET_PF_SPURIOUS);
> > -             rmap_add(vcpu, slot, sptep, gfn);
> > +             rmap_add(vcpu, slot, sptep, gfn, pte_access);
> > +     } else {
> > +             /* Already rmapped but the pte_access bits may have changed. */
> > +             kvm_mmu_page_set_access(sp, sptep - sp->spt, pte_access);
> >       }
> >
> >       return ret;
>
> ...
>
> > diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> > index a8a755e1561d..97bf53b29b88 100644
> > --- a/arch/x86/kvm/mmu/paging_tmpl.h
> > +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> > @@ -978,7 +978,8 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> >  }
> >
> >  /*
> > - * Using the cached information from sp->gfns is safe because:
> > + * Using the information in sp->shadowed_translation (kvm_mmu_page_get_gfn()
> > + * and kvm_mmu_page_get_access()) is safe because:
> >   * - The spte has a reference to the struct page, so the pfn for a given gfn
> >   *   can't change unless all sptes pointing to it are nuked first.
> >   *
> > @@ -1052,12 +1053,15 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
> >               if (sync_mmio_spte(vcpu, &sp->spt[i], gfn, pte_access))
> >                       continue;
> >
> > -             if (gfn != sp->gfns[i]) {
> > +             if (gfn != kvm_mmu_page_get_gfn(sp, i)) {
> >                       drop_spte(vcpu->kvm, &sp->spt[i]);
> >                       flush = true;
> >                       continue;
> >               }
> >
> > +             if (pte_access != kvm_mmu_page_get_access(sp, i))
>
> I think it makes sense to do this unconditionally, same as mmu_set_spte().  Or
> make the mmu_set_spte() case conditional.  I don't have a strong preference either
> way, but the two callers should be consistent with each other.

I'll make them both unconditional.

>
> > +                     kvm_mmu_page_set_access(sp, i, pte_access);
> > +
> >               sptep = &sp->spt[i];
> >               spte = *sptep;
> >               host_writable = spte & shadow_host_writable_mask;
> > --
> > 2.36.0.rc2.479.g8af0fa9b8e-goog
> >

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 15/20] KVM: x86/mmu: Cache the access bits of shadowed translations
@ 2022-05-09 21:29       ` David Matlack
  0 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-05-09 21:29 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Mon, May 9, 2022 at 9:10 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Apr 22, 2022, David Matlack wrote:
> > @@ -2820,7 +2861,10 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
> >
> >       if (!was_rmapped) {
> >               WARN_ON_ONCE(ret == RET_PF_SPURIOUS);
> > -             rmap_add(vcpu, slot, sptep, gfn);
> > +             rmap_add(vcpu, slot, sptep, gfn, pte_access);
> > +     } else {
> > +             /* Already rmapped but the pte_access bits may have changed. */
> > +             kvm_mmu_page_set_access(sp, sptep - sp->spt, pte_access);
> >       }
> >
> >       return ret;
>
> ...
>
> > diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> > index a8a755e1561d..97bf53b29b88 100644
> > --- a/arch/x86/kvm/mmu/paging_tmpl.h
> > +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> > @@ -978,7 +978,8 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> >  }
> >
> >  /*
> > - * Using the cached information from sp->gfns is safe because:
> > + * Using the information in sp->shadowed_translation (kvm_mmu_page_get_gfn()
> > + * and kvm_mmu_page_get_access()) is safe because:
> >   * - The spte has a reference to the struct page, so the pfn for a given gfn
> >   *   can't change unless all sptes pointing to it are nuked first.
> >   *
> > @@ -1052,12 +1053,15 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
> >               if (sync_mmio_spte(vcpu, &sp->spt[i], gfn, pte_access))
> >                       continue;
> >
> > -             if (gfn != sp->gfns[i]) {
> > +             if (gfn != kvm_mmu_page_get_gfn(sp, i)) {
> >                       drop_spte(vcpu->kvm, &sp->spt[i]);
> >                       flush = true;
> >                       continue;
> >               }
> >
> > +             if (pte_access != kvm_mmu_page_get_access(sp, i))
>
> I think it makes sense to do this unconditionally, same as mmu_set_spte().  Or
> make the mmu_set_spte() case conditional.  I don't have a strong preference either
> way, but the two callers should be consistent with each other.

I'll make them both unconditional.

>
> > +                     kvm_mmu_page_set_access(sp, i, pte_access);
> > +
> >               sptep = &sp->spt[i];
> >               spte = *sptep;
> >               host_writable = spte & shadow_host_writable_mask;
> > --
> > 2.36.0.rc2.479.g8af0fa9b8e-goog
> >
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 16/20] KVM: x86/mmu: Extend make_huge_page_split_spte() for the shadow MMU
  2022-05-09 16:22     ` Sean Christopherson
@ 2022-05-09 21:31       ` David Matlack
  -1 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-05-09 21:31 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Andrew Jones, Ben Gardon, Peter Xu, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Mon, May 9, 2022 at 9:22 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Apr 22, 2022, David Matlack wrote:
> > Currently make_huge_page_split_spte() assumes execute permissions can be
> > granted to any 4K SPTE when splitting huge pages. This is true for the
> > TDP MMU but is not necessarily true for the shadow MMU, since KVM may be
> > shadowing a non-executable huge page.
> >
> > To fix this, pass in the child shadow page where the huge page will be
> > split and derive the execution permission from the shadow page's role.
> > This is correct because huge pages are always split with direct shadow
> > page and thus the shadow page role contains the correct access
> > permissions.
> >
> > No functional change intended.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> >  arch/x86/kvm/mmu/spte.c    | 13 +++++++------
> >  arch/x86/kvm/mmu/spte.h    |  2 +-
> >  arch/x86/kvm/mmu/tdp_mmu.c |  2 +-
> >  3 files changed, 9 insertions(+), 8 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> > index 4739b53c9734..9db98fbeee61 100644
> > --- a/arch/x86/kvm/mmu/spte.c
> > +++ b/arch/x86/kvm/mmu/spte.c
> > @@ -215,10 +215,11 @@ static u64 make_spte_executable(u64 spte)
> >   * This is used during huge page splitting to build the SPTEs that make up the
> >   * new page table.
> >   */
> > -u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index)
> > +u64 make_huge_page_split_spte(u64 huge_spte, struct kvm_mmu_page *sp, int index)
>
> Rather than pass in @sp, what about passing in @role?  Then the need for
> exec_allowed and child_level goes away (for whatever reason I reacted to the
> "allowed" part of exec_allowed).

I like it! Will do.

>
> E.g.
>
> ---
>  arch/x86/kvm/mmu/spte.c    | 11 +++++------
>  arch/x86/kvm/mmu/spte.h    |  3 ++-
>  arch/x86/kvm/mmu/tdp_mmu.c |  2 +-
>  3 files changed, 8 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> index 9db98fbeee61..1b766e381727 100644
> --- a/arch/x86/kvm/mmu/spte.c
> +++ b/arch/x86/kvm/mmu/spte.c
> @@ -215,10 +215,9 @@ static u64 make_spte_executable(u64 spte)
>   * This is used during huge page splitting to build the SPTEs that make up the
>   * new page table.
>   */
> -u64 make_huge_page_split_spte(u64 huge_spte, struct kvm_mmu_page *sp, int index)
> +u64 make_huge_page_split_spte(u64 huge_spte, union kvm_mmu_page_role role,
> +                             int index)
>  {
> -       bool exec_allowed = sp->role.access & ACC_EXEC_MASK;
> -       int child_level = sp->role.level;
>         u64 child_spte;
>
>         if (WARN_ON_ONCE(!is_shadow_present_pte(huge_spte)))
> @@ -234,9 +233,9 @@ u64 make_huge_page_split_spte(u64 huge_spte, struct kvm_mmu_page *sp, int index)
>          * split. So we just have to OR in the offset to the page at the next
>          * lower level for the given index.
>          */
> -       child_spte |= (index * KVM_PAGES_PER_HPAGE(child_level)) << PAGE_SHIFT;
> +       child_spte |= (index * KVM_PAGES_PER_HPAGE(role.level)) << PAGE_SHIFT;
>
> -       if (child_level == PG_LEVEL_4K) {
> +       if (role.level == PG_LEVEL_4K) {
>                 child_spte &= ~PT_PAGE_SIZE_MASK;
>
>                 /*
> @@ -244,7 +243,7 @@ u64 make_huge_page_split_spte(u64 huge_spte, struct kvm_mmu_page *sp, int index)
>                  * the page executable as the NX hugepage mitigation no longer
>                  * applies.
>                  */
> -               if (exec_allowed && is_nx_huge_page_enabled())
> +               if ((role.access & ACC_EXEC_MASK) && is_nx_huge_page_enabled())
>                         child_spte = make_spte_executable(child_spte);
>         }
>
> diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> index 921ea77f1b5e..80d36d0d9def 100644
> --- a/arch/x86/kvm/mmu/spte.h
> +++ b/arch/x86/kvm/mmu/spte.h
> @@ -415,7 +415,8 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>                unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
>                u64 old_spte, bool prefetch, bool can_unsync,
>                bool host_writable, u64 *new_spte);
> -u64 make_huge_page_split_spte(u64 huge_spte, struct kvm_mmu_page *sp, int index);
> +u64 make_huge_page_split_spte(u64 huge_spte, union kvm_mmu_page_role role,
> +                             int index);
>  u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled);
>  u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access);
>  u64 mark_spte_for_access_track(u64 spte);
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 110a34ca41c2..c4c4bad69f38 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1469,7 +1469,7 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
>          * not been linked in yet and thus is not reachable from any other CPU.
>          */
>         for (i = 0; i < PT64_ENT_PER_PAGE; i++)
> -               sp->spt[i] = make_huge_page_split_spte(huge_spte, sp, i);
> +               sp->spt[i] = make_huge_page_split_spte(huge_spte, sp->role, i);
>
>         /*
>          * Replace the huge spte with a pointer to the populated lower level
>
> base-commit: 721828e2397ab854b536de3ea10a9bc7962091a9
> --

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 16/20] KVM: x86/mmu: Extend make_huge_page_split_spte() for the shadow MMU
@ 2022-05-09 21:31       ` David Matlack
  0 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-05-09 21:31 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Mon, May 9, 2022 at 9:22 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Apr 22, 2022, David Matlack wrote:
> > Currently make_huge_page_split_spte() assumes execute permissions can be
> > granted to any 4K SPTE when splitting huge pages. This is true for the
> > TDP MMU but is not necessarily true for the shadow MMU, since KVM may be
> > shadowing a non-executable huge page.
> >
> > To fix this, pass in the child shadow page where the huge page will be
> > split and derive the execution permission from the shadow page's role.
> > This is correct because huge pages are always split with direct shadow
> > page and thus the shadow page role contains the correct access
> > permissions.
> >
> > No functional change intended.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> >  arch/x86/kvm/mmu/spte.c    | 13 +++++++------
> >  arch/x86/kvm/mmu/spte.h    |  2 +-
> >  arch/x86/kvm/mmu/tdp_mmu.c |  2 +-
> >  3 files changed, 9 insertions(+), 8 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> > index 4739b53c9734..9db98fbeee61 100644
> > --- a/arch/x86/kvm/mmu/spte.c
> > +++ b/arch/x86/kvm/mmu/spte.c
> > @@ -215,10 +215,11 @@ static u64 make_spte_executable(u64 spte)
> >   * This is used during huge page splitting to build the SPTEs that make up the
> >   * new page table.
> >   */
> > -u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index)
> > +u64 make_huge_page_split_spte(u64 huge_spte, struct kvm_mmu_page *sp, int index)
>
> Rather than pass in @sp, what about passing in @role?  Then the need for
> exec_allowed and child_level goes away (for whatever reason I reacted to the
> "allowed" part of exec_allowed).

I like it! Will do.

>
> E.g.
>
> ---
>  arch/x86/kvm/mmu/spte.c    | 11 +++++------
>  arch/x86/kvm/mmu/spte.h    |  3 ++-
>  arch/x86/kvm/mmu/tdp_mmu.c |  2 +-
>  3 files changed, 8 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> index 9db98fbeee61..1b766e381727 100644
> --- a/arch/x86/kvm/mmu/spte.c
> +++ b/arch/x86/kvm/mmu/spte.c
> @@ -215,10 +215,9 @@ static u64 make_spte_executable(u64 spte)
>   * This is used during huge page splitting to build the SPTEs that make up the
>   * new page table.
>   */
> -u64 make_huge_page_split_spte(u64 huge_spte, struct kvm_mmu_page *sp, int index)
> +u64 make_huge_page_split_spte(u64 huge_spte, union kvm_mmu_page_role role,
> +                             int index)
>  {
> -       bool exec_allowed = sp->role.access & ACC_EXEC_MASK;
> -       int child_level = sp->role.level;
>         u64 child_spte;
>
>         if (WARN_ON_ONCE(!is_shadow_present_pte(huge_spte)))
> @@ -234,9 +233,9 @@ u64 make_huge_page_split_spte(u64 huge_spte, struct kvm_mmu_page *sp, int index)
>          * split. So we just have to OR in the offset to the page at the next
>          * lower level for the given index.
>          */
> -       child_spte |= (index * KVM_PAGES_PER_HPAGE(child_level)) << PAGE_SHIFT;
> +       child_spte |= (index * KVM_PAGES_PER_HPAGE(role.level)) << PAGE_SHIFT;
>
> -       if (child_level == PG_LEVEL_4K) {
> +       if (role.level == PG_LEVEL_4K) {
>                 child_spte &= ~PT_PAGE_SIZE_MASK;
>
>                 /*
> @@ -244,7 +243,7 @@ u64 make_huge_page_split_spte(u64 huge_spte, struct kvm_mmu_page *sp, int index)
>                  * the page executable as the NX hugepage mitigation no longer
>                  * applies.
>                  */
> -               if (exec_allowed && is_nx_huge_page_enabled())
> +               if ((role.access & ACC_EXEC_MASK) && is_nx_huge_page_enabled())
>                         child_spte = make_spte_executable(child_spte);
>         }
>
> diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> index 921ea77f1b5e..80d36d0d9def 100644
> --- a/arch/x86/kvm/mmu/spte.h
> +++ b/arch/x86/kvm/mmu/spte.h
> @@ -415,7 +415,8 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>                unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
>                u64 old_spte, bool prefetch, bool can_unsync,
>                bool host_writable, u64 *new_spte);
> -u64 make_huge_page_split_spte(u64 huge_spte, struct kvm_mmu_page *sp, int index);
> +u64 make_huge_page_split_spte(u64 huge_spte, union kvm_mmu_page_role role,
> +                             int index);
>  u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled);
>  u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access);
>  u64 mark_spte_for_access_track(u64 spte);
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 110a34ca41c2..c4c4bad69f38 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1469,7 +1469,7 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
>          * not been linked in yet and thus is not reachable from any other CPU.
>          */
>         for (i = 0; i < PT64_ENT_PER_PAGE; i++)
> -               sp->spt[i] = make_huge_page_split_spte(huge_spte, sp, i);
> +               sp->spt[i] = make_huge_page_split_spte(huge_spte, sp->role, i);
>
>         /*
>          * Replace the huge spte with a pointer to the populated lower level
>
> base-commit: 721828e2397ab854b536de3ea10a9bc7962091a9
> --
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 17/20] KVM: x86/mmu: Zap collapsible SPTEs at all levels in the shadow MMU
  2022-05-09 16:31     ` Sean Christopherson
@ 2022-05-09 21:34       ` David Matlack
  -1 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-05-09 21:34 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Andrew Jones, Ben Gardon, Peter Xu, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Mon, May 9, 2022 at 9:31 AM Sean Christopherson <seanjc@google.com> wrote:
>
> Maybe a slight tweak to the shortlog?  "Zap collapsible SPTEs at all levels in
> the shadow MMU" left me wondering "when is KVM zapping at all levels?"
>
>   KVM: x86/mmu: Zap all possible levels in shadow MMU when collapsing SPTEs
>
> On Fri, Apr 22, 2022, David Matlack wrote:
> > Currently KVM only zaps collapsible 4KiB SPTEs in the shadow MMU (i.e.
> > in the rmap). This is fine for now KVM never creates intermediate huge
> > pages during dirty logging, i.e. a 1GiB page is never partially split to
> > a 2MiB page.
>
> "partially" is really confusing.  I think what you mean is that KVM can split a
> 1gb to a 2mb page, and not split all the way down to 4kb.  But "partially" makes
> it sound like KVM ends up with a huge SPTE that is half split or something.  I
> think you can just avoid that altogether and be more explicit:
>
>   i.e. a 1GiB pager is never split to just 2MiB, dirty logging always splits
>   down to 4KiB pages.
>
> > However, this will stop being true once the shadow MMU participates in
> > eager page splitting, which can in fact leave behind partially split
>
> "partially" again.  Maybe
>
>   which can in fact leave behind 2MiB pages after splitting 1GiB huge pages.

Looks good, I'll incorporate these edits into v5.

>
> > huge pages. In preparation for that change, change the shadow MMU to
> > iterate over all necessary levels when zapping collapsible SPTEs.
> >
> > No functional change intended.
> >
> > Reviewed-by: Peter Xu <peterx@redhat.com>
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> >  arch/x86/kvm/mmu/mmu.c | 21 ++++++++++++++-------
> >  1 file changed, 14 insertions(+), 7 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index ed65899d15a2..479c581e8a96 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -6098,18 +6098,25 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
> >       return need_tlb_flush;
> >  }
> >
> > +static void kvm_rmap_zap_collapsible_sptes(struct kvm *kvm,
> > +                                        const struct kvm_memory_slot *slot)
> > +{
> > +     /*
> > +      * Note, use KVM_MAX_HUGEPAGE_LEVEL - 1 since there's no need to zap
> > +      * pages that are already mapped at the maximum possible level.
> > +      */
> > +     if (slot_handle_level(kvm, slot, kvm_mmu_zap_collapsible_spte,
> > +                           PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL - 1,
> > +                           true))
> > +             kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
> > +}
> > +
> >  void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
> >                                  const struct kvm_memory_slot *slot)
> >  {
> >       if (kvm_memslots_have_rmaps(kvm)) {
> >               write_lock(&kvm->mmu_lock);
> > -             /*
> > -              * Zap only 4k SPTEs since the legacy MMU only supports dirty
> > -              * logging at a 4k granularity and never creates collapsible
> > -              * 2m SPTEs during dirty logging.
> > -              */
> > -             if (slot_handle_level_4k(kvm, slot, kvm_mmu_zap_collapsible_spte, true))
> > -                     kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
> > +             kvm_rmap_zap_collapsible_sptes(kvm, slot);
> >               write_unlock(&kvm->mmu_lock);
> >       }
> >
> > --
> > 2.36.0.rc2.479.g8af0fa9b8e-goog
> >

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 17/20] KVM: x86/mmu: Zap collapsible SPTEs at all levels in the shadow MMU
@ 2022-05-09 21:34       ` David Matlack
  0 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-05-09 21:34 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Mon, May 9, 2022 at 9:31 AM Sean Christopherson <seanjc@google.com> wrote:
>
> Maybe a slight tweak to the shortlog?  "Zap collapsible SPTEs at all levels in
> the shadow MMU" left me wondering "when is KVM zapping at all levels?"
>
>   KVM: x86/mmu: Zap all possible levels in shadow MMU when collapsing SPTEs
>
> On Fri, Apr 22, 2022, David Matlack wrote:
> > Currently KVM only zaps collapsible 4KiB SPTEs in the shadow MMU (i.e.
> > in the rmap). This is fine for now KVM never creates intermediate huge
> > pages during dirty logging, i.e. a 1GiB page is never partially split to
> > a 2MiB page.
>
> "partially" is really confusing.  I think what you mean is that KVM can split a
> 1gb to a 2mb page, and not split all the way down to 4kb.  But "partially" makes
> it sound like KVM ends up with a huge SPTE that is half split or something.  I
> think you can just avoid that altogether and be more explicit:
>
>   i.e. a 1GiB pager is never split to just 2MiB, dirty logging always splits
>   down to 4KiB pages.
>
> > However, this will stop being true once the shadow MMU participates in
> > eager page splitting, which can in fact leave behind partially split
>
> "partially" again.  Maybe
>
>   which can in fact leave behind 2MiB pages after splitting 1GiB huge pages.

Looks good, I'll incorporate these edits into v5.

>
> > huge pages. In preparation for that change, change the shadow MMU to
> > iterate over all necessary levels when zapping collapsible SPTEs.
> >
> > No functional change intended.
> >
> > Reviewed-by: Peter Xu <peterx@redhat.com>
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> >  arch/x86/kvm/mmu/mmu.c | 21 ++++++++++++++-------
> >  1 file changed, 14 insertions(+), 7 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index ed65899d15a2..479c581e8a96 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -6098,18 +6098,25 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
> >       return need_tlb_flush;
> >  }
> >
> > +static void kvm_rmap_zap_collapsible_sptes(struct kvm *kvm,
> > +                                        const struct kvm_memory_slot *slot)
> > +{
> > +     /*
> > +      * Note, use KVM_MAX_HUGEPAGE_LEVEL - 1 since there's no need to zap
> > +      * pages that are already mapped at the maximum possible level.
> > +      */
> > +     if (slot_handle_level(kvm, slot, kvm_mmu_zap_collapsible_spte,
> > +                           PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL - 1,
> > +                           true))
> > +             kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
> > +}
> > +
> >  void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
> >                                  const struct kvm_memory_slot *slot)
> >  {
> >       if (kvm_memslots_have_rmaps(kvm)) {
> >               write_lock(&kvm->mmu_lock);
> > -             /*
> > -              * Zap only 4k SPTEs since the legacy MMU only supports dirty
> > -              * logging at a 4k granularity and never creates collapsible
> > -              * 2m SPTEs during dirty logging.
> > -              */
> > -             if (slot_handle_level_4k(kvm, slot, kvm_mmu_zap_collapsible_spte, true))
> > -                     kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
> > +             kvm_rmap_zap_collapsible_sptes(kvm, slot);
> >               write_unlock(&kvm->mmu_lock);
> >       }
> >
> > --
> > 2.36.0.rc2.479.g8af0fa9b8e-goog
> >
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 20/20] KVM: x86/mmu: Extend Eager Page Splitting to nested MMUs
  2022-05-07  7:51     ` Lai Jiangshan
@ 2022-05-09 21:40       ` David Matlack
  -1 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-05-09 21:40 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Sean Christopherson, Andrew Jones, Ben Gardon, Peter Xu,
	Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Sat, May 7, 2022 at 12:51 AM Lai Jiangshan <jiangshanlai@gmail.com> wrote:
>
>
> On 2022/4/23 05:05, David Matlack wrote:
> > Add support for Eager Page Splitting pages that are mapped by nested
> > MMUs. Walk through the rmap first splitting all 1GiB pages to 2MiB
> > pages, and then splitting all 2MiB pages to 4KiB pages.
> >
> > Note, Eager Page Splitting is limited to nested MMUs as a policy rather
> > than due to any technical reason (the sp->role.guest_mode check could
> > just be deleted and Eager Page Splitting would work correctly for all
> > shadow MMU pages). There is really no reason to support Eager Page
> > Splitting for tdp_mmu=N, since such support will eventually be phased
> > out, and there is no current use case supporting Eager Page Splitting on
> > hosts where TDP is either disabled or unavailable in hardware.
> > Furthermore, future improvements to nested MMU scalability may diverge
> > the code from the legacy shadow paging implementation. These
> > improvements will be simpler to make if Eager Page Splitting does not
> > have to worry about legacy shadow paging.
> >
> > Splitting huge pages mapped by nested MMUs requires dealing with some
> > extra complexity beyond that of the TDP MMU:
> >
> > (1) The shadow MMU has a limit on the number of shadow pages that are
> >      allowed to be allocated. So, as a policy, Eager Page Splitting
> >      refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
> >      pages available.
> >
> > (2) Splitting a huge page may end up re-using an existing lower level
> >      shadow page tables. This is unlike the TDP MMU which always allocates
> >      new shadow page tables when splitting.
> >
> > (3) When installing the lower level SPTEs, they must be added to the
> >      rmap which may require allocating additional pte_list_desc structs.
> >
> > Case (2) is especially interesting since it may require a TLB flush,
> > unlike the TDP MMU which can fully split huge pages without any TLB
> > flushes. Specifically, an existing lower level page table may point to
> > even lower level page tables that are not fully populated, effectively
> > unmapping a portion of the huge page, which requires a flush.
> >
> > This commit performs such flushes after dropping the huge page and
> > before installing the lower level page table. This TLB flush could
> > instead be delayed until the MMU lock is about to be dropped, which
> > would batch flushes for multiple splits.  However these flushes should
> > be rare in practice (a huge page must be aliased in multiple SPTEs and
> > have been split for NX Huge Pages in only some of them). Flushing
> > immediately is simpler to plumb and also reduces the chances of tripping
> > over a CPU bug (e.g. see iTLB multihit).
> >
> > Suggested-by: Peter Feiner <pfeiner@google.com>
> > [ This commit is based off of the original implementation of Eager Page
> >    Splitting from Peter in Google's kernel from 2016. ]
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> >   .../admin-guide/kernel-parameters.txt         |   3 +-
> >   arch/x86/include/asm/kvm_host.h               |  20 ++
> >   arch/x86/kvm/mmu/mmu.c                        | 276 +++++++++++++++++-
> >   arch/x86/kvm/x86.c                            |   6 +
> >   4 files changed, 296 insertions(+), 9 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > index 3f1cc5e317ed..bc3ad3d4df0b 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -2387,8 +2387,7 @@
> >                       the KVM_CLEAR_DIRTY ioctl, and only for the pages being
> >                       cleared.
> >
> > -                     Eager page splitting currently only supports splitting
> > -                     huge pages mapped by the TDP MMU.
> > +                     Eager page splitting is only supported when kvm.tdp_mmu=Y.
> >
> >                       Default is Y (on).
> >
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 15131aa05701..5df4dff385a1 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1240,6 +1240,24 @@ struct kvm_arch {
> >       hpa_t   hv_root_tdp;
> >       spinlock_t hv_root_tdp_lock;
> >   #endif
> > +
> > +     /*
> > +      * Memory caches used to allocate shadow pages when performing eager
> > +      * page splitting. No need for a shadowed_info_cache since eager page
> > +      * splitting only allocates direct shadow pages.
> > +      */
> > +     struct kvm_mmu_memory_cache split_shadow_page_cache;
> > +     struct kvm_mmu_memory_cache split_page_header_cache;
> > +
> > +     /*
> > +      * Memory cache used to allocate pte_list_desc structs while splitting
> > +      * huge pages. In the worst case, to split one huge page, 512
> > +      * pte_list_desc structs are needed to add each lower level leaf sptep
> > +      * to the rmap plus 1 to extend the parent_ptes rmap of the lower level
> > +      * page table.
> > +      */
> > +#define SPLIT_DESC_CACHE_CAPACITY 513
> > +     struct kvm_mmu_memory_cache split_desc_cache;
> >   };
> >
> >
>
>
> I think it needs to document that the topup operations for these caches are
>
> protected by kvm->slots_lock.

Will do. Thanks!
>

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 20/20] KVM: x86/mmu: Extend Eager Page Splitting to nested MMUs
@ 2022-05-09 21:40       ` David Matlack
  0 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-05-09 21:40 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Sat, May 7, 2022 at 12:51 AM Lai Jiangshan <jiangshanlai@gmail.com> wrote:
>
>
> On 2022/4/23 05:05, David Matlack wrote:
> > Add support for Eager Page Splitting pages that are mapped by nested
> > MMUs. Walk through the rmap first splitting all 1GiB pages to 2MiB
> > pages, and then splitting all 2MiB pages to 4KiB pages.
> >
> > Note, Eager Page Splitting is limited to nested MMUs as a policy rather
> > than due to any technical reason (the sp->role.guest_mode check could
> > just be deleted and Eager Page Splitting would work correctly for all
> > shadow MMU pages). There is really no reason to support Eager Page
> > Splitting for tdp_mmu=N, since such support will eventually be phased
> > out, and there is no current use case supporting Eager Page Splitting on
> > hosts where TDP is either disabled or unavailable in hardware.
> > Furthermore, future improvements to nested MMU scalability may diverge
> > the code from the legacy shadow paging implementation. These
> > improvements will be simpler to make if Eager Page Splitting does not
> > have to worry about legacy shadow paging.
> >
> > Splitting huge pages mapped by nested MMUs requires dealing with some
> > extra complexity beyond that of the TDP MMU:
> >
> > (1) The shadow MMU has a limit on the number of shadow pages that are
> >      allowed to be allocated. So, as a policy, Eager Page Splitting
> >      refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
> >      pages available.
> >
> > (2) Splitting a huge page may end up re-using an existing lower level
> >      shadow page tables. This is unlike the TDP MMU which always allocates
> >      new shadow page tables when splitting.
> >
> > (3) When installing the lower level SPTEs, they must be added to the
> >      rmap which may require allocating additional pte_list_desc structs.
> >
> > Case (2) is especially interesting since it may require a TLB flush,
> > unlike the TDP MMU which can fully split huge pages without any TLB
> > flushes. Specifically, an existing lower level page table may point to
> > even lower level page tables that are not fully populated, effectively
> > unmapping a portion of the huge page, which requires a flush.
> >
> > This commit performs such flushes after dropping the huge page and
> > before installing the lower level page table. This TLB flush could
> > instead be delayed until the MMU lock is about to be dropped, which
> > would batch flushes for multiple splits.  However these flushes should
> > be rare in practice (a huge page must be aliased in multiple SPTEs and
> > have been split for NX Huge Pages in only some of them). Flushing
> > immediately is simpler to plumb and also reduces the chances of tripping
> > over a CPU bug (e.g. see iTLB multihit).
> >
> > Suggested-by: Peter Feiner <pfeiner@google.com>
> > [ This commit is based off of the original implementation of Eager Page
> >    Splitting from Peter in Google's kernel from 2016. ]
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> >   .../admin-guide/kernel-parameters.txt         |   3 +-
> >   arch/x86/include/asm/kvm_host.h               |  20 ++
> >   arch/x86/kvm/mmu/mmu.c                        | 276 +++++++++++++++++-
> >   arch/x86/kvm/x86.c                            |   6 +
> >   4 files changed, 296 insertions(+), 9 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > index 3f1cc5e317ed..bc3ad3d4df0b 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -2387,8 +2387,7 @@
> >                       the KVM_CLEAR_DIRTY ioctl, and only for the pages being
> >                       cleared.
> >
> > -                     Eager page splitting currently only supports splitting
> > -                     huge pages mapped by the TDP MMU.
> > +                     Eager page splitting is only supported when kvm.tdp_mmu=Y.
> >
> >                       Default is Y (on).
> >
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 15131aa05701..5df4dff385a1 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1240,6 +1240,24 @@ struct kvm_arch {
> >       hpa_t   hv_root_tdp;
> >       spinlock_t hv_root_tdp_lock;
> >   #endif
> > +
> > +     /*
> > +      * Memory caches used to allocate shadow pages when performing eager
> > +      * page splitting. No need for a shadowed_info_cache since eager page
> > +      * splitting only allocates direct shadow pages.
> > +      */
> > +     struct kvm_mmu_memory_cache split_shadow_page_cache;
> > +     struct kvm_mmu_memory_cache split_page_header_cache;
> > +
> > +     /*
> > +      * Memory cache used to allocate pte_list_desc structs while splitting
> > +      * huge pages. In the worst case, to split one huge page, 512
> > +      * pte_list_desc structs are needed to add each lower level leaf sptep
> > +      * to the rmap plus 1 to extend the parent_ptes rmap of the lower level
> > +      * page table.
> > +      */
> > +#define SPLIT_DESC_CACHE_CAPACITY 513
> > +     struct kvm_mmu_memory_cache split_desc_cache;
> >   };
> >
> >
>
>
> I think it needs to document that the topup operations for these caches are
>
> protected by kvm->slots_lock.

Will do. Thanks!
>
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 20/20] KVM: x86/mmu: Extend Eager Page Splitting to nested MMUs
  2022-05-09 16:48     ` Sean Christopherson
@ 2022-05-09 21:44       ` David Matlack
  -1 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-05-09 21:44 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Andrew Jones, Ben Gardon, Peter Xu, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Mon, May 9, 2022 at 9:48 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Apr 22, 2022, David Matlack wrote:
> > +static bool need_topup_split_caches_or_resched(struct kvm *kvm)
> > +{
> > +     if (need_resched() || rwlock_needbreak(&kvm->mmu_lock))
> > +             return true;
> > +
> > +     /*
> > +      * In the worst case, SPLIT_DESC_CACHE_CAPACITY descriptors are needed
> > +      * to split a single huge page. Calculating how many are actually needed
> > +      * is possible but not worth the complexity.
> > +      */
> > +     return need_topup(&kvm->arch.split_desc_cache, SPLIT_DESC_CACHE_CAPACITY) ||
> > +             need_topup(&kvm->arch.split_page_header_cache, 1) ||
> > +             need_topup(&kvm->arch.split_shadow_page_cache, 1);
>
> Uber nit that Paolo will make fun of me for... please align indentiation
>
>         return need_topup(&kvm->arch.split_desc_cache, SPLIT_DESC_CACHE_CAPACITY) ||
>                need_topup(&kvm->arch.split_page_header_cache, 1) ||
>                need_topup(&kvm->arch.split_shadow_page_cache, 1);

Will do.

>
> > +static void nested_mmu_split_huge_page(struct kvm *kvm,
> > +                                    const struct kvm_memory_slot *slot,
> > +                                    u64 *huge_sptep)
> > +
> > +{
> > +     struct kvm_mmu_memory_cache *cache = &kvm->arch.split_desc_cache;
> > +     u64 huge_spte = READ_ONCE(*huge_sptep);
> > +     struct kvm_mmu_page *sp;
> > +     bool flush = false;
> > +     u64 *sptep, spte;
> > +     gfn_t gfn;
> > +     int index;
> > +
> > +     sp = nested_mmu_get_sp_for_split(kvm, huge_sptep);
> > +
> > +     for (index = 0; index < PT64_ENT_PER_PAGE; index++) {
> > +             sptep = &sp->spt[index];
> > +             gfn = kvm_mmu_page_get_gfn(sp, index);
> > +
> > +             /*
> > +              * The SP may already have populated SPTEs, e.g. if this huge
> > +              * page is aliased by multiple sptes with the same access
> > +              * permissions. These entries are guaranteed to map the same
> > +              * gfn-to-pfn translation since the SP is direct, so no need to
> > +              * modify them.
> > +              *
> > +              * However, if a given SPTE points to a lower level page table,
> > +              * that lower level page table may only be partially populated.
> > +              * Installing such SPTEs would effectively unmap a potion of the
> > +              * huge page, which requires a TLB flush.
>
> Maybe explain why a TLB flush is required?  E.g. "which requires a TLB flush as
> a subsequent mmu_notifier event on the unmapped region would fail to detect the
> need to flush".

Will do.

>
> > +static bool nested_mmu_skip_split_huge_page(u64 *huge_sptep)
>
> "skip" is kinda odd terminology.  It reads like a command, but it's actually
> querying state _and_ it's returning a boolean, which I've learned to hate :-)
>
> I don't see any reason for a helper, there's one caller and it can just do
> "continue" directly.

Will do.

>
> > +static void kvm_nested_mmu_try_split_huge_pages(struct kvm *kvm,
> > +                                             const struct kvm_memory_slot *slot,
> > +                                             gfn_t start, gfn_t end,
> > +                                             int target_level)
> > +{
> > +     int level;
> > +
> > +     /*
> > +      * Split huge pages starting with KVM_MAX_HUGEPAGE_LEVEL and working
> > +      * down to the target level. This ensures pages are recursively split
> > +      * all the way to the target level. There's no need to split pages
> > +      * already at the target level.
> > +      */
> > +     for (level = KVM_MAX_HUGEPAGE_LEVEL; level > target_level; level--) {
>
> Unnecessary braces.

The brace is unnecessary, but when the inner statement is split across
multiple lines I tend to prefer using braces. (That's why I did the
same in the other patch and you had the same feedback.) I couldn't
find any guidance about this in CodingStyle so I'm fine with getting
rid of the braces if that's what you prefer.

> > +             slot_handle_level_range(kvm, slot,
> > +                                     nested_mmu_try_split_huge_pages,
> > +                                     level, level, start, end - 1,
> > +                                     true, false);
>
> IMO it's worth running over by 4 chars to drop 2 lines:

Will do.

>
>         for (level = KVM_MAX_HUGEPAGE_LEVEL; level > target_level; level--)
>                 slot_handle_level_range(kvm, slot, nested_mmu_try_split_huge_pages,
>                                         level, level, start, end - 1, true, false);
> > +     }
> > +}
> > +
> >  /* Must be called with the mmu_lock held in write-mode. */
>
> Add a lockdep assertion, not a comment.

Agreed but this is an existing comment, so better left to a separate patch.

>
> >  void kvm_mmu_try_split_huge_pages(struct kvm *kvm,
> >                                  const struct kvm_memory_slot *memslot,
> >                                  u64 start, u64 end,
> >                                  int target_level)
> >  {
> > -     if (is_tdp_mmu_enabled(kvm))
> > -             kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end,
> > -                                              target_level, false);
> > +     if (!is_tdp_mmu_enabled(kvm))
> > +             return;
> > +
> > +     kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end, target_level,
> > +                                      false);
> > +
> > +     if (kvm_memslots_have_rmaps(kvm))
> > +             kvm_nested_mmu_try_split_huge_pages(kvm, memslot, start, end,
> > +                                                 target_level);
> >
> >       /*
> >        * A TLB flush is unnecessary at this point for the same resons as in
> > @@ -6051,10 +6304,19 @@ void kvm_mmu_slot_try_split_huge_pages(struct kvm *kvm,
> >       u64 start = memslot->base_gfn;
> >       u64 end = start + memslot->npages;
> >
> > -     if (is_tdp_mmu_enabled(kvm)) {
> > -             read_lock(&kvm->mmu_lock);
> > -             kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end, target_level, true);
> > -             read_unlock(&kvm->mmu_lock);
> > +     if (!is_tdp_mmu_enabled(kvm))
> > +             return;
> > +
> > +     read_lock(&kvm->mmu_lock);
> > +     kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end, target_level,
> > +                                      true);
>
> Eh, let this poke out.

Will do :)

>
> > +     read_unlock(&kvm->mmu_lock);
> > +
> > +     if (kvm_memslots_have_rmaps(kvm)) {
> > +             write_lock(&kvm->mmu_lock);
> > +             kvm_nested_mmu_try_split_huge_pages(kvm, memslot, start, end,
> > +                                                 target_level);
> > +             write_unlock(&kvm->mmu_lock);
>
> Super duper nit: all other flows do rmaps first, than TDP MMU.  Might as well keep
> that ordering here, otherwise it suggests there's a reason to be different.

Will do.
>
> >       }
> >
> >       /*
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index ab336f7c82e4..e123e24a130f 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -12161,6 +12161,12 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
> >                * page faults will create the large-page sptes.
> >                */
> >               kvm_mmu_zap_collapsible_sptes(kvm, new);
> > +
> > +             /*
> > +              * Free any memory left behind by eager page splitting. Ignore
> > +              * the module parameter since userspace might have changed it.
> > +              */
> > +             free_split_caches(kvm);
> >       } else {
> >               /*
> >                * Initially-all-set does not require write protecting any page,
> > --
> > 2.36.0.rc2.479.g8af0fa9b8e-goog
> >

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 20/20] KVM: x86/mmu: Extend Eager Page Splitting to nested MMUs
@ 2022-05-09 21:44       ` David Matlack
  0 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-05-09 21:44 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Mon, May 9, 2022 at 9:48 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Apr 22, 2022, David Matlack wrote:
> > +static bool need_topup_split_caches_or_resched(struct kvm *kvm)
> > +{
> > +     if (need_resched() || rwlock_needbreak(&kvm->mmu_lock))
> > +             return true;
> > +
> > +     /*
> > +      * In the worst case, SPLIT_DESC_CACHE_CAPACITY descriptors are needed
> > +      * to split a single huge page. Calculating how many are actually needed
> > +      * is possible but not worth the complexity.
> > +      */
> > +     return need_topup(&kvm->arch.split_desc_cache, SPLIT_DESC_CACHE_CAPACITY) ||
> > +             need_topup(&kvm->arch.split_page_header_cache, 1) ||
> > +             need_topup(&kvm->arch.split_shadow_page_cache, 1);
>
> Uber nit that Paolo will make fun of me for... please align indentiation
>
>         return need_topup(&kvm->arch.split_desc_cache, SPLIT_DESC_CACHE_CAPACITY) ||
>                need_topup(&kvm->arch.split_page_header_cache, 1) ||
>                need_topup(&kvm->arch.split_shadow_page_cache, 1);

Will do.

>
> > +static void nested_mmu_split_huge_page(struct kvm *kvm,
> > +                                    const struct kvm_memory_slot *slot,
> > +                                    u64 *huge_sptep)
> > +
> > +{
> > +     struct kvm_mmu_memory_cache *cache = &kvm->arch.split_desc_cache;
> > +     u64 huge_spte = READ_ONCE(*huge_sptep);
> > +     struct kvm_mmu_page *sp;
> > +     bool flush = false;
> > +     u64 *sptep, spte;
> > +     gfn_t gfn;
> > +     int index;
> > +
> > +     sp = nested_mmu_get_sp_for_split(kvm, huge_sptep);
> > +
> > +     for (index = 0; index < PT64_ENT_PER_PAGE; index++) {
> > +             sptep = &sp->spt[index];
> > +             gfn = kvm_mmu_page_get_gfn(sp, index);
> > +
> > +             /*
> > +              * The SP may already have populated SPTEs, e.g. if this huge
> > +              * page is aliased by multiple sptes with the same access
> > +              * permissions. These entries are guaranteed to map the same
> > +              * gfn-to-pfn translation since the SP is direct, so no need to
> > +              * modify them.
> > +              *
> > +              * However, if a given SPTE points to a lower level page table,
> > +              * that lower level page table may only be partially populated.
> > +              * Installing such SPTEs would effectively unmap a potion of the
> > +              * huge page, which requires a TLB flush.
>
> Maybe explain why a TLB flush is required?  E.g. "which requires a TLB flush as
> a subsequent mmu_notifier event on the unmapped region would fail to detect the
> need to flush".

Will do.

>
> > +static bool nested_mmu_skip_split_huge_page(u64 *huge_sptep)
>
> "skip" is kinda odd terminology.  It reads like a command, but it's actually
> querying state _and_ it's returning a boolean, which I've learned to hate :-)
>
> I don't see any reason for a helper, there's one caller and it can just do
> "continue" directly.

Will do.

>
> > +static void kvm_nested_mmu_try_split_huge_pages(struct kvm *kvm,
> > +                                             const struct kvm_memory_slot *slot,
> > +                                             gfn_t start, gfn_t end,
> > +                                             int target_level)
> > +{
> > +     int level;
> > +
> > +     /*
> > +      * Split huge pages starting with KVM_MAX_HUGEPAGE_LEVEL and working
> > +      * down to the target level. This ensures pages are recursively split
> > +      * all the way to the target level. There's no need to split pages
> > +      * already at the target level.
> > +      */
> > +     for (level = KVM_MAX_HUGEPAGE_LEVEL; level > target_level; level--) {
>
> Unnecessary braces.

The brace is unnecessary, but when the inner statement is split across
multiple lines I tend to prefer using braces. (That's why I did the
same in the other patch and you had the same feedback.) I couldn't
find any guidance about this in CodingStyle so I'm fine with getting
rid of the braces if that's what you prefer.

> > +             slot_handle_level_range(kvm, slot,
> > +                                     nested_mmu_try_split_huge_pages,
> > +                                     level, level, start, end - 1,
> > +                                     true, false);
>
> IMO it's worth running over by 4 chars to drop 2 lines:

Will do.

>
>         for (level = KVM_MAX_HUGEPAGE_LEVEL; level > target_level; level--)
>                 slot_handle_level_range(kvm, slot, nested_mmu_try_split_huge_pages,
>                                         level, level, start, end - 1, true, false);
> > +     }
> > +}
> > +
> >  /* Must be called with the mmu_lock held in write-mode. */
>
> Add a lockdep assertion, not a comment.

Agreed but this is an existing comment, so better left to a separate patch.

>
> >  void kvm_mmu_try_split_huge_pages(struct kvm *kvm,
> >                                  const struct kvm_memory_slot *memslot,
> >                                  u64 start, u64 end,
> >                                  int target_level)
> >  {
> > -     if (is_tdp_mmu_enabled(kvm))
> > -             kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end,
> > -                                              target_level, false);
> > +     if (!is_tdp_mmu_enabled(kvm))
> > +             return;
> > +
> > +     kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end, target_level,
> > +                                      false);
> > +
> > +     if (kvm_memslots_have_rmaps(kvm))
> > +             kvm_nested_mmu_try_split_huge_pages(kvm, memslot, start, end,
> > +                                                 target_level);
> >
> >       /*
> >        * A TLB flush is unnecessary at this point for the same resons as in
> > @@ -6051,10 +6304,19 @@ void kvm_mmu_slot_try_split_huge_pages(struct kvm *kvm,
> >       u64 start = memslot->base_gfn;
> >       u64 end = start + memslot->npages;
> >
> > -     if (is_tdp_mmu_enabled(kvm)) {
> > -             read_lock(&kvm->mmu_lock);
> > -             kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end, target_level, true);
> > -             read_unlock(&kvm->mmu_lock);
> > +     if (!is_tdp_mmu_enabled(kvm))
> > +             return;
> > +
> > +     read_lock(&kvm->mmu_lock);
> > +     kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end, target_level,
> > +                                      true);
>
> Eh, let this poke out.

Will do :)

>
> > +     read_unlock(&kvm->mmu_lock);
> > +
> > +     if (kvm_memslots_have_rmaps(kvm)) {
> > +             write_lock(&kvm->mmu_lock);
> > +             kvm_nested_mmu_try_split_huge_pages(kvm, memslot, start, end,
> > +                                                 target_level);
> > +             write_unlock(&kvm->mmu_lock);
>
> Super duper nit: all other flows do rmaps first, than TDP MMU.  Might as well keep
> that ordering here, otherwise it suggests there's a reason to be different.

Will do.
>
> >       }
> >
> >       /*
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index ab336f7c82e4..e123e24a130f 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -12161,6 +12161,12 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
> >                * page faults will create the large-page sptes.
> >                */
> >               kvm_mmu_zap_collapsible_sptes(kvm, new);
> > +
> > +             /*
> > +              * Free any memory left behind by eager page splitting. Ignore
> > +              * the module parameter since userspace might have changed it.
> > +              */
> > +             free_split_caches(kvm);
> >       } else {
> >               /*
> >                * Initially-all-set does not require write protecting any page,
> > --
> > 2.36.0.rc2.479.g8af0fa9b8e-goog
> >
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 03/20] KVM: x86/mmu: Derive shadow MMU page role from parent
  2022-05-05 21:50     ` Sean Christopherson
@ 2022-05-09 22:10       ` David Matlack
  -1 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-05-09 22:10 UTC (permalink / raw)
  To: Sean Christopherson, Lai Jiangshan
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Andrew Jones, Ben Gardon, Peter Xu, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Thu, May 5, 2022 at 2:50 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Apr 22, 2022, David Matlack wrote:
> > Instead of computing the shadow page role from scratch for every new
> > page, derive most of the information from the parent shadow page.  This
> > avoids redundant calculations and reduces the number of parameters to
> > kvm_mmu_get_page().
> >
> > Preemptively split out the role calculation to a separate function for
> > use in a following commit.
> >
> > No functional change intended.
> >
> > Reviewed-by: Peter Xu <peterx@redhat.com>
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> >  arch/x86/kvm/mmu/mmu.c         | 96 +++++++++++++++++++++++-----------
> >  arch/x86/kvm/mmu/paging_tmpl.h |  9 ++--
> >  2 files changed, 71 insertions(+), 34 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index dc20eccd6a77..4249a771818b 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -2021,31 +2021,15 @@ static void clear_sp_write_flooding_count(u64 *spte)
> >       __clear_sp_write_flooding_count(sptep_to_sp(spte));
> >  }
> >
> > -static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > -                                          gfn_t gfn,
> > -                                          gva_t gaddr,
> > -                                          unsigned level,
> > -                                          bool direct,
> > -                                          unsigned int access)
> > +static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
> > +                                          union kvm_mmu_page_role role)
> >  {
> > -     union kvm_mmu_page_role role;
> >       struct hlist_head *sp_list;
> > -     unsigned quadrant;
> >       struct kvm_mmu_page *sp;
> >       int ret;
> >       int collisions = 0;
> >       LIST_HEAD(invalid_list);
> >
> > -     role = vcpu->arch.mmu->root_role;
> > -     role.level = level;
> > -     role.direct = direct;
> > -     role.access = access;
> > -     if (role.has_4_byte_gpte) {
> > -             quadrant = gaddr >> (PAGE_SHIFT + (PT64_PT_BITS * level));
> > -             quadrant &= (1 << ((PT32_PT_BITS - PT64_PT_BITS) * level)) - 1;
> > -             role.quadrant = quadrant;
> > -     }
> > -
>
> When you rebase to kvm/queue, the helper will need to deal with
>
>         if (level <= vcpu->arch.mmu->cpu_role.base.level)
>                 role.passthrough = 0;
>
> KVM should never create a passthrough huge page, so I believe it's just a matter
> of adding yet another boolean param to kvm_mmu_child_role().

+Lai Jiangshan

It looks like only root pages can be passthrough, so
kvm_mmu_child_role() can hard-code passthrough to 0. Lai do you agree?

>
>
> >       sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
> >       for_each_valid_sp(vcpu->kvm, sp, sp_list) {
> >               if (sp->gfn != gfn) {
> > @@ -2063,7 +2047,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> >                        * Unsync pages must not be left as is, because the new
> >                        * upper-level page will be write-protected.
> >                        */
> > -                     if (level > PG_LEVEL_4K && sp->unsync)
> > +                     if (role.level > PG_LEVEL_4K && sp->unsync)
> >                               kvm_mmu_prepare_zap_page(vcpu->kvm, sp,
> >                                                        &invalid_list);
> >                       continue;
>
> ...
>
> > @@ -3310,12 +3338,21 @@ static int mmu_check_root(struct kvm_vcpu *vcpu, gfn_t root_gfn)
> >       return ret;
> >  }
> >
> > -static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
> > +static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant,
> >                           u8 level, bool direct)
> >  {
> > +     union kvm_mmu_page_role role;
> >       struct kvm_mmu_page *sp;
> >
> > -     sp = kvm_mmu_get_page(vcpu, gfn, gva, level, direct, ACC_ALL);
> > +     role = vcpu->arch.mmu->root_role;
> > +     role.level = level;
> > +     role.direct = direct;
> > +     role.access = ACC_ALL;
> > +
> > +     if (role.has_4_byte_gpte)
> > +             role.quadrant = quadrant;
>
> Maybe add a comment explaining the PAE and 32-bit paging paths share a call for
> allocating PDPTEs?  Otherwise it looks like passing a non-zero quadrant when the
> guest doesn't have 4-byte PTEs should be a bug.
>
> Hmm, even better, if the check is moved to the caller, then this can be:
>
>         role.level = level;
>         role.direct = direct;
>         role.access = ACC_ALL;
>         role.quadrant = quadrant;
>
>         WARN_ON_ONCE(quadrant && !role.has_4_byte_gpte));
>         WARN_ON_ONCE(direct && role.has_4_byte_gpte));
>
> and no comment is necessary.

Sure.

>
> > +
> > +     sp = kvm_mmu_get_page(vcpu, gfn, role);
> >       ++sp->root_count;
> >
> >       return __pa(sp->spt);
> > @@ -3349,8 +3386,8 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
> >               for (i = 0; i < 4; ++i) {
> >                       WARN_ON_ONCE(IS_VALID_PAE_ROOT(mmu->pae_root[i]));
> >
> > -                     root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT),
> > -                                           i << 30, PT32_ROOT_LEVEL, true);
> > +                     root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT), i,
>
> The @quadrant here can be hardcoded to '0', has_4_byte_gpte is guaranteed to be
> false if the MMU is direct.  And then in the indirect path, set gva (and then
> quadrant) based on 'i' iff the guest is using 32-bit paging.
>
> Probably worth making it a separate patch just in case I'm forgetting something.
> Lightly tested...

Looks sane. I'll incorporate something like this in v5.

>
> --
> From: Sean Christopherson <seanjc@google.com>
> Date: Thu, 5 May 2022 14:19:35 -0700
> Subject: [PATCH] KVM: x86/mmu: Pass '0' for @gva when allocating root with
>  8-byte gpte
>
> Pass '0' instead of the "real" gva when allocating a direct PAE root,
> a.k.a. a direct PDPTE, and when allocating indirect roots that shadow
> 64-bit / 8-byte GPTEs.
>
> Thee @gva is only needed if the root is shadowing 32-bit paging in the
> guest, in which case KVM needs to use different shadow pages for each of
> the two 4-byte GPTEs covered by KVM's 8-byte PAE SPTE.
>
> For direct MMUs, there's obviously no shadowing, and for indirect MMU
>
> In anticipation of moving the quadrant logic into mmu_alloc_root(), WARN
> if a non-zero @gva is passed for !4-byte GPTEs, and WARN if 4-byte GPTEs
> are ever combined with a direct root (there's no shadowing, so TDP roots
> should ignore the GPTE size).
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 14 +++++++++++---
>  1 file changed, 11 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index dc20eccd6a77..6dfa3cfa8394 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3313,8 +3313,12 @@ static int mmu_check_root(struct kvm_vcpu *vcpu, gfn_t root_gfn)
>  static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
>                             u8 level, bool direct)
>  {
> +       union kvm_mmu_page_role role = vcpu->arch.mmu->root_role;
>         struct kvm_mmu_page *sp;
>
> +       WARN_ON_ONCE(gva && !role.has_4_byte_gpte);
> +       WARN_ON_ONCE(direct && role.has_4_byte_gpte);
> +
>         sp = kvm_mmu_get_page(vcpu, gfn, gva, level, direct, ACC_ALL);
>         ++sp->root_count;
>
> @@ -3349,8 +3353,8 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
>                 for (i = 0; i < 4; ++i) {
>                         WARN_ON_ONCE(IS_VALID_PAE_ROOT(mmu->pae_root[i]));
>
> -                       root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT),
> -                                             i << 30, PT32_ROOT_LEVEL, true);
> +                       root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT), 0,
> +                                             PT32_ROOT_LEVEL, true);
>                         mmu->pae_root[i] = root | PT_PRESENT_MASK |
>                                            shadow_me_mask;
>                 }
> @@ -3435,6 +3439,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
>         u64 pdptrs[4], pm_mask;
>         gfn_t root_gfn, root_pgd;
>         hpa_t root;
> +       gva_t gva;
>         unsigned i;
>         int r;
>
> @@ -3508,6 +3513,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
>                 }
>         }
>
> +       gva = 0;
>         for (i = 0; i < 4; ++i) {
>                 WARN_ON_ONCE(IS_VALID_PAE_ROOT(mmu->pae_root[i]));
>
> @@ -3517,9 +3523,11 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
>                                 continue;
>                         }
>                         root_gfn = pdptrs[i] >> PAGE_SHIFT;
> +               } else if (mmu->cpu_role.base.level == PT32_ROOT_LEVEL) {
> +                       gva = i << 30;
>                 }
>
> -               root = mmu_alloc_root(vcpu, root_gfn, i << 30,
> +               root = mmu_alloc_root(vcpu, root_gfn, gva,
>                                       PT32_ROOT_LEVEL, false);
>                 mmu->pae_root[i] = root | pm_mask;
>         }
>
> base-commit: 8bae380ad7dd3c31266d3685841ea4ce574d462d
> --
>

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 03/20] KVM: x86/mmu: Derive shadow MMU page role from parent
@ 2022-05-09 22:10       ` David Matlack
  0 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-05-09 22:10 UTC (permalink / raw)
  To: Sean Christopherson, Lai Jiangshan
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Thu, May 5, 2022 at 2:50 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Apr 22, 2022, David Matlack wrote:
> > Instead of computing the shadow page role from scratch for every new
> > page, derive most of the information from the parent shadow page.  This
> > avoids redundant calculations and reduces the number of parameters to
> > kvm_mmu_get_page().
> >
> > Preemptively split out the role calculation to a separate function for
> > use in a following commit.
> >
> > No functional change intended.
> >
> > Reviewed-by: Peter Xu <peterx@redhat.com>
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> >  arch/x86/kvm/mmu/mmu.c         | 96 +++++++++++++++++++++++-----------
> >  arch/x86/kvm/mmu/paging_tmpl.h |  9 ++--
> >  2 files changed, 71 insertions(+), 34 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index dc20eccd6a77..4249a771818b 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -2021,31 +2021,15 @@ static void clear_sp_write_flooding_count(u64 *spte)
> >       __clear_sp_write_flooding_count(sptep_to_sp(spte));
> >  }
> >
> > -static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > -                                          gfn_t gfn,
> > -                                          gva_t gaddr,
> > -                                          unsigned level,
> > -                                          bool direct,
> > -                                          unsigned int access)
> > +static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
> > +                                          union kvm_mmu_page_role role)
> >  {
> > -     union kvm_mmu_page_role role;
> >       struct hlist_head *sp_list;
> > -     unsigned quadrant;
> >       struct kvm_mmu_page *sp;
> >       int ret;
> >       int collisions = 0;
> >       LIST_HEAD(invalid_list);
> >
> > -     role = vcpu->arch.mmu->root_role;
> > -     role.level = level;
> > -     role.direct = direct;
> > -     role.access = access;
> > -     if (role.has_4_byte_gpte) {
> > -             quadrant = gaddr >> (PAGE_SHIFT + (PT64_PT_BITS * level));
> > -             quadrant &= (1 << ((PT32_PT_BITS - PT64_PT_BITS) * level)) - 1;
> > -             role.quadrant = quadrant;
> > -     }
> > -
>
> When you rebase to kvm/queue, the helper will need to deal with
>
>         if (level <= vcpu->arch.mmu->cpu_role.base.level)
>                 role.passthrough = 0;
>
> KVM should never create a passthrough huge page, so I believe it's just a matter
> of adding yet another boolean param to kvm_mmu_child_role().

+Lai Jiangshan

It looks like only root pages can be passthrough, so
kvm_mmu_child_role() can hard-code passthrough to 0. Lai do you agree?

>
>
> >       sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
> >       for_each_valid_sp(vcpu->kvm, sp, sp_list) {
> >               if (sp->gfn != gfn) {
> > @@ -2063,7 +2047,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> >                        * Unsync pages must not be left as is, because the new
> >                        * upper-level page will be write-protected.
> >                        */
> > -                     if (level > PG_LEVEL_4K && sp->unsync)
> > +                     if (role.level > PG_LEVEL_4K && sp->unsync)
> >                               kvm_mmu_prepare_zap_page(vcpu->kvm, sp,
> >                                                        &invalid_list);
> >                       continue;
>
> ...
>
> > @@ -3310,12 +3338,21 @@ static int mmu_check_root(struct kvm_vcpu *vcpu, gfn_t root_gfn)
> >       return ret;
> >  }
> >
> > -static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
> > +static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant,
> >                           u8 level, bool direct)
> >  {
> > +     union kvm_mmu_page_role role;
> >       struct kvm_mmu_page *sp;
> >
> > -     sp = kvm_mmu_get_page(vcpu, gfn, gva, level, direct, ACC_ALL);
> > +     role = vcpu->arch.mmu->root_role;
> > +     role.level = level;
> > +     role.direct = direct;
> > +     role.access = ACC_ALL;
> > +
> > +     if (role.has_4_byte_gpte)
> > +             role.quadrant = quadrant;
>
> Maybe add a comment explaining the PAE and 32-bit paging paths share a call for
> allocating PDPTEs?  Otherwise it looks like passing a non-zero quadrant when the
> guest doesn't have 4-byte PTEs should be a bug.
>
> Hmm, even better, if the check is moved to the caller, then this can be:
>
>         role.level = level;
>         role.direct = direct;
>         role.access = ACC_ALL;
>         role.quadrant = quadrant;
>
>         WARN_ON_ONCE(quadrant && !role.has_4_byte_gpte));
>         WARN_ON_ONCE(direct && role.has_4_byte_gpte));
>
> and no comment is necessary.

Sure.

>
> > +
> > +     sp = kvm_mmu_get_page(vcpu, gfn, role);
> >       ++sp->root_count;
> >
> >       return __pa(sp->spt);
> > @@ -3349,8 +3386,8 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
> >               for (i = 0; i < 4; ++i) {
> >                       WARN_ON_ONCE(IS_VALID_PAE_ROOT(mmu->pae_root[i]));
> >
> > -                     root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT),
> > -                                           i << 30, PT32_ROOT_LEVEL, true);
> > +                     root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT), i,
>
> The @quadrant here can be hardcoded to '0', has_4_byte_gpte is guaranteed to be
> false if the MMU is direct.  And then in the indirect path, set gva (and then
> quadrant) based on 'i' iff the guest is using 32-bit paging.
>
> Probably worth making it a separate patch just in case I'm forgetting something.
> Lightly tested...

Looks sane. I'll incorporate something like this in v5.

>
> --
> From: Sean Christopherson <seanjc@google.com>
> Date: Thu, 5 May 2022 14:19:35 -0700
> Subject: [PATCH] KVM: x86/mmu: Pass '0' for @gva when allocating root with
>  8-byte gpte
>
> Pass '0' instead of the "real" gva when allocating a direct PAE root,
> a.k.a. a direct PDPTE, and when allocating indirect roots that shadow
> 64-bit / 8-byte GPTEs.
>
> Thee @gva is only needed if the root is shadowing 32-bit paging in the
> guest, in which case KVM needs to use different shadow pages for each of
> the two 4-byte GPTEs covered by KVM's 8-byte PAE SPTE.
>
> For direct MMUs, there's obviously no shadowing, and for indirect MMU
>
> In anticipation of moving the quadrant logic into mmu_alloc_root(), WARN
> if a non-zero @gva is passed for !4-byte GPTEs, and WARN if 4-byte GPTEs
> are ever combined with a direct root (there's no shadowing, so TDP roots
> should ignore the GPTE size).
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 14 +++++++++++---
>  1 file changed, 11 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index dc20eccd6a77..6dfa3cfa8394 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3313,8 +3313,12 @@ static int mmu_check_root(struct kvm_vcpu *vcpu, gfn_t root_gfn)
>  static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
>                             u8 level, bool direct)
>  {
> +       union kvm_mmu_page_role role = vcpu->arch.mmu->root_role;
>         struct kvm_mmu_page *sp;
>
> +       WARN_ON_ONCE(gva && !role.has_4_byte_gpte);
> +       WARN_ON_ONCE(direct && role.has_4_byte_gpte);
> +
>         sp = kvm_mmu_get_page(vcpu, gfn, gva, level, direct, ACC_ALL);
>         ++sp->root_count;
>
> @@ -3349,8 +3353,8 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
>                 for (i = 0; i < 4; ++i) {
>                         WARN_ON_ONCE(IS_VALID_PAE_ROOT(mmu->pae_root[i]));
>
> -                       root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT),
> -                                             i << 30, PT32_ROOT_LEVEL, true);
> +                       root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT), 0,
> +                                             PT32_ROOT_LEVEL, true);
>                         mmu->pae_root[i] = root | PT_PRESENT_MASK |
>                                            shadow_me_mask;
>                 }
> @@ -3435,6 +3439,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
>         u64 pdptrs[4], pm_mask;
>         gfn_t root_gfn, root_pgd;
>         hpa_t root;
> +       gva_t gva;
>         unsigned i;
>         int r;
>
> @@ -3508,6 +3513,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
>                 }
>         }
>
> +       gva = 0;
>         for (i = 0; i < 4; ++i) {
>                 WARN_ON_ONCE(IS_VALID_PAE_ROOT(mmu->pae_root[i]));
>
> @@ -3517,9 +3523,11 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
>                                 continue;
>                         }
>                         root_gfn = pdptrs[i] >> PAGE_SHIFT;
> +               } else if (mmu->cpu_role.base.level == PT32_ROOT_LEVEL) {
> +                       gva = i << 30;
>                 }
>
> -               root = mmu_alloc_root(vcpu, root_gfn, i << 30,
> +               root = mmu_alloc_root(vcpu, root_gfn, gva,
>                                       PT32_ROOT_LEVEL, false);
>                 mmu->pae_root[i] = root | pm_mask;
>         }
>
> base-commit: 8bae380ad7dd3c31266d3685841ea4ce574d462d
> --
>
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 20/20] KVM: x86/mmu: Extend Eager Page Splitting to nested MMUs
  2022-05-09 21:44       ` David Matlack
@ 2022-05-09 22:47         ` Sean Christopherson
  -1 siblings, 0 replies; 120+ messages in thread
From: Sean Christopherson @ 2022-05-09 22:47 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Andrew Jones, Ben Gardon, Peter Xu, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Mon, May 09, 2022, David Matlack wrote:
> On Mon, May 9, 2022 at 9:48 AM Sean Christopherson <seanjc@google.com> wrote:
> > > +static void kvm_nested_mmu_try_split_huge_pages(struct kvm *kvm,
> > > +                                             const struct kvm_memory_slot *slot,
> > > +                                             gfn_t start, gfn_t end,
> > > +                                             int target_level)
> > > +{
> > > +     int level;
> > > +
> > > +     /*
> > > +      * Split huge pages starting with KVM_MAX_HUGEPAGE_LEVEL and working
> > > +      * down to the target level. This ensures pages are recursively split
> > > +      * all the way to the target level. There's no need to split pages
> > > +      * already at the target level.
> > > +      */
> > > +     for (level = KVM_MAX_HUGEPAGE_LEVEL; level > target_level; level--) {
> >
> > Unnecessary braces.
> 
> The brace is unnecessary, but when the inner statement is split across
> multiple lines I tend to prefer using braces. (That's why I did the
> same in the other patch and you had the same feedback.) I couldn't
> find any guidance about this in CodingStyle so I'm fine with getting
> rid of the braces if that's what you prefer.

The style varies by subsystem, e.g. I believe perf requires braces in this case.
Absent a "hard" rule, I value consistency above all else, e.g. because KVM doesn't
(usually) include the braces, I started looking for the second statement, i.e. the
lack of an opening brace is an indicator (to me at elast) that a loop/if contains
a single statement.

I actually like Golang's forced braces, but mostly because they are 100% mandatory
and so all code is consistent.
 
> > > +             slot_handle_level_range(kvm, slot,
> > > +                                     nested_mmu_try_split_huge_pages,
> > > +                                     level, level, start, end - 1,
> > > +                                     true, false);
> >
> > IMO it's worth running over by 4 chars to drop 2 lines:
> 
> Will do.
> 
> >
> >         for (level = KVM_MAX_HUGEPAGE_LEVEL; level > target_level; level--)
> >                 slot_handle_level_range(kvm, slot, nested_mmu_try_split_huge_pages,
> >                                         level, level, start, end - 1, true, false);
> > > +     }
> > > +}
> > > +
> > >  /* Must be called with the mmu_lock held in write-mode. */
> >
> > Add a lockdep assertion, not a comment.
> 
> Agreed but this is an existing comment, so better left to a separate patch.

Doh, I mistook the /* for a +.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 20/20] KVM: x86/mmu: Extend Eager Page Splitting to nested MMUs
@ 2022-05-09 22:47         ` Sean Christopherson
  0 siblings, 0 replies; 120+ messages in thread
From: Sean Christopherson @ 2022-05-09 22:47 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Mon, May 09, 2022, David Matlack wrote:
> On Mon, May 9, 2022 at 9:48 AM Sean Christopherson <seanjc@google.com> wrote:
> > > +static void kvm_nested_mmu_try_split_huge_pages(struct kvm *kvm,
> > > +                                             const struct kvm_memory_slot *slot,
> > > +                                             gfn_t start, gfn_t end,
> > > +                                             int target_level)
> > > +{
> > > +     int level;
> > > +
> > > +     /*
> > > +      * Split huge pages starting with KVM_MAX_HUGEPAGE_LEVEL and working
> > > +      * down to the target level. This ensures pages are recursively split
> > > +      * all the way to the target level. There's no need to split pages
> > > +      * already at the target level.
> > > +      */
> > > +     for (level = KVM_MAX_HUGEPAGE_LEVEL; level > target_level; level--) {
> >
> > Unnecessary braces.
> 
> The brace is unnecessary, but when the inner statement is split across
> multiple lines I tend to prefer using braces. (That's why I did the
> same in the other patch and you had the same feedback.) I couldn't
> find any guidance about this in CodingStyle so I'm fine with getting
> rid of the braces if that's what you prefer.

The style varies by subsystem, e.g. I believe perf requires braces in this case.
Absent a "hard" rule, I value consistency above all else, e.g. because KVM doesn't
(usually) include the braces, I started looking for the second statement, i.e. the
lack of an opening brace is an indicator (to me at elast) that a loop/if contains
a single statement.

I actually like Golang's forced braces, but mostly because they are 100% mandatory
and so all code is consistent.
 
> > > +             slot_handle_level_range(kvm, slot,
> > > +                                     nested_mmu_try_split_huge_pages,
> > > +                                     level, level, start, end - 1,
> > > +                                     true, false);
> >
> > IMO it's worth running over by 4 chars to drop 2 lines:
> 
> Will do.
> 
> >
> >         for (level = KVM_MAX_HUGEPAGE_LEVEL; level > target_level; level--)
> >                 slot_handle_level_range(kvm, slot, nested_mmu_try_split_huge_pages,
> >                                         level, level, start, end - 1, true, false);
> > > +     }
> > > +}
> > > +
> > >  /* Must be called with the mmu_lock held in write-mode. */
> >
> > Add a lockdep assertion, not a comment.
> 
> Agreed but this is an existing comment, so better left to a separate patch.

Doh, I mistook the /* for a +.
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 11/20] KVM: x86/mmu: Allow for NULL vcpu pointer in __kvm_mmu_get_shadow_page()
  2022-05-09 21:26       ` David Matlack
@ 2022-05-09 22:56         ` Sean Christopherson
  -1 siblings, 0 replies; 120+ messages in thread
From: Sean Christopherson @ 2022-05-09 22:56 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Andrew Jones, Ben Gardon, Peter Xu, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Mon, May 09, 2022, David Matlack wrote:
> On Thu, May 5, 2022 at 4:33 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Fri, Apr 22, 2022, David Matlack wrote:
> > > Allow the vcpu pointer in __kvm_mmu_get_shadow_page() to be NULL. Rename
> > > it to vcpu_or_null to prevent future commits from accidentally taking
> > > dependency on it without first considering the NULL case.
> > >
> > > The vcpu pointer is only used for syncing indirect shadow pages in
> > > kvm_mmu_find_shadow_page(). A vcpu pointer it not required for
> > > correctness since unsync pages can simply be zapped. But this should
> > > never occur in practice, since the only use-case for passing a NULL vCPU
> > > pointer is eager page splitting which will only request direct shadow
> > > pages (which can never be unsync).
> > >
> > > Even though __kvm_mmu_get_shadow_page() can gracefully handle a NULL
> > > vcpu, add a WARN() that will fire if __kvm_mmu_get_shadow_page() is ever
> > > called to get an indirect shadow page with a NULL vCPU pointer, since
> > > zapping unsync SPs is a performance overhead that should be considered.
> > >
> > > Signed-off-by: David Matlack <dmatlack@google.com>
> > > ---
> > >  arch/x86/kvm/mmu/mmu.c | 40 ++++++++++++++++++++++++++++++++--------
> > >  1 file changed, 32 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 04029c01aebd..21407bd4435a 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -1845,16 +1845,27 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
> > >         &(_kvm)->arch.mmu_page_hash[kvm_page_table_hashfn(_gfn)])     \
> > >               if ((_sp)->gfn != (_gfn) || (_sp)->role.direct) {} else
> > >
> > > -static int kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> > > -                      struct list_head *invalid_list)
> > > +static int __kvm_sync_page(struct kvm *kvm, struct kvm_vcpu *vcpu_or_null,
> > > +                        struct kvm_mmu_page *sp,
> > > +                        struct list_head *invalid_list)
> > >  {
> > > -     int ret = vcpu->arch.mmu->sync_page(vcpu, sp);
> > > +     int ret = -1;
> > > +
> > > +     if (vcpu_or_null)
> >
> > This should never happen.  I like the idea of warning early, but I really don't
> > like that the WARN is far removed from the code that actually depends on @vcpu
> > being non-NULL. Case in point, KVM should have bailed on the WARN and never
> > reached this point.  And the inner __kvm_sync_page() is completely unnecessary.
> 
> Yeah that's fair.
> 
> >
> > I also don't love the vcpu_or_null terminology; I get the intent, but it doesn't
> > really help because understand why/when it's NULL.
> 
> Eh, I don't think it needs to encode why or when. It just needs to
> flag to the reader (and future code authors) that this vcpu pointer
> (unlike all other vcpu pointers in KVM) is NULL in certain cases.

My objection is that without the why/when, developers that aren't familiar with
this code won't know the rules for using vcpu_or_null.  E.g. I don't want to end
up with

	if (vcpu_or_null)
		do x;
	else
		do y;

because inevitably it'll become unclear whether or not that code is actually _correct_.
It might not #GP on a NULL pointer, but it doesn't mean it's correct.

> > I played around with casting, e.g. to/from an unsigned long or void *, to prevent
> > usage, but that doesn't work very well because 'unsigned long' ends up being
> > awkward/confusing, and 'void *' is easily lost on a function call.  And both
> > lose type safety :-(
> 
> Yet another shortcoming of C :(

And lack of closures, which would work very well here.

> (The other being our other discussion about the RET_PF* return codes
> getting easily misinterpreted as KVM's magic return-to-user /
> continue-running-guest return codes.)
> 
> Makes me miss Rust!
> 
> >
> > All in all, I think I'd prefer this patch to simply be a KVM_BUG_ON() if
> > kvm_mmu_find_shadow_page() encounters an unsync page.  Less churn, and IMO there's
> > no real loss in robustness, e.g. we'd really have to screw up code review and
> > testing to introduce a null vCPU pointer dereference in this code.
> 
> Agreed about moving the check here and dropping __kvm_sync_page(). But
> I would prefer to retain the vcpu_or_null name (or at least something
> other than "vcpu" to indicate there's something non-standard about
> this pointer).

The least awful idea I've come up with is wrapping the vCPU in a struct, e.g.

	struct sync_page_info {
		void *vcpu;
	}

That provides the contextual information I want, and also provides the hint that
something is odd about the vcpu, which you want.  It's like a very poor man's closure :-)
	
The struct could even be passed by value to avoid the miniscule overhead, and to
make readers look extra hard because it's that much more wierd.

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 3d102522804a..068be77a4fff 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2003,8 +2003,13 @@ static void clear_sp_write_flooding_count(u64 *spte)
        __clear_sp_write_flooding_count(sptep_to_sp(spte));
 }

+/* Wrapper to make it difficult to dereference a potentially NULL @vcpu. */
+struct sync_page_info {
+       void *vcpu;
+};
+
 static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm,
-                                                    struct kvm_vcpu *vcpu,
+                                                    struct sync_page_info spi,
                                                     gfn_t gfn,
                                                     struct hlist_head *sp_list,
                                                     union kvm_mmu_page_role role)
@@ -2041,6 +2046,13 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm,
                        goto out;

                if (sp->unsync) {
+                       /*
+                        * Getting indirect shadow pages without a valid @spi
+                        * is not supported, i.e. this should never happen.
+                        */
+                       if (KVM_BUG_ON(!spi.vcpu, kvm))
+                               break;
+
                        /*
                         * The page is good, but is stale.  kvm_sync_page does
                         * get the latest guest state, but (unlike mmu_unsync_children)
@@ -2053,7 +2065,7 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm,
                         * If the sync fails, the page is zapped.  If so, break
                         * in order to rebuild it.
                         */
-                       ret = kvm_sync_page(vcpu, sp, &invalid_list);
+                       ret = kvm_sync_page(spi.vcpu, sp, &invalid_list);
                        if (ret < 0)
                                break;

@@ -2120,7 +2132,7 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
 }

 static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm *kvm,
-                                                     struct kvm_vcpu *vcpu,
+                                                     struct sync_page_info spi,
                                                      struct shadow_page_caches *caches,
                                                      gfn_t gfn,
                                                      union kvm_mmu_page_role role)
@@ -2131,7 +2143,7 @@ static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm *kvm,

        sp_list = &kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];

-       sp = kvm_mmu_find_shadow_page(kvm, vcpu, gfn, sp_list, role);
+       sp = kvm_mmu_find_shadow_page(kvm, spi, gfn, sp_list, role);
        if (!sp) {
                created = true;
                sp = kvm_mmu_alloc_shadow_page(kvm, caches, gfn, sp_list, role);
@@ -2151,7 +2163,11 @@ static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
                .gfn_array_cache = &vcpu->arch.mmu_gfn_array_cache,
        };

-       return __kvm_mmu_get_shadow_page(vcpu->kvm, vcpu, &caches, gfn, role);
+       struct sync_page_info spi = {
+               .vcpu = vcpu,
+       };
+
+       return __kvm_mmu_get_shadow_page(vcpu->kvm, spi, &caches, gfn, role);
 }

 static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct, u32 access)


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 11/20] KVM: x86/mmu: Allow for NULL vcpu pointer in __kvm_mmu_get_shadow_page()
@ 2022-05-09 22:56         ` Sean Christopherson
  0 siblings, 0 replies; 120+ messages in thread
From: Sean Christopherson @ 2022-05-09 22:56 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Mon, May 09, 2022, David Matlack wrote:
> On Thu, May 5, 2022 at 4:33 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Fri, Apr 22, 2022, David Matlack wrote:
> > > Allow the vcpu pointer in __kvm_mmu_get_shadow_page() to be NULL. Rename
> > > it to vcpu_or_null to prevent future commits from accidentally taking
> > > dependency on it without first considering the NULL case.
> > >
> > > The vcpu pointer is only used for syncing indirect shadow pages in
> > > kvm_mmu_find_shadow_page(). A vcpu pointer it not required for
> > > correctness since unsync pages can simply be zapped. But this should
> > > never occur in practice, since the only use-case for passing a NULL vCPU
> > > pointer is eager page splitting which will only request direct shadow
> > > pages (which can never be unsync).
> > >
> > > Even though __kvm_mmu_get_shadow_page() can gracefully handle a NULL
> > > vcpu, add a WARN() that will fire if __kvm_mmu_get_shadow_page() is ever
> > > called to get an indirect shadow page with a NULL vCPU pointer, since
> > > zapping unsync SPs is a performance overhead that should be considered.
> > >
> > > Signed-off-by: David Matlack <dmatlack@google.com>
> > > ---
> > >  arch/x86/kvm/mmu/mmu.c | 40 ++++++++++++++++++++++++++++++++--------
> > >  1 file changed, 32 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 04029c01aebd..21407bd4435a 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -1845,16 +1845,27 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
> > >         &(_kvm)->arch.mmu_page_hash[kvm_page_table_hashfn(_gfn)])     \
> > >               if ((_sp)->gfn != (_gfn) || (_sp)->role.direct) {} else
> > >
> > > -static int kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> > > -                      struct list_head *invalid_list)
> > > +static int __kvm_sync_page(struct kvm *kvm, struct kvm_vcpu *vcpu_or_null,
> > > +                        struct kvm_mmu_page *sp,
> > > +                        struct list_head *invalid_list)
> > >  {
> > > -     int ret = vcpu->arch.mmu->sync_page(vcpu, sp);
> > > +     int ret = -1;
> > > +
> > > +     if (vcpu_or_null)
> >
> > This should never happen.  I like the idea of warning early, but I really don't
> > like that the WARN is far removed from the code that actually depends on @vcpu
> > being non-NULL. Case in point, KVM should have bailed on the WARN and never
> > reached this point.  And the inner __kvm_sync_page() is completely unnecessary.
> 
> Yeah that's fair.
> 
> >
> > I also don't love the vcpu_or_null terminology; I get the intent, but it doesn't
> > really help because understand why/when it's NULL.
> 
> Eh, I don't think it needs to encode why or when. It just needs to
> flag to the reader (and future code authors) that this vcpu pointer
> (unlike all other vcpu pointers in KVM) is NULL in certain cases.

My objection is that without the why/when, developers that aren't familiar with
this code won't know the rules for using vcpu_or_null.  E.g. I don't want to end
up with

	if (vcpu_or_null)
		do x;
	else
		do y;

because inevitably it'll become unclear whether or not that code is actually _correct_.
It might not #GP on a NULL pointer, but it doesn't mean it's correct.

> > I played around with casting, e.g. to/from an unsigned long or void *, to prevent
> > usage, but that doesn't work very well because 'unsigned long' ends up being
> > awkward/confusing, and 'void *' is easily lost on a function call.  And both
> > lose type safety :-(
> 
> Yet another shortcoming of C :(

And lack of closures, which would work very well here.

> (The other being our other discussion about the RET_PF* return codes
> getting easily misinterpreted as KVM's magic return-to-user /
> continue-running-guest return codes.)
> 
> Makes me miss Rust!
> 
> >
> > All in all, I think I'd prefer this patch to simply be a KVM_BUG_ON() if
> > kvm_mmu_find_shadow_page() encounters an unsync page.  Less churn, and IMO there's
> > no real loss in robustness, e.g. we'd really have to screw up code review and
> > testing to introduce a null vCPU pointer dereference in this code.
> 
> Agreed about moving the check here and dropping __kvm_sync_page(). But
> I would prefer to retain the vcpu_or_null name (or at least something
> other than "vcpu" to indicate there's something non-standard about
> this pointer).

The least awful idea I've come up with is wrapping the vCPU in a struct, e.g.

	struct sync_page_info {
		void *vcpu;
	}

That provides the contextual information I want, and also provides the hint that
something is odd about the vcpu, which you want.  It's like a very poor man's closure :-)
	
The struct could even be passed by value to avoid the miniscule overhead, and to
make readers look extra hard because it's that much more wierd.

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 3d102522804a..068be77a4fff 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2003,8 +2003,13 @@ static void clear_sp_write_flooding_count(u64 *spte)
        __clear_sp_write_flooding_count(sptep_to_sp(spte));
 }

+/* Wrapper to make it difficult to dereference a potentially NULL @vcpu. */
+struct sync_page_info {
+       void *vcpu;
+};
+
 static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm,
-                                                    struct kvm_vcpu *vcpu,
+                                                    struct sync_page_info spi,
                                                     gfn_t gfn,
                                                     struct hlist_head *sp_list,
                                                     union kvm_mmu_page_role role)
@@ -2041,6 +2046,13 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm,
                        goto out;

                if (sp->unsync) {
+                       /*
+                        * Getting indirect shadow pages without a valid @spi
+                        * is not supported, i.e. this should never happen.
+                        */
+                       if (KVM_BUG_ON(!spi.vcpu, kvm))
+                               break;
+
                        /*
                         * The page is good, but is stale.  kvm_sync_page does
                         * get the latest guest state, but (unlike mmu_unsync_children)
@@ -2053,7 +2065,7 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm,
                         * If the sync fails, the page is zapped.  If so, break
                         * in order to rebuild it.
                         */
-                       ret = kvm_sync_page(vcpu, sp, &invalid_list);
+                       ret = kvm_sync_page(spi.vcpu, sp, &invalid_list);
                        if (ret < 0)
                                break;

@@ -2120,7 +2132,7 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
 }

 static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm *kvm,
-                                                     struct kvm_vcpu *vcpu,
+                                                     struct sync_page_info spi,
                                                      struct shadow_page_caches *caches,
                                                      gfn_t gfn,
                                                      union kvm_mmu_page_role role)
@@ -2131,7 +2143,7 @@ static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm *kvm,

        sp_list = &kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];

-       sp = kvm_mmu_find_shadow_page(kvm, vcpu, gfn, sp_list, role);
+       sp = kvm_mmu_find_shadow_page(kvm, spi, gfn, sp_list, role);
        if (!sp) {
                created = true;
                sp = kvm_mmu_alloc_shadow_page(kvm, caches, gfn, sp_list, role);
@@ -2151,7 +2163,11 @@ static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
                .gfn_array_cache = &vcpu->arch.mmu_gfn_array_cache,
        };

-       return __kvm_mmu_get_shadow_page(vcpu->kvm, vcpu, &caches, gfn, role);
+       struct sync_page_info spi = {
+               .vcpu = vcpu,
+       };
+
+       return __kvm_mmu_get_shadow_page(vcpu->kvm, spi, &caches, gfn, role);
 }

 static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct, u32 access)

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 11/20] KVM: x86/mmu: Allow for NULL vcpu pointer in __kvm_mmu_get_shadow_page()
  2022-05-09 22:56         ` Sean Christopherson
@ 2022-05-09 23:59           ` David Matlack
  -1 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-05-09 23:59 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Andrew Jones, Ben Gardon, Peter Xu, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Mon, May 9, 2022 at 3:57 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, May 09, 2022, David Matlack wrote:
> > On Thu, May 5, 2022 at 4:33 PM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > On Fri, Apr 22, 2022, David Matlack wrote:
> > > > Allow the vcpu pointer in __kvm_mmu_get_shadow_page() to be NULL. Rename
> > > > it to vcpu_or_null to prevent future commits from accidentally taking
> > > > dependency on it without first considering the NULL case.
> > > >
> > > > The vcpu pointer is only used for syncing indirect shadow pages in
> > > > kvm_mmu_find_shadow_page(). A vcpu pointer it not required for
> > > > correctness since unsync pages can simply be zapped. But this should
> > > > never occur in practice, since the only use-case for passing a NULL vCPU
> > > > pointer is eager page splitting which will only request direct shadow
> > > > pages (which can never be unsync).
> > > >
> > > > Even though __kvm_mmu_get_shadow_page() can gracefully handle a NULL
> > > > vcpu, add a WARN() that will fire if __kvm_mmu_get_shadow_page() is ever
> > > > called to get an indirect shadow page with a NULL vCPU pointer, since
> > > > zapping unsync SPs is a performance overhead that should be considered.
> > > >
> > > > Signed-off-by: David Matlack <dmatlack@google.com>
> > > > ---
> > > >  arch/x86/kvm/mmu/mmu.c | 40 ++++++++++++++++++++++++++++++++--------
> > > >  1 file changed, 32 insertions(+), 8 deletions(-)
> > > >
> > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > index 04029c01aebd..21407bd4435a 100644
> > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > > @@ -1845,16 +1845,27 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
> > > >         &(_kvm)->arch.mmu_page_hash[kvm_page_table_hashfn(_gfn)])     \
> > > >               if ((_sp)->gfn != (_gfn) || (_sp)->role.direct) {} else
> > > >
> > > > -static int kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> > > > -                      struct list_head *invalid_list)
> > > > +static int __kvm_sync_page(struct kvm *kvm, struct kvm_vcpu *vcpu_or_null,
> > > > +                        struct kvm_mmu_page *sp,
> > > > +                        struct list_head *invalid_list)
> > > >  {
> > > > -     int ret = vcpu->arch.mmu->sync_page(vcpu, sp);
> > > > +     int ret = -1;
> > > > +
> > > > +     if (vcpu_or_null)
> > >
> > > This should never happen.  I like the idea of warning early, but I really don't
> > > like that the WARN is far removed from the code that actually depends on @vcpu
> > > being non-NULL. Case in point, KVM should have bailed on the WARN and never
> > > reached this point.  And the inner __kvm_sync_page() is completely unnecessary.
> >
> > Yeah that's fair.
> >
> > >
> > > I also don't love the vcpu_or_null terminology; I get the intent, but it doesn't
> > > really help because understand why/when it's NULL.
> >
> > Eh, I don't think it needs to encode why or when. It just needs to
> > flag to the reader (and future code authors) that this vcpu pointer
> > (unlike all other vcpu pointers in KVM) is NULL in certain cases.
>
> My objection is that without the why/when, developers that aren't familiar with
> this code won't know the rules for using vcpu_or_null.  E.g. I don't want to end
> up with
>
>         if (vcpu_or_null)
>                 do x;
>         else
>                 do y;
>
> because inevitably it'll become unclear whether or not that code is actually _correct_.
> It might not #GP on a NULL pointer, but it doesn't mean it's correct.

Ah, right. And that's actually why I put the big comment and WARN in
__kvm_mmu_get_shadow_page(). Readers could easily jump to where
vcpu_or_null is passed in and see the rules around it. But if we move
the WARN to the kvm_sync_page() call, I agree it will be harder for
readers to know the rules and "vcpu_or_null" starts to become a risky
variable.

>
> > > I played around with casting, e.g. to/from an unsigned long or void *, to prevent
> > > usage, but that doesn't work very well because 'unsigned long' ends up being
> > > awkward/confusing, and 'void *' is easily lost on a function call.  And both
> > > lose type safety :-(
> >
> > Yet another shortcoming of C :(
>
> And lack of closures, which would work very well here.
>
> > (The other being our other discussion about the RET_PF* return codes
> > getting easily misinterpreted as KVM's magic return-to-user /
> > continue-running-guest return codes.)
> >
> > Makes me miss Rust!
> >
> > >
> > > All in all, I think I'd prefer this patch to simply be a KVM_BUG_ON() if
> > > kvm_mmu_find_shadow_page() encounters an unsync page.  Less churn, and IMO there's
> > > no real loss in robustness, e.g. we'd really have to screw up code review and
> > > testing to introduce a null vCPU pointer dereference in this code.
> >
> > Agreed about moving the check here and dropping __kvm_sync_page(). But
> > I would prefer to retain the vcpu_or_null name (or at least something
> > other than "vcpu" to indicate there's something non-standard about
> > this pointer).
>
> The least awful idea I've come up with is wrapping the vCPU in a struct, e.g.
>
>         struct sync_page_info {
>                 void *vcpu;
>         }
>
> That provides the contextual information I want, and also provides the hint that
> something is odd about the vcpu, which you want.  It's like a very poor man's closure :-)
>
> The struct could even be passed by value to avoid the miniscule overhead, and to
> make readers look extra hard because it's that much more wierd.

Interesting idea. I'll give it a shot.

>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 3d102522804a..068be77a4fff 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2003,8 +2003,13 @@ static void clear_sp_write_flooding_count(u64 *spte)
>         __clear_sp_write_flooding_count(sptep_to_sp(spte));
>  }
>
> +/* Wrapper to make it difficult to dereference a potentially NULL @vcpu. */
> +struct sync_page_info {
> +       void *vcpu;
> +};
> +
>  static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm,
> -                                                    struct kvm_vcpu *vcpu,
> +                                                    struct sync_page_info spi,
>                                                      gfn_t gfn,
>                                                      struct hlist_head *sp_list,
>                                                      union kvm_mmu_page_role role)
> @@ -2041,6 +2046,13 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm,
>                         goto out;
>
>                 if (sp->unsync) {
> +                       /*
> +                        * Getting indirect shadow pages without a valid @spi
> +                        * is not supported, i.e. this should never happen.
> +                        */
> +                       if (KVM_BUG_ON(!spi.vcpu, kvm))
> +                               break;
> +
>                         /*
>                          * The page is good, but is stale.  kvm_sync_page does
>                          * get the latest guest state, but (unlike mmu_unsync_children)
> @@ -2053,7 +2065,7 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm,
>                          * If the sync fails, the page is zapped.  If so, break
>                          * in order to rebuild it.
>                          */
> -                       ret = kvm_sync_page(vcpu, sp, &invalid_list);
> +                       ret = kvm_sync_page(spi.vcpu, sp, &invalid_list);
>                         if (ret < 0)
>                                 break;
>
> @@ -2120,7 +2132,7 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
>  }
>
>  static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm *kvm,
> -                                                     struct kvm_vcpu *vcpu,
> +                                                     struct sync_page_info spi,
>                                                       struct shadow_page_caches *caches,
>                                                       gfn_t gfn,
>                                                       union kvm_mmu_page_role role)
> @@ -2131,7 +2143,7 @@ static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm *kvm,
>
>         sp_list = &kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
>
> -       sp = kvm_mmu_find_shadow_page(kvm, vcpu, gfn, sp_list, role);
> +       sp = kvm_mmu_find_shadow_page(kvm, spi, gfn, sp_list, role);
>         if (!sp) {
>                 created = true;
>                 sp = kvm_mmu_alloc_shadow_page(kvm, caches, gfn, sp_list, role);
> @@ -2151,7 +2163,11 @@ static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
>                 .gfn_array_cache = &vcpu->arch.mmu_gfn_array_cache,
>         };
>
> -       return __kvm_mmu_get_shadow_page(vcpu->kvm, vcpu, &caches, gfn, role);
> +       struct sync_page_info spi = {
> +               .vcpu = vcpu,
> +       };
> +
> +       return __kvm_mmu_get_shadow_page(vcpu->kvm, spi, &caches, gfn, role);
>  }
>
>  static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct, u32 access)
>

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 11/20] KVM: x86/mmu: Allow for NULL vcpu pointer in __kvm_mmu_get_shadow_page()
@ 2022-05-09 23:59           ` David Matlack
  0 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-05-09 23:59 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Mon, May 9, 2022 at 3:57 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, May 09, 2022, David Matlack wrote:
> > On Thu, May 5, 2022 at 4:33 PM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > On Fri, Apr 22, 2022, David Matlack wrote:
> > > > Allow the vcpu pointer in __kvm_mmu_get_shadow_page() to be NULL. Rename
> > > > it to vcpu_or_null to prevent future commits from accidentally taking
> > > > dependency on it without first considering the NULL case.
> > > >
> > > > The vcpu pointer is only used for syncing indirect shadow pages in
> > > > kvm_mmu_find_shadow_page(). A vcpu pointer it not required for
> > > > correctness since unsync pages can simply be zapped. But this should
> > > > never occur in practice, since the only use-case for passing a NULL vCPU
> > > > pointer is eager page splitting which will only request direct shadow
> > > > pages (which can never be unsync).
> > > >
> > > > Even though __kvm_mmu_get_shadow_page() can gracefully handle a NULL
> > > > vcpu, add a WARN() that will fire if __kvm_mmu_get_shadow_page() is ever
> > > > called to get an indirect shadow page with a NULL vCPU pointer, since
> > > > zapping unsync SPs is a performance overhead that should be considered.
> > > >
> > > > Signed-off-by: David Matlack <dmatlack@google.com>
> > > > ---
> > > >  arch/x86/kvm/mmu/mmu.c | 40 ++++++++++++++++++++++++++++++++--------
> > > >  1 file changed, 32 insertions(+), 8 deletions(-)
> > > >
> > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > index 04029c01aebd..21407bd4435a 100644
> > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > > @@ -1845,16 +1845,27 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
> > > >         &(_kvm)->arch.mmu_page_hash[kvm_page_table_hashfn(_gfn)])     \
> > > >               if ((_sp)->gfn != (_gfn) || (_sp)->role.direct) {} else
> > > >
> > > > -static int kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> > > > -                      struct list_head *invalid_list)
> > > > +static int __kvm_sync_page(struct kvm *kvm, struct kvm_vcpu *vcpu_or_null,
> > > > +                        struct kvm_mmu_page *sp,
> > > > +                        struct list_head *invalid_list)
> > > >  {
> > > > -     int ret = vcpu->arch.mmu->sync_page(vcpu, sp);
> > > > +     int ret = -1;
> > > > +
> > > > +     if (vcpu_or_null)
> > >
> > > This should never happen.  I like the idea of warning early, but I really don't
> > > like that the WARN is far removed from the code that actually depends on @vcpu
> > > being non-NULL. Case in point, KVM should have bailed on the WARN and never
> > > reached this point.  And the inner __kvm_sync_page() is completely unnecessary.
> >
> > Yeah that's fair.
> >
> > >
> > > I also don't love the vcpu_or_null terminology; I get the intent, but it doesn't
> > > really help because understand why/when it's NULL.
> >
> > Eh, I don't think it needs to encode why or when. It just needs to
> > flag to the reader (and future code authors) that this vcpu pointer
> > (unlike all other vcpu pointers in KVM) is NULL in certain cases.
>
> My objection is that without the why/when, developers that aren't familiar with
> this code won't know the rules for using vcpu_or_null.  E.g. I don't want to end
> up with
>
>         if (vcpu_or_null)
>                 do x;
>         else
>                 do y;
>
> because inevitably it'll become unclear whether or not that code is actually _correct_.
> It might not #GP on a NULL pointer, but it doesn't mean it's correct.

Ah, right. And that's actually why I put the big comment and WARN in
__kvm_mmu_get_shadow_page(). Readers could easily jump to where
vcpu_or_null is passed in and see the rules around it. But if we move
the WARN to the kvm_sync_page() call, I agree it will be harder for
readers to know the rules and "vcpu_or_null" starts to become a risky
variable.

>
> > > I played around with casting, e.g. to/from an unsigned long or void *, to prevent
> > > usage, but that doesn't work very well because 'unsigned long' ends up being
> > > awkward/confusing, and 'void *' is easily lost on a function call.  And both
> > > lose type safety :-(
> >
> > Yet another shortcoming of C :(
>
> And lack of closures, which would work very well here.
>
> > (The other being our other discussion about the RET_PF* return codes
> > getting easily misinterpreted as KVM's magic return-to-user /
> > continue-running-guest return codes.)
> >
> > Makes me miss Rust!
> >
> > >
> > > All in all, I think I'd prefer this patch to simply be a KVM_BUG_ON() if
> > > kvm_mmu_find_shadow_page() encounters an unsync page.  Less churn, and IMO there's
> > > no real loss in robustness, e.g. we'd really have to screw up code review and
> > > testing to introduce a null vCPU pointer dereference in this code.
> >
> > Agreed about moving the check here and dropping __kvm_sync_page(). But
> > I would prefer to retain the vcpu_or_null name (or at least something
> > other than "vcpu" to indicate there's something non-standard about
> > this pointer).
>
> The least awful idea I've come up with is wrapping the vCPU in a struct, e.g.
>
>         struct sync_page_info {
>                 void *vcpu;
>         }
>
> That provides the contextual information I want, and also provides the hint that
> something is odd about the vcpu, which you want.  It's like a very poor man's closure :-)
>
> The struct could even be passed by value to avoid the miniscule overhead, and to
> make readers look extra hard because it's that much more wierd.

Interesting idea. I'll give it a shot.

>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 3d102522804a..068be77a4fff 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2003,8 +2003,13 @@ static void clear_sp_write_flooding_count(u64 *spte)
>         __clear_sp_write_flooding_count(sptep_to_sp(spte));
>  }
>
> +/* Wrapper to make it difficult to dereference a potentially NULL @vcpu. */
> +struct sync_page_info {
> +       void *vcpu;
> +};
> +
>  static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm,
> -                                                    struct kvm_vcpu *vcpu,
> +                                                    struct sync_page_info spi,
>                                                      gfn_t gfn,
>                                                      struct hlist_head *sp_list,
>                                                      union kvm_mmu_page_role role)
> @@ -2041,6 +2046,13 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm,
>                         goto out;
>
>                 if (sp->unsync) {
> +                       /*
> +                        * Getting indirect shadow pages without a valid @spi
> +                        * is not supported, i.e. this should never happen.
> +                        */
> +                       if (KVM_BUG_ON(!spi.vcpu, kvm))
> +                               break;
> +
>                         /*
>                          * The page is good, but is stale.  kvm_sync_page does
>                          * get the latest guest state, but (unlike mmu_unsync_children)
> @@ -2053,7 +2065,7 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm,
>                          * If the sync fails, the page is zapped.  If so, break
>                          * in order to rebuild it.
>                          */
> -                       ret = kvm_sync_page(vcpu, sp, &invalid_list);
> +                       ret = kvm_sync_page(spi.vcpu, sp, &invalid_list);
>                         if (ret < 0)
>                                 break;
>
> @@ -2120,7 +2132,7 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
>  }
>
>  static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm *kvm,
> -                                                     struct kvm_vcpu *vcpu,
> +                                                     struct sync_page_info spi,
>                                                       struct shadow_page_caches *caches,
>                                                       gfn_t gfn,
>                                                       union kvm_mmu_page_role role)
> @@ -2131,7 +2143,7 @@ static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm *kvm,
>
>         sp_list = &kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
>
> -       sp = kvm_mmu_find_shadow_page(kvm, vcpu, gfn, sp_list, role);
> +       sp = kvm_mmu_find_shadow_page(kvm, spi, gfn, sp_list, role);
>         if (!sp) {
>                 created = true;
>                 sp = kvm_mmu_alloc_shadow_page(kvm, caches, gfn, sp_list, role);
> @@ -2151,7 +2163,11 @@ static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
>                 .gfn_array_cache = &vcpu->arch.mmu_gfn_array_cache,
>         };
>
> -       return __kvm_mmu_get_shadow_page(vcpu->kvm, vcpu, &caches, gfn, role);
> +       struct sync_page_info spi = {
> +               .vcpu = vcpu,
> +       };
> +
> +       return __kvm_mmu_get_shadow_page(vcpu->kvm, spi, &caches, gfn, role);
>  }
>
>  static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct, u32 access)
>
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 03/20] KVM: x86/mmu: Derive shadow MMU page role from parent
  2022-05-09 22:10       ` David Matlack
@ 2022-05-10  2:38         ` Lai Jiangshan
  -1 siblings, 0 replies; 120+ messages in thread
From: Lai Jiangshan @ 2022-05-10  2:38 UTC (permalink / raw)
  To: David Matlack
  Cc: Sean Christopherson, Paolo Bonzini, Marc Zyngier, Huacai Chen,
	Aleksandar Markovic, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Andrew Jones, Ben Gardon, Peter Xu,
	Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Tue, May 10, 2022 at 6:10 AM David Matlack <dmatlack@google.com> wrote:

>
> +Lai Jiangshan
>
> It looks like only root pages can be passthrough, so
> kvm_mmu_child_role() can hard-code passthrough to 0. Lai do you agree?
>


Agree.

Thanks
Lai

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 03/20] KVM: x86/mmu: Derive shadow MMU page role from parent
@ 2022-05-10  2:38         ` Lai Jiangshan
  0 siblings, 0 replies; 120+ messages in thread
From: Lai Jiangshan @ 2022-05-10  2:38 UTC (permalink / raw)
  To: David Matlack
  Cc: Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt, Paul Walmsley, Marc Zyngier,
	Ben Gardon, Paolo Bonzini, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Tue, May 10, 2022 at 6:10 AM David Matlack <dmatlack@google.com> wrote:

>
> +Lai Jiangshan
>
> It looks like only root pages can be passthrough, so
> kvm_mmu_child_role() can hard-code passthrough to 0. Lai do you agree?
>


Agree.

Thanks
Lai
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 03/20] KVM: x86/mmu: Derive shadow MMU page role from parent
  2022-05-09 21:04       ` David Matlack
@ 2022-05-10  2:58         ` Lai Jiangshan
  -1 siblings, 0 replies; 120+ messages in thread
From: Lai Jiangshan @ 2022-05-10  2:58 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Sean Christopherson, Andrew Jones, Ben Gardon, Peter Xu,
	Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

()

On Tue, May 10, 2022 at 5:04 AM David Matlack <dmatlack@google.com> wrote:
>
> On Sat, May 7, 2022 at 1:28 AM Lai Jiangshan <jiangshanlai@gmail.com> wrote:
> >
> >
> >
> > On 2022/4/23 05:05, David Matlack wrote:
> > > Instead of computing the shadow page role from scratch for every new
> > > page, derive most of the information from the parent shadow page.  This
> > > avoids redundant calculations and reduces the number of parameters to
> > > kvm_mmu_get_page().
> > >
> > > Preemptively split out the role calculation to a separate function for
> > > use in a following commit.
> > >
> > > No functional change intended.
> > >
> > > Reviewed-by: Peter Xu <peterx@redhat.com>
> > > Signed-off-by: David Matlack <dmatlack@google.com>
> > > ---
> > >   arch/x86/kvm/mmu/mmu.c         | 96 +++++++++++++++++++++++-----------
> > >   arch/x86/kvm/mmu/paging_tmpl.h |  9 ++--
> > >   2 files changed, 71 insertions(+), 34 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index dc20eccd6a77..4249a771818b 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -2021,31 +2021,15 @@ static void clear_sp_write_flooding_count(u64 *spte)
> > >       __clear_sp_write_flooding_count(sptep_to_sp(spte));
> > >   }
> > >
> > > -static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > > -                                          gfn_t gfn,
> > > -                                          gva_t gaddr,
> > > -                                          unsigned level,
> > > -                                          bool direct,
> > > -                                          unsigned int access)
> > > +static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
> > > +                                          union kvm_mmu_page_role role)
> > >   {
> > > -     union kvm_mmu_page_role role;
> > >       struct hlist_head *sp_list;
> > > -     unsigned quadrant;
> > >       struct kvm_mmu_page *sp;
> > >       int ret;
> > >       int collisions = 0;
> > >       LIST_HEAD(invalid_list);
> > >
> > > -     role = vcpu->arch.mmu->root_role;
> > > -     role.level = level;
> > > -     role.direct = direct;
> > > -     role.access = access;
> > > -     if (role.has_4_byte_gpte) {
> > > -             quadrant = gaddr >> (PAGE_SHIFT + (PT64_PT_BITS * level));
> > > -             quadrant &= (1 << ((PT32_PT_BITS - PT64_PT_BITS) * level)) - 1;
> > > -             role.quadrant = quadrant;
> > > -     }
> >
> >
> > I don't think replacing it with kvm_mmu_child_role() can reduce any calculations.
> >
> > role.level, role.direct, role.access and role.quadrant still need to be
> > calculated.  And the old code is still in mmu_alloc_root().
>
> You are correct. Instead of saying "less calculations" I should have
> said "eliminates the dependency on vcpu->arch.mmu->root_role".
>
> >
> > I think kvm_mmu_get_shadow_page() can keep the those parameters and
> > kvm_mmu_child_role() is only introduced for nested_mmu_get_sp_for_split().
> >
> > Both kvm_mmu_get_shadow_page() and nested_mmu_get_sp_for_split() call
> > __kvm_mmu_get_shadow_page() which uses role as a parameter.
>
> I agree this would work, but I think the end result would be more
> difficult to read.
>
> The way I've implemented it there are two ways the SP roles are calculated:
>
> 1. For root SPs: Derive the role from vCPU root_role and caller-provided inputs.
> 2. For child SPs: Derive the role from parent SP and caller-provided inputs.
>
> Your proposal would still have two ways to calculate the SP role, but
> split along a different dimension:
>
> 1. For vCPUs creating SPs: Derive the role from vCPU root_role and
> caller-provided inputs.
> 2. For Eager Page Splitting creating SPs: Derive the role from parent
> SP and caller-provided inputs.
>
> With your proposal, it is less obvious that eager page splitting is
> going to end up with the correct role. Whereas if we use the same
> calculation for all child SPs, it is immediately obvious.


In this patchset, there are still two ways to calculate the SP role
including the one in mmu_alloc_root() which I dislike.

The old code is just moved into mmu_alloc_root() even kvm_mmu_child_role()
is introduced.

>
> >
> >
> >
> > > -
> > >       sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
> > >       for_each_valid_sp(vcpu->kvm, sp, sp_list) {
> > >               if (sp->gfn != gfn) {
> > > @@ -2063,7 +2047,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > >                        * Unsync pages must not be left as is, because the new
> > >                        * upper-level page will be write-protected.
> > >                        */
> > > -                     if (level > PG_LEVEL_4K && sp->unsync)
> > > +                     if (role.level > PG_LEVEL_4K && sp->unsync)
> > >                               kvm_mmu_prepare_zap_page(vcpu->kvm, sp,
> > >                                                        &invalid_list);
> > >                       continue;
> > > @@ -2104,14 +2088,14 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > >
> > >       ++vcpu->kvm->stat.mmu_cache_miss;
> > >
> > > -     sp = kvm_mmu_alloc_page(vcpu, direct);
> > > +     sp = kvm_mmu_alloc_page(vcpu, role.direct);
> > >
> > >       sp->gfn = gfn;
> > >       sp->role = role;
> > >       hlist_add_head(&sp->hash_link, sp_list);
> > > -     if (!direct) {
> > > +     if (!role.direct) {
> > >               account_shadowed(vcpu->kvm, sp);
> > > -             if (level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))
> > > +             if (role.level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))
> > >                       kvm_flush_remote_tlbs_with_address(vcpu->kvm, gfn, 1);
> > >       }
> > >       trace_kvm_mmu_get_page(sp, true);
> > > @@ -2123,6 +2107,51 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > >       return sp;
> > >   }
> > >
> > > +static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct, u32 access)
> > > +{
> > > +     struct kvm_mmu_page *parent_sp = sptep_to_sp(sptep);
> > > +     union kvm_mmu_page_role role;
> > > +
> > > +     role = parent_sp->role;
> > > +     role.level--;
> > > +     role.access = access;
> > > +     role.direct = direct;
> > > +
> > > +     /*
> > > +      * If the guest has 4-byte PTEs then that means it's using 32-bit,
> > > +      * 2-level, non-PAE paging. KVM shadows such guests using 4 PAE page
> > > +      * directories, each mapping 1/4 of the guest's linear address space
> > > +      * (1GiB). The shadow pages for those 4 page directories are
> > > +      * pre-allocated and assigned a separate quadrant in their role.
> >
> >
> > It is not going to be true in patchset:
> > [PATCH V2 0/7] KVM: X86/MMU: Use one-off special shadow page for special roots
> >
> > https://lore.kernel.org/lkml/20220503150735.32723-1-jiangshanlai@gmail.com/
> >
> > The shadow pages for those 4 page directories are also allocated on demand.
>
> Ack. I can even just drop this sentence in v5, it's just background information.

No, if one-off special shadow pages are used.

kvm_mmu_child_role() should be:

+       if (role.has_4_byte_gpte) {
+               if (role.level == PG_LEVEL_4K)
+                       role.quadrant = (sptep - parent_sp->spt) % 2;
+               if (role.level == PG_LEVEL_2M)
+                       role.quadrant = (sptep - parent_sp->spt) % 4;
+       }


And if one-off special shadow pages are merged first.  You don't
need any calculation in mmu_alloc_root(), you can just directly use
    sp = kvm_mmu_get_page(vcpu, gfn, vcpu->arch.mmu->root_role);
because vcpu->arch.mmu->root_role is always the real role of the root
sp no matter if it is a normall root sp or an one-off special sp.

I hope you will pardon me for my touting my patchset and asking
people to review them in your threads.

Thanks
Lai

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 03/20] KVM: x86/mmu: Derive shadow MMU page role from parent
@ 2022-05-10  2:58         ` Lai Jiangshan
  0 siblings, 0 replies; 120+ messages in thread
From: Lai Jiangshan @ 2022-05-10  2:58 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

()

On Tue, May 10, 2022 at 5:04 AM David Matlack <dmatlack@google.com> wrote:
>
> On Sat, May 7, 2022 at 1:28 AM Lai Jiangshan <jiangshanlai@gmail.com> wrote:
> >
> >
> >
> > On 2022/4/23 05:05, David Matlack wrote:
> > > Instead of computing the shadow page role from scratch for every new
> > > page, derive most of the information from the parent shadow page.  This
> > > avoids redundant calculations and reduces the number of parameters to
> > > kvm_mmu_get_page().
> > >
> > > Preemptively split out the role calculation to a separate function for
> > > use in a following commit.
> > >
> > > No functional change intended.
> > >
> > > Reviewed-by: Peter Xu <peterx@redhat.com>
> > > Signed-off-by: David Matlack <dmatlack@google.com>
> > > ---
> > >   arch/x86/kvm/mmu/mmu.c         | 96 +++++++++++++++++++++++-----------
> > >   arch/x86/kvm/mmu/paging_tmpl.h |  9 ++--
> > >   2 files changed, 71 insertions(+), 34 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index dc20eccd6a77..4249a771818b 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -2021,31 +2021,15 @@ static void clear_sp_write_flooding_count(u64 *spte)
> > >       __clear_sp_write_flooding_count(sptep_to_sp(spte));
> > >   }
> > >
> > > -static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > > -                                          gfn_t gfn,
> > > -                                          gva_t gaddr,
> > > -                                          unsigned level,
> > > -                                          bool direct,
> > > -                                          unsigned int access)
> > > +static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
> > > +                                          union kvm_mmu_page_role role)
> > >   {
> > > -     union kvm_mmu_page_role role;
> > >       struct hlist_head *sp_list;
> > > -     unsigned quadrant;
> > >       struct kvm_mmu_page *sp;
> > >       int ret;
> > >       int collisions = 0;
> > >       LIST_HEAD(invalid_list);
> > >
> > > -     role = vcpu->arch.mmu->root_role;
> > > -     role.level = level;
> > > -     role.direct = direct;
> > > -     role.access = access;
> > > -     if (role.has_4_byte_gpte) {
> > > -             quadrant = gaddr >> (PAGE_SHIFT + (PT64_PT_BITS * level));
> > > -             quadrant &= (1 << ((PT32_PT_BITS - PT64_PT_BITS) * level)) - 1;
> > > -             role.quadrant = quadrant;
> > > -     }
> >
> >
> > I don't think replacing it with kvm_mmu_child_role() can reduce any calculations.
> >
> > role.level, role.direct, role.access and role.quadrant still need to be
> > calculated.  And the old code is still in mmu_alloc_root().
>
> You are correct. Instead of saying "less calculations" I should have
> said "eliminates the dependency on vcpu->arch.mmu->root_role".
>
> >
> > I think kvm_mmu_get_shadow_page() can keep the those parameters and
> > kvm_mmu_child_role() is only introduced for nested_mmu_get_sp_for_split().
> >
> > Both kvm_mmu_get_shadow_page() and nested_mmu_get_sp_for_split() call
> > __kvm_mmu_get_shadow_page() which uses role as a parameter.
>
> I agree this would work, but I think the end result would be more
> difficult to read.
>
> The way I've implemented it there are two ways the SP roles are calculated:
>
> 1. For root SPs: Derive the role from vCPU root_role and caller-provided inputs.
> 2. For child SPs: Derive the role from parent SP and caller-provided inputs.
>
> Your proposal would still have two ways to calculate the SP role, but
> split along a different dimension:
>
> 1. For vCPUs creating SPs: Derive the role from vCPU root_role and
> caller-provided inputs.
> 2. For Eager Page Splitting creating SPs: Derive the role from parent
> SP and caller-provided inputs.
>
> With your proposal, it is less obvious that eager page splitting is
> going to end up with the correct role. Whereas if we use the same
> calculation for all child SPs, it is immediately obvious.


In this patchset, there are still two ways to calculate the SP role
including the one in mmu_alloc_root() which I dislike.

The old code is just moved into mmu_alloc_root() even kvm_mmu_child_role()
is introduced.

>
> >
> >
> >
> > > -
> > >       sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
> > >       for_each_valid_sp(vcpu->kvm, sp, sp_list) {
> > >               if (sp->gfn != gfn) {
> > > @@ -2063,7 +2047,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > >                        * Unsync pages must not be left as is, because the new
> > >                        * upper-level page will be write-protected.
> > >                        */
> > > -                     if (level > PG_LEVEL_4K && sp->unsync)
> > > +                     if (role.level > PG_LEVEL_4K && sp->unsync)
> > >                               kvm_mmu_prepare_zap_page(vcpu->kvm, sp,
> > >                                                        &invalid_list);
> > >                       continue;
> > > @@ -2104,14 +2088,14 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > >
> > >       ++vcpu->kvm->stat.mmu_cache_miss;
> > >
> > > -     sp = kvm_mmu_alloc_page(vcpu, direct);
> > > +     sp = kvm_mmu_alloc_page(vcpu, role.direct);
> > >
> > >       sp->gfn = gfn;
> > >       sp->role = role;
> > >       hlist_add_head(&sp->hash_link, sp_list);
> > > -     if (!direct) {
> > > +     if (!role.direct) {
> > >               account_shadowed(vcpu->kvm, sp);
> > > -             if (level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))
> > > +             if (role.level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))
> > >                       kvm_flush_remote_tlbs_with_address(vcpu->kvm, gfn, 1);
> > >       }
> > >       trace_kvm_mmu_get_page(sp, true);
> > > @@ -2123,6 +2107,51 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > >       return sp;
> > >   }
> > >
> > > +static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct, u32 access)
> > > +{
> > > +     struct kvm_mmu_page *parent_sp = sptep_to_sp(sptep);
> > > +     union kvm_mmu_page_role role;
> > > +
> > > +     role = parent_sp->role;
> > > +     role.level--;
> > > +     role.access = access;
> > > +     role.direct = direct;
> > > +
> > > +     /*
> > > +      * If the guest has 4-byte PTEs then that means it's using 32-bit,
> > > +      * 2-level, non-PAE paging. KVM shadows such guests using 4 PAE page
> > > +      * directories, each mapping 1/4 of the guest's linear address space
> > > +      * (1GiB). The shadow pages for those 4 page directories are
> > > +      * pre-allocated and assigned a separate quadrant in their role.
> >
> >
> > It is not going to be true in patchset:
> > [PATCH V2 0/7] KVM: X86/MMU: Use one-off special shadow page for special roots
> >
> > https://lore.kernel.org/lkml/20220503150735.32723-1-jiangshanlai@gmail.com/
> >
> > The shadow pages for those 4 page directories are also allocated on demand.
>
> Ack. I can even just drop this sentence in v5, it's just background information.

No, if one-off special shadow pages are used.

kvm_mmu_child_role() should be:

+       if (role.has_4_byte_gpte) {
+               if (role.level == PG_LEVEL_4K)
+                       role.quadrant = (sptep - parent_sp->spt) % 2;
+               if (role.level == PG_LEVEL_2M)
+                       role.quadrant = (sptep - parent_sp->spt) % 4;
+       }


And if one-off special shadow pages are merged first.  You don't
need any calculation in mmu_alloc_root(), you can just directly use
    sp = kvm_mmu_get_page(vcpu, gfn, vcpu->arch.mmu->root_role);
because vcpu->arch.mmu->root_role is always the real role of the root
sp no matter if it is a normall root sp or an one-off special sp.

I hope you will pardon me for my touting my patchset and asking
people to review them in your threads.

Thanks
Lai
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 03/20] KVM: x86/mmu: Derive shadow MMU page role from parent
  2022-05-10  2:58         ` Lai Jiangshan
@ 2022-05-10 13:31           ` Sean Christopherson
  -1 siblings, 0 replies; 120+ messages in thread
From: Sean Christopherson @ 2022-05-10 13:31 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Ben Gardon, Aleksandar Markovic, Palmer Dabbelt, Paul Walmsley,
	Marc Zyngier, David Matlack, Paolo Bonzini, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Tue, May 10, 2022, Lai Jiangshan wrote:
> ()
> 
> On Tue, May 10, 2022 at 5:04 AM David Matlack <dmatlack@google.com> wrote:
> >
> > On Sat, May 7, 2022 at 1:28 AM Lai Jiangshan <jiangshanlai@gmail.com> wrote:
> > > > +static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct, u32 access)
> > > > +{
> > > > +     struct kvm_mmu_page *parent_sp = sptep_to_sp(sptep);
> > > > +     union kvm_mmu_page_role role;
> > > > +
> > > > +     role = parent_sp->role;
> > > > +     role.level--;
> > > > +     role.access = access;
> > > > +     role.direct = direct;
> > > > +
> > > > +     /*
> > > > +      * If the guest has 4-byte PTEs then that means it's using 32-bit,
> > > > +      * 2-level, non-PAE paging. KVM shadows such guests using 4 PAE page
> > > > +      * directories, each mapping 1/4 of the guest's linear address space
> > > > +      * (1GiB). The shadow pages for those 4 page directories are
> > > > +      * pre-allocated and assigned a separate quadrant in their role.
> > >
> > >
> > > It is not going to be true in patchset:
> > > [PATCH V2 0/7] KVM: X86/MMU: Use one-off special shadow page for special roots
> > >
> > > https://lore.kernel.org/lkml/20220503150735.32723-1-jiangshanlai@gmail.com/
> > >
> > > The shadow pages for those 4 page directories are also allocated on demand.
> >
> > Ack. I can even just drop this sentence in v5, it's just background information.
> 
> No, if one-off special shadow pages are used.
> 
> kvm_mmu_child_role() should be:
> 
> +       if (role.has_4_byte_gpte) {
> +               if (role.level == PG_LEVEL_4K)
> +                       role.quadrant = (sptep - parent_sp->spt) % 2;
> +               if (role.level == PG_LEVEL_2M)

If the code ends up looking anything like this, please use PT32_ROOT_LEVEL instead
of PG_LEVEL_2M.  PSE paging has 4M huge pages, using PG_LEVEL_2M is confusing.

Or even better might be to do:

		if (role.level == PG_LEVEL_4k)
			...
		else
			...

Or arithmetic using role.level directly, a la the current code.
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 03/20] KVM: x86/mmu: Derive shadow MMU page role from parent
@ 2022-05-10 13:31           ` Sean Christopherson
  0 siblings, 0 replies; 120+ messages in thread
From: Sean Christopherson @ 2022-05-10 13:31 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: David Matlack, Paolo Bonzini, Marc Zyngier, Huacai Chen,
	Aleksandar Markovic, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Andrew Jones, Ben Gardon, Peter Xu,
	Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Tue, May 10, 2022, Lai Jiangshan wrote:
> ()
> 
> On Tue, May 10, 2022 at 5:04 AM David Matlack <dmatlack@google.com> wrote:
> >
> > On Sat, May 7, 2022 at 1:28 AM Lai Jiangshan <jiangshanlai@gmail.com> wrote:
> > > > +static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct, u32 access)
> > > > +{
> > > > +     struct kvm_mmu_page *parent_sp = sptep_to_sp(sptep);
> > > > +     union kvm_mmu_page_role role;
> > > > +
> > > > +     role = parent_sp->role;
> > > > +     role.level--;
> > > > +     role.access = access;
> > > > +     role.direct = direct;
> > > > +
> > > > +     /*
> > > > +      * If the guest has 4-byte PTEs then that means it's using 32-bit,
> > > > +      * 2-level, non-PAE paging. KVM shadows such guests using 4 PAE page
> > > > +      * directories, each mapping 1/4 of the guest's linear address space
> > > > +      * (1GiB). The shadow pages for those 4 page directories are
> > > > +      * pre-allocated and assigned a separate quadrant in their role.
> > >
> > >
> > > It is not going to be true in patchset:
> > > [PATCH V2 0/7] KVM: X86/MMU: Use one-off special shadow page for special roots
> > >
> > > https://lore.kernel.org/lkml/20220503150735.32723-1-jiangshanlai@gmail.com/
> > >
> > > The shadow pages for those 4 page directories are also allocated on demand.
> >
> > Ack. I can even just drop this sentence in v5, it's just background information.
> 
> No, if one-off special shadow pages are used.
> 
> kvm_mmu_child_role() should be:
> 
> +       if (role.has_4_byte_gpte) {
> +               if (role.level == PG_LEVEL_4K)
> +                       role.quadrant = (sptep - parent_sp->spt) % 2;
> +               if (role.level == PG_LEVEL_2M)

If the code ends up looking anything like this, please use PT32_ROOT_LEVEL instead
of PG_LEVEL_2M.  PSE paging has 4M huge pages, using PG_LEVEL_2M is confusing.

Or even better might be to do:

		if (role.level == PG_LEVEL_4k)
			...
		else
			...

Or arithmetic using role.level directly, a la the current code.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 03/20] KVM: x86/mmu: Derive shadow MMU page role from parent
  2022-05-10  2:58         ` Lai Jiangshan
@ 2022-05-12 16:10           ` David Matlack
  -1 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-05-12 16:10 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Sean Christopherson, Andrew Jones, Ben Gardon, Peter Xu,
	Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Mon, May 9, 2022 at 7:58 PM Lai Jiangshan <jiangshanlai@gmail.com> wrote:
>
> ()
>
> On Tue, May 10, 2022 at 5:04 AM David Matlack <dmatlack@google.com> wrote:
> >
> > On Sat, May 7, 2022 at 1:28 AM Lai Jiangshan <jiangshanlai@gmail.com> wrote:
> > >
> > >
> > >
> > > On 2022/4/23 05:05, David Matlack wrote:
> > > > Instead of computing the shadow page role from scratch for every new
> > > > page, derive most of the information from the parent shadow page.  This
> > > > avoids redundant calculations and reduces the number of parameters to
> > > > kvm_mmu_get_page().
> > > >
> > > > Preemptively split out the role calculation to a separate function for
> > > > use in a following commit.
> > > >
> > > > No functional change intended.
> > > >
> > > > Reviewed-by: Peter Xu <peterx@redhat.com>
> > > > Signed-off-by: David Matlack <dmatlack@google.com>
> > > > ---
> > > >   arch/x86/kvm/mmu/mmu.c         | 96 +++++++++++++++++++++++-----------
> > > >   arch/x86/kvm/mmu/paging_tmpl.h |  9 ++--
> > > >   2 files changed, 71 insertions(+), 34 deletions(-)
> > > >
> > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > index dc20eccd6a77..4249a771818b 100644
> > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > > @@ -2021,31 +2021,15 @@ static void clear_sp_write_flooding_count(u64 *spte)
> > > >       __clear_sp_write_flooding_count(sptep_to_sp(spte));
> > > >   }
> > > >
> > > > -static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > > > -                                          gfn_t gfn,
> > > > -                                          gva_t gaddr,
> > > > -                                          unsigned level,
> > > > -                                          bool direct,
> > > > -                                          unsigned int access)
> > > > +static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
> > > > +                                          union kvm_mmu_page_role role)
> > > >   {
> > > > -     union kvm_mmu_page_role role;
> > > >       struct hlist_head *sp_list;
> > > > -     unsigned quadrant;
> > > >       struct kvm_mmu_page *sp;
> > > >       int ret;
> > > >       int collisions = 0;
> > > >       LIST_HEAD(invalid_list);
> > > >
> > > > -     role = vcpu->arch.mmu->root_role;
> > > > -     role.level = level;
> > > > -     role.direct = direct;
> > > > -     role.access = access;
> > > > -     if (role.has_4_byte_gpte) {
> > > > -             quadrant = gaddr >> (PAGE_SHIFT + (PT64_PT_BITS * level));
> > > > -             quadrant &= (1 << ((PT32_PT_BITS - PT64_PT_BITS) * level)) - 1;
> > > > -             role.quadrant = quadrant;
> > > > -     }
> > >
> > >
> > > I don't think replacing it with kvm_mmu_child_role() can reduce any calculations.
> > >
> > > role.level, role.direct, role.access and role.quadrant still need to be
> > > calculated.  And the old code is still in mmu_alloc_root().
> >
> > You are correct. Instead of saying "less calculations" I should have
> > said "eliminates the dependency on vcpu->arch.mmu->root_role".
> >
> > >
> > > I think kvm_mmu_get_shadow_page() can keep the those parameters and
> > > kvm_mmu_child_role() is only introduced for nested_mmu_get_sp_for_split().
> > >
> > > Both kvm_mmu_get_shadow_page() and nested_mmu_get_sp_for_split() call
> > > __kvm_mmu_get_shadow_page() which uses role as a parameter.
> >
> > I agree this would work, but I think the end result would be more
> > difficult to read.
> >
> > The way I've implemented it there are two ways the SP roles are calculated:
> >
> > 1. For root SPs: Derive the role from vCPU root_role and caller-provided inputs.
> > 2. For child SPs: Derive the role from parent SP and caller-provided inputs.
> >
> > Your proposal would still have two ways to calculate the SP role, but
> > split along a different dimension:
> >
> > 1. For vCPUs creating SPs: Derive the role from vCPU root_role and
> > caller-provided inputs.
> > 2. For Eager Page Splitting creating SPs: Derive the role from parent
> > SP and caller-provided inputs.
> >
> > With your proposal, it is less obvious that eager page splitting is
> > going to end up with the correct role. Whereas if we use the same
> > calculation for all child SPs, it is immediately obvious.
>
>
> In this patchset, there are still two ways to calculate the SP role
> including the one in mmu_alloc_root() which I dislike.

My point is there will be two ways to calculate the SP role either
way. Can you explain why you dislike calculating the role in
mmu_alloc_root()? As you point out later, that code will disappear
anyway as soon as your series is merged.

>
> The old code is just moved into mmu_alloc_root() even kvm_mmu_child_role()
> is introduced.
>
> >
> > >
> > >
> > >
> > > > -
> > > >       sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
> > > >       for_each_valid_sp(vcpu->kvm, sp, sp_list) {
> > > >               if (sp->gfn != gfn) {
> > > > @@ -2063,7 +2047,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > > >                        * Unsync pages must not be left as is, because the new
> > > >                        * upper-level page will be write-protected.
> > > >                        */
> > > > -                     if (level > PG_LEVEL_4K && sp->unsync)
> > > > +                     if (role.level > PG_LEVEL_4K && sp->unsync)
> > > >                               kvm_mmu_prepare_zap_page(vcpu->kvm, sp,
> > > >                                                        &invalid_list);
> > > >                       continue;
> > > > @@ -2104,14 +2088,14 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > > >
> > > >       ++vcpu->kvm->stat.mmu_cache_miss;
> > > >
> > > > -     sp = kvm_mmu_alloc_page(vcpu, direct);
> > > > +     sp = kvm_mmu_alloc_page(vcpu, role.direct);
> > > >
> > > >       sp->gfn = gfn;
> > > >       sp->role = role;
> > > >       hlist_add_head(&sp->hash_link, sp_list);
> > > > -     if (!direct) {
> > > > +     if (!role.direct) {
> > > >               account_shadowed(vcpu->kvm, sp);
> > > > -             if (level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))
> > > > +             if (role.level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))
> > > >                       kvm_flush_remote_tlbs_with_address(vcpu->kvm, gfn, 1);
> > > >       }
> > > >       trace_kvm_mmu_get_page(sp, true);
> > > > @@ -2123,6 +2107,51 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > > >       return sp;
> > > >   }
> > > >
> > > > +static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct, u32 access)
> > > > +{
> > > > +     struct kvm_mmu_page *parent_sp = sptep_to_sp(sptep);
> > > > +     union kvm_mmu_page_role role;
> > > > +
> > > > +     role = parent_sp->role;
> > > > +     role.level--;
> > > > +     role.access = access;
> > > > +     role.direct = direct;
> > > > +
> > > > +     /*
> > > > +      * If the guest has 4-byte PTEs then that means it's using 32-bit,
> > > > +      * 2-level, non-PAE paging. KVM shadows such guests using 4 PAE page
> > > > +      * directories, each mapping 1/4 of the guest's linear address space
> > > > +      * (1GiB). The shadow pages for those 4 page directories are
> > > > +      * pre-allocated and assigned a separate quadrant in their role.
> > >
> > >
> > > It is not going to be true in patchset:
> > > [PATCH V2 0/7] KVM: X86/MMU: Use one-off special shadow page for special roots
> > >
> > > https://lore.kernel.org/lkml/20220503150735.32723-1-jiangshanlai@gmail.com/
> > >
> > > The shadow pages for those 4 page directories are also allocated on demand.
> >
> > Ack. I can even just drop this sentence in v5, it's just background information.
>
> No, if one-off special shadow pages are used.
>
> kvm_mmu_child_role() should be:
>
> +       if (role.has_4_byte_gpte) {
> +               if (role.level == PG_LEVEL_4K)
> +                       role.quadrant = (sptep - parent_sp->spt) % 2;
> +               if (role.level == PG_LEVEL_2M)
> +                       role.quadrant = (sptep - parent_sp->spt) % 4;
> +       }
>
>
> And if one-off special shadow pages are merged first.  You don't
> need any calculation in mmu_alloc_root(), you can just directly use
>     sp = kvm_mmu_get_page(vcpu, gfn, vcpu->arch.mmu->root_role);
> because vcpu->arch.mmu->root_role is always the real role of the root
> sp no matter if it is a normall root sp or an one-off special sp.
>
> I hope you will pardon me for my touting my patchset and asking
> people to review them in your threads.

I see what you mean now. If your series is queued I will rebase on top
with the appropriate changes. But for now I will continue to code
against kvm/queue.

>
> Thanks
> Lai

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 03/20] KVM: x86/mmu: Derive shadow MMU page role from parent
@ 2022-05-12 16:10           ` David Matlack
  0 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-05-12 16:10 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Mon, May 9, 2022 at 7:58 PM Lai Jiangshan <jiangshanlai@gmail.com> wrote:
>
> ()
>
> On Tue, May 10, 2022 at 5:04 AM David Matlack <dmatlack@google.com> wrote:
> >
> > On Sat, May 7, 2022 at 1:28 AM Lai Jiangshan <jiangshanlai@gmail.com> wrote:
> > >
> > >
> > >
> > > On 2022/4/23 05:05, David Matlack wrote:
> > > > Instead of computing the shadow page role from scratch for every new
> > > > page, derive most of the information from the parent shadow page.  This
> > > > avoids redundant calculations and reduces the number of parameters to
> > > > kvm_mmu_get_page().
> > > >
> > > > Preemptively split out the role calculation to a separate function for
> > > > use in a following commit.
> > > >
> > > > No functional change intended.
> > > >
> > > > Reviewed-by: Peter Xu <peterx@redhat.com>
> > > > Signed-off-by: David Matlack <dmatlack@google.com>
> > > > ---
> > > >   arch/x86/kvm/mmu/mmu.c         | 96 +++++++++++++++++++++++-----------
> > > >   arch/x86/kvm/mmu/paging_tmpl.h |  9 ++--
> > > >   2 files changed, 71 insertions(+), 34 deletions(-)
> > > >
> > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > index dc20eccd6a77..4249a771818b 100644
> > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > > @@ -2021,31 +2021,15 @@ static void clear_sp_write_flooding_count(u64 *spte)
> > > >       __clear_sp_write_flooding_count(sptep_to_sp(spte));
> > > >   }
> > > >
> > > > -static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > > > -                                          gfn_t gfn,
> > > > -                                          gva_t gaddr,
> > > > -                                          unsigned level,
> > > > -                                          bool direct,
> > > > -                                          unsigned int access)
> > > > +static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
> > > > +                                          union kvm_mmu_page_role role)
> > > >   {
> > > > -     union kvm_mmu_page_role role;
> > > >       struct hlist_head *sp_list;
> > > > -     unsigned quadrant;
> > > >       struct kvm_mmu_page *sp;
> > > >       int ret;
> > > >       int collisions = 0;
> > > >       LIST_HEAD(invalid_list);
> > > >
> > > > -     role = vcpu->arch.mmu->root_role;
> > > > -     role.level = level;
> > > > -     role.direct = direct;
> > > > -     role.access = access;
> > > > -     if (role.has_4_byte_gpte) {
> > > > -             quadrant = gaddr >> (PAGE_SHIFT + (PT64_PT_BITS * level));
> > > > -             quadrant &= (1 << ((PT32_PT_BITS - PT64_PT_BITS) * level)) - 1;
> > > > -             role.quadrant = quadrant;
> > > > -     }
> > >
> > >
> > > I don't think replacing it with kvm_mmu_child_role() can reduce any calculations.
> > >
> > > role.level, role.direct, role.access and role.quadrant still need to be
> > > calculated.  And the old code is still in mmu_alloc_root().
> >
> > You are correct. Instead of saying "less calculations" I should have
> > said "eliminates the dependency on vcpu->arch.mmu->root_role".
> >
> > >
> > > I think kvm_mmu_get_shadow_page() can keep the those parameters and
> > > kvm_mmu_child_role() is only introduced for nested_mmu_get_sp_for_split().
> > >
> > > Both kvm_mmu_get_shadow_page() and nested_mmu_get_sp_for_split() call
> > > __kvm_mmu_get_shadow_page() which uses role as a parameter.
> >
> > I agree this would work, but I think the end result would be more
> > difficult to read.
> >
> > The way I've implemented it there are two ways the SP roles are calculated:
> >
> > 1. For root SPs: Derive the role from vCPU root_role and caller-provided inputs.
> > 2. For child SPs: Derive the role from parent SP and caller-provided inputs.
> >
> > Your proposal would still have two ways to calculate the SP role, but
> > split along a different dimension:
> >
> > 1. For vCPUs creating SPs: Derive the role from vCPU root_role and
> > caller-provided inputs.
> > 2. For Eager Page Splitting creating SPs: Derive the role from parent
> > SP and caller-provided inputs.
> >
> > With your proposal, it is less obvious that eager page splitting is
> > going to end up with the correct role. Whereas if we use the same
> > calculation for all child SPs, it is immediately obvious.
>
>
> In this patchset, there are still two ways to calculate the SP role
> including the one in mmu_alloc_root() which I dislike.

My point is there will be two ways to calculate the SP role either
way. Can you explain why you dislike calculating the role in
mmu_alloc_root()? As you point out later, that code will disappear
anyway as soon as your series is merged.

>
> The old code is just moved into mmu_alloc_root() even kvm_mmu_child_role()
> is introduced.
>
> >
> > >
> > >
> > >
> > > > -
> > > >       sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
> > > >       for_each_valid_sp(vcpu->kvm, sp, sp_list) {
> > > >               if (sp->gfn != gfn) {
> > > > @@ -2063,7 +2047,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > > >                        * Unsync pages must not be left as is, because the new
> > > >                        * upper-level page will be write-protected.
> > > >                        */
> > > > -                     if (level > PG_LEVEL_4K && sp->unsync)
> > > > +                     if (role.level > PG_LEVEL_4K && sp->unsync)
> > > >                               kvm_mmu_prepare_zap_page(vcpu->kvm, sp,
> > > >                                                        &invalid_list);
> > > >                       continue;
> > > > @@ -2104,14 +2088,14 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > > >
> > > >       ++vcpu->kvm->stat.mmu_cache_miss;
> > > >
> > > > -     sp = kvm_mmu_alloc_page(vcpu, direct);
> > > > +     sp = kvm_mmu_alloc_page(vcpu, role.direct);
> > > >
> > > >       sp->gfn = gfn;
> > > >       sp->role = role;
> > > >       hlist_add_head(&sp->hash_link, sp_list);
> > > > -     if (!direct) {
> > > > +     if (!role.direct) {
> > > >               account_shadowed(vcpu->kvm, sp);
> > > > -             if (level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))
> > > > +             if (role.level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))
> > > >                       kvm_flush_remote_tlbs_with_address(vcpu->kvm, gfn, 1);
> > > >       }
> > > >       trace_kvm_mmu_get_page(sp, true);
> > > > @@ -2123,6 +2107,51 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > > >       return sp;
> > > >   }
> > > >
> > > > +static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct, u32 access)
> > > > +{
> > > > +     struct kvm_mmu_page *parent_sp = sptep_to_sp(sptep);
> > > > +     union kvm_mmu_page_role role;
> > > > +
> > > > +     role = parent_sp->role;
> > > > +     role.level--;
> > > > +     role.access = access;
> > > > +     role.direct = direct;
> > > > +
> > > > +     /*
> > > > +      * If the guest has 4-byte PTEs then that means it's using 32-bit,
> > > > +      * 2-level, non-PAE paging. KVM shadows such guests using 4 PAE page
> > > > +      * directories, each mapping 1/4 of the guest's linear address space
> > > > +      * (1GiB). The shadow pages for those 4 page directories are
> > > > +      * pre-allocated and assigned a separate quadrant in their role.
> > >
> > >
> > > It is not going to be true in patchset:
> > > [PATCH V2 0/7] KVM: X86/MMU: Use one-off special shadow page for special roots
> > >
> > > https://lore.kernel.org/lkml/20220503150735.32723-1-jiangshanlai@gmail.com/
> > >
> > > The shadow pages for those 4 page directories are also allocated on demand.
> >
> > Ack. I can even just drop this sentence in v5, it's just background information.
>
> No, if one-off special shadow pages are used.
>
> kvm_mmu_child_role() should be:
>
> +       if (role.has_4_byte_gpte) {
> +               if (role.level == PG_LEVEL_4K)
> +                       role.quadrant = (sptep - parent_sp->spt) % 2;
> +               if (role.level == PG_LEVEL_2M)
> +                       role.quadrant = (sptep - parent_sp->spt) % 4;
> +       }
>
>
> And if one-off special shadow pages are merged first.  You don't
> need any calculation in mmu_alloc_root(), you can just directly use
>     sp = kvm_mmu_get_page(vcpu, gfn, vcpu->arch.mmu->root_role);
> because vcpu->arch.mmu->root_role is always the real role of the root
> sp no matter if it is a normall root sp or an one-off special sp.
>
> I hope you will pardon me for my touting my patchset and asking
> people to review them in your threads.

I see what you mean now. If your series is queued I will rebase on top
with the appropriate changes. But for now I will continue to code
against kvm/queue.

>
> Thanks
> Lai
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 03/20] KVM: x86/mmu: Derive shadow MMU page role from parent
  2022-05-12 16:10           ` David Matlack
@ 2022-05-13 18:26             ` David Matlack
  -1 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-05-13 18:26 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Sean Christopherson, Andrew Jones, Ben Gardon, Peter Xu,
	Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Thu, May 12, 2022 at 09:10:59AM -0700, David Matlack wrote:
> On Mon, May 9, 2022 at 7:58 PM Lai Jiangshan <jiangshanlai@gmail.com> wrote:
> > On Tue, May 10, 2022 at 5:04 AM David Matlack <dmatlack@google.com> wrote:
> > > On Sat, May 7, 2022 at 1:28 AM Lai Jiangshan <jiangshanlai@gmail.com> wrote:
> > > > On 2022/4/23 05:05, David Matlack wrote:
> > > > > +     /*
> > > > > +      * If the guest has 4-byte PTEs then that means it's using 32-bit,
> > > > > +      * 2-level, non-PAE paging. KVM shadows such guests using 4 PAE page
> > > > > +      * directories, each mapping 1/4 of the guest's linear address space
> > > > > +      * (1GiB). The shadow pages for those 4 page directories are
> > > > > +      * pre-allocated and assigned a separate quadrant in their role.
> > > >
> > > >
> > > > It is not going to be true in patchset:
> > > > [PATCH V2 0/7] KVM: X86/MMU: Use one-off special shadow page for special roots
> > > >
> > > > https://lore.kernel.org/lkml/20220503150735.32723-1-jiangshanlai@gmail.com/
> > > >
> > > > The shadow pages for those 4 page directories are also allocated on demand.
> > >
> > > Ack. I can even just drop this sentence in v5, it's just background information.
> >
> > No, if one-off special shadow pages are used.
> >
> > kvm_mmu_child_role() should be:
> >
> > +       if (role.has_4_byte_gpte) {
> > +               if (role.level == PG_LEVEL_4K)
> > +                       role.quadrant = (sptep - parent_sp->spt) % 2;
> > +               if (role.level == PG_LEVEL_2M)
> > +                       role.quadrant = (sptep - parent_sp->spt) % 4;
> > +       }
> >
> >
> > And if one-off special shadow pages are merged first.  You don't
> > need any calculation in mmu_alloc_root(), you can just directly use
> >     sp = kvm_mmu_get_page(vcpu, gfn, vcpu->arch.mmu->root_role);
> > because vcpu->arch.mmu->root_role is always the real role of the root
> > sp no matter if it is a normall root sp or an one-off special sp.
> >
> > I hope you will pardon me for my touting my patchset and asking
> > people to review them in your threads.
> 
> I see what you mean now. If your series is queued I will rebase on top
> with the appropriate changes. But for now I will continue to code
> against kvm/queue.

Here is what I'm going with for v5:

        /*
         * If the guest has 4-byte PTEs then that means it's using 32-bit,
         * 2-level, non-PAE paging. KVM shadows such guests with PAE paging
         * (i.e. 8-byte PTEs). The difference in PTE size means that
         * KVM must shadow each guest page table with multiple shadow page
         * tables, which requires extra bookkeeping in the role.
         *
         * Specifically, to shadow the guest's page directory (which covers a
         * 4GiB address space), KVM uses 4 PAE page directories, each mapping
         * 1GiB of the address space. @role.quadrant encodes which quarter of
         * the address space each maps.
         *
         * To shadow the guest's page tables (which each map a 4MiB region),
         * KVM uses 2 PAE page tables, each mapping a 2MiB region. For these,
         * @role.quadrant encodes which half of the region they map.
         *
         * Note, the 4 PAE page directories are pre-allocated and the quadrant
         * assigned in mmu_alloc_root(). So only page tables need to be handled
         * here.
         */
        if (role.has_4_byte_gpte) {
                WARN_ON_ONCE(role.level != PG_LEVEL_4K);
                role.quadrant = (sptep - parent_sp->spt) % 2;
        }

Then to make it work with your series we can just apply this diff:

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f7c4f08e8a69..0e0e2da2f37d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2131,14 +2131,10 @@ static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct, u32 a
         * To shadow the guest's page tables (which each map a 4MiB region),
         * KVM uses 2 PAE page tables, each mapping a 2MiB region. For these,
         * @role.quadrant encodes which half of the region they map.
-        *
-        * Note, the 4 PAE page directories are pre-allocated and the quadrant
-        * assigned in mmu_alloc_root(). So only page tables need to be handled
-        * here.
         */
        if (role.has_4_byte_gpte) {
-               WARN_ON_ONCE(role.level != PG_LEVEL_4K);
-               role.quadrant = (sptep - parent_sp->spt) % 2;
+               WARN_ON_ONCE(role.level > PG_LEVEL_2M);
+               role.quadrant = (sptep - parent_sp->spt) % (1 << role.level);
        }

        return role;

If your series is queued first, I can resend a v6 with this change or Paolo can
apply it. If mine is queued first then you can include this as part of your
series.

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 03/20] KVM: x86/mmu: Derive shadow MMU page role from parent
@ 2022-05-13 18:26             ` David Matlack
  0 siblings, 0 replies; 120+ messages in thread
From: David Matlack @ 2022-05-13 18:26 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Thu, May 12, 2022 at 09:10:59AM -0700, David Matlack wrote:
> On Mon, May 9, 2022 at 7:58 PM Lai Jiangshan <jiangshanlai@gmail.com> wrote:
> > On Tue, May 10, 2022 at 5:04 AM David Matlack <dmatlack@google.com> wrote:
> > > On Sat, May 7, 2022 at 1:28 AM Lai Jiangshan <jiangshanlai@gmail.com> wrote:
> > > > On 2022/4/23 05:05, David Matlack wrote:
> > > > > +     /*
> > > > > +      * If the guest has 4-byte PTEs then that means it's using 32-bit,
> > > > > +      * 2-level, non-PAE paging. KVM shadows such guests using 4 PAE page
> > > > > +      * directories, each mapping 1/4 of the guest's linear address space
> > > > > +      * (1GiB). The shadow pages for those 4 page directories are
> > > > > +      * pre-allocated and assigned a separate quadrant in their role.
> > > >
> > > >
> > > > It is not going to be true in patchset:
> > > > [PATCH V2 0/7] KVM: X86/MMU: Use one-off special shadow page for special roots
> > > >
> > > > https://lore.kernel.org/lkml/20220503150735.32723-1-jiangshanlai@gmail.com/
> > > >
> > > > The shadow pages for those 4 page directories are also allocated on demand.
> > >
> > > Ack. I can even just drop this sentence in v5, it's just background information.
> >
> > No, if one-off special shadow pages are used.
> >
> > kvm_mmu_child_role() should be:
> >
> > +       if (role.has_4_byte_gpte) {
> > +               if (role.level == PG_LEVEL_4K)
> > +                       role.quadrant = (sptep - parent_sp->spt) % 2;
> > +               if (role.level == PG_LEVEL_2M)
> > +                       role.quadrant = (sptep - parent_sp->spt) % 4;
> > +       }
> >
> >
> > And if one-off special shadow pages are merged first.  You don't
> > need any calculation in mmu_alloc_root(), you can just directly use
> >     sp = kvm_mmu_get_page(vcpu, gfn, vcpu->arch.mmu->root_role);
> > because vcpu->arch.mmu->root_role is always the real role of the root
> > sp no matter if it is a normall root sp or an one-off special sp.
> >
> > I hope you will pardon me for my touting my patchset and asking
> > people to review them in your threads.
> 
> I see what you mean now. If your series is queued I will rebase on top
> with the appropriate changes. But for now I will continue to code
> against kvm/queue.

Here is what I'm going with for v5:

        /*
         * If the guest has 4-byte PTEs then that means it's using 32-bit,
         * 2-level, non-PAE paging. KVM shadows such guests with PAE paging
         * (i.e. 8-byte PTEs). The difference in PTE size means that
         * KVM must shadow each guest page table with multiple shadow page
         * tables, which requires extra bookkeeping in the role.
         *
         * Specifically, to shadow the guest's page directory (which covers a
         * 4GiB address space), KVM uses 4 PAE page directories, each mapping
         * 1GiB of the address space. @role.quadrant encodes which quarter of
         * the address space each maps.
         *
         * To shadow the guest's page tables (which each map a 4MiB region),
         * KVM uses 2 PAE page tables, each mapping a 2MiB region. For these,
         * @role.quadrant encodes which half of the region they map.
         *
         * Note, the 4 PAE page directories are pre-allocated and the quadrant
         * assigned in mmu_alloc_root(). So only page tables need to be handled
         * here.
         */
        if (role.has_4_byte_gpte) {
                WARN_ON_ONCE(role.level != PG_LEVEL_4K);
                role.quadrant = (sptep - parent_sp->spt) % 2;
        }

Then to make it work with your series we can just apply this diff:

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f7c4f08e8a69..0e0e2da2f37d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2131,14 +2131,10 @@ static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct, u32 a
         * To shadow the guest's page tables (which each map a 4MiB region),
         * KVM uses 2 PAE page tables, each mapping a 2MiB region. For these,
         * @role.quadrant encodes which half of the region they map.
-        *
-        * Note, the 4 PAE page directories are pre-allocated and the quadrant
-        * assigned in mmu_alloc_root(). So only page tables need to be handled
-        * here.
         */
        if (role.has_4_byte_gpte) {
-               WARN_ON_ONCE(role.level != PG_LEVEL_4K);
-               role.quadrant = (sptep - parent_sp->spt) % 2;
+               WARN_ON_ONCE(role.level > PG_LEVEL_2M);
+               role.quadrant = (sptep - parent_sp->spt) % (1 << role.level);
        }

        return role;

If your series is queued first, I can resend a v6 with this change or Paolo can
apply it. If mine is queued first then you can include this as part of your
series.
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 120+ messages in thread

end of thread, other threads:[~2022-05-14 10:09 UTC | newest]

Thread overview: 120+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-22 21:05 [PATCH v4 00/20] KVM: Extend Eager Page Splitting to the shadow MMU David Matlack
2022-04-22 21:05 ` David Matlack
2022-04-22 21:05 ` [PATCH v4 01/20] KVM: x86/mmu: Optimize MMU page cache lookup for all direct SPs David Matlack
2022-04-22 21:05   ` David Matlack
2022-05-07  7:46   ` Lai Jiangshan
2022-05-07  7:46     ` Lai Jiangshan
2022-04-22 21:05 ` [PATCH v4 02/20] KVM: x86/mmu: Use a bool for direct David Matlack
2022-04-22 21:05   ` David Matlack
2022-05-07  7:46   ` Lai Jiangshan
2022-05-07  7:46     ` Lai Jiangshan
2022-04-22 21:05 ` [PATCH v4 03/20] KVM: x86/mmu: Derive shadow MMU page role from parent David Matlack
2022-04-22 21:05   ` David Matlack
2022-05-05 21:50   ` Sean Christopherson
2022-05-05 21:50     ` Sean Christopherson
2022-05-09 22:10     ` David Matlack
2022-05-09 22:10       ` David Matlack
2022-05-10  2:38       ` Lai Jiangshan
2022-05-10  2:38         ` Lai Jiangshan
2022-05-07  8:28   ` Lai Jiangshan
2022-05-07  8:28     ` Lai Jiangshan
2022-05-09 21:04     ` David Matlack
2022-05-09 21:04       ` David Matlack
2022-05-10  2:58       ` Lai Jiangshan
2022-05-10  2:58         ` Lai Jiangshan
2022-05-10 13:31         ` Sean Christopherson
2022-05-10 13:31           ` Sean Christopherson
2022-05-12 16:10         ` David Matlack
2022-05-12 16:10           ` David Matlack
2022-05-13 18:26           ` David Matlack
2022-05-13 18:26             ` David Matlack
2022-04-22 21:05 ` [PATCH v4 04/20] KVM: x86/mmu: Decompose kvm_mmu_get_page() into separate functions David Matlack
2022-04-22 21:05   ` David Matlack
2022-05-05 21:58   ` Sean Christopherson
2022-05-05 21:58     ` Sean Christopherson
2022-04-22 21:05 ` [PATCH v4 05/20] KVM: x86/mmu: Consolidate shadow page allocation and initialization David Matlack
2022-04-22 21:05   ` David Matlack
2022-05-05 22:10   ` Sean Christopherson
2022-05-05 22:10     ` Sean Christopherson
2022-05-09 20:53     ` David Matlack
2022-05-09 20:53       ` David Matlack
2022-04-22 21:05 ` [PATCH v4 06/20] KVM: x86/mmu: Rename shadow MMU functions that deal with shadow pages David Matlack
2022-04-22 21:05   ` David Matlack
2022-05-05 22:15   ` Sean Christopherson
2022-05-05 22:15     ` Sean Christopherson
2022-04-22 21:05 ` [PATCH v4 07/20] KVM: x86/mmu: Move guest PT write-protection to account_shadowed() David Matlack
2022-04-22 21:05   ` David Matlack
2022-05-05 22:51   ` Sean Christopherson
2022-05-05 22:51     ` Sean Christopherson
2022-05-09 21:18     ` David Matlack
2022-05-09 21:18       ` David Matlack
2022-04-22 21:05 ` [PATCH v4 08/20] KVM: x86/mmu: Pass memory caches to allocate SPs separately David Matlack
2022-04-22 21:05   ` David Matlack
2022-05-05 23:00   ` Sean Christopherson
2022-05-05 23:00     ` Sean Christopherson
2022-04-22 21:05 ` [PATCH v4 09/20] KVM: x86/mmu: Replace vcpu with kvm in kvm_mmu_alloc_shadow_page() David Matlack
2022-04-22 21:05   ` David Matlack
2022-04-22 21:05 ` [PATCH v4 10/20] KVM: x86/mmu: Pass kvm pointer separately from vcpu to kvm_mmu_find_shadow_page() David Matlack
2022-04-22 21:05   ` David Matlack
2022-04-22 21:05 ` [PATCH v4 11/20] KVM: x86/mmu: Allow for NULL vcpu pointer in __kvm_mmu_get_shadow_page() David Matlack
2022-04-22 21:05   ` David Matlack
2022-05-05 23:33   ` Sean Christopherson
2022-05-05 23:33     ` Sean Christopherson
2022-05-09 21:26     ` David Matlack
2022-05-09 21:26       ` David Matlack
2022-05-09 22:56       ` Sean Christopherson
2022-05-09 22:56         ` Sean Christopherson
2022-05-09 23:59         ` David Matlack
2022-05-09 23:59           ` David Matlack
2022-04-22 21:05 ` [PATCH v4 12/20] KVM: x86/mmu: Pass const memslot to rmap_add() David Matlack
2022-04-22 21:05   ` David Matlack
2022-04-22 21:05 ` [PATCH v4 13/20] KVM: x86/mmu: Decouple rmap_add() and link_shadow_page() from kvm_vcpu David Matlack
2022-04-22 21:05   ` David Matlack
2022-05-05 23:46   ` Sean Christopherson
2022-05-05 23:46     ` Sean Christopherson
2022-05-09 21:27     ` David Matlack
2022-05-09 21:27       ` David Matlack
2022-04-22 21:05 ` [PATCH v4 14/20] KVM: x86/mmu: Update page stats in __rmap_add() David Matlack
2022-04-22 21:05   ` David Matlack
2022-04-22 21:05 ` [PATCH v4 15/20] KVM: x86/mmu: Cache the access bits of shadowed translations David Matlack
2022-04-22 21:05   ` David Matlack
2022-05-06 19:47   ` Sean Christopherson
2022-05-06 19:47     ` Sean Christopherson
2022-05-09 16:10   ` Sean Christopherson
2022-05-09 16:10     ` Sean Christopherson
2022-05-09 21:29     ` David Matlack
2022-05-09 21:29       ` David Matlack
2022-04-22 21:05 ` [PATCH v4 16/20] KVM: x86/mmu: Extend make_huge_page_split_spte() for the shadow MMU David Matlack
2022-04-22 21:05   ` David Matlack
2022-05-09 16:22   ` Sean Christopherson
2022-05-09 16:22     ` Sean Christopherson
2022-05-09 21:31     ` David Matlack
2022-05-09 21:31       ` David Matlack
2022-04-22 21:05 ` [PATCH v4 17/20] KVM: x86/mmu: Zap collapsible SPTEs at all levels in " David Matlack
2022-04-22 21:05   ` David Matlack
2022-05-09 16:31   ` Sean Christopherson
2022-05-09 16:31     ` Sean Christopherson
2022-05-09 21:34     ` David Matlack
2022-05-09 21:34       ` David Matlack
2022-04-22 21:05 ` [PATCH v4 18/20] KVM: x86/mmu: Refactor drop_large_spte() David Matlack
2022-04-22 21:05   ` David Matlack
2022-05-09 16:36   ` Sean Christopherson
2022-05-09 16:36     ` Sean Christopherson
2022-04-22 21:05 ` [PATCH v4 19/20] KVM: Allow for different capacities in kvm_mmu_memory_cache structs David Matlack
2022-04-22 21:05   ` David Matlack
2022-04-23  8:08   ` kernel test robot
2022-04-23  8:08     ` kernel test robot
2022-04-24 15:21   ` kernel test robot
2022-04-24 15:21     ` kernel test robot
2022-04-22 21:05 ` [PATCH v4 20/20] KVM: x86/mmu: Extend Eager Page Splitting to nested MMUs David Matlack
2022-04-22 21:05   ` David Matlack
2022-05-07  7:51   ` Lai Jiangshan
2022-05-07  7:51     ` Lai Jiangshan
2022-05-09 21:40     ` David Matlack
2022-05-09 21:40       ` David Matlack
2022-05-09 16:48   ` Sean Christopherson
2022-05-09 16:48     ` Sean Christopherson
2022-05-09 21:44     ` David Matlack
2022-05-09 21:44       ` David Matlack
2022-05-09 22:47       ` Sean Christopherson
2022-05-09 22:47         ` Sean Christopherson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.