All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v6 00/22] KVM: Extend Eager Page Splitting to the shadow MMU
@ 2022-05-16 23:21 ` David Matlack
  0 siblings, 0 replies; 111+ messages in thread
From: David Matlack @ 2022-05-16 23:21 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, Lai Jiangshan, David Matlack

[ This is a small incremental revision on top of v5. I'm sending this now
  rather than waiting longer, as I would typically do, since the 5.19 merge
  window is fast approaching. ]

This series extends KVM's Eager Page Splitting to also split huge pages
mapped by the shadow MMU, specifically **nested MMUs**.

For background on Eager Page Splitting, see:
 - Proposal: https://lore.kernel.org/kvm/CALzav=dV_U4r1K9oDq4esb4mpBQDQ2ROQ5zH5wV3KpOaZrRW-A@mail.gmail.com/
 - TDP MMU support: https://lore.kernel.org/kvm/20220119230739.2234394-1-dmatlack@google.com/

Splitting huge pages mapped by the shadow MMU is more complicated than
the TDP MMU, but it is also more important for performance as the shadow
MMU handles huge page write-protection faults under the write lock.  See
the Performance section for more details.

The extra complexity of splitting huge pages mapped by the shadow MMU
comes from a few places:

(1) The shadow MMU has a limit on the number of shadow pages that are
    allowed to be allocated. So, as a policy, Eager Page Splitting
    refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
    pages available.

(2) Huge pages may be mapped by indirect shadow pages which may have access
    permission constraints from the guest (unlike the TDP MMU which is
    ACC_ALL by default).

(3) Splitting a huge page may end up re-using an existing lower level
    shadow page tables. This is unlike the TDP MMU which always allocates
    new shadow page tables when splitting.

(4) When installing the lower level SPTEs, they must be added to the
    rmap which may require allocating additional pte_list_desc structs.

In Google's internal implementation of Eager Page Splitting, we do not
handle cases (3) and (4), and intstead opts to skip splitting entirely
(case 3) or only partially splitting (case 4). This series handles the
additional cases, which requires an additional 4KiB of memory per VM to
store the extra pte_list_desc cache. However it does also avoids the need
for TLB flushes in most cases and allows KVM to split more pages mapped
by shadow paging.

The bulk of this series is just refactoring the existing MMU code in
preparation for splitting, specifically to make it possible to operate
on the MMU outside of a vCPU context.

Motivation
----------

During dirty logging, VMs using the shadow MMU suffer from:

(1) Write-protection faults on huge pages that take the MMU lock to
    unmap the huge page, map a 4KiB page, and update the dirty log.

(2) Non-present faults caused by (1) that take the MMU lock to map in
    the missing page.

(3) Write-protection faults on 4KiB pages that take the MMU lock to
    make the page writable and update the dirty log. [Note: These faults
    only take the MMU lock during shadow paging.]

The lock contention from (1), (2) and (3) can severely degrade
application performance to the point of failure.  Eager page splitting
eliminates (1) by moving the splitting of huge pages off the vCPU
threads onto the thread invoking VM-ioctls to configure dirty logging,
and eliminates (2) by fully splitting each huge page into its
constituent small pages. (3) is still a concern for shadow paging
workloads (e.g. nested virtualization) but is not addressed by this
series.

Splitting in the VM-ioctl thread is useful because it can run in the
background without interrupting vCPU execution. However, it does take
the MMU lock so it may introduce some extra contention if vCPUs are
hammering the MMU lock. This is offset by the fact that eager page
splitting drops the MMU lock after splitting each SPTE if there is any
contention, and the fact that eager page splitting is reducing the MMU
lock contention from (1) and (2) above. Even workloads that only write
to 5% of their memory see massive MMU lock contention reduction during
dirty logging thanks to Eager Page Splitting (see Performance data
below).

A downside of Eager Page Splitting is that it splits all huge pages,
which may include ranges of memory that are never written to by the
guest and thus could theoretically stay huge. Workloads that write to
only a fraction of their memory may see higher TLB miss costs with Eager
Page Splitting enabled. However, that is secondary to the application
failure that otherwise may occur without Eager Page Splitting.

Further work is necessary to improve the TLB miss performance for
read-heavy workoads, such as dirty logging at 2M instead of 4K.

Performance
-----------

To measure the performance impact of Eager Page Splitting I ran
dirty_log_perf_test with support for a new flag, -n, that causes each vCPU
thread to run in L2 instead of L1. This support will be sent out in a
separate series.

To measure the imapct of customer performance, we can look at the time
it takes all vCPUs to dirty memory after dirty logging has been enabled.
Without Eager Page Splitting enabled, such dirtying must take faults to
split huge pages and bottleneck on the MMU lock.

For write-heavy workloads, there is not as much benefit since nested MMUs
still have to take the write-lock when resolving 4K write-protection
faults (case (3) in the Motivation section). But ready-heavy workloads
greatly benefit.

             | Config: tdp_mmu=Y, nested, 100% writes                  |
             | Iteration 1 dirty memory time                           |
             | ------------------------------------------------------- |
vCPU Count   | eager_page_split=N         | eager_page_split=Y         |
------------ | -------------------------- | -------------------------- |
2            | 0.367445635s               | 0.359880160s               |
4            | 0.503976497s               | 0.418760595s               |
8            | 1.328792652s               | 1.442455382s               |
16           | 4.609457301s               | 3.649754574s               |
32           | 8.751328485s               | 7.659014140s               |
64           | 20.438482174s              | 17.890019577s              |

             | Config: tdp_mmu=Y, nested, 50% writes                   |
             | Iteration 1 dirty memory time                           |
             | ------------------------------------------------------- |
vCPU Count   | eager_page_split=N         | eager_page_split=Y         |
------------ | -------------------------- | -------------------------- |
2            | 0.374082549s               | 0.189881327s               |
4            | 0.498175012s               | 0.216221200s               |
8            | 1.848155856s               | 0.525316794s               |
16           | 4.387725630s               | 1.844867390s               |
32           | 9.153260046s               | 4.061645844s               |
64           | 20.077600588s              | 8.825413269s               |

             | Config: tdp_mmu=Y, nested, 5% writes                    |
             | Iteration 1 dirty memory time                           |
             | ------------------------------------------------------- |
vCPU Count   | eager_page_split=N         | eager_page_split=Y         |
------------ | -------------------------- | -------------------------- |
2            | 0.386395635s               | 0.023315599s               |
4            | 0.495352933s               | 0.024971794s               |
8            | 1.568730321s               | 0.052010563s               |
16           | 4.258323166s               | 0.174402708s               |
32           | 9.260176347s               | 0.377929203s               |
64           | 19.861473882s              | 0.905998574s               |

Eager Page Splitting does increase the time it takes to enable dirty
logging when not using initially-all-set, since that's when KVM splits
huge pages. However, this runs in parallel with vCPU execution and drops
the MMU lock whenever there is contention.

             | Config: tdp_mmu=Y, nested, 100% writes                  |
             | Enabling dirty logging time                             |
             | ------------------------------------------------------- |
vCPU Count   | eager_page_split=N         | eager_page_split=Y         |
------------ | -------------------------- | -------------------------- |
2            | 0.001330088s               | 0.018624938s               |
4            | 0.002763111s               | 0.037247815s               |
8            | 0.005220762s               | 0.074637543s               |
16           | 0.010381925s               | 0.149096917s               |
32           | 0.022109466s               | 0.307983859s               |
64           | 0.085547182s               | 0.854228170s               |

Similarly, Eager Page Splitting increases the time it takes to clear the
dirty log for when using initially-all-set. The first time userspace
clears the dirty log, KVM will split huge pages:

             | Config: tdp_mmu=Y, nested, 100% writes initially-all-set |
             | Iteration 1 clear dirty log time                        |
             | ------------------------------------------------------- |
vCPU Count   | eager_page_split=N         | eager_page_split=Y         |
------------ | -------------------------- | -------------------------- |
2            | 0.001947098s               | 0.019836052s               |
4            | 0.003817996s               | 0.039574178s               |
8            | 0.007673616s               | 0.079118964s               |
16           | 0.015733003s               | 0.158006697s               |
32           | 0.031728367s               | 0.330793049s               |
64           | 0.108699714s               | 0.891762988s               |

Subsequent calls to clear the dirty log incur almost no additional cost
since KVM can very quickly determine there are no more huge pages to
split via the RMAP. This is unlike the TDP MMU which must re-traverse
the entire page table to check for huge pages.

             | Config: tdp_mmu=Y, nested, 100% writes initially-all-set |
             | Iteration 2 clear dirty log time                        |
             | ------------------------------------------------------- |
vCPU Count   | eager_page_split=N         | eager_page_split=Y         |
------------ | -------------------------- | -------------------------- |
2            | 0.009585296s               | 0.009931437s               |
4            | 0.019188984s               | 0.019842738s               |
8            | 0.038568630s               | 0.039951832s               |
16           | 0.077188525s               | 0.079536780s               |
32           | 0.156728329s               | 0.163612725s               |
64           | 0.418679324s               | 0.337336844s               |

Testing
-------

 - Ran all kvm-unit-tests and KVM selftests.

 - Booted a 32-bit non-PAE kernel with shadow paging to verify the
   quadrant change.

 - Ran dirty_log_perf_test with support for a new flag, -n, that causes
   each vCPU thread to run in L2 instead of L1. This support will be
   sent out in a separate series.

 - Tested VM live migration with nested MMUs and huge pages. The live
   migration setup consisted of an 8 vCPU 8 GiB VM running on an Intel
   Cascade Lake host and backed by 1GiB HugeTLBFS memory.  The VM was
   running Debian 10.  Inside a VM was a 6 vCPU 4Gib nested VM also
   Debian 10 and backed by 2M HugeTLBFS. Inside the nested VM ran a
   workload that aggressively accessed memory across 6 threads.
   Tracepoints during the migration confirmes eager page splitting
   occurred, both for the direct TDP MMU mappings, and the nested MMU
   mappings.

Version Log
-----------

v6:
 - Collect R-b tag from Marc.
 - Initialize memory cache capacity during top-up [Sean]
 - Eliminate redundant role overrides in mmu_alloc_root() [Lai]

v5: https://lore.kernel.org/kvm/20220513202819.829591-1-dmatlack@google.com/
 - Rebase on top of latest kvm/queue.
 - Collected R-b tags from Sean and Lai.
 - Add another patch to stop passing non-zero quadrant [Sean]
 - Drop vcpu_or_null and __kvm_sync_page() [Sean]
 - Formatting and wording changes [Sean]
 - Pass role instead of sp when making huge split SPTEs [Sean]
 - Fix riscv compilation error [kernel test robot]
 - Document split caches protected by slots_lock [Lai]

v4: https://lore.kernel.org/kvm/20220422210546.458943-1-dmatlack@google.com/
 - Limit eager page splitting to nested MMUs [Sean]
 - Use memory caches for SP allocation [Sean]
 - Use kvm_mmu_get_page() with NULL vCPU for EPS [Sean]
 - Use u64 instead of bit field for shadow translation entry [Sean]
 - Add Sean's R-b to "Use a bool" patch.
 - Fix printf warning in "Cache access bits" patch.
 - Fix asymmentrical pr_err_ratelimit() + WARN() [Sean]
 - Drop unnecessary unsync check for huge pages [Sean]
 - Eliminate use of we in comments and change logs [Sean]
 - Allocate objects arrays dynamically [Ben]

v3: https://lore.kernel.org/kvm/20220401175554.1931568-1-dmatlack@google.com/
 - Add R-b tags from Peter.
 - Explain direct SPs in indirect MMUs in commit message [Peter]
 - Change BUG_ON() to WARN_ON_ONCE() in quadrant calculation [me]
 - Eliminate unnecessary gotos [Peter]
 - Drop mmu_alloc_pte_list_desc() [Peter]
 - Also update access cache in mmu_set_spte() if was_rmapped [Peter]
 - Fix number of gfn bits in shadowed_translation cache [Peter]
 - Pass sp to make_huge_page_split_spte() to derive level and exec [me]
 - Eliminate flush var in kvm_rmap_zap_collapsible_sptes() [Peter]
 - Drop NULL pte_list_desc cache fallback [Peter]
 - Fix get_access to return sp->role.access. [me]
 - Re-use split cache across calls to CLEAR_DIRTY_LOG for better perf [me]
 - Top-up the split cache outside of the MMU lock when possible [me]
 - Refactor prepare_to_split_huge_page() into try_split_huge_page() [me]
 - Collapse PATCH 20, 23, and 24 avoid intermediate complexity [Peter]
 - Update the RISC-V function stage2_ioremap() [Anup]

v2: https://lore.kernel.org/kvm/20220311002528.2230172-1-dmatlack@google.com/
 - Add performance data for workloads that mix reads and writes [Peter]
 - Collect R-b tags from Ben and Sean.
 - Fix quadrant calculation when deriving role from parent [Sean]
 - Tweak new shadow page function names [Sean]
 - Move set_page_private() to allocation functions [Ben]
 - Only zap collapsible SPTEs up to MAX_LEVEL-1 [Ben]
 - Always top-up pte_list_desc cache to reduce complexity [Ben]
 - Require mmu cache capacity field to be initialized and add WARN()
   to reduce chance of programmer error [Marc]
 - Fix up kvm_mmu_memory_cache struct initialization in arm64 [Marc]

v1: https://lore.kernel.org/kvm/20220203010051.2813563-1-dmatlack@google.com/

David Matlack (22):
  KVM: x86/mmu: Optimize MMU page cache lookup for all direct SPs
  KVM: x86/mmu: Use a bool for direct
  KVM: x86/mmu: Stop passing @direct to mmu_alloc_root()
  KVM: x86/mmu: Derive shadow MMU page role from parent
  KVM: x86/mmu: Always pass 0 for @quadrant when gptes are 8 bytes
  KVM: x86/mmu: Decompose kvm_mmu_get_page() into separate functions
  KVM: x86/mmu: Consolidate shadow page allocation and initialization
  KVM: x86/mmu: Rename shadow MMU functions that deal with shadow pages
  KVM: x86/mmu: Move guest PT write-protection to account_shadowed()
  KVM: x86/mmu: Pass memory caches to allocate SPs separately
  KVM: x86/mmu: Replace vcpu with kvm in kvm_mmu_alloc_shadow_page()
  KVM: x86/mmu: Pass kvm pointer separately from vcpu to
    kvm_mmu_find_shadow_page()
  KVM: x86/mmu: Allow NULL @vcpu in kvm_mmu_find_shadow_page()
  KVM: x86/mmu: Pass const memslot to rmap_add()
  KVM: x86/mmu: Decouple rmap_add() and link_shadow_page() from kvm_vcpu
  KVM: x86/mmu: Update page stats in __rmap_add()
  KVM: x86/mmu: Cache the access bits of shadowed translations
  KVM: x86/mmu: Extend make_huge_page_split_spte() for the shadow MMU
  KVM: x86/mmu: Zap collapsible SPTEs in shadow MMU at all possible
    levels
  KVM: x86/mmu: Refactor drop_large_spte()
  KVM: Allow for different capacities in kvm_mmu_memory_cache structs
  KVM: x86/mmu: Extend Eager Page Splitting to nested MMUs

 .../admin-guide/kernel-parameters.txt         |   3 +-
 arch/arm64/kvm/mmu.c                          |   2 +-
 arch/riscv/kvm/mmu.c                          |   5 +-
 arch/x86/include/asm/kvm_host.h               |  26 +-
 arch/x86/kvm/mmu/mmu.c                        | 693 ++++++++++++++----
 arch/x86/kvm/mmu/mmu_internal.h               |  17 +-
 arch/x86/kvm/mmu/paging_tmpl.h                |  17 +-
 arch/x86/kvm/mmu/spte.c                       |  16 +-
 arch/x86/kvm/mmu/spte.h                       |   2 +-
 arch/x86/kvm/mmu/tdp_mmu.c                    |   2 +-
 arch/x86/kvm/x86.c                            |   6 +
 include/linux/kvm_host.h                      |   1 +
 include/linux/kvm_types.h                     |   6 +-
 virt/kvm/kvm_main.c                           |  33 +-
 14 files changed, 658 insertions(+), 171 deletions(-)


base-commit: a3808d88461270c71d3fece5e51cc486ecdac7d0
-- 
2.36.0.550.gb090851708-goog


^ permalink raw reply	[flat|nested] 111+ messages in thread

* [PATCH v6 00/22] KVM: Extend Eager Page Splitting to the shadow MMU
@ 2022-05-16 23:21 ` David Matlack
  0 siblings, 0 replies; 111+ messages in thread
From: David Matlack @ 2022-05-16 23:21 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen, Lai Jiangshan,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

[ This is a small incremental revision on top of v5. I'm sending this now
  rather than waiting longer, as I would typically do, since the 5.19 merge
  window is fast approaching. ]

This series extends KVM's Eager Page Splitting to also split huge pages
mapped by the shadow MMU, specifically **nested MMUs**.

For background on Eager Page Splitting, see:
 - Proposal: https://lore.kernel.org/kvm/CALzav=dV_U4r1K9oDq4esb4mpBQDQ2ROQ5zH5wV3KpOaZrRW-A@mail.gmail.com/
 - TDP MMU support: https://lore.kernel.org/kvm/20220119230739.2234394-1-dmatlack@google.com/

Splitting huge pages mapped by the shadow MMU is more complicated than
the TDP MMU, but it is also more important for performance as the shadow
MMU handles huge page write-protection faults under the write lock.  See
the Performance section for more details.

The extra complexity of splitting huge pages mapped by the shadow MMU
comes from a few places:

(1) The shadow MMU has a limit on the number of shadow pages that are
    allowed to be allocated. So, as a policy, Eager Page Splitting
    refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
    pages available.

(2) Huge pages may be mapped by indirect shadow pages which may have access
    permission constraints from the guest (unlike the TDP MMU which is
    ACC_ALL by default).

(3) Splitting a huge page may end up re-using an existing lower level
    shadow page tables. This is unlike the TDP MMU which always allocates
    new shadow page tables when splitting.

(4) When installing the lower level SPTEs, they must be added to the
    rmap which may require allocating additional pte_list_desc structs.

In Google's internal implementation of Eager Page Splitting, we do not
handle cases (3) and (4), and intstead opts to skip splitting entirely
(case 3) or only partially splitting (case 4). This series handles the
additional cases, which requires an additional 4KiB of memory per VM to
store the extra pte_list_desc cache. However it does also avoids the need
for TLB flushes in most cases and allows KVM to split more pages mapped
by shadow paging.

The bulk of this series is just refactoring the existing MMU code in
preparation for splitting, specifically to make it possible to operate
on the MMU outside of a vCPU context.

Motivation
----------

During dirty logging, VMs using the shadow MMU suffer from:

(1) Write-protection faults on huge pages that take the MMU lock to
    unmap the huge page, map a 4KiB page, and update the dirty log.

(2) Non-present faults caused by (1) that take the MMU lock to map in
    the missing page.

(3) Write-protection faults on 4KiB pages that take the MMU lock to
    make the page writable and update the dirty log. [Note: These faults
    only take the MMU lock during shadow paging.]

The lock contention from (1), (2) and (3) can severely degrade
application performance to the point of failure.  Eager page splitting
eliminates (1) by moving the splitting of huge pages off the vCPU
threads onto the thread invoking VM-ioctls to configure dirty logging,
and eliminates (2) by fully splitting each huge page into its
constituent small pages. (3) is still a concern for shadow paging
workloads (e.g. nested virtualization) but is not addressed by this
series.

Splitting in the VM-ioctl thread is useful because it can run in the
background without interrupting vCPU execution. However, it does take
the MMU lock so it may introduce some extra contention if vCPUs are
hammering the MMU lock. This is offset by the fact that eager page
splitting drops the MMU lock after splitting each SPTE if there is any
contention, and the fact that eager page splitting is reducing the MMU
lock contention from (1) and (2) above. Even workloads that only write
to 5% of their memory see massive MMU lock contention reduction during
dirty logging thanks to Eager Page Splitting (see Performance data
below).

A downside of Eager Page Splitting is that it splits all huge pages,
which may include ranges of memory that are never written to by the
guest and thus could theoretically stay huge. Workloads that write to
only a fraction of their memory may see higher TLB miss costs with Eager
Page Splitting enabled. However, that is secondary to the application
failure that otherwise may occur without Eager Page Splitting.

Further work is necessary to improve the TLB miss performance for
read-heavy workoads, such as dirty logging at 2M instead of 4K.

Performance
-----------

To measure the performance impact of Eager Page Splitting I ran
dirty_log_perf_test with support for a new flag, -n, that causes each vCPU
thread to run in L2 instead of L1. This support will be sent out in a
separate series.

To measure the imapct of customer performance, we can look at the time
it takes all vCPUs to dirty memory after dirty logging has been enabled.
Without Eager Page Splitting enabled, such dirtying must take faults to
split huge pages and bottleneck on the MMU lock.

For write-heavy workloads, there is not as much benefit since nested MMUs
still have to take the write-lock when resolving 4K write-protection
faults (case (3) in the Motivation section). But ready-heavy workloads
greatly benefit.

             | Config: tdp_mmu=Y, nested, 100% writes                  |
             | Iteration 1 dirty memory time                           |
             | ------------------------------------------------------- |
vCPU Count   | eager_page_split=N         | eager_page_split=Y         |
------------ | -------------------------- | -------------------------- |
2            | 0.367445635s               | 0.359880160s               |
4            | 0.503976497s               | 0.418760595s               |
8            | 1.328792652s               | 1.442455382s               |
16           | 4.609457301s               | 3.649754574s               |
32           | 8.751328485s               | 7.659014140s               |
64           | 20.438482174s              | 17.890019577s              |

             | Config: tdp_mmu=Y, nested, 50% writes                   |
             | Iteration 1 dirty memory time                           |
             | ------------------------------------------------------- |
vCPU Count   | eager_page_split=N         | eager_page_split=Y         |
------------ | -------------------------- | -------------------------- |
2            | 0.374082549s               | 0.189881327s               |
4            | 0.498175012s               | 0.216221200s               |
8            | 1.848155856s               | 0.525316794s               |
16           | 4.387725630s               | 1.844867390s               |
32           | 9.153260046s               | 4.061645844s               |
64           | 20.077600588s              | 8.825413269s               |

             | Config: tdp_mmu=Y, nested, 5% writes                    |
             | Iteration 1 dirty memory time                           |
             | ------------------------------------------------------- |
vCPU Count   | eager_page_split=N         | eager_page_split=Y         |
------------ | -------------------------- | -------------------------- |
2            | 0.386395635s               | 0.023315599s               |
4            | 0.495352933s               | 0.024971794s               |
8            | 1.568730321s               | 0.052010563s               |
16           | 4.258323166s               | 0.174402708s               |
32           | 9.260176347s               | 0.377929203s               |
64           | 19.861473882s              | 0.905998574s               |

Eager Page Splitting does increase the time it takes to enable dirty
logging when not using initially-all-set, since that's when KVM splits
huge pages. However, this runs in parallel with vCPU execution and drops
the MMU lock whenever there is contention.

             | Config: tdp_mmu=Y, nested, 100% writes                  |
             | Enabling dirty logging time                             |
             | ------------------------------------------------------- |
vCPU Count   | eager_page_split=N         | eager_page_split=Y         |
------------ | -------------------------- | -------------------------- |
2            | 0.001330088s               | 0.018624938s               |
4            | 0.002763111s               | 0.037247815s               |
8            | 0.005220762s               | 0.074637543s               |
16           | 0.010381925s               | 0.149096917s               |
32           | 0.022109466s               | 0.307983859s               |
64           | 0.085547182s               | 0.854228170s               |

Similarly, Eager Page Splitting increases the time it takes to clear the
dirty log for when using initially-all-set. The first time userspace
clears the dirty log, KVM will split huge pages:

             | Config: tdp_mmu=Y, nested, 100% writes initially-all-set |
             | Iteration 1 clear dirty log time                        |
             | ------------------------------------------------------- |
vCPU Count   | eager_page_split=N         | eager_page_split=Y         |
------------ | -------------------------- | -------------------------- |
2            | 0.001947098s               | 0.019836052s               |
4            | 0.003817996s               | 0.039574178s               |
8            | 0.007673616s               | 0.079118964s               |
16           | 0.015733003s               | 0.158006697s               |
32           | 0.031728367s               | 0.330793049s               |
64           | 0.108699714s               | 0.891762988s               |

Subsequent calls to clear the dirty log incur almost no additional cost
since KVM can very quickly determine there are no more huge pages to
split via the RMAP. This is unlike the TDP MMU which must re-traverse
the entire page table to check for huge pages.

             | Config: tdp_mmu=Y, nested, 100% writes initially-all-set |
             | Iteration 2 clear dirty log time                        |
             | ------------------------------------------------------- |
vCPU Count   | eager_page_split=N         | eager_page_split=Y         |
------------ | -------------------------- | -------------------------- |
2            | 0.009585296s               | 0.009931437s               |
4            | 0.019188984s               | 0.019842738s               |
8            | 0.038568630s               | 0.039951832s               |
16           | 0.077188525s               | 0.079536780s               |
32           | 0.156728329s               | 0.163612725s               |
64           | 0.418679324s               | 0.337336844s               |

Testing
-------

 - Ran all kvm-unit-tests and KVM selftests.

 - Booted a 32-bit non-PAE kernel with shadow paging to verify the
   quadrant change.

 - Ran dirty_log_perf_test with support for a new flag, -n, that causes
   each vCPU thread to run in L2 instead of L1. This support will be
   sent out in a separate series.

 - Tested VM live migration with nested MMUs and huge pages. The live
   migration setup consisted of an 8 vCPU 8 GiB VM running on an Intel
   Cascade Lake host and backed by 1GiB HugeTLBFS memory.  The VM was
   running Debian 10.  Inside a VM was a 6 vCPU 4Gib nested VM also
   Debian 10 and backed by 2M HugeTLBFS. Inside the nested VM ran a
   workload that aggressively accessed memory across 6 threads.
   Tracepoints during the migration confirmes eager page splitting
   occurred, both for the direct TDP MMU mappings, and the nested MMU
   mappings.

Version Log
-----------

v6:
 - Collect R-b tag from Marc.
 - Initialize memory cache capacity during top-up [Sean]
 - Eliminate redundant role overrides in mmu_alloc_root() [Lai]

v5: https://lore.kernel.org/kvm/20220513202819.829591-1-dmatlack@google.com/
 - Rebase on top of latest kvm/queue.
 - Collected R-b tags from Sean and Lai.
 - Add another patch to stop passing non-zero quadrant [Sean]
 - Drop vcpu_or_null and __kvm_sync_page() [Sean]
 - Formatting and wording changes [Sean]
 - Pass role instead of sp when making huge split SPTEs [Sean]
 - Fix riscv compilation error [kernel test robot]
 - Document split caches protected by slots_lock [Lai]

v4: https://lore.kernel.org/kvm/20220422210546.458943-1-dmatlack@google.com/
 - Limit eager page splitting to nested MMUs [Sean]
 - Use memory caches for SP allocation [Sean]
 - Use kvm_mmu_get_page() with NULL vCPU for EPS [Sean]
 - Use u64 instead of bit field for shadow translation entry [Sean]
 - Add Sean's R-b to "Use a bool" patch.
 - Fix printf warning in "Cache access bits" patch.
 - Fix asymmentrical pr_err_ratelimit() + WARN() [Sean]
 - Drop unnecessary unsync check for huge pages [Sean]
 - Eliminate use of we in comments and change logs [Sean]
 - Allocate objects arrays dynamically [Ben]

v3: https://lore.kernel.org/kvm/20220401175554.1931568-1-dmatlack@google.com/
 - Add R-b tags from Peter.
 - Explain direct SPs in indirect MMUs in commit message [Peter]
 - Change BUG_ON() to WARN_ON_ONCE() in quadrant calculation [me]
 - Eliminate unnecessary gotos [Peter]
 - Drop mmu_alloc_pte_list_desc() [Peter]
 - Also update access cache in mmu_set_spte() if was_rmapped [Peter]
 - Fix number of gfn bits in shadowed_translation cache [Peter]
 - Pass sp to make_huge_page_split_spte() to derive level and exec [me]
 - Eliminate flush var in kvm_rmap_zap_collapsible_sptes() [Peter]
 - Drop NULL pte_list_desc cache fallback [Peter]
 - Fix get_access to return sp->role.access. [me]
 - Re-use split cache across calls to CLEAR_DIRTY_LOG for better perf [me]
 - Top-up the split cache outside of the MMU lock when possible [me]
 - Refactor prepare_to_split_huge_page() into try_split_huge_page() [me]
 - Collapse PATCH 20, 23, and 24 avoid intermediate complexity [Peter]
 - Update the RISC-V function stage2_ioremap() [Anup]

v2: https://lore.kernel.org/kvm/20220311002528.2230172-1-dmatlack@google.com/
 - Add performance data for workloads that mix reads and writes [Peter]
 - Collect R-b tags from Ben and Sean.
 - Fix quadrant calculation when deriving role from parent [Sean]
 - Tweak new shadow page function names [Sean]
 - Move set_page_private() to allocation functions [Ben]
 - Only zap collapsible SPTEs up to MAX_LEVEL-1 [Ben]
 - Always top-up pte_list_desc cache to reduce complexity [Ben]
 - Require mmu cache capacity field to be initialized and add WARN()
   to reduce chance of programmer error [Marc]
 - Fix up kvm_mmu_memory_cache struct initialization in arm64 [Marc]

v1: https://lore.kernel.org/kvm/20220203010051.2813563-1-dmatlack@google.com/

David Matlack (22):
  KVM: x86/mmu: Optimize MMU page cache lookup for all direct SPs
  KVM: x86/mmu: Use a bool for direct
  KVM: x86/mmu: Stop passing @direct to mmu_alloc_root()
  KVM: x86/mmu: Derive shadow MMU page role from parent
  KVM: x86/mmu: Always pass 0 for @quadrant when gptes are 8 bytes
  KVM: x86/mmu: Decompose kvm_mmu_get_page() into separate functions
  KVM: x86/mmu: Consolidate shadow page allocation and initialization
  KVM: x86/mmu: Rename shadow MMU functions that deal with shadow pages
  KVM: x86/mmu: Move guest PT write-protection to account_shadowed()
  KVM: x86/mmu: Pass memory caches to allocate SPs separately
  KVM: x86/mmu: Replace vcpu with kvm in kvm_mmu_alloc_shadow_page()
  KVM: x86/mmu: Pass kvm pointer separately from vcpu to
    kvm_mmu_find_shadow_page()
  KVM: x86/mmu: Allow NULL @vcpu in kvm_mmu_find_shadow_page()
  KVM: x86/mmu: Pass const memslot to rmap_add()
  KVM: x86/mmu: Decouple rmap_add() and link_shadow_page() from kvm_vcpu
  KVM: x86/mmu: Update page stats in __rmap_add()
  KVM: x86/mmu: Cache the access bits of shadowed translations
  KVM: x86/mmu: Extend make_huge_page_split_spte() for the shadow MMU
  KVM: x86/mmu: Zap collapsible SPTEs in shadow MMU at all possible
    levels
  KVM: x86/mmu: Refactor drop_large_spte()
  KVM: Allow for different capacities in kvm_mmu_memory_cache structs
  KVM: x86/mmu: Extend Eager Page Splitting to nested MMUs

 .../admin-guide/kernel-parameters.txt         |   3 +-
 arch/arm64/kvm/mmu.c                          |   2 +-
 arch/riscv/kvm/mmu.c                          |   5 +-
 arch/x86/include/asm/kvm_host.h               |  26 +-
 arch/x86/kvm/mmu/mmu.c                        | 693 ++++++++++++++----
 arch/x86/kvm/mmu/mmu_internal.h               |  17 +-
 arch/x86/kvm/mmu/paging_tmpl.h                |  17 +-
 arch/x86/kvm/mmu/spte.c                       |  16 +-
 arch/x86/kvm/mmu/spte.h                       |   2 +-
 arch/x86/kvm/mmu/tdp_mmu.c                    |   2 +-
 arch/x86/kvm/x86.c                            |   6 +
 include/linux/kvm_host.h                      |   1 +
 include/linux/kvm_types.h                     |   6 +-
 virt/kvm/kvm_main.c                           |  33 +-
 14 files changed, 658 insertions(+), 171 deletions(-)


base-commit: a3808d88461270c71d3fece5e51cc486ecdac7d0
-- 
2.36.0.550.gb090851708-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 111+ messages in thread

* [PATCH v6 01/22] KVM: x86/mmu: Optimize MMU page cache lookup for all direct SPs
  2022-05-16 23:21 ` David Matlack
@ 2022-05-16 23:21   ` David Matlack
  -1 siblings, 0 replies; 111+ messages in thread
From: David Matlack @ 2022-05-16 23:21 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, Lai Jiangshan, David Matlack

Commit fb58a9c345f6 ("KVM: x86/mmu: Optimize MMU page cache lookup for
fully direct MMUs") skipped the unsync checks and write flood clearing
for full direct MMUs. We can extend this further to skip the checks for
all direct shadow pages. Direct shadow pages in indirect MMUs (i.e.
shadow paging) are used when shadowing a guest huge page with smaller
pages. Such direct shadow pages, like their counterparts in fully direct
MMUs, are never marked unsynced or have a non-zero write-flooding count.

Checking sp->role.direct also generates better code than checking
direct_map because, due to register pressure, direct_map has to get
shoved onto the stack and then pulled back off.

No functional change intended.

Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index efe5a3dca1e0..774810d8a2ed 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2026,7 +2026,6 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 					     int direct,
 					     unsigned int access)
 {
-	bool direct_mmu = vcpu->arch.mmu->root_role.direct;
 	union kvm_mmu_page_role role;
 	struct hlist_head *sp_list;
 	unsigned quadrant;
@@ -2070,7 +2069,8 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 			continue;
 		}
 
-		if (direct_mmu)
+		/* unsync and write-flooding only apply to indirect SPs. */
+		if (sp->role.direct)
 			goto trace_get_page;
 
 		if (sp->unsync) {
-- 
2.36.0.550.gb090851708-goog


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 01/22] KVM: x86/mmu: Optimize MMU page cache lookup for all direct SPs
@ 2022-05-16 23:21   ` David Matlack
  0 siblings, 0 replies; 111+ messages in thread
From: David Matlack @ 2022-05-16 23:21 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen, Lai Jiangshan,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Commit fb58a9c345f6 ("KVM: x86/mmu: Optimize MMU page cache lookup for
fully direct MMUs") skipped the unsync checks and write flood clearing
for full direct MMUs. We can extend this further to skip the checks for
all direct shadow pages. Direct shadow pages in indirect MMUs (i.e.
shadow paging) are used when shadowing a guest huge page with smaller
pages. Such direct shadow pages, like their counterparts in fully direct
MMUs, are never marked unsynced or have a non-zero write-flooding count.

Checking sp->role.direct also generates better code than checking
direct_map because, due to register pressure, direct_map has to get
shoved onto the stack and then pulled back off.

No functional change intended.

Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index efe5a3dca1e0..774810d8a2ed 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2026,7 +2026,6 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 					     int direct,
 					     unsigned int access)
 {
-	bool direct_mmu = vcpu->arch.mmu->root_role.direct;
 	union kvm_mmu_page_role role;
 	struct hlist_head *sp_list;
 	unsigned quadrant;
@@ -2070,7 +2069,8 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 			continue;
 		}
 
-		if (direct_mmu)
+		/* unsync and write-flooding only apply to indirect SPs. */
+		if (sp->role.direct)
 			goto trace_get_page;
 
 		if (sp->unsync) {
-- 
2.36.0.550.gb090851708-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 02/22] KVM: x86/mmu: Use a bool for direct
  2022-05-16 23:21 ` David Matlack
@ 2022-05-16 23:21   ` David Matlack
  -1 siblings, 0 replies; 111+ messages in thread
From: David Matlack @ 2022-05-16 23:21 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, Lai Jiangshan, David Matlack

The parameter "direct" can either be true or false, and all of the
callers pass in a bool variable or true/false literal, so just use the
type bool.

No functional change intended.

Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 774810d8a2ed..34fb0cddff2b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1690,7 +1690,7 @@ static void drop_parent_pte(struct kvm_mmu_page *sp,
 	mmu_spte_clear_no_track(parent_pte);
 }
 
-static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, int direct)
+static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, bool direct)
 {
 	struct kvm_mmu_page *sp;
 
@@ -2023,7 +2023,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 					     gfn_t gfn,
 					     gva_t gaddr,
 					     unsigned level,
-					     int direct,
+					     bool direct,
 					     unsigned int access)
 {
 	union kvm_mmu_page_role role;
-- 
2.36.0.550.gb090851708-goog


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 02/22] KVM: x86/mmu: Use a bool for direct
@ 2022-05-16 23:21   ` David Matlack
  0 siblings, 0 replies; 111+ messages in thread
From: David Matlack @ 2022-05-16 23:21 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen, Lai Jiangshan,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

The parameter "direct" can either be true or false, and all of the
callers pass in a bool variable or true/false literal, so just use the
type bool.

No functional change intended.

Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 774810d8a2ed..34fb0cddff2b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1690,7 +1690,7 @@ static void drop_parent_pte(struct kvm_mmu_page *sp,
 	mmu_spte_clear_no_track(parent_pte);
 }
 
-static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, int direct)
+static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, bool direct)
 {
 	struct kvm_mmu_page *sp;
 
@@ -2023,7 +2023,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 					     gfn_t gfn,
 					     gva_t gaddr,
 					     unsigned level,
-					     int direct,
+					     bool direct,
 					     unsigned int access)
 {
 	union kvm_mmu_page_role role;
-- 
2.36.0.550.gb090851708-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 03/22] KVM: x86/mmu: Stop passing @direct to mmu_alloc_root()
  2022-05-16 23:21 ` David Matlack
@ 2022-05-16 23:21   ` David Matlack
  -1 siblings, 0 replies; 111+ messages in thread
From: David Matlack @ 2022-05-16 23:21 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, Lai Jiangshan, David Matlack

The argument @direct is vcpu->arch.mmu->root_role.direct, so just use
that.

Suggested-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 34fb0cddff2b..a9d28bcabcbb 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3370,8 +3370,9 @@ static int mmu_check_root(struct kvm_vcpu *vcpu, gfn_t root_gfn)
 }
 
 static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
-			    u8 level, bool direct)
+			    u8 level)
 {
+	bool direct = vcpu->arch.mmu->root_role.direct;
 	struct kvm_mmu_page *sp;
 
 	sp = kvm_mmu_get_page(vcpu, gfn, gva, level, direct, ACC_ALL);
@@ -3397,7 +3398,7 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
 		root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu);
 		mmu->root.hpa = root;
 	} else if (shadow_root_level >= PT64_ROOT_4LEVEL) {
-		root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level, true);
+		root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level);
 		mmu->root.hpa = root;
 	} else if (shadow_root_level == PT32E_ROOT_LEVEL) {
 		if (WARN_ON_ONCE(!mmu->pae_root)) {
@@ -3409,7 +3410,7 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
 			WARN_ON_ONCE(IS_VALID_PAE_ROOT(mmu->pae_root[i]));
 
 			root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT),
-					      i << 30, PT32_ROOT_LEVEL, true);
+					      i << 30, PT32_ROOT_LEVEL);
 			mmu->pae_root[i] = root | PT_PRESENT_MASK |
 					   shadow_me_mask;
 		}
@@ -3533,7 +3534,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 	 */
 	if (mmu->cpu_role.base.level >= PT64_ROOT_4LEVEL) {
 		root = mmu_alloc_root(vcpu, root_gfn, 0,
-				      mmu->root_role.level, false);
+				      mmu->root_role.level);
 		mmu->root.hpa = root;
 		goto set_root_pgd;
 	}
@@ -3579,7 +3580,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 		}
 
 		root = mmu_alloc_root(vcpu, root_gfn, i << 30,
-				      PT32_ROOT_LEVEL, false);
+				      PT32_ROOT_LEVEL);
 		mmu->pae_root[i] = root | pm_mask;
 	}
 
-- 
2.36.0.550.gb090851708-goog


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 03/22] KVM: x86/mmu: Stop passing @direct to mmu_alloc_root()
@ 2022-05-16 23:21   ` David Matlack
  0 siblings, 0 replies; 111+ messages in thread
From: David Matlack @ 2022-05-16 23:21 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen, Lai Jiangshan,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

The argument @direct is vcpu->arch.mmu->root_role.direct, so just use
that.

Suggested-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 34fb0cddff2b..a9d28bcabcbb 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3370,8 +3370,9 @@ static int mmu_check_root(struct kvm_vcpu *vcpu, gfn_t root_gfn)
 }
 
 static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
-			    u8 level, bool direct)
+			    u8 level)
 {
+	bool direct = vcpu->arch.mmu->root_role.direct;
 	struct kvm_mmu_page *sp;
 
 	sp = kvm_mmu_get_page(vcpu, gfn, gva, level, direct, ACC_ALL);
@@ -3397,7 +3398,7 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
 		root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu);
 		mmu->root.hpa = root;
 	} else if (shadow_root_level >= PT64_ROOT_4LEVEL) {
-		root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level, true);
+		root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level);
 		mmu->root.hpa = root;
 	} else if (shadow_root_level == PT32E_ROOT_LEVEL) {
 		if (WARN_ON_ONCE(!mmu->pae_root)) {
@@ -3409,7 +3410,7 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
 			WARN_ON_ONCE(IS_VALID_PAE_ROOT(mmu->pae_root[i]));
 
 			root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT),
-					      i << 30, PT32_ROOT_LEVEL, true);
+					      i << 30, PT32_ROOT_LEVEL);
 			mmu->pae_root[i] = root | PT_PRESENT_MASK |
 					   shadow_me_mask;
 		}
@@ -3533,7 +3534,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 	 */
 	if (mmu->cpu_role.base.level >= PT64_ROOT_4LEVEL) {
 		root = mmu_alloc_root(vcpu, root_gfn, 0,
-				      mmu->root_role.level, false);
+				      mmu->root_role.level);
 		mmu->root.hpa = root;
 		goto set_root_pgd;
 	}
@@ -3579,7 +3580,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 		}
 
 		root = mmu_alloc_root(vcpu, root_gfn, i << 30,
-				      PT32_ROOT_LEVEL, false);
+				      PT32_ROOT_LEVEL);
 		mmu->pae_root[i] = root | pm_mask;
 	}
 
-- 
2.36.0.550.gb090851708-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 04/22] KVM: x86/mmu: Derive shadow MMU page role from parent
  2022-05-16 23:21 ` David Matlack
@ 2022-05-16 23:21   ` David Matlack
  -1 siblings, 0 replies; 111+ messages in thread
From: David Matlack @ 2022-05-16 23:21 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, Lai Jiangshan, David Matlack

Instead of computing the shadow page role from scratch for every new
page, derive most of the information from the parent shadow page.  This
eliminates the dependency on the vCPU root role to allocate shadow page
tables, and reduces the number of parameters to kvm_mmu_get_page().

Preemptively split out the role calculation to a separate function for
use in a following commit.

Note that when calculating the MMU root role, we can take
@role.passthrough, @role.direct, and @role.access directly from
@vcpu->arch.mmu->root_role. Only @role.level and @role.quadrant still
must be overridden for PAE page directories.

No functional change intended.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c         | 98 +++++++++++++++++++++++-----------
 arch/x86/kvm/mmu/paging_tmpl.h |  9 ++--
 2 files changed, 71 insertions(+), 36 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index a9d28bcabcbb..515e0b33144a 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2019,33 +2019,15 @@ static void clear_sp_write_flooding_count(u64 *spte)
 	__clear_sp_write_flooding_count(sptep_to_sp(spte));
 }
 
-static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
-					     gfn_t gfn,
-					     gva_t gaddr,
-					     unsigned level,
-					     bool direct,
-					     unsigned int access)
+static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
+					     union kvm_mmu_page_role role)
 {
-	union kvm_mmu_page_role role;
 	struct hlist_head *sp_list;
-	unsigned quadrant;
 	struct kvm_mmu_page *sp;
 	int ret;
 	int collisions = 0;
 	LIST_HEAD(invalid_list);
 
-	role = vcpu->arch.mmu->root_role;
-	role.level = level;
-	role.direct = direct;
-	role.access = access;
-	if (role.has_4_byte_gpte) {
-		quadrant = gaddr >> (PAGE_SHIFT + (PT64_PT_BITS * level));
-		quadrant &= (1 << ((PT32_PT_BITS - PT64_PT_BITS) * level)) - 1;
-		role.quadrant = quadrant;
-	}
-	if (level <= vcpu->arch.mmu->cpu_role.base.level)
-		role.passthrough = 0;
-
 	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
 	for_each_valid_sp(vcpu->kvm, sp, sp_list) {
 		if (sp->gfn != gfn) {
@@ -2063,7 +2045,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 			 * Unsync pages must not be left as is, because the new
 			 * upper-level page will be write-protected.
 			 */
-			if (level > PG_LEVEL_4K && sp->unsync)
+			if (role.level > PG_LEVEL_4K && sp->unsync)
 				kvm_mmu_prepare_zap_page(vcpu->kvm, sp,
 							 &invalid_list);
 			continue;
@@ -2104,14 +2086,14 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 
 	++vcpu->kvm->stat.mmu_cache_miss;
 
-	sp = kvm_mmu_alloc_page(vcpu, direct);
+	sp = kvm_mmu_alloc_page(vcpu, role.direct);
 
 	sp->gfn = gfn;
 	sp->role = role;
 	hlist_add_head(&sp->hash_link, sp_list);
 	if (sp_has_gptes(sp)) {
 		account_shadowed(vcpu->kvm, sp);
-		if (level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))
+		if (role.level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))
 			kvm_flush_remote_tlbs_with_address(vcpu->kvm, gfn, 1);
 	}
 	trace_kvm_mmu_get_page(sp, true);
@@ -2123,6 +2105,55 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 	return sp;
 }
 
+static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct, u32 access)
+{
+	struct kvm_mmu_page *parent_sp = sptep_to_sp(sptep);
+	union kvm_mmu_page_role role;
+
+	role = parent_sp->role;
+	role.level--;
+	role.access = access;
+	role.direct = direct;
+	role.passthrough = 0;
+
+	/*
+	 * If the guest has 4-byte PTEs then that means it's using 32-bit,
+	 * 2-level, non-PAE paging. KVM shadows such guests with PAE paging
+	 * (i.e. 8-byte PTEs). The difference in PTE size means that KVM must
+	 * shadow each guest page table with multiple shadow page tables, which
+	 * requires extra bookkeeping in the role.
+	 *
+	 * Specifically, to shadow the guest's page directory (which covers a
+	 * 4GiB address space), KVM uses 4 PAE page directories, each mapping
+	 * 1GiB of the address space. @role.quadrant encodes which quarter of
+	 * the address space each maps.
+	 *
+	 * To shadow the guest's page tables (which each map a 4MiB region), KVM
+	 * uses 2 PAE page tables, each mapping a 2MiB region. For these,
+	 * @role.quadrant encodes which half of the region they map.
+	 *
+	 * Note, the 4 PAE page directories are pre-allocated and the quadrant
+	 * assigned in mmu_alloc_root(). So only page tables need to be handled
+	 * here.
+	 */
+	if (role.has_4_byte_gpte) {
+		WARN_ON_ONCE(role.level != PG_LEVEL_4K);
+		role.quadrant = (sptep - parent_sp->spt) % 2;
+	}
+
+	return role;
+}
+
+static struct kvm_mmu_page *kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu,
+						 u64 *sptep, gfn_t gfn,
+						 bool direct, u32 access)
+{
+	union kvm_mmu_page_role role;
+
+	role = kvm_mmu_child_role(sptep, direct, access);
+	return kvm_mmu_get_page(vcpu, gfn, role);
+}
+
 static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterator,
 					struct kvm_vcpu *vcpu, hpa_t root,
 					u64 addr)
@@ -2965,8 +2996,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		if (is_shadow_present_pte(*it.sptep))
 			continue;
 
-		sp = kvm_mmu_get_page(vcpu, base_gfn, it.addr,
-				      it.level - 1, true, ACC_ALL);
+		sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn, true, ACC_ALL);
 
 		link_shadow_page(vcpu, it.sptep, sp);
 		if (fault->is_tdp && fault->huge_page_disallowed &&
@@ -3369,13 +3399,18 @@ static int mmu_check_root(struct kvm_vcpu *vcpu, gfn_t root_gfn)
 	return ret;
 }
 
-static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
+static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant,
 			    u8 level)
 {
-	bool direct = vcpu->arch.mmu->root_role.direct;
+	union kvm_mmu_page_role role = vcpu->arch.mmu->root_role;
 	struct kvm_mmu_page *sp;
 
-	sp = kvm_mmu_get_page(vcpu, gfn, gva, level, direct, ACC_ALL);
+	role.level = level;
+
+	if (role.has_4_byte_gpte)
+		role.quadrant = quadrant;
+
+	sp = kvm_mmu_get_page(vcpu, gfn, role);
 	++sp->root_count;
 
 	return __pa(sp->spt);
@@ -3409,8 +3444,8 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
 		for (i = 0; i < 4; ++i) {
 			WARN_ON_ONCE(IS_VALID_PAE_ROOT(mmu->pae_root[i]));
 
-			root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT),
-					      i << 30, PT32_ROOT_LEVEL);
+			root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT), i,
+					      PT32_ROOT_LEVEL);
 			mmu->pae_root[i] = root | PT_PRESENT_MASK |
 					   shadow_me_mask;
 		}
@@ -3579,8 +3614,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 			root_gfn = pdptrs[i] >> PAGE_SHIFT;
 		}
 
-		root = mmu_alloc_root(vcpu, root_gfn, i << 30,
-				      PT32_ROOT_LEVEL);
+		root = mmu_alloc_root(vcpu, root_gfn, i, PT32_ROOT_LEVEL);
 		mmu->pae_root[i] = root | pm_mask;
 	}
 
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index db80f7ccaa4e..fd73c857af90 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -648,8 +648,9 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 		if (!is_shadow_present_pte(*it.sptep)) {
 			table_gfn = gw->table_gfn[it.level - 2];
 			access = gw->pt_access[it.level - 2];
-			sp = kvm_mmu_get_page(vcpu, table_gfn, fault->addr,
-					      it.level-1, false, access);
+			sp = kvm_mmu_get_child_sp(vcpu, it.sptep, table_gfn,
+						  false, access);
+
 			/*
 			 * We must synchronize the pagetable before linking it
 			 * because the guest doesn't need to flush tlb when
@@ -705,8 +706,8 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 		drop_large_spte(vcpu, it.sptep);
 
 		if (!is_shadow_present_pte(*it.sptep)) {
-			sp = kvm_mmu_get_page(vcpu, base_gfn, fault->addr,
-					      it.level - 1, true, direct_access);
+			sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn,
+						  true, direct_access);
 			link_shadow_page(vcpu, it.sptep, sp);
 			if (fault->huge_page_disallowed &&
 			    fault->req_level >= it.level)
-- 
2.36.0.550.gb090851708-goog


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 04/22] KVM: x86/mmu: Derive shadow MMU page role from parent
@ 2022-05-16 23:21   ` David Matlack
  0 siblings, 0 replies; 111+ messages in thread
From: David Matlack @ 2022-05-16 23:21 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen, Lai Jiangshan,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Instead of computing the shadow page role from scratch for every new
page, derive most of the information from the parent shadow page.  This
eliminates the dependency on the vCPU root role to allocate shadow page
tables, and reduces the number of parameters to kvm_mmu_get_page().

Preemptively split out the role calculation to a separate function for
use in a following commit.

Note that when calculating the MMU root role, we can take
@role.passthrough, @role.direct, and @role.access directly from
@vcpu->arch.mmu->root_role. Only @role.level and @role.quadrant still
must be overridden for PAE page directories.

No functional change intended.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c         | 98 +++++++++++++++++++++++-----------
 arch/x86/kvm/mmu/paging_tmpl.h |  9 ++--
 2 files changed, 71 insertions(+), 36 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index a9d28bcabcbb..515e0b33144a 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2019,33 +2019,15 @@ static void clear_sp_write_flooding_count(u64 *spte)
 	__clear_sp_write_flooding_count(sptep_to_sp(spte));
 }
 
-static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
-					     gfn_t gfn,
-					     gva_t gaddr,
-					     unsigned level,
-					     bool direct,
-					     unsigned int access)
+static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
+					     union kvm_mmu_page_role role)
 {
-	union kvm_mmu_page_role role;
 	struct hlist_head *sp_list;
-	unsigned quadrant;
 	struct kvm_mmu_page *sp;
 	int ret;
 	int collisions = 0;
 	LIST_HEAD(invalid_list);
 
-	role = vcpu->arch.mmu->root_role;
-	role.level = level;
-	role.direct = direct;
-	role.access = access;
-	if (role.has_4_byte_gpte) {
-		quadrant = gaddr >> (PAGE_SHIFT + (PT64_PT_BITS * level));
-		quadrant &= (1 << ((PT32_PT_BITS - PT64_PT_BITS) * level)) - 1;
-		role.quadrant = quadrant;
-	}
-	if (level <= vcpu->arch.mmu->cpu_role.base.level)
-		role.passthrough = 0;
-
 	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
 	for_each_valid_sp(vcpu->kvm, sp, sp_list) {
 		if (sp->gfn != gfn) {
@@ -2063,7 +2045,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 			 * Unsync pages must not be left as is, because the new
 			 * upper-level page will be write-protected.
 			 */
-			if (level > PG_LEVEL_4K && sp->unsync)
+			if (role.level > PG_LEVEL_4K && sp->unsync)
 				kvm_mmu_prepare_zap_page(vcpu->kvm, sp,
 							 &invalid_list);
 			continue;
@@ -2104,14 +2086,14 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 
 	++vcpu->kvm->stat.mmu_cache_miss;
 
-	sp = kvm_mmu_alloc_page(vcpu, direct);
+	sp = kvm_mmu_alloc_page(vcpu, role.direct);
 
 	sp->gfn = gfn;
 	sp->role = role;
 	hlist_add_head(&sp->hash_link, sp_list);
 	if (sp_has_gptes(sp)) {
 		account_shadowed(vcpu->kvm, sp);
-		if (level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))
+		if (role.level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))
 			kvm_flush_remote_tlbs_with_address(vcpu->kvm, gfn, 1);
 	}
 	trace_kvm_mmu_get_page(sp, true);
@@ -2123,6 +2105,55 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 	return sp;
 }
 
+static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct, u32 access)
+{
+	struct kvm_mmu_page *parent_sp = sptep_to_sp(sptep);
+	union kvm_mmu_page_role role;
+
+	role = parent_sp->role;
+	role.level--;
+	role.access = access;
+	role.direct = direct;
+	role.passthrough = 0;
+
+	/*
+	 * If the guest has 4-byte PTEs then that means it's using 32-bit,
+	 * 2-level, non-PAE paging. KVM shadows such guests with PAE paging
+	 * (i.e. 8-byte PTEs). The difference in PTE size means that KVM must
+	 * shadow each guest page table with multiple shadow page tables, which
+	 * requires extra bookkeeping in the role.
+	 *
+	 * Specifically, to shadow the guest's page directory (which covers a
+	 * 4GiB address space), KVM uses 4 PAE page directories, each mapping
+	 * 1GiB of the address space. @role.quadrant encodes which quarter of
+	 * the address space each maps.
+	 *
+	 * To shadow the guest's page tables (which each map a 4MiB region), KVM
+	 * uses 2 PAE page tables, each mapping a 2MiB region. For these,
+	 * @role.quadrant encodes which half of the region they map.
+	 *
+	 * Note, the 4 PAE page directories are pre-allocated and the quadrant
+	 * assigned in mmu_alloc_root(). So only page tables need to be handled
+	 * here.
+	 */
+	if (role.has_4_byte_gpte) {
+		WARN_ON_ONCE(role.level != PG_LEVEL_4K);
+		role.quadrant = (sptep - parent_sp->spt) % 2;
+	}
+
+	return role;
+}
+
+static struct kvm_mmu_page *kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu,
+						 u64 *sptep, gfn_t gfn,
+						 bool direct, u32 access)
+{
+	union kvm_mmu_page_role role;
+
+	role = kvm_mmu_child_role(sptep, direct, access);
+	return kvm_mmu_get_page(vcpu, gfn, role);
+}
+
 static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterator,
 					struct kvm_vcpu *vcpu, hpa_t root,
 					u64 addr)
@@ -2965,8 +2996,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		if (is_shadow_present_pte(*it.sptep))
 			continue;
 
-		sp = kvm_mmu_get_page(vcpu, base_gfn, it.addr,
-				      it.level - 1, true, ACC_ALL);
+		sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn, true, ACC_ALL);
 
 		link_shadow_page(vcpu, it.sptep, sp);
 		if (fault->is_tdp && fault->huge_page_disallowed &&
@@ -3369,13 +3399,18 @@ static int mmu_check_root(struct kvm_vcpu *vcpu, gfn_t root_gfn)
 	return ret;
 }
 
-static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
+static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant,
 			    u8 level)
 {
-	bool direct = vcpu->arch.mmu->root_role.direct;
+	union kvm_mmu_page_role role = vcpu->arch.mmu->root_role;
 	struct kvm_mmu_page *sp;
 
-	sp = kvm_mmu_get_page(vcpu, gfn, gva, level, direct, ACC_ALL);
+	role.level = level;
+
+	if (role.has_4_byte_gpte)
+		role.quadrant = quadrant;
+
+	sp = kvm_mmu_get_page(vcpu, gfn, role);
 	++sp->root_count;
 
 	return __pa(sp->spt);
@@ -3409,8 +3444,8 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
 		for (i = 0; i < 4; ++i) {
 			WARN_ON_ONCE(IS_VALID_PAE_ROOT(mmu->pae_root[i]));
 
-			root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT),
-					      i << 30, PT32_ROOT_LEVEL);
+			root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT), i,
+					      PT32_ROOT_LEVEL);
 			mmu->pae_root[i] = root | PT_PRESENT_MASK |
 					   shadow_me_mask;
 		}
@@ -3579,8 +3614,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 			root_gfn = pdptrs[i] >> PAGE_SHIFT;
 		}
 
-		root = mmu_alloc_root(vcpu, root_gfn, i << 30,
-				      PT32_ROOT_LEVEL);
+		root = mmu_alloc_root(vcpu, root_gfn, i, PT32_ROOT_LEVEL);
 		mmu->pae_root[i] = root | pm_mask;
 	}
 
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index db80f7ccaa4e..fd73c857af90 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -648,8 +648,9 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 		if (!is_shadow_present_pte(*it.sptep)) {
 			table_gfn = gw->table_gfn[it.level - 2];
 			access = gw->pt_access[it.level - 2];
-			sp = kvm_mmu_get_page(vcpu, table_gfn, fault->addr,
-					      it.level-1, false, access);
+			sp = kvm_mmu_get_child_sp(vcpu, it.sptep, table_gfn,
+						  false, access);
+
 			/*
 			 * We must synchronize the pagetable before linking it
 			 * because the guest doesn't need to flush tlb when
@@ -705,8 +706,8 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 		drop_large_spte(vcpu, it.sptep);
 
 		if (!is_shadow_present_pte(*it.sptep)) {
-			sp = kvm_mmu_get_page(vcpu, base_gfn, fault->addr,
-					      it.level - 1, true, direct_access);
+			sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn,
+						  true, direct_access);
 			link_shadow_page(vcpu, it.sptep, sp);
 			if (fault->huge_page_disallowed &&
 			    fault->req_level >= it.level)
-- 
2.36.0.550.gb090851708-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 05/22] KVM: x86/mmu: Always pass 0 for @quadrant when gptes are 8 bytes
  2022-05-16 23:21 ` David Matlack
@ 2022-05-16 23:21   ` David Matlack
  -1 siblings, 0 replies; 111+ messages in thread
From: David Matlack @ 2022-05-16 23:21 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, Lai Jiangshan, David Matlack

The quadrant is only used when gptes are 4 bytes, but
mmu_alloc_{direct,shadow}_roots() pass in a non-zero quadrant for PAE
page directories regardless. Make this less confusing by only passing in
a non-zero quadrant when it is actually necessary.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 515e0b33144a..8508c4bfddb5 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3406,9 +3406,10 @@ static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant,
 	struct kvm_mmu_page *sp;
 
 	role.level = level;
+	role.quadrant = quadrant;
 
-	if (role.has_4_byte_gpte)
-		role.quadrant = quadrant;
+	WARN_ON_ONCE(quadrant && !role.has_4_byte_gpte);
+	WARN_ON_ONCE(role.direct && role.has_4_byte_gpte);
 
 	sp = kvm_mmu_get_page(vcpu, gfn, role);
 	++sp->root_count;
@@ -3444,7 +3445,7 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
 		for (i = 0; i < 4; ++i) {
 			WARN_ON_ONCE(IS_VALID_PAE_ROOT(mmu->pae_root[i]));
 
-			root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT), i,
+			root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT), 0,
 					      PT32_ROOT_LEVEL);
 			mmu->pae_root[i] = root | PT_PRESENT_MASK |
 					   shadow_me_mask;
@@ -3529,6 +3530,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 	struct kvm_mmu *mmu = vcpu->arch.mmu;
 	u64 pdptrs[4], pm_mask;
 	gfn_t root_gfn, root_pgd;
+	unsigned int quadrant;
 	hpa_t root;
 	unsigned i;
 	int r;
@@ -3614,7 +3616,15 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 			root_gfn = pdptrs[i] >> PAGE_SHIFT;
 		}
 
-		root = mmu_alloc_root(vcpu, root_gfn, i, PT32_ROOT_LEVEL);
+		/*
+		 * If shadowing 32-bit non-PAE page tables, each PAE page
+		 * directory maps one quarter of the guest's non-PAE page
+		 * directory. Othwerise each PAE page direct shadows one guest
+		 * PAE page directory so that quadrant should be 0.
+		 */
+		quadrant = (mmu->cpu_role.base.level == PT32_ROOT_LEVEL) ? i : 0;
+
+		root = mmu_alloc_root(vcpu, root_gfn, quadrant, PT32_ROOT_LEVEL);
 		mmu->pae_root[i] = root | pm_mask;
 	}
 
-- 
2.36.0.550.gb090851708-goog


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 05/22] KVM: x86/mmu: Always pass 0 for @quadrant when gptes are 8 bytes
@ 2022-05-16 23:21   ` David Matlack
  0 siblings, 0 replies; 111+ messages in thread
From: David Matlack @ 2022-05-16 23:21 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen, Lai Jiangshan,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

The quadrant is only used when gptes are 4 bytes, but
mmu_alloc_{direct,shadow}_roots() pass in a non-zero quadrant for PAE
page directories regardless. Make this less confusing by only passing in
a non-zero quadrant when it is actually necessary.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 515e0b33144a..8508c4bfddb5 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3406,9 +3406,10 @@ static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant,
 	struct kvm_mmu_page *sp;
 
 	role.level = level;
+	role.quadrant = quadrant;
 
-	if (role.has_4_byte_gpte)
-		role.quadrant = quadrant;
+	WARN_ON_ONCE(quadrant && !role.has_4_byte_gpte);
+	WARN_ON_ONCE(role.direct && role.has_4_byte_gpte);
 
 	sp = kvm_mmu_get_page(vcpu, gfn, role);
 	++sp->root_count;
@@ -3444,7 +3445,7 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
 		for (i = 0; i < 4; ++i) {
 			WARN_ON_ONCE(IS_VALID_PAE_ROOT(mmu->pae_root[i]));
 
-			root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT), i,
+			root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT), 0,
 					      PT32_ROOT_LEVEL);
 			mmu->pae_root[i] = root | PT_PRESENT_MASK |
 					   shadow_me_mask;
@@ -3529,6 +3530,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 	struct kvm_mmu *mmu = vcpu->arch.mmu;
 	u64 pdptrs[4], pm_mask;
 	gfn_t root_gfn, root_pgd;
+	unsigned int quadrant;
 	hpa_t root;
 	unsigned i;
 	int r;
@@ -3614,7 +3616,15 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 			root_gfn = pdptrs[i] >> PAGE_SHIFT;
 		}
 
-		root = mmu_alloc_root(vcpu, root_gfn, i, PT32_ROOT_LEVEL);
+		/*
+		 * If shadowing 32-bit non-PAE page tables, each PAE page
+		 * directory maps one quarter of the guest's non-PAE page
+		 * directory. Othwerise each PAE page direct shadows one guest
+		 * PAE page directory so that quadrant should be 0.
+		 */
+		quadrant = (mmu->cpu_role.base.level == PT32_ROOT_LEVEL) ? i : 0;
+
+		root = mmu_alloc_root(vcpu, root_gfn, quadrant, PT32_ROOT_LEVEL);
 		mmu->pae_root[i] = root | pm_mask;
 	}
 
-- 
2.36.0.550.gb090851708-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 06/22] KVM: x86/mmu: Decompose kvm_mmu_get_page() into separate functions
  2022-05-16 23:21 ` David Matlack
@ 2022-05-16 23:21   ` David Matlack
  -1 siblings, 0 replies; 111+ messages in thread
From: David Matlack @ 2022-05-16 23:21 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, Lai Jiangshan, David Matlack

Decompose kvm_mmu_get_page() into separate helper functions to increase
readability and prepare for allocating shadow pages without a vcpu
pointer.

Specifically, pull the guts of kvm_mmu_get_page() into 2 helper
functions:

kvm_mmu_find_shadow_page() -
  Walks the page hash checking for any existing mmu pages that match the
  given gfn and role.

kvm_mmu_alloc_shadow_page()
  Allocates and initializes an entirely new kvm_mmu_page. This currently
  requries a vcpu pointer for allocation and looking up the memslot but
  that will be removed in a future commit.

No functional change intended.

Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 52 +++++++++++++++++++++++++++++++-----------
 1 file changed, 39 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 8508c4bfddb5..c8ee92e45e8b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2019,16 +2019,16 @@ static void clear_sp_write_flooding_count(u64 *spte)
 	__clear_sp_write_flooding_count(sptep_to_sp(spte));
 }
 
-static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
-					     union kvm_mmu_page_role role)
+static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
+						     gfn_t gfn,
+						     struct hlist_head *sp_list,
+						     union kvm_mmu_page_role role)
 {
-	struct hlist_head *sp_list;
 	struct kvm_mmu_page *sp;
 	int ret;
 	int collisions = 0;
 	LIST_HEAD(invalid_list);
 
-	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
 	for_each_valid_sp(vcpu->kvm, sp, sp_list) {
 		if (sp->gfn != gfn) {
 			collisions++;
@@ -2053,7 +2053,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
 
 		/* unsync and write-flooding only apply to indirect SPs. */
 		if (sp->role.direct)
-			goto trace_get_page;
+			goto out;
 
 		if (sp->unsync) {
 			/*
@@ -2079,14 +2079,26 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
 
 		__clear_sp_write_flooding_count(sp);
 
-trace_get_page:
-		trace_kvm_mmu_get_page(sp, false);
 		goto out;
 	}
 
+	sp = NULL;
 	++vcpu->kvm->stat.mmu_cache_miss;
 
-	sp = kvm_mmu_alloc_page(vcpu, role.direct);
+out:
+	kvm_mmu_commit_zap_page(vcpu->kvm, &invalid_list);
+
+	if (collisions > vcpu->kvm->stat.max_mmu_page_hash_collisions)
+		vcpu->kvm->stat.max_mmu_page_hash_collisions = collisions;
+	return sp;
+}
+
+static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
+						      gfn_t gfn,
+						      struct hlist_head *sp_list,
+						      union kvm_mmu_page_role role)
+{
+	struct kvm_mmu_page *sp = kvm_mmu_alloc_page(vcpu, role.direct);
 
 	sp->gfn = gfn;
 	sp->role = role;
@@ -2096,12 +2108,26 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
 		if (role.level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))
 			kvm_flush_remote_tlbs_with_address(vcpu->kvm, gfn, 1);
 	}
-	trace_kvm_mmu_get_page(sp, true);
-out:
-	kvm_mmu_commit_zap_page(vcpu->kvm, &invalid_list);
 
-	if (collisions > vcpu->kvm->stat.max_mmu_page_hash_collisions)
-		vcpu->kvm->stat.max_mmu_page_hash_collisions = collisions;
+	return sp;
+}
+
+static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
+					     union kvm_mmu_page_role role)
+{
+	struct hlist_head *sp_list;
+	struct kvm_mmu_page *sp;
+	bool created = false;
+
+	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
+
+	sp = kvm_mmu_find_shadow_page(vcpu, gfn, sp_list, role);
+	if (!sp) {
+		created = true;
+		sp = kvm_mmu_alloc_shadow_page(vcpu, gfn, sp_list, role);
+	}
+
+	trace_kvm_mmu_get_page(sp, created);
 	return sp;
 }
 
-- 
2.36.0.550.gb090851708-goog


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 06/22] KVM: x86/mmu: Decompose kvm_mmu_get_page() into separate functions
@ 2022-05-16 23:21   ` David Matlack
  0 siblings, 0 replies; 111+ messages in thread
From: David Matlack @ 2022-05-16 23:21 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen, Lai Jiangshan,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Decompose kvm_mmu_get_page() into separate helper functions to increase
readability and prepare for allocating shadow pages without a vcpu
pointer.

Specifically, pull the guts of kvm_mmu_get_page() into 2 helper
functions:

kvm_mmu_find_shadow_page() -
  Walks the page hash checking for any existing mmu pages that match the
  given gfn and role.

kvm_mmu_alloc_shadow_page()
  Allocates and initializes an entirely new kvm_mmu_page. This currently
  requries a vcpu pointer for allocation and looking up the memslot but
  that will be removed in a future commit.

No functional change intended.

Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 52 +++++++++++++++++++++++++++++++-----------
 1 file changed, 39 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 8508c4bfddb5..c8ee92e45e8b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2019,16 +2019,16 @@ static void clear_sp_write_flooding_count(u64 *spte)
 	__clear_sp_write_flooding_count(sptep_to_sp(spte));
 }
 
-static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
-					     union kvm_mmu_page_role role)
+static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
+						     gfn_t gfn,
+						     struct hlist_head *sp_list,
+						     union kvm_mmu_page_role role)
 {
-	struct hlist_head *sp_list;
 	struct kvm_mmu_page *sp;
 	int ret;
 	int collisions = 0;
 	LIST_HEAD(invalid_list);
 
-	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
 	for_each_valid_sp(vcpu->kvm, sp, sp_list) {
 		if (sp->gfn != gfn) {
 			collisions++;
@@ -2053,7 +2053,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
 
 		/* unsync and write-flooding only apply to indirect SPs. */
 		if (sp->role.direct)
-			goto trace_get_page;
+			goto out;
 
 		if (sp->unsync) {
 			/*
@@ -2079,14 +2079,26 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
 
 		__clear_sp_write_flooding_count(sp);
 
-trace_get_page:
-		trace_kvm_mmu_get_page(sp, false);
 		goto out;
 	}
 
+	sp = NULL;
 	++vcpu->kvm->stat.mmu_cache_miss;
 
-	sp = kvm_mmu_alloc_page(vcpu, role.direct);
+out:
+	kvm_mmu_commit_zap_page(vcpu->kvm, &invalid_list);
+
+	if (collisions > vcpu->kvm->stat.max_mmu_page_hash_collisions)
+		vcpu->kvm->stat.max_mmu_page_hash_collisions = collisions;
+	return sp;
+}
+
+static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
+						      gfn_t gfn,
+						      struct hlist_head *sp_list,
+						      union kvm_mmu_page_role role)
+{
+	struct kvm_mmu_page *sp = kvm_mmu_alloc_page(vcpu, role.direct);
 
 	sp->gfn = gfn;
 	sp->role = role;
@@ -2096,12 +2108,26 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
 		if (role.level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))
 			kvm_flush_remote_tlbs_with_address(vcpu->kvm, gfn, 1);
 	}
-	trace_kvm_mmu_get_page(sp, true);
-out:
-	kvm_mmu_commit_zap_page(vcpu->kvm, &invalid_list);
 
-	if (collisions > vcpu->kvm->stat.max_mmu_page_hash_collisions)
-		vcpu->kvm->stat.max_mmu_page_hash_collisions = collisions;
+	return sp;
+}
+
+static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
+					     union kvm_mmu_page_role role)
+{
+	struct hlist_head *sp_list;
+	struct kvm_mmu_page *sp;
+	bool created = false;
+
+	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
+
+	sp = kvm_mmu_find_shadow_page(vcpu, gfn, sp_list, role);
+	if (!sp) {
+		created = true;
+		sp = kvm_mmu_alloc_shadow_page(vcpu, gfn, sp_list, role);
+	}
+
+	trace_kvm_mmu_get_page(sp, created);
 	return sp;
 }
 
-- 
2.36.0.550.gb090851708-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 07/22] KVM: x86/mmu: Consolidate shadow page allocation and initialization
  2022-05-16 23:21 ` David Matlack
@ 2022-05-16 23:21   ` David Matlack
  -1 siblings, 0 replies; 111+ messages in thread
From: David Matlack @ 2022-05-16 23:21 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, Lai Jiangshan, David Matlack

Consolidate kvm_mmu_alloc_page() and kvm_mmu_alloc_shadow_page() under
the latter so that all shadow page allocation and initialization happens
in one place.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 39 +++++++++++++++++----------------------
 1 file changed, 17 insertions(+), 22 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index c8ee92e45e8b..0b14097f8771 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1690,27 +1690,6 @@ static void drop_parent_pte(struct kvm_mmu_page *sp,
 	mmu_spte_clear_no_track(parent_pte);
 }
 
-static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, bool direct)
-{
-	struct kvm_mmu_page *sp;
-
-	sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
-	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
-	if (!direct)
-		sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache);
-	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
-
-	/*
-	 * active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages()
-	 * depends on valid pages being added to the head of the list.  See
-	 * comments in kvm_zap_obsolete_pages().
-	 */
-	sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
-	list_add(&sp->link, &vcpu->kvm->arch.active_mmu_pages);
-	kvm_mod_used_mmu_pages(vcpu->kvm, +1);
-	return sp;
-}
-
 static void mark_unsync(u64 *spte);
 static void kvm_mmu_mark_parents_unsync(struct kvm_mmu_page *sp)
 {
@@ -2098,7 +2077,23 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
 						      struct hlist_head *sp_list,
 						      union kvm_mmu_page_role role)
 {
-	struct kvm_mmu_page *sp = kvm_mmu_alloc_page(vcpu, role.direct);
+	struct kvm_mmu_page *sp;
+
+	sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
+	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
+	if (!role.direct)
+		sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache);
+
+	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
+
+	/*
+	 * active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages()
+	 * depends on valid pages being added to the head of the list.  See
+	 * comments in kvm_zap_obsolete_pages().
+	 */
+	sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
+	list_add(&sp->link, &vcpu->kvm->arch.active_mmu_pages);
+	kvm_mod_used_mmu_pages(vcpu->kvm, +1);
 
 	sp->gfn = gfn;
 	sp->role = role;
-- 
2.36.0.550.gb090851708-goog


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 07/22] KVM: x86/mmu: Consolidate shadow page allocation and initialization
@ 2022-05-16 23:21   ` David Matlack
  0 siblings, 0 replies; 111+ messages in thread
From: David Matlack @ 2022-05-16 23:21 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen, Lai Jiangshan,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Consolidate kvm_mmu_alloc_page() and kvm_mmu_alloc_shadow_page() under
the latter so that all shadow page allocation and initialization happens
in one place.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 39 +++++++++++++++++----------------------
 1 file changed, 17 insertions(+), 22 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index c8ee92e45e8b..0b14097f8771 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1690,27 +1690,6 @@ static void drop_parent_pte(struct kvm_mmu_page *sp,
 	mmu_spte_clear_no_track(parent_pte);
 }
 
-static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, bool direct)
-{
-	struct kvm_mmu_page *sp;
-
-	sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
-	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
-	if (!direct)
-		sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache);
-	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
-
-	/*
-	 * active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages()
-	 * depends on valid pages being added to the head of the list.  See
-	 * comments in kvm_zap_obsolete_pages().
-	 */
-	sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
-	list_add(&sp->link, &vcpu->kvm->arch.active_mmu_pages);
-	kvm_mod_used_mmu_pages(vcpu->kvm, +1);
-	return sp;
-}
-
 static void mark_unsync(u64 *spte);
 static void kvm_mmu_mark_parents_unsync(struct kvm_mmu_page *sp)
 {
@@ -2098,7 +2077,23 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
 						      struct hlist_head *sp_list,
 						      union kvm_mmu_page_role role)
 {
-	struct kvm_mmu_page *sp = kvm_mmu_alloc_page(vcpu, role.direct);
+	struct kvm_mmu_page *sp;
+
+	sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
+	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
+	if (!role.direct)
+		sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache);
+
+	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
+
+	/*
+	 * active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages()
+	 * depends on valid pages being added to the head of the list.  See
+	 * comments in kvm_zap_obsolete_pages().
+	 */
+	sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
+	list_add(&sp->link, &vcpu->kvm->arch.active_mmu_pages);
+	kvm_mod_used_mmu_pages(vcpu->kvm, +1);
 
 	sp->gfn = gfn;
 	sp->role = role;
-- 
2.36.0.550.gb090851708-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 08/22] KVM: x86/mmu: Rename shadow MMU functions that deal with shadow pages
  2022-05-16 23:21 ` David Matlack
@ 2022-05-16 23:21   ` David Matlack
  -1 siblings, 0 replies; 111+ messages in thread
From: David Matlack @ 2022-05-16 23:21 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, Lai Jiangshan, David Matlack

Rename 2 functions:

  kvm_mmu_get_page() -> kvm_mmu_get_shadow_page()
  kvm_mmu_free_page() -> kvm_mmu_free_shadow_page()

This change makes it clear that these functions deal with shadow pages
rather than struct pages. It also aligns these functions with the naming
scheme for kvm_mmu_find_shadow_page() and kvm_mmu_alloc_shadow_page().

Prefer "shadow_page" over the shorter "sp" since these are core
functions and the line lengths aren't terrible.

No functional change intended.

Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 0b14097f8771..d342fcc5813d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1652,7 +1652,7 @@ static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, long nr)
 	percpu_counter_add(&kvm_total_used_mmu_pages, nr);
 }
 
-static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
+static void kvm_mmu_free_shadow_page(struct kvm_mmu_page *sp)
 {
 	MMU_WARN_ON(!is_empty_shadow_page(sp->spt));
 	hlist_del(&sp->hash_link);
@@ -2107,8 +2107,9 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
 	return sp;
 }
 
-static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
-					     union kvm_mmu_page_role role)
+static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
+						    gfn_t gfn,
+						    union kvm_mmu_page_role role)
 {
 	struct hlist_head *sp_list;
 	struct kvm_mmu_page *sp;
@@ -2172,7 +2173,7 @@ static struct kvm_mmu_page *kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu,
 	union kvm_mmu_page_role role;
 
 	role = kvm_mmu_child_role(sptep, direct, access);
-	return kvm_mmu_get_page(vcpu, gfn, role);
+	return kvm_mmu_get_shadow_page(vcpu, gfn, role);
 }
 
 static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterator,
@@ -2448,7 +2449,7 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
 
 	list_for_each_entry_safe(sp, nsp, invalid_list, link) {
 		WARN_ON(!sp->role.invalid || sp->root_count);
-		kvm_mmu_free_page(sp);
+		kvm_mmu_free_shadow_page(sp);
 	}
 }
 
@@ -3432,7 +3433,7 @@ static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant,
 	WARN_ON_ONCE(quadrant && !role.has_4_byte_gpte);
 	WARN_ON_ONCE(role.direct && role.has_4_byte_gpte);
 
-	sp = kvm_mmu_get_page(vcpu, gfn, role);
+	sp = kvm_mmu_get_shadow_page(vcpu, gfn, role);
 	++sp->root_count;
 
 	return __pa(sp->spt);
-- 
2.36.0.550.gb090851708-goog


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 08/22] KVM: x86/mmu: Rename shadow MMU functions that deal with shadow pages
@ 2022-05-16 23:21   ` David Matlack
  0 siblings, 0 replies; 111+ messages in thread
From: David Matlack @ 2022-05-16 23:21 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen, Lai Jiangshan,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Rename 2 functions:

  kvm_mmu_get_page() -> kvm_mmu_get_shadow_page()
  kvm_mmu_free_page() -> kvm_mmu_free_shadow_page()

This change makes it clear that these functions deal with shadow pages
rather than struct pages. It also aligns these functions with the naming
scheme for kvm_mmu_find_shadow_page() and kvm_mmu_alloc_shadow_page().

Prefer "shadow_page" over the shorter "sp" since these are core
functions and the line lengths aren't terrible.

No functional change intended.

Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 0b14097f8771..d342fcc5813d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1652,7 +1652,7 @@ static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, long nr)
 	percpu_counter_add(&kvm_total_used_mmu_pages, nr);
 }
 
-static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
+static void kvm_mmu_free_shadow_page(struct kvm_mmu_page *sp)
 {
 	MMU_WARN_ON(!is_empty_shadow_page(sp->spt));
 	hlist_del(&sp->hash_link);
@@ -2107,8 +2107,9 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
 	return sp;
 }
 
-static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
-					     union kvm_mmu_page_role role)
+static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
+						    gfn_t gfn,
+						    union kvm_mmu_page_role role)
 {
 	struct hlist_head *sp_list;
 	struct kvm_mmu_page *sp;
@@ -2172,7 +2173,7 @@ static struct kvm_mmu_page *kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu,
 	union kvm_mmu_page_role role;
 
 	role = kvm_mmu_child_role(sptep, direct, access);
-	return kvm_mmu_get_page(vcpu, gfn, role);
+	return kvm_mmu_get_shadow_page(vcpu, gfn, role);
 }
 
 static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterator,
@@ -2448,7 +2449,7 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
 
 	list_for_each_entry_safe(sp, nsp, invalid_list, link) {
 		WARN_ON(!sp->role.invalid || sp->root_count);
-		kvm_mmu_free_page(sp);
+		kvm_mmu_free_shadow_page(sp);
 	}
 }
 
@@ -3432,7 +3433,7 @@ static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant,
 	WARN_ON_ONCE(quadrant && !role.has_4_byte_gpte);
 	WARN_ON_ONCE(role.direct && role.has_4_byte_gpte);
 
-	sp = kvm_mmu_get_page(vcpu, gfn, role);
+	sp = kvm_mmu_get_shadow_page(vcpu, gfn, role);
 	++sp->root_count;
 
 	return __pa(sp->spt);
-- 
2.36.0.550.gb090851708-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 09/22] KVM: x86/mmu: Move guest PT write-protection to account_shadowed()
  2022-05-16 23:21 ` David Matlack
@ 2022-05-16 23:21   ` David Matlack
  -1 siblings, 0 replies; 111+ messages in thread
From: David Matlack @ 2022-05-16 23:21 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, Lai Jiangshan, David Matlack

Move the code that write-protects newly-shadowed guest page tables into
account_shadowed(). This avoids a extra gfn-to-memslot lookup and is a
more logical place for this code to live. But most importantly, this
reduces kvm_mmu_alloc_shadow_page()'s reliance on having a struct
kvm_vcpu pointer, which will be necessary when creating new shadow pages
during VM ioctls for eager page splitting.

Note, it is safe to drop the role.level == PG_LEVEL_4K check since
account_shadowed() returns early if role.level > PG_LEVEL_4K.

No functional change intended.

Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index d342fcc5813d..6a3b1b00f02b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -792,6 +792,9 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
 						    KVM_PAGE_TRACK_WRITE);
 
 	kvm_mmu_gfn_disallow_lpage(slot, gfn);
+
+	if (kvm_mmu_slot_gfn_write_protect(kvm, slot, gfn, PG_LEVEL_4K))
+		kvm_flush_remote_tlbs_with_address(kvm, gfn, 1);
 }
 
 void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp)
@@ -2098,11 +2101,8 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
 	sp->gfn = gfn;
 	sp->role = role;
 	hlist_add_head(&sp->hash_link, sp_list);
-	if (sp_has_gptes(sp)) {
+	if (sp_has_gptes(sp))
 		account_shadowed(vcpu->kvm, sp);
-		if (role.level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))
-			kvm_flush_remote_tlbs_with_address(vcpu->kvm, gfn, 1);
-	}
 
 	return sp;
 }
-- 
2.36.0.550.gb090851708-goog


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 09/22] KVM: x86/mmu: Move guest PT write-protection to account_shadowed()
@ 2022-05-16 23:21   ` David Matlack
  0 siblings, 0 replies; 111+ messages in thread
From: David Matlack @ 2022-05-16 23:21 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen, Lai Jiangshan,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Move the code that write-protects newly-shadowed guest page tables into
account_shadowed(). This avoids a extra gfn-to-memslot lookup and is a
more logical place for this code to live. But most importantly, this
reduces kvm_mmu_alloc_shadow_page()'s reliance on having a struct
kvm_vcpu pointer, which will be necessary when creating new shadow pages
during VM ioctls for eager page splitting.

Note, it is safe to drop the role.level == PG_LEVEL_4K check since
account_shadowed() returns early if role.level > PG_LEVEL_4K.

No functional change intended.

Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index d342fcc5813d..6a3b1b00f02b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -792,6 +792,9 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
 						    KVM_PAGE_TRACK_WRITE);
 
 	kvm_mmu_gfn_disallow_lpage(slot, gfn);
+
+	if (kvm_mmu_slot_gfn_write_protect(kvm, slot, gfn, PG_LEVEL_4K))
+		kvm_flush_remote_tlbs_with_address(kvm, gfn, 1);
 }
 
 void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp)
@@ -2098,11 +2101,8 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
 	sp->gfn = gfn;
 	sp->role = role;
 	hlist_add_head(&sp->hash_link, sp_list);
-	if (sp_has_gptes(sp)) {
+	if (sp_has_gptes(sp))
 		account_shadowed(vcpu->kvm, sp);
-		if (role.level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))
-			kvm_flush_remote_tlbs_with_address(vcpu->kvm, gfn, 1);
-	}
 
 	return sp;
 }
-- 
2.36.0.550.gb090851708-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 10/22] KVM: x86/mmu: Pass memory caches to allocate SPs separately
  2022-05-16 23:21 ` David Matlack
@ 2022-05-16 23:21   ` David Matlack
  -1 siblings, 0 replies; 111+ messages in thread
From: David Matlack @ 2022-05-16 23:21 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, Lai Jiangshan, David Matlack

Refactor kvm_mmu_alloc_shadow_page() to receive the caches from which it
will allocate the various pieces of memory for shadow pages as a
parameter, rather than deriving them from the vcpu pointer. This will be
useful in a future commit where shadow pages are allocated during VM
ioctls for eager page splitting, and thus will use a different set of
caches.

Preemptively pull the caches out all the way to
kvm_mmu_get_shadow_page() since eager page splitting will not be calling
kvm_mmu_alloc_shadow_page() directly.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 36 +++++++++++++++++++++++++++++-------
 1 file changed, 29 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6a3b1b00f02b..bad4dd5aa051 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2075,17 +2075,25 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
 	return sp;
 }
 
+/* Caches used when allocating a new shadow page. */
+struct shadow_page_caches {
+	struct kvm_mmu_memory_cache *page_header_cache;
+	struct kvm_mmu_memory_cache *shadow_page_cache;
+	struct kvm_mmu_memory_cache *gfn_array_cache;
+};
+
 static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
+						      struct shadow_page_caches *caches,
 						      gfn_t gfn,
 						      struct hlist_head *sp_list,
 						      union kvm_mmu_page_role role)
 {
 	struct kvm_mmu_page *sp;
 
-	sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
-	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
+	sp = kvm_mmu_memory_cache_alloc(caches->page_header_cache);
+	sp->spt = kvm_mmu_memory_cache_alloc(caches->shadow_page_cache);
 	if (!role.direct)
-		sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache);
+		sp->gfns = kvm_mmu_memory_cache_alloc(caches->gfn_array_cache);
 
 	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
 
@@ -2107,9 +2115,10 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
 	return sp;
 }
 
-static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
-						    gfn_t gfn,
-						    union kvm_mmu_page_role role)
+static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
+						      struct shadow_page_caches *caches,
+						      gfn_t gfn,
+						      union kvm_mmu_page_role role)
 {
 	struct hlist_head *sp_list;
 	struct kvm_mmu_page *sp;
@@ -2120,13 +2129,26 @@ static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
 	sp = kvm_mmu_find_shadow_page(vcpu, gfn, sp_list, role);
 	if (!sp) {
 		created = true;
-		sp = kvm_mmu_alloc_shadow_page(vcpu, gfn, sp_list, role);
+		sp = kvm_mmu_alloc_shadow_page(vcpu, caches, gfn, sp_list, role);
 	}
 
 	trace_kvm_mmu_get_page(sp, created);
 	return sp;
 }
 
+static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
+						    gfn_t gfn,
+						    union kvm_mmu_page_role role)
+{
+	struct shadow_page_caches caches = {
+		.page_header_cache = &vcpu->arch.mmu_page_header_cache,
+		.shadow_page_cache = &vcpu->arch.mmu_shadow_page_cache,
+		.gfn_array_cache = &vcpu->arch.mmu_gfn_array_cache,
+	};
+
+	return __kvm_mmu_get_shadow_page(vcpu, &caches, gfn, role);
+}
+
 static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct, u32 access)
 {
 	struct kvm_mmu_page *parent_sp = sptep_to_sp(sptep);
-- 
2.36.0.550.gb090851708-goog


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 10/22] KVM: x86/mmu: Pass memory caches to allocate SPs separately
@ 2022-05-16 23:21   ` David Matlack
  0 siblings, 0 replies; 111+ messages in thread
From: David Matlack @ 2022-05-16 23:21 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen, Lai Jiangshan,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Refactor kvm_mmu_alloc_shadow_page() to receive the caches from which it
will allocate the various pieces of memory for shadow pages as a
parameter, rather than deriving them from the vcpu pointer. This will be
useful in a future commit where shadow pages are allocated during VM
ioctls for eager page splitting, and thus will use a different set of
caches.

Preemptively pull the caches out all the way to
kvm_mmu_get_shadow_page() since eager page splitting will not be calling
kvm_mmu_alloc_shadow_page() directly.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 36 +++++++++++++++++++++++++++++-------
 1 file changed, 29 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6a3b1b00f02b..bad4dd5aa051 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2075,17 +2075,25 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
 	return sp;
 }
 
+/* Caches used when allocating a new shadow page. */
+struct shadow_page_caches {
+	struct kvm_mmu_memory_cache *page_header_cache;
+	struct kvm_mmu_memory_cache *shadow_page_cache;
+	struct kvm_mmu_memory_cache *gfn_array_cache;
+};
+
 static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
+						      struct shadow_page_caches *caches,
 						      gfn_t gfn,
 						      struct hlist_head *sp_list,
 						      union kvm_mmu_page_role role)
 {
 	struct kvm_mmu_page *sp;
 
-	sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
-	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
+	sp = kvm_mmu_memory_cache_alloc(caches->page_header_cache);
+	sp->spt = kvm_mmu_memory_cache_alloc(caches->shadow_page_cache);
 	if (!role.direct)
-		sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache);
+		sp->gfns = kvm_mmu_memory_cache_alloc(caches->gfn_array_cache);
 
 	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
 
@@ -2107,9 +2115,10 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
 	return sp;
 }
 
-static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
-						    gfn_t gfn,
-						    union kvm_mmu_page_role role)
+static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
+						      struct shadow_page_caches *caches,
+						      gfn_t gfn,
+						      union kvm_mmu_page_role role)
 {
 	struct hlist_head *sp_list;
 	struct kvm_mmu_page *sp;
@@ -2120,13 +2129,26 @@ static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
 	sp = kvm_mmu_find_shadow_page(vcpu, gfn, sp_list, role);
 	if (!sp) {
 		created = true;
-		sp = kvm_mmu_alloc_shadow_page(vcpu, gfn, sp_list, role);
+		sp = kvm_mmu_alloc_shadow_page(vcpu, caches, gfn, sp_list, role);
 	}
 
 	trace_kvm_mmu_get_page(sp, created);
 	return sp;
 }
 
+static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
+						    gfn_t gfn,
+						    union kvm_mmu_page_role role)
+{
+	struct shadow_page_caches caches = {
+		.page_header_cache = &vcpu->arch.mmu_page_header_cache,
+		.shadow_page_cache = &vcpu->arch.mmu_shadow_page_cache,
+		.gfn_array_cache = &vcpu->arch.mmu_gfn_array_cache,
+	};
+
+	return __kvm_mmu_get_shadow_page(vcpu, &caches, gfn, role);
+}
+
 static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct, u32 access)
 {
 	struct kvm_mmu_page *parent_sp = sptep_to_sp(sptep);
-- 
2.36.0.550.gb090851708-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 11/22] KVM: x86/mmu: Replace vcpu with kvm in kvm_mmu_alloc_shadow_page()
  2022-05-16 23:21 ` David Matlack
@ 2022-05-16 23:21   ` David Matlack
  -1 siblings, 0 replies; 111+ messages in thread
From: David Matlack @ 2022-05-16 23:21 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, Lai Jiangshan, David Matlack

The vcpu pointer in kvm_mmu_alloc_shadow_page() is only used to get the
kvm pointer. So drop the vcpu pointer and just pass in the kvm pointer.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index bad4dd5aa051..8031b799ca77 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2082,7 +2082,7 @@ struct shadow_page_caches {
 	struct kvm_mmu_memory_cache *gfn_array_cache;
 };
 
-static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
+static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
 						      struct shadow_page_caches *caches,
 						      gfn_t gfn,
 						      struct hlist_head *sp_list,
@@ -2102,15 +2102,15 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
 	 * depends on valid pages being added to the head of the list.  See
 	 * comments in kvm_zap_obsolete_pages().
 	 */
-	sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
-	list_add(&sp->link, &vcpu->kvm->arch.active_mmu_pages);
-	kvm_mod_used_mmu_pages(vcpu->kvm, +1);
+	sp->mmu_valid_gen = kvm->arch.mmu_valid_gen;
+	list_add(&sp->link, &kvm->arch.active_mmu_pages);
+	kvm_mod_used_mmu_pages(kvm, +1);
 
 	sp->gfn = gfn;
 	sp->role = role;
 	hlist_add_head(&sp->hash_link, sp_list);
 	if (sp_has_gptes(sp))
-		account_shadowed(vcpu->kvm, sp);
+		account_shadowed(kvm, sp);
 
 	return sp;
 }
@@ -2129,7 +2129,7 @@ static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
 	sp = kvm_mmu_find_shadow_page(vcpu, gfn, sp_list, role);
 	if (!sp) {
 		created = true;
-		sp = kvm_mmu_alloc_shadow_page(vcpu, caches, gfn, sp_list, role);
+		sp = kvm_mmu_alloc_shadow_page(vcpu->kvm, caches, gfn, sp_list, role);
 	}
 
 	trace_kvm_mmu_get_page(sp, created);
-- 
2.36.0.550.gb090851708-goog


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 11/22] KVM: x86/mmu: Replace vcpu with kvm in kvm_mmu_alloc_shadow_page()
@ 2022-05-16 23:21   ` David Matlack
  0 siblings, 0 replies; 111+ messages in thread
From: David Matlack @ 2022-05-16 23:21 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen, Lai Jiangshan,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

The vcpu pointer in kvm_mmu_alloc_shadow_page() is only used to get the
kvm pointer. So drop the vcpu pointer and just pass in the kvm pointer.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index bad4dd5aa051..8031b799ca77 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2082,7 +2082,7 @@ struct shadow_page_caches {
 	struct kvm_mmu_memory_cache *gfn_array_cache;
 };
 
-static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
+static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
 						      struct shadow_page_caches *caches,
 						      gfn_t gfn,
 						      struct hlist_head *sp_list,
@@ -2102,15 +2102,15 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
 	 * depends on valid pages being added to the head of the list.  See
 	 * comments in kvm_zap_obsolete_pages().
 	 */
-	sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
-	list_add(&sp->link, &vcpu->kvm->arch.active_mmu_pages);
-	kvm_mod_used_mmu_pages(vcpu->kvm, +1);
+	sp->mmu_valid_gen = kvm->arch.mmu_valid_gen;
+	list_add(&sp->link, &kvm->arch.active_mmu_pages);
+	kvm_mod_used_mmu_pages(kvm, +1);
 
 	sp->gfn = gfn;
 	sp->role = role;
 	hlist_add_head(&sp->hash_link, sp_list);
 	if (sp_has_gptes(sp))
-		account_shadowed(vcpu->kvm, sp);
+		account_shadowed(kvm, sp);
 
 	return sp;
 }
@@ -2129,7 +2129,7 @@ static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
 	sp = kvm_mmu_find_shadow_page(vcpu, gfn, sp_list, role);
 	if (!sp) {
 		created = true;
-		sp = kvm_mmu_alloc_shadow_page(vcpu, caches, gfn, sp_list, role);
+		sp = kvm_mmu_alloc_shadow_page(vcpu->kvm, caches, gfn, sp_list, role);
 	}
 
 	trace_kvm_mmu_get_page(sp, created);
-- 
2.36.0.550.gb090851708-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 12/22] KVM: x86/mmu: Pass kvm pointer separately from vcpu to kvm_mmu_find_shadow_page()
  2022-05-16 23:21 ` David Matlack
@ 2022-05-16 23:21   ` David Matlack
  -1 siblings, 0 replies; 111+ messages in thread
From: David Matlack @ 2022-05-16 23:21 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, Lai Jiangshan, David Matlack

Get the kvm pointer from the caller, rather than deriving it from
vcpu->kvm, and plumb the kvm pointer all the way from
kvm_mmu_get_shadow_page(). With this change in place, the vcpu pointer
is only needed to sync indirect shadow pages. In other words,
__kvm_mmu_get_shadow_page() can now be used to get *direct* shadow pages
without a vcpu pointer. This enables eager page splitting, which needs
to allocate direct shadow pages during VM ioctls.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 28 +++++++++++++++-------------
 1 file changed, 15 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 8031b799ca77..4fbc2da47428 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2001,7 +2001,8 @@ static void clear_sp_write_flooding_count(u64 *spte)
 	__clear_sp_write_flooding_count(sptep_to_sp(spte));
 }
 
-static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
+static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm,
+						     struct kvm_vcpu *vcpu,
 						     gfn_t gfn,
 						     struct hlist_head *sp_list,
 						     union kvm_mmu_page_role role)
@@ -2011,7 +2012,7 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
 	int collisions = 0;
 	LIST_HEAD(invalid_list);
 
-	for_each_valid_sp(vcpu->kvm, sp, sp_list) {
+	for_each_valid_sp(kvm, sp, sp_list) {
 		if (sp->gfn != gfn) {
 			collisions++;
 			continue;
@@ -2028,7 +2029,7 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
 			 * upper-level page will be write-protected.
 			 */
 			if (role.level > PG_LEVEL_4K && sp->unsync)
-				kvm_mmu_prepare_zap_page(vcpu->kvm, sp,
+				kvm_mmu_prepare_zap_page(kvm, sp,
 							 &invalid_list);
 			continue;
 		}
@@ -2056,7 +2057,7 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
 
 			WARN_ON(!list_empty(&invalid_list));
 			if (ret > 0)
-				kvm_flush_remote_tlbs(vcpu->kvm);
+				kvm_flush_remote_tlbs(kvm);
 		}
 
 		__clear_sp_write_flooding_count(sp);
@@ -2065,13 +2066,13 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
 	}
 
 	sp = NULL;
-	++vcpu->kvm->stat.mmu_cache_miss;
+	++kvm->stat.mmu_cache_miss;
 
 out:
-	kvm_mmu_commit_zap_page(vcpu->kvm, &invalid_list);
+	kvm_mmu_commit_zap_page(kvm, &invalid_list);
 
-	if (collisions > vcpu->kvm->stat.max_mmu_page_hash_collisions)
-		vcpu->kvm->stat.max_mmu_page_hash_collisions = collisions;
+	if (collisions > kvm->stat.max_mmu_page_hash_collisions)
+		kvm->stat.max_mmu_page_hash_collisions = collisions;
 	return sp;
 }
 
@@ -2115,7 +2116,8 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
 	return sp;
 }
 
-static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
+static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm *kvm,
+						      struct kvm_vcpu *vcpu,
 						      struct shadow_page_caches *caches,
 						      gfn_t gfn,
 						      union kvm_mmu_page_role role)
@@ -2124,12 +2126,12 @@ static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
 	struct kvm_mmu_page *sp;
 	bool created = false;
 
-	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
+	sp_list = &kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
 
-	sp = kvm_mmu_find_shadow_page(vcpu, gfn, sp_list, role);
+	sp = kvm_mmu_find_shadow_page(kvm, vcpu, gfn, sp_list, role);
 	if (!sp) {
 		created = true;
-		sp = kvm_mmu_alloc_shadow_page(vcpu->kvm, caches, gfn, sp_list, role);
+		sp = kvm_mmu_alloc_shadow_page(kvm, caches, gfn, sp_list, role);
 	}
 
 	trace_kvm_mmu_get_page(sp, created);
@@ -2146,7 +2148,7 @@ static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
 		.gfn_array_cache = &vcpu->arch.mmu_gfn_array_cache,
 	};
 
-	return __kvm_mmu_get_shadow_page(vcpu, &caches, gfn, role);
+	return __kvm_mmu_get_shadow_page(vcpu->kvm, vcpu, &caches, gfn, role);
 }
 
 static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct, u32 access)
-- 
2.36.0.550.gb090851708-goog


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 12/22] KVM: x86/mmu: Pass kvm pointer separately from vcpu to kvm_mmu_find_shadow_page()
@ 2022-05-16 23:21   ` David Matlack
  0 siblings, 0 replies; 111+ messages in thread
From: David Matlack @ 2022-05-16 23:21 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen, Lai Jiangshan,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Get the kvm pointer from the caller, rather than deriving it from
vcpu->kvm, and plumb the kvm pointer all the way from
kvm_mmu_get_shadow_page(). With this change in place, the vcpu pointer
is only needed to sync indirect shadow pages. In other words,
__kvm_mmu_get_shadow_page() can now be used to get *direct* shadow pages
without a vcpu pointer. This enables eager page splitting, which needs
to allocate direct shadow pages during VM ioctls.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 28 +++++++++++++++-------------
 1 file changed, 15 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 8031b799ca77..4fbc2da47428 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2001,7 +2001,8 @@ static void clear_sp_write_flooding_count(u64 *spte)
 	__clear_sp_write_flooding_count(sptep_to_sp(spte));
 }
 
-static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
+static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm,
+						     struct kvm_vcpu *vcpu,
 						     gfn_t gfn,
 						     struct hlist_head *sp_list,
 						     union kvm_mmu_page_role role)
@@ -2011,7 +2012,7 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
 	int collisions = 0;
 	LIST_HEAD(invalid_list);
 
-	for_each_valid_sp(vcpu->kvm, sp, sp_list) {
+	for_each_valid_sp(kvm, sp, sp_list) {
 		if (sp->gfn != gfn) {
 			collisions++;
 			continue;
@@ -2028,7 +2029,7 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
 			 * upper-level page will be write-protected.
 			 */
 			if (role.level > PG_LEVEL_4K && sp->unsync)
-				kvm_mmu_prepare_zap_page(vcpu->kvm, sp,
+				kvm_mmu_prepare_zap_page(kvm, sp,
 							 &invalid_list);
 			continue;
 		}
@@ -2056,7 +2057,7 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
 
 			WARN_ON(!list_empty(&invalid_list));
 			if (ret > 0)
-				kvm_flush_remote_tlbs(vcpu->kvm);
+				kvm_flush_remote_tlbs(kvm);
 		}
 
 		__clear_sp_write_flooding_count(sp);
@@ -2065,13 +2066,13 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
 	}
 
 	sp = NULL;
-	++vcpu->kvm->stat.mmu_cache_miss;
+	++kvm->stat.mmu_cache_miss;
 
 out:
-	kvm_mmu_commit_zap_page(vcpu->kvm, &invalid_list);
+	kvm_mmu_commit_zap_page(kvm, &invalid_list);
 
-	if (collisions > vcpu->kvm->stat.max_mmu_page_hash_collisions)
-		vcpu->kvm->stat.max_mmu_page_hash_collisions = collisions;
+	if (collisions > kvm->stat.max_mmu_page_hash_collisions)
+		kvm->stat.max_mmu_page_hash_collisions = collisions;
 	return sp;
 }
 
@@ -2115,7 +2116,8 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
 	return sp;
 }
 
-static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
+static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm *kvm,
+						      struct kvm_vcpu *vcpu,
 						      struct shadow_page_caches *caches,
 						      gfn_t gfn,
 						      union kvm_mmu_page_role role)
@@ -2124,12 +2126,12 @@ static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
 	struct kvm_mmu_page *sp;
 	bool created = false;
 
-	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
+	sp_list = &kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
 
-	sp = kvm_mmu_find_shadow_page(vcpu, gfn, sp_list, role);
+	sp = kvm_mmu_find_shadow_page(kvm, vcpu, gfn, sp_list, role);
 	if (!sp) {
 		created = true;
-		sp = kvm_mmu_alloc_shadow_page(vcpu->kvm, caches, gfn, sp_list, role);
+		sp = kvm_mmu_alloc_shadow_page(kvm, caches, gfn, sp_list, role);
 	}
 
 	trace_kvm_mmu_get_page(sp, created);
@@ -2146,7 +2148,7 @@ static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
 		.gfn_array_cache = &vcpu->arch.mmu_gfn_array_cache,
 	};
 
-	return __kvm_mmu_get_shadow_page(vcpu, &caches, gfn, role);
+	return __kvm_mmu_get_shadow_page(vcpu->kvm, vcpu, &caches, gfn, role);
 }
 
 static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct, u32 access)
-- 
2.36.0.550.gb090851708-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 13/22] KVM: x86/mmu: Allow NULL @vcpu in kvm_mmu_find_shadow_page()
  2022-05-16 23:21 ` David Matlack
@ 2022-05-16 23:21   ` David Matlack
  -1 siblings, 0 replies; 111+ messages in thread
From: David Matlack @ 2022-05-16 23:21 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, Lai Jiangshan, David Matlack

Allow @vcpu to be NULL in kvm_mmu_find_shadow_page() (and its only
caller __kvm_mmu_get_shadow_page()). @vcpu is only required to sync
indirect shadow pages, so it's safe to pass in NULL when looking up
direct shadow pages.

This will be used for doing eager page splitting, which allocates direct
shadow pages from the context of a VM ioctl without access to a vCPU
pointer.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 4fbc2da47428..acb54d6e0ea5 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1850,6 +1850,7 @@ static int kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 
 	if (ret < 0)
 		kvm_mmu_prepare_zap_page(vcpu->kvm, sp, invalid_list);
+
 	return ret;
 }
 
@@ -2001,6 +2002,7 @@ static void clear_sp_write_flooding_count(u64 *spte)
 	__clear_sp_write_flooding_count(sptep_to_sp(spte));
 }
 
+/* Note, @vcpu may be NULL if @role.direct is true. */
 static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm,
 						     struct kvm_vcpu *vcpu,
 						     gfn_t gfn,
@@ -2039,6 +2041,16 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm,
 			goto out;
 
 		if (sp->unsync) {
+			/*
+			 * A vCPU pointer should always be provided when finding
+			 * indirect shadow pages, as that shadow page may
+			 * already exist and need to be synced using the vCPU
+			 * pointer. Direct shadow pages are never unsync and
+			 * thus do not require a vCPU pointer.
+			 */
+			if (KVM_BUG_ON(!vcpu, kvm))
+				break;
+
 			/*
 			 * The page is good, but is stale.  kvm_sync_page does
 			 * get the latest guest state, but (unlike mmu_unsync_children)
@@ -2116,6 +2128,7 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
 	return sp;
 }
 
+/* Note, @vcpu may be NULL if @role.direct is true. */
 static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm *kvm,
 						      struct kvm_vcpu *vcpu,
 						      struct shadow_page_caches *caches,
-- 
2.36.0.550.gb090851708-goog


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 13/22] KVM: x86/mmu: Allow NULL @vcpu in kvm_mmu_find_shadow_page()
@ 2022-05-16 23:21   ` David Matlack
  0 siblings, 0 replies; 111+ messages in thread
From: David Matlack @ 2022-05-16 23:21 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen, Lai Jiangshan,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Allow @vcpu to be NULL in kvm_mmu_find_shadow_page() (and its only
caller __kvm_mmu_get_shadow_page()). @vcpu is only required to sync
indirect shadow pages, so it's safe to pass in NULL when looking up
direct shadow pages.

This will be used for doing eager page splitting, which allocates direct
shadow pages from the context of a VM ioctl without access to a vCPU
pointer.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 4fbc2da47428..acb54d6e0ea5 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1850,6 +1850,7 @@ static int kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 
 	if (ret < 0)
 		kvm_mmu_prepare_zap_page(vcpu->kvm, sp, invalid_list);
+
 	return ret;
 }
 
@@ -2001,6 +2002,7 @@ static void clear_sp_write_flooding_count(u64 *spte)
 	__clear_sp_write_flooding_count(sptep_to_sp(spte));
 }
 
+/* Note, @vcpu may be NULL if @role.direct is true. */
 static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm,
 						     struct kvm_vcpu *vcpu,
 						     gfn_t gfn,
@@ -2039,6 +2041,16 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm,
 			goto out;
 
 		if (sp->unsync) {
+			/*
+			 * A vCPU pointer should always be provided when finding
+			 * indirect shadow pages, as that shadow page may
+			 * already exist and need to be synced using the vCPU
+			 * pointer. Direct shadow pages are never unsync and
+			 * thus do not require a vCPU pointer.
+			 */
+			if (KVM_BUG_ON(!vcpu, kvm))
+				break;
+
 			/*
 			 * The page is good, but is stale.  kvm_sync_page does
 			 * get the latest guest state, but (unlike mmu_unsync_children)
@@ -2116,6 +2128,7 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
 	return sp;
 }
 
+/* Note, @vcpu may be NULL if @role.direct is true. */
 static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm *kvm,
 						      struct kvm_vcpu *vcpu,
 						      struct shadow_page_caches *caches,
-- 
2.36.0.550.gb090851708-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 14/22] KVM: x86/mmu: Pass const memslot to rmap_add()
  2022-05-16 23:21 ` David Matlack
@ 2022-05-16 23:21   ` David Matlack
  -1 siblings, 0 replies; 111+ messages in thread
From: David Matlack @ 2022-05-16 23:21 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, Lai Jiangshan, David Matlack

rmap_add() only uses the slot to call gfn_to_rmap() which takes a const
memslot.

No functional change intended.

Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index acb54d6e0ea5..1c0c1f82067d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1582,7 +1582,7 @@ static bool kvm_test_age_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 
 #define RMAP_RECYCLE_THRESHOLD 1000
 
-static void rmap_add(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
+static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
 		     u64 *spte, gfn_t gfn)
 {
 	struct kvm_mmu_page *sp;
-- 
2.36.0.550.gb090851708-goog


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 14/22] KVM: x86/mmu: Pass const memslot to rmap_add()
@ 2022-05-16 23:21   ` David Matlack
  0 siblings, 0 replies; 111+ messages in thread
From: David Matlack @ 2022-05-16 23:21 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen, Lai Jiangshan,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

rmap_add() only uses the slot to call gfn_to_rmap() which takes a const
memslot.

No functional change intended.

Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index acb54d6e0ea5..1c0c1f82067d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1582,7 +1582,7 @@ static bool kvm_test_age_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 
 #define RMAP_RECYCLE_THRESHOLD 1000
 
-static void rmap_add(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
+static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
 		     u64 *spte, gfn_t gfn)
 {
 	struct kvm_mmu_page *sp;
-- 
2.36.0.550.gb090851708-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 15/22] KVM: x86/mmu: Decouple rmap_add() and link_shadow_page() from kvm_vcpu
  2022-05-16 23:21 ` David Matlack
@ 2022-05-16 23:21   ` David Matlack
  -1 siblings, 0 replies; 111+ messages in thread
From: David Matlack @ 2022-05-16 23:21 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, Lai Jiangshan, David Matlack

Allow adding new entries to the rmap and linking shadow pages without a
struct kvm_vcpu pointer by moving the implementation of rmap_add() and
link_shadow_page() into inner helper functions.

No functional change intended.

Reviewed-by: Ben Gardon <bgardon@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 45 +++++++++++++++++++++++++-----------------
 1 file changed, 27 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 1c0c1f82067d..15c0f03848d3 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -699,11 +699,6 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
 }
 
-static struct pte_list_desc *mmu_alloc_pte_list_desc(struct kvm_vcpu *vcpu)
-{
-	return kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_pte_list_desc_cache);
-}
-
 static void mmu_free_pte_list_desc(struct pte_list_desc *pte_list_desc)
 {
 	kmem_cache_free(pte_list_desc_cache, pte_list_desc);
@@ -858,7 +853,7 @@ gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu, gfn_t gfn,
 /*
  * Returns the number of pointers in the rmap chain, not counting the new one.
  */
-static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
+static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte,
 			struct kvm_rmap_head *rmap_head)
 {
 	struct pte_list_desc *desc;
@@ -869,7 +864,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
 		rmap_head->val = (unsigned long)spte;
 	} else if (!(rmap_head->val & 1)) {
 		rmap_printk("%p %llx 1->many\n", spte, *spte);
-		desc = mmu_alloc_pte_list_desc(vcpu);
+		desc = kvm_mmu_memory_cache_alloc(cache);
 		desc->sptes[0] = (u64 *)rmap_head->val;
 		desc->sptes[1] = spte;
 		desc->spte_count = 2;
@@ -881,7 +876,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
 		while (desc->spte_count == PTE_LIST_EXT) {
 			count += PTE_LIST_EXT;
 			if (!desc->more) {
-				desc->more = mmu_alloc_pte_list_desc(vcpu);
+				desc->more = kvm_mmu_memory_cache_alloc(cache);
 				desc = desc->more;
 				desc->spte_count = 0;
 				break;
@@ -1582,8 +1577,10 @@ static bool kvm_test_age_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 
 #define RMAP_RECYCLE_THRESHOLD 1000
 
-static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
-		     u64 *spte, gfn_t gfn)
+static void __rmap_add(struct kvm *kvm,
+		       struct kvm_mmu_memory_cache *cache,
+		       const struct kvm_memory_slot *slot,
+		       u64 *spte, gfn_t gfn)
 {
 	struct kvm_mmu_page *sp;
 	struct kvm_rmap_head *rmap_head;
@@ -1592,15 +1589,21 @@ static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
 	sp = sptep_to_sp(spte);
 	kvm_mmu_page_set_gfn(sp, spte - sp->spt, gfn);
 	rmap_head = gfn_to_rmap(gfn, sp->role.level, slot);
-	rmap_count = pte_list_add(vcpu, spte, rmap_head);
+	rmap_count = pte_list_add(cache, spte, rmap_head);
 
 	if (rmap_count > RMAP_RECYCLE_THRESHOLD) {
-		kvm_unmap_rmapp(vcpu->kvm, rmap_head, NULL, gfn, sp->role.level, __pte(0));
+		kvm_unmap_rmapp(kvm, rmap_head, NULL, gfn, sp->role.level, __pte(0));
 		kvm_flush_remote_tlbs_with_address(
-				vcpu->kvm, sp->gfn, KVM_PAGES_PER_HPAGE(sp->role.level));
+				kvm, sp->gfn, KVM_PAGES_PER_HPAGE(sp->role.level));
 	}
 }
 
+static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
+		     u64 *spte, gfn_t gfn)
+{
+	__rmap_add(vcpu->kvm, &vcpu->arch.mmu_pte_list_desc_cache, slot, spte, gfn);
+}
+
 bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	bool young = false;
@@ -1671,13 +1674,13 @@ static unsigned kvm_page_table_hashfn(gfn_t gfn)
 	return hash_64(gfn, KVM_MMU_HASH_SHIFT);
 }
 
-static void mmu_page_add_parent_pte(struct kvm_vcpu *vcpu,
+static void mmu_page_add_parent_pte(struct kvm_mmu_memory_cache *cache,
 				    struct kvm_mmu_page *sp, u64 *parent_pte)
 {
 	if (!parent_pte)
 		return;
 
-	pte_list_add(vcpu, parent_pte, &sp->parent_ptes);
+	pte_list_add(cache, parent_pte, &sp->parent_ptes);
 }
 
 static void mmu_page_remove_parent_pte(struct kvm_mmu_page *sp,
@@ -2276,8 +2279,8 @@ static void shadow_walk_next(struct kvm_shadow_walk_iterator *iterator)
 	__shadow_walk_next(iterator, *iterator->sptep);
 }
 
-static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep,
-			     struct kvm_mmu_page *sp)
+static void __link_shadow_page(struct kvm_mmu_memory_cache *cache, u64 *sptep,
+			       struct kvm_mmu_page *sp)
 {
 	u64 spte;
 
@@ -2287,12 +2290,18 @@ static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep,
 
 	mmu_spte_set(sptep, spte);
 
-	mmu_page_add_parent_pte(vcpu, sp, sptep);
+	mmu_page_add_parent_pte(cache, sp, sptep);
 
 	if (sp->unsync_children || sp->unsync)
 		mark_unsync(sptep);
 }
 
+static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep,
+			     struct kvm_mmu_page *sp)
+{
+	__link_shadow_page(&vcpu->arch.mmu_pte_list_desc_cache, sptep, sp);
+}
+
 static void validate_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 				   unsigned direct_access)
 {
-- 
2.36.0.550.gb090851708-goog


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 15/22] KVM: x86/mmu: Decouple rmap_add() and link_shadow_page() from kvm_vcpu
@ 2022-05-16 23:21   ` David Matlack
  0 siblings, 0 replies; 111+ messages in thread
From: David Matlack @ 2022-05-16 23:21 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen, Lai Jiangshan,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Allow adding new entries to the rmap and linking shadow pages without a
struct kvm_vcpu pointer by moving the implementation of rmap_add() and
link_shadow_page() into inner helper functions.

No functional change intended.

Reviewed-by: Ben Gardon <bgardon@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 45 +++++++++++++++++++++++++-----------------
 1 file changed, 27 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 1c0c1f82067d..15c0f03848d3 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -699,11 +699,6 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
 }
 
-static struct pte_list_desc *mmu_alloc_pte_list_desc(struct kvm_vcpu *vcpu)
-{
-	return kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_pte_list_desc_cache);
-}
-
 static void mmu_free_pte_list_desc(struct pte_list_desc *pte_list_desc)
 {
 	kmem_cache_free(pte_list_desc_cache, pte_list_desc);
@@ -858,7 +853,7 @@ gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu, gfn_t gfn,
 /*
  * Returns the number of pointers in the rmap chain, not counting the new one.
  */
-static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
+static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte,
 			struct kvm_rmap_head *rmap_head)
 {
 	struct pte_list_desc *desc;
@@ -869,7 +864,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
 		rmap_head->val = (unsigned long)spte;
 	} else if (!(rmap_head->val & 1)) {
 		rmap_printk("%p %llx 1->many\n", spte, *spte);
-		desc = mmu_alloc_pte_list_desc(vcpu);
+		desc = kvm_mmu_memory_cache_alloc(cache);
 		desc->sptes[0] = (u64 *)rmap_head->val;
 		desc->sptes[1] = spte;
 		desc->spte_count = 2;
@@ -881,7 +876,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
 		while (desc->spte_count == PTE_LIST_EXT) {
 			count += PTE_LIST_EXT;
 			if (!desc->more) {
-				desc->more = mmu_alloc_pte_list_desc(vcpu);
+				desc->more = kvm_mmu_memory_cache_alloc(cache);
 				desc = desc->more;
 				desc->spte_count = 0;
 				break;
@@ -1582,8 +1577,10 @@ static bool kvm_test_age_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 
 #define RMAP_RECYCLE_THRESHOLD 1000
 
-static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
-		     u64 *spte, gfn_t gfn)
+static void __rmap_add(struct kvm *kvm,
+		       struct kvm_mmu_memory_cache *cache,
+		       const struct kvm_memory_slot *slot,
+		       u64 *spte, gfn_t gfn)
 {
 	struct kvm_mmu_page *sp;
 	struct kvm_rmap_head *rmap_head;
@@ -1592,15 +1589,21 @@ static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
 	sp = sptep_to_sp(spte);
 	kvm_mmu_page_set_gfn(sp, spte - sp->spt, gfn);
 	rmap_head = gfn_to_rmap(gfn, sp->role.level, slot);
-	rmap_count = pte_list_add(vcpu, spte, rmap_head);
+	rmap_count = pte_list_add(cache, spte, rmap_head);
 
 	if (rmap_count > RMAP_RECYCLE_THRESHOLD) {
-		kvm_unmap_rmapp(vcpu->kvm, rmap_head, NULL, gfn, sp->role.level, __pte(0));
+		kvm_unmap_rmapp(kvm, rmap_head, NULL, gfn, sp->role.level, __pte(0));
 		kvm_flush_remote_tlbs_with_address(
-				vcpu->kvm, sp->gfn, KVM_PAGES_PER_HPAGE(sp->role.level));
+				kvm, sp->gfn, KVM_PAGES_PER_HPAGE(sp->role.level));
 	}
 }
 
+static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
+		     u64 *spte, gfn_t gfn)
+{
+	__rmap_add(vcpu->kvm, &vcpu->arch.mmu_pte_list_desc_cache, slot, spte, gfn);
+}
+
 bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	bool young = false;
@@ -1671,13 +1674,13 @@ static unsigned kvm_page_table_hashfn(gfn_t gfn)
 	return hash_64(gfn, KVM_MMU_HASH_SHIFT);
 }
 
-static void mmu_page_add_parent_pte(struct kvm_vcpu *vcpu,
+static void mmu_page_add_parent_pte(struct kvm_mmu_memory_cache *cache,
 				    struct kvm_mmu_page *sp, u64 *parent_pte)
 {
 	if (!parent_pte)
 		return;
 
-	pte_list_add(vcpu, parent_pte, &sp->parent_ptes);
+	pte_list_add(cache, parent_pte, &sp->parent_ptes);
 }
 
 static void mmu_page_remove_parent_pte(struct kvm_mmu_page *sp,
@@ -2276,8 +2279,8 @@ static void shadow_walk_next(struct kvm_shadow_walk_iterator *iterator)
 	__shadow_walk_next(iterator, *iterator->sptep);
 }
 
-static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep,
-			     struct kvm_mmu_page *sp)
+static void __link_shadow_page(struct kvm_mmu_memory_cache *cache, u64 *sptep,
+			       struct kvm_mmu_page *sp)
 {
 	u64 spte;
 
@@ -2287,12 +2290,18 @@ static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep,
 
 	mmu_spte_set(sptep, spte);
 
-	mmu_page_add_parent_pte(vcpu, sp, sptep);
+	mmu_page_add_parent_pte(cache, sp, sptep);
 
 	if (sp->unsync_children || sp->unsync)
 		mark_unsync(sptep);
 }
 
+static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep,
+			     struct kvm_mmu_page *sp)
+{
+	__link_shadow_page(&vcpu->arch.mmu_pte_list_desc_cache, sptep, sp);
+}
+
 static void validate_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 				   unsigned direct_access)
 {
-- 
2.36.0.550.gb090851708-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 16/22] KVM: x86/mmu: Update page stats in __rmap_add()
  2022-05-16 23:21 ` David Matlack
@ 2022-05-16 23:21   ` David Matlack
  -1 siblings, 0 replies; 111+ messages in thread
From: David Matlack @ 2022-05-16 23:21 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, Lai Jiangshan, David Matlack

Update the page stats in __rmap_add() rather than at the call site. This
will avoid having to manually update page stats when splitting huge
pages in a subsequent commit.

No functional change intended.

Reviewed-by: Ben Gardon <bgardon@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 15c0f03848d3..6aef85dac1e2 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1588,6 +1588,8 @@ static void __rmap_add(struct kvm *kvm,
 
 	sp = sptep_to_sp(spte);
 	kvm_mmu_page_set_gfn(sp, spte - sp->spt, gfn);
+	kvm_update_page_stats(kvm, sp->role.level, 1);
+
 	rmap_head = gfn_to_rmap(gfn, sp->role.level, slot);
 	rmap_count = pte_list_add(cache, spte, rmap_head);
 
@@ -2810,7 +2812,6 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
 
 	if (!was_rmapped) {
 		WARN_ON_ONCE(ret == RET_PF_SPURIOUS);
-		kvm_update_page_stats(vcpu->kvm, level, 1);
 		rmap_add(vcpu, slot, sptep, gfn);
 	}
 
-- 
2.36.0.550.gb090851708-goog


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 16/22] KVM: x86/mmu: Update page stats in __rmap_add()
@ 2022-05-16 23:21   ` David Matlack
  0 siblings, 0 replies; 111+ messages in thread
From: David Matlack @ 2022-05-16 23:21 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen, Lai Jiangshan,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Update the page stats in __rmap_add() rather than at the call site. This
will avoid having to manually update page stats when splitting huge
pages in a subsequent commit.

No functional change intended.

Reviewed-by: Ben Gardon <bgardon@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 15c0f03848d3..6aef85dac1e2 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1588,6 +1588,8 @@ static void __rmap_add(struct kvm *kvm,
 
 	sp = sptep_to_sp(spte);
 	kvm_mmu_page_set_gfn(sp, spte - sp->spt, gfn);
+	kvm_update_page_stats(kvm, sp->role.level, 1);
+
 	rmap_head = gfn_to_rmap(gfn, sp->role.level, slot);
 	rmap_count = pte_list_add(cache, spte, rmap_head);
 
@@ -2810,7 +2812,6 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
 
 	if (!was_rmapped) {
 		WARN_ON_ONCE(ret == RET_PF_SPURIOUS);
-		kvm_update_page_stats(vcpu->kvm, level, 1);
 		rmap_add(vcpu, slot, sptep, gfn);
 	}
 
-- 
2.36.0.550.gb090851708-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 17/22] KVM: x86/mmu: Cache the access bits of shadowed translations
  2022-05-16 23:21 ` David Matlack
@ 2022-05-16 23:21   ` David Matlack
  -1 siblings, 0 replies; 111+ messages in thread
From: David Matlack @ 2022-05-16 23:21 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, Lai Jiangshan, David Matlack

Splitting huge pages requires allocating/finding shadow pages to replace
the huge page. Shadow pages are keyed, in part, off the guest access
permissions they are shadowing. For fully direct MMUs, there is no
shadowing so the access bits in the shadow page role are always ACC_ALL.
But during shadow paging, the guest can enforce whatever access
permissions it wants.

When KVM is resolving a fault, it walks the guest pages tables to
determine the guest access permissions. But that is difficult to plumb
when splitting huge pages outside of a fault context, e.g. for eager
page splitting.

To enable eager page splitting, KVM can cache the shadowed (guest)
access permissions whenever it updates the shadow page tables (e.g.
during fault, or FNAME(sync_page)). In fact KVM already does this to
cache the shadowed GFN using the gfns array in the shadow page.
The access bits only take up 3 bits, which leaves 61 bits left over for
gfns, which is more than enough. So this change does not require any
additional memory.

Now that the gfns array caches more information than just GFNs, rename
it to shadowed_translation.

While here, preemptively fix up the WARN_ON() that detects gfn
mismatches in direct SPs. The WARN_ON() was paired with a
pr_err_ratelimited(), which means that users could sometimes see the
WARN without the accompanying error message. Fix this by outputting the
error message as part of the WARN splat.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/include/asm/kvm_host.h |  2 +-
 arch/x86/kvm/mmu/mmu.c          | 85 +++++++++++++++++++++++----------
 arch/x86/kvm/mmu/mmu_internal.h | 17 ++++++-
 arch/x86/kvm/mmu/paging_tmpl.h  |  8 +++-
 4 files changed, 83 insertions(+), 29 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 9cdc5bbd721f..9193a700fe2d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -710,7 +710,7 @@ struct kvm_vcpu_arch {
 
 	struct kvm_mmu_memory_cache mmu_pte_list_desc_cache;
 	struct kvm_mmu_memory_cache mmu_shadow_page_cache;
-	struct kvm_mmu_memory_cache mmu_gfn_array_cache;
+	struct kvm_mmu_memory_cache mmu_shadowed_info_cache;
 	struct kvm_mmu_memory_cache mmu_page_header_cache;
 
 	/*
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6aef85dac1e2..f83de72feeac 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -682,7 +682,7 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
 	if (r)
 		return r;
 	if (maybe_indirect) {
-		r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_gfn_array_cache,
+		r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadowed_info_cache,
 					       PT64_ROOT_MAX_LEVEL);
 		if (r)
 			return r;
@@ -695,7 +695,7 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
 {
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
-	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_gfn_array_cache);
+	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadowed_info_cache);
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
 }
 
@@ -704,34 +704,68 @@ static void mmu_free_pte_list_desc(struct pte_list_desc *pte_list_desc)
 	kmem_cache_free(pte_list_desc_cache, pte_list_desc);
 }
 
+static bool sp_has_gptes(struct kvm_mmu_page *sp);
+
 static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index)
 {
 	if (sp->role.passthrough)
 		return sp->gfn;
 
 	if (!sp->role.direct)
-		return sp->gfns[index];
+		return sp->shadowed_translation[index] >> PAGE_SHIFT;
 
 	return sp->gfn + (index << ((sp->role.level - 1) * PT64_LEVEL_BITS));
 }
 
-static void kvm_mmu_page_set_gfn(struct kvm_mmu_page *sp, int index, gfn_t gfn)
+/*
+ * For leaf SPTEs, fetch the *guest* access permissions being shadowed. Note
+ * that the SPTE itself may have a more constrained access permissions that
+ * what the guest enforces. For example, a guest may create an executable
+ * huge PTE but KVM may disallow execution to mitigate iTLB multihit.
+ */
+static u32 kvm_mmu_page_get_access(struct kvm_mmu_page *sp, int index)
 {
-	if (sp->role.passthrough) {
-		WARN_ON_ONCE(gfn != sp->gfn);
-		return;
-	}
+	if (sp_has_gptes(sp))
+		return sp->shadowed_translation[index] & ACC_ALL;
 
-	if (!sp->role.direct) {
-		sp->gfns[index] = gfn;
+	/*
+	 * For direct MMUs (e.g. TDP or non-paging guests) or passthrough SPs,
+	 * KVM is not shadowing any guest page tables, so the "guest access
+	 * permissions" are just ACC_ALL.
+	 *
+	 * For direct SPs in indirect MMUs (shadow paging), i.e. when KVM
+	 * is shadowing a guest huge page with small pages, the guest access
+	 * permissions being shadowed are the access permissions of the huge
+	 * page.
+	 *
+	 * In both cases, sp->role.access contains the correct access bits.
+	 */
+	return sp->role.access;
+}
+
+static void kvm_mmu_page_set_translation(struct kvm_mmu_page *sp, int index, gfn_t gfn, u32 access)
+{
+	if (sp_has_gptes(sp)) {
+		sp->shadowed_translation[index] = (gfn << PAGE_SHIFT) | access;
 		return;
 	}
 
-	if (WARN_ON(gfn != kvm_mmu_page_get_gfn(sp, index)))
-		pr_err_ratelimited("gfn mismatch under direct page %llx "
-				   "(expected %llx, got %llx)\n",
-				   sp->gfn,
-				   kvm_mmu_page_get_gfn(sp, index), gfn);
+	WARN(access != kvm_mmu_page_get_access(sp, index),
+	     "access mismatch under %s page %llx (expected %u, got %u)\n",
+	     sp->role.passthrough ? "passthrough" : "direct",
+	     sp->gfn, kvm_mmu_page_get_access(sp, index), access);
+
+	WARN(gfn != kvm_mmu_page_get_gfn(sp, index),
+	     "gfn mismatch under %s page %llx (expected %llx, got %llx)\n",
+	     sp->role.passthrough ? "passthrough" : "direct",
+	     sp->gfn, kvm_mmu_page_get_gfn(sp, index), gfn);
+}
+
+static void kvm_mmu_page_set_access(struct kvm_mmu_page *sp, int index, u32 access)
+{
+	gfn_t gfn = kvm_mmu_page_get_gfn(sp, index);
+
+	kvm_mmu_page_set_translation(sp, index, gfn, access);
 }
 
 /*
@@ -1580,14 +1614,14 @@ static bool kvm_test_age_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 static void __rmap_add(struct kvm *kvm,
 		       struct kvm_mmu_memory_cache *cache,
 		       const struct kvm_memory_slot *slot,
-		       u64 *spte, gfn_t gfn)
+		       u64 *spte, gfn_t gfn, u32 access)
 {
 	struct kvm_mmu_page *sp;
 	struct kvm_rmap_head *rmap_head;
 	int rmap_count;
 
 	sp = sptep_to_sp(spte);
-	kvm_mmu_page_set_gfn(sp, spte - sp->spt, gfn);
+	kvm_mmu_page_set_translation(sp, spte - sp->spt, gfn, access);
 	kvm_update_page_stats(kvm, sp->role.level, 1);
 
 	rmap_head = gfn_to_rmap(gfn, sp->role.level, slot);
@@ -1601,9 +1635,9 @@ static void __rmap_add(struct kvm *kvm,
 }
 
 static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
-		     u64 *spte, gfn_t gfn)
+		     u64 *spte, gfn_t gfn, u32 access)
 {
-	__rmap_add(vcpu->kvm, &vcpu->arch.mmu_pte_list_desc_cache, slot, spte, gfn);
+	__rmap_add(vcpu->kvm, &vcpu->arch.mmu_pte_list_desc_cache, slot, spte, gfn, access);
 }
 
 bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
@@ -1667,7 +1701,7 @@ static void kvm_mmu_free_shadow_page(struct kvm_mmu_page *sp)
 	list_del(&sp->link);
 	free_page((unsigned long)sp->spt);
 	if (!sp->role.direct)
-		free_page((unsigned long)sp->gfns);
+		free_page((unsigned long)sp->shadowed_translation);
 	kmem_cache_free(mmu_page_header_cache, sp);
 }
 
@@ -2097,7 +2131,7 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm,
 struct shadow_page_caches {
 	struct kvm_mmu_memory_cache *page_header_cache;
 	struct kvm_mmu_memory_cache *shadow_page_cache;
-	struct kvm_mmu_memory_cache *gfn_array_cache;
+	struct kvm_mmu_memory_cache *shadowed_info_cache;
 };
 
 static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
@@ -2111,7 +2145,7 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
 	sp = kvm_mmu_memory_cache_alloc(caches->page_header_cache);
 	sp->spt = kvm_mmu_memory_cache_alloc(caches->shadow_page_cache);
 	if (!role.direct)
-		sp->gfns = kvm_mmu_memory_cache_alloc(caches->gfn_array_cache);
+		sp->shadowed_translation = kvm_mmu_memory_cache_alloc(caches->shadowed_info_cache);
 
 	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
 
@@ -2163,7 +2197,7 @@ static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
 	struct shadow_page_caches caches = {
 		.page_header_cache = &vcpu->arch.mmu_page_header_cache,
 		.shadow_page_cache = &vcpu->arch.mmu_shadow_page_cache,
-		.gfn_array_cache = &vcpu->arch.mmu_gfn_array_cache,
+		.shadowed_info_cache = &vcpu->arch.mmu_shadowed_info_cache,
 	};
 
 	return __kvm_mmu_get_shadow_page(vcpu->kvm, vcpu, &caches, gfn, role);
@@ -2812,7 +2846,10 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
 
 	if (!was_rmapped) {
 		WARN_ON_ONCE(ret == RET_PF_SPURIOUS);
-		rmap_add(vcpu, slot, sptep, gfn);
+		rmap_add(vcpu, slot, sptep, gfn, pte_access);
+	} else {
+		/* Already rmapped but the pte_access bits may have changed. */
+		kvm_mmu_page_set_access(sp, sptep - sp->spt, pte_access);
 	}
 
 	return ret;
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index bd2a26897b97..0395950045d1 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -53,8 +53,21 @@ struct kvm_mmu_page {
 	gfn_t gfn;
 
 	u64 *spt;
-	/* hold the gfn of each spte inside spt */
-	gfn_t *gfns;
+
+	/*
+	 * Stores the result of the guest translation being shadowed by each
+	 * SPTE.  KVM shadows two types of guest translations: nGPA -> GPA
+	 * (shadow EPT/NPT) and GVA -> GPA (traditional shadow paging). In both
+	 * cases the result of the translation is a GPA and a set of access
+	 * constraints.
+	 *
+	 * The GFN is stored in the upper bits (PAGE_SHIFT) and the shadowed
+	 * access permissions are stored in the lower bits. Note, for
+	 * convenience and uniformity across guests, the access permissions are
+	 * stored in KVM format (e.g.  ACC_EXEC_MASK) not the raw guest format.
+	 */
+	u64 *shadowed_translation;
+
 	/* Currently serving as active root */
 	union {
 		int root_count;
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index fd73c857af90..37ceb6e452e6 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -979,7 +979,8 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 }
 
 /*
- * Using the cached information from sp->gfns is safe because:
+ * Using the information in sp->shadowed_translation (kvm_mmu_page_get_gfn()) is
+ * safe because:
  * - The spte has a reference to the struct page, so the pfn for a given gfn
  *   can't change unless all sptes pointing to it are nuked first.
  *
@@ -1054,12 +1055,15 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
 		if (sync_mmio_spte(vcpu, &sp->spt[i], gfn, pte_access))
 			continue;
 
-		if (gfn != sp->gfns[i]) {
+		if (gfn != kvm_mmu_page_get_gfn(sp, i)) {
 			drop_spte(vcpu->kvm, &sp->spt[i]);
 			flush = true;
 			continue;
 		}
 
+		/* Update the shadowed access bits in case they changed. */
+		kvm_mmu_page_set_access(sp, i, pte_access);
+
 		sptep = &sp->spt[i];
 		spte = *sptep;
 		host_writable = spte & shadow_host_writable_mask;
-- 
2.36.0.550.gb090851708-goog


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 17/22] KVM: x86/mmu: Cache the access bits of shadowed translations
@ 2022-05-16 23:21   ` David Matlack
  0 siblings, 0 replies; 111+ messages in thread
From: David Matlack @ 2022-05-16 23:21 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen, Lai Jiangshan,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Splitting huge pages requires allocating/finding shadow pages to replace
the huge page. Shadow pages are keyed, in part, off the guest access
permissions they are shadowing. For fully direct MMUs, there is no
shadowing so the access bits in the shadow page role are always ACC_ALL.
But during shadow paging, the guest can enforce whatever access
permissions it wants.

When KVM is resolving a fault, it walks the guest pages tables to
determine the guest access permissions. But that is difficult to plumb
when splitting huge pages outside of a fault context, e.g. for eager
page splitting.

To enable eager page splitting, KVM can cache the shadowed (guest)
access permissions whenever it updates the shadow page tables (e.g.
during fault, or FNAME(sync_page)). In fact KVM already does this to
cache the shadowed GFN using the gfns array in the shadow page.
The access bits only take up 3 bits, which leaves 61 bits left over for
gfns, which is more than enough. So this change does not require any
additional memory.

Now that the gfns array caches more information than just GFNs, rename
it to shadowed_translation.

While here, preemptively fix up the WARN_ON() that detects gfn
mismatches in direct SPs. The WARN_ON() was paired with a
pr_err_ratelimited(), which means that users could sometimes see the
WARN without the accompanying error message. Fix this by outputting the
error message as part of the WARN splat.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/include/asm/kvm_host.h |  2 +-
 arch/x86/kvm/mmu/mmu.c          | 85 +++++++++++++++++++++++----------
 arch/x86/kvm/mmu/mmu_internal.h | 17 ++++++-
 arch/x86/kvm/mmu/paging_tmpl.h  |  8 +++-
 4 files changed, 83 insertions(+), 29 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 9cdc5bbd721f..9193a700fe2d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -710,7 +710,7 @@ struct kvm_vcpu_arch {
 
 	struct kvm_mmu_memory_cache mmu_pte_list_desc_cache;
 	struct kvm_mmu_memory_cache mmu_shadow_page_cache;
-	struct kvm_mmu_memory_cache mmu_gfn_array_cache;
+	struct kvm_mmu_memory_cache mmu_shadowed_info_cache;
 	struct kvm_mmu_memory_cache mmu_page_header_cache;
 
 	/*
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6aef85dac1e2..f83de72feeac 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -682,7 +682,7 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
 	if (r)
 		return r;
 	if (maybe_indirect) {
-		r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_gfn_array_cache,
+		r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadowed_info_cache,
 					       PT64_ROOT_MAX_LEVEL);
 		if (r)
 			return r;
@@ -695,7 +695,7 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
 {
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
-	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_gfn_array_cache);
+	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadowed_info_cache);
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
 }
 
@@ -704,34 +704,68 @@ static void mmu_free_pte_list_desc(struct pte_list_desc *pte_list_desc)
 	kmem_cache_free(pte_list_desc_cache, pte_list_desc);
 }
 
+static bool sp_has_gptes(struct kvm_mmu_page *sp);
+
 static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index)
 {
 	if (sp->role.passthrough)
 		return sp->gfn;
 
 	if (!sp->role.direct)
-		return sp->gfns[index];
+		return sp->shadowed_translation[index] >> PAGE_SHIFT;
 
 	return sp->gfn + (index << ((sp->role.level - 1) * PT64_LEVEL_BITS));
 }
 
-static void kvm_mmu_page_set_gfn(struct kvm_mmu_page *sp, int index, gfn_t gfn)
+/*
+ * For leaf SPTEs, fetch the *guest* access permissions being shadowed. Note
+ * that the SPTE itself may have a more constrained access permissions that
+ * what the guest enforces. For example, a guest may create an executable
+ * huge PTE but KVM may disallow execution to mitigate iTLB multihit.
+ */
+static u32 kvm_mmu_page_get_access(struct kvm_mmu_page *sp, int index)
 {
-	if (sp->role.passthrough) {
-		WARN_ON_ONCE(gfn != sp->gfn);
-		return;
-	}
+	if (sp_has_gptes(sp))
+		return sp->shadowed_translation[index] & ACC_ALL;
 
-	if (!sp->role.direct) {
-		sp->gfns[index] = gfn;
+	/*
+	 * For direct MMUs (e.g. TDP or non-paging guests) or passthrough SPs,
+	 * KVM is not shadowing any guest page tables, so the "guest access
+	 * permissions" are just ACC_ALL.
+	 *
+	 * For direct SPs in indirect MMUs (shadow paging), i.e. when KVM
+	 * is shadowing a guest huge page with small pages, the guest access
+	 * permissions being shadowed are the access permissions of the huge
+	 * page.
+	 *
+	 * In both cases, sp->role.access contains the correct access bits.
+	 */
+	return sp->role.access;
+}
+
+static void kvm_mmu_page_set_translation(struct kvm_mmu_page *sp, int index, gfn_t gfn, u32 access)
+{
+	if (sp_has_gptes(sp)) {
+		sp->shadowed_translation[index] = (gfn << PAGE_SHIFT) | access;
 		return;
 	}
 
-	if (WARN_ON(gfn != kvm_mmu_page_get_gfn(sp, index)))
-		pr_err_ratelimited("gfn mismatch under direct page %llx "
-				   "(expected %llx, got %llx)\n",
-				   sp->gfn,
-				   kvm_mmu_page_get_gfn(sp, index), gfn);
+	WARN(access != kvm_mmu_page_get_access(sp, index),
+	     "access mismatch under %s page %llx (expected %u, got %u)\n",
+	     sp->role.passthrough ? "passthrough" : "direct",
+	     sp->gfn, kvm_mmu_page_get_access(sp, index), access);
+
+	WARN(gfn != kvm_mmu_page_get_gfn(sp, index),
+	     "gfn mismatch under %s page %llx (expected %llx, got %llx)\n",
+	     sp->role.passthrough ? "passthrough" : "direct",
+	     sp->gfn, kvm_mmu_page_get_gfn(sp, index), gfn);
+}
+
+static void kvm_mmu_page_set_access(struct kvm_mmu_page *sp, int index, u32 access)
+{
+	gfn_t gfn = kvm_mmu_page_get_gfn(sp, index);
+
+	kvm_mmu_page_set_translation(sp, index, gfn, access);
 }
 
 /*
@@ -1580,14 +1614,14 @@ static bool kvm_test_age_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 static void __rmap_add(struct kvm *kvm,
 		       struct kvm_mmu_memory_cache *cache,
 		       const struct kvm_memory_slot *slot,
-		       u64 *spte, gfn_t gfn)
+		       u64 *spte, gfn_t gfn, u32 access)
 {
 	struct kvm_mmu_page *sp;
 	struct kvm_rmap_head *rmap_head;
 	int rmap_count;
 
 	sp = sptep_to_sp(spte);
-	kvm_mmu_page_set_gfn(sp, spte - sp->spt, gfn);
+	kvm_mmu_page_set_translation(sp, spte - sp->spt, gfn, access);
 	kvm_update_page_stats(kvm, sp->role.level, 1);
 
 	rmap_head = gfn_to_rmap(gfn, sp->role.level, slot);
@@ -1601,9 +1635,9 @@ static void __rmap_add(struct kvm *kvm,
 }
 
 static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
-		     u64 *spte, gfn_t gfn)
+		     u64 *spte, gfn_t gfn, u32 access)
 {
-	__rmap_add(vcpu->kvm, &vcpu->arch.mmu_pte_list_desc_cache, slot, spte, gfn);
+	__rmap_add(vcpu->kvm, &vcpu->arch.mmu_pte_list_desc_cache, slot, spte, gfn, access);
 }
 
 bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
@@ -1667,7 +1701,7 @@ static void kvm_mmu_free_shadow_page(struct kvm_mmu_page *sp)
 	list_del(&sp->link);
 	free_page((unsigned long)sp->spt);
 	if (!sp->role.direct)
-		free_page((unsigned long)sp->gfns);
+		free_page((unsigned long)sp->shadowed_translation);
 	kmem_cache_free(mmu_page_header_cache, sp);
 }
 
@@ -2097,7 +2131,7 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm,
 struct shadow_page_caches {
 	struct kvm_mmu_memory_cache *page_header_cache;
 	struct kvm_mmu_memory_cache *shadow_page_cache;
-	struct kvm_mmu_memory_cache *gfn_array_cache;
+	struct kvm_mmu_memory_cache *shadowed_info_cache;
 };
 
 static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
@@ -2111,7 +2145,7 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
 	sp = kvm_mmu_memory_cache_alloc(caches->page_header_cache);
 	sp->spt = kvm_mmu_memory_cache_alloc(caches->shadow_page_cache);
 	if (!role.direct)
-		sp->gfns = kvm_mmu_memory_cache_alloc(caches->gfn_array_cache);
+		sp->shadowed_translation = kvm_mmu_memory_cache_alloc(caches->shadowed_info_cache);
 
 	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
 
@@ -2163,7 +2197,7 @@ static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
 	struct shadow_page_caches caches = {
 		.page_header_cache = &vcpu->arch.mmu_page_header_cache,
 		.shadow_page_cache = &vcpu->arch.mmu_shadow_page_cache,
-		.gfn_array_cache = &vcpu->arch.mmu_gfn_array_cache,
+		.shadowed_info_cache = &vcpu->arch.mmu_shadowed_info_cache,
 	};
 
 	return __kvm_mmu_get_shadow_page(vcpu->kvm, vcpu, &caches, gfn, role);
@@ -2812,7 +2846,10 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
 
 	if (!was_rmapped) {
 		WARN_ON_ONCE(ret == RET_PF_SPURIOUS);
-		rmap_add(vcpu, slot, sptep, gfn);
+		rmap_add(vcpu, slot, sptep, gfn, pte_access);
+	} else {
+		/* Already rmapped but the pte_access bits may have changed. */
+		kvm_mmu_page_set_access(sp, sptep - sp->spt, pte_access);
 	}
 
 	return ret;
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index bd2a26897b97..0395950045d1 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -53,8 +53,21 @@ struct kvm_mmu_page {
 	gfn_t gfn;
 
 	u64 *spt;
-	/* hold the gfn of each spte inside spt */
-	gfn_t *gfns;
+
+	/*
+	 * Stores the result of the guest translation being shadowed by each
+	 * SPTE.  KVM shadows two types of guest translations: nGPA -> GPA
+	 * (shadow EPT/NPT) and GVA -> GPA (traditional shadow paging). In both
+	 * cases the result of the translation is a GPA and a set of access
+	 * constraints.
+	 *
+	 * The GFN is stored in the upper bits (PAGE_SHIFT) and the shadowed
+	 * access permissions are stored in the lower bits. Note, for
+	 * convenience and uniformity across guests, the access permissions are
+	 * stored in KVM format (e.g.  ACC_EXEC_MASK) not the raw guest format.
+	 */
+	u64 *shadowed_translation;
+
 	/* Currently serving as active root */
 	union {
 		int root_count;
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index fd73c857af90..37ceb6e452e6 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -979,7 +979,8 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 }
 
 /*
- * Using the cached information from sp->gfns is safe because:
+ * Using the information in sp->shadowed_translation (kvm_mmu_page_get_gfn()) is
+ * safe because:
  * - The spte has a reference to the struct page, so the pfn for a given gfn
  *   can't change unless all sptes pointing to it are nuked first.
  *
@@ -1054,12 +1055,15 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
 		if (sync_mmio_spte(vcpu, &sp->spt[i], gfn, pte_access))
 			continue;
 
-		if (gfn != sp->gfns[i]) {
+		if (gfn != kvm_mmu_page_get_gfn(sp, i)) {
 			drop_spte(vcpu->kvm, &sp->spt[i]);
 			flush = true;
 			continue;
 		}
 
+		/* Update the shadowed access bits in case they changed. */
+		kvm_mmu_page_set_access(sp, i, pte_access);
+
 		sptep = &sp->spt[i];
 		spte = *sptep;
 		host_writable = spte & shadow_host_writable_mask;
-- 
2.36.0.550.gb090851708-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 18/22] KVM: x86/mmu: Extend make_huge_page_split_spte() for the shadow MMU
  2022-05-16 23:21 ` David Matlack
@ 2022-05-16 23:21   ` David Matlack
  -1 siblings, 0 replies; 111+ messages in thread
From: David Matlack @ 2022-05-16 23:21 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, Lai Jiangshan, David Matlack

Currently make_huge_page_split_spte() assumes execute permissions can be
granted to any 4K SPTE when splitting huge pages. This is true for the
TDP MMU but is not necessarily true for the shadow MMU, since KVM may be
shadowing a non-executable huge page.

To fix this, pass in the role of the child shadow page where the huge
page will be split and derive the execution permission from that.  This
is correct because huge pages are always split with direct shadow page
and thus the shadow page role contains the correct access permissions.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/spte.c    | 16 ++++++++--------
 arch/x86/kvm/mmu/spte.h    |  2 +-
 arch/x86/kvm/mmu/tdp_mmu.c |  2 +-
 3 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index b5960bbde7f7..237e8dc12993 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -244,10 +244,10 @@ static u64 make_spte_executable(u64 spte)
  * This is used during huge page splitting to build the SPTEs that make up the
  * new page table.
  */
-u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index)
+u64 make_huge_page_split_spte(u64 huge_spte, union kvm_mmu_page_role role,
+			      int index)
 {
 	u64 child_spte;
-	int child_level;
 
 	if (WARN_ON_ONCE(!is_shadow_present_pte(huge_spte)))
 		return 0;
@@ -256,23 +256,23 @@ u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index)
 		return 0;
 
 	child_spte = huge_spte;
-	child_level = huge_level - 1;
 
 	/*
 	 * The child_spte already has the base address of the huge page being
 	 * split. So we just have to OR in the offset to the page at the next
 	 * lower level for the given index.
 	 */
-	child_spte |= (index * KVM_PAGES_PER_HPAGE(child_level)) << PAGE_SHIFT;
+	child_spte |= (index * KVM_PAGES_PER_HPAGE(role.level)) << PAGE_SHIFT;
 
-	if (child_level == PG_LEVEL_4K) {
+	if (role.level == PG_LEVEL_4K) {
 		child_spte &= ~PT_PAGE_SIZE_MASK;
 
 		/*
-		 * When splitting to a 4K page, mark the page executable as the
-		 * NX hugepage mitigation no longer applies.
+		 * When splitting to a 4K page where execution is allowed, mark
+		 * the page executable as the NX hugepage mitigation no longer
+		 * applies.
 		 */
-		if (is_nx_huge_page_enabled())
+		if ((role.access & ACC_EXEC_MASK) && is_nx_huge_page_enabled())
 			child_spte = make_spte_executable(child_spte);
 	}
 
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 0127bb6e3c7d..3dada44cc066 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -425,7 +425,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	       unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
 	       u64 old_spte, bool prefetch, bool can_unsync,
 	       bool host_writable, u64 *new_spte);
-u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index);
+u64 make_huge_page_split_spte(u64 huge_spte, union kvm_mmu_page_role role, int index);
 u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled);
 u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access);
 u64 mark_spte_for_access_track(u64 spte);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 841feaa48be5..a5472ee56080 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1488,7 +1488,7 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
 	 * not been linked in yet and thus is not reachable from any other CPU.
 	 */
 	for (i = 0; i < PT64_ENT_PER_PAGE; i++)
-		sp->spt[i] = make_huge_page_split_spte(huge_spte, level, i);
+		sp->spt[i] = make_huge_page_split_spte(huge_spte, sp->role, i);
 
 	/*
 	 * Replace the huge spte with a pointer to the populated lower level
-- 
2.36.0.550.gb090851708-goog


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 18/22] KVM: x86/mmu: Extend make_huge_page_split_spte() for the shadow MMU
@ 2022-05-16 23:21   ` David Matlack
  0 siblings, 0 replies; 111+ messages in thread
From: David Matlack @ 2022-05-16 23:21 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen, Lai Jiangshan,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Currently make_huge_page_split_spte() assumes execute permissions can be
granted to any 4K SPTE when splitting huge pages. This is true for the
TDP MMU but is not necessarily true for the shadow MMU, since KVM may be
shadowing a non-executable huge page.

To fix this, pass in the role of the child shadow page where the huge
page will be split and derive the execution permission from that.  This
is correct because huge pages are always split with direct shadow page
and thus the shadow page role contains the correct access permissions.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/spte.c    | 16 ++++++++--------
 arch/x86/kvm/mmu/spte.h    |  2 +-
 arch/x86/kvm/mmu/tdp_mmu.c |  2 +-
 3 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index b5960bbde7f7..237e8dc12993 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -244,10 +244,10 @@ static u64 make_spte_executable(u64 spte)
  * This is used during huge page splitting to build the SPTEs that make up the
  * new page table.
  */
-u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index)
+u64 make_huge_page_split_spte(u64 huge_spte, union kvm_mmu_page_role role,
+			      int index)
 {
 	u64 child_spte;
-	int child_level;
 
 	if (WARN_ON_ONCE(!is_shadow_present_pte(huge_spte)))
 		return 0;
@@ -256,23 +256,23 @@ u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index)
 		return 0;
 
 	child_spte = huge_spte;
-	child_level = huge_level - 1;
 
 	/*
 	 * The child_spte already has the base address of the huge page being
 	 * split. So we just have to OR in the offset to the page at the next
 	 * lower level for the given index.
 	 */
-	child_spte |= (index * KVM_PAGES_PER_HPAGE(child_level)) << PAGE_SHIFT;
+	child_spte |= (index * KVM_PAGES_PER_HPAGE(role.level)) << PAGE_SHIFT;
 
-	if (child_level == PG_LEVEL_4K) {
+	if (role.level == PG_LEVEL_4K) {
 		child_spte &= ~PT_PAGE_SIZE_MASK;
 
 		/*
-		 * When splitting to a 4K page, mark the page executable as the
-		 * NX hugepage mitigation no longer applies.
+		 * When splitting to a 4K page where execution is allowed, mark
+		 * the page executable as the NX hugepage mitigation no longer
+		 * applies.
 		 */
-		if (is_nx_huge_page_enabled())
+		if ((role.access & ACC_EXEC_MASK) && is_nx_huge_page_enabled())
 			child_spte = make_spte_executable(child_spte);
 	}
 
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 0127bb6e3c7d..3dada44cc066 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -425,7 +425,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	       unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
 	       u64 old_spte, bool prefetch, bool can_unsync,
 	       bool host_writable, u64 *new_spte);
-u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index);
+u64 make_huge_page_split_spte(u64 huge_spte, union kvm_mmu_page_role role, int index);
 u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled);
 u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access);
 u64 mark_spte_for_access_track(u64 spte);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 841feaa48be5..a5472ee56080 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1488,7 +1488,7 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
 	 * not been linked in yet and thus is not reachable from any other CPU.
 	 */
 	for (i = 0; i < PT64_ENT_PER_PAGE; i++)
-		sp->spt[i] = make_huge_page_split_spte(huge_spte, level, i);
+		sp->spt[i] = make_huge_page_split_spte(huge_spte, sp->role, i);
 
 	/*
 	 * Replace the huge spte with a pointer to the populated lower level
-- 
2.36.0.550.gb090851708-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 19/22] KVM: x86/mmu: Zap collapsible SPTEs in shadow MMU at all possible levels
  2022-05-16 23:21 ` David Matlack
@ 2022-05-16 23:21   ` David Matlack
  -1 siblings, 0 replies; 111+ messages in thread
From: David Matlack @ 2022-05-16 23:21 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, Lai Jiangshan, David Matlack

Currently KVM only zaps collapsible 4KiB SPTEs in the shadow MMU. This
is fine for now since KVM never creates intermediate huge pages during
dirty logging. In other words, KVM always replaces 1GiB pages directly
with 4KiB pages, so there is no reason to look for collapsible 2MiB
pages.

However, this will stop being true once the shadow MMU participates in
eager page splitting. During eager page splitting, each 1GiB is first
split into 2MiB pages and then those are split into 4KiB pages. The
intermediate 2MiB pages may be left behind if an error condition causes
eager page splitting to bail early.

No functional change intended.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 21 ++++++++++++++-------
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f83de72feeac..a5d96d452f42 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6177,18 +6177,25 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
 	return need_tlb_flush;
 }
 
+static void kvm_rmap_zap_collapsible_sptes(struct kvm *kvm,
+					   const struct kvm_memory_slot *slot)
+{
+	/*
+	 * Note, use KVM_MAX_HUGEPAGE_LEVEL - 1 since there's no need to zap
+	 * pages that are already mapped at the maximum possible level.
+	 */
+	if (slot_handle_level(kvm, slot, kvm_mmu_zap_collapsible_spte,
+			      PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL - 1,
+			      true))
+		kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
+}
+
 void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
 				   const struct kvm_memory_slot *slot)
 {
 	if (kvm_memslots_have_rmaps(kvm)) {
 		write_lock(&kvm->mmu_lock);
-		/*
-		 * Zap only 4k SPTEs since the legacy MMU only supports dirty
-		 * logging at a 4k granularity and never creates collapsible
-		 * 2m SPTEs during dirty logging.
-		 */
-		if (slot_handle_level_4k(kvm, slot, kvm_mmu_zap_collapsible_spte, true))
-			kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
+		kvm_rmap_zap_collapsible_sptes(kvm, slot);
 		write_unlock(&kvm->mmu_lock);
 	}
 
-- 
2.36.0.550.gb090851708-goog


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 19/22] KVM: x86/mmu: Zap collapsible SPTEs in shadow MMU at all possible levels
@ 2022-05-16 23:21   ` David Matlack
  0 siblings, 0 replies; 111+ messages in thread
From: David Matlack @ 2022-05-16 23:21 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen, Lai Jiangshan,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Currently KVM only zaps collapsible 4KiB SPTEs in the shadow MMU. This
is fine for now since KVM never creates intermediate huge pages during
dirty logging. In other words, KVM always replaces 1GiB pages directly
with 4KiB pages, so there is no reason to look for collapsible 2MiB
pages.

However, this will stop being true once the shadow MMU participates in
eager page splitting. During eager page splitting, each 1GiB is first
split into 2MiB pages and then those are split into 4KiB pages. The
intermediate 2MiB pages may be left behind if an error condition causes
eager page splitting to bail early.

No functional change intended.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 21 ++++++++++++++-------
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f83de72feeac..a5d96d452f42 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6177,18 +6177,25 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
 	return need_tlb_flush;
 }
 
+static void kvm_rmap_zap_collapsible_sptes(struct kvm *kvm,
+					   const struct kvm_memory_slot *slot)
+{
+	/*
+	 * Note, use KVM_MAX_HUGEPAGE_LEVEL - 1 since there's no need to zap
+	 * pages that are already mapped at the maximum possible level.
+	 */
+	if (slot_handle_level(kvm, slot, kvm_mmu_zap_collapsible_spte,
+			      PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL - 1,
+			      true))
+		kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
+}
+
 void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
 				   const struct kvm_memory_slot *slot)
 {
 	if (kvm_memslots_have_rmaps(kvm)) {
 		write_lock(&kvm->mmu_lock);
-		/*
-		 * Zap only 4k SPTEs since the legacy MMU only supports dirty
-		 * logging at a 4k granularity and never creates collapsible
-		 * 2m SPTEs during dirty logging.
-		 */
-		if (slot_handle_level_4k(kvm, slot, kvm_mmu_zap_collapsible_spte, true))
-			kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
+		kvm_rmap_zap_collapsible_sptes(kvm, slot);
 		write_unlock(&kvm->mmu_lock);
 	}
 
-- 
2.36.0.550.gb090851708-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 20/22] KVM: x86/mmu: Refactor drop_large_spte()
  2022-05-16 23:21 ` David Matlack
@ 2022-05-16 23:21   ` David Matlack
  -1 siblings, 0 replies; 111+ messages in thread
From: David Matlack @ 2022-05-16 23:21 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, Lai Jiangshan, David Matlack

drop_large_spte() drops a large SPTE if it exists and then flushes TLBs.
Its helper function, __drop_large_spte(), does the drop without the
flush.

In preparation for eager page splitting, which will need to sometimes
flush when dropping large SPTEs (and sometimes not), push the flushing
logic down into __drop_large_spte() and add a bool parameter to control
it.

No functional change intended.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 28 ++++++++++++++--------------
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index a5d96d452f42..964a8fa63e1b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1161,26 +1161,26 @@ static void drop_spte(struct kvm *kvm, u64 *sptep)
 		rmap_remove(kvm, sptep);
 }
 
-
-static bool __drop_large_spte(struct kvm *kvm, u64 *sptep)
+static void __drop_large_spte(struct kvm *kvm, u64 *sptep, bool flush)
 {
-	if (is_large_pte(*sptep)) {
-		WARN_ON(sptep_to_sp(sptep)->role.level == PG_LEVEL_4K);
-		drop_spte(kvm, sptep);
-		return true;
-	}
+	struct kvm_mmu_page *sp;
 
-	return false;
+	if (!is_large_pte(*sptep))
+		return;
+
+	sp = sptep_to_sp(sptep);
+	WARN_ON(sp->role.level == PG_LEVEL_4K);
+
+	drop_spte(kvm, sptep);
+
+	if (flush)
+		kvm_flush_remote_tlbs_with_address(kvm, sp->gfn,
+			KVM_PAGES_PER_HPAGE(sp->role.level));
 }
 
 static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
 {
-	if (__drop_large_spte(vcpu->kvm, sptep)) {
-		struct kvm_mmu_page *sp = sptep_to_sp(sptep);
-
-		kvm_flush_remote_tlbs_with_address(vcpu->kvm, sp->gfn,
-			KVM_PAGES_PER_HPAGE(sp->role.level));
-	}
+	return __drop_large_spte(vcpu->kvm, sptep, true);
 }
 
 /*
-- 
2.36.0.550.gb090851708-goog


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 20/22] KVM: x86/mmu: Refactor drop_large_spte()
@ 2022-05-16 23:21   ` David Matlack
  0 siblings, 0 replies; 111+ messages in thread
From: David Matlack @ 2022-05-16 23:21 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen, Lai Jiangshan,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

drop_large_spte() drops a large SPTE if it exists and then flushes TLBs.
Its helper function, __drop_large_spte(), does the drop without the
flush.

In preparation for eager page splitting, which will need to sometimes
flush when dropping large SPTEs (and sometimes not), push the flushing
logic down into __drop_large_spte() and add a bool parameter to control
it.

No functional change intended.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 28 ++++++++++++++--------------
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index a5d96d452f42..964a8fa63e1b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1161,26 +1161,26 @@ static void drop_spte(struct kvm *kvm, u64 *sptep)
 		rmap_remove(kvm, sptep);
 }
 
-
-static bool __drop_large_spte(struct kvm *kvm, u64 *sptep)
+static void __drop_large_spte(struct kvm *kvm, u64 *sptep, bool flush)
 {
-	if (is_large_pte(*sptep)) {
-		WARN_ON(sptep_to_sp(sptep)->role.level == PG_LEVEL_4K);
-		drop_spte(kvm, sptep);
-		return true;
-	}
+	struct kvm_mmu_page *sp;
 
-	return false;
+	if (!is_large_pte(*sptep))
+		return;
+
+	sp = sptep_to_sp(sptep);
+	WARN_ON(sp->role.level == PG_LEVEL_4K);
+
+	drop_spte(kvm, sptep);
+
+	if (flush)
+		kvm_flush_remote_tlbs_with_address(kvm, sp->gfn,
+			KVM_PAGES_PER_HPAGE(sp->role.level));
 }
 
 static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
 {
-	if (__drop_large_spte(vcpu->kvm, sptep)) {
-		struct kvm_mmu_page *sp = sptep_to_sp(sptep);
-
-		kvm_flush_remote_tlbs_with_address(vcpu->kvm, sp->gfn,
-			KVM_PAGES_PER_HPAGE(sp->role.level));
-	}
+	return __drop_large_spte(vcpu->kvm, sptep, true);
 }
 
 /*
-- 
2.36.0.550.gb090851708-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 21/22] KVM: Allow for different capacities in kvm_mmu_memory_cache structs
  2022-05-16 23:21 ` David Matlack
@ 2022-05-16 23:21   ` David Matlack
  -1 siblings, 0 replies; 111+ messages in thread
From: David Matlack @ 2022-05-16 23:21 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, Lai Jiangshan, David Matlack

Allow the capacity of the kvm_mmu_memory_cache struct to be chosen at
declaration time rather than being fixed for all declarations. This will
be used in a follow-up commit to declare an cache in x86 with a capacity
of 512+ objects without having to increase the capacity of all caches in
KVM.

This change requires each cache now specify its capacity at runtime,
since the cache struct itself no longer has a fixed capacity known at
compile time. To protect against someone accidentally defining a
kvm_mmu_memory_cache struct directly (without the extra storage), this
commit includes a WARN_ON() in kvm_mmu_topup_memory_cache().

In order to support different capacities, this commit changes the
objects pointer array to be dynamically allocated the first time the
cache is topped-up.

While here, opportunistically clean up the stack-allocated
kvm_mmu_memory_cache structs in riscv and arm64 to use designated
initializers.

No functional change intended.

Reviewed-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/arm64/kvm/mmu.c      |  2 +-
 arch/riscv/kvm/mmu.c      |  5 +----
 include/linux/kvm_types.h |  6 +++++-
 virt/kvm/kvm_main.c       | 33 ++++++++++++++++++++++++++++++---
 4 files changed, 37 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 53ae2c0640bc..f443ed845f85 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -764,7 +764,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
 {
 	phys_addr_t addr;
 	int ret = 0;
-	struct kvm_mmu_memory_cache cache = { 0, __GFP_ZERO, NULL, };
+	struct kvm_mmu_memory_cache cache = { .gfp_zero = __GFP_ZERO };
 	struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
 	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_DEVICE |
 				     KVM_PGTABLE_PROT_R |
diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c
index f80a34fbf102..4d95ebe4114f 100644
--- a/arch/riscv/kvm/mmu.c
+++ b/arch/riscv/kvm/mmu.c
@@ -347,10 +347,7 @@ static int stage2_ioremap(struct kvm *kvm, gpa_t gpa, phys_addr_t hpa,
 	int ret = 0;
 	unsigned long pfn;
 	phys_addr_t addr, end;
-	struct kvm_mmu_memory_cache pcache;
-
-	memset(&pcache, 0, sizeof(pcache));
-	pcache.gfp_zero = __GFP_ZERO;
+	struct kvm_mmu_memory_cache pcache = { .gfp_zero = __GFP_ZERO };
 
 	end = (gpa + size + PAGE_SIZE - 1) & PAGE_MASK;
 	pfn = __phys_to_pfn(hpa);
diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index ac1ebb37a0ff..68529884eaf8 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -83,12 +83,16 @@ struct gfn_to_pfn_cache {
  * MMU flows is problematic, as is triggering reclaim, I/O, etc... while
  * holding MMU locks.  Note, these caches act more like prefetch buffers than
  * classical caches, i.e. objects are not returned to the cache on being freed.
+ *
+ * The @capacity field and @objects array are lazily initialized when the cache
+ * is topped up (__kvm_mmu_topup_memory_cache()).
  */
 struct kvm_mmu_memory_cache {
 	int nobjs;
 	gfp_t gfp_zero;
 	struct kmem_cache *kmem_cache;
-	void *objects[KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE];
+	int capacity;
+	void **objects;
 };
 #endif
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e089db822c12..5e2e75014256 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -369,14 +369,31 @@ static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc,
 		return (void *)__get_free_page(gfp_flags);
 }
 
-int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
+static int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min)
 {
+	gfp_t gfp = GFP_KERNEL_ACCOUNT;
 	void *obj;
 
 	if (mc->nobjs >= min)
 		return 0;
-	while (mc->nobjs < ARRAY_SIZE(mc->objects)) {
-		obj = mmu_memory_cache_alloc_obj(mc, GFP_KERNEL_ACCOUNT);
+
+	if (unlikely(!mc->objects)) {
+		if (WARN_ON_ONCE(!capacity))
+			return -EIO;
+
+		mc->objects = kvmalloc_array(sizeof(void *), capacity, gfp);
+		if (!mc->objects)
+			return -ENOMEM;
+
+		mc->capacity = capacity;
+	}
+
+	/* It is illegal to request a different capacity across topups. */
+	if (WARN_ON_ONCE(mc->capacity != capacity))
+		return -EIO;
+
+	while (mc->nobjs < mc->capacity) {
+		obj = mmu_memory_cache_alloc_obj(mc, gfp);
 		if (!obj)
 			return mc->nobjs >= min ? 0 : -ENOMEM;
 		mc->objects[mc->nobjs++] = obj;
@@ -384,6 +401,11 @@ int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
 	return 0;
 }
 
+int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
+{
+	return __kvm_mmu_topup_memory_cache(mc, KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE, min);
+}
+
 int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc)
 {
 	return mc->nobjs;
@@ -397,6 +419,11 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc)
 		else
 			free_page((unsigned long)mc->objects[--mc->nobjs]);
 	}
+
+	kvfree(mc->objects);
+
+	mc->objects = NULL;
+	mc->capacity = 0;
 }
 
 void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc)
-- 
2.36.0.550.gb090851708-goog


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 21/22] KVM: Allow for different capacities in kvm_mmu_memory_cache structs
@ 2022-05-16 23:21   ` David Matlack
  0 siblings, 0 replies; 111+ messages in thread
From: David Matlack @ 2022-05-16 23:21 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen, Lai Jiangshan,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Allow the capacity of the kvm_mmu_memory_cache struct to be chosen at
declaration time rather than being fixed for all declarations. This will
be used in a follow-up commit to declare an cache in x86 with a capacity
of 512+ objects without having to increase the capacity of all caches in
KVM.

This change requires each cache now specify its capacity at runtime,
since the cache struct itself no longer has a fixed capacity known at
compile time. To protect against someone accidentally defining a
kvm_mmu_memory_cache struct directly (without the extra storage), this
commit includes a WARN_ON() in kvm_mmu_topup_memory_cache().

In order to support different capacities, this commit changes the
objects pointer array to be dynamically allocated the first time the
cache is topped-up.

While here, opportunistically clean up the stack-allocated
kvm_mmu_memory_cache structs in riscv and arm64 to use designated
initializers.

No functional change intended.

Reviewed-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/arm64/kvm/mmu.c      |  2 +-
 arch/riscv/kvm/mmu.c      |  5 +----
 include/linux/kvm_types.h |  6 +++++-
 virt/kvm/kvm_main.c       | 33 ++++++++++++++++++++++++++++++---
 4 files changed, 37 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 53ae2c0640bc..f443ed845f85 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -764,7 +764,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
 {
 	phys_addr_t addr;
 	int ret = 0;
-	struct kvm_mmu_memory_cache cache = { 0, __GFP_ZERO, NULL, };
+	struct kvm_mmu_memory_cache cache = { .gfp_zero = __GFP_ZERO };
 	struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
 	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_DEVICE |
 				     KVM_PGTABLE_PROT_R |
diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c
index f80a34fbf102..4d95ebe4114f 100644
--- a/arch/riscv/kvm/mmu.c
+++ b/arch/riscv/kvm/mmu.c
@@ -347,10 +347,7 @@ static int stage2_ioremap(struct kvm *kvm, gpa_t gpa, phys_addr_t hpa,
 	int ret = 0;
 	unsigned long pfn;
 	phys_addr_t addr, end;
-	struct kvm_mmu_memory_cache pcache;
-
-	memset(&pcache, 0, sizeof(pcache));
-	pcache.gfp_zero = __GFP_ZERO;
+	struct kvm_mmu_memory_cache pcache = { .gfp_zero = __GFP_ZERO };
 
 	end = (gpa + size + PAGE_SIZE - 1) & PAGE_MASK;
 	pfn = __phys_to_pfn(hpa);
diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index ac1ebb37a0ff..68529884eaf8 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -83,12 +83,16 @@ struct gfn_to_pfn_cache {
  * MMU flows is problematic, as is triggering reclaim, I/O, etc... while
  * holding MMU locks.  Note, these caches act more like prefetch buffers than
  * classical caches, i.e. objects are not returned to the cache on being freed.
+ *
+ * The @capacity field and @objects array are lazily initialized when the cache
+ * is topped up (__kvm_mmu_topup_memory_cache()).
  */
 struct kvm_mmu_memory_cache {
 	int nobjs;
 	gfp_t gfp_zero;
 	struct kmem_cache *kmem_cache;
-	void *objects[KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE];
+	int capacity;
+	void **objects;
 };
 #endif
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e089db822c12..5e2e75014256 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -369,14 +369,31 @@ static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc,
 		return (void *)__get_free_page(gfp_flags);
 }
 
-int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
+static int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min)
 {
+	gfp_t gfp = GFP_KERNEL_ACCOUNT;
 	void *obj;
 
 	if (mc->nobjs >= min)
 		return 0;
-	while (mc->nobjs < ARRAY_SIZE(mc->objects)) {
-		obj = mmu_memory_cache_alloc_obj(mc, GFP_KERNEL_ACCOUNT);
+
+	if (unlikely(!mc->objects)) {
+		if (WARN_ON_ONCE(!capacity))
+			return -EIO;
+
+		mc->objects = kvmalloc_array(sizeof(void *), capacity, gfp);
+		if (!mc->objects)
+			return -ENOMEM;
+
+		mc->capacity = capacity;
+	}
+
+	/* It is illegal to request a different capacity across topups. */
+	if (WARN_ON_ONCE(mc->capacity != capacity))
+		return -EIO;
+
+	while (mc->nobjs < mc->capacity) {
+		obj = mmu_memory_cache_alloc_obj(mc, gfp);
 		if (!obj)
 			return mc->nobjs >= min ? 0 : -ENOMEM;
 		mc->objects[mc->nobjs++] = obj;
@@ -384,6 +401,11 @@ int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
 	return 0;
 }
 
+int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
+{
+	return __kvm_mmu_topup_memory_cache(mc, KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE, min);
+}
+
 int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc)
 {
 	return mc->nobjs;
@@ -397,6 +419,11 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc)
 		else
 			free_page((unsigned long)mc->objects[--mc->nobjs]);
 	}
+
+	kvfree(mc->objects);
+
+	mc->objects = NULL;
+	mc->capacity = 0;
 }
 
 void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc)
-- 
2.36.0.550.gb090851708-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 22/22] KVM: x86/mmu: Extend Eager Page Splitting to nested MMUs
  2022-05-16 23:21 ` David Matlack
@ 2022-05-16 23:21   ` David Matlack
  -1 siblings, 0 replies; 111+ messages in thread
From: David Matlack @ 2022-05-16 23:21 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, Lai Jiangshan, David Matlack

Add support for Eager Page Splitting pages that are mapped by nested
MMUs. Walk through the rmap first splitting all 1GiB pages to 2MiB
pages, and then splitting all 2MiB pages to 4KiB pages.

Note, Eager Page Splitting is limited to nested MMUs as a policy rather
than due to any technical reason (the sp->role.guest_mode check could
just be deleted and Eager Page Splitting would work correctly for all
shadow MMU pages). There is really no reason to support Eager Page
Splitting for tdp_mmu=N, since such support will eventually be phased
out, and there is no current use case supporting Eager Page Splitting on
hosts where TDP is either disabled or unavailable in hardware.
Furthermore, future improvements to nested MMU scalability may diverge
the code from the legacy shadow paging implementation. These
improvements will be simpler to make if Eager Page Splitting does not
have to worry about legacy shadow paging.

Splitting huge pages mapped by nested MMUs requires dealing with some
extra complexity beyond that of the TDP MMU:

(1) The shadow MMU has a limit on the number of shadow pages that are
    allowed to be allocated. So, as a policy, Eager Page Splitting
    refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
    pages available.

(2) Splitting a huge page may end up re-using an existing lower level
    shadow page tables. This is unlike the TDP MMU which always allocates
    new shadow page tables when splitting.

(3) When installing the lower level SPTEs, they must be added to the
    rmap which may require allocating additional pte_list_desc structs.

Case (2) is especially interesting since it may require a TLB flush,
unlike the TDP MMU which can fully split huge pages without any TLB
flushes. Specifically, an existing lower level page table may point to
even lower level page tables that are not fully populated, effectively
unmapping a portion of the huge page, which requires a flush.

This commit performs such flushes after dropping the huge page and
before installing the lower level page table. This TLB flush could
instead be delayed until the MMU lock is about to be dropped, which
would batch flushes for multiple splits.  However these flushes should
be rare in practice (a huge page must be aliased in multiple SPTEs and
have been split for NX Huge Pages in only some of them). Flushing
immediately is simpler to plumb and also reduces the chances of tripping
over a CPU bug (e.g. see iTLB multihit).

Suggested-by: Peter Feiner <pfeiner@google.com>
[ This commit is based off of the original implementation of Eager Page
  Splitting from Peter in Google's kernel from 2016. ]
Signed-off-by: David Matlack <dmatlack@google.com>
---
 .../admin-guide/kernel-parameters.txt         |   3 +-
 arch/x86/include/asm/kvm_host.h               |  24 ++
 arch/x86/kvm/mmu/mmu.c                        | 267 +++++++++++++++++-
 arch/x86/kvm/x86.c                            |   6 +
 include/linux/kvm_host.h                      |   1 +
 virt/kvm/kvm_main.c                           |   2 +-
 6 files changed, 293 insertions(+), 10 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 3f1cc5e317ed..bc3ad3d4df0b 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2387,8 +2387,7 @@
 			the KVM_CLEAR_DIRTY ioctl, and only for the pages being
 			cleared.
 
-			Eager page splitting currently only supports splitting
-			huge pages mapped by the TDP MMU.
+			Eager page splitting is only supported when kvm.tdp_mmu=Y.
 
 			Default is Y (on).
 
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 9193a700fe2d..ea99e61cc556 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1265,6 +1265,28 @@ struct kvm_arch {
 	 * the global KVM_MAX_VCPU_IDS may lead to significant memory waste.
 	 */
 	u32 max_vcpu_ids;
+
+	/*
+	 * Memory caches used to allocate shadow pages when performing eager
+	 * page splitting. No need for a shadowed_info_cache since eager page
+	 * splitting only allocates direct shadow pages.
+	 *
+	 * Protected by kvm->slots_lock.
+	 */
+	struct kvm_mmu_memory_cache split_shadow_page_cache;
+	struct kvm_mmu_memory_cache split_page_header_cache;
+
+	/*
+	 * Memory cache used to allocate pte_list_desc structs while splitting
+	 * huge pages. In the worst case, to split one huge page, 512
+	 * pte_list_desc structs are needed to add each lower level leaf sptep
+	 * to the rmap plus 1 to extend the parent_ptes rmap of the lower level
+	 * page table.
+	 *
+	 * Protected by kvm->slots_lock.
+	 */
+#define SPLIT_DESC_CACHE_CAPACITY 513
+	struct kvm_mmu_memory_cache split_desc_cache;
 };
 
 struct kvm_vm_stat {
@@ -1639,6 +1661,8 @@ void kvm_mmu_zap_all(struct kvm *kvm);
 void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen);
 void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned long kvm_nr_mmu_pages);
 
+void free_split_caches(struct kvm *kvm);
+
 int load_pdptrs(struct kvm_vcpu *vcpu, unsigned long cr3);
 
 int emulator_write_phys(struct kvm_vcpu *vcpu, gpa_t gpa,
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 964a8fa63e1b..7c5eab61c4ea 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5966,6 +5966,15 @@ int kvm_mmu_init_vm(struct kvm *kvm)
 	node->track_write = kvm_mmu_pte_write;
 	node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;
 	kvm_page_track_register_notifier(kvm, node);
+
+	kvm->arch.split_page_header_cache.kmem_cache = mmu_page_header_cache;
+	kvm->arch.split_page_header_cache.gfp_zero = __GFP_ZERO;
+
+	kvm->arch.split_shadow_page_cache.gfp_zero = __GFP_ZERO;
+
+	kvm->arch.split_desc_cache.kmem_cache = pte_list_desc_cache;
+	kvm->arch.split_desc_cache.gfp_zero = __GFP_ZERO;
+
 	return 0;
 }
 
@@ -6097,15 +6106,252 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
 		kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
 }
 
+void free_split_caches(struct kvm *kvm)
+{
+	lockdep_assert_held(&kvm->slots_lock);
+
+	kvm_mmu_free_memory_cache(&kvm->arch.split_desc_cache);
+	kvm_mmu_free_memory_cache(&kvm->arch.split_page_header_cache);
+	kvm_mmu_free_memory_cache(&kvm->arch.split_shadow_page_cache);
+}
+
+static inline bool need_topup(struct kvm_mmu_memory_cache *cache, int min)
+{
+	return kvm_mmu_memory_cache_nr_free_objects(cache) < min;
+}
+
+static bool need_topup_split_caches_or_resched(struct kvm *kvm)
+{
+	if (need_resched() || rwlock_needbreak(&kvm->mmu_lock))
+		return true;
+
+	/*
+	 * In the worst case, SPLIT_DESC_CACHE_CAPACITY descriptors are needed
+	 * to split a single huge page. Calculating how many are actually needed
+	 * is possible but not worth the complexity.
+	 */
+	return need_topup(&kvm->arch.split_desc_cache, SPLIT_DESC_CACHE_CAPACITY) ||
+	       need_topup(&kvm->arch.split_page_header_cache, 1) ||
+	       need_topup(&kvm->arch.split_shadow_page_cache, 1);
+}
+
+static int topup_split_caches(struct kvm *kvm)
+{
+	int r;
+
+	lockdep_assert_held(&kvm->slots_lock);
+
+	r = __kvm_mmu_topup_memory_cache(&kvm->arch.split_desc_cache,
+					 SPLIT_DESC_CACHE_CAPACITY,
+					 SPLIT_DESC_CACHE_CAPACITY);
+	if (r)
+		return r;
+
+	r = kvm_mmu_topup_memory_cache(&kvm->arch.split_page_header_cache, 1);
+	if (r)
+		return r;
+
+	return kvm_mmu_topup_memory_cache(&kvm->arch.split_shadow_page_cache, 1);
+}
+
+static struct kvm_mmu_page *nested_mmu_get_sp_for_split(struct kvm *kvm, u64 *huge_sptep)
+{
+	struct kvm_mmu_page *huge_sp = sptep_to_sp(huge_sptep);
+	struct shadow_page_caches caches = {};
+	union kvm_mmu_page_role role;
+	unsigned int access;
+	gfn_t gfn;
+
+	gfn = kvm_mmu_page_get_gfn(huge_sp, huge_sptep - huge_sp->spt);
+	access = kvm_mmu_page_get_access(huge_sp, huge_sptep - huge_sp->spt);
+
+	/*
+	 * Note, huge page splitting always uses direct shadow pages, regardless
+	 * of whether the huge page itself is mapped by a direct or indirect
+	 * shadow page, since the huge page region itself is being directly
+	 * mapped with smaller pages.
+	 */
+	role = kvm_mmu_child_role(huge_sptep, /*direct=*/true, access);
+
+	/* Direct SPs do not require a shadowed_info_cache. */
+	caches.page_header_cache = &kvm->arch.split_page_header_cache;
+	caches.shadow_page_cache = &kvm->arch.split_shadow_page_cache;
+
+	/* Safe to pass NULL for vCPU since requesting a direct SP. */
+	return __kvm_mmu_get_shadow_page(kvm, NULL, &caches, gfn, role);
+}
+
+static void nested_mmu_split_huge_page(struct kvm *kvm,
+				       const struct kvm_memory_slot *slot,
+				       u64 *huge_sptep)
+
+{
+	struct kvm_mmu_memory_cache *cache = &kvm->arch.split_desc_cache;
+	u64 huge_spte = READ_ONCE(*huge_sptep);
+	struct kvm_mmu_page *sp;
+	bool flush = false;
+	u64 *sptep, spte;
+	gfn_t gfn;
+	int index;
+
+	sp = nested_mmu_get_sp_for_split(kvm, huge_sptep);
+
+	for (index = 0; index < PT64_ENT_PER_PAGE; index++) {
+		sptep = &sp->spt[index];
+		gfn = kvm_mmu_page_get_gfn(sp, index);
+
+		/*
+		 * The SP may already have populated SPTEs, e.g. if this huge
+		 * page is aliased by multiple sptes with the same access
+		 * permissions. These entries are guaranteed to map the same
+		 * gfn-to-pfn translation since the SP is direct, so no need to
+		 * modify them.
+		 *
+		 * However, if a given SPTE points to a lower level page table,
+		 * that lower level page table may only be partially populated.
+		 * Installing such SPTEs would effectively unmap a potion of the
+		 * huge page. Unmapping guest memory always requires a TLB flush
+		 * since a subsequent operation on the unmapped regions would
+		 * fail to detect the need to flush.
+		 */
+		if (is_shadow_present_pte(*sptep)) {
+			flush |= !is_last_spte(*sptep, sp->role.level);
+			continue;
+		}
+
+		spte = make_huge_page_split_spte(huge_spte, sp->role, index);
+		mmu_spte_set(sptep, spte);
+		__rmap_add(kvm, cache, slot, sptep, gfn, sp->role.access);
+	}
+
+	/*
+	 * Replace the huge spte with a pointer to the populated lower level
+	 * page table. If the lower-level page table indentically maps the huge
+	 * page (i.e. no memory is unmapped), there's no need for a TLB flush.
+	 * Otherwise, flush TLBs after dropping the huge page and before
+	 * installing the shadow page table.
+	 */
+	__drop_large_spte(kvm, huge_sptep, flush);
+	__link_shadow_page(cache, huge_sptep, sp);
+}
+
+static int nested_mmu_try_split_huge_page(struct kvm *kvm,
+					  const struct kvm_memory_slot *slot,
+					  u64 *huge_sptep)
+{
+	struct kvm_mmu_page *huge_sp = sptep_to_sp(huge_sptep);
+	int level, r = 0;
+	gfn_t gfn;
+	u64 spte;
+
+	/* Grab information for the tracepoint before dropping the MMU lock. */
+	gfn = kvm_mmu_page_get_gfn(huge_sp, huge_sptep - huge_sp->spt);
+	level = huge_sp->role.level;
+	spte = *huge_sptep;
+
+	if (kvm_mmu_available_pages(kvm) <= KVM_MIN_FREE_MMU_PAGES) {
+		r = -ENOSPC;
+		goto out;
+	}
+
+	if (need_topup_split_caches_or_resched(kvm)) {
+		write_unlock(&kvm->mmu_lock);
+		cond_resched();
+		/*
+		 * If the topup succeeds, return -EAGAIN to indicate that the
+		 * rmap iterator should be restarted because the MMU lock was
+		 * dropped.
+		 */
+		r = topup_split_caches(kvm) ?: -EAGAIN;
+		write_lock(&kvm->mmu_lock);
+		goto out;
+	}
+
+	nested_mmu_split_huge_page(kvm, slot, huge_sptep);
+
+out:
+	trace_kvm_mmu_split_huge_page(gfn, spte, level, r);
+	return r;
+}
+
+static bool nested_mmu_try_split_huge_pages(struct kvm *kvm,
+					    struct kvm_rmap_head *rmap_head,
+					    const struct kvm_memory_slot *slot)
+{
+	struct rmap_iterator iter;
+	struct kvm_mmu_page *sp;
+	u64 *huge_sptep;
+	int r;
+
+restart:
+	for_each_rmap_spte(rmap_head, &iter, huge_sptep) {
+		sp = sptep_to_sp(huge_sptep);
+
+		/* TDP MMU is enabled, so rmap only contains nested MMU SPs. */
+		if (WARN_ON_ONCE(!sp->role.guest_mode))
+			continue;
+
+		/* The rmaps should never contain non-leaf SPTEs. */
+		if (WARN_ON_ONCE(!is_large_pte(*huge_sptep)))
+			continue;
+
+		/* SPs with level >PG_LEVEL_4K should never by unsync. */
+		if (WARN_ON_ONCE(sp->unsync))
+			continue;
+
+		/* Don't bother splitting huge pages on invalid SPs. */
+		if (sp->role.invalid)
+			continue;
+
+		r = nested_mmu_try_split_huge_page(kvm, slot, huge_sptep);
+
+		/*
+		 * The split succeeded or needs to be retried because the MMU
+		 * lock was dropped. Either way, restart the iterator to get it
+		 * back into a consistent state.
+		 */
+		if (!r || r == -EAGAIN)
+			goto restart;
+
+		/* The split failed and shouldn't be retried (e.g. -ENOMEM). */
+		break;
+	}
+
+	return false;
+}
+
+static void kvm_nested_mmu_try_split_huge_pages(struct kvm *kvm,
+						const struct kvm_memory_slot *slot,
+						gfn_t start, gfn_t end,
+						int target_level)
+{
+	int level;
+
+	/*
+	 * Split huge pages starting with KVM_MAX_HUGEPAGE_LEVEL and working
+	 * down to the target level. This ensures pages are recursively split
+	 * all the way to the target level. There's no need to split pages
+	 * already at the target level.
+	 */
+	for (level = KVM_MAX_HUGEPAGE_LEVEL; level > target_level; level--) {
+		slot_handle_level_range(kvm, slot, nested_mmu_try_split_huge_pages,
+					level, level, start, end - 1, true, false);
+	}
+}
+
 /* Must be called with the mmu_lock held in write-mode. */
 void kvm_mmu_try_split_huge_pages(struct kvm *kvm,
 				   const struct kvm_memory_slot *memslot,
 				   u64 start, u64 end,
 				   int target_level)
 {
-	if (is_tdp_mmu_enabled(kvm))
-		kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end,
-						 target_level, false);
+	if (!is_tdp_mmu_enabled(kvm))
+		return;
+
+	if (kvm_memslots_have_rmaps(kvm))
+		kvm_nested_mmu_try_split_huge_pages(kvm, memslot, start, end, target_level);
+
+	kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end, target_level, false);
 
 	/*
 	 * A TLB flush is unnecessary at this point for the same resons as in
@@ -6120,12 +6366,19 @@ void kvm_mmu_slot_try_split_huge_pages(struct kvm *kvm,
 	u64 start = memslot->base_gfn;
 	u64 end = start + memslot->npages;
 
-	if (is_tdp_mmu_enabled(kvm)) {
-		read_lock(&kvm->mmu_lock);
-		kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end, target_level, true);
-		read_unlock(&kvm->mmu_lock);
+	if (!is_tdp_mmu_enabled(kvm))
+		return;
+
+	if (kvm_memslots_have_rmaps(kvm)) {
+		write_lock(&kvm->mmu_lock);
+		kvm_nested_mmu_try_split_huge_pages(kvm, memslot, start, end, target_level);
+		write_unlock(&kvm->mmu_lock);
 	}
 
+	read_lock(&kvm->mmu_lock);
+	kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end, target_level, true);
+	read_unlock(&kvm->mmu_lock);
+
 	/*
 	 * No TLB flush is necessary here. KVM will flush TLBs after
 	 * write-protecting and/or clearing dirty on the newly split SPTEs to
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 04812eaaf61b..4fe018ddd1cd 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12197,6 +12197,12 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
 		 * page faults will create the large-page sptes.
 		 */
 		kvm_mmu_zap_collapsible_sptes(kvm, new);
+
+		/*
+		 * Free any memory left behind by eager page splitting. Ignore
+		 * the module parameter since userspace might have changed it.
+		 */
+		free_split_caches(kvm);
 	} else {
 		/*
 		 * Initially-all-set does not require write protecting any page,
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index f94f72bbd2d3..17fc9247504d 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1336,6 +1336,7 @@ void kvm_flush_remote_tlbs(struct kvm *kvm);
 
 #ifdef KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE
 int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min);
+int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min);
 int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc);
 void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
 void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 5e2e75014256..b9573e958a03 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -369,7 +369,7 @@ static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc,
 		return (void *)__get_free_page(gfp_flags);
 }
 
-static int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min)
+int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min)
 {
 	gfp_t gfp = GFP_KERNEL_ACCOUNT;
 	void *obj;
-- 
2.36.0.550.gb090851708-goog


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 22/22] KVM: x86/mmu: Extend Eager Page Splitting to nested MMUs
@ 2022-05-16 23:21   ` David Matlack
  0 siblings, 0 replies; 111+ messages in thread
From: David Matlack @ 2022-05-16 23:21 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen, Lai Jiangshan,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Add support for Eager Page Splitting pages that are mapped by nested
MMUs. Walk through the rmap first splitting all 1GiB pages to 2MiB
pages, and then splitting all 2MiB pages to 4KiB pages.

Note, Eager Page Splitting is limited to nested MMUs as a policy rather
than due to any technical reason (the sp->role.guest_mode check could
just be deleted and Eager Page Splitting would work correctly for all
shadow MMU pages). There is really no reason to support Eager Page
Splitting for tdp_mmu=N, since such support will eventually be phased
out, and there is no current use case supporting Eager Page Splitting on
hosts where TDP is either disabled or unavailable in hardware.
Furthermore, future improvements to nested MMU scalability may diverge
the code from the legacy shadow paging implementation. These
improvements will be simpler to make if Eager Page Splitting does not
have to worry about legacy shadow paging.

Splitting huge pages mapped by nested MMUs requires dealing with some
extra complexity beyond that of the TDP MMU:

(1) The shadow MMU has a limit on the number of shadow pages that are
    allowed to be allocated. So, as a policy, Eager Page Splitting
    refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
    pages available.

(2) Splitting a huge page may end up re-using an existing lower level
    shadow page tables. This is unlike the TDP MMU which always allocates
    new shadow page tables when splitting.

(3) When installing the lower level SPTEs, they must be added to the
    rmap which may require allocating additional pte_list_desc structs.

Case (2) is especially interesting since it may require a TLB flush,
unlike the TDP MMU which can fully split huge pages without any TLB
flushes. Specifically, an existing lower level page table may point to
even lower level page tables that are not fully populated, effectively
unmapping a portion of the huge page, which requires a flush.

This commit performs such flushes after dropping the huge page and
before installing the lower level page table. This TLB flush could
instead be delayed until the MMU lock is about to be dropped, which
would batch flushes for multiple splits.  However these flushes should
be rare in practice (a huge page must be aliased in multiple SPTEs and
have been split for NX Huge Pages in only some of them). Flushing
immediately is simpler to plumb and also reduces the chances of tripping
over a CPU bug (e.g. see iTLB multihit).

Suggested-by: Peter Feiner <pfeiner@google.com>
[ This commit is based off of the original implementation of Eager Page
  Splitting from Peter in Google's kernel from 2016. ]
Signed-off-by: David Matlack <dmatlack@google.com>
---
 .../admin-guide/kernel-parameters.txt         |   3 +-
 arch/x86/include/asm/kvm_host.h               |  24 ++
 arch/x86/kvm/mmu/mmu.c                        | 267 +++++++++++++++++-
 arch/x86/kvm/x86.c                            |   6 +
 include/linux/kvm_host.h                      |   1 +
 virt/kvm/kvm_main.c                           |   2 +-
 6 files changed, 293 insertions(+), 10 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 3f1cc5e317ed..bc3ad3d4df0b 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2387,8 +2387,7 @@
 			the KVM_CLEAR_DIRTY ioctl, and only for the pages being
 			cleared.
 
-			Eager page splitting currently only supports splitting
-			huge pages mapped by the TDP MMU.
+			Eager page splitting is only supported when kvm.tdp_mmu=Y.
 
 			Default is Y (on).
 
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 9193a700fe2d..ea99e61cc556 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1265,6 +1265,28 @@ struct kvm_arch {
 	 * the global KVM_MAX_VCPU_IDS may lead to significant memory waste.
 	 */
 	u32 max_vcpu_ids;
+
+	/*
+	 * Memory caches used to allocate shadow pages when performing eager
+	 * page splitting. No need for a shadowed_info_cache since eager page
+	 * splitting only allocates direct shadow pages.
+	 *
+	 * Protected by kvm->slots_lock.
+	 */
+	struct kvm_mmu_memory_cache split_shadow_page_cache;
+	struct kvm_mmu_memory_cache split_page_header_cache;
+
+	/*
+	 * Memory cache used to allocate pte_list_desc structs while splitting
+	 * huge pages. In the worst case, to split one huge page, 512
+	 * pte_list_desc structs are needed to add each lower level leaf sptep
+	 * to the rmap plus 1 to extend the parent_ptes rmap of the lower level
+	 * page table.
+	 *
+	 * Protected by kvm->slots_lock.
+	 */
+#define SPLIT_DESC_CACHE_CAPACITY 513
+	struct kvm_mmu_memory_cache split_desc_cache;
 };
 
 struct kvm_vm_stat {
@@ -1639,6 +1661,8 @@ void kvm_mmu_zap_all(struct kvm *kvm);
 void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen);
 void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned long kvm_nr_mmu_pages);
 
+void free_split_caches(struct kvm *kvm);
+
 int load_pdptrs(struct kvm_vcpu *vcpu, unsigned long cr3);
 
 int emulator_write_phys(struct kvm_vcpu *vcpu, gpa_t gpa,
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 964a8fa63e1b..7c5eab61c4ea 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5966,6 +5966,15 @@ int kvm_mmu_init_vm(struct kvm *kvm)
 	node->track_write = kvm_mmu_pte_write;
 	node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;
 	kvm_page_track_register_notifier(kvm, node);
+
+	kvm->arch.split_page_header_cache.kmem_cache = mmu_page_header_cache;
+	kvm->arch.split_page_header_cache.gfp_zero = __GFP_ZERO;
+
+	kvm->arch.split_shadow_page_cache.gfp_zero = __GFP_ZERO;
+
+	kvm->arch.split_desc_cache.kmem_cache = pte_list_desc_cache;
+	kvm->arch.split_desc_cache.gfp_zero = __GFP_ZERO;
+
 	return 0;
 }
 
@@ -6097,15 +6106,252 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
 		kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
 }
 
+void free_split_caches(struct kvm *kvm)
+{
+	lockdep_assert_held(&kvm->slots_lock);
+
+	kvm_mmu_free_memory_cache(&kvm->arch.split_desc_cache);
+	kvm_mmu_free_memory_cache(&kvm->arch.split_page_header_cache);
+	kvm_mmu_free_memory_cache(&kvm->arch.split_shadow_page_cache);
+}
+
+static inline bool need_topup(struct kvm_mmu_memory_cache *cache, int min)
+{
+	return kvm_mmu_memory_cache_nr_free_objects(cache) < min;
+}
+
+static bool need_topup_split_caches_or_resched(struct kvm *kvm)
+{
+	if (need_resched() || rwlock_needbreak(&kvm->mmu_lock))
+		return true;
+
+	/*
+	 * In the worst case, SPLIT_DESC_CACHE_CAPACITY descriptors are needed
+	 * to split a single huge page. Calculating how many are actually needed
+	 * is possible but not worth the complexity.
+	 */
+	return need_topup(&kvm->arch.split_desc_cache, SPLIT_DESC_CACHE_CAPACITY) ||
+	       need_topup(&kvm->arch.split_page_header_cache, 1) ||
+	       need_topup(&kvm->arch.split_shadow_page_cache, 1);
+}
+
+static int topup_split_caches(struct kvm *kvm)
+{
+	int r;
+
+	lockdep_assert_held(&kvm->slots_lock);
+
+	r = __kvm_mmu_topup_memory_cache(&kvm->arch.split_desc_cache,
+					 SPLIT_DESC_CACHE_CAPACITY,
+					 SPLIT_DESC_CACHE_CAPACITY);
+	if (r)
+		return r;
+
+	r = kvm_mmu_topup_memory_cache(&kvm->arch.split_page_header_cache, 1);
+	if (r)
+		return r;
+
+	return kvm_mmu_topup_memory_cache(&kvm->arch.split_shadow_page_cache, 1);
+}
+
+static struct kvm_mmu_page *nested_mmu_get_sp_for_split(struct kvm *kvm, u64 *huge_sptep)
+{
+	struct kvm_mmu_page *huge_sp = sptep_to_sp(huge_sptep);
+	struct shadow_page_caches caches = {};
+	union kvm_mmu_page_role role;
+	unsigned int access;
+	gfn_t gfn;
+
+	gfn = kvm_mmu_page_get_gfn(huge_sp, huge_sptep - huge_sp->spt);
+	access = kvm_mmu_page_get_access(huge_sp, huge_sptep - huge_sp->spt);
+
+	/*
+	 * Note, huge page splitting always uses direct shadow pages, regardless
+	 * of whether the huge page itself is mapped by a direct or indirect
+	 * shadow page, since the huge page region itself is being directly
+	 * mapped with smaller pages.
+	 */
+	role = kvm_mmu_child_role(huge_sptep, /*direct=*/true, access);
+
+	/* Direct SPs do not require a shadowed_info_cache. */
+	caches.page_header_cache = &kvm->arch.split_page_header_cache;
+	caches.shadow_page_cache = &kvm->arch.split_shadow_page_cache;
+
+	/* Safe to pass NULL for vCPU since requesting a direct SP. */
+	return __kvm_mmu_get_shadow_page(kvm, NULL, &caches, gfn, role);
+}
+
+static void nested_mmu_split_huge_page(struct kvm *kvm,
+				       const struct kvm_memory_slot *slot,
+				       u64 *huge_sptep)
+
+{
+	struct kvm_mmu_memory_cache *cache = &kvm->arch.split_desc_cache;
+	u64 huge_spte = READ_ONCE(*huge_sptep);
+	struct kvm_mmu_page *sp;
+	bool flush = false;
+	u64 *sptep, spte;
+	gfn_t gfn;
+	int index;
+
+	sp = nested_mmu_get_sp_for_split(kvm, huge_sptep);
+
+	for (index = 0; index < PT64_ENT_PER_PAGE; index++) {
+		sptep = &sp->spt[index];
+		gfn = kvm_mmu_page_get_gfn(sp, index);
+
+		/*
+		 * The SP may already have populated SPTEs, e.g. if this huge
+		 * page is aliased by multiple sptes with the same access
+		 * permissions. These entries are guaranteed to map the same
+		 * gfn-to-pfn translation since the SP is direct, so no need to
+		 * modify them.
+		 *
+		 * However, if a given SPTE points to a lower level page table,
+		 * that lower level page table may only be partially populated.
+		 * Installing such SPTEs would effectively unmap a potion of the
+		 * huge page. Unmapping guest memory always requires a TLB flush
+		 * since a subsequent operation on the unmapped regions would
+		 * fail to detect the need to flush.
+		 */
+		if (is_shadow_present_pte(*sptep)) {
+			flush |= !is_last_spte(*sptep, sp->role.level);
+			continue;
+		}
+
+		spte = make_huge_page_split_spte(huge_spte, sp->role, index);
+		mmu_spte_set(sptep, spte);
+		__rmap_add(kvm, cache, slot, sptep, gfn, sp->role.access);
+	}
+
+	/*
+	 * Replace the huge spte with a pointer to the populated lower level
+	 * page table. If the lower-level page table indentically maps the huge
+	 * page (i.e. no memory is unmapped), there's no need for a TLB flush.
+	 * Otherwise, flush TLBs after dropping the huge page and before
+	 * installing the shadow page table.
+	 */
+	__drop_large_spte(kvm, huge_sptep, flush);
+	__link_shadow_page(cache, huge_sptep, sp);
+}
+
+static int nested_mmu_try_split_huge_page(struct kvm *kvm,
+					  const struct kvm_memory_slot *slot,
+					  u64 *huge_sptep)
+{
+	struct kvm_mmu_page *huge_sp = sptep_to_sp(huge_sptep);
+	int level, r = 0;
+	gfn_t gfn;
+	u64 spte;
+
+	/* Grab information for the tracepoint before dropping the MMU lock. */
+	gfn = kvm_mmu_page_get_gfn(huge_sp, huge_sptep - huge_sp->spt);
+	level = huge_sp->role.level;
+	spte = *huge_sptep;
+
+	if (kvm_mmu_available_pages(kvm) <= KVM_MIN_FREE_MMU_PAGES) {
+		r = -ENOSPC;
+		goto out;
+	}
+
+	if (need_topup_split_caches_or_resched(kvm)) {
+		write_unlock(&kvm->mmu_lock);
+		cond_resched();
+		/*
+		 * If the topup succeeds, return -EAGAIN to indicate that the
+		 * rmap iterator should be restarted because the MMU lock was
+		 * dropped.
+		 */
+		r = topup_split_caches(kvm) ?: -EAGAIN;
+		write_lock(&kvm->mmu_lock);
+		goto out;
+	}
+
+	nested_mmu_split_huge_page(kvm, slot, huge_sptep);
+
+out:
+	trace_kvm_mmu_split_huge_page(gfn, spte, level, r);
+	return r;
+}
+
+static bool nested_mmu_try_split_huge_pages(struct kvm *kvm,
+					    struct kvm_rmap_head *rmap_head,
+					    const struct kvm_memory_slot *slot)
+{
+	struct rmap_iterator iter;
+	struct kvm_mmu_page *sp;
+	u64 *huge_sptep;
+	int r;
+
+restart:
+	for_each_rmap_spte(rmap_head, &iter, huge_sptep) {
+		sp = sptep_to_sp(huge_sptep);
+
+		/* TDP MMU is enabled, so rmap only contains nested MMU SPs. */
+		if (WARN_ON_ONCE(!sp->role.guest_mode))
+			continue;
+
+		/* The rmaps should never contain non-leaf SPTEs. */
+		if (WARN_ON_ONCE(!is_large_pte(*huge_sptep)))
+			continue;
+
+		/* SPs with level >PG_LEVEL_4K should never by unsync. */
+		if (WARN_ON_ONCE(sp->unsync))
+			continue;
+
+		/* Don't bother splitting huge pages on invalid SPs. */
+		if (sp->role.invalid)
+			continue;
+
+		r = nested_mmu_try_split_huge_page(kvm, slot, huge_sptep);
+
+		/*
+		 * The split succeeded or needs to be retried because the MMU
+		 * lock was dropped. Either way, restart the iterator to get it
+		 * back into a consistent state.
+		 */
+		if (!r || r == -EAGAIN)
+			goto restart;
+
+		/* The split failed and shouldn't be retried (e.g. -ENOMEM). */
+		break;
+	}
+
+	return false;
+}
+
+static void kvm_nested_mmu_try_split_huge_pages(struct kvm *kvm,
+						const struct kvm_memory_slot *slot,
+						gfn_t start, gfn_t end,
+						int target_level)
+{
+	int level;
+
+	/*
+	 * Split huge pages starting with KVM_MAX_HUGEPAGE_LEVEL and working
+	 * down to the target level. This ensures pages are recursively split
+	 * all the way to the target level. There's no need to split pages
+	 * already at the target level.
+	 */
+	for (level = KVM_MAX_HUGEPAGE_LEVEL; level > target_level; level--) {
+		slot_handle_level_range(kvm, slot, nested_mmu_try_split_huge_pages,
+					level, level, start, end - 1, true, false);
+	}
+}
+
 /* Must be called with the mmu_lock held in write-mode. */
 void kvm_mmu_try_split_huge_pages(struct kvm *kvm,
 				   const struct kvm_memory_slot *memslot,
 				   u64 start, u64 end,
 				   int target_level)
 {
-	if (is_tdp_mmu_enabled(kvm))
-		kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end,
-						 target_level, false);
+	if (!is_tdp_mmu_enabled(kvm))
+		return;
+
+	if (kvm_memslots_have_rmaps(kvm))
+		kvm_nested_mmu_try_split_huge_pages(kvm, memslot, start, end, target_level);
+
+	kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end, target_level, false);
 
 	/*
 	 * A TLB flush is unnecessary at this point for the same resons as in
@@ -6120,12 +6366,19 @@ void kvm_mmu_slot_try_split_huge_pages(struct kvm *kvm,
 	u64 start = memslot->base_gfn;
 	u64 end = start + memslot->npages;
 
-	if (is_tdp_mmu_enabled(kvm)) {
-		read_lock(&kvm->mmu_lock);
-		kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end, target_level, true);
-		read_unlock(&kvm->mmu_lock);
+	if (!is_tdp_mmu_enabled(kvm))
+		return;
+
+	if (kvm_memslots_have_rmaps(kvm)) {
+		write_lock(&kvm->mmu_lock);
+		kvm_nested_mmu_try_split_huge_pages(kvm, memslot, start, end, target_level);
+		write_unlock(&kvm->mmu_lock);
 	}
 
+	read_lock(&kvm->mmu_lock);
+	kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end, target_level, true);
+	read_unlock(&kvm->mmu_lock);
+
 	/*
 	 * No TLB flush is necessary here. KVM will flush TLBs after
 	 * write-protecting and/or clearing dirty on the newly split SPTEs to
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 04812eaaf61b..4fe018ddd1cd 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12197,6 +12197,12 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
 		 * page faults will create the large-page sptes.
 		 */
 		kvm_mmu_zap_collapsible_sptes(kvm, new);
+
+		/*
+		 * Free any memory left behind by eager page splitting. Ignore
+		 * the module parameter since userspace might have changed it.
+		 */
+		free_split_caches(kvm);
 	} else {
 		/*
 		 * Initially-all-set does not require write protecting any page,
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index f94f72bbd2d3..17fc9247504d 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1336,6 +1336,7 @@ void kvm_flush_remote_tlbs(struct kvm *kvm);
 
 #ifdef KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE
 int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min);
+int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min);
 int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc);
 void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
 void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 5e2e75014256..b9573e958a03 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -369,7 +369,7 @@ static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc,
 		return (void *)__get_free_page(gfp_flags);
 }
 
-static int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min)
+int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min)
 {
 	gfp_t gfp = GFP_KERNEL_ACCOUNT;
 	void *obj;
-- 
2.36.0.550.gb090851708-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 21/22] KVM: Allow for different capacities in kvm_mmu_memory_cache structs
  2022-05-16 23:21   ` David Matlack
@ 2022-05-19 15:33     ` Anup Patel
  -1 siblings, 0 replies; 111+ messages in thread
From: Anup Patel @ 2022-05-19 15:33 UTC (permalink / raw)
  To: David Matlack
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen, Lai Jiangshan,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Tue, May 17, 2022 at 4:52 AM David Matlack <dmatlack@google.com> wrote:
>
> Allow the capacity of the kvm_mmu_memory_cache struct to be chosen at
> declaration time rather than being fixed for all declarations. This will
> be used in a follow-up commit to declare an cache in x86 with a capacity
> of 512+ objects without having to increase the capacity of all caches in
> KVM.
>
> This change requires each cache now specify its capacity at runtime,
> since the cache struct itself no longer has a fixed capacity known at
> compile time. To protect against someone accidentally defining a
> kvm_mmu_memory_cache struct directly (without the extra storage), this
> commit includes a WARN_ON() in kvm_mmu_topup_memory_cache().
>
> In order to support different capacities, this commit changes the
> objects pointer array to be dynamically allocated the first time the
> cache is topped-up.
>
> While here, opportunistically clean up the stack-allocated
> kvm_mmu_memory_cache structs in riscv and arm64 to use designated
> initializers.
>
> No functional change intended.
>
> Reviewed-by: Marc Zyngier <maz@kernel.org>
> Signed-off-by: David Matlack <dmatlack@google.com>

Looks good to me for KVM RISC-V.

Reviewed-by: Anup Patel <anup@brainfault.org>

A small heads-up that function stage2_ioremap() is going to be
renamed for Linux-5.19 so you might have to rebase one more time.

Thanks,
Anup

> ---
>  arch/arm64/kvm/mmu.c      |  2 +-
>  arch/riscv/kvm/mmu.c      |  5 +----
>  include/linux/kvm_types.h |  6 +++++-
>  virt/kvm/kvm_main.c       | 33 ++++++++++++++++++++++++++++++---
>  4 files changed, 37 insertions(+), 9 deletions(-)
>
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 53ae2c0640bc..f443ed845f85 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -764,7 +764,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
>  {
>         phys_addr_t addr;
>         int ret = 0;
> -       struct kvm_mmu_memory_cache cache = { 0, __GFP_ZERO, NULL, };
> +       struct kvm_mmu_memory_cache cache = { .gfp_zero = __GFP_ZERO };
>         struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
>         enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_DEVICE |
>                                      KVM_PGTABLE_PROT_R |
> diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c
> index f80a34fbf102..4d95ebe4114f 100644
> --- a/arch/riscv/kvm/mmu.c
> +++ b/arch/riscv/kvm/mmu.c
> @@ -347,10 +347,7 @@ static int stage2_ioremap(struct kvm *kvm, gpa_t gpa, phys_addr_t hpa,
>         int ret = 0;
>         unsigned long pfn;
>         phys_addr_t addr, end;
> -       struct kvm_mmu_memory_cache pcache;
> -
> -       memset(&pcache, 0, sizeof(pcache));
> -       pcache.gfp_zero = __GFP_ZERO;
> +       struct kvm_mmu_memory_cache pcache = { .gfp_zero = __GFP_ZERO };
>
>         end = (gpa + size + PAGE_SIZE - 1) & PAGE_MASK;
>         pfn = __phys_to_pfn(hpa);
> diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> index ac1ebb37a0ff..68529884eaf8 100644
> --- a/include/linux/kvm_types.h
> +++ b/include/linux/kvm_types.h
> @@ -83,12 +83,16 @@ struct gfn_to_pfn_cache {
>   * MMU flows is problematic, as is triggering reclaim, I/O, etc... while
>   * holding MMU locks.  Note, these caches act more like prefetch buffers than
>   * classical caches, i.e. objects are not returned to the cache on being freed.
> + *
> + * The @capacity field and @objects array are lazily initialized when the cache
> + * is topped up (__kvm_mmu_topup_memory_cache()).
>   */
>  struct kvm_mmu_memory_cache {
>         int nobjs;
>         gfp_t gfp_zero;
>         struct kmem_cache *kmem_cache;
> -       void *objects[KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE];
> +       int capacity;
> +       void **objects;
>  };
>  #endif
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index e089db822c12..5e2e75014256 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -369,14 +369,31 @@ static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc,
>                 return (void *)__get_free_page(gfp_flags);
>  }
>
> -int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
> +static int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min)
>  {
> +       gfp_t gfp = GFP_KERNEL_ACCOUNT;
>         void *obj;
>
>         if (mc->nobjs >= min)
>                 return 0;
> -       while (mc->nobjs < ARRAY_SIZE(mc->objects)) {
> -               obj = mmu_memory_cache_alloc_obj(mc, GFP_KERNEL_ACCOUNT);
> +
> +       if (unlikely(!mc->objects)) {
> +               if (WARN_ON_ONCE(!capacity))
> +                       return -EIO;
> +
> +               mc->objects = kvmalloc_array(sizeof(void *), capacity, gfp);
> +               if (!mc->objects)
> +                       return -ENOMEM;
> +
> +               mc->capacity = capacity;
> +       }
> +
> +       /* It is illegal to request a different capacity across topups. */
> +       if (WARN_ON_ONCE(mc->capacity != capacity))
> +               return -EIO;
> +
> +       while (mc->nobjs < mc->capacity) {
> +               obj = mmu_memory_cache_alloc_obj(mc, gfp);
>                 if (!obj)
>                         return mc->nobjs >= min ? 0 : -ENOMEM;
>                 mc->objects[mc->nobjs++] = obj;
> @@ -384,6 +401,11 @@ int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
>         return 0;
>  }
>
> +int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
> +{
> +       return __kvm_mmu_topup_memory_cache(mc, KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE, min);
> +}
> +
>  int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc)
>  {
>         return mc->nobjs;
> @@ -397,6 +419,11 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc)
>                 else
>                         free_page((unsigned long)mc->objects[--mc->nobjs]);
>         }
> +
> +       kvfree(mc->objects);
> +
> +       mc->objects = NULL;
> +       mc->capacity = 0;
>  }
>
>  void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc)
> --
> 2.36.0.550.gb090851708-goog
>
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 21/22] KVM: Allow for different capacities in kvm_mmu_memory_cache structs
@ 2022-05-19 15:33     ` Anup Patel
  0 siblings, 0 replies; 111+ messages in thread
From: Anup Patel @ 2022-05-19 15:33 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, Lai Jiangshan

On Tue, May 17, 2022 at 4:52 AM David Matlack <dmatlack@google.com> wrote:
>
> Allow the capacity of the kvm_mmu_memory_cache struct to be chosen at
> declaration time rather than being fixed for all declarations. This will
> be used in a follow-up commit to declare an cache in x86 with a capacity
> of 512+ objects without having to increase the capacity of all caches in
> KVM.
>
> This change requires each cache now specify its capacity at runtime,
> since the cache struct itself no longer has a fixed capacity known at
> compile time. To protect against someone accidentally defining a
> kvm_mmu_memory_cache struct directly (without the extra storage), this
> commit includes a WARN_ON() in kvm_mmu_topup_memory_cache().
>
> In order to support different capacities, this commit changes the
> objects pointer array to be dynamically allocated the first time the
> cache is topped-up.
>
> While here, opportunistically clean up the stack-allocated
> kvm_mmu_memory_cache structs in riscv and arm64 to use designated
> initializers.
>
> No functional change intended.
>
> Reviewed-by: Marc Zyngier <maz@kernel.org>
> Signed-off-by: David Matlack <dmatlack@google.com>

Looks good to me for KVM RISC-V.

Reviewed-by: Anup Patel <anup@brainfault.org>

A small heads-up that function stage2_ioremap() is going to be
renamed for Linux-5.19 so you might have to rebase one more time.

Thanks,
Anup

> ---
>  arch/arm64/kvm/mmu.c      |  2 +-
>  arch/riscv/kvm/mmu.c      |  5 +----
>  include/linux/kvm_types.h |  6 +++++-
>  virt/kvm/kvm_main.c       | 33 ++++++++++++++++++++++++++++++---
>  4 files changed, 37 insertions(+), 9 deletions(-)
>
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 53ae2c0640bc..f443ed845f85 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -764,7 +764,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
>  {
>         phys_addr_t addr;
>         int ret = 0;
> -       struct kvm_mmu_memory_cache cache = { 0, __GFP_ZERO, NULL, };
> +       struct kvm_mmu_memory_cache cache = { .gfp_zero = __GFP_ZERO };
>         struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
>         enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_DEVICE |
>                                      KVM_PGTABLE_PROT_R |
> diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c
> index f80a34fbf102..4d95ebe4114f 100644
> --- a/arch/riscv/kvm/mmu.c
> +++ b/arch/riscv/kvm/mmu.c
> @@ -347,10 +347,7 @@ static int stage2_ioremap(struct kvm *kvm, gpa_t gpa, phys_addr_t hpa,
>         int ret = 0;
>         unsigned long pfn;
>         phys_addr_t addr, end;
> -       struct kvm_mmu_memory_cache pcache;
> -
> -       memset(&pcache, 0, sizeof(pcache));
> -       pcache.gfp_zero = __GFP_ZERO;
> +       struct kvm_mmu_memory_cache pcache = { .gfp_zero = __GFP_ZERO };
>
>         end = (gpa + size + PAGE_SIZE - 1) & PAGE_MASK;
>         pfn = __phys_to_pfn(hpa);
> diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> index ac1ebb37a0ff..68529884eaf8 100644
> --- a/include/linux/kvm_types.h
> +++ b/include/linux/kvm_types.h
> @@ -83,12 +83,16 @@ struct gfn_to_pfn_cache {
>   * MMU flows is problematic, as is triggering reclaim, I/O, etc... while
>   * holding MMU locks.  Note, these caches act more like prefetch buffers than
>   * classical caches, i.e. objects are not returned to the cache on being freed.
> + *
> + * The @capacity field and @objects array are lazily initialized when the cache
> + * is topped up (__kvm_mmu_topup_memory_cache()).
>   */
>  struct kvm_mmu_memory_cache {
>         int nobjs;
>         gfp_t gfp_zero;
>         struct kmem_cache *kmem_cache;
> -       void *objects[KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE];
> +       int capacity;
> +       void **objects;
>  };
>  #endif
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index e089db822c12..5e2e75014256 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -369,14 +369,31 @@ static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc,
>                 return (void *)__get_free_page(gfp_flags);
>  }
>
> -int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
> +static int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min)
>  {
> +       gfp_t gfp = GFP_KERNEL_ACCOUNT;
>         void *obj;
>
>         if (mc->nobjs >= min)
>                 return 0;
> -       while (mc->nobjs < ARRAY_SIZE(mc->objects)) {
> -               obj = mmu_memory_cache_alloc_obj(mc, GFP_KERNEL_ACCOUNT);
> +
> +       if (unlikely(!mc->objects)) {
> +               if (WARN_ON_ONCE(!capacity))
> +                       return -EIO;
> +
> +               mc->objects = kvmalloc_array(sizeof(void *), capacity, gfp);
> +               if (!mc->objects)
> +                       return -ENOMEM;
> +
> +               mc->capacity = capacity;
> +       }
> +
> +       /* It is illegal to request a different capacity across topups. */
> +       if (WARN_ON_ONCE(mc->capacity != capacity))
> +               return -EIO;
> +
> +       while (mc->nobjs < mc->capacity) {
> +               obj = mmu_memory_cache_alloc_obj(mc, gfp);
>                 if (!obj)
>                         return mc->nobjs >= min ? 0 : -ENOMEM;
>                 mc->objects[mc->nobjs++] = obj;
> @@ -384,6 +401,11 @@ int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
>         return 0;
>  }
>
> +int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
> +{
> +       return __kvm_mmu_topup_memory_cache(mc, KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE, min);
> +}
> +
>  int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc)
>  {
>         return mc->nobjs;
> @@ -397,6 +419,11 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc)
>                 else
>                         free_page((unsigned long)mc->objects[--mc->nobjs]);
>         }
> +
> +       kvfree(mc->objects);
> +
> +       mc->objects = NULL;
> +       mc->capacity = 0;
>  }
>
>  void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc)
> --
> 2.36.0.550.gb090851708-goog
>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 21/22] KVM: Allow for different capacities in kvm_mmu_memory_cache structs
  2022-05-16 23:21   ` David Matlack
  (?)
  (?)
@ 2022-05-20 23:21   ` Mingwei Zhang
  2022-05-23 17:37       ` Sean Christopherson
  -1 siblings, 1 reply; 111+ messages in thread
From: Mingwei Zhang @ 2022-05-20 23:21 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Sean Christopherson, Andrew Jones, Ben Gardon, Peter Xu,
	maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, Lai Jiangshan

On Mon, May 16, 2022 at 4:24 PM David Matlack <dmatlack@google.com> wrote:
>
> Allow the capacity of the kvm_mmu_memory_cache struct to be chosen at
> declaration time rather than being fixed for all declarations. This will
> be used in a follow-up commit to declare an cache in x86 with a capacity
> of 512+ objects without having to increase the capacity of all caches in
> KVM.
>
> This change requires each cache now specify its capacity at runtime,
> since the cache struct itself no longer has a fixed capacity known at
> compile time. To protect against someone accidentally defining a
> kvm_mmu_memory_cache struct directly (without the extra storage), this
> commit includes a WARN_ON() in kvm_mmu_topup_memory_cache().
>
> In order to support different capacities, this commit changes the
> objects pointer array to be dynamically allocated the first time the
> cache is topped-up.
>
> While here, opportunistically clean up the stack-allocated
> kvm_mmu_memory_cache structs in riscv and arm64 to use designated
> initializers.
>
> No functional change intended.
>
> Reviewed-by: Marc Zyngier <maz@kernel.org>
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/arm64/kvm/mmu.c      |  2 +-
>  arch/riscv/kvm/mmu.c      |  5 +----
>  include/linux/kvm_types.h |  6 +++++-
>  virt/kvm/kvm_main.c       | 33 ++++++++++++++++++++++++++++++---
>  4 files changed, 37 insertions(+), 9 deletions(-)
>
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 53ae2c0640bc..f443ed845f85 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -764,7 +764,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
>  {
>         phys_addr_t addr;
>         int ret = 0;
> -       struct kvm_mmu_memory_cache cache = { 0, __GFP_ZERO, NULL, };
> +       struct kvm_mmu_memory_cache cache = { .gfp_zero = __GFP_ZERO };
>         struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
>         enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_DEVICE |
>                                      KVM_PGTABLE_PROT_R |
> diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c
> index f80a34fbf102..4d95ebe4114f 100644
> --- a/arch/riscv/kvm/mmu.c
> +++ b/arch/riscv/kvm/mmu.c
> @@ -347,10 +347,7 @@ static int stage2_ioremap(struct kvm *kvm, gpa_t gpa, phys_addr_t hpa,
>         int ret = 0;
>         unsigned long pfn;
>         phys_addr_t addr, end;
> -       struct kvm_mmu_memory_cache pcache;
> -
> -       memset(&pcache, 0, sizeof(pcache));
> -       pcache.gfp_zero = __GFP_ZERO;
> +       struct kvm_mmu_memory_cache pcache = { .gfp_zero = __GFP_ZERO };
>
>         end = (gpa + size + PAGE_SIZE - 1) & PAGE_MASK;
>         pfn = __phys_to_pfn(hpa);
> diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> index ac1ebb37a0ff..68529884eaf8 100644
> --- a/include/linux/kvm_types.h
> +++ b/include/linux/kvm_types.h
> @@ -83,12 +83,16 @@ struct gfn_to_pfn_cache {
>   * MMU flows is problematic, as is triggering reclaim, I/O, etc... while
>   * holding MMU locks.  Note, these caches act more like prefetch buffers than
>   * classical caches, i.e. objects are not returned to the cache on being freed.
> + *
> + * The @capacity field and @objects array are lazily initialized when the cache
> + * is topped up (__kvm_mmu_topup_memory_cache()).
>   */
>  struct kvm_mmu_memory_cache {
>         int nobjs;
>         gfp_t gfp_zero;
>         struct kmem_cache *kmem_cache;
> -       void *objects[KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE];
> +       int capacity;
> +       void **objects;
>  };
>  #endif
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index e089db822c12..5e2e75014256 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -369,14 +369,31 @@ static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc,
>                 return (void *)__get_free_page(gfp_flags);
>  }
>
> -int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
> +static int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min)
>  {
> +       gfp_t gfp = GFP_KERNEL_ACCOUNT;
>         void *obj;
>
>         if (mc->nobjs >= min)
>                 return 0;
> -       while (mc->nobjs < ARRAY_SIZE(mc->objects)) {
> -               obj = mmu_memory_cache_alloc_obj(mc, GFP_KERNEL_ACCOUNT);
> +
> +       if (unlikely(!mc->objects)) {
> +               if (WARN_ON_ONCE(!capacity))
> +                       return -EIO;
> +
> +               mc->objects = kvmalloc_array(sizeof(void *), capacity, gfp);
> +               if (!mc->objects)
> +                       return -ENOMEM;
> +
> +               mc->capacity = capacity;

Do we want to ensure the minimum value of the capacity? I think
otherwise, we may more likely start using memory from GFP_ATOMIC if
the capacity is less than, say 5? But the minimum value seems related
to each cache type.

> +       }
> +
> +       /* It is illegal to request a different capacity across topups. */
> +       if (WARN_ON_ONCE(mc->capacity != capacity))
> +               return -EIO;
> +
> +       while (mc->nobjs < mc->capacity) {
> +               obj = mmu_memory_cache_alloc_obj(mc, gfp);
>                 if (!obj)
>                         return mc->nobjs >= min ? 0 : -ENOMEM;
>                 mc->objects[mc->nobjs++] = obj;
> @@ -384,6 +401,11 @@ int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
>         return 0;
>  }
>
> +int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
> +{
> +       return __kvm_mmu_topup_memory_cache(mc, KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE, min);
> +}
> +
>  int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc)
>  {
>         return mc->nobjs;
> @@ -397,6 +419,11 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc)
>                 else
>                         free_page((unsigned long)mc->objects[--mc->nobjs]);
>         }
> +
> +       kvfree(mc->objects);
> +
> +       mc->objects = NULL;
> +       mc->capacity = 0;
>  }
>
>  void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc)
> --
> 2.36.0.550.gb090851708-goog
>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 21/22] KVM: Allow for different capacities in kvm_mmu_memory_cache structs
  2022-05-20 23:21   ` Mingwei Zhang
@ 2022-05-23 17:37       ` Sean Christopherson
  0 siblings, 0 replies; 111+ messages in thread
From: Sean Christopherson @ 2022-05-23 17:37 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, Lai Jiangshan,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Ben Gardon, Aleksandar Markovic, Palmer Dabbelt, Paul Walmsley,
	Marc Zyngier, David Matlack, Paolo Bonzini, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Fri, May 20, 2022, Mingwei Zhang wrote:
> On Mon, May 16, 2022 at 4:24 PM David Matlack <dmatlack@google.com> wrote:
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index e089db822c12..5e2e75014256 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -369,14 +369,31 @@ static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc,
> >                 return (void *)__get_free_page(gfp_flags);
> >  }
> >
> > -int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
> > +static int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min)
> >  {
> > +       gfp_t gfp = GFP_KERNEL_ACCOUNT;
> >         void *obj;
> >
> >         if (mc->nobjs >= min)
> >                 return 0;
> > -       while (mc->nobjs < ARRAY_SIZE(mc->objects)) {
> > -               obj = mmu_memory_cache_alloc_obj(mc, GFP_KERNEL_ACCOUNT);
> > +
> > +       if (unlikely(!mc->objects)) {
> > +               if (WARN_ON_ONCE(!capacity))
> > +                       return -EIO;
> > +
> > +               mc->objects = kvmalloc_array(sizeof(void *), capacity, gfp);
> > +               if (!mc->objects)
> > +                       return -ENOMEM;
> > +
> > +               mc->capacity = capacity;
> 
> Do we want to ensure the minimum value of the capacity? I think
> otherwise, we may more likely start using memory from GFP_ATOMIC if
> the capacity is less than, say 5? But the minimum value seems related
> to each cache type.

Eh, if we specify a minimum, just make the arch default the minimum.  That way we
avoid adding even more magic/arbitrary numbers.  E.g. for whatever reason, MIPS's
default is '4'.
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 21/22] KVM: Allow for different capacities in kvm_mmu_memory_cache structs
@ 2022-05-23 17:37       ` Sean Christopherson
  0 siblings, 0 replies; 111+ messages in thread
From: Sean Christopherson @ 2022-05-23 17:37 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: David Matlack, Paolo Bonzini, Marc Zyngier, Huacai Chen,
	Aleksandar Markovic, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, Lai Jiangshan

On Fri, May 20, 2022, Mingwei Zhang wrote:
> On Mon, May 16, 2022 at 4:24 PM David Matlack <dmatlack@google.com> wrote:
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index e089db822c12..5e2e75014256 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -369,14 +369,31 @@ static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc,
> >                 return (void *)__get_free_page(gfp_flags);
> >  }
> >
> > -int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
> > +static int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min)
> >  {
> > +       gfp_t gfp = GFP_KERNEL_ACCOUNT;
> >         void *obj;
> >
> >         if (mc->nobjs >= min)
> >                 return 0;
> > -       while (mc->nobjs < ARRAY_SIZE(mc->objects)) {
> > -               obj = mmu_memory_cache_alloc_obj(mc, GFP_KERNEL_ACCOUNT);
> > +
> > +       if (unlikely(!mc->objects)) {
> > +               if (WARN_ON_ONCE(!capacity))
> > +                       return -EIO;
> > +
> > +               mc->objects = kvmalloc_array(sizeof(void *), capacity, gfp);
> > +               if (!mc->objects)
> > +                       return -ENOMEM;
> > +
> > +               mc->capacity = capacity;
> 
> Do we want to ensure the minimum value of the capacity? I think
> otherwise, we may more likely start using memory from GFP_ATOMIC if
> the capacity is less than, say 5? But the minimum value seems related
> to each cache type.

Eh, if we specify a minimum, just make the arch default the minimum.  That way we
avoid adding even more magic/arbitrary numbers.  E.g. for whatever reason, MIPS's
default is '4'.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 21/22] KVM: Allow for different capacities in kvm_mmu_memory_cache structs
  2022-05-23 17:37       ` Sean Christopherson
@ 2022-05-23 17:44         ` David Matlack
  -1 siblings, 0 replies; 111+ messages in thread
From: David Matlack @ 2022-05-23 17:44 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Mingwei Zhang, Paolo Bonzini, Marc Zyngier, Huacai Chen,
	Aleksandar Markovic, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Andrew Jones, Ben Gardon, Peter Xu,
	Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, Lai Jiangshan

On Mon, May 23, 2022 at 10:37 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, May 20, 2022, Mingwei Zhang wrote:
> > On Mon, May 16, 2022 at 4:24 PM David Matlack <dmatlack@google.com> wrote:
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > index e089db822c12..5e2e75014256 100644
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > > @@ -369,14 +369,31 @@ static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc,
> > >                 return (void *)__get_free_page(gfp_flags);
> > >  }
> > >
> > > -int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
> > > +static int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min)
> > >  {
> > > +       gfp_t gfp = GFP_KERNEL_ACCOUNT;
> > >         void *obj;
> > >
> > >         if (mc->nobjs >= min)
> > >                 return 0;
> > > -       while (mc->nobjs < ARRAY_SIZE(mc->objects)) {
> > > -               obj = mmu_memory_cache_alloc_obj(mc, GFP_KERNEL_ACCOUNT);
> > > +
> > > +       if (unlikely(!mc->objects)) {
> > > +               if (WARN_ON_ONCE(!capacity))
> > > +                       return -EIO;
> > > +
> > > +               mc->objects = kvmalloc_array(sizeof(void *), capacity, gfp);
> > > +               if (!mc->objects)
> > > +                       return -ENOMEM;
> > > +
> > > +               mc->capacity = capacity;
> >
> > Do we want to ensure the minimum value of the capacity? I think
> > otherwise, we may more likely start using memory from GFP_ATOMIC if
> > the capacity is less than, say 5? But the minimum value seems related
> > to each cache type.
>
> Eh, if we specify a minimum, just make the arch default the minimum.  That way we
> avoid adding even more magic/arbitrary numbers.  E.g. for whatever reason, MIPS's
> default is '4'.

I'm not exactly sure what you had in mind Mingwei. But there is a bug
in this code if min > capacity. This function will happily return 0
after filling up the cache, even though it did not allocate min
objects. The same bug existed before this patch if min >
ARRAY_SIZE(mc->objects). I can include a separate patch to fix this
bug (e.g. WARN and return -ENOMEM if min > capacity).

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 21/22] KVM: Allow for different capacities in kvm_mmu_memory_cache structs
@ 2022-05-23 17:44         ` David Matlack
  0 siblings, 0 replies; 111+ messages in thread
From: David Matlack @ 2022-05-23 17:44 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, Lai Jiangshan,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Aleksandar Markovic, Palmer Dabbelt, Paul Walmsley, Marc Zyngier,
	Ben Gardon, Paolo Bonzini, Mingwei Zhang, Maciej S. Szmigiero,
	Peter Feiner

On Mon, May 23, 2022 at 10:37 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, May 20, 2022, Mingwei Zhang wrote:
> > On Mon, May 16, 2022 at 4:24 PM David Matlack <dmatlack@google.com> wrote:
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > index e089db822c12..5e2e75014256 100644
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > > @@ -369,14 +369,31 @@ static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc,
> > >                 return (void *)__get_free_page(gfp_flags);
> > >  }
> > >
> > > -int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
> > > +static int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min)
> > >  {
> > > +       gfp_t gfp = GFP_KERNEL_ACCOUNT;
> > >         void *obj;
> > >
> > >         if (mc->nobjs >= min)
> > >                 return 0;
> > > -       while (mc->nobjs < ARRAY_SIZE(mc->objects)) {
> > > -               obj = mmu_memory_cache_alloc_obj(mc, GFP_KERNEL_ACCOUNT);
> > > +
> > > +       if (unlikely(!mc->objects)) {
> > > +               if (WARN_ON_ONCE(!capacity))
> > > +                       return -EIO;
> > > +
> > > +               mc->objects = kvmalloc_array(sizeof(void *), capacity, gfp);
> > > +               if (!mc->objects)
> > > +                       return -ENOMEM;
> > > +
> > > +               mc->capacity = capacity;
> >
> > Do we want to ensure the minimum value of the capacity? I think
> > otherwise, we may more likely start using memory from GFP_ATOMIC if
> > the capacity is less than, say 5? But the minimum value seems related
> > to each cache type.
>
> Eh, if we specify a minimum, just make the arch default the minimum.  That way we
> avoid adding even more magic/arbitrary numbers.  E.g. for whatever reason, MIPS's
> default is '4'.

I'm not exactly sure what you had in mind Mingwei. But there is a bug
in this code if min > capacity. This function will happily return 0
after filling up the cache, even though it did not allocate min
objects. The same bug existed before this patch if min >
ARRAY_SIZE(mc->objects). I can include a separate patch to fix this
bug (e.g. WARN and return -ENOMEM if min > capacity).
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 21/22] KVM: Allow for different capacities in kvm_mmu_memory_cache structs
  2022-05-23 17:44         ` David Matlack
@ 2022-05-23 18:13           ` Mingwei Zhang
  -1 siblings, 0 replies; 111+ messages in thread
From: Mingwei Zhang @ 2022-05-23 18:13 UTC (permalink / raw)
  To: David Matlack
  Cc: Sean Christopherson, Paolo Bonzini, Marc Zyngier, Huacai Chen,
	Aleksandar Markovic, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Andrew Jones, Ben Gardon, Peter Xu,
	Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, Lai Jiangshan

On Mon, May 23, 2022 at 10:44 AM David Matlack <dmatlack@google.com> wrote:
>
> On Mon, May 23, 2022 at 10:37 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Fri, May 20, 2022, Mingwei Zhang wrote:
> > > On Mon, May 16, 2022 at 4:24 PM David Matlack <dmatlack@google.com> wrote:
> > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > > index e089db822c12..5e2e75014256 100644
> > > > --- a/virt/kvm/kvm_main.c
> > > > +++ b/virt/kvm/kvm_main.c
> > > > @@ -369,14 +369,31 @@ static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc,
> > > >                 return (void *)__get_free_page(gfp_flags);
> > > >  }
> > > >
> > > > -int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
> > > > +static int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min)
> > > >  {
> > > > +       gfp_t gfp = GFP_KERNEL_ACCOUNT;
> > > >         void *obj;
> > > >
> > > >         if (mc->nobjs >= min)
> > > >                 return 0;
> > > > -       while (mc->nobjs < ARRAY_SIZE(mc->objects)) {
> > > > -               obj = mmu_memory_cache_alloc_obj(mc, GFP_KERNEL_ACCOUNT);
> > > > +
> > > > +       if (unlikely(!mc->objects)) {
> > > > +               if (WARN_ON_ONCE(!capacity))
> > > > +                       return -EIO;
> > > > +
> > > > +               mc->objects = kvmalloc_array(sizeof(void *), capacity, gfp);
> > > > +               if (!mc->objects)
> > > > +                       return -ENOMEM;
> > > > +
> > > > +               mc->capacity = capacity;
> > >
> > > Do we want to ensure the minimum value of the capacity? I think
> > > otherwise, we may more likely start using memory from GFP_ATOMIC if
> > > the capacity is less than, say 5? But the minimum value seems related
> > > to each cache type.
> >
> > Eh, if we specify a minimum, just make the arch default the minimum.  That way we
> > avoid adding even more magic/arbitrary numbers.  E.g. for whatever reason, MIPS's
> > default is '4'.
>
> I'm not exactly sure what you had in mind Mingwei. But there is a bug
> in this code if min > capacity. This function will happily return 0
> after filling up the cache, even though it did not allocate min
> objects. The same bug existed before this patch if min >
> ARRAY_SIZE(mc->objects). I can include a separate patch to fix this
> bug (e.g. WARN and return -ENOMEM if min > capacity).

oh, what I am saying is this one:
https://elixir.bootlin.com/linux/latest/source/virt/kvm/kvm_main.c#L417

If we are running out of kmem cache, then we start to use
__GFP_ATOMIC, which should be avoided as much as we can? Since this
patch parameterized the 'capacity', then to avoid the future usage
where caller provides a too small value, maybe we could add a warning
if the 'capacity' is too small, say, smaller than 40 (the default
value)?

The case of  'capacity' < min would be a more serious issue, that
situation probably should never be allowed.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 21/22] KVM: Allow for different capacities in kvm_mmu_memory_cache structs
@ 2022-05-23 18:13           ` Mingwei Zhang
  0 siblings, 0 replies; 111+ messages in thread
From: Mingwei Zhang @ 2022-05-23 18:13 UTC (permalink / raw)
  To: David Matlack
  Cc: Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, Lai Jiangshan,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt, Paul Walmsley, Marc Zyngier,
	Ben Gardon, Paolo Bonzini, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Mon, May 23, 2022 at 10:44 AM David Matlack <dmatlack@google.com> wrote:
>
> On Mon, May 23, 2022 at 10:37 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Fri, May 20, 2022, Mingwei Zhang wrote:
> > > On Mon, May 16, 2022 at 4:24 PM David Matlack <dmatlack@google.com> wrote:
> > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > > index e089db822c12..5e2e75014256 100644
> > > > --- a/virt/kvm/kvm_main.c
> > > > +++ b/virt/kvm/kvm_main.c
> > > > @@ -369,14 +369,31 @@ static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc,
> > > >                 return (void *)__get_free_page(gfp_flags);
> > > >  }
> > > >
> > > > -int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
> > > > +static int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min)
> > > >  {
> > > > +       gfp_t gfp = GFP_KERNEL_ACCOUNT;
> > > >         void *obj;
> > > >
> > > >         if (mc->nobjs >= min)
> > > >                 return 0;
> > > > -       while (mc->nobjs < ARRAY_SIZE(mc->objects)) {
> > > > -               obj = mmu_memory_cache_alloc_obj(mc, GFP_KERNEL_ACCOUNT);
> > > > +
> > > > +       if (unlikely(!mc->objects)) {
> > > > +               if (WARN_ON_ONCE(!capacity))
> > > > +                       return -EIO;
> > > > +
> > > > +               mc->objects = kvmalloc_array(sizeof(void *), capacity, gfp);
> > > > +               if (!mc->objects)
> > > > +                       return -ENOMEM;
> > > > +
> > > > +               mc->capacity = capacity;
> > >
> > > Do we want to ensure the minimum value of the capacity? I think
> > > otherwise, we may more likely start using memory from GFP_ATOMIC if
> > > the capacity is less than, say 5? But the minimum value seems related
> > > to each cache type.
> >
> > Eh, if we specify a minimum, just make the arch default the minimum.  That way we
> > avoid adding even more magic/arbitrary numbers.  E.g. for whatever reason, MIPS's
> > default is '4'.
>
> I'm not exactly sure what you had in mind Mingwei. But there is a bug
> in this code if min > capacity. This function will happily return 0
> after filling up the cache, even though it did not allocate min
> objects. The same bug existed before this patch if min >
> ARRAY_SIZE(mc->objects). I can include a separate patch to fix this
> bug (e.g. WARN and return -ENOMEM if min > capacity).

oh, what I am saying is this one:
https://elixir.bootlin.com/linux/latest/source/virt/kvm/kvm_main.c#L417

If we are running out of kmem cache, then we start to use
__GFP_ATOMIC, which should be avoided as much as we can? Since this
patch parameterized the 'capacity', then to avoid the future usage
where caller provides a too small value, maybe we could add a warning
if the 'capacity' is too small, say, smaller than 40 (the default
value)?

The case of  'capacity' < min would be a more serious issue, that
situation probably should never be allowed.
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 21/22] KVM: Allow for different capacities in kvm_mmu_memory_cache structs
  2022-05-23 18:13           ` Mingwei Zhang
@ 2022-05-23 18:22             ` David Matlack
  -1 siblings, 0 replies; 111+ messages in thread
From: David Matlack @ 2022-05-23 18:22 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Sean Christopherson, Paolo Bonzini, Marc Zyngier, Huacai Chen,
	Aleksandar Markovic, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Andrew Jones, Ben Gardon, Peter Xu,
	Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, Lai Jiangshan

On Mon, May 23, 2022 at 11:13 AM Mingwei Zhang <mizhang@google.com> wrote:
>
> On Mon, May 23, 2022 at 10:44 AM David Matlack <dmatlack@google.com> wrote:
> >
> > On Mon, May 23, 2022 at 10:37 AM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > On Fri, May 20, 2022, Mingwei Zhang wrote:
> > > > On Mon, May 16, 2022 at 4:24 PM David Matlack <dmatlack@google.com> wrote:
> > > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > > > index e089db822c12..5e2e75014256 100644
> > > > > --- a/virt/kvm/kvm_main.c
> > > > > +++ b/virt/kvm/kvm_main.c
> > > > > @@ -369,14 +369,31 @@ static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc,
> > > > >                 return (void *)__get_free_page(gfp_flags);
> > > > >  }
> > > > >
> > > > > -int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
> > > > > +static int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min)
> > > > >  {
> > > > > +       gfp_t gfp = GFP_KERNEL_ACCOUNT;
> > > > >         void *obj;
> > > > >
> > > > >         if (mc->nobjs >= min)
> > > > >                 return 0;
> > > > > -       while (mc->nobjs < ARRAY_SIZE(mc->objects)) {
> > > > > -               obj = mmu_memory_cache_alloc_obj(mc, GFP_KERNEL_ACCOUNT);
> > > > > +
> > > > > +       if (unlikely(!mc->objects)) {
> > > > > +               if (WARN_ON_ONCE(!capacity))
> > > > > +                       return -EIO;
> > > > > +
> > > > > +               mc->objects = kvmalloc_array(sizeof(void *), capacity, gfp);
> > > > > +               if (!mc->objects)
> > > > > +                       return -ENOMEM;
> > > > > +
> > > > > +               mc->capacity = capacity;
> > > >
> > > > Do we want to ensure the minimum value of the capacity? I think
> > > > otherwise, we may more likely start using memory from GFP_ATOMIC if
> > > > the capacity is less than, say 5? But the minimum value seems related
> > > > to each cache type.
> > >
> > > Eh, if we specify a minimum, just make the arch default the minimum.  That way we
> > > avoid adding even more magic/arbitrary numbers.  E.g. for whatever reason, MIPS's
> > > default is '4'.
> >
> > I'm not exactly sure what you had in mind Mingwei. But there is a bug
> > in this code if min > capacity. This function will happily return 0
> > after filling up the cache, even though it did not allocate min
> > objects. The same bug existed before this patch if min >
> > ARRAY_SIZE(mc->objects). I can include a separate patch to fix this
> > bug (e.g. WARN and return -ENOMEM if min > capacity).
>
> oh, what I am saying is this one:
> https://elixir.bootlin.com/linux/latest/source/virt/kvm/kvm_main.c#L417
>
> If we are running out of kmem cache, then we start to use
> __GFP_ATOMIC, which should be avoided as much as we can? Since this
> patch parameterized the 'capacity', then to avoid the future usage
> where caller provides a too small value, maybe we could add a warning
> if the 'capacity' is too small, say, smaller than 40 (the default
> value)?

I'm not too worried about that. Callers of
kvm_mmu_topup_memory_cache() are responsible for passing in a min
value. It doesn't matter if capacity is a number lower than 40, as
long as kvm_mmu_topup_memory_cache() is able to allocate min objects,
the call is a success (and the GFP_ATOMIC fallback should never
trigger, and if it does, we'll get a WARN splat).

The only actual loophole I can spot is if capacity is less than min.
In that case topup will return 0 despite allocating less than min
objects. Again we'll still hit the GFP_ATOMIC and get a WARN splat,
but we can detect the problem in kvm_mmu_topup_memory_cache() which
will include the buggy callsite in the backtrace.

>
> The case of  'capacity' < min would be a more serious issue, that
> situation probably should never be allowed.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 21/22] KVM: Allow for different capacities in kvm_mmu_memory_cache structs
@ 2022-05-23 18:22             ` David Matlack
  0 siblings, 0 replies; 111+ messages in thread
From: David Matlack @ 2022-05-23 18:22 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, Lai Jiangshan,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt, Paul Walmsley, Marc Zyngier,
	Ben Gardon, Paolo Bonzini, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Mon, May 23, 2022 at 11:13 AM Mingwei Zhang <mizhang@google.com> wrote:
>
> On Mon, May 23, 2022 at 10:44 AM David Matlack <dmatlack@google.com> wrote:
> >
> > On Mon, May 23, 2022 at 10:37 AM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > On Fri, May 20, 2022, Mingwei Zhang wrote:
> > > > On Mon, May 16, 2022 at 4:24 PM David Matlack <dmatlack@google.com> wrote:
> > > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > > > index e089db822c12..5e2e75014256 100644
> > > > > --- a/virt/kvm/kvm_main.c
> > > > > +++ b/virt/kvm/kvm_main.c
> > > > > @@ -369,14 +369,31 @@ static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc,
> > > > >                 return (void *)__get_free_page(gfp_flags);
> > > > >  }
> > > > >
> > > > > -int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
> > > > > +static int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min)
> > > > >  {
> > > > > +       gfp_t gfp = GFP_KERNEL_ACCOUNT;
> > > > >         void *obj;
> > > > >
> > > > >         if (mc->nobjs >= min)
> > > > >                 return 0;
> > > > > -       while (mc->nobjs < ARRAY_SIZE(mc->objects)) {
> > > > > -               obj = mmu_memory_cache_alloc_obj(mc, GFP_KERNEL_ACCOUNT);
> > > > > +
> > > > > +       if (unlikely(!mc->objects)) {
> > > > > +               if (WARN_ON_ONCE(!capacity))
> > > > > +                       return -EIO;
> > > > > +
> > > > > +               mc->objects = kvmalloc_array(sizeof(void *), capacity, gfp);
> > > > > +               if (!mc->objects)
> > > > > +                       return -ENOMEM;
> > > > > +
> > > > > +               mc->capacity = capacity;
> > > >
> > > > Do we want to ensure the minimum value of the capacity? I think
> > > > otherwise, we may more likely start using memory from GFP_ATOMIC if
> > > > the capacity is less than, say 5? But the minimum value seems related
> > > > to each cache type.
> > >
> > > Eh, if we specify a minimum, just make the arch default the minimum.  That way we
> > > avoid adding even more magic/arbitrary numbers.  E.g. for whatever reason, MIPS's
> > > default is '4'.
> >
> > I'm not exactly sure what you had in mind Mingwei. But there is a bug
> > in this code if min > capacity. This function will happily return 0
> > after filling up the cache, even though it did not allocate min
> > objects. The same bug existed before this patch if min >
> > ARRAY_SIZE(mc->objects). I can include a separate patch to fix this
> > bug (e.g. WARN and return -ENOMEM if min > capacity).
>
> oh, what I am saying is this one:
> https://elixir.bootlin.com/linux/latest/source/virt/kvm/kvm_main.c#L417
>
> If we are running out of kmem cache, then we start to use
> __GFP_ATOMIC, which should be avoided as much as we can? Since this
> patch parameterized the 'capacity', then to avoid the future usage
> where caller provides a too small value, maybe we could add a warning
> if the 'capacity' is too small, say, smaller than 40 (the default
> value)?

I'm not too worried about that. Callers of
kvm_mmu_topup_memory_cache() are responsible for passing in a min
value. It doesn't matter if capacity is a number lower than 40, as
long as kvm_mmu_topup_memory_cache() is able to allocate min objects,
the call is a success (and the GFP_ATOMIC fallback should never
trigger, and if it does, we'll get a WARN splat).

The only actual loophole I can spot is if capacity is less than min.
In that case topup will return 0 despite allocating less than min
objects. Again we'll still hit the GFP_ATOMIC and get a WARN splat,
but we can detect the problem in kvm_mmu_topup_memory_cache() which
will include the buggy callsite in the backtrace.

>
> The case of  'capacity' < min would be a more serious issue, that
> situation probably should never be allowed.
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 21/22] KVM: Allow for different capacities in kvm_mmu_memory_cache structs
  2022-05-23 18:22             ` David Matlack
@ 2022-05-23 23:53               ` David Matlack
  -1 siblings, 0 replies; 111+ messages in thread
From: David Matlack @ 2022-05-23 23:53 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Sean Christopherson, Paolo Bonzini, Marc Zyngier, Huacai Chen,
	Aleksandar Markovic, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Andrew Jones, Ben Gardon, Peter Xu,
	Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, Lai Jiangshan

On Mon, May 23, 2022 at 11:22 AM David Matlack <dmatlack@google.com> wrote:
>
> On Mon, May 23, 2022 at 11:13 AM Mingwei Zhang <mizhang@google.com> wrote:
> >
> > On Mon, May 23, 2022 at 10:44 AM David Matlack <dmatlack@google.com> wrote:
> > >
> > > On Mon, May 23, 2022 at 10:37 AM Sean Christopherson <seanjc@google.com> wrote:
> > > >
> > > > On Fri, May 20, 2022, Mingwei Zhang wrote:
> > > > > On Mon, May 16, 2022 at 4:24 PM David Matlack <dmatlack@google.com> wrote:
> > > > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > > > > index e089db822c12..5e2e75014256 100644
> > > > > > --- a/virt/kvm/kvm_main.c
> > > > > > +++ b/virt/kvm/kvm_main.c
> > > > > > @@ -369,14 +369,31 @@ static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc,
> > > > > >                 return (void *)__get_free_page(gfp_flags);
> > > > > >  }
> > > > > >
> > > > > > -int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
> > > > > > +static int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min)
> > > > > >  {
> > > > > > +       gfp_t gfp = GFP_KERNEL_ACCOUNT;
> > > > > >         void *obj;
> > > > > >
> > > > > >         if (mc->nobjs >= min)
> > > > > >                 return 0;
> > > > > > -       while (mc->nobjs < ARRAY_SIZE(mc->objects)) {
> > > > > > -               obj = mmu_memory_cache_alloc_obj(mc, GFP_KERNEL_ACCOUNT);
> > > > > > +
> > > > > > +       if (unlikely(!mc->objects)) {
> > > > > > +               if (WARN_ON_ONCE(!capacity))
> > > > > > +                       return -EIO;
> > > > > > +
> > > > > > +               mc->objects = kvmalloc_array(sizeof(void *), capacity, gfp);
> > > > > > +               if (!mc->objects)
> > > > > > +                       return -ENOMEM;
> > > > > > +
> > > > > > +               mc->capacity = capacity;
> > > > >
> > > > > Do we want to ensure the minimum value of the capacity? I think
> > > > > otherwise, we may more likely start using memory from GFP_ATOMIC if
> > > > > the capacity is less than, say 5? But the minimum value seems related
> > > > > to each cache type.
> > > >
> > > > Eh, if we specify a minimum, just make the arch default the minimum.  That way we
> > > > avoid adding even more magic/arbitrary numbers.  E.g. for whatever reason, MIPS's
> > > > default is '4'.
> > >
> > > I'm not exactly sure what you had in mind Mingwei. But there is a bug
> > > in this code if min > capacity. This function will happily return 0
> > > after filling up the cache, even though it did not allocate min
> > > objects. The same bug existed before this patch if min >
> > > ARRAY_SIZE(mc->objects). I can include a separate patch to fix this
> > > bug (e.g. WARN and return -ENOMEM if min > capacity).
> >
> > oh, what I am saying is this one:
> > https://elixir.bootlin.com/linux/latest/source/virt/kvm/kvm_main.c#L417
> >
> > If we are running out of kmem cache, then we start to use
> > __GFP_ATOMIC, which should be avoided as much as we can? Since this
> > patch parameterized the 'capacity', then to avoid the future usage
> > where caller provides a too small value, maybe we could add a warning
> > if the 'capacity' is too small, say, smaller than 40 (the default
> > value)?
>
> I'm not too worried about that. Callers of
> kvm_mmu_topup_memory_cache() are responsible for passing in a min
> value. It doesn't matter if capacity is a number lower than 40, as
> long as kvm_mmu_topup_memory_cache() is able to allocate min objects,
> the call is a success (and the GFP_ATOMIC fallback should never
> trigger, and if it does, we'll get a WARN splat).

Ah and I forgot to add: In this situation, the bug is that *min* is
too small, not capacity. So adding a restriction on capacity would not
help.

>
> The only actual loophole I can spot is if capacity is less than min.
> In that case topup will return 0 despite allocating less than min
> objects. Again we'll still hit the GFP_ATOMIC and get a WARN splat,
> but we can detect the problem in kvm_mmu_topup_memory_cache() which
> will include the buggy callsite in the backtrace.
>
> >
> > The case of  'capacity' < min would be a more serious issue, that
> > situation probably should never be allowed.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 21/22] KVM: Allow for different capacities in kvm_mmu_memory_cache structs
@ 2022-05-23 23:53               ` David Matlack
  0 siblings, 0 replies; 111+ messages in thread
From: David Matlack @ 2022-05-23 23:53 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, Lai Jiangshan,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt, Paul Walmsley, Marc Zyngier,
	Ben Gardon, Paolo Bonzini, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Mon, May 23, 2022 at 11:22 AM David Matlack <dmatlack@google.com> wrote:
>
> On Mon, May 23, 2022 at 11:13 AM Mingwei Zhang <mizhang@google.com> wrote:
> >
> > On Mon, May 23, 2022 at 10:44 AM David Matlack <dmatlack@google.com> wrote:
> > >
> > > On Mon, May 23, 2022 at 10:37 AM Sean Christopherson <seanjc@google.com> wrote:
> > > >
> > > > On Fri, May 20, 2022, Mingwei Zhang wrote:
> > > > > On Mon, May 16, 2022 at 4:24 PM David Matlack <dmatlack@google.com> wrote:
> > > > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > > > > index e089db822c12..5e2e75014256 100644
> > > > > > --- a/virt/kvm/kvm_main.c
> > > > > > +++ b/virt/kvm/kvm_main.c
> > > > > > @@ -369,14 +369,31 @@ static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc,
> > > > > >                 return (void *)__get_free_page(gfp_flags);
> > > > > >  }
> > > > > >
> > > > > > -int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
> > > > > > +static int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min)
> > > > > >  {
> > > > > > +       gfp_t gfp = GFP_KERNEL_ACCOUNT;
> > > > > >         void *obj;
> > > > > >
> > > > > >         if (mc->nobjs >= min)
> > > > > >                 return 0;
> > > > > > -       while (mc->nobjs < ARRAY_SIZE(mc->objects)) {
> > > > > > -               obj = mmu_memory_cache_alloc_obj(mc, GFP_KERNEL_ACCOUNT);
> > > > > > +
> > > > > > +       if (unlikely(!mc->objects)) {
> > > > > > +               if (WARN_ON_ONCE(!capacity))
> > > > > > +                       return -EIO;
> > > > > > +
> > > > > > +               mc->objects = kvmalloc_array(sizeof(void *), capacity, gfp);
> > > > > > +               if (!mc->objects)
> > > > > > +                       return -ENOMEM;
> > > > > > +
> > > > > > +               mc->capacity = capacity;
> > > > >
> > > > > Do we want to ensure the minimum value of the capacity? I think
> > > > > otherwise, we may more likely start using memory from GFP_ATOMIC if
> > > > > the capacity is less than, say 5? But the minimum value seems related
> > > > > to each cache type.
> > > >
> > > > Eh, if we specify a minimum, just make the arch default the minimum.  That way we
> > > > avoid adding even more magic/arbitrary numbers.  E.g. for whatever reason, MIPS's
> > > > default is '4'.
> > >
> > > I'm not exactly sure what you had in mind Mingwei. But there is a bug
> > > in this code if min > capacity. This function will happily return 0
> > > after filling up the cache, even though it did not allocate min
> > > objects. The same bug existed before this patch if min >
> > > ARRAY_SIZE(mc->objects). I can include a separate patch to fix this
> > > bug (e.g. WARN and return -ENOMEM if min > capacity).
> >
> > oh, what I am saying is this one:
> > https://elixir.bootlin.com/linux/latest/source/virt/kvm/kvm_main.c#L417
> >
> > If we are running out of kmem cache, then we start to use
> > __GFP_ATOMIC, which should be avoided as much as we can? Since this
> > patch parameterized the 'capacity', then to avoid the future usage
> > where caller provides a too small value, maybe we could add a warning
> > if the 'capacity' is too small, say, smaller than 40 (the default
> > value)?
>
> I'm not too worried about that. Callers of
> kvm_mmu_topup_memory_cache() are responsible for passing in a min
> value. It doesn't matter if capacity is a number lower than 40, as
> long as kvm_mmu_topup_memory_cache() is able to allocate min objects,
> the call is a success (and the GFP_ATOMIC fallback should never
> trigger, and if it does, we'll get a WARN splat).

Ah and I forgot to add: In this situation, the bug is that *min* is
too small, not capacity. So adding a restriction on capacity would not
help.

>
> The only actual loophole I can spot is if capacity is less than min.
> In that case topup will return 0 despite allocating less than min
> objects. Again we'll still hit the GFP_ATOMIC and get a WARN splat,
> but we can detect the problem in kvm_mmu_topup_memory_cache() which
> will include the buggy callsite in the backtrace.
>
> >
> > The case of  'capacity' < min would be a more serious issue, that
> > situation probably should never be allowed.
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 22/22] KVM: x86/mmu: Extend Eager Page Splitting to nested MMUs
  2022-05-16 23:21   ` David Matlack
@ 2022-06-01 21:50     ` Ricardo Koller
  -1 siblings, 0 replies; 111+ messages in thread
From: Ricardo Koller @ 2022-06-01 21:50 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Sean Christopherson, Andrew Jones, Ben Gardon, Peter Xu,
	maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, Lai Jiangshan

Hi David,

On Mon, May 16, 2022 at 11:21:38PM +0000, David Matlack wrote:
> Add support for Eager Page Splitting pages that are mapped by nested
> MMUs. Walk through the rmap first splitting all 1GiB pages to 2MiB
> pages, and then splitting all 2MiB pages to 4KiB pages.
> 
> Note, Eager Page Splitting is limited to nested MMUs as a policy rather
> than due to any technical reason (the sp->role.guest_mode check could
> just be deleted and Eager Page Splitting would work correctly for all
> shadow MMU pages). There is really no reason to support Eager Page
> Splitting for tdp_mmu=N, since such support will eventually be phased
> out, and there is no current use case supporting Eager Page Splitting on
> hosts where TDP is either disabled or unavailable in hardware.
> Furthermore, future improvements to nested MMU scalability may diverge
> the code from the legacy shadow paging implementation. These
> improvements will be simpler to make if Eager Page Splitting does not
> have to worry about legacy shadow paging.
> 
> Splitting huge pages mapped by nested MMUs requires dealing with some
> extra complexity beyond that of the TDP MMU:
> 
> (1) The shadow MMU has a limit on the number of shadow pages that are
>     allowed to be allocated. So, as a policy, Eager Page Splitting
>     refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
>     pages available.
> 
> (2) Splitting a huge page may end up re-using an existing lower level
>     shadow page tables. This is unlike the TDP MMU which always allocates
>     new shadow page tables when splitting.
> 
> (3) When installing the lower level SPTEs, they must be added to the
>     rmap which may require allocating additional pte_list_desc structs.
> 
> Case (2) is especially interesting since it may require a TLB flush,
> unlike the TDP MMU which can fully split huge pages without any TLB
> flushes. Specifically, an existing lower level page table may point to
> even lower level page tables that are not fully populated, effectively
> unmapping a portion of the huge page, which requires a flush.
> 
> This commit performs such flushes after dropping the huge page and
> before installing the lower level page table. This TLB flush could
> instead be delayed until the MMU lock is about to be dropped, which
> would batch flushes for multiple splits.  However these flushes should
> be rare in practice (a huge page must be aliased in multiple SPTEs and
> have been split for NX Huge Pages in only some of them). Flushing
> immediately is simpler to plumb and also reduces the chances of tripping
> over a CPU bug (e.g. see iTLB multihit).
> 
> Suggested-by: Peter Feiner <pfeiner@google.com>
> [ This commit is based off of the original implementation of Eager Page
>   Splitting from Peter in Google's kernel from 2016. ]
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  .../admin-guide/kernel-parameters.txt         |   3 +-
>  arch/x86/include/asm/kvm_host.h               |  24 ++
>  arch/x86/kvm/mmu/mmu.c                        | 267 +++++++++++++++++-
>  arch/x86/kvm/x86.c                            |   6 +
>  include/linux/kvm_host.h                      |   1 +
>  virt/kvm/kvm_main.c                           |   2 +-
>  6 files changed, 293 insertions(+), 10 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 3f1cc5e317ed..bc3ad3d4df0b 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -2387,8 +2387,7 @@
>  			the KVM_CLEAR_DIRTY ioctl, and only for the pages being
>  			cleared.
>  
> -			Eager page splitting currently only supports splitting
> -			huge pages mapped by the TDP MMU.
> +			Eager page splitting is only supported when kvm.tdp_mmu=Y.
>  
>  			Default is Y (on).
>  
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 9193a700fe2d..ea99e61cc556 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1265,6 +1265,28 @@ struct kvm_arch {
>  	 * the global KVM_MAX_VCPU_IDS may lead to significant memory waste.
>  	 */
>  	u32 max_vcpu_ids;
> +
> +	/*
> +	 * Memory caches used to allocate shadow pages when performing eager
> +	 * page splitting. No need for a shadowed_info_cache since eager page
> +	 * splitting only allocates direct shadow pages.
> +	 *
> +	 * Protected by kvm->slots_lock.
> +	 */
> +	struct kvm_mmu_memory_cache split_shadow_page_cache;
> +	struct kvm_mmu_memory_cache split_page_header_cache;
> +
> +	/*
> +	 * Memory cache used to allocate pte_list_desc structs while splitting
> +	 * huge pages. In the worst case, to split one huge page, 512
> +	 * pte_list_desc structs are needed to add each lower level leaf sptep
> +	 * to the rmap plus 1 to extend the parent_ptes rmap of the lower level
> +	 * page table.
> +	 *
> +	 * Protected by kvm->slots_lock.
> +	 */
> +#define SPLIT_DESC_CACHE_CAPACITY 513
> +	struct kvm_mmu_memory_cache split_desc_cache;
>  };
>  
>  struct kvm_vm_stat {
> @@ -1639,6 +1661,8 @@ void kvm_mmu_zap_all(struct kvm *kvm);
>  void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen);
>  void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned long kvm_nr_mmu_pages);
>  
> +void free_split_caches(struct kvm *kvm);
> +
>  int load_pdptrs(struct kvm_vcpu *vcpu, unsigned long cr3);
>  
>  int emulator_write_phys(struct kvm_vcpu *vcpu, gpa_t gpa,
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 964a8fa63e1b..7c5eab61c4ea 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5966,6 +5966,15 @@ int kvm_mmu_init_vm(struct kvm *kvm)
>  	node->track_write = kvm_mmu_pte_write;
>  	node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;
>  	kvm_page_track_register_notifier(kvm, node);
> +
> +	kvm->arch.split_page_header_cache.kmem_cache = mmu_page_header_cache;
> +	kvm->arch.split_page_header_cache.gfp_zero = __GFP_ZERO;
> +
> +	kvm->arch.split_shadow_page_cache.gfp_zero = __GFP_ZERO;
> +
> +	kvm->arch.split_desc_cache.kmem_cache = pte_list_desc_cache;
> +	kvm->arch.split_desc_cache.gfp_zero = __GFP_ZERO;
> +
>  	return 0;
>  }
>  
> @@ -6097,15 +6106,252 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
>  		kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
>  }
>  
> +void free_split_caches(struct kvm *kvm)
> +{
> +	lockdep_assert_held(&kvm->slots_lock);
> +
> +	kvm_mmu_free_memory_cache(&kvm->arch.split_desc_cache);
> +	kvm_mmu_free_memory_cache(&kvm->arch.split_page_header_cache);
> +	kvm_mmu_free_memory_cache(&kvm->arch.split_shadow_page_cache);
> +}
> +
> +static inline bool need_topup(struct kvm_mmu_memory_cache *cache, int min)
> +{
> +	return kvm_mmu_memory_cache_nr_free_objects(cache) < min;
> +}
> +
> +static bool need_topup_split_caches_or_resched(struct kvm *kvm)
> +{
> +	if (need_resched() || rwlock_needbreak(&kvm->mmu_lock))
> +		return true;
> +
> +	/*
> +	 * In the worst case, SPLIT_DESC_CACHE_CAPACITY descriptors are needed
> +	 * to split a single huge page. Calculating how many are actually needed
> +	 * is possible but not worth the complexity.
> +	 */
> +	return need_topup(&kvm->arch.split_desc_cache, SPLIT_DESC_CACHE_CAPACITY) ||
> +	       need_topup(&kvm->arch.split_page_header_cache, 1) ||
> +	       need_topup(&kvm->arch.split_shadow_page_cache, 1);
> +}
> +
> +static int topup_split_caches(struct kvm *kvm)
> +{
> +	int r;
> +
> +	lockdep_assert_held(&kvm->slots_lock);
> +
> +	r = __kvm_mmu_topup_memory_cache(&kvm->arch.split_desc_cache,
> +					 SPLIT_DESC_CACHE_CAPACITY,
> +					 SPLIT_DESC_CACHE_CAPACITY);
> +	if (r)
> +		return r;
> +
> +	r = kvm_mmu_topup_memory_cache(&kvm->arch.split_page_header_cache, 1);
> +	if (r)
> +		return r;
> +
> +	return kvm_mmu_topup_memory_cache(&kvm->arch.split_shadow_page_cache, 1);
> +}
> +
> +static struct kvm_mmu_page *nested_mmu_get_sp_for_split(struct kvm *kvm, u64 *huge_sptep)
> +{
> +	struct kvm_mmu_page *huge_sp = sptep_to_sp(huge_sptep);
> +	struct shadow_page_caches caches = {};
> +	union kvm_mmu_page_role role;
> +	unsigned int access;
> +	gfn_t gfn;
> +
> +	gfn = kvm_mmu_page_get_gfn(huge_sp, huge_sptep - huge_sp->spt);
> +	access = kvm_mmu_page_get_access(huge_sp, huge_sptep - huge_sp->spt);
> +
> +	/*
> +	 * Note, huge page splitting always uses direct shadow pages, regardless
> +	 * of whether the huge page itself is mapped by a direct or indirect
> +	 * shadow page, since the huge page region itself is being directly
> +	 * mapped with smaller pages.
> +	 */
> +	role = kvm_mmu_child_role(huge_sptep, /*direct=*/true, access);
> +
> +	/* Direct SPs do not require a shadowed_info_cache. */
> +	caches.page_header_cache = &kvm->arch.split_page_header_cache;
> +	caches.shadow_page_cache = &kvm->arch.split_shadow_page_cache;
> +
> +	/* Safe to pass NULL for vCPU since requesting a direct SP. */
> +	return __kvm_mmu_get_shadow_page(kvm, NULL, &caches, gfn, role);
> +}
> +
> +static void nested_mmu_split_huge_page(struct kvm *kvm,
> +				       const struct kvm_memory_slot *slot,
> +				       u64 *huge_sptep)
> +
> +{
> +	struct kvm_mmu_memory_cache *cache = &kvm->arch.split_desc_cache;
> +	u64 huge_spte = READ_ONCE(*huge_sptep);
> +	struct kvm_mmu_page *sp;
> +	bool flush = false;
> +	u64 *sptep, spte;
> +	gfn_t gfn;
> +	int index;
> +
> +	sp = nested_mmu_get_sp_for_split(kvm, huge_sptep);
> +
> +	for (index = 0; index < PT64_ENT_PER_PAGE; index++) {
> +		sptep = &sp->spt[index];
> +		gfn = kvm_mmu_page_get_gfn(sp, index);
> +
> +		/*
> +		 * The SP may already have populated SPTEs, e.g. if this huge
> +		 * page is aliased by multiple sptes with the same access
> +		 * permissions. These entries are guaranteed to map the same
> +		 * gfn-to-pfn translation since the SP is direct, so no need to
> +		 * modify them.
> +		 *
> +		 * However, if a given SPTE points to a lower level page table,
> +		 * that lower level page table may only be partially populated.
> +		 * Installing such SPTEs would effectively unmap a potion of the
> +		 * huge page. Unmapping guest memory always requires a TLB flush
> +		 * since a subsequent operation on the unmapped regions would
> +		 * fail to detect the need to flush.
> +		 */
> +		if (is_shadow_present_pte(*sptep)) {
> +			flush |= !is_last_spte(*sptep, sp->role.level);
> +			continue;
> +		}
> +
> +		spte = make_huge_page_split_spte(huge_spte, sp->role, index);
> +		mmu_spte_set(sptep, spte);
> +		__rmap_add(kvm, cache, slot, sptep, gfn, sp->role.access);
> +	}
> +
> +	/*
> +	 * Replace the huge spte with a pointer to the populated lower level
> +	 * page table. If the lower-level page table indentically maps the huge
> +	 * page (i.e. no memory is unmapped), there's no need for a TLB flush.
> +	 * Otherwise, flush TLBs after dropping the huge page and before
> +	 * installing the shadow page table.
> +	 */
> +	__drop_large_spte(kvm, huge_sptep, flush);
> +	__link_shadow_page(cache, huge_sptep, sp);
> +}
> +
> +static int nested_mmu_try_split_huge_page(struct kvm *kvm,
> +					  const struct kvm_memory_slot *slot,
> +					  u64 *huge_sptep)
> +{
> +	struct kvm_mmu_page *huge_sp = sptep_to_sp(huge_sptep);
> +	int level, r = 0;
> +	gfn_t gfn;
> +	u64 spte;
> +
> +	/* Grab information for the tracepoint before dropping the MMU lock. */
> +	gfn = kvm_mmu_page_get_gfn(huge_sp, huge_sptep - huge_sp->spt);
> +	level = huge_sp->role.level;
> +	spte = *huge_sptep;
> +
> +	if (kvm_mmu_available_pages(kvm) <= KVM_MIN_FREE_MMU_PAGES) {
> +		r = -ENOSPC;
> +		goto out;
> +	}
> +
> +	if (need_topup_split_caches_or_resched(kvm)) {
> +		write_unlock(&kvm->mmu_lock);
> +		cond_resched();
> +		/*
> +		 * If the topup succeeds, return -EAGAIN to indicate that the
> +		 * rmap iterator should be restarted because the MMU lock was
> +		 * dropped.
> +		 */
> +		r = topup_split_caches(kvm) ?: -EAGAIN;
> +		write_lock(&kvm->mmu_lock);
> +		goto out;
> +	}
> +
> +	nested_mmu_split_huge_page(kvm, slot, huge_sptep);
> +
> +out:
> +	trace_kvm_mmu_split_huge_page(gfn, spte, level, r);
> +	return r;
> +}
> +
> +static bool nested_mmu_try_split_huge_pages(struct kvm *kvm,
> +					    struct kvm_rmap_head *rmap_head,
> +					    const struct kvm_memory_slot *slot)
> +{
> +	struct rmap_iterator iter;
> +	struct kvm_mmu_page *sp;
> +	u64 *huge_sptep;
> +	int r;
> +
> +restart:
> +	for_each_rmap_spte(rmap_head, &iter, huge_sptep) {
> +		sp = sptep_to_sp(huge_sptep);
> +
> +		/* TDP MMU is enabled, so rmap only contains nested MMU SPs. */
> +		if (WARN_ON_ONCE(!sp->role.guest_mode))
> +			continue;
> +
> +		/* The rmaps should never contain non-leaf SPTEs. */
> +		if (WARN_ON_ONCE(!is_large_pte(*huge_sptep)))
> +			continue;
> +
> +		/* SPs with level >PG_LEVEL_4K should never by unsync. */
> +		if (WARN_ON_ONCE(sp->unsync))
> +			continue;
> +
> +		/* Don't bother splitting huge pages on invalid SPs. */
> +		if (sp->role.invalid)
> +			continue;
> +
> +		r = nested_mmu_try_split_huge_page(kvm, slot, huge_sptep);
> +
> +		/*
> +		 * The split succeeded or needs to be retried because the MMU
> +		 * lock was dropped. Either way, restart the iterator to get it
> +		 * back into a consistent state.
> +		 */
> +		if (!r || r == -EAGAIN)
> +			goto restart;
> +
> +		/* The split failed and shouldn't be retried (e.g. -ENOMEM). */
> +		break;
> +	}
> +
> +	return false;
> +}
> +
> +static void kvm_nested_mmu_try_split_huge_pages(struct kvm *kvm,
> +						const struct kvm_memory_slot *slot,
> +						gfn_t start, gfn_t end,
> +						int target_level)
> +{
> +	int level;
> +
> +	/*
> +	 * Split huge pages starting with KVM_MAX_HUGEPAGE_LEVEL and working
> +	 * down to the target level. This ensures pages are recursively split
> +	 * all the way to the target level. There's no need to split pages
> +	 * already at the target level.
> +	 */
> +	for (level = KVM_MAX_HUGEPAGE_LEVEL; level > target_level; level--) {
> +		slot_handle_level_range(kvm, slot, nested_mmu_try_split_huge_pages,
> +					level, level, start, end - 1, true, false);
> +	}
> +}
> +
>  /* Must be called with the mmu_lock held in write-mode. */
>  void kvm_mmu_try_split_huge_pages(struct kvm *kvm,
>  				   const struct kvm_memory_slot *memslot,
>  				   u64 start, u64 end,
>  				   int target_level)
>  {
> -	if (is_tdp_mmu_enabled(kvm))
> -		kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end,
> -						 target_level, false);
> +	if (!is_tdp_mmu_enabled(kvm))
> +		return;
> +
> +	if (kvm_memslots_have_rmaps(kvm))
> +		kvm_nested_mmu_try_split_huge_pages(kvm, memslot, start, end, target_level);
> +
> +	kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end, target_level, false);
>  
>  	/*
>  	 * A TLB flush is unnecessary at this point for the same resons as in
> @@ -6120,12 +6366,19 @@ void kvm_mmu_slot_try_split_huge_pages(struct kvm *kvm,
>  	u64 start = memslot->base_gfn;
>  	u64 end = start + memslot->npages;
>  
> -	if (is_tdp_mmu_enabled(kvm)) {
> -		read_lock(&kvm->mmu_lock);
> -		kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end, target_level, true);
> -		read_unlock(&kvm->mmu_lock);
> +	if (!is_tdp_mmu_enabled(kvm))
> +		return;
> +
> +	if (kvm_memslots_have_rmaps(kvm)) {
> +		write_lock(&kvm->mmu_lock);
> +		kvm_nested_mmu_try_split_huge_pages(kvm, memslot, start, end, target_level);
> +		write_unlock(&kvm->mmu_lock);
>  	}
>  
> +	read_lock(&kvm->mmu_lock);
> +	kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end, target_level, true);
> +	read_unlock(&kvm->mmu_lock);
> +
>  	/*
>  	 * No TLB flush is necessary here. KVM will flush TLBs after
>  	 * write-protecting and/or clearing dirty on the newly split SPTEs to
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 04812eaaf61b..4fe018ddd1cd 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -12197,6 +12197,12 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
>  		 * page faults will create the large-page sptes.
>  		 */
>  		kvm_mmu_zap_collapsible_sptes(kvm, new);
> +
> +		/*
> +		 * Free any memory left behind by eager page splitting. Ignore
> +		 * the module parameter since userspace might have changed it.
> +		 */
> +		free_split_caches(kvm);
>  	} else {
>  		/*
>  		 * Initially-all-set does not require write protecting any page,
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index f94f72bbd2d3..17fc9247504d 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1336,6 +1336,7 @@ void kvm_flush_remote_tlbs(struct kvm *kvm);
>  
>  #ifdef KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE
>  int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min);
> +int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min);

If you end up with a v7, could you move this to the previous commit,
please. In that case this would include not making
__kvm_mmu_topup_memory_cache a static in the previous one as well.

Thanks,
Ricardo

>  int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc);
>  void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
>  void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 5e2e75014256..b9573e958a03 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -369,7 +369,7 @@ static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc,
>  		return (void *)__get_free_page(gfp_flags);
>  }
>  
> -static int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min)
> +int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min)
>  {
>  	gfp_t gfp = GFP_KERNEL_ACCOUNT;
>  	void *obj;
> -- 
> 2.36.0.550.gb090851708-goog
> 

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 22/22] KVM: x86/mmu: Extend Eager Page Splitting to nested MMUs
@ 2022-06-01 21:50     ` Ricardo Koller
  0 siblings, 0 replies; 111+ messages in thread
From: Ricardo Koller @ 2022-06-01 21:50 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, Lai Jiangshan,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Hi David,

On Mon, May 16, 2022 at 11:21:38PM +0000, David Matlack wrote:
> Add support for Eager Page Splitting pages that are mapped by nested
> MMUs. Walk through the rmap first splitting all 1GiB pages to 2MiB
> pages, and then splitting all 2MiB pages to 4KiB pages.
> 
> Note, Eager Page Splitting is limited to nested MMUs as a policy rather
> than due to any technical reason (the sp->role.guest_mode check could
> just be deleted and Eager Page Splitting would work correctly for all
> shadow MMU pages). There is really no reason to support Eager Page
> Splitting for tdp_mmu=N, since such support will eventually be phased
> out, and there is no current use case supporting Eager Page Splitting on
> hosts where TDP is either disabled or unavailable in hardware.
> Furthermore, future improvements to nested MMU scalability may diverge
> the code from the legacy shadow paging implementation. These
> improvements will be simpler to make if Eager Page Splitting does not
> have to worry about legacy shadow paging.
> 
> Splitting huge pages mapped by nested MMUs requires dealing with some
> extra complexity beyond that of the TDP MMU:
> 
> (1) The shadow MMU has a limit on the number of shadow pages that are
>     allowed to be allocated. So, as a policy, Eager Page Splitting
>     refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
>     pages available.
> 
> (2) Splitting a huge page may end up re-using an existing lower level
>     shadow page tables. This is unlike the TDP MMU which always allocates
>     new shadow page tables when splitting.
> 
> (3) When installing the lower level SPTEs, they must be added to the
>     rmap which may require allocating additional pte_list_desc structs.
> 
> Case (2) is especially interesting since it may require a TLB flush,
> unlike the TDP MMU which can fully split huge pages without any TLB
> flushes. Specifically, an existing lower level page table may point to
> even lower level page tables that are not fully populated, effectively
> unmapping a portion of the huge page, which requires a flush.
> 
> This commit performs such flushes after dropping the huge page and
> before installing the lower level page table. This TLB flush could
> instead be delayed until the MMU lock is about to be dropped, which
> would batch flushes for multiple splits.  However these flushes should
> be rare in practice (a huge page must be aliased in multiple SPTEs and
> have been split for NX Huge Pages in only some of them). Flushing
> immediately is simpler to plumb and also reduces the chances of tripping
> over a CPU bug (e.g. see iTLB multihit).
> 
> Suggested-by: Peter Feiner <pfeiner@google.com>
> [ This commit is based off of the original implementation of Eager Page
>   Splitting from Peter in Google's kernel from 2016. ]
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  .../admin-guide/kernel-parameters.txt         |   3 +-
>  arch/x86/include/asm/kvm_host.h               |  24 ++
>  arch/x86/kvm/mmu/mmu.c                        | 267 +++++++++++++++++-
>  arch/x86/kvm/x86.c                            |   6 +
>  include/linux/kvm_host.h                      |   1 +
>  virt/kvm/kvm_main.c                           |   2 +-
>  6 files changed, 293 insertions(+), 10 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 3f1cc5e317ed..bc3ad3d4df0b 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -2387,8 +2387,7 @@
>  			the KVM_CLEAR_DIRTY ioctl, and only for the pages being
>  			cleared.
>  
> -			Eager page splitting currently only supports splitting
> -			huge pages mapped by the TDP MMU.
> +			Eager page splitting is only supported when kvm.tdp_mmu=Y.
>  
>  			Default is Y (on).
>  
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 9193a700fe2d..ea99e61cc556 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1265,6 +1265,28 @@ struct kvm_arch {
>  	 * the global KVM_MAX_VCPU_IDS may lead to significant memory waste.
>  	 */
>  	u32 max_vcpu_ids;
> +
> +	/*
> +	 * Memory caches used to allocate shadow pages when performing eager
> +	 * page splitting. No need for a shadowed_info_cache since eager page
> +	 * splitting only allocates direct shadow pages.
> +	 *
> +	 * Protected by kvm->slots_lock.
> +	 */
> +	struct kvm_mmu_memory_cache split_shadow_page_cache;
> +	struct kvm_mmu_memory_cache split_page_header_cache;
> +
> +	/*
> +	 * Memory cache used to allocate pte_list_desc structs while splitting
> +	 * huge pages. In the worst case, to split one huge page, 512
> +	 * pte_list_desc structs are needed to add each lower level leaf sptep
> +	 * to the rmap plus 1 to extend the parent_ptes rmap of the lower level
> +	 * page table.
> +	 *
> +	 * Protected by kvm->slots_lock.
> +	 */
> +#define SPLIT_DESC_CACHE_CAPACITY 513
> +	struct kvm_mmu_memory_cache split_desc_cache;
>  };
>  
>  struct kvm_vm_stat {
> @@ -1639,6 +1661,8 @@ void kvm_mmu_zap_all(struct kvm *kvm);
>  void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen);
>  void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned long kvm_nr_mmu_pages);
>  
> +void free_split_caches(struct kvm *kvm);
> +
>  int load_pdptrs(struct kvm_vcpu *vcpu, unsigned long cr3);
>  
>  int emulator_write_phys(struct kvm_vcpu *vcpu, gpa_t gpa,
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 964a8fa63e1b..7c5eab61c4ea 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5966,6 +5966,15 @@ int kvm_mmu_init_vm(struct kvm *kvm)
>  	node->track_write = kvm_mmu_pte_write;
>  	node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;
>  	kvm_page_track_register_notifier(kvm, node);
> +
> +	kvm->arch.split_page_header_cache.kmem_cache = mmu_page_header_cache;
> +	kvm->arch.split_page_header_cache.gfp_zero = __GFP_ZERO;
> +
> +	kvm->arch.split_shadow_page_cache.gfp_zero = __GFP_ZERO;
> +
> +	kvm->arch.split_desc_cache.kmem_cache = pte_list_desc_cache;
> +	kvm->arch.split_desc_cache.gfp_zero = __GFP_ZERO;
> +
>  	return 0;
>  }
>  
> @@ -6097,15 +6106,252 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
>  		kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
>  }
>  
> +void free_split_caches(struct kvm *kvm)
> +{
> +	lockdep_assert_held(&kvm->slots_lock);
> +
> +	kvm_mmu_free_memory_cache(&kvm->arch.split_desc_cache);
> +	kvm_mmu_free_memory_cache(&kvm->arch.split_page_header_cache);
> +	kvm_mmu_free_memory_cache(&kvm->arch.split_shadow_page_cache);
> +}
> +
> +static inline bool need_topup(struct kvm_mmu_memory_cache *cache, int min)
> +{
> +	return kvm_mmu_memory_cache_nr_free_objects(cache) < min;
> +}
> +
> +static bool need_topup_split_caches_or_resched(struct kvm *kvm)
> +{
> +	if (need_resched() || rwlock_needbreak(&kvm->mmu_lock))
> +		return true;
> +
> +	/*
> +	 * In the worst case, SPLIT_DESC_CACHE_CAPACITY descriptors are needed
> +	 * to split a single huge page. Calculating how many are actually needed
> +	 * is possible but not worth the complexity.
> +	 */
> +	return need_topup(&kvm->arch.split_desc_cache, SPLIT_DESC_CACHE_CAPACITY) ||
> +	       need_topup(&kvm->arch.split_page_header_cache, 1) ||
> +	       need_topup(&kvm->arch.split_shadow_page_cache, 1);
> +}
> +
> +static int topup_split_caches(struct kvm *kvm)
> +{
> +	int r;
> +
> +	lockdep_assert_held(&kvm->slots_lock);
> +
> +	r = __kvm_mmu_topup_memory_cache(&kvm->arch.split_desc_cache,
> +					 SPLIT_DESC_CACHE_CAPACITY,
> +					 SPLIT_DESC_CACHE_CAPACITY);
> +	if (r)
> +		return r;
> +
> +	r = kvm_mmu_topup_memory_cache(&kvm->arch.split_page_header_cache, 1);
> +	if (r)
> +		return r;
> +
> +	return kvm_mmu_topup_memory_cache(&kvm->arch.split_shadow_page_cache, 1);
> +}
> +
> +static struct kvm_mmu_page *nested_mmu_get_sp_for_split(struct kvm *kvm, u64 *huge_sptep)
> +{
> +	struct kvm_mmu_page *huge_sp = sptep_to_sp(huge_sptep);
> +	struct shadow_page_caches caches = {};
> +	union kvm_mmu_page_role role;
> +	unsigned int access;
> +	gfn_t gfn;
> +
> +	gfn = kvm_mmu_page_get_gfn(huge_sp, huge_sptep - huge_sp->spt);
> +	access = kvm_mmu_page_get_access(huge_sp, huge_sptep - huge_sp->spt);
> +
> +	/*
> +	 * Note, huge page splitting always uses direct shadow pages, regardless
> +	 * of whether the huge page itself is mapped by a direct or indirect
> +	 * shadow page, since the huge page region itself is being directly
> +	 * mapped with smaller pages.
> +	 */
> +	role = kvm_mmu_child_role(huge_sptep, /*direct=*/true, access);
> +
> +	/* Direct SPs do not require a shadowed_info_cache. */
> +	caches.page_header_cache = &kvm->arch.split_page_header_cache;
> +	caches.shadow_page_cache = &kvm->arch.split_shadow_page_cache;
> +
> +	/* Safe to pass NULL for vCPU since requesting a direct SP. */
> +	return __kvm_mmu_get_shadow_page(kvm, NULL, &caches, gfn, role);
> +}
> +
> +static void nested_mmu_split_huge_page(struct kvm *kvm,
> +				       const struct kvm_memory_slot *slot,
> +				       u64 *huge_sptep)
> +
> +{
> +	struct kvm_mmu_memory_cache *cache = &kvm->arch.split_desc_cache;
> +	u64 huge_spte = READ_ONCE(*huge_sptep);
> +	struct kvm_mmu_page *sp;
> +	bool flush = false;
> +	u64 *sptep, spte;
> +	gfn_t gfn;
> +	int index;
> +
> +	sp = nested_mmu_get_sp_for_split(kvm, huge_sptep);
> +
> +	for (index = 0; index < PT64_ENT_PER_PAGE; index++) {
> +		sptep = &sp->spt[index];
> +		gfn = kvm_mmu_page_get_gfn(sp, index);
> +
> +		/*
> +		 * The SP may already have populated SPTEs, e.g. if this huge
> +		 * page is aliased by multiple sptes with the same access
> +		 * permissions. These entries are guaranteed to map the same
> +		 * gfn-to-pfn translation since the SP is direct, so no need to
> +		 * modify them.
> +		 *
> +		 * However, if a given SPTE points to a lower level page table,
> +		 * that lower level page table may only be partially populated.
> +		 * Installing such SPTEs would effectively unmap a potion of the
> +		 * huge page. Unmapping guest memory always requires a TLB flush
> +		 * since a subsequent operation on the unmapped regions would
> +		 * fail to detect the need to flush.
> +		 */
> +		if (is_shadow_present_pte(*sptep)) {
> +			flush |= !is_last_spte(*sptep, sp->role.level);
> +			continue;
> +		}
> +
> +		spte = make_huge_page_split_spte(huge_spte, sp->role, index);
> +		mmu_spte_set(sptep, spte);
> +		__rmap_add(kvm, cache, slot, sptep, gfn, sp->role.access);
> +	}
> +
> +	/*
> +	 * Replace the huge spte with a pointer to the populated lower level
> +	 * page table. If the lower-level page table indentically maps the huge
> +	 * page (i.e. no memory is unmapped), there's no need for a TLB flush.
> +	 * Otherwise, flush TLBs after dropping the huge page and before
> +	 * installing the shadow page table.
> +	 */
> +	__drop_large_spte(kvm, huge_sptep, flush);
> +	__link_shadow_page(cache, huge_sptep, sp);
> +}
> +
> +static int nested_mmu_try_split_huge_page(struct kvm *kvm,
> +					  const struct kvm_memory_slot *slot,
> +					  u64 *huge_sptep)
> +{
> +	struct kvm_mmu_page *huge_sp = sptep_to_sp(huge_sptep);
> +	int level, r = 0;
> +	gfn_t gfn;
> +	u64 spte;
> +
> +	/* Grab information for the tracepoint before dropping the MMU lock. */
> +	gfn = kvm_mmu_page_get_gfn(huge_sp, huge_sptep - huge_sp->spt);
> +	level = huge_sp->role.level;
> +	spte = *huge_sptep;
> +
> +	if (kvm_mmu_available_pages(kvm) <= KVM_MIN_FREE_MMU_PAGES) {
> +		r = -ENOSPC;
> +		goto out;
> +	}
> +
> +	if (need_topup_split_caches_or_resched(kvm)) {
> +		write_unlock(&kvm->mmu_lock);
> +		cond_resched();
> +		/*
> +		 * If the topup succeeds, return -EAGAIN to indicate that the
> +		 * rmap iterator should be restarted because the MMU lock was
> +		 * dropped.
> +		 */
> +		r = topup_split_caches(kvm) ?: -EAGAIN;
> +		write_lock(&kvm->mmu_lock);
> +		goto out;
> +	}
> +
> +	nested_mmu_split_huge_page(kvm, slot, huge_sptep);
> +
> +out:
> +	trace_kvm_mmu_split_huge_page(gfn, spte, level, r);
> +	return r;
> +}
> +
> +static bool nested_mmu_try_split_huge_pages(struct kvm *kvm,
> +					    struct kvm_rmap_head *rmap_head,
> +					    const struct kvm_memory_slot *slot)
> +{
> +	struct rmap_iterator iter;
> +	struct kvm_mmu_page *sp;
> +	u64 *huge_sptep;
> +	int r;
> +
> +restart:
> +	for_each_rmap_spte(rmap_head, &iter, huge_sptep) {
> +		sp = sptep_to_sp(huge_sptep);
> +
> +		/* TDP MMU is enabled, so rmap only contains nested MMU SPs. */
> +		if (WARN_ON_ONCE(!sp->role.guest_mode))
> +			continue;
> +
> +		/* The rmaps should never contain non-leaf SPTEs. */
> +		if (WARN_ON_ONCE(!is_large_pte(*huge_sptep)))
> +			continue;
> +
> +		/* SPs with level >PG_LEVEL_4K should never by unsync. */
> +		if (WARN_ON_ONCE(sp->unsync))
> +			continue;
> +
> +		/* Don't bother splitting huge pages on invalid SPs. */
> +		if (sp->role.invalid)
> +			continue;
> +
> +		r = nested_mmu_try_split_huge_page(kvm, slot, huge_sptep);
> +
> +		/*
> +		 * The split succeeded or needs to be retried because the MMU
> +		 * lock was dropped. Either way, restart the iterator to get it
> +		 * back into a consistent state.
> +		 */
> +		if (!r || r == -EAGAIN)
> +			goto restart;
> +
> +		/* The split failed and shouldn't be retried (e.g. -ENOMEM). */
> +		break;
> +	}
> +
> +	return false;
> +}
> +
> +static void kvm_nested_mmu_try_split_huge_pages(struct kvm *kvm,
> +						const struct kvm_memory_slot *slot,
> +						gfn_t start, gfn_t end,
> +						int target_level)
> +{
> +	int level;
> +
> +	/*
> +	 * Split huge pages starting with KVM_MAX_HUGEPAGE_LEVEL and working
> +	 * down to the target level. This ensures pages are recursively split
> +	 * all the way to the target level. There's no need to split pages
> +	 * already at the target level.
> +	 */
> +	for (level = KVM_MAX_HUGEPAGE_LEVEL; level > target_level; level--) {
> +		slot_handle_level_range(kvm, slot, nested_mmu_try_split_huge_pages,
> +					level, level, start, end - 1, true, false);
> +	}
> +}
> +
>  /* Must be called with the mmu_lock held in write-mode. */
>  void kvm_mmu_try_split_huge_pages(struct kvm *kvm,
>  				   const struct kvm_memory_slot *memslot,
>  				   u64 start, u64 end,
>  				   int target_level)
>  {
> -	if (is_tdp_mmu_enabled(kvm))
> -		kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end,
> -						 target_level, false);
> +	if (!is_tdp_mmu_enabled(kvm))
> +		return;
> +
> +	if (kvm_memslots_have_rmaps(kvm))
> +		kvm_nested_mmu_try_split_huge_pages(kvm, memslot, start, end, target_level);
> +
> +	kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end, target_level, false);
>  
>  	/*
>  	 * A TLB flush is unnecessary at this point for the same resons as in
> @@ -6120,12 +6366,19 @@ void kvm_mmu_slot_try_split_huge_pages(struct kvm *kvm,
>  	u64 start = memslot->base_gfn;
>  	u64 end = start + memslot->npages;
>  
> -	if (is_tdp_mmu_enabled(kvm)) {
> -		read_lock(&kvm->mmu_lock);
> -		kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end, target_level, true);
> -		read_unlock(&kvm->mmu_lock);
> +	if (!is_tdp_mmu_enabled(kvm))
> +		return;
> +
> +	if (kvm_memslots_have_rmaps(kvm)) {
> +		write_lock(&kvm->mmu_lock);
> +		kvm_nested_mmu_try_split_huge_pages(kvm, memslot, start, end, target_level);
> +		write_unlock(&kvm->mmu_lock);
>  	}
>  
> +	read_lock(&kvm->mmu_lock);
> +	kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end, target_level, true);
> +	read_unlock(&kvm->mmu_lock);
> +
>  	/*
>  	 * No TLB flush is necessary here. KVM will flush TLBs after
>  	 * write-protecting and/or clearing dirty on the newly split SPTEs to
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 04812eaaf61b..4fe018ddd1cd 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -12197,6 +12197,12 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
>  		 * page faults will create the large-page sptes.
>  		 */
>  		kvm_mmu_zap_collapsible_sptes(kvm, new);
> +
> +		/*
> +		 * Free any memory left behind by eager page splitting. Ignore
> +		 * the module parameter since userspace might have changed it.
> +		 */
> +		free_split_caches(kvm);
>  	} else {
>  		/*
>  		 * Initially-all-set does not require write protecting any page,
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index f94f72bbd2d3..17fc9247504d 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1336,6 +1336,7 @@ void kvm_flush_remote_tlbs(struct kvm *kvm);
>  
>  #ifdef KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE
>  int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min);
> +int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min);

If you end up with a v7, could you move this to the previous commit,
please. In that case this would include not making
__kvm_mmu_topup_memory_cache a static in the previous one as well.

Thanks,
Ricardo

>  int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc);
>  void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
>  void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 5e2e75014256..b9573e958a03 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -369,7 +369,7 @@ static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc,
>  		return (void *)__get_free_page(gfp_flags);
>  }
>  
> -static int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min)
> +int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min)
>  {
>  	gfp_t gfp = GFP_KERNEL_ACCOUNT;
>  	void *obj;
> -- 
> 2.36.0.550.gb090851708-goog
> 
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 03/22] KVM: x86/mmu: Stop passing @direct to mmu_alloc_root()
  2022-05-16 23:21   ` David Matlack
@ 2022-06-16 18:47     ` Sean Christopherson
  -1 siblings, 0 replies; 111+ messages in thread
From: Sean Christopherson @ 2022-06-16 18:47 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, Lai Jiangshan

On Mon, May 16, 2022, David Matlack wrote:
> The argument @direct is vcpu->arch.mmu->root_role.direct, so just use
> that.

It's worth calling out that, unlike non-root page tables, it's impossible to have
a direct root in an indirect MMU.  I.e. provide a hint as to why there's a need to
pass @direct in the first place.

> Suggested-by: Lai Jiangshan <jiangshanlai@gmail.com>
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 03/22] KVM: x86/mmu: Stop passing @direct to mmu_alloc_root()
@ 2022-06-16 18:47     ` Sean Christopherson
  0 siblings, 0 replies; 111+ messages in thread
From: Sean Christopherson @ 2022-06-16 18:47 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, Lai Jiangshan,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Mon, May 16, 2022, David Matlack wrote:
> The argument @direct is vcpu->arch.mmu->root_role.direct, so just use
> that.

It's worth calling out that, unlike non-root page tables, it's impossible to have
a direct root in an indirect MMU.  I.e. provide a hint as to why there's a need to
pass @direct in the first place.

> Suggested-by: Lai Jiangshan <jiangshanlai@gmail.com>
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 04/22] KVM: x86/mmu: Derive shadow MMU page role from parent
  2022-05-16 23:21   ` David Matlack
@ 2022-06-17  1:19     ` Sean Christopherson
  -1 siblings, 0 replies; 111+ messages in thread
From: Sean Christopherson @ 2022-06-17  1:19 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, Lai Jiangshan

On Mon, May 16, 2022, David Matlack wrote:
> Instead of computing the shadow page role from scratch for every new
> page, derive most of the information from the parent shadow page.  This
> eliminates the dependency on the vCPU root role to allocate shadow page
> tables, and reduces the number of parameters to kvm_mmu_get_page().
> 
> Preemptively split out the role calculation to a separate function for
> use in a following commit.
> 
> Note that when calculating the MMU root role, we can take
> @role.passthrough, @role.direct, and @role.access directly from
> @vcpu->arch.mmu->root_role. Only @role.level and @role.quadrant still
> must be overridden for PAE page directories.

Nit, instead of "for PAE page directories", something like "when shadowing 32-bit
guest page tables with PAE page tables".  Not all PAE PDEs need to be overridden.

> No functional change intended.
> 
> Reviewed-by: Peter Xu <peterx@redhat.com>
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c         | 98 +++++++++++++++++++++++-----------
>  arch/x86/kvm/mmu/paging_tmpl.h |  9 ++--
>  2 files changed, 71 insertions(+), 36 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index a9d28bcabcbb..515e0b33144a 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c

...

> -	if (level <= vcpu->arch.mmu->cpu_role.base.level)
> -		role.passthrough = 0;
> -
>  	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
>  	for_each_valid_sp(vcpu->kvm, sp, sp_list) {
>  		if (sp->gfn != gfn) {

...

> +static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct, u32 access)
> +{
> +	struct kvm_mmu_page *parent_sp = sptep_to_sp(sptep);
> +	union kvm_mmu_page_role role;
> +
> +	role = parent_sp->role;
> +	role.level--;
> +	role.access = access;
> +	role.direct = direct;
> +	role.passthrough = 0;

I don't love that this subtly relies on passthrough being limited to 5-level nNPT
with 4-level L1 NPT.  That's really just an implementation oddity, e.g. KVM can
and (hopefully) will eventually use passthrough pages for at least level=4 when
shadowing 3-level or 2-level NPT.

The easiest thing would be to add a WARN so that we don't forget to handle this
when this collides with Lai's series, and to document why KVM never sets "passthrough"
for child shadow pages.  The latter is especially confusing because it does have
other passthrough pages, they just don't happen to have an associated "struct kvm_mmu_page".

	/*
	 * KVM currently doesn't use "struct kvm_mmu_page" to track passthrough
	 * pages when the guest is using 3-level or 2-level NPT, and instead
	 * uses bare page allocations (see pml4/5_root and pae_root).  The only
	 * scenario where KVM uses a passthrough "struct kvm_mmu_page" is when
	 * shadowing 4-level NPT with 5-level nNPT.  So even though passthrough
	 * child pages do exist, such pages aren't tracked in the list of shadow
	 * pages and so don't need to compute a role.
	 */
	WARN_ON_ONCE(role.passthrough && role.level != PT64_ROOT_4LEVEL);
	role.passthrough = 0;

> +
> +	/*
> +	 * If the guest has 4-byte PTEs then that means it's using 32-bit,
> +	 * 2-level, non-PAE paging. KVM shadows such guests with PAE paging
> +	 * (i.e. 8-byte PTEs). The difference in PTE size means that KVM must
> +	 * shadow each guest page table with multiple shadow page tables, which
> +	 * requires extra bookkeeping in the role.
> +	 *
> +	 * Specifically, to shadow the guest's page directory (which covers a
> +	 * 4GiB address space), KVM uses 4 PAE page directories, each mapping

Nit, it's worth explicitly saying "virtual address space" at least once.

> +	 * 1GiB of the address space. @role.quadrant encodes which quarter of
> +	 * the address space each maps.
> +	 *
> +	 * To shadow the guest's page tables (which each map a 4MiB region), KVM
> +	 * uses 2 PAE page tables, each mapping a 2MiB region. For these,
> +	 * @role.quadrant encodes which half of the region they map.

Oof, so I really like this comment because it simplifies the concept, but it glosses
over one very crucial detail.  The 32-bit GPTE consumes bits 21:12, and the 64-bit PTE
consumes bits 20:12.  So while it's absolutely correct to state the the quadrant
encodes which half, bit 21 is consumed when doing a lookup in the _parent_, which
is the _least_ significant bit in when indexing PDEs, hence the quadrant essentially
becomes evens and odds.  Specifically, it does NOT split the parent PD down the middle.

Paolo's more concrete comment about bits helps a map things out explicit.  Paolo is
going to snag the above, so for your looming rebase, how about replacing the paragraph
below with a version of Paolo's concrete example to pair with your abstract definition?

	 *
	 * Concretely, a 4-byte PDE consumes bits 31:22, while an 8-byte PDE
	 * consumes bits 29:21.  To consume bits 31:30, KVM's uses 4 shadow
	 * PDPTEs; those 4 PAE page directories are pre-allocated and their
	 * quadrant is assigned in mmu_alloc_root().  To consume bit 21, KVM
	 * uses an additional PDE in every PD; the page table being configured
	 * here is what's pointed at by the PDE.  Thus, bit 21 is the _least_
	 * significant bit of the PDE index pointing at the shadow PT.
	 */

[*] https://lore.kernel.org/all/090e701d-6893-ea25-1237-233ff3dd01ee@redhat.com

> +	 *
> +	 * Note, the 4 PAE page directories are pre-allocated and the quadrant
> +	 * assigned in mmu_alloc_root(). So only page tables need to be handled
> +	 * here.
> +	 */
> +	if (role.has_4_byte_gpte) {
> +		WARN_ON_ONCE(role.level != PG_LEVEL_4K);
> +		role.quadrant = (sptep - parent_sp->spt) % 2;

Oh hell no.  LOL.  It took me a _long_ time to realize you're doing pointer arithmetic
on "u64 *".  I actually booted a 32-bit VM with printks and even then it still took
me a good 20 seconds wondering if I was having a brain fart and simply forgot how mod
works.

The calculation is also unnecessarily costly; not that anyone is likely to notice,
but still.  The compiler doesn't know that sptep and parent_sp->spt are intertwined
and so can't optimize, i.e. is forced to do the subtraction.

A more efficient equivalent that doesn't require pointer arithmetic:

	role.quadrant = ((unsigned long)sptep / sizeof(*sptep)) & 1;

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 04/22] KVM: x86/mmu: Derive shadow MMU page role from parent
@ 2022-06-17  1:19     ` Sean Christopherson
  0 siblings, 0 replies; 111+ messages in thread
From: Sean Christopherson @ 2022-06-17  1:19 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, Lai Jiangshan,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Mon, May 16, 2022, David Matlack wrote:
> Instead of computing the shadow page role from scratch for every new
> page, derive most of the information from the parent shadow page.  This
> eliminates the dependency on the vCPU root role to allocate shadow page
> tables, and reduces the number of parameters to kvm_mmu_get_page().
> 
> Preemptively split out the role calculation to a separate function for
> use in a following commit.
> 
> Note that when calculating the MMU root role, we can take
> @role.passthrough, @role.direct, and @role.access directly from
> @vcpu->arch.mmu->root_role. Only @role.level and @role.quadrant still
> must be overridden for PAE page directories.

Nit, instead of "for PAE page directories", something like "when shadowing 32-bit
guest page tables with PAE page tables".  Not all PAE PDEs need to be overridden.

> No functional change intended.
> 
> Reviewed-by: Peter Xu <peterx@redhat.com>
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c         | 98 +++++++++++++++++++++++-----------
>  arch/x86/kvm/mmu/paging_tmpl.h |  9 ++--
>  2 files changed, 71 insertions(+), 36 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index a9d28bcabcbb..515e0b33144a 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c

...

> -	if (level <= vcpu->arch.mmu->cpu_role.base.level)
> -		role.passthrough = 0;
> -
>  	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
>  	for_each_valid_sp(vcpu->kvm, sp, sp_list) {
>  		if (sp->gfn != gfn) {

...

> +static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct, u32 access)
> +{
> +	struct kvm_mmu_page *parent_sp = sptep_to_sp(sptep);
> +	union kvm_mmu_page_role role;
> +
> +	role = parent_sp->role;
> +	role.level--;
> +	role.access = access;
> +	role.direct = direct;
> +	role.passthrough = 0;

I don't love that this subtly relies on passthrough being limited to 5-level nNPT
with 4-level L1 NPT.  That's really just an implementation oddity, e.g. KVM can
and (hopefully) will eventually use passthrough pages for at least level=4 when
shadowing 3-level or 2-level NPT.

The easiest thing would be to add a WARN so that we don't forget to handle this
when this collides with Lai's series, and to document why KVM never sets "passthrough"
for child shadow pages.  The latter is especially confusing because it does have
other passthrough pages, they just don't happen to have an associated "struct kvm_mmu_page".

	/*
	 * KVM currently doesn't use "struct kvm_mmu_page" to track passthrough
	 * pages when the guest is using 3-level or 2-level NPT, and instead
	 * uses bare page allocations (see pml4/5_root and pae_root).  The only
	 * scenario where KVM uses a passthrough "struct kvm_mmu_page" is when
	 * shadowing 4-level NPT with 5-level nNPT.  So even though passthrough
	 * child pages do exist, such pages aren't tracked in the list of shadow
	 * pages and so don't need to compute a role.
	 */
	WARN_ON_ONCE(role.passthrough && role.level != PT64_ROOT_4LEVEL);
	role.passthrough = 0;

> +
> +	/*
> +	 * If the guest has 4-byte PTEs then that means it's using 32-bit,
> +	 * 2-level, non-PAE paging. KVM shadows such guests with PAE paging
> +	 * (i.e. 8-byte PTEs). The difference in PTE size means that KVM must
> +	 * shadow each guest page table with multiple shadow page tables, which
> +	 * requires extra bookkeeping in the role.
> +	 *
> +	 * Specifically, to shadow the guest's page directory (which covers a
> +	 * 4GiB address space), KVM uses 4 PAE page directories, each mapping

Nit, it's worth explicitly saying "virtual address space" at least once.

> +	 * 1GiB of the address space. @role.quadrant encodes which quarter of
> +	 * the address space each maps.
> +	 *
> +	 * To shadow the guest's page tables (which each map a 4MiB region), KVM
> +	 * uses 2 PAE page tables, each mapping a 2MiB region. For these,
> +	 * @role.quadrant encodes which half of the region they map.

Oof, so I really like this comment because it simplifies the concept, but it glosses
over one very crucial detail.  The 32-bit GPTE consumes bits 21:12, and the 64-bit PTE
consumes bits 20:12.  So while it's absolutely correct to state the the quadrant
encodes which half, bit 21 is consumed when doing a lookup in the _parent_, which
is the _least_ significant bit in when indexing PDEs, hence the quadrant essentially
becomes evens and odds.  Specifically, it does NOT split the parent PD down the middle.

Paolo's more concrete comment about bits helps a map things out explicit.  Paolo is
going to snag the above, so for your looming rebase, how about replacing the paragraph
below with a version of Paolo's concrete example to pair with your abstract definition?

	 *
	 * Concretely, a 4-byte PDE consumes bits 31:22, while an 8-byte PDE
	 * consumes bits 29:21.  To consume bits 31:30, KVM's uses 4 shadow
	 * PDPTEs; those 4 PAE page directories are pre-allocated and their
	 * quadrant is assigned in mmu_alloc_root().  To consume bit 21, KVM
	 * uses an additional PDE in every PD; the page table being configured
	 * here is what's pointed at by the PDE.  Thus, bit 21 is the _least_
	 * significant bit of the PDE index pointing at the shadow PT.
	 */

[*] https://lore.kernel.org/all/090e701d-6893-ea25-1237-233ff3dd01ee@redhat.com

> +	 *
> +	 * Note, the 4 PAE page directories are pre-allocated and the quadrant
> +	 * assigned in mmu_alloc_root(). So only page tables need to be handled
> +	 * here.
> +	 */
> +	if (role.has_4_byte_gpte) {
> +		WARN_ON_ONCE(role.level != PG_LEVEL_4K);
> +		role.quadrant = (sptep - parent_sp->spt) % 2;

Oh hell no.  LOL.  It took me a _long_ time to realize you're doing pointer arithmetic
on "u64 *".  I actually booted a 32-bit VM with printks and even then it still took
me a good 20 seconds wondering if I was having a brain fart and simply forgot how mod
works.

The calculation is also unnecessarily costly; not that anyone is likely to notice,
but still.  The compiler doesn't know that sptep and parent_sp->spt are intertwined
and so can't optimize, i.e. is forced to do the subtraction.

A more efficient equivalent that doesn't require pointer arithmetic:

	role.quadrant = ((unsigned long)sptep / sizeof(*sptep)) & 1;
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 10/22] KVM: x86/mmu: Pass memory caches to allocate SPs separately
  2022-05-16 23:21   ` David Matlack
@ 2022-06-17 15:01     ` Sean Christopherson
  -1 siblings, 0 replies; 111+ messages in thread
From: Sean Christopherson @ 2022-06-17 15:01 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, Lai Jiangshan

On Mon, May 16, 2022, David Matlack wrote:
> Refactor kvm_mmu_alloc_shadow_page() to receive the caches from which it
> will allocate the various pieces of memory for shadow pages as a
> parameter, rather than deriving them from the vcpu pointer. This will be
> useful in a future commit where shadow pages are allocated during VM
> ioctls for eager page splitting, and thus will use a different set of
> caches.
> 
> Preemptively pull the caches out all the way to
> kvm_mmu_get_shadow_page() since eager page splitting will not be calling

Uber nit, "eager hugepage splitting" to provide a mental cue/reminder for why
those pages are direct.

> kvm_mmu_alloc_shadow_page() directly.
> 
> No functional change intended.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---

Reviewed-by: Sean Christopherson <seanjc@google.com>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 10/22] KVM: x86/mmu: Pass memory caches to allocate SPs separately
@ 2022-06-17 15:01     ` Sean Christopherson
  0 siblings, 0 replies; 111+ messages in thread
From: Sean Christopherson @ 2022-06-17 15:01 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, Lai Jiangshan,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Mon, May 16, 2022, David Matlack wrote:
> Refactor kvm_mmu_alloc_shadow_page() to receive the caches from which it
> will allocate the various pieces of memory for shadow pages as a
> parameter, rather than deriving them from the vcpu pointer. This will be
> useful in a future commit where shadow pages are allocated during VM
> ioctls for eager page splitting, and thus will use a different set of
> caches.
> 
> Preemptively pull the caches out all the way to
> kvm_mmu_get_shadow_page() since eager page splitting will not be calling

Uber nit, "eager hugepage splitting" to provide a mental cue/reminder for why
those pages are direct.

> kvm_mmu_alloc_shadow_page() directly.
> 
> No functional change intended.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---

Reviewed-by: Sean Christopherson <seanjc@google.com>
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 04/22] KVM: x86/mmu: Derive shadow MMU page role from parent
  2022-05-16 23:21   ` David Matlack
@ 2022-06-17 15:12     ` Sean Christopherson
  -1 siblings, 0 replies; 111+ messages in thread
From: Sean Christopherson @ 2022-06-17 15:12 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, Lai Jiangshan

On Mon, May 16, 2022, David Matlack wrote:
> +static struct kvm_mmu_page *kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu,
> +						 u64 *sptep, gfn_t gfn,
> +						 bool direct, u32 access)

Please use "unsigned int" for @access, here and everywhere else, so that KVM is
consistent in how it refers to access.  @access can actually squeeze into a u8,
but it's referenced as a "unsigned int" because sp->role.access is an unsigned int.
For me at least, when I see "u<size" I always assume there is a specific reason
for using an exact size, e.g. variables/fields that track hardware state.  Whereas
"int" and "unsigned int" give the hint that the variable is KVM metadata.

@@ -2201,7 +2201,8 @@ static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
        return __kvm_mmu_get_shadow_page(vcpu->kvm, vcpu, &caches, gfn, role);
 }

-static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct, u32 access)
+static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct,
+                                                 unsigned int access)
 {
        struct kvm_mmu_page *parent_sp = sptep_to_sp(sptep);
        union kvm_mmu_page_role role;
@@ -2242,7 +2243,7 @@ static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct, u32 a

 static struct kvm_mmu_page *kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu,
                                                 u64 *sptep, gfn_t gfn,
-                                                bool direct, u32 access)
+                                                bool direct, unsigned int access)
 {
        union kvm_mmu_page_role role;

> +{
> +	union kvm_mmu_page_role role;
> +
> +	role = kvm_mmu_child_role(sptep, direct, access);
> +	return kvm_mmu_get_page(vcpu, gfn, role);
> +}
> +

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 04/22] KVM: x86/mmu: Derive shadow MMU page role from parent
@ 2022-06-17 15:12     ` Sean Christopherson
  0 siblings, 0 replies; 111+ messages in thread
From: Sean Christopherson @ 2022-06-17 15:12 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, Lai Jiangshan,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Mon, May 16, 2022, David Matlack wrote:
> +static struct kvm_mmu_page *kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu,
> +						 u64 *sptep, gfn_t gfn,
> +						 bool direct, u32 access)

Please use "unsigned int" for @access, here and everywhere else, so that KVM is
consistent in how it refers to access.  @access can actually squeeze into a u8,
but it's referenced as a "unsigned int" because sp->role.access is an unsigned int.
For me at least, when I see "u<size" I always assume there is a specific reason
for using an exact size, e.g. variables/fields that track hardware state.  Whereas
"int" and "unsigned int" give the hint that the variable is KVM metadata.

@@ -2201,7 +2201,8 @@ static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
        return __kvm_mmu_get_shadow_page(vcpu->kvm, vcpu, &caches, gfn, role);
 }

-static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct, u32 access)
+static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct,
+                                                 unsigned int access)
 {
        struct kvm_mmu_page *parent_sp = sptep_to_sp(sptep);
        union kvm_mmu_page_role role;
@@ -2242,7 +2243,7 @@ static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct, u32 a

 static struct kvm_mmu_page *kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu,
                                                 u64 *sptep, gfn_t gfn,
-                                                bool direct, u32 access)
+                                                bool direct, unsigned int access)
 {
        union kvm_mmu_page_role role;

> +{
> +	union kvm_mmu_page_role role;
> +
> +	role = kvm_mmu_child_role(sptep, direct, access);
> +	return kvm_mmu_get_page(vcpu, gfn, role);
> +}
> +
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 05/22] KVM: x86/mmu: Always pass 0 for @quadrant when gptes are 8 bytes
  2022-05-16 23:21   ` David Matlack
@ 2022-06-17 15:20     ` Sean Christopherson
  -1 siblings, 0 replies; 111+ messages in thread
From: Sean Christopherson @ 2022-06-17 15:20 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, Lai Jiangshan

On Mon, May 16, 2022, David Matlack wrote:
> The quadrant is only used when gptes are 4 bytes, but
> mmu_alloc_{direct,shadow}_roots() pass in a non-zero quadrant for PAE
> page directories regardless. Make this less confusing by only passing in
> a non-zero quadrant when it is actually necessary.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---

One nit, otherwise

Reviewed-by: Sean Christopherson <seanjc@google.com>

>  arch/x86/kvm/mmu/mmu.c | 18 ++++++++++++++----
>  1 file changed, 14 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 515e0b33144a..8508c4bfddb5 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3406,9 +3406,10 @@ static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant,
>  	struct kvm_mmu_page *sp;
>  
>  	role.level = level;
> +	role.quadrant = quadrant;
>  
> -	if (role.has_4_byte_gpte)
> -		role.quadrant = quadrant;
> +	WARN_ON_ONCE(quadrant && !role.has_4_byte_gpte);
> +	WARN_ON_ONCE(role.direct && role.has_4_byte_gpte);
>  
>  	sp = kvm_mmu_get_page(vcpu, gfn, role);
>  	++sp->root_count;
> @@ -3444,7 +3445,7 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
>  		for (i = 0; i < 4; ++i) {
>  			WARN_ON_ONCE(IS_VALID_PAE_ROOT(mmu->pae_root[i]));
>  
> -			root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT), i,
> +			root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT), 0,
>  					      PT32_ROOT_LEVEL);
>  			mmu->pae_root[i] = root | PT_PRESENT_MASK |
>  					   shadow_me_mask;
> @@ -3529,6 +3530,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
>  	struct kvm_mmu *mmu = vcpu->arch.mmu;
>  	u64 pdptrs[4], pm_mask;
>  	gfn_t root_gfn, root_pgd;
> +	unsigned int quadrant;
>  	hpa_t root;
>  	unsigned i;

Not really your fault, but this manages to use three different type declarations
for quadrant.  i is a bare "unsigned", quadrant an "unsigned int" here, and then
@quadrant in mmu_alloc_root() is an "int".

I suspect the "unsigned i" is originated with the "i << (30 - PAGE_SHIFT)" in
mmu_alloc_direct_roots(), though even that can't create a negative value.

Given that quadrant is tiny and "int i" is a de facto standard for iterator values,
my preference would be to opportunisticaly consolidate this to

	int quadrant, i, r;

>  	int r;
> @@ -3614,7 +3616,15 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
>  			root_gfn = pdptrs[i] >> PAGE_SHIFT;
>  		}
>  
> -		root = mmu_alloc_root(vcpu, root_gfn, i, PT32_ROOT_LEVEL);
> +		/*
> +		 * If shadowing 32-bit non-PAE page tables, each PAE page
> +		 * directory maps one quarter of the guest's non-PAE page
> +		 * directory. Othwerise each PAE page direct shadows one guest
> +		 * PAE page directory so that quadrant should be 0.
> +		 */
> +		quadrant = (mmu->cpu_role.base.level == PT32_ROOT_LEVEL) ? i : 0;
> +
> +		root = mmu_alloc_root(vcpu, root_gfn, quadrant, PT32_ROOT_LEVEL);
>  		mmu->pae_root[i] = root | pm_mask;
>  	}
>  
> -- 
> 2.36.0.550.gb090851708-goog
> 

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 05/22] KVM: x86/mmu: Always pass 0 for @quadrant when gptes are 8 bytes
@ 2022-06-17 15:20     ` Sean Christopherson
  0 siblings, 0 replies; 111+ messages in thread
From: Sean Christopherson @ 2022-06-17 15:20 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, Lai Jiangshan,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Mon, May 16, 2022, David Matlack wrote:
> The quadrant is only used when gptes are 4 bytes, but
> mmu_alloc_{direct,shadow}_roots() pass in a non-zero quadrant for PAE
> page directories regardless. Make this less confusing by only passing in
> a non-zero quadrant when it is actually necessary.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---

One nit, otherwise

Reviewed-by: Sean Christopherson <seanjc@google.com>

>  arch/x86/kvm/mmu/mmu.c | 18 ++++++++++++++----
>  1 file changed, 14 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 515e0b33144a..8508c4bfddb5 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3406,9 +3406,10 @@ static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant,
>  	struct kvm_mmu_page *sp;
>  
>  	role.level = level;
> +	role.quadrant = quadrant;
>  
> -	if (role.has_4_byte_gpte)
> -		role.quadrant = quadrant;
> +	WARN_ON_ONCE(quadrant && !role.has_4_byte_gpte);
> +	WARN_ON_ONCE(role.direct && role.has_4_byte_gpte);
>  
>  	sp = kvm_mmu_get_page(vcpu, gfn, role);
>  	++sp->root_count;
> @@ -3444,7 +3445,7 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
>  		for (i = 0; i < 4; ++i) {
>  			WARN_ON_ONCE(IS_VALID_PAE_ROOT(mmu->pae_root[i]));
>  
> -			root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT), i,
> +			root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT), 0,
>  					      PT32_ROOT_LEVEL);
>  			mmu->pae_root[i] = root | PT_PRESENT_MASK |
>  					   shadow_me_mask;
> @@ -3529,6 +3530,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
>  	struct kvm_mmu *mmu = vcpu->arch.mmu;
>  	u64 pdptrs[4], pm_mask;
>  	gfn_t root_gfn, root_pgd;
> +	unsigned int quadrant;
>  	hpa_t root;
>  	unsigned i;

Not really your fault, but this manages to use three different type declarations
for quadrant.  i is a bare "unsigned", quadrant an "unsigned int" here, and then
@quadrant in mmu_alloc_root() is an "int".

I suspect the "unsigned i" is originated with the "i << (30 - PAGE_SHIFT)" in
mmu_alloc_direct_roots(), though even that can't create a negative value.

Given that quadrant is tiny and "int i" is a de facto standard for iterator values,
my preference would be to opportunisticaly consolidate this to

	int quadrant, i, r;

>  	int r;
> @@ -3614,7 +3616,15 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
>  			root_gfn = pdptrs[i] >> PAGE_SHIFT;
>  		}
>  
> -		root = mmu_alloc_root(vcpu, root_gfn, i, PT32_ROOT_LEVEL);
> +		/*
> +		 * If shadowing 32-bit non-PAE page tables, each PAE page
> +		 * directory maps one quarter of the guest's non-PAE page
> +		 * directory. Othwerise each PAE page direct shadows one guest
> +		 * PAE page directory so that quadrant should be 0.
> +		 */
> +		quadrant = (mmu->cpu_role.base.level == PT32_ROOT_LEVEL) ? i : 0;
> +
> +		root = mmu_alloc_root(vcpu, root_gfn, quadrant, PT32_ROOT_LEVEL);
>  		mmu->pae_root[i] = root | pm_mask;
>  	}
>  
> -- 
> 2.36.0.550.gb090851708-goog
> 
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 13/22] KVM: x86/mmu: Allow NULL @vcpu in kvm_mmu_find_shadow_page()
  2022-05-16 23:21   ` David Matlack
@ 2022-06-17 15:28     ` Sean Christopherson
  -1 siblings, 0 replies; 111+ messages in thread
From: Sean Christopherson @ 2022-06-17 15:28 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, Lai Jiangshan

On Mon, May 16, 2022, David Matlack wrote:
> Allow @vcpu to be NULL in kvm_mmu_find_shadow_page() (and its only
> caller __kvm_mmu_get_shadow_page()). @vcpu is only required to sync
> indirect shadow pages, so it's safe to pass in NULL when looking up
> direct shadow pages.
> 
> This will be used for doing eager page splitting, which allocates direct

"hugepage" again, because I need constant reminders :-)

> shadow pages from the context of a VM ioctl without access to a vCPU
> pointer.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---

With nits addressed,

Reviewed-by: Sean Christopherson <seanjc@google.com>

>  arch/x86/kvm/mmu/mmu.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 4fbc2da47428..acb54d6e0ea5 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1850,6 +1850,7 @@ static int kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>  
>  	if (ret < 0)
>  		kvm_mmu_prepare_zap_page(vcpu->kvm, sp, invalid_list);
> +

Unrelated whitespace change leftover from the previous approach.

>  	return ret;
>  }
>  
> @@ -2001,6 +2002,7 @@ static void clear_sp_write_flooding_count(u64 *spte)
>  	__clear_sp_write_flooding_count(sptep_to_sp(spte));
>  }
>  
> +/* Note, @vcpu may be NULL if @role.direct is true. */
>  static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm,
>  						     struct kvm_vcpu *vcpu,
>  						     gfn_t gfn,
> @@ -2039,6 +2041,16 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm,
>  			goto out;
>  
>  		if (sp->unsync) {
> +			/*
> +			 * A vCPU pointer should always be provided when finding

s/should/must, and "be provided" in unnecessarily ambiguous, simply state that
"@vcpu must be non-NULL".  E.g. if a caller provides a pointer, but that pointer
happens to be NULL.

> +			 * indirect shadow pages, as that shadow page may
> +			 * already exist and need to be synced using the vCPU
> +			 * pointer. Direct shadow pages are never unsync and
> +			 * thus do not require a vCPU pointer.
> +			 */

"vCPU pointer" over and over is a bit versbose, and I prefer to refer to vCPUs/VMs
as objects themselves.  E.g. "XYZ requires a vCPU" versus "XYZ requires a vCPU
pointer" since it's not the pointer itself that's required, it's all the context
of the vCPU that is needed.

			/*
			 * @vcpu must be non-NULL when finding indirect shadow
			 * pages, as such pages may already exist and need to
			 * be synced, which requires a vCPU.  Direct pages are
			 * never unsync and thus do not require a vCPU.
			 */

> +			if (KVM_BUG_ON(!vcpu, kvm))
> +				break;
> +
>  			/*
>  			 * The page is good, but is stale.  kvm_sync_page does
>  			 * get the latest guest state, but (unlike mmu_unsync_children)
> @@ -2116,6 +2128,7 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
>  	return sp;
>  }
>  
> +/* Note, @vcpu may be NULL if @role.direct is true. */
>  static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm *kvm,
>  						      struct kvm_vcpu *vcpu,
>  						      struct shadow_page_caches *caches,
> -- 
> 2.36.0.550.gb090851708-goog
> 

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 13/22] KVM: x86/mmu: Allow NULL @vcpu in kvm_mmu_find_shadow_page()
@ 2022-06-17 15:28     ` Sean Christopherson
  0 siblings, 0 replies; 111+ messages in thread
From: Sean Christopherson @ 2022-06-17 15:28 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, Lai Jiangshan,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Mon, May 16, 2022, David Matlack wrote:
> Allow @vcpu to be NULL in kvm_mmu_find_shadow_page() (and its only
> caller __kvm_mmu_get_shadow_page()). @vcpu is only required to sync
> indirect shadow pages, so it's safe to pass in NULL when looking up
> direct shadow pages.
> 
> This will be used for doing eager page splitting, which allocates direct

"hugepage" again, because I need constant reminders :-)

> shadow pages from the context of a VM ioctl without access to a vCPU
> pointer.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---

With nits addressed,

Reviewed-by: Sean Christopherson <seanjc@google.com>

>  arch/x86/kvm/mmu/mmu.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 4fbc2da47428..acb54d6e0ea5 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1850,6 +1850,7 @@ static int kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>  
>  	if (ret < 0)
>  		kvm_mmu_prepare_zap_page(vcpu->kvm, sp, invalid_list);
> +

Unrelated whitespace change leftover from the previous approach.

>  	return ret;
>  }
>  
> @@ -2001,6 +2002,7 @@ static void clear_sp_write_flooding_count(u64 *spte)
>  	__clear_sp_write_flooding_count(sptep_to_sp(spte));
>  }
>  
> +/* Note, @vcpu may be NULL if @role.direct is true. */
>  static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm,
>  						     struct kvm_vcpu *vcpu,
>  						     gfn_t gfn,
> @@ -2039,6 +2041,16 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm,
>  			goto out;
>  
>  		if (sp->unsync) {
> +			/*
> +			 * A vCPU pointer should always be provided when finding

s/should/must, and "be provided" in unnecessarily ambiguous, simply state that
"@vcpu must be non-NULL".  E.g. if a caller provides a pointer, but that pointer
happens to be NULL.

> +			 * indirect shadow pages, as that shadow page may
> +			 * already exist and need to be synced using the vCPU
> +			 * pointer. Direct shadow pages are never unsync and
> +			 * thus do not require a vCPU pointer.
> +			 */

"vCPU pointer" over and over is a bit versbose, and I prefer to refer to vCPUs/VMs
as objects themselves.  E.g. "XYZ requires a vCPU" versus "XYZ requires a vCPU
pointer" since it's not the pointer itself that's required, it's all the context
of the vCPU that is needed.

			/*
			 * @vcpu must be non-NULL when finding indirect shadow
			 * pages, as such pages may already exist and need to
			 * be synced, which requires a vCPU.  Direct pages are
			 * never unsync and thus do not require a vCPU.
			 */

> +			if (KVM_BUG_ON(!vcpu, kvm))
> +				break;
> +
>  			/*
>  			 * The page is good, but is stale.  kvm_sync_page does
>  			 * get the latest guest state, but (unlike mmu_unsync_children)
> @@ -2116,6 +2128,7 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
>  	return sp;
>  }
>  
> +/* Note, @vcpu may be NULL if @role.direct is true. */
>  static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm *kvm,
>  						      struct kvm_vcpu *vcpu,
>  						      struct shadow_page_caches *caches,
> -- 
> 2.36.0.550.gb090851708-goog
> 
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 14/22] KVM: x86/mmu: Pass const memslot to rmap_add()
  2022-05-16 23:21   ` David Matlack
@ 2022-06-17 15:30     ` Sean Christopherson
  -1 siblings, 0 replies; 111+ messages in thread
From: Sean Christopherson @ 2022-06-17 15:30 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, Lai Jiangshan

On Mon, May 16, 2022, David Matlack wrote:

Please restate the shortlog in the changelog, it doesn't require much more typing
and means readers don't have to mentally preserve context across "paragraphs".

  Constify rmap_add()'s @slot parameter, the is just passed on to
  gfn_to_rmap(), which takes a const memslot.

> rmap_add() only uses the slot to call gfn_to_rmap() which takes a const
> memslot.
> 
> No functional change intended.
> 
> Reviewed-by: Ben Gardon <bgardon@google.com>
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---

Reviewed-by: Sean Christopherson <seanjc@google.com>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 14/22] KVM: x86/mmu: Pass const memslot to rmap_add()
@ 2022-06-17 15:30     ` Sean Christopherson
  0 siblings, 0 replies; 111+ messages in thread
From: Sean Christopherson @ 2022-06-17 15:30 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, Lai Jiangshan,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Mon, May 16, 2022, David Matlack wrote:

Please restate the shortlog in the changelog, it doesn't require much more typing
and means readers don't have to mentally preserve context across "paragraphs".

  Constify rmap_add()'s @slot parameter, the is just passed on to
  gfn_to_rmap(), which takes a const memslot.

> rmap_add() only uses the slot to call gfn_to_rmap() which takes a const
> memslot.
> 
> No functional change intended.
> 
> Reviewed-by: Ben Gardon <bgardon@google.com>
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---

Reviewed-by: Sean Christopherson <seanjc@google.com>
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 15/22] KVM: x86/mmu: Decouple rmap_add() and link_shadow_page() from kvm_vcpu
  2022-05-16 23:21   ` David Matlack
@ 2022-06-17 16:39     ` Sean Christopherson
  -1 siblings, 0 replies; 111+ messages in thread
From: Sean Christopherson @ 2022-06-17 16:39 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, Lai Jiangshan

On Mon, May 16, 2022, David Matlack wrote:
> @@ -1592,15 +1589,21 @@ static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
>  	sp = sptep_to_sp(spte);
>  	kvm_mmu_page_set_gfn(sp, spte - sp->spt, gfn);
>  	rmap_head = gfn_to_rmap(gfn, sp->role.level, slot);
> -	rmap_count = pte_list_add(vcpu, spte, rmap_head);
> +	rmap_count = pte_list_add(cache, spte, rmap_head);
>  
>  	if (rmap_count > RMAP_RECYCLE_THRESHOLD) {
> -		kvm_unmap_rmapp(vcpu->kvm, rmap_head, NULL, gfn, sp->role.level, __pte(0));
> +		kvm_unmap_rmapp(kvm, rmap_head, NULL, gfn, sp->role.level, __pte(0));

Ewww, the existing code is awful.  This call passes NULL for @slot, but it already
has a slot!  This could simply be

		pte_list_destroy(vcpu->kvm, rmap_head);

but that's undesirable with the current name as it's not remotely obvious that
pte_list_destroy() actually zaps rmaps.

I'll send a separate series to clean this up, e.g. rename pte_list_destroy() to
make it clear that it zaps SPTEs.  That'll also give me a good excuse to kill the
"p is for pointer" rmapp() naming scheme.  The only conflict with your series is
this one vcpu->kvm => kvm change, which is easy to note and resolve.

>  		kvm_flush_remote_tlbs_with_address(
> -				vcpu->kvm, sp->gfn, KVM_PAGES_PER_HPAGE(sp->role.level));
> +				kvm, sp->gfn, KVM_PAGES_PER_HPAGE(sp->role.level));
>  	}
>  }
>  
> +static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
> +		     u64 *spte, gfn_t gfn)
> +{
> +	__rmap_add(vcpu->kvm, &vcpu->arch.mmu_pte_list_desc_cache, slot, spte, gfn);

I prefer to grab "cache" locally,

	struct kvm_mmu_memory_cache *cache = &vcpu->arch.mmu_pte_list_desc_cache;

	__rmap_add(vcpu->kvm, cache, slot, spte, gfn);

both to keep the lines shorter in the final form (adding "access" runs yours out
to 93 chars), and because I find it easier to see read the call without a gigantic
parameter in the midde.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 15/22] KVM: x86/mmu: Decouple rmap_add() and link_shadow_page() from kvm_vcpu
@ 2022-06-17 16:39     ` Sean Christopherson
  0 siblings, 0 replies; 111+ messages in thread
From: Sean Christopherson @ 2022-06-17 16:39 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, Lai Jiangshan,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Mon, May 16, 2022, David Matlack wrote:
> @@ -1592,15 +1589,21 @@ static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
>  	sp = sptep_to_sp(spte);
>  	kvm_mmu_page_set_gfn(sp, spte - sp->spt, gfn);
>  	rmap_head = gfn_to_rmap(gfn, sp->role.level, slot);
> -	rmap_count = pte_list_add(vcpu, spte, rmap_head);
> +	rmap_count = pte_list_add(cache, spte, rmap_head);
>  
>  	if (rmap_count > RMAP_RECYCLE_THRESHOLD) {
> -		kvm_unmap_rmapp(vcpu->kvm, rmap_head, NULL, gfn, sp->role.level, __pte(0));
> +		kvm_unmap_rmapp(kvm, rmap_head, NULL, gfn, sp->role.level, __pte(0));

Ewww, the existing code is awful.  This call passes NULL for @slot, but it already
has a slot!  This could simply be

		pte_list_destroy(vcpu->kvm, rmap_head);

but that's undesirable with the current name as it's not remotely obvious that
pte_list_destroy() actually zaps rmaps.

I'll send a separate series to clean this up, e.g. rename pte_list_destroy() to
make it clear that it zaps SPTEs.  That'll also give me a good excuse to kill the
"p is for pointer" rmapp() naming scheme.  The only conflict with your series is
this one vcpu->kvm => kvm change, which is easy to note and resolve.

>  		kvm_flush_remote_tlbs_with_address(
> -				vcpu->kvm, sp->gfn, KVM_PAGES_PER_HPAGE(sp->role.level));
> +				kvm, sp->gfn, KVM_PAGES_PER_HPAGE(sp->role.level));
>  	}
>  }
>  
> +static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
> +		     u64 *spte, gfn_t gfn)
> +{
> +	__rmap_add(vcpu->kvm, &vcpu->arch.mmu_pte_list_desc_cache, slot, spte, gfn);

I prefer to grab "cache" locally,

	struct kvm_mmu_memory_cache *cache = &vcpu->arch.mmu_pte_list_desc_cache;

	__rmap_add(vcpu->kvm, cache, slot, spte, gfn);

both to keep the lines shorter in the final form (adding "access" runs yours out
to 93 chars), and because I find it easier to see read the call without a gigantic
parameter in the midde.
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 16/22] KVM: x86/mmu: Update page stats in __rmap_add()
  2022-05-16 23:21   ` David Matlack
@ 2022-06-17 16:40     ` Sean Christopherson
  -1 siblings, 0 replies; 111+ messages in thread
From: Sean Christopherson @ 2022-06-17 16:40 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, Lai Jiangshan

On Mon, May 16, 2022, David Matlack wrote:
> Update the page stats in __rmap_add() rather than at the call site. This
> will avoid having to manually update page stats when splitting huge
> pages in a subsequent commit.
> 
> No functional change intended.
> 
> Reviewed-by: Ben Gardon <bgardon@google.com>
> Reviewed-by: Peter Xu <peterx@redhat.com>
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---

Reviewed-by: Sean Christopherson <seanjc@google.com>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 16/22] KVM: x86/mmu: Update page stats in __rmap_add()
@ 2022-06-17 16:40     ` Sean Christopherson
  0 siblings, 0 replies; 111+ messages in thread
From: Sean Christopherson @ 2022-06-17 16:40 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, Lai Jiangshan,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Mon, May 16, 2022, David Matlack wrote:
> Update the page stats in __rmap_add() rather than at the call site. This
> will avoid having to manually update page stats when splitting huge
> pages in a subsequent commit.
> 
> No functional change intended.
> 
> Reviewed-by: Ben Gardon <bgardon@google.com>
> Reviewed-by: Peter Xu <peterx@redhat.com>
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---

Reviewed-by: Sean Christopherson <seanjc@google.com>
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 17/22] KVM: x86/mmu: Cache the access bits of shadowed translations
  2022-05-16 23:21   ` David Matlack
@ 2022-06-17 16:53     ` Sean Christopherson
  -1 siblings, 0 replies; 111+ messages in thread
From: Sean Christopherson @ 2022-06-17 16:53 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, Lai Jiangshan,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Mon, May 16, 2022, David Matlack wrote:

Please lead with what the patch actually does, e.g. move paragraphs three and four
ito the top and reword paragraph three to be a command.  I already know what this
patch does and still had a hard time finding that information in the changelog.

> Splitting huge pages requires allocating/finding shadow pages to replace
> the huge page. Shadow pages are keyed, in part, off the guest access
> permissions they are shadowing. For fully direct MMUs, there is no
> shadowing so the access bits in the shadow page role are always ACC_ALL.
> But during shadow paging, the guest can enforce whatever access
> permissions it wants.
> 
> When KVM is resolving a fault, it walks the guest pages tables to
> determine the guest access permissions. But that is difficult to plumb
> when splitting huge pages outside of a fault context, e.g. for eager
> page splitting.
> 
> To enable eager page splitting, KVM can cache the shadowed (guest)
> access permissions whenever it updates the shadow page tables (e.g.
> during fault, or FNAME(sync_page)). In fact KVM already does this to
> cache the shadowed GFN using the gfns array in the shadow page.
> The access bits only take up 3 bits, which leaves 61 bits left over for
> gfns, which is more than enough. So this change does not require any
> additional memory.
> 
> Now that the gfns array caches more information than just GFNs, rename
> it to shadowed_translation.
> 
> While here, preemptively fix up the WARN_ON() that detects gfn
> mismatches in direct SPs. The WARN_ON() was paired with a
> pr_err_ratelimited(), which means that users could sometimes see the
> WARN without the accompanying error message. Fix this by outputting the
> error message as part of the WARN splat.

If you're going do this cleanup, I vote to make them WARN_ONCE().  If these fire,
then they are all but guaranteed to fire _a lot_ and will bring down the kernel.
Spamming the log is unlikely to help debug problems, i.e. a single splat should
be sufficient to alert a downstream debugger that a VM crash was more than likely
due to a KVM MMU bug.

> Signed-off-by: David Matlack <dmatlack@google.com>
> ---

...

> +static void kvm_mmu_page_set_translation(struct kvm_mmu_page *sp, int index, gfn_t gfn, u32 access)

"unsigned int access", and I would prefer that we are a bit more agressive in
wrapping, i.e. run past 80 chars only when it's silly to wrap or when not wrapping
is inarguably easier to read.

E.g. I completely agree that letting this

	sp->shadowed_translation = kvm_mmu_memory_cache_alloc(caches->shadowed_info_cache);

is better than

	sp->shadowed_translation =
		kvm_mmu_memory_cache_alloc(caches->shadowed_info_cache);

but I'd prefer we don't end up with function prototypes that have multiple parameters
ending after 80 chars.


diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 09135fcfbfcf..36176af6e4c3 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -743,7 +743,8 @@ static u32 kvm_mmu_page_get_access(struct kvm_mmu_page *sp, int index)
        return sp->role.access;
 }

-static void kvm_mmu_page_set_translation(struct kvm_mmu_page *sp, int index, gfn_t gfn, u32 access)
+static void kvm_mmu_page_set_translation(struct kvm_mmu_page *sp, int index,
+                                        gfn_t gfn, unsigned int access)
 {
        if (sp_has_gptes(sp)) {
                sp->shadowed_translation[index] = (gfn << PAGE_SHIFT) | access;
@@ -761,7 +762,8 @@ static void kvm_mmu_page_set_translation(struct kvm_mmu_page *sp, int index, gfn
             sp->gfn, kvm_mmu_page_get_gfn(sp, index), gfn);
 }

-static void kvm_mmu_page_set_access(struct kvm_mmu_page *sp, int index, u32 access)
+static void kvm_mmu_page_set_access(struct kvm_mmu_page *sp, int index,
+                                   unsigned int access)
 {
        gfn_t gfn = kvm_mmu_page_get_gfn(sp, index);

@@ -2201,7 +2203,8 @@ static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
        return __kvm_mmu_get_shadow_page(vcpu->kvm, vcpu, &caches, gfn, role);
 }

-static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct, u32 access)
+static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct,
+                                                 unsigned int access)
 {
        struct kvm_mmu_page *parent_sp = sptep_to_sp(sptep);
        union kvm_mmu_page_role role;


> @@ -1054,12 +1055,15 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
>  		if (sync_mmio_spte(vcpu, &sp->spt[i], gfn, pte_access))
>  			continue;
>  
> -		if (gfn != sp->gfns[i]) {
> +		if (gfn != kvm_mmu_page_get_gfn(sp, i)) {

This will conflict with kvm/queue, resolution is straightforward:

		if ((!pte_access && !shadow_present_mask) ||
		    gfn != kvm_mmu_page_get_gfn(sp, i)) {

>  			drop_spte(vcpu->kvm, &sp->spt[i]);
>  			flush = true;
>  			continue;
>  		}
>  
> +		/* Update the shadowed access bits in case they changed. */
> +		kvm_mmu_page_set_access(sp, i, pte_access);
> +
>  		sptep = &sp->spt[i];
>  		spte = *sptep;
>  		host_writable = spte & shadow_host_writable_mask;
> -- 
> 2.36.0.550.gb090851708-goog
> 
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 17/22] KVM: x86/mmu: Cache the access bits of shadowed translations
@ 2022-06-17 16:53     ` Sean Christopherson
  0 siblings, 0 replies; 111+ messages in thread
From: Sean Christopherson @ 2022-06-17 16:53 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, Lai Jiangshan

On Mon, May 16, 2022, David Matlack wrote:

Please lead with what the patch actually does, e.g. move paragraphs three and four
ito the top and reword paragraph three to be a command.  I already know what this
patch does and still had a hard time finding that information in the changelog.

> Splitting huge pages requires allocating/finding shadow pages to replace
> the huge page. Shadow pages are keyed, in part, off the guest access
> permissions they are shadowing. For fully direct MMUs, there is no
> shadowing so the access bits in the shadow page role are always ACC_ALL.
> But during shadow paging, the guest can enforce whatever access
> permissions it wants.
> 
> When KVM is resolving a fault, it walks the guest pages tables to
> determine the guest access permissions. But that is difficult to plumb
> when splitting huge pages outside of a fault context, e.g. for eager
> page splitting.
> 
> To enable eager page splitting, KVM can cache the shadowed (guest)
> access permissions whenever it updates the shadow page tables (e.g.
> during fault, or FNAME(sync_page)). In fact KVM already does this to
> cache the shadowed GFN using the gfns array in the shadow page.
> The access bits only take up 3 bits, which leaves 61 bits left over for
> gfns, which is more than enough. So this change does not require any
> additional memory.
> 
> Now that the gfns array caches more information than just GFNs, rename
> it to shadowed_translation.
> 
> While here, preemptively fix up the WARN_ON() that detects gfn
> mismatches in direct SPs. The WARN_ON() was paired with a
> pr_err_ratelimited(), which means that users could sometimes see the
> WARN without the accompanying error message. Fix this by outputting the
> error message as part of the WARN splat.

If you're going do this cleanup, I vote to make them WARN_ONCE().  If these fire,
then they are all but guaranteed to fire _a lot_ and will bring down the kernel.
Spamming the log is unlikely to help debug problems, i.e. a single splat should
be sufficient to alert a downstream debugger that a VM crash was more than likely
due to a KVM MMU bug.

> Signed-off-by: David Matlack <dmatlack@google.com>
> ---

...

> +static void kvm_mmu_page_set_translation(struct kvm_mmu_page *sp, int index, gfn_t gfn, u32 access)

"unsigned int access", and I would prefer that we are a bit more agressive in
wrapping, i.e. run past 80 chars only when it's silly to wrap or when not wrapping
is inarguably easier to read.

E.g. I completely agree that letting this

	sp->shadowed_translation = kvm_mmu_memory_cache_alloc(caches->shadowed_info_cache);

is better than

	sp->shadowed_translation =
		kvm_mmu_memory_cache_alloc(caches->shadowed_info_cache);

but I'd prefer we don't end up with function prototypes that have multiple parameters
ending after 80 chars.


diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 09135fcfbfcf..36176af6e4c3 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -743,7 +743,8 @@ static u32 kvm_mmu_page_get_access(struct kvm_mmu_page *sp, int index)
        return sp->role.access;
 }

-static void kvm_mmu_page_set_translation(struct kvm_mmu_page *sp, int index, gfn_t gfn, u32 access)
+static void kvm_mmu_page_set_translation(struct kvm_mmu_page *sp, int index,
+                                        gfn_t gfn, unsigned int access)
 {
        if (sp_has_gptes(sp)) {
                sp->shadowed_translation[index] = (gfn << PAGE_SHIFT) | access;
@@ -761,7 +762,8 @@ static void kvm_mmu_page_set_translation(struct kvm_mmu_page *sp, int index, gfn
             sp->gfn, kvm_mmu_page_get_gfn(sp, index), gfn);
 }

-static void kvm_mmu_page_set_access(struct kvm_mmu_page *sp, int index, u32 access)
+static void kvm_mmu_page_set_access(struct kvm_mmu_page *sp, int index,
+                                   unsigned int access)
 {
        gfn_t gfn = kvm_mmu_page_get_gfn(sp, index);

@@ -2201,7 +2203,8 @@ static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
        return __kvm_mmu_get_shadow_page(vcpu->kvm, vcpu, &caches, gfn, role);
 }

-static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct, u32 access)
+static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct,
+                                                 unsigned int access)
 {
        struct kvm_mmu_page *parent_sp = sptep_to_sp(sptep);
        union kvm_mmu_page_role role;


> @@ -1054,12 +1055,15 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
>  		if (sync_mmio_spte(vcpu, &sp->spt[i], gfn, pte_access))
>  			continue;
>  
> -		if (gfn != sp->gfns[i]) {
> +		if (gfn != kvm_mmu_page_get_gfn(sp, i)) {

This will conflict with kvm/queue, resolution is straightforward:

		if ((!pte_access && !shadow_present_mask) ||
		    gfn != kvm_mmu_page_get_gfn(sp, i)) {

>  			drop_spte(vcpu->kvm, &sp->spt[i]);
>  			flush = true;
>  			continue;
>  		}
>  
> +		/* Update the shadowed access bits in case they changed. */
> +		kvm_mmu_page_set_access(sp, i, pte_access);
> +
>  		sptep = &sp->spt[i];
>  		spte = *sptep;
>  		host_writable = spte & shadow_host_writable_mask;
> -- 
> 2.36.0.550.gb090851708-goog
> 

^ permalink raw reply related	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 18/22] KVM: x86/mmu: Extend make_huge_page_split_spte() for the shadow MMU
  2022-05-16 23:21   ` David Matlack
@ 2022-06-17 16:56     ` Sean Christopherson
  -1 siblings, 0 replies; 111+ messages in thread
From: Sean Christopherson @ 2022-06-17 16:56 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, Lai Jiangshan

On Mon, May 16, 2022, David Matlack wrote:
> Currently make_huge_page_split_spte() assumes execute permissions can be
> granted to any 4K SPTE when splitting huge pages. This is true for the
> TDP MMU but is not necessarily true for the shadow MMU, since KVM may be
> shadowing a non-executable huge page.
> 
> To fix this, pass in the role of the child shadow page where the huge
> page will be split and derive the execution permission from that.  This
> is correct because huge pages are always split with direct shadow page
> and thus the shadow page role contains the correct access permissions.
> 
> No functional change intended.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---

Reviewed-by: Sean Christopherson <seanjc@google.com>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 18/22] KVM: x86/mmu: Extend make_huge_page_split_spte() for the shadow MMU
@ 2022-06-17 16:56     ` Sean Christopherson
  0 siblings, 0 replies; 111+ messages in thread
From: Sean Christopherson @ 2022-06-17 16:56 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, Lai Jiangshan,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Mon, May 16, 2022, David Matlack wrote:
> Currently make_huge_page_split_spte() assumes execute permissions can be
> granted to any 4K SPTE when splitting huge pages. This is true for the
> TDP MMU but is not necessarily true for the shadow MMU, since KVM may be
> shadowing a non-executable huge page.
> 
> To fix this, pass in the role of the child shadow page where the huge
> page will be split and derive the execution permission from that.  This
> is correct because huge pages are always split with direct shadow page
> and thus the shadow page role contains the correct access permissions.
> 
> No functional change intended.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---

Reviewed-by: Sean Christopherson <seanjc@google.com>
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 19/22] KVM: x86/mmu: Zap collapsible SPTEs in shadow MMU at all possible levels
  2022-05-16 23:21   ` David Matlack
@ 2022-06-17 17:01     ` Sean Christopherson
  -1 siblings, 0 replies; 111+ messages in thread
From: Sean Christopherson @ 2022-06-17 17:01 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, Lai Jiangshan

On Mon, May 16, 2022, David Matlack wrote:
> Currently KVM only zaps collapsible 4KiB SPTEs in the shadow MMU. This
> is fine for now since KVM never creates intermediate huge pages during
> dirty logging. In other words, KVM always replaces 1GiB pages directly
> with 4KiB pages, so there is no reason to look for collapsible 2MiB
> pages.
> 
> However, this will stop being true once the shadow MMU participates in
> eager page splitting. During eager page splitting, each 1GiB is first
> split into 2MiB pages and then those are split into 4KiB pages. The
> intermediate 2MiB pages may be left behind if an error condition causes
> eager page splitting to bail early.
> 
> No functional change intended.
> 
> Reviewed-by: Peter Xu <peterx@redhat.com>
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 21 ++++++++++++++-------
>  1 file changed, 14 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index f83de72feeac..a5d96d452f42 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -6177,18 +6177,25 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
>  	return need_tlb_flush;
>  }
>  
> +static void kvm_rmap_zap_collapsible_sptes(struct kvm *kvm,
> +					   const struct kvm_memory_slot *slot)
> +{
> +	/*
> +	 * Note, use KVM_MAX_HUGEPAGE_LEVEL - 1 since there's no need to zap
> +	 * pages that are already mapped at the maximum possible level.
> +	 */
> +	if (slot_handle_level(kvm, slot, kvm_mmu_zap_collapsible_spte,
> +			      PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL - 1,
> +			      true))

No need to wrap, "true" fits easily on the previous line.  That said, I don't see
any point in adding a helper.  It's highly unlike there will be another caller,
and IMO it's not any more readable since I have to go look at another function
when reading kvm_mmu_zap_collapsible_sptes().

With some gentle massaging, the comment can squeeze onto two lines even with the
extra level of indentation.

		/*
	 	 * Note, use KVM_MAX_HUGEPAGE_LEVEL - 1, there's no need to zap
		 * pages that are already mapped at the maximum hugepage level.
		 */
		if (slot_handle_level(kvm, slot, kvm_mmu_zap_collapsible_spte,
				      PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL - 1, true)
			kvm_arch_flush_remote_tlbs_memslot(kvm, slot);

> +		kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
> +}
> +
>  void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
>  				   const struct kvm_memory_slot *slot)
>  {
>  	if (kvm_memslots_have_rmaps(kvm)) {
>  		write_lock(&kvm->mmu_lock);
> -		/*
> -		 * Zap only 4k SPTEs since the legacy MMU only supports dirty
> -		 * logging at a 4k granularity and never creates collapsible
> -		 * 2m SPTEs during dirty logging.
> -		 */
> -		if (slot_handle_level_4k(kvm, slot, kvm_mmu_zap_collapsible_spte, true))
> -			kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
> +		kvm_rmap_zap_collapsible_sptes(kvm, slot);
>  		write_unlock(&kvm->mmu_lock);
>  	}
>  
> -- 
> 2.36.0.550.gb090851708-goog
> 

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 19/22] KVM: x86/mmu: Zap collapsible SPTEs in shadow MMU at all possible levels
@ 2022-06-17 17:01     ` Sean Christopherson
  0 siblings, 0 replies; 111+ messages in thread
From: Sean Christopherson @ 2022-06-17 17:01 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, Lai Jiangshan,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Mon, May 16, 2022, David Matlack wrote:
> Currently KVM only zaps collapsible 4KiB SPTEs in the shadow MMU. This
> is fine for now since KVM never creates intermediate huge pages during
> dirty logging. In other words, KVM always replaces 1GiB pages directly
> with 4KiB pages, so there is no reason to look for collapsible 2MiB
> pages.
> 
> However, this will stop being true once the shadow MMU participates in
> eager page splitting. During eager page splitting, each 1GiB is first
> split into 2MiB pages and then those are split into 4KiB pages. The
> intermediate 2MiB pages may be left behind if an error condition causes
> eager page splitting to bail early.
> 
> No functional change intended.
> 
> Reviewed-by: Peter Xu <peterx@redhat.com>
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 21 ++++++++++++++-------
>  1 file changed, 14 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index f83de72feeac..a5d96d452f42 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -6177,18 +6177,25 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
>  	return need_tlb_flush;
>  }
>  
> +static void kvm_rmap_zap_collapsible_sptes(struct kvm *kvm,
> +					   const struct kvm_memory_slot *slot)
> +{
> +	/*
> +	 * Note, use KVM_MAX_HUGEPAGE_LEVEL - 1 since there's no need to zap
> +	 * pages that are already mapped at the maximum possible level.
> +	 */
> +	if (slot_handle_level(kvm, slot, kvm_mmu_zap_collapsible_spte,
> +			      PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL - 1,
> +			      true))

No need to wrap, "true" fits easily on the previous line.  That said, I don't see
any point in adding a helper.  It's highly unlike there will be another caller,
and IMO it's not any more readable since I have to go look at another function
when reading kvm_mmu_zap_collapsible_sptes().

With some gentle massaging, the comment can squeeze onto two lines even with the
extra level of indentation.

		/*
	 	 * Note, use KVM_MAX_HUGEPAGE_LEVEL - 1, there's no need to zap
		 * pages that are already mapped at the maximum hugepage level.
		 */
		if (slot_handle_level(kvm, slot, kvm_mmu_zap_collapsible_spte,
				      PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL - 1, true)
			kvm_arch_flush_remote_tlbs_memslot(kvm, slot);

> +		kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
> +}
> +
>  void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
>  				   const struct kvm_memory_slot *slot)
>  {
>  	if (kvm_memslots_have_rmaps(kvm)) {
>  		write_lock(&kvm->mmu_lock);
> -		/*
> -		 * Zap only 4k SPTEs since the legacy MMU only supports dirty
> -		 * logging at a 4k granularity and never creates collapsible
> -		 * 2m SPTEs during dirty logging.
> -		 */
> -		if (slot_handle_level_4k(kvm, slot, kvm_mmu_zap_collapsible_spte, true))
> -			kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
> +		kvm_rmap_zap_collapsible_sptes(kvm, slot);
>  		write_unlock(&kvm->mmu_lock);
>  	}
>  
> -- 
> 2.36.0.550.gb090851708-goog
> 
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 20/22] KVM: x86/mmu: Refactor drop_large_spte()
  2022-05-16 23:21   ` David Matlack
@ 2022-06-17 17:11     ` Sean Christopherson
  -1 siblings, 0 replies; 111+ messages in thread
From: Sean Christopherson @ 2022-06-17 17:11 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, Lai Jiangshan

On Mon, May 16, 2022, David Matlack wrote:
>  static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
>  {
> -	if (__drop_large_spte(vcpu->kvm, sptep)) {
> -		struct kvm_mmu_page *sp = sptep_to_sp(sptep);
> -
> -		kvm_flush_remote_tlbs_with_address(vcpu->kvm, sp->gfn,
> -			KVM_PAGES_PER_HPAGE(sp->role.level));
> -	}
> +	return __drop_large_spte(vcpu->kvm, sptep, true);

A "return" for a void function is unnecessary.  And since the shortlog is already
a somewhat vague "do a refactor", I vote to opportunistically:

  - rename drop_large_spte() to drop_spte_if_huge()
  - rename __drop_large_spte() to drop_huge_spte()
  - move "if (!is_large_pte(*sptep))" to drop_spte_if_huge() since the split path
    should never pass in a non-huge SPTE.

That last point will also clean up an oddity with with "flush" parameter; given
the command-like name of "flush", it's a bit weird that __drop_large_spte() doesn't
flush when the SPTE is large.


static void drop_huge_spte(struct kvm *kvm, u64 *sptep, bool flush)
{
	struct kvm_mmu_page *sp;

	sp = sptep_to_sp(sptep);
	WARN_ON(sp->role.level == PG_LEVEL_4K);

	drop_spte(kvm, sptep);

	if (flush)
		kvm_flush_remote_tlbs_with_address(kvm, sp->gfn,
			KVM_PAGES_PER_HPAGE(sp->role.level));
}

static void drop_spte_if_huge(struct kvm_vcpu *vcpu, u64 *sptep)
{
	if (is_large_pte(*sptep))
		drop_huge_spte(vcpu->kvm, sptep, true);
}


>  }
>  
>  /*
> -- 
> 2.36.0.550.gb090851708-goog
> 

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 20/22] KVM: x86/mmu: Refactor drop_large_spte()
@ 2022-06-17 17:11     ` Sean Christopherson
  0 siblings, 0 replies; 111+ messages in thread
From: Sean Christopherson @ 2022-06-17 17:11 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, Lai Jiangshan,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Mon, May 16, 2022, David Matlack wrote:
>  static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
>  {
> -	if (__drop_large_spte(vcpu->kvm, sptep)) {
> -		struct kvm_mmu_page *sp = sptep_to_sp(sptep);
> -
> -		kvm_flush_remote_tlbs_with_address(vcpu->kvm, sp->gfn,
> -			KVM_PAGES_PER_HPAGE(sp->role.level));
> -	}
> +	return __drop_large_spte(vcpu->kvm, sptep, true);

A "return" for a void function is unnecessary.  And since the shortlog is already
a somewhat vague "do a refactor", I vote to opportunistically:

  - rename drop_large_spte() to drop_spte_if_huge()
  - rename __drop_large_spte() to drop_huge_spte()
  - move "if (!is_large_pte(*sptep))" to drop_spte_if_huge() since the split path
    should never pass in a non-huge SPTE.

That last point will also clean up an oddity with with "flush" parameter; given
the command-like name of "flush", it's a bit weird that __drop_large_spte() doesn't
flush when the SPTE is large.


static void drop_huge_spte(struct kvm *kvm, u64 *sptep, bool flush)
{
	struct kvm_mmu_page *sp;

	sp = sptep_to_sp(sptep);
	WARN_ON(sp->role.level == PG_LEVEL_4K);

	drop_spte(kvm, sptep);

	if (flush)
		kvm_flush_remote_tlbs_with_address(kvm, sp->gfn,
			KVM_PAGES_PER_HPAGE(sp->role.level));
}

static void drop_spte_if_huge(struct kvm_vcpu *vcpu, u64 *sptep)
{
	if (is_large_pte(*sptep))
		drop_huge_spte(vcpu->kvm, sptep, true);
}


>  }
>  
>  /*
> -- 
> 2.36.0.550.gb090851708-goog
> 
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 21/22] KVM: Allow for different capacities in kvm_mmu_memory_cache structs
  2022-05-16 23:21   ` David Matlack
@ 2022-06-17 17:41     ` Sean Christopherson
  -1 siblings, 0 replies; 111+ messages in thread
From: Sean Christopherson @ 2022-06-17 17:41 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, Lai Jiangshan

On Mon, May 16, 2022, David Matlack wrote:
> -int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
> +static int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min)

I still find it somewhat kludgy to have callers provide an capacity.  It's not
terrible while there's only a single call site, but if a use case comes along
for using an oversized cache with multiple call sites, it'll be gross.

Tweaking my idea of a "custom" wrapper, what about adding an "oversized" wrapper?
That yields clear, firm rules on when to use each helper, guards against calling
the "standard" flavor with an impossible @min, and addresses Mingwei's concern
that a misguided user could specify a nonsensically small capacity.

The only quirk is that kvm_mmu_topup_memory_cache_oversized() has a fixed min,
but IMO that's an acceptable tradeoff, and it's a non-issue until another user
pops up.

static int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc,
					int capacity, int min)
{
	gfp_t gfp = GFP_KERNEL_ACCOUNT;
	void *obj;

	if (mc->nobjs >= min)
		return 0;

	if (unlikely(!mc->objects)) {
		capacity = max(min, KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE);

		mc->objects = kvmalloc_array(sizeof(void *), capacity, gfp);
		if (!mc->objects)
			return -ENOMEM;

		mc->capacity = capacity;
	}

	/* It is illegal to request a different capacity across topups. */
	if (WARN_ON_ONCE(mc->capacity != capacity))
		return -EIO;

	while (mc->nobjs < mc->capacity) {
		obj = mmu_memory_cache_alloc_obj(mc, gfp);
		if (!obj)
			return mc->nobjs >= min ? 0 : -ENOMEM;
		mc->objects[mc->nobjs++] = obj;
	}
	return 0;
}

int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
{
	const int capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;

	if (WARN_ON_ONCE(min > capacity))
		min = capacity;

	return __kvm_mmu_topup_memory_cache(mc, capacity, min);
}

/* Oversized caches have a fixed size, i.e. min == capacity == size. */
int kvm_mmu_topup_memory_cache_oversized(struct kvm_mmu_memory_cache *mc, int size)
{
	if (WARN_ON_ONCE(size < KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE)
		size = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;

	return __kvm_mmu_topup_memory_cache(mc, size, size);
}

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 21/22] KVM: Allow for different capacities in kvm_mmu_memory_cache structs
@ 2022-06-17 17:41     ` Sean Christopherson
  0 siblings, 0 replies; 111+ messages in thread
From: Sean Christopherson @ 2022-06-17 17:41 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, Lai Jiangshan,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Mon, May 16, 2022, David Matlack wrote:
> -int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
> +static int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min)

I still find it somewhat kludgy to have callers provide an capacity.  It's not
terrible while there's only a single call site, but if a use case comes along
for using an oversized cache with multiple call sites, it'll be gross.

Tweaking my idea of a "custom" wrapper, what about adding an "oversized" wrapper?
That yields clear, firm rules on when to use each helper, guards against calling
the "standard" flavor with an impossible @min, and addresses Mingwei's concern
that a misguided user could specify a nonsensically small capacity.

The only quirk is that kvm_mmu_topup_memory_cache_oversized() has a fixed min,
but IMO that's an acceptable tradeoff, and it's a non-issue until another user
pops up.

static int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc,
					int capacity, int min)
{
	gfp_t gfp = GFP_KERNEL_ACCOUNT;
	void *obj;

	if (mc->nobjs >= min)
		return 0;

	if (unlikely(!mc->objects)) {
		capacity = max(min, KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE);

		mc->objects = kvmalloc_array(sizeof(void *), capacity, gfp);
		if (!mc->objects)
			return -ENOMEM;

		mc->capacity = capacity;
	}

	/* It is illegal to request a different capacity across topups. */
	if (WARN_ON_ONCE(mc->capacity != capacity))
		return -EIO;

	while (mc->nobjs < mc->capacity) {
		obj = mmu_memory_cache_alloc_obj(mc, gfp);
		if (!obj)
			return mc->nobjs >= min ? 0 : -ENOMEM;
		mc->objects[mc->nobjs++] = obj;
	}
	return 0;
}

int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
{
	const int capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;

	if (WARN_ON_ONCE(min > capacity))
		min = capacity;

	return __kvm_mmu_topup_memory_cache(mc, capacity, min);
}

/* Oversized caches have a fixed size, i.e. min == capacity == size. */
int kvm_mmu_topup_memory_cache_oversized(struct kvm_mmu_memory_cache *mc, int size)
{
	if (WARN_ON_ONCE(size < KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE)
		size = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;

	return __kvm_mmu_topup_memory_cache(mc, size, size);
}
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 21/22] KVM: Allow for different capacities in kvm_mmu_memory_cache structs
  2022-06-17 17:41     ` Sean Christopherson
@ 2022-06-17 18:34       ` Sean Christopherson
  -1 siblings, 0 replies; 111+ messages in thread
From: Sean Christopherson @ 2022-06-17 18:34 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, Lai Jiangshan

On Fri, Jun 17, 2022, Sean Christopherson wrote:
> On Mon, May 16, 2022, David Matlack wrote:
> > -int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
> > +static int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min)
> 
> I still find it somewhat kludgy to have callers provide an capacity.  It's not
> terrible while there's only a single call site, but if a use case comes along
> for using an oversized cache with multiple call sites, it'll be gross.
> 
> Tweaking my idea of a "custom" wrapper, what about adding an "oversized" wrapper?
> That yields clear, firm rules on when to use each helper, guards against calling
> the "standard" flavor with an impossible @min, and addresses Mingwei's concern
> that a misguided user could specify a nonsensically small capacity.

Drat, arguing against my own idea.

The future usage in nested_mmu_try_split_huge_page() is going to be inefficient.
By having capacity==min, consuming just one entry, which is guaranteed when a
huge page split is successful, will mean the cache needs to be topped up.  In other
words, I'm pretty sure need_topup_split_caches_or_resched() will always return
true between different huge pages and so KVM will drop mmu_lock and reschedule
after every huge page.  Maybe that's not a big deal, but at the very least it's
odd, and its a nasty gotcha with forcing capacity==min.

So I'm ok with this patch, though it should yell if min > capacity.

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 932abb4fb67e..14e807501229 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -388,7 +388,7 @@ int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity,
                return 0;

        if (unlikely(!mc->objects)) {
-               if (WARN_ON_ONCE(!capacity))
+               if (WARN_ON_ONCE(!capacity || min > capacity))
                        return -EIO;

                mc->objects = kvmalloc_array(sizeof(void *), capacity, gfp);

^ permalink raw reply related	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 21/22] KVM: Allow for different capacities in kvm_mmu_memory_cache structs
@ 2022-06-17 18:34       ` Sean Christopherson
  0 siblings, 0 replies; 111+ messages in thread
From: Sean Christopherson @ 2022-06-17 18:34 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, Lai Jiangshan,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Fri, Jun 17, 2022, Sean Christopherson wrote:
> On Mon, May 16, 2022, David Matlack wrote:
> > -int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
> > +static int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min)
> 
> I still find it somewhat kludgy to have callers provide an capacity.  It's not
> terrible while there's only a single call site, but if a use case comes along
> for using an oversized cache with multiple call sites, it'll be gross.
> 
> Tweaking my idea of a "custom" wrapper, what about adding an "oversized" wrapper?
> That yields clear, firm rules on when to use each helper, guards against calling
> the "standard" flavor with an impossible @min, and addresses Mingwei's concern
> that a misguided user could specify a nonsensically small capacity.

Drat, arguing against my own idea.

The future usage in nested_mmu_try_split_huge_page() is going to be inefficient.
By having capacity==min, consuming just one entry, which is guaranteed when a
huge page split is successful, will mean the cache needs to be topped up.  In other
words, I'm pretty sure need_topup_split_caches_or_resched() will always return
true between different huge pages and so KVM will drop mmu_lock and reschedule
after every huge page.  Maybe that's not a big deal, but at the very least it's
odd, and its a nasty gotcha with forcing capacity==min.

So I'm ok with this patch, though it should yell if min > capacity.

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 932abb4fb67e..14e807501229 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -388,7 +388,7 @@ int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity,
                return 0;

        if (unlikely(!mc->objects)) {
-               if (WARN_ON_ONCE(!capacity))
+               if (WARN_ON_ONCE(!capacity || min > capacity))
                        return -EIO;

                mc->objects = kvmalloc_array(sizeof(void *), capacity, gfp);
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 22/22] KVM: x86/mmu: Extend Eager Page Splitting to nested MMUs
  2022-05-16 23:21   ` David Matlack
@ 2022-06-17 19:08     ` Sean Christopherson
  -1 siblings, 0 replies; 111+ messages in thread
From: Sean Christopherson @ 2022-06-17 19:08 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, Lai Jiangshan

On Mon, May 16, 2022, David Matlack wrote:
> +	/*
> +	 * Memory cache used to allocate pte_list_desc structs while splitting
> +	 * huge pages. In the worst case, to split one huge page, 512
> +	 * pte_list_desc structs are needed to add each lower level leaf sptep
> +	 * to the rmap plus 1 to extend the parent_ptes rmap of the lower level
> +	 * page table.
> +	 *
> +	 * Protected by kvm->slots_lock.
> +	 */
> +#define SPLIT_DESC_CACHE_CAPACITY 513

I would strongly prefer to programmaticaly define this (note that SPTE_ENT_PER_PAGE
doesn't yet exist in kvm/queue, but hopefully will by the time you rebase; it's
PT64_ENT_PER_PAGE at the moment).  And I think we should define the min number of
objects separately from the capacity (see below).

	/*
	 * Memory cache used to allocate pte_list_desc structs while splitting
	 * huge pages.  In the worst case, to split one huge page, a struct will
	 * be needed to rmap every possible new child SPTE, plus one to extend
	 * the parent_ptes rmap of the newly create page table.
	 */
#define SPLIT_DESC_CACHE_MIN_NR_OBJECTS (SPTE_ENT_PER_PAGE + 1)

> +	struct kvm_mmu_memory_cache split_desc_cache;
>  };
>  

...

> +static int topup_split_caches(struct kvm *kvm)
> +{
> +     int r;
> +
> +     lockdep_assert_held(&kvm->slots_lock);
> +
> +     r = __kvm_mmu_topup_memory_cache(&kvm->arch.split_desc_cache,
> +                                      SPLIT_DESC_CACHE_CAPACITY,
> +                                      SPLIT_DESC_CACHE_CAPACITY);

min==capacity will be inefficient as consuming just one object from the cache
will force KVM to drop mmu_lock and topup the cache.

2*min seems like the logical choice.  Presumably it's common to need all 513
objects when splitting a page, so that at least lets KVM handle two huge pages
without having to drop mmu_lock.

> +     if (r)
> +             return r;
> +
> +     r = kvm_mmu_topup_memory_cache(&kvm->arch.split_page_header_cache, 1);
> +     if (r)
> +             return r;
> +
> +     return kvm_mmu_topup_memory_cache(&kvm->arch.split_shadow_page_cache, 1);
> +}

...

> @@ -6097,15 +6106,252 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
>  		kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
>  }
>  
> +void free_split_caches(struct kvm *kvm)

This should be prefixed with kvm_mmu_, and since it's short, make it more explicit:

void kvm_mmu_free_eager_page_split_caches(struct kvm *kvm)

> +{
> +	lockdep_assert_held(&kvm->slots_lock);
> +
> +	kvm_mmu_free_memory_cache(&kvm->arch.split_desc_cache);
> +	kvm_mmu_free_memory_cache(&kvm->arch.split_page_header_cache);
> +	kvm_mmu_free_memory_cache(&kvm->arch.split_shadow_page_cache);
> +}
> +

...

> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 04812eaaf61b..4fe018ddd1cd 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -12197,6 +12197,12 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
>  		 * page faults will create the large-page sptes.
>  		 */
>  		kvm_mmu_zap_collapsible_sptes(kvm, new);
> +
> +		/*
> +		 * Free any memory left behind by eager page splitting. Ignore
> +		 * the module parameter since userspace might have changed it.
> +		 */
> +		free_split_caches(kvm);

Freeing the caches only in kvm_mmu_slot_apply_flags() will leak memory, and the
kmem_cache code will yell about objects being in the cache when the global caches
are destroyed by mmu_destroy_caches().  When KVM destroys a VM, it directly frees
the memslots without updating struct kvm_memslots; see kvm_free_memslot().

kvm_mmu_uninit_vm() is probably the best landing spot even though it's called
before memslots are freed.  The VM is unreachable so nothing can be triggerring
page splitting.

All that said, I don't know that I agree that kvm_mmu_slot_apply_flags() is the
right place to free the caches.  I agree that _most_ use cases will toggle dirty
logging on all memslots, but I don't know that that holds true for _all_ use
cases as dirty logging is used for things other than live migration.

Even if we expand the capacity of the pte_list_desc cache (see below), at worst,
it's still less than 16kb of memory per VM, i.e. quite small.  And if the host is
under memory pressure, KVM really should purge the caches in mmu_shrink_scan().

I know we proposed dropping mmu_shrink_scan(), but the more I think about that,
the more I think that an outright drop is wrong.  The real issue is that KVM as
quite literally the dumbest possible "algorithm" for zapping possibly-in-use
shadow pages, and doesn't target the zapping to fit the cgroup that's under
pressure.

So for this, IMO rather than assume that freeing the caches when any memslot
disables dirty logging, I think it makes sense to initially keep the caches and
only free them at VM destruction.  Then in follow-up patches, if we want, free
the caches in the mmu_shrink_scan(), and/or add a function hook for toggling
eager_page_split to topup/free the caches accordingly.  That gives userspace
explicit control over when the caches are purged, and does the logical thing of
freeing the caches when eager_page_split is disabled.

>  	} else {
>  		/*
>  		 * Initially-all-set does not require write protecting any page,
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index f94f72bbd2d3..17fc9247504d 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1336,6 +1336,7 @@ void kvm_flush_remote_tlbs(struct kvm *kvm);
>  
>  #ifdef KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE
>  int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min);
> +int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min);
>  int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc);
>  void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
>  void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 5e2e75014256..b9573e958a03 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -369,7 +369,7 @@ static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc,
>  		return (void *)__get_free_page(gfp_flags);
>  }
>  
> -static int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min)
> +int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min)

+1 to Ricardo's feedback, expose this function in patch 21.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 22/22] KVM: x86/mmu: Extend Eager Page Splitting to nested MMUs
@ 2022-06-17 19:08     ` Sean Christopherson
  0 siblings, 0 replies; 111+ messages in thread
From: Sean Christopherson @ 2022-06-17 19:08 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, Lai Jiangshan,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Mon, May 16, 2022, David Matlack wrote:
> +	/*
> +	 * Memory cache used to allocate pte_list_desc structs while splitting
> +	 * huge pages. In the worst case, to split one huge page, 512
> +	 * pte_list_desc structs are needed to add each lower level leaf sptep
> +	 * to the rmap plus 1 to extend the parent_ptes rmap of the lower level
> +	 * page table.
> +	 *
> +	 * Protected by kvm->slots_lock.
> +	 */
> +#define SPLIT_DESC_CACHE_CAPACITY 513

I would strongly prefer to programmaticaly define this (note that SPTE_ENT_PER_PAGE
doesn't yet exist in kvm/queue, but hopefully will by the time you rebase; it's
PT64_ENT_PER_PAGE at the moment).  And I think we should define the min number of
objects separately from the capacity (see below).

	/*
	 * Memory cache used to allocate pte_list_desc structs while splitting
	 * huge pages.  In the worst case, to split one huge page, a struct will
	 * be needed to rmap every possible new child SPTE, plus one to extend
	 * the parent_ptes rmap of the newly create page table.
	 */
#define SPLIT_DESC_CACHE_MIN_NR_OBJECTS (SPTE_ENT_PER_PAGE + 1)

> +	struct kvm_mmu_memory_cache split_desc_cache;
>  };
>  

...

> +static int topup_split_caches(struct kvm *kvm)
> +{
> +     int r;
> +
> +     lockdep_assert_held(&kvm->slots_lock);
> +
> +     r = __kvm_mmu_topup_memory_cache(&kvm->arch.split_desc_cache,
> +                                      SPLIT_DESC_CACHE_CAPACITY,
> +                                      SPLIT_DESC_CACHE_CAPACITY);

min==capacity will be inefficient as consuming just one object from the cache
will force KVM to drop mmu_lock and topup the cache.

2*min seems like the logical choice.  Presumably it's common to need all 513
objects when splitting a page, so that at least lets KVM handle two huge pages
without having to drop mmu_lock.

> +     if (r)
> +             return r;
> +
> +     r = kvm_mmu_topup_memory_cache(&kvm->arch.split_page_header_cache, 1);
> +     if (r)
> +             return r;
> +
> +     return kvm_mmu_topup_memory_cache(&kvm->arch.split_shadow_page_cache, 1);
> +}

...

> @@ -6097,15 +6106,252 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
>  		kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
>  }
>  
> +void free_split_caches(struct kvm *kvm)

This should be prefixed with kvm_mmu_, and since it's short, make it more explicit:

void kvm_mmu_free_eager_page_split_caches(struct kvm *kvm)

> +{
> +	lockdep_assert_held(&kvm->slots_lock);
> +
> +	kvm_mmu_free_memory_cache(&kvm->arch.split_desc_cache);
> +	kvm_mmu_free_memory_cache(&kvm->arch.split_page_header_cache);
> +	kvm_mmu_free_memory_cache(&kvm->arch.split_shadow_page_cache);
> +}
> +

...

> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 04812eaaf61b..4fe018ddd1cd 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -12197,6 +12197,12 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
>  		 * page faults will create the large-page sptes.
>  		 */
>  		kvm_mmu_zap_collapsible_sptes(kvm, new);
> +
> +		/*
> +		 * Free any memory left behind by eager page splitting. Ignore
> +		 * the module parameter since userspace might have changed it.
> +		 */
> +		free_split_caches(kvm);

Freeing the caches only in kvm_mmu_slot_apply_flags() will leak memory, and the
kmem_cache code will yell about objects being in the cache when the global caches
are destroyed by mmu_destroy_caches().  When KVM destroys a VM, it directly frees
the memslots without updating struct kvm_memslots; see kvm_free_memslot().

kvm_mmu_uninit_vm() is probably the best landing spot even though it's called
before memslots are freed.  The VM is unreachable so nothing can be triggerring
page splitting.

All that said, I don't know that I agree that kvm_mmu_slot_apply_flags() is the
right place to free the caches.  I agree that _most_ use cases will toggle dirty
logging on all memslots, but I don't know that that holds true for _all_ use
cases as dirty logging is used for things other than live migration.

Even if we expand the capacity of the pte_list_desc cache (see below), at worst,
it's still less than 16kb of memory per VM, i.e. quite small.  And if the host is
under memory pressure, KVM really should purge the caches in mmu_shrink_scan().

I know we proposed dropping mmu_shrink_scan(), but the more I think about that,
the more I think that an outright drop is wrong.  The real issue is that KVM as
quite literally the dumbest possible "algorithm" for zapping possibly-in-use
shadow pages, and doesn't target the zapping to fit the cgroup that's under
pressure.

So for this, IMO rather than assume that freeing the caches when any memslot
disables dirty logging, I think it makes sense to initially keep the caches and
only free them at VM destruction.  Then in follow-up patches, if we want, free
the caches in the mmu_shrink_scan(), and/or add a function hook for toggling
eager_page_split to topup/free the caches accordingly.  That gives userspace
explicit control over when the caches are purged, and does the logical thing of
freeing the caches when eager_page_split is disabled.

>  	} else {
>  		/*
>  		 * Initially-all-set does not require write protecting any page,
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index f94f72bbd2d3..17fc9247504d 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1336,6 +1336,7 @@ void kvm_flush_remote_tlbs(struct kvm *kvm);
>  
>  #ifdef KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE
>  int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min);
> +int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min);
>  int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc);
>  void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
>  void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 5e2e75014256..b9573e958a03 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -369,7 +369,7 @@ static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc,
>  		return (void *)__get_free_page(gfp_flags);
>  }
>  
> -static int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min)
> +int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min)

+1 to Ricardo's feedback, expose this function in patch 21.
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 10/22] KVM: x86/mmu: Pass memory caches to allocate SPs separately
  2022-06-17 15:01     ` Sean Christopherson
@ 2022-06-21 17:06       ` David Matlack
  -1 siblings, 0 replies; 111+ messages in thread
From: David Matlack @ 2022-06-21 17:06 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Andrew Jones, Ben Gardon, Peter Xu, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, Lai Jiangshan

On Fri, Jun 17, 2022 at 8:02 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, May 16, 2022, David Matlack wrote:
> > Refactor kvm_mmu_alloc_shadow_page() to receive the caches from which it
> > will allocate the various pieces of memory for shadow pages as a
> > parameter, rather than deriving them from the vcpu pointer. This will be
> > useful in a future commit where shadow pages are allocated during VM
> > ioctls for eager page splitting, and thus will use a different set of
> > caches.
> >
> > Preemptively pull the caches out all the way to
> > kvm_mmu_get_shadow_page() since eager page splitting will not be calling
>
> Uber nit, "eager hugepage splitting" to provide a mental cue/reminder for why
> those pages are direct.

I think it may be too late to move away from the term "eager page
splitting" (it is already in commit messages and the module param is
called "eager_page_split"). Using a slightly different name here might
produce more confusion, or at least cause readers to do a double-take.

But naming aside, I don't follow what you mean here. i.e. What does
the fact that page splitting uses direct shadow pages have to do with
this patch?


>
> > kvm_mmu_alloc_shadow_page() directly.
> >
> > No functional change intended.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
>
> Reviewed-by: Sean Christopherson <seanjc@google.com>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 10/22] KVM: x86/mmu: Pass memory caches to allocate SPs separately
@ 2022-06-21 17:06       ` David Matlack
  0 siblings, 0 replies; 111+ messages in thread
From: David Matlack @ 2022-06-21 17:06 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, Lai Jiangshan,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Fri, Jun 17, 2022 at 8:02 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, May 16, 2022, David Matlack wrote:
> > Refactor kvm_mmu_alloc_shadow_page() to receive the caches from which it
> > will allocate the various pieces of memory for shadow pages as a
> > parameter, rather than deriving them from the vcpu pointer. This will be
> > useful in a future commit where shadow pages are allocated during VM
> > ioctls for eager page splitting, and thus will use a different set of
> > caches.
> >
> > Preemptively pull the caches out all the way to
> > kvm_mmu_get_shadow_page() since eager page splitting will not be calling
>
> Uber nit, "eager hugepage splitting" to provide a mental cue/reminder for why
> those pages are direct.

I think it may be too late to move away from the term "eager page
splitting" (it is already in commit messages and the module param is
called "eager_page_split"). Using a slightly different name here might
produce more confusion, or at least cause readers to do a double-take.

But naming aside, I don't follow what you mean here. i.e. What does
the fact that page splitting uses direct shadow pages have to do with
this patch?


>
> > kvm_mmu_alloc_shadow_page() directly.
> >
> > No functional change intended.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
>
> Reviewed-by: Sean Christopherson <seanjc@google.com>
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 19/22] KVM: x86/mmu: Zap collapsible SPTEs in shadow MMU at all possible levels
  2022-06-17 17:01     ` Sean Christopherson
@ 2022-06-21 17:24       ` David Matlack
  -1 siblings, 0 replies; 111+ messages in thread
From: David Matlack @ 2022-06-21 17:24 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Andrew Jones, Ben Gardon, Peter Xu, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, Lai Jiangshan

On Fri, Jun 17, 2022 at 10:01 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, May 16, 2022, David Matlack wrote:
> > Currently KVM only zaps collapsible 4KiB SPTEs in the shadow MMU. This
> > is fine for now since KVM never creates intermediate huge pages during
> > dirty logging. In other words, KVM always replaces 1GiB pages directly
> > with 4KiB pages, so there is no reason to look for collapsible 2MiB
> > pages.
> >
> > However, this will stop being true once the shadow MMU participates in
> > eager page splitting. During eager page splitting, each 1GiB is first
> > split into 2MiB pages and then those are split into 4KiB pages. The
> > intermediate 2MiB pages may be left behind if an error condition causes
> > eager page splitting to bail early.
> >
> > No functional change intended.
> >
> > Reviewed-by: Peter Xu <peterx@redhat.com>
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> >  arch/x86/kvm/mmu/mmu.c | 21 ++++++++++++++-------
> >  1 file changed, 14 insertions(+), 7 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index f83de72feeac..a5d96d452f42 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -6177,18 +6177,25 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
> >       return need_tlb_flush;
> >  }
> >
> > +static void kvm_rmap_zap_collapsible_sptes(struct kvm *kvm,
> > +                                        const struct kvm_memory_slot *slot)
> > +{
> > +     /*
> > +      * Note, use KVM_MAX_HUGEPAGE_LEVEL - 1 since there's no need to zap
> > +      * pages that are already mapped at the maximum possible level.
> > +      */
> > +     if (slot_handle_level(kvm, slot, kvm_mmu_zap_collapsible_spte,
> > +                           PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL - 1,
> > +                           true))
>
> No need to wrap, "true" fits easily on the previous line.  That said, I don't see
> any point in adding a helper.  It's highly unlike there will be another caller,
> and IMO it's not any more readable since I have to go look at another function
> when reading kvm_mmu_zap_collapsible_sptes().

I could see an argument for readability either way. Putting it in a
helper function abstracts away the details, which would aid
readability if the reader does not care about the implementation
details of the rmap case.

I also have been thinking about splitting the rmap stuff out of mmu.c
(e.g. into rmap.c or shadow_mmu.c) to mirror the TDP MMU. That way we
can have a more clear split between the TDP MMU and shadow MMU, each
with their own file, and with higher level MMU operations that need to
operate on either or both MMUs living in mmu.c.

>
> With some gentle massaging, the comment can squeeze onto two lines even with the
> extra level of indentation.
>
>                 /*
>                  * Note, use KVM_MAX_HUGEPAGE_LEVEL - 1, there's no need to zap
>                  * pages that are already mapped at the maximum hugepage level.
>                  */
>                 if (slot_handle_level(kvm, slot, kvm_mmu_zap_collapsible_spte,
>                                       PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL - 1, true)
>                         kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
>
> > +             kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
> > +}
> > +
> >  void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
> >                                  const struct kvm_memory_slot *slot)
> >  {
> >       if (kvm_memslots_have_rmaps(kvm)) {
> >               write_lock(&kvm->mmu_lock);
> > -             /*
> > -              * Zap only 4k SPTEs since the legacy MMU only supports dirty
> > -              * logging at a 4k granularity and never creates collapsible
> > -              * 2m SPTEs during dirty logging.
> > -              */
> > -             if (slot_handle_level_4k(kvm, slot, kvm_mmu_zap_collapsible_spte, true))
> > -                     kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
> > +             kvm_rmap_zap_collapsible_sptes(kvm, slot);
> >               write_unlock(&kvm->mmu_lock);
> >       }
> >
> > --
> > 2.36.0.550.gb090851708-goog
> >

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 19/22] KVM: x86/mmu: Zap collapsible SPTEs in shadow MMU at all possible levels
@ 2022-06-21 17:24       ` David Matlack
  0 siblings, 0 replies; 111+ messages in thread
From: David Matlack @ 2022-06-21 17:24 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, Lai Jiangshan,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Fri, Jun 17, 2022 at 10:01 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, May 16, 2022, David Matlack wrote:
> > Currently KVM only zaps collapsible 4KiB SPTEs in the shadow MMU. This
> > is fine for now since KVM never creates intermediate huge pages during
> > dirty logging. In other words, KVM always replaces 1GiB pages directly
> > with 4KiB pages, so there is no reason to look for collapsible 2MiB
> > pages.
> >
> > However, this will stop being true once the shadow MMU participates in
> > eager page splitting. During eager page splitting, each 1GiB is first
> > split into 2MiB pages and then those are split into 4KiB pages. The
> > intermediate 2MiB pages may be left behind if an error condition causes
> > eager page splitting to bail early.
> >
> > No functional change intended.
> >
> > Reviewed-by: Peter Xu <peterx@redhat.com>
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> >  arch/x86/kvm/mmu/mmu.c | 21 ++++++++++++++-------
> >  1 file changed, 14 insertions(+), 7 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index f83de72feeac..a5d96d452f42 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -6177,18 +6177,25 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
> >       return need_tlb_flush;
> >  }
> >
> > +static void kvm_rmap_zap_collapsible_sptes(struct kvm *kvm,
> > +                                        const struct kvm_memory_slot *slot)
> > +{
> > +     /*
> > +      * Note, use KVM_MAX_HUGEPAGE_LEVEL - 1 since there's no need to zap
> > +      * pages that are already mapped at the maximum possible level.
> > +      */
> > +     if (slot_handle_level(kvm, slot, kvm_mmu_zap_collapsible_spte,
> > +                           PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL - 1,
> > +                           true))
>
> No need to wrap, "true" fits easily on the previous line.  That said, I don't see
> any point in adding a helper.  It's highly unlike there will be another caller,
> and IMO it's not any more readable since I have to go look at another function
> when reading kvm_mmu_zap_collapsible_sptes().

I could see an argument for readability either way. Putting it in a
helper function abstracts away the details, which would aid
readability if the reader does not care about the implementation
details of the rmap case.

I also have been thinking about splitting the rmap stuff out of mmu.c
(e.g. into rmap.c or shadow_mmu.c) to mirror the TDP MMU. That way we
can have a more clear split between the TDP MMU and shadow MMU, each
with their own file, and with higher level MMU operations that need to
operate on either or both MMUs living in mmu.c.

>
> With some gentle massaging, the comment can squeeze onto two lines even with the
> extra level of indentation.
>
>                 /*
>                  * Note, use KVM_MAX_HUGEPAGE_LEVEL - 1, there's no need to zap
>                  * pages that are already mapped at the maximum hugepage level.
>                  */
>                 if (slot_handle_level(kvm, slot, kvm_mmu_zap_collapsible_spte,
>                                       PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL - 1, true)
>                         kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
>
> > +             kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
> > +}
> > +
> >  void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
> >                                  const struct kvm_memory_slot *slot)
> >  {
> >       if (kvm_memslots_have_rmaps(kvm)) {
> >               write_lock(&kvm->mmu_lock);
> > -             /*
> > -              * Zap only 4k SPTEs since the legacy MMU only supports dirty
> > -              * logging at a 4k granularity and never creates collapsible
> > -              * 2m SPTEs during dirty logging.
> > -              */
> > -             if (slot_handle_level_4k(kvm, slot, kvm_mmu_zap_collapsible_spte, true))
> > -                     kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
> > +             kvm_rmap_zap_collapsible_sptes(kvm, slot);
> >               write_unlock(&kvm->mmu_lock);
> >       }
> >
> > --
> > 2.36.0.550.gb090851708-goog
> >
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 10/22] KVM: x86/mmu: Pass memory caches to allocate SPs separately
  2022-06-21 17:06       ` David Matlack
@ 2022-06-21 17:27         ` Sean Christopherson
  -1 siblings, 0 replies; 111+ messages in thread
From: Sean Christopherson @ 2022-06-21 17:27 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Andrew Jones, Ben Gardon, Peter Xu, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, Lai Jiangshan

On Tue, Jun 21, 2022, David Matlack wrote:
> On Fri, Jun 17, 2022 at 8:02 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Mon, May 16, 2022, David Matlack wrote:
> > > Refactor kvm_mmu_alloc_shadow_page() to receive the caches from which it
> > > will allocate the various pieces of memory for shadow pages as a
> > > parameter, rather than deriving them from the vcpu pointer. This will be
> > > useful in a future commit where shadow pages are allocated during VM
> > > ioctls for eager page splitting, and thus will use a different set of
> > > caches.
> > >
> > > Preemptively pull the caches out all the way to
> > > kvm_mmu_get_shadow_page() since eager page splitting will not be calling
> >
> > Uber nit, "eager hugepage splitting" to provide a mental cue/reminder for why
> > those pages are direct.
> 
> I think it may be too late to move away from the term "eager page
> splitting" (it is already in commit messages and the module param is
> called "eager_page_split"). Using a slightly different name here might
> produce more confusion, or at least cause readers to do a double-take.

True.  I'm totally fine omitting "huge".

> But naming aside, I don't follow what you mean here. i.e. What does
> the fact that page splitting uses direct shadow pages have to do with
> this patch?

I have no idea.  I suspect I was looking at a different patch when replying to
this one.  I distinctly remember pausing for a few seconds to recall the direct
aspect, but looking back at this patch I don't see what I could have possibly be
wondering about.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 10/22] KVM: x86/mmu: Pass memory caches to allocate SPs separately
@ 2022-06-21 17:27         ` Sean Christopherson
  0 siblings, 0 replies; 111+ messages in thread
From: Sean Christopherson @ 2022-06-21 17:27 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, Lai Jiangshan,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Tue, Jun 21, 2022, David Matlack wrote:
> On Fri, Jun 17, 2022 at 8:02 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Mon, May 16, 2022, David Matlack wrote:
> > > Refactor kvm_mmu_alloc_shadow_page() to receive the caches from which it
> > > will allocate the various pieces of memory for shadow pages as a
> > > parameter, rather than deriving them from the vcpu pointer. This will be
> > > useful in a future commit where shadow pages are allocated during VM
> > > ioctls for eager page splitting, and thus will use a different set of
> > > caches.
> > >
> > > Preemptively pull the caches out all the way to
> > > kvm_mmu_get_shadow_page() since eager page splitting will not be calling
> >
> > Uber nit, "eager hugepage splitting" to provide a mental cue/reminder for why
> > those pages are direct.
> 
> I think it may be too late to move away from the term "eager page
> splitting" (it is already in commit messages and the module param is
> called "eager_page_split"). Using a slightly different name here might
> produce more confusion, or at least cause readers to do a double-take.

True.  I'm totally fine omitting "huge".

> But naming aside, I don't follow what you mean here. i.e. What does
> the fact that page splitting uses direct shadow pages have to do with
> this patch?

I have no idea.  I suspect I was looking at a different patch when replying to
this one.  I distinctly remember pausing for a few seconds to recall the direct
aspect, but looking back at this patch I don't see what I could have possibly be
wondering about.
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 19/22] KVM: x86/mmu: Zap collapsible SPTEs in shadow MMU at all possible levels
  2022-06-21 17:24       ` David Matlack
@ 2022-06-21 17:59         ` Sean Christopherson
  -1 siblings, 0 replies; 111+ messages in thread
From: Sean Christopherson @ 2022-06-21 17:59 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Andrew Jones, Ben Gardon, Peter Xu, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, Lai Jiangshan

On Tue, Jun 21, 2022, David Matlack wrote:
> On Fri, Jun 17, 2022 at 10:01 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Mon, May 16, 2022, David Matlack wrote:
> > > +static void kvm_rmap_zap_collapsible_sptes(struct kvm *kvm,
> > > +                                        const struct kvm_memory_slot *slot)
> > > +{
> > > +     /*
> > > +      * Note, use KVM_MAX_HUGEPAGE_LEVEL - 1 since there's no need to zap
> > > +      * pages that are already mapped at the maximum possible level.
> > > +      */
> > > +     if (slot_handle_level(kvm, slot, kvm_mmu_zap_collapsible_spte,
> > > +                           PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL - 1,
> > > +                           true))
> >
> > No need to wrap, "true" fits easily on the previous line.  That said, I don't see
> > any point in adding a helper.  It's highly unlike there will be another caller,
> > and IMO it's not any more readable since I have to go look at another function
> > when reading kvm_mmu_zap_collapsible_sptes().
> 
> I could see an argument for readability either way. Putting it in a
> helper function abstracts away the details, which would aid
> readability if the reader does not care about the implementation
> details of the rmap case.

I'm ok either way, dealer's choice.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 19/22] KVM: x86/mmu: Zap collapsible SPTEs in shadow MMU at all possible levels
@ 2022-06-21 17:59         ` Sean Christopherson
  0 siblings, 0 replies; 111+ messages in thread
From: Sean Christopherson @ 2022-06-21 17:59 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, Lai Jiangshan,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Tue, Jun 21, 2022, David Matlack wrote:
> On Fri, Jun 17, 2022 at 10:01 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Mon, May 16, 2022, David Matlack wrote:
> > > +static void kvm_rmap_zap_collapsible_sptes(struct kvm *kvm,
> > > +                                        const struct kvm_memory_slot *slot)
> > > +{
> > > +     /*
> > > +      * Note, use KVM_MAX_HUGEPAGE_LEVEL - 1 since there's no need to zap
> > > +      * pages that are already mapped at the maximum possible level.
> > > +      */
> > > +     if (slot_handle_level(kvm, slot, kvm_mmu_zap_collapsible_spte,
> > > +                           PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL - 1,
> > > +                           true))
> >
> > No need to wrap, "true" fits easily on the previous line.  That said, I don't see
> > any point in adding a helper.  It's highly unlike there will be another caller,
> > and IMO it's not any more readable since I have to go look at another function
> > when reading kvm_mmu_zap_collapsible_sptes().
> 
> I could see an argument for readability either way. Putting it in a
> helper function abstracts away the details, which would aid
> readability if the reader does not care about the implementation
> details of the rmap case.

I'm ok either way, dealer's choice.
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 03/22] KVM: x86/mmu: Stop passing @direct to mmu_alloc_root()
  2022-06-16 18:47     ` Sean Christopherson
@ 2022-06-22 14:06       ` Paolo Bonzini
  -1 siblings, 0 replies; 111+ messages in thread
From: Paolo Bonzini @ 2022-06-22 14:06 UTC (permalink / raw)
  To: Sean Christopherson, David Matlack
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Andrew Jones,
	Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, Lai Jiangshan

On 6/16/22 20:47, Sean Christopherson wrote:
>> The argument @direct is vcpu->arch.mmu->root_role.direct, so just use
>> that.
> It's worth calling out that, unlike non-root page tables, it's impossible to have
> a direct root in an indirect MMU.  I.e. provide a hint as to why there's a need to
> pass @direct in the first place.
> 

I suppose there's *no* need to pass direct?  Also, there's the trivial 
(but less interesting) justification that kvm_mmu_load does

         if (vcpu->arch.mmu->root_role.direct)
                 r = mmu_alloc_direct_roots(vcpu);
         else
                 r = mmu_alloc_shadow_roots(vcpu);

and those are the only callers of mmu_alloc_root.

Paolo

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 03/22] KVM: x86/mmu: Stop passing @direct to mmu_alloc_root()
@ 2022-06-22 14:06       ` Paolo Bonzini
  0 siblings, 0 replies; 111+ messages in thread
From: Paolo Bonzini @ 2022-06-22 14:06 UTC (permalink / raw)
  To: Sean Christopherson, David Matlack
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen, Lai Jiangshan,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On 6/16/22 20:47, Sean Christopherson wrote:
>> The argument @direct is vcpu->arch.mmu->root_role.direct, so just use
>> that.
> It's worth calling out that, unlike non-root page tables, it's impossible to have
> a direct root in an indirect MMU.  I.e. provide a hint as to why there's a need to
> pass @direct in the first place.
> 

I suppose there's *no* need to pass direct?  Also, there's the trivial 
(but less interesting) justification that kvm_mmu_load does

         if (vcpu->arch.mmu->root_role.direct)
                 r = mmu_alloc_direct_roots(vcpu);
         else
                 r = mmu_alloc_shadow_roots(vcpu);

and those are the only callers of mmu_alloc_root.

Paolo
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 03/22] KVM: x86/mmu: Stop passing @direct to mmu_alloc_root()
  2022-06-22 14:06       ` Paolo Bonzini
@ 2022-06-22 14:19         ` Sean Christopherson
  -1 siblings, 0 replies; 111+ messages in thread
From: Sean Christopherson @ 2022-06-22 14:19 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: David Matlack, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, Lai Jiangshan

On Wed, Jun 22, 2022, Paolo Bonzini wrote:
> On 6/16/22 20:47, Sean Christopherson wrote:
> > > The argument @direct is vcpu->arch.mmu->root_role.direct, so just use
> > > that.
> > It's worth calling out that, unlike non-root page tables, it's impossible to have
> > a direct root in an indirect MMU.  I.e. provide a hint as to why there's a need to
> > pass @direct in the first place.
> > 
> 
> I suppose there's *no* need to pass direct?  Also, there's the trivial (but
> less interesting) justification that kvm_mmu_load does
> 
>         if (vcpu->arch.mmu->root_role.direct)
>                 r = mmu_alloc_direct_roots(vcpu);
>         else
>                 r = mmu_alloc_shadow_roots(vcpu);
> 
> and those are the only callers of mmu_alloc_root.

Duh, you're right, grabbing root_role.direct in mmu_alloc_root() is much better.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 03/22] KVM: x86/mmu: Stop passing @direct to mmu_alloc_root()
@ 2022-06-22 14:19         ` Sean Christopherson
  0 siblings, 0 replies; 111+ messages in thread
From: Sean Christopherson @ 2022-06-22 14:19 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen, Lai Jiangshan,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Ben Gardon, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, David Matlack, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Wed, Jun 22, 2022, Paolo Bonzini wrote:
> On 6/16/22 20:47, Sean Christopherson wrote:
> > > The argument @direct is vcpu->arch.mmu->root_role.direct, so just use
> > > that.
> > It's worth calling out that, unlike non-root page tables, it's impossible to have
> > a direct root in an indirect MMU.  I.e. provide a hint as to why there's a need to
> > pass @direct in the first place.
> > 
> 
> I suppose there's *no* need to pass direct?  Also, there's the trivial (but
> less interesting) justification that kvm_mmu_load does
> 
>         if (vcpu->arch.mmu->root_role.direct)
>                 r = mmu_alloc_direct_roots(vcpu);
>         else
>                 r = mmu_alloc_shadow_roots(vcpu);
> 
> and those are the only callers of mmu_alloc_root.

Duh, you're right, grabbing root_role.direct in mmu_alloc_root() is much better.
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 13/22] KVM: x86/mmu: Allow NULL @vcpu in kvm_mmu_find_shadow_page()
  2022-06-17 15:28     ` Sean Christopherson
@ 2022-06-22 14:26       ` Paolo Bonzini
  -1 siblings, 0 replies; 111+ messages in thread
From: Paolo Bonzini @ 2022-06-22 14:26 UTC (permalink / raw)
  To: Sean Christopherson, David Matlack
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Andrew Jones,
	Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, Lai Jiangshan

On 6/17/22 17:28, Sean Christopherson wrote:
> On Mon, May 16, 2022, David Matlack wrote:
>> Allow @vcpu to be NULL in kvm_mmu_find_shadow_page() (and its only
>> caller __kvm_mmu_get_shadow_page()). @vcpu is only required to sync
>> indirect shadow pages, so it's safe to pass in NULL when looking up
>> direct shadow pages.
>>
>> This will be used for doing eager page splitting, which allocates direct
> 
> "hugepage" again, because I need constant reminders :-)
> 
>> shadow pages from the context of a VM ioctl without access to a vCPU
>> pointer.
>>
>> Signed-off-by: David Matlack <dmatlack@google.com>
>> ---
> 
> With nits addressed,
> 
> Reviewed-by: Sean Christopherson <seanjc@google.com>
> 
>>   arch/x86/kvm/mmu/mmu.c | 13 +++++++++++++
>>   1 file changed, 13 insertions(+)
>>
>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>> index 4fbc2da47428..acb54d6e0ea5 100644
>> --- a/arch/x86/kvm/mmu/mmu.c
>> +++ b/arch/x86/kvm/mmu/mmu.c
>> @@ -1850,6 +1850,7 @@ static int kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>>   
>>   	if (ret < 0)
>>   		kvm_mmu_prepare_zap_page(vcpu->kvm, sp, invalid_list);
>> +
> 
> Unrelated whitespace change leftover from the previous approach.
> 
>>   	return ret;
>>   }
>>   
>> @@ -2001,6 +2002,7 @@ static void clear_sp_write_flooding_count(u64 *spte)
>>   	__clear_sp_write_flooding_count(sptep_to_sp(spte));
>>   }
>>   
>> +/* Note, @vcpu may be NULL if @role.direct is true. */
>>   static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm,
>>   						     struct kvm_vcpu *vcpu,
>>   						     gfn_t gfn,
>> @@ -2039,6 +2041,16 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm,
>>   			goto out;
>>   
>>   		if (sp->unsync) {
>> +			/*
>> +			 * A vCPU pointer should always be provided when finding
> 
> s/should/must, and "be provided" in unnecessarily ambiguous, simply state that
> "@vcpu must be non-NULL".  E.g. if a caller provides a pointer, but that pointer
> happens to be NULL.
> 
>> +			 * indirect shadow pages, as that shadow page may
>> +			 * already exist and need to be synced using the vCPU
>> +			 * pointer. Direct shadow pages are never unsync and
>> +			 * thus do not require a vCPU pointer.
>> +			 */
> 
> "vCPU pointer" over and over is a bit versbose, and I prefer to refer to vCPUs/VMs
> as objects themselves.  E.g. "XYZ requires a vCPU" versus "XYZ requires a vCPU
> pointer" since it's not the pointer itself that's required, it's all the context
> of the vCPU that is needed.
> 
> 			/*
> 			 * @vcpu must be non-NULL when finding indirect shadow
> 			 * pages, as such pages may already exist and need to
> 			 * be synced, which requires a vCPU.  Direct pages are
> 			 * never unsync and thus do not require a vCPU.
> 			 */

My own take:

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index d7987420bb26..a7748c5a2385 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1975,7 +1975,12 @@ static void clear_sp_write_flooding_count(u64 *spte)
  	__clear_sp_write_flooding_count(sptep_to_sp(spte));
  }
  
-/* Note, @vcpu may be NULL if @role.direct is true. */
+/*
+ * The vCPU is required when finding indirect shadow pages; the shadow
+ * page may already exist and syncing it needs the vCPU pointer in
+ * order to read guest page tables.  Direct shadow pages are never
+ * unsync, thus @vcpu can be NULL if @role.direct is true.
+ */
  static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm,
  						     struct kvm_vcpu *vcpu,
  						     gfn_t gfn,
@@ -2014,13 +2019,6 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm,
  			goto out;
  
  		if (sp->unsync) {
-			/*
-			 * The vCPU pointer is required when finding indirect
-			 * shadow pages, as that shadow page may already exist
-			 * exist and need to be synced using the vCPU pointer.
-			 * Direct shadow pages are never unsync and thus do not
-			 * require a vCPU pointer.
-			 */
  			if (KVM_BUG_ON(!vcpu, kvm))
  				break;
  
@@ -2101,7 +2099,7 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
  	return sp;
  }
  
-/* Note, @vcpu may be NULL if @role.direct is true. */
+/* Note, @vcpu may be NULL if @role.direct is true; see kvm_mmu_find_shadow_page. */
  static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm *kvm,
  						      struct kvm_vcpu *vcpu,
  						      struct shadow_page_caches *caches,


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 13/22] KVM: x86/mmu: Allow NULL @vcpu in kvm_mmu_find_shadow_page()
@ 2022-06-22 14:26       ` Paolo Bonzini
  0 siblings, 0 replies; 111+ messages in thread
From: Paolo Bonzini @ 2022-06-22 14:26 UTC (permalink / raw)
  To: Sean Christopherson, David Matlack
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen, Lai Jiangshan,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On 6/17/22 17:28, Sean Christopherson wrote:
> On Mon, May 16, 2022, David Matlack wrote:
>> Allow @vcpu to be NULL in kvm_mmu_find_shadow_page() (and its only
>> caller __kvm_mmu_get_shadow_page()). @vcpu is only required to sync
>> indirect shadow pages, so it's safe to pass in NULL when looking up
>> direct shadow pages.
>>
>> This will be used for doing eager page splitting, which allocates direct
> 
> "hugepage" again, because I need constant reminders :-)
> 
>> shadow pages from the context of a VM ioctl without access to a vCPU
>> pointer.
>>
>> Signed-off-by: David Matlack <dmatlack@google.com>
>> ---
> 
> With nits addressed,
> 
> Reviewed-by: Sean Christopherson <seanjc@google.com>
> 
>>   arch/x86/kvm/mmu/mmu.c | 13 +++++++++++++
>>   1 file changed, 13 insertions(+)
>>
>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>> index 4fbc2da47428..acb54d6e0ea5 100644
>> --- a/arch/x86/kvm/mmu/mmu.c
>> +++ b/arch/x86/kvm/mmu/mmu.c
>> @@ -1850,6 +1850,7 @@ static int kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>>   
>>   	if (ret < 0)
>>   		kvm_mmu_prepare_zap_page(vcpu->kvm, sp, invalid_list);
>> +
> 
> Unrelated whitespace change leftover from the previous approach.
> 
>>   	return ret;
>>   }
>>   
>> @@ -2001,6 +2002,7 @@ static void clear_sp_write_flooding_count(u64 *spte)
>>   	__clear_sp_write_flooding_count(sptep_to_sp(spte));
>>   }
>>   
>> +/* Note, @vcpu may be NULL if @role.direct is true. */
>>   static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm,
>>   						     struct kvm_vcpu *vcpu,
>>   						     gfn_t gfn,
>> @@ -2039,6 +2041,16 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm,
>>   			goto out;
>>   
>>   		if (sp->unsync) {
>> +			/*
>> +			 * A vCPU pointer should always be provided when finding
> 
> s/should/must, and "be provided" in unnecessarily ambiguous, simply state that
> "@vcpu must be non-NULL".  E.g. if a caller provides a pointer, but that pointer
> happens to be NULL.
> 
>> +			 * indirect shadow pages, as that shadow page may
>> +			 * already exist and need to be synced using the vCPU
>> +			 * pointer. Direct shadow pages are never unsync and
>> +			 * thus do not require a vCPU pointer.
>> +			 */
> 
> "vCPU pointer" over and over is a bit versbose, and I prefer to refer to vCPUs/VMs
> as objects themselves.  E.g. "XYZ requires a vCPU" versus "XYZ requires a vCPU
> pointer" since it's not the pointer itself that's required, it's all the context
> of the vCPU that is needed.
> 
> 			/*
> 			 * @vcpu must be non-NULL when finding indirect shadow
> 			 * pages, as such pages may already exist and need to
> 			 * be synced, which requires a vCPU.  Direct pages are
> 			 * never unsync and thus do not require a vCPU.
> 			 */

My own take:

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index d7987420bb26..a7748c5a2385 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1975,7 +1975,12 @@ static void clear_sp_write_flooding_count(u64 *spte)
  	__clear_sp_write_flooding_count(sptep_to_sp(spte));
  }
  
-/* Note, @vcpu may be NULL if @role.direct is true. */
+/*
+ * The vCPU is required when finding indirect shadow pages; the shadow
+ * page may already exist and syncing it needs the vCPU pointer in
+ * order to read guest page tables.  Direct shadow pages are never
+ * unsync, thus @vcpu can be NULL if @role.direct is true.
+ */
  static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm,
  						     struct kvm_vcpu *vcpu,
  						     gfn_t gfn,
@@ -2014,13 +2019,6 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm,
  			goto out;
  
  		if (sp->unsync) {
-			/*
-			 * The vCPU pointer is required when finding indirect
-			 * shadow pages, as that shadow page may already exist
-			 * exist and need to be synced using the vCPU pointer.
-			 * Direct shadow pages are never unsync and thus do not
-			 * require a vCPU pointer.
-			 */
  			if (KVM_BUG_ON(!vcpu, kvm))
  				break;
  
@@ -2101,7 +2099,7 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
  	return sp;
  }
  
-/* Note, @vcpu may be NULL if @role.direct is true. */
+/* Note, @vcpu may be NULL if @role.direct is true; see kvm_mmu_find_shadow_page. */
  static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm *kvm,
  						      struct kvm_vcpu *vcpu,
  						      struct shadow_page_caches *caches,

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 20/22] KVM: x86/mmu: Refactor drop_large_spte()
  2022-06-17 17:11     ` Sean Christopherson
@ 2022-06-22 16:13       ` Paolo Bonzini
  -1 siblings, 0 replies; 111+ messages in thread
From: Paolo Bonzini @ 2022-06-22 16:13 UTC (permalink / raw)
  To: Sean Christopherson, David Matlack
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Andrew Jones,
	Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, Lai Jiangshan

On 6/17/22 19:11, Sean Christopherson wrote:
> since the shortlog is already
> a somewhat vague "do a refactor", I vote to opportunistically:
> 
>    - rename drop_large_spte() to drop_spte_if_huge()
>    - rename __drop_large_spte() to drop_huge_spte()
>    - move "if (!is_large_pte(*sptep))" to drop_spte_if_huge() since the split path
>      should never pass in a non-huge SPTE.
> 
> That last point will also clean up an oddity with with "flush" parameter; given
> the command-like name of "flush", it's a bit weird that __drop_large_spte() doesn't
> flush when the SPTE is large.

Even better, drop_large_spte() is always called right before 
kvm_mmu_get_child_sp(), so:

 From 86a9490972a1e959a4df114678719494b5475720 Mon Sep 17 00:00:00 2001
From: Paolo Bonzini <pbonzini@redhat.com>
Date: Wed, 22 Jun 2022 12:11:44 -0400
Subject: [PATCH] KVM: MMU: pull drop_large_spte into kvm_mmu_get_child_sp

Before allocating a child shadow page table, all callers need to
check whether the parent already points to a huge page and, if so,
drop it.  This is done by drop_large_spte(), but it can be moved
to kvm_mmu_get_child_sp().

To ensure that the shadow page is not linked twice if it was
present, do _not_ opportunistically make kvm_mmu_get_child_sp()
idempotent: instead, return an error value if the shadow page
already existed.  This is a bit more verbose, but clearer than
NULL.

Now that the drop_large_spte() name is not taken anymore,
remove the two underscores in front of __drop_large_spte().

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 36bc49f08d60..7f52870ee062 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1135,26 +1135,16 @@ static void drop_spte(struct kvm *kvm, u64 *sptep)
  		rmap_remove(kvm, sptep);
  }

-
-static bool __drop_large_spte(struct kvm *kvm, u64 *sptep)
+static void drop_large_spte(struct kvm *kvm, u64 *sptep)
  {
-	if (is_large_pte(*sptep)) {
-		WARN_ON(sptep_to_sp(sptep)->role.level == PG_LEVEL_4K);
-		drop_spte(kvm, sptep);
-		return true;
-	}
-
-	return false;
-}
+	struct kvm_mmu_page *sp;

-static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
-{
-	if (__drop_large_spte(vcpu->kvm, sptep)) {
-		struct kvm_mmu_page *sp = sptep_to_sp(sptep);
+	sp = sptep_to_sp(sptep);
+	WARN_ON(sp->role.level == PG_LEVEL_4K);

-		kvm_flush_remote_tlbs_with_address(vcpu->kvm, sp->gfn,
+	drop_spte(kvm, sptep);
+	kvm_flush_remote_tlbs_with_address(kvm, sp->gfn,
  			KVM_PAGES_PER_HPAGE(sp->role.level));
-	}
  }

  /*
@@ -2221,6 +2211,13 @@ static struct kvm_mmu_page 
*kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu,
  {
  	union kvm_mmu_page_role role;

+	if (is_shadow_present_pte(*sptep)) {
+		if (!is_large_pte(*sptep))
+			return ERR_PTR(-EEXIST);
+
+		drop_large_spte(vcpu->kvm, sptep, true);
+	}
+
  	role = kvm_mmu_child_role(sptep, direct, access);
  	return kvm_mmu_get_shadow_page(vcpu, gfn, role);
  }
@@ -3080,11 +3077,9 @@ static int __direct_map(struct kvm_vcpu *vcpu, 
struct kvm_page_fault *fault)
  		if (it.level == fault->goal_level)
  			break;

-		drop_large_spte(vcpu, it.sptep);
-		if (is_shadow_present_pte(*it.sptep))
-			continue;
-
  		sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn, true, ACC_ALL);
+		if (sp == ERR_PTR(-EEXIST))
+			continue;

  		link_shadow_page(vcpu, it.sptep, sp);
  		if (fault->is_tdp && fault->huge_page_disallowed &&
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 24f292f3f93f..2448fa8d8438 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -648,15 +648,13 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, 
struct kvm_page_fault *fault,
  		gfn_t table_gfn;

  		clear_sp_write_flooding_count(it.sptep);
-		drop_large_spte(vcpu, it.sptep);

-		sp = NULL;
-		if (!is_shadow_present_pte(*it.sptep)) {
-			table_gfn = gw->table_gfn[it.level - 2];
-			access = gw->pt_access[it.level - 2];
-			sp = kvm_mmu_get_child_sp(vcpu, it.sptep, table_gfn,
-						  false, access);
+		table_gfn = gw->table_gfn[it.level - 2];
+		access = gw->pt_access[it.level - 2];
+		sp = kvm_mmu_get_child_sp(vcpu, it.sptep, table_gfn,
+					  false, access);

+		if (sp != ERR_PTR(-EEXIST)) {
  			/*
  			 * We must synchronize the pagetable before linking it
  			 * because the guest doesn't need to flush tlb when
@@ -685,7 +683,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, 
struct kvm_page_fault *fault,
  		if (FNAME(gpte_changed)(vcpu, gw, it.level - 1))
  			goto out_gpte_changed;

-		if (sp)
+		if (sp != ERR_PTR(-EEXIST))
  			link_shadow_page(vcpu, it.sptep, sp);
  	}

@@ -709,16 +707,15 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, 
struct kvm_page_fault *fault,

  		validate_direct_spte(vcpu, it.sptep, direct_access);

-		drop_large_spte(vcpu, it.sptep);
+		sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn,
+					  true, direct_access);
+		if (sp == ERR_PTR(-EEXIST))
+			continue;

-		if (!is_shadow_present_pte(*it.sptep)) {
-			sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn,
-						  true, direct_access);
-			link_shadow_page(vcpu, it.sptep, sp);
-			if (fault->huge_page_disallowed &&
-			    fault->req_level >= it.level)
-				account_huge_nx_page(vcpu->kvm, sp);
-		}
+		link_shadow_page(vcpu, it.sptep, sp);
+		if (fault->huge_page_disallowed &&
+		    fault->req_level >= it.level)
+			account_huge_nx_page(vcpu->kvm, sp);
  	}

  	if (WARN_ON_ONCE(it.level != fault->goal_level))

with the obvious patch on top to add the flush argument.

The ERR_PTR(-EEXIST) is a bit heavy, but at least conveys what's going 
on.  Thoughts?

Paolo

^ permalink raw reply related	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 20/22] KVM: x86/mmu: Refactor drop_large_spte()
@ 2022-06-22 16:13       ` Paolo Bonzini
  0 siblings, 0 replies; 111+ messages in thread
From: Paolo Bonzini @ 2022-06-22 16:13 UTC (permalink / raw)
  To: Sean Christopherson, David Matlack
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen, Lai Jiangshan,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On 6/17/22 19:11, Sean Christopherson wrote:
> since the shortlog is already
> a somewhat vague "do a refactor", I vote to opportunistically:
> 
>    - rename drop_large_spte() to drop_spte_if_huge()
>    - rename __drop_large_spte() to drop_huge_spte()
>    - move "if (!is_large_pte(*sptep))" to drop_spte_if_huge() since the split path
>      should never pass in a non-huge SPTE.
> 
> That last point will also clean up an oddity with with "flush" parameter; given
> the command-like name of "flush", it's a bit weird that __drop_large_spte() doesn't
> flush when the SPTE is large.

Even better, drop_large_spte() is always called right before 
kvm_mmu_get_child_sp(), so:

 From 86a9490972a1e959a4df114678719494b5475720 Mon Sep 17 00:00:00 2001
From: Paolo Bonzini <pbonzini@redhat.com>
Date: Wed, 22 Jun 2022 12:11:44 -0400
Subject: [PATCH] KVM: MMU: pull drop_large_spte into kvm_mmu_get_child_sp

Before allocating a child shadow page table, all callers need to
check whether the parent already points to a huge page and, if so,
drop it.  This is done by drop_large_spte(), but it can be moved
to kvm_mmu_get_child_sp().

To ensure that the shadow page is not linked twice if it was
present, do _not_ opportunistically make kvm_mmu_get_child_sp()
idempotent: instead, return an error value if the shadow page
already existed.  This is a bit more verbose, but clearer than
NULL.

Now that the drop_large_spte() name is not taken anymore,
remove the two underscores in front of __drop_large_spte().

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 36bc49f08d60..7f52870ee062 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1135,26 +1135,16 @@ static void drop_spte(struct kvm *kvm, u64 *sptep)
  		rmap_remove(kvm, sptep);
  }

-
-static bool __drop_large_spte(struct kvm *kvm, u64 *sptep)
+static void drop_large_spte(struct kvm *kvm, u64 *sptep)
  {
-	if (is_large_pte(*sptep)) {
-		WARN_ON(sptep_to_sp(sptep)->role.level == PG_LEVEL_4K);
-		drop_spte(kvm, sptep);
-		return true;
-	}
-
-	return false;
-}
+	struct kvm_mmu_page *sp;

-static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
-{
-	if (__drop_large_spte(vcpu->kvm, sptep)) {
-		struct kvm_mmu_page *sp = sptep_to_sp(sptep);
+	sp = sptep_to_sp(sptep);
+	WARN_ON(sp->role.level == PG_LEVEL_4K);

-		kvm_flush_remote_tlbs_with_address(vcpu->kvm, sp->gfn,
+	drop_spte(kvm, sptep);
+	kvm_flush_remote_tlbs_with_address(kvm, sp->gfn,
  			KVM_PAGES_PER_HPAGE(sp->role.level));
-	}
  }

  /*
@@ -2221,6 +2211,13 @@ static struct kvm_mmu_page 
*kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu,
  {
  	union kvm_mmu_page_role role;

+	if (is_shadow_present_pte(*sptep)) {
+		if (!is_large_pte(*sptep))
+			return ERR_PTR(-EEXIST);
+
+		drop_large_spte(vcpu->kvm, sptep, true);
+	}
+
  	role = kvm_mmu_child_role(sptep, direct, access);
  	return kvm_mmu_get_shadow_page(vcpu, gfn, role);
  }
@@ -3080,11 +3077,9 @@ static int __direct_map(struct kvm_vcpu *vcpu, 
struct kvm_page_fault *fault)
  		if (it.level == fault->goal_level)
  			break;

-		drop_large_spte(vcpu, it.sptep);
-		if (is_shadow_present_pte(*it.sptep))
-			continue;
-
  		sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn, true, ACC_ALL);
+		if (sp == ERR_PTR(-EEXIST))
+			continue;

  		link_shadow_page(vcpu, it.sptep, sp);
  		if (fault->is_tdp && fault->huge_page_disallowed &&
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 24f292f3f93f..2448fa8d8438 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -648,15 +648,13 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, 
struct kvm_page_fault *fault,
  		gfn_t table_gfn;

  		clear_sp_write_flooding_count(it.sptep);
-		drop_large_spte(vcpu, it.sptep);

-		sp = NULL;
-		if (!is_shadow_present_pte(*it.sptep)) {
-			table_gfn = gw->table_gfn[it.level - 2];
-			access = gw->pt_access[it.level - 2];
-			sp = kvm_mmu_get_child_sp(vcpu, it.sptep, table_gfn,
-						  false, access);
+		table_gfn = gw->table_gfn[it.level - 2];
+		access = gw->pt_access[it.level - 2];
+		sp = kvm_mmu_get_child_sp(vcpu, it.sptep, table_gfn,
+					  false, access);

+		if (sp != ERR_PTR(-EEXIST)) {
  			/*
  			 * We must synchronize the pagetable before linking it
  			 * because the guest doesn't need to flush tlb when
@@ -685,7 +683,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, 
struct kvm_page_fault *fault,
  		if (FNAME(gpte_changed)(vcpu, gw, it.level - 1))
  			goto out_gpte_changed;

-		if (sp)
+		if (sp != ERR_PTR(-EEXIST))
  			link_shadow_page(vcpu, it.sptep, sp);
  	}

@@ -709,16 +707,15 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, 
struct kvm_page_fault *fault,

  		validate_direct_spte(vcpu, it.sptep, direct_access);

-		drop_large_spte(vcpu, it.sptep);
+		sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn,
+					  true, direct_access);
+		if (sp == ERR_PTR(-EEXIST))
+			continue;

-		if (!is_shadow_present_pte(*it.sptep)) {
-			sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn,
-						  true, direct_access);
-			link_shadow_page(vcpu, it.sptep, sp);
-			if (fault->huge_page_disallowed &&
-			    fault->req_level >= it.level)
-				account_huge_nx_page(vcpu->kvm, sp);
-		}
+		link_shadow_page(vcpu, it.sptep, sp);
+		if (fault->huge_page_disallowed &&
+		    fault->req_level >= it.level)
+			account_huge_nx_page(vcpu->kvm, sp);
  	}

  	if (WARN_ON_ONCE(it.level != fault->goal_level))

with the obvious patch on top to add the flush argument.

The ERR_PTR(-EEXIST) is a bit heavy, but at least conveys what's going 
on.  Thoughts?

Paolo
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 20/22] KVM: x86/mmu: Refactor drop_large_spte()
  2022-06-22 16:13       ` Paolo Bonzini
@ 2022-06-22 16:50         ` Paolo Bonzini
  -1 siblings, 0 replies; 111+ messages in thread
From: Paolo Bonzini @ 2022-06-22 16:50 UTC (permalink / raw)
  To: Sean Christopherson, David Matlack
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen, Lai Jiangshan,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On 6/22/22 18:13, Paolo Bonzini wrote:
> Even better, drop_large_spte() is always called right before 
> kvm_mmu_get_child_sp(), so:

Actually, we can even include the call from eager page splitting if
__link_shadow_page() is the one that takes care of dropping the large
SPTE:

 From bea344e409bb8329ca69aca0a63f97537a7ec798 Mon Sep 17 00:00:00 2001
From: Paolo Bonzini <pbonzini@redhat.com>
Date: Wed, 22 Jun 2022 12:11:44 -0400
Subject: [PATCH] KVM: MMU: pull call to drop_large_spte() into
  __link_shadow_page()

Before allocating a child shadow page table, all callers check
whether the parent already points to a huge page and, if so, they
drop that SPTE.  This is done by drop_large_spte().

However, the act that requires dropping the large SPTE is the
installation of the sp that is returned by kvm_mmu_get_child_sp(),
which happens in __link_shadow_page().  Move the call there
instead of having it in each and every caller.

To ensure that the shadow page is not linked twice if it was
present, do _not_ opportunistically make kvm_mmu_get_child_sp()
idempotent: instead, return an error value if the shadow page
already existed.  This is a bit more verbose, but clearer than
NULL.

Now that the drop_large_spte() name is not taken anymore,
remove the two underscores in front of __drop_large_spte().

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 36bc49f08d60..64c1191be4ae 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1135,26 +1135,16 @@ static void drop_spte(struct kvm *kvm, u64 *sptep)
  		rmap_remove(kvm, sptep);
  }
  
-
-static bool __drop_large_spte(struct kvm *kvm, u64 *sptep)
+static void drop_large_spte(struct kvm *kvm, u64 *sptep)
  {
-	if (is_large_pte(*sptep)) {
-		WARN_ON(sptep_to_sp(sptep)->role.level == PG_LEVEL_4K);
-		drop_spte(kvm, sptep);
-		return true;
-	}
-
-	return false;
-}
+	struct kvm_mmu_page *sp;
  
-static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
-{
-	if (__drop_large_spte(vcpu->kvm, sptep)) {
-		struct kvm_mmu_page *sp = sptep_to_sp(sptep);
+	sp = sptep_to_sp(sptep);
+	WARN_ON(sp->role.level == PG_LEVEL_4K);
  
-		kvm_flush_remote_tlbs_with_address(vcpu->kvm, sp->gfn,
+	drop_spte(kvm, sptep);
+	kvm_flush_remote_tlbs_with_address(kvm, sp->gfn,
  			KVM_PAGES_PER_HPAGE(sp->role.level));
-	}
  }
  
  /*
@@ -2221,6 +2211,9 @@ static struct kvm_mmu_page *kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu,
  {
  	union kvm_mmu_page_role role;
  
+	if (is_shadow_present_pte(*sptep) && !is_large_pte(*sptep))
+		return ERR_PTR(-EEXIST);
+
  	role = kvm_mmu_child_role(sptep, direct, access);
  	return kvm_mmu_get_shadow_page(vcpu, gfn, role);
  }
@@ -2295,6 +2288,13 @@ static void __link_shadow_page(struct kvm_mmu_memory_cache *cache, u64 *sptep,
  
  	BUILD_BUG_ON(VMX_EPT_WRITABLE_MASK != PT_WRITABLE_MASK);
  
+	/*
+	 * If an SPTE is present already, it must be a leaf and therefore
+	 * a large one.  Drop it and flush the TLB before installing sp.
+	 */
+	if (is_shadow_present_pte(*sptep)
+		drop_large_spte(vcpu->kvm, sptep);
+
  	spte = make_nonleaf_spte(sp->spt, sp_ad_disabled(sp));
  
  	mmu_spte_set(sptep, spte);
@@ -3080,11 +3080,9 @@ static int __direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
  		if (it.level == fault->goal_level)
  			break;
  
-		drop_large_spte(vcpu, it.sptep);
-		if (is_shadow_present_pte(*it.sptep))
-			continue;
-
  		sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn, true, ACC_ALL);
+		if (sp == ERR_PTR(-EEXIST))
+			continue;
  
  		link_shadow_page(vcpu, it.sptep, sp);
  		if (fault->is_tdp && fault->huge_page_disallowed &&
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 24f292f3f93f..2448fa8d8438 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -648,15 +648,13 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
  		gfn_t table_gfn;
  
  		clear_sp_write_flooding_count(it.sptep);
-		drop_large_spte(vcpu, it.sptep);
  
-		sp = NULL;
-		if (!is_shadow_present_pte(*it.sptep)) {
-			table_gfn = gw->table_gfn[it.level - 2];
-			access = gw->pt_access[it.level - 2];
-			sp = kvm_mmu_get_child_sp(vcpu, it.sptep, table_gfn,
-						  false, access);
+		table_gfn = gw->table_gfn[it.level - 2];
+		access = gw->pt_access[it.level - 2];
+		sp = kvm_mmu_get_child_sp(vcpu, it.sptep, table_gfn,
+					  false, access);
  
+		if (sp != ERR_PTR(-EEXIST)) {
  			/*
  			 * We must synchronize the pagetable before linking it
  			 * because the guest doesn't need to flush tlb when
@@ -685,7 +683,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
  		if (FNAME(gpte_changed)(vcpu, gw, it.level - 1))
  			goto out_gpte_changed;
  
-		if (sp)
+		if (sp != ERR_PTR(-EEXIST))
  			link_shadow_page(vcpu, it.sptep, sp);
  	}
  
@@ -709,16 +707,15 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
  
  		validate_direct_spte(vcpu, it.sptep, direct_access);
  
-		drop_large_spte(vcpu, it.sptep);
+		sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn,
+					  true, direct_access);
+		if (sp == ERR_PTR(-EEXIST))
+			continue;
  
-		if (!is_shadow_present_pte(*it.sptep)) {
-			sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn,
-						  true, direct_access);
-			link_shadow_page(vcpu, it.sptep, sp);
-			if (fault->huge_page_disallowed &&
-			    fault->req_level >= it.level)
-				account_huge_nx_page(vcpu->kvm, sp);
-		}
+		link_shadow_page(vcpu, it.sptep, sp);
+		if (fault->huge_page_disallowed &&
+		    fault->req_level >= it.level)
+			account_huge_nx_page(vcpu->kvm, sp);
  	}
  
  	if (WARN_ON_ONCE(it.level != fault->goal_level))


I'll test the resulting series and then send a v7.

Paolo

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 20/22] KVM: x86/mmu: Refactor drop_large_spte()
@ 2022-06-22 16:50         ` Paolo Bonzini
  0 siblings, 0 replies; 111+ messages in thread
From: Paolo Bonzini @ 2022-06-22 16:50 UTC (permalink / raw)
  To: Sean Christopherson, David Matlack
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Andrew Jones,
	Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, Lai Jiangshan

On 6/22/22 18:13, Paolo Bonzini wrote:
> Even better, drop_large_spte() is always called right before 
> kvm_mmu_get_child_sp(), so:

Actually, we can even include the call from eager page splitting if
__link_shadow_page() is the one that takes care of dropping the large
SPTE:

 From bea344e409bb8329ca69aca0a63f97537a7ec798 Mon Sep 17 00:00:00 2001
From: Paolo Bonzini <pbonzini@redhat.com>
Date: Wed, 22 Jun 2022 12:11:44 -0400
Subject: [PATCH] KVM: MMU: pull call to drop_large_spte() into
  __link_shadow_page()

Before allocating a child shadow page table, all callers check
whether the parent already points to a huge page and, if so, they
drop that SPTE.  This is done by drop_large_spte().

However, the act that requires dropping the large SPTE is the
installation of the sp that is returned by kvm_mmu_get_child_sp(),
which happens in __link_shadow_page().  Move the call there
instead of having it in each and every caller.

To ensure that the shadow page is not linked twice if it was
present, do _not_ opportunistically make kvm_mmu_get_child_sp()
idempotent: instead, return an error value if the shadow page
already existed.  This is a bit more verbose, but clearer than
NULL.

Now that the drop_large_spte() name is not taken anymore,
remove the two underscores in front of __drop_large_spte().

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 36bc49f08d60..64c1191be4ae 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1135,26 +1135,16 @@ static void drop_spte(struct kvm *kvm, u64 *sptep)
  		rmap_remove(kvm, sptep);
  }
  
-
-static bool __drop_large_spte(struct kvm *kvm, u64 *sptep)
+static void drop_large_spte(struct kvm *kvm, u64 *sptep)
  {
-	if (is_large_pte(*sptep)) {
-		WARN_ON(sptep_to_sp(sptep)->role.level == PG_LEVEL_4K);
-		drop_spte(kvm, sptep);
-		return true;
-	}
-
-	return false;
-}
+	struct kvm_mmu_page *sp;
  
-static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
-{
-	if (__drop_large_spte(vcpu->kvm, sptep)) {
-		struct kvm_mmu_page *sp = sptep_to_sp(sptep);
+	sp = sptep_to_sp(sptep);
+	WARN_ON(sp->role.level == PG_LEVEL_4K);
  
-		kvm_flush_remote_tlbs_with_address(vcpu->kvm, sp->gfn,
+	drop_spte(kvm, sptep);
+	kvm_flush_remote_tlbs_with_address(kvm, sp->gfn,
  			KVM_PAGES_PER_HPAGE(sp->role.level));
-	}
  }
  
  /*
@@ -2221,6 +2211,9 @@ static struct kvm_mmu_page *kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu,
  {
  	union kvm_mmu_page_role role;
  
+	if (is_shadow_present_pte(*sptep) && !is_large_pte(*sptep))
+		return ERR_PTR(-EEXIST);
+
  	role = kvm_mmu_child_role(sptep, direct, access);
  	return kvm_mmu_get_shadow_page(vcpu, gfn, role);
  }
@@ -2295,6 +2288,13 @@ static void __link_shadow_page(struct kvm_mmu_memory_cache *cache, u64 *sptep,
  
  	BUILD_BUG_ON(VMX_EPT_WRITABLE_MASK != PT_WRITABLE_MASK);
  
+	/*
+	 * If an SPTE is present already, it must be a leaf and therefore
+	 * a large one.  Drop it and flush the TLB before installing sp.
+	 */
+	if (is_shadow_present_pte(*sptep)
+		drop_large_spte(vcpu->kvm, sptep);
+
  	spte = make_nonleaf_spte(sp->spt, sp_ad_disabled(sp));
  
  	mmu_spte_set(sptep, spte);
@@ -3080,11 +3080,9 @@ static int __direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
  		if (it.level == fault->goal_level)
  			break;
  
-		drop_large_spte(vcpu, it.sptep);
-		if (is_shadow_present_pte(*it.sptep))
-			continue;
-
  		sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn, true, ACC_ALL);
+		if (sp == ERR_PTR(-EEXIST))
+			continue;
  
  		link_shadow_page(vcpu, it.sptep, sp);
  		if (fault->is_tdp && fault->huge_page_disallowed &&
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 24f292f3f93f..2448fa8d8438 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -648,15 +648,13 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
  		gfn_t table_gfn;
  
  		clear_sp_write_flooding_count(it.sptep);
-		drop_large_spte(vcpu, it.sptep);
  
-		sp = NULL;
-		if (!is_shadow_present_pte(*it.sptep)) {
-			table_gfn = gw->table_gfn[it.level - 2];
-			access = gw->pt_access[it.level - 2];
-			sp = kvm_mmu_get_child_sp(vcpu, it.sptep, table_gfn,
-						  false, access);
+		table_gfn = gw->table_gfn[it.level - 2];
+		access = gw->pt_access[it.level - 2];
+		sp = kvm_mmu_get_child_sp(vcpu, it.sptep, table_gfn,
+					  false, access);
  
+		if (sp != ERR_PTR(-EEXIST)) {
  			/*
  			 * We must synchronize the pagetable before linking it
  			 * because the guest doesn't need to flush tlb when
@@ -685,7 +683,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
  		if (FNAME(gpte_changed)(vcpu, gw, it.level - 1))
  			goto out_gpte_changed;
  
-		if (sp)
+		if (sp != ERR_PTR(-EEXIST))
  			link_shadow_page(vcpu, it.sptep, sp);
  	}
  
@@ -709,16 +707,15 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
  
  		validate_direct_spte(vcpu, it.sptep, direct_access);
  
-		drop_large_spte(vcpu, it.sptep);
+		sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn,
+					  true, direct_access);
+		if (sp == ERR_PTR(-EEXIST))
+			continue;
  
-		if (!is_shadow_present_pte(*it.sptep)) {
-			sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn,
-						  true, direct_access);
-			link_shadow_page(vcpu, it.sptep, sp);
-			if (fault->huge_page_disallowed &&
-			    fault->req_level >= it.level)
-				account_huge_nx_page(vcpu->kvm, sp);
-		}
+		link_shadow_page(vcpu, it.sptep, sp);
+		if (fault->huge_page_disallowed &&
+		    fault->req_level >= it.level)
+			account_huge_nx_page(vcpu->kvm, sp);
  	}
  
  	if (WARN_ON_ONCE(it.level != fault->goal_level))


I'll test the resulting series and then send a v7.

Paolo


^ permalink raw reply related	[flat|nested] 111+ messages in thread

end of thread, other threads:[~2022-06-22 16:53 UTC | newest]

Thread overview: 111+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-16 23:21 [PATCH v6 00/22] KVM: Extend Eager Page Splitting to the shadow MMU David Matlack
2022-05-16 23:21 ` David Matlack
2022-05-16 23:21 ` [PATCH v6 01/22] KVM: x86/mmu: Optimize MMU page cache lookup for all direct SPs David Matlack
2022-05-16 23:21   ` David Matlack
2022-05-16 23:21 ` [PATCH v6 02/22] KVM: x86/mmu: Use a bool for direct David Matlack
2022-05-16 23:21   ` David Matlack
2022-05-16 23:21 ` [PATCH v6 03/22] KVM: x86/mmu: Stop passing @direct to mmu_alloc_root() David Matlack
2022-05-16 23:21   ` David Matlack
2022-06-16 18:47   ` Sean Christopherson
2022-06-16 18:47     ` Sean Christopherson
2022-06-22 14:06     ` Paolo Bonzini
2022-06-22 14:06       ` Paolo Bonzini
2022-06-22 14:19       ` Sean Christopherson
2022-06-22 14:19         ` Sean Christopherson
2022-05-16 23:21 ` [PATCH v6 04/22] KVM: x86/mmu: Derive shadow MMU page role from parent David Matlack
2022-05-16 23:21   ` David Matlack
2022-06-17  1:19   ` Sean Christopherson
2022-06-17  1:19     ` Sean Christopherson
2022-06-17 15:12   ` Sean Christopherson
2022-06-17 15:12     ` Sean Christopherson
2022-05-16 23:21 ` [PATCH v6 05/22] KVM: x86/mmu: Always pass 0 for @quadrant when gptes are 8 bytes David Matlack
2022-05-16 23:21   ` David Matlack
2022-06-17 15:20   ` Sean Christopherson
2022-06-17 15:20     ` Sean Christopherson
2022-05-16 23:21 ` [PATCH v6 06/22] KVM: x86/mmu: Decompose kvm_mmu_get_page() into separate functions David Matlack
2022-05-16 23:21   ` David Matlack
2022-05-16 23:21 ` [PATCH v6 07/22] KVM: x86/mmu: Consolidate shadow page allocation and initialization David Matlack
2022-05-16 23:21   ` David Matlack
2022-05-16 23:21 ` [PATCH v6 08/22] KVM: x86/mmu: Rename shadow MMU functions that deal with shadow pages David Matlack
2022-05-16 23:21   ` David Matlack
2022-05-16 23:21 ` [PATCH v6 09/22] KVM: x86/mmu: Move guest PT write-protection to account_shadowed() David Matlack
2022-05-16 23:21   ` David Matlack
2022-05-16 23:21 ` [PATCH v6 10/22] KVM: x86/mmu: Pass memory caches to allocate SPs separately David Matlack
2022-05-16 23:21   ` David Matlack
2022-06-17 15:01   ` Sean Christopherson
2022-06-17 15:01     ` Sean Christopherson
2022-06-21 17:06     ` David Matlack
2022-06-21 17:06       ` David Matlack
2022-06-21 17:27       ` Sean Christopherson
2022-06-21 17:27         ` Sean Christopherson
2022-05-16 23:21 ` [PATCH v6 11/22] KVM: x86/mmu: Replace vcpu with kvm in kvm_mmu_alloc_shadow_page() David Matlack
2022-05-16 23:21   ` David Matlack
2022-05-16 23:21 ` [PATCH v6 12/22] KVM: x86/mmu: Pass kvm pointer separately from vcpu to kvm_mmu_find_shadow_page() David Matlack
2022-05-16 23:21   ` David Matlack
2022-05-16 23:21 ` [PATCH v6 13/22] KVM: x86/mmu: Allow NULL @vcpu in kvm_mmu_find_shadow_page() David Matlack
2022-05-16 23:21   ` David Matlack
2022-06-17 15:28   ` Sean Christopherson
2022-06-17 15:28     ` Sean Christopherson
2022-06-22 14:26     ` Paolo Bonzini
2022-06-22 14:26       ` Paolo Bonzini
2022-05-16 23:21 ` [PATCH v6 14/22] KVM: x86/mmu: Pass const memslot to rmap_add() David Matlack
2022-05-16 23:21   ` David Matlack
2022-06-17 15:30   ` Sean Christopherson
2022-06-17 15:30     ` Sean Christopherson
2022-05-16 23:21 ` [PATCH v6 15/22] KVM: x86/mmu: Decouple rmap_add() and link_shadow_page() from kvm_vcpu David Matlack
2022-05-16 23:21   ` David Matlack
2022-06-17 16:39   ` Sean Christopherson
2022-06-17 16:39     ` Sean Christopherson
2022-05-16 23:21 ` [PATCH v6 16/22] KVM: x86/mmu: Update page stats in __rmap_add() David Matlack
2022-05-16 23:21   ` David Matlack
2022-06-17 16:40   ` Sean Christopherson
2022-06-17 16:40     ` Sean Christopherson
2022-05-16 23:21 ` [PATCH v6 17/22] KVM: x86/mmu: Cache the access bits of shadowed translations David Matlack
2022-05-16 23:21   ` David Matlack
2022-06-17 16:53   ` Sean Christopherson
2022-06-17 16:53     ` Sean Christopherson
2022-05-16 23:21 ` [PATCH v6 18/22] KVM: x86/mmu: Extend make_huge_page_split_spte() for the shadow MMU David Matlack
2022-05-16 23:21   ` David Matlack
2022-06-17 16:56   ` Sean Christopherson
2022-06-17 16:56     ` Sean Christopherson
2022-05-16 23:21 ` [PATCH v6 19/22] KVM: x86/mmu: Zap collapsible SPTEs in shadow MMU at all possible levels David Matlack
2022-05-16 23:21   ` David Matlack
2022-06-17 17:01   ` Sean Christopherson
2022-06-17 17:01     ` Sean Christopherson
2022-06-21 17:24     ` David Matlack
2022-06-21 17:24       ` David Matlack
2022-06-21 17:59       ` Sean Christopherson
2022-06-21 17:59         ` Sean Christopherson
2022-05-16 23:21 ` [PATCH v6 20/22] KVM: x86/mmu: Refactor drop_large_spte() David Matlack
2022-05-16 23:21   ` David Matlack
2022-06-17 17:11   ` Sean Christopherson
2022-06-17 17:11     ` Sean Christopherson
2022-06-22 16:13     ` Paolo Bonzini
2022-06-22 16:13       ` Paolo Bonzini
2022-06-22 16:50       ` Paolo Bonzini
2022-06-22 16:50         ` Paolo Bonzini
2022-05-16 23:21 ` [PATCH v6 21/22] KVM: Allow for different capacities in kvm_mmu_memory_cache structs David Matlack
2022-05-16 23:21   ` David Matlack
2022-05-19 15:33   ` Anup Patel
2022-05-19 15:33     ` Anup Patel
2022-05-20 23:21   ` Mingwei Zhang
2022-05-23 17:37     ` Sean Christopherson
2022-05-23 17:37       ` Sean Christopherson
2022-05-23 17:44       ` David Matlack
2022-05-23 17:44         ` David Matlack
2022-05-23 18:13         ` Mingwei Zhang
2022-05-23 18:13           ` Mingwei Zhang
2022-05-23 18:22           ` David Matlack
2022-05-23 18:22             ` David Matlack
2022-05-23 23:53             ` David Matlack
2022-05-23 23:53               ` David Matlack
2022-06-17 17:41   ` Sean Christopherson
2022-06-17 17:41     ` Sean Christopherson
2022-06-17 18:34     ` Sean Christopherson
2022-06-17 18:34       ` Sean Christopherson
2022-05-16 23:21 ` [PATCH v6 22/22] KVM: x86/mmu: Extend Eager Page Splitting to nested MMUs David Matlack
2022-05-16 23:21   ` David Matlack
2022-06-01 21:50   ` Ricardo Koller
2022-06-01 21:50     ` Ricardo Koller
2022-06-17 19:08   ` Sean Christopherson
2022-06-17 19:08     ` Sean Christopherson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.