All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 00/23] KVM: Extend Eager Page Splitting to the shadow MMU
@ 2022-04-01 17:55 ` David Matlack
  0 siblings, 0 replies; 88+ messages in thread
From: David Matlack @ 2022-04-01 17:55 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

This series extends KVM's Eager Page Splitting to also split huge pages
mapped by the shadow MMU, i.e. huge pages present in the memslot rmaps.
This will be useful for configurations that use Nested Virtualization,
disable the TDP MMU, or disable/lack TDP hardware support.

For background on Eager Page Splitting, see:
 - Proposal: https://lore.kernel.org/kvm/CALzav=dV_U4r1K9oDq4esb4mpBQDQ2ROQ5zH5wV3KpOaZrRW-A@mail.gmail.com/
 - TDP MMU support: https://lore.kernel.org/kvm/20220119230739.2234394-1-dmatlack@google.com/

Splitting huge pages mapped by the shadow MMU is more complicated than
the TDP MMU, but it is also more important for performance as the shadow
MMU handles huge page write-protection faults under the write lock.  See
the Performance section for more details.

The extra complexity of splitting huge pages mapped by the shadow MMU
comes from a few places:

(1) The shadow MMU has a limit on the number of shadow pages that are
    allowed to be allocated. So, as a policy, Eager Page Splitting
    refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
    pages available.

(2) Huge pages may be mapped by indirect shadow pages.

    - Indirect shadow pages have the possibilty of being unsync. As a
      policy we opt not to split such pages as their translation may no
      longer be valid.
    - Huge pages on indirect shadow pages may have access permission
      constraints from the guest (unlike the TDP MMU which is ACC_ALL
      by default).

(3) Splitting a huge page may end up re-using an existing lower level
    shadow page tables. This is unlike the TDP MMU which always allocates
    new shadow page tables when splitting.

(4) When installing the lower level SPTEs, they must be added to the
    rmap which may require allocating additional pte_list_desc structs.

In Google's internal implementation of Eager Page Splitting, we do not
handle cases (3) and (4), and intstead opts to skip splitting entirely
(case 3) or only partially splitting (case 4). This series handles the
additional cases, which requires an additional 4KiB of memory per VM to
store the extra pte_list_desc cache. However it does also avoids the need
for TLB flushes in most cases and allows KVM to split more pages mapped
by shadow paging.

The bulk of this series is just refactoring the existing MMU code in
preparation for splitting, specifically to make it possible to operate
on the MMU outside of a vCPU context.

Motivation
----------

During dirty logging, VMs using the shadow MMU suffer from:

(1) Write-protection faults on huge pages that take the MMU lock to
    unmap the huge page, map a 4KiB page, and update the dirty log.

(2) Non-present faults caused by (1) that take the MMU lock to map in
    the missing page.

(3) Write-protection faults on 4KiB pages that take the MMU lock to
    make the page writable and update the dirty log. [Note: These faults
    only take the MMU lock during shadow paging.]

The lock contention from (1), (2) and (3) can severely degrade
application performance to the point of failure.  Eager page splitting
eliminates (1) by moving the splitting of huge pages off the vCPU
threads onto the thread invoking VM-ioctls to configure dirty logging,
and eliminates (2) by fully splitting each huge page into its
constituent small pages. (3) is still a concern for shadow paging
workloads (e.g. nested virtualization) but is not addressed by this
series.

Splitting in the VM-ioctl thread is useful because it can run in the
background without interrupting vCPU execution. However, it does take
the MMU lock so it may introduce some extra contention if vCPUs are
hammering the MMU lock. This is offset by the fact that eager page
splitting drops the MMU lock after splitting each SPTE if there is any
contention, and the fact that eager page splitting is reducing the MMU
lock contention from (1) and (2) above. Even workloads that only write
to 5% of their memory see massive MMU lock contention reduction during
dirty logging thanks to Eager Page Splitting (see Performance data
below).

A downside of Eager Page Splitting is that it splits all huge pages,
which may include ranges of memory that are never written to by the
guest and thus could theoretically stay huge. Workloads that write to
only a fraction of their memory may see higher TLB miss costs with Eager
Page Splitting enabled. However, that is secondary to the application
failure that otherwise may occur without Eager Page Splitting.

Further work is necessary to improve the TLB miss performance for
read-heavy workoads, such as dirty logging at 2M instead of 4K.

Performance
-----------

To measure the performance impact of Eager Page Splitting I ran
dirty_log_perf_test with tdp_mmu=N, various virtual CPU counts, 1GiB per
vCPU, and backed by 1GiB HugeTLB memory. The amount of memory that was
written to versus read was controlled with the -f option.

To measure the imapct of customer performance, we can look at the time
it takes all vCPUs to dirty memory after dirty logging has been enabled.
Without Eager Page Splitting enabled, such dirtying must take faults to
split huge pages and bottleneck on the MMU lock.

             | Config: ept=Y, tdp_mmu=N, 100% writes                   |
             | Iteration 1 dirty memory time                           |
             | ------------------------------------------------------- |
vCPU Count   | eager_page_split=N         | eager_page_split=Y         |
------------ | -------------------------- | -------------------------- |
2            | 0.326340139s               | 0.058645119s               |
4            | 0.425658733s               | 0.059211364s               |
8            | 1.392495283s               | 0.059992269s               |
16           | 2.891475203s               | 0.074386427s               |
32           | 7.077453255s               | 0.074484273s               |
64           | 17.907075277s              | 0.080433025s               |

             | Config: ept=Y, tdp_mmu=N, 5% writes                     |
             | Iteration 1 dirty memory time                           |
             | ------------------------------------------------------- |
vCPU Count   | eager_page_split=N         | eager_page_split=Y         |
------------ | -------------------------- | -------------------------- |
2            | 0.333304003s               | 0.005811521s               |
4            | 0.418382740s               | 0.006195093s               |
8            | 1.127732621s               | 0.007388453s               |
16           | 3.003522635s               | 0.007854799s               |
32           | 7.341293635s               | 0.012048705s               |
64           | 16.555752029s              | 0.016820654s               |

Eager Page Splitting does increase the time it takes to enable dirty
logging when not using initially-all-set, since that's when KVM splits
huge pages. However, this runs in parallel with vCPU execution and drops
the MMU lock whenever there is contention.

             | Config: ept=Y, tdp_mmu=N, 100% writes                   |
             | Enabling dirty logging time                             |
             | ------------------------------------------------------- |
vCPU Count   | eager_page_split=N         | eager_page_split=Y         |
------------ | -------------------------- | -------------------------- |
2            | 0.001618723s               | 0.026362044s               |
4            | 0.003214434s               | 0.052079721s               |
8            | 0.006392632s               | 0.106090423s               |
16           | 0.012733697s               | 0.212877154s               |
32           | 0.060493391s               | 0.438669189s               |
64           | 0.104983842s               | 1.199435360s               |

Similarly, Eager Page Splitting increases the time it takes to clear the
dirty log for when using initially-all-set. The first time userspace
clears the dirty log, KVM will split huge pages:

             | Config: ept=Y, tdp_mmu=N, 100% writes initially-all-set |
             | Iteration 1 clear dirty log time                        |
             | ------------------------------------------------------- |
vCPU Count   | eager_page_split=N         | eager_page_split=Y         |
------------ | -------------------------- | -------------------------- |
2            | 0.001599286s               | 0.027095008s               |
4            | 0.003180496s               | 0.053376070s               |
8            | 0.006417064s               | 0.106120759s               |
16           | 0.012826026s               | 0.215146223s               |
32           | 0.027089937s               | 0.444193363s               |
64           | 0.090911208s               | 1.200149758s               |

Subsequent calls to clear the dirty log incur almost no additional cost
since KVM can very quickly determine there are no more huge pages to
split via the RMAP. This is unlike the TDP MMU which must re-traverse
the entire page table to check for huge pages.

             | Config: ept=Y, tdp_mmu=N, 100% writes initially-all-set |
             | Iteration 2 clear dirty log time                        |
             | ------------------------------------------------------- |
vCPU Count   | eager_page_split=N         | eager_page_split=Y         |
------------ | -------------------------- | -------------------------- |
2            | 0.016009056s               | 0.016397093s               |
4            | 0.031897037s               | 0.032602744s               |
8            | 0.063964980s               | 0.064671106s               |
16           | 0.128522311s               | 0.131025221s               |
32           | 0.259647643s               | 0.273217378s               |
64           | 0.647532182s               | 0.705208495s               |


Eager Page Splitting also improves the performance for shadow paging
configurations, as measured with ept=N. Although the absolute gains are
less for write-heavy workloads since KVM's shadow paging takes the write
lock to track 4KiB writes (i.e. no fast_page_faut() or PML). However
there are still major gains for read/write and read-heavy workloads.

             | Config: ept=N, tdp_mmu=Y, 100% writes                   |
             | Iteration 1 dirty memory time                           |
             | ------------------------------------------------------- |
vCPU Count   | eager_page_split=N         | eager_page_split=Y         |
------------ | -------------------------- | -------------------------- |
2            | 0.369208141s               | 0.346649742s               |
4            | 0.586295755s               | 0.486400573s               |
8            | 1.605597377s               | 1.433775448s               |
16           | 3.499827702s               | 3.424215648s               |
32           | 9.219003076s               | 8.142864934s               |
64           | 20.876099825s              | 19.342073745s              |

             | Config: ept=N, tdp_mmu=Y, 50% writes                    |
             | Iteration 1 dirty memory time                           |
             | ------------------------------------------------------- |
vCPU Count   | eager_page_split=N         | eager_page_split=Y         |
------------ | -------------------------- | -------------------------- |
2            | 0.381155928s               | 0.179281063s               |
4            | 0.530227136s               | 0.262919652s               |
8            | 1.771578193s               | 0.732303441s               |
16           | 3.744348496s               | 1.633369935s               |
32           | 9.558809131s               | 4.343945991s               |
64           | 20.257131790s              | 9.609166533s               |

             | Config: ept=N, tdp_mmu=Y, 5% writes                     |
             | Iteration 1 dirty memory time                           |
             | ------------------------------------------------------- |
vCPU Count   | eager_page_split=N         | eager_page_split=Y         |
------------ | -------------------------- | -------------------------- |
2            | 0.385243149s               | 0.020628192s               |
4            | 0.506447280s               | 0.024045445s               |
8            | 1.635995700s               | 0.064108556s               |
16           | 3.752015939s               | 0.131484155s               |
32           | 9.493783332s               | 0.422104253s               |
64           | 21.250099240s              | 0.984948363s               |

Testing
-------

- Ran all kvm-unit-tests and KVM selftests with all combinations of
  ept=[NY] and tdp_mmu=[NY].
- Booted a 32-bit non-PAE kernel with shadow paging to verify the
  quadrant change in patch 3.
- Tested VM live migration [*] with ept=N and ept=Y and observed pages
  being split via tracepoint and the pages_* stats.

[*] The live migration setup consisted of an 8 vCPU 8 GiB VM running
    on an Intel Cascade Lake host and backed by 1GiB HugeTLBFS memory.
    The VM was running Debian 10 and a workload that consisted of 16
    independent processes that each dirty memory. The tests were run
    with ept=N to exercise the interaction of Eager Page Splitting and
    shadow paging.

Version Log
-----------

v3:
 - Add R-b tags from Peter.
 - Explain direct SPs in indirect MMUs in commit message [Peter]
 - Change BUG_ON() to WARN_ON_ONCE() in quadrant calculation [me]
 - Eliminate unnecessary gotos [Peter]
 - Drop mmu_alloc_pte_list_desc() [Peter]
 - Also update access cache in mmu_set_spte() if was_rmapped [Peter]
 - Fix number of gfn bits in shadowed_translation cache [Peter]
 - Pass sp to make_huge_page_split_spte() to derive level and exec [me]
 - Eliminate flush var in kvm_rmap_zap_collapsible_sptes() [Peter]
 - Drop NULL pte_list_desc cache fallback [Peter]
 - Fix get_access to return sp->role.access. [me]
 - Re-use split cache across calls to CLEAR_DIRTY_LOG for better perf [me]
 - Top-up the split cache outside of the MMU lock when possible [me]
 - Refactor prepare_to_split_huge_page() into try_split_huge_page() [me]
 - Collapse PATCH 20, 23, and 24 avoid intermediate complexity [Peter]
 - Update the RISC-V function stage2_ioremap() [Anup]

v2: https://lore.kernel.org/kvm/20220311002528.2230172-1-dmatlack@google.com/
 - Add performance data for workloads that mix reads and writes [Peter]
 - Collect R-b tags from Ben and Sean.
 - Fix quadrant calculation when deriving role from parent [Sean]
 - Tweak new shadow page function names [Sean]
 - Move set_page_private() to allocation functions [Ben]
 - Only zap collapsible SPTEs up to MAX_LEVEL-1 [Ben]
 - Always top-up pte_list_desc cache to reduce complexity [Ben]
 - Require mmu cache capacity field to be initialized and add WARN()
   to reduce chance of programmer error [Marc]
 - Fix up kvm_mmu_memory_cache struct initialization in arm64 [Marc]

v1: https://lore.kernel.org/kvm/20220203010051.2813563-1-dmatlack@google.com/


David Matlack (23):
  KVM: x86/mmu: Optimize MMU page cache lookup for all direct SPs
  KVM: x86/mmu: Use a bool for direct
  KVM: x86/mmu: Derive shadow MMU page role from parent
  KVM: x86/mmu: Decompose kvm_mmu_get_page() into separate functions
  KVM: x86/mmu: Rename shadow MMU functions that deal with shadow pages
  KVM: x86/mmu: Pass memslot to kvm_mmu_new_shadow_page()
  KVM: x86/mmu: Separate shadow MMU sp allocation from initialization
  KVM: x86/mmu: Link spt to sp during allocation
  KVM: x86/mmu: Move huge page split sp allocation code to mmu.c
  KVM: x86/mmu: Use common code to free kvm_mmu_page structs
  KVM: x86/mmu: Use common code to allocate shadow pages from vCPU
    caches
  KVM: x86/mmu: Pass const memslot to rmap_add()
  KVM: x86/mmu: Pass const memslot to init_shadow_page() and descendants
  KVM: x86/mmu: Decouple rmap_add() and link_shadow_page() from kvm_vcpu
  KVM: x86/mmu: Update page stats in __rmap_add()
  KVM: x86/mmu: Cache the access bits of shadowed translations
  KVM: x86/mmu: Extend make_huge_page_split_spte() for the shadow MMU
  KVM: x86/mmu: Zap collapsible SPTEs at all levels in the shadow MMU
  KVM: x86/mmu: Refactor drop_large_spte()
  KVM: Allow for different capacities in kvm_mmu_memory_cache structs
  KVM: Allow GFP flags to be passed when topping up MMU caches
  KVM: x86/mmu: Support Eager Page Splitting in the shadow MMU
  KVM: selftests: Map x86_64 guest virtual memory with huge pages

 .../admin-guide/kernel-parameters.txt         |   3 -
 arch/arm64/include/asm/kvm_host.h             |   2 +-
 arch/arm64/kvm/arm.c                          |   1 +
 arch/arm64/kvm/mmu.c                          |  13 +-
 arch/mips/include/asm/kvm_host.h              |   2 +-
 arch/mips/kvm/mips.c                          |   2 +
 arch/riscv/include/asm/kvm_host.h             |   2 +-
 arch/riscv/kvm/mmu.c                          |  17 +-
 arch/riscv/kvm/vcpu.c                         |   1 +
 arch/x86/include/asm/kvm_host.h               |  20 +-
 arch/x86/include/asm/kvm_page_track.h         |   2 +-
 arch/x86/kvm/mmu/mmu.c                        | 732 ++++++++++++++----
 arch/x86/kvm/mmu/mmu_internal.h               |  27 +-
 arch/x86/kvm/mmu/page_track.c                 |   4 +-
 arch/x86/kvm/mmu/paging_tmpl.h                |  22 +-
 arch/x86/kvm/mmu/spte.c                       |  18 +-
 arch/x86/kvm/mmu/spte.h                       |   2 +-
 arch/x86/kvm/mmu/tdp_mmu.c                    |  48 +-
 arch/x86/kvm/mmu/tdp_mmu.h                    |   2 +-
 arch/x86/kvm/x86.c                            |   6 +
 include/linux/kvm_host.h                      |   1 +
 include/linux/kvm_types.h                     |  19 +-
 .../selftests/kvm/include/x86_64/processor.h  |   6 +
 tools/testing/selftests/kvm/lib/kvm_util.c    |   4 +-
 .../selftests/kvm/lib/x86_64/processor.c      |  31 +
 virt/kvm/kvm_main.c                           |  19 +-
 26 files changed, 780 insertions(+), 226 deletions(-)


base-commit: 19164ad08bf668bca4f4bfbaacaa0a47c1b737a6
-- 
2.35.1.1094.g7c7d902a7c-goog


^ permalink raw reply	[flat|nested] 88+ messages in thread

* [PATCH v3 00/23] KVM: Extend Eager Page Splitting to the shadow MMU
@ 2022-04-01 17:55 ` David Matlack
  0 siblings, 0 replies; 88+ messages in thread
From: David Matlack @ 2022-04-01 17:55 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

This series extends KVM's Eager Page Splitting to also split huge pages
mapped by the shadow MMU, i.e. huge pages present in the memslot rmaps.
This will be useful for configurations that use Nested Virtualization,
disable the TDP MMU, or disable/lack TDP hardware support.

For background on Eager Page Splitting, see:
 - Proposal: https://lore.kernel.org/kvm/CALzav=dV_U4r1K9oDq4esb4mpBQDQ2ROQ5zH5wV3KpOaZrRW-A@mail.gmail.com/
 - TDP MMU support: https://lore.kernel.org/kvm/20220119230739.2234394-1-dmatlack@google.com/

Splitting huge pages mapped by the shadow MMU is more complicated than
the TDP MMU, but it is also more important for performance as the shadow
MMU handles huge page write-protection faults under the write lock.  See
the Performance section for more details.

The extra complexity of splitting huge pages mapped by the shadow MMU
comes from a few places:

(1) The shadow MMU has a limit on the number of shadow pages that are
    allowed to be allocated. So, as a policy, Eager Page Splitting
    refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
    pages available.

(2) Huge pages may be mapped by indirect shadow pages.

    - Indirect shadow pages have the possibilty of being unsync. As a
      policy we opt not to split such pages as their translation may no
      longer be valid.
    - Huge pages on indirect shadow pages may have access permission
      constraints from the guest (unlike the TDP MMU which is ACC_ALL
      by default).

(3) Splitting a huge page may end up re-using an existing lower level
    shadow page tables. This is unlike the TDP MMU which always allocates
    new shadow page tables when splitting.

(4) When installing the lower level SPTEs, they must be added to the
    rmap which may require allocating additional pte_list_desc structs.

In Google's internal implementation of Eager Page Splitting, we do not
handle cases (3) and (4), and intstead opts to skip splitting entirely
(case 3) or only partially splitting (case 4). This series handles the
additional cases, which requires an additional 4KiB of memory per VM to
store the extra pte_list_desc cache. However it does also avoids the need
for TLB flushes in most cases and allows KVM to split more pages mapped
by shadow paging.

The bulk of this series is just refactoring the existing MMU code in
preparation for splitting, specifically to make it possible to operate
on the MMU outside of a vCPU context.

Motivation
----------

During dirty logging, VMs using the shadow MMU suffer from:

(1) Write-protection faults on huge pages that take the MMU lock to
    unmap the huge page, map a 4KiB page, and update the dirty log.

(2) Non-present faults caused by (1) that take the MMU lock to map in
    the missing page.

(3) Write-protection faults on 4KiB pages that take the MMU lock to
    make the page writable and update the dirty log. [Note: These faults
    only take the MMU lock during shadow paging.]

The lock contention from (1), (2) and (3) can severely degrade
application performance to the point of failure.  Eager page splitting
eliminates (1) by moving the splitting of huge pages off the vCPU
threads onto the thread invoking VM-ioctls to configure dirty logging,
and eliminates (2) by fully splitting each huge page into its
constituent small pages. (3) is still a concern for shadow paging
workloads (e.g. nested virtualization) but is not addressed by this
series.

Splitting in the VM-ioctl thread is useful because it can run in the
background without interrupting vCPU execution. However, it does take
the MMU lock so it may introduce some extra contention if vCPUs are
hammering the MMU lock. This is offset by the fact that eager page
splitting drops the MMU lock after splitting each SPTE if there is any
contention, and the fact that eager page splitting is reducing the MMU
lock contention from (1) and (2) above. Even workloads that only write
to 5% of their memory see massive MMU lock contention reduction during
dirty logging thanks to Eager Page Splitting (see Performance data
below).

A downside of Eager Page Splitting is that it splits all huge pages,
which may include ranges of memory that are never written to by the
guest and thus could theoretically stay huge. Workloads that write to
only a fraction of their memory may see higher TLB miss costs with Eager
Page Splitting enabled. However, that is secondary to the application
failure that otherwise may occur without Eager Page Splitting.

Further work is necessary to improve the TLB miss performance for
read-heavy workoads, such as dirty logging at 2M instead of 4K.

Performance
-----------

To measure the performance impact of Eager Page Splitting I ran
dirty_log_perf_test with tdp_mmu=N, various virtual CPU counts, 1GiB per
vCPU, and backed by 1GiB HugeTLB memory. The amount of memory that was
written to versus read was controlled with the -f option.

To measure the imapct of customer performance, we can look at the time
it takes all vCPUs to dirty memory after dirty logging has been enabled.
Without Eager Page Splitting enabled, such dirtying must take faults to
split huge pages and bottleneck on the MMU lock.

             | Config: ept=Y, tdp_mmu=N, 100% writes                   |
             | Iteration 1 dirty memory time                           |
             | ------------------------------------------------------- |
vCPU Count   | eager_page_split=N         | eager_page_split=Y         |
------------ | -------------------------- | -------------------------- |
2            | 0.326340139s               | 0.058645119s               |
4            | 0.425658733s               | 0.059211364s               |
8            | 1.392495283s               | 0.059992269s               |
16           | 2.891475203s               | 0.074386427s               |
32           | 7.077453255s               | 0.074484273s               |
64           | 17.907075277s              | 0.080433025s               |

             | Config: ept=Y, tdp_mmu=N, 5% writes                     |
             | Iteration 1 dirty memory time                           |
             | ------------------------------------------------------- |
vCPU Count   | eager_page_split=N         | eager_page_split=Y         |
------------ | -------------------------- | -------------------------- |
2            | 0.333304003s               | 0.005811521s               |
4            | 0.418382740s               | 0.006195093s               |
8            | 1.127732621s               | 0.007388453s               |
16           | 3.003522635s               | 0.007854799s               |
32           | 7.341293635s               | 0.012048705s               |
64           | 16.555752029s              | 0.016820654s               |

Eager Page Splitting does increase the time it takes to enable dirty
logging when not using initially-all-set, since that's when KVM splits
huge pages. However, this runs in parallel with vCPU execution and drops
the MMU lock whenever there is contention.

             | Config: ept=Y, tdp_mmu=N, 100% writes                   |
             | Enabling dirty logging time                             |
             | ------------------------------------------------------- |
vCPU Count   | eager_page_split=N         | eager_page_split=Y         |
------------ | -------------------------- | -------------------------- |
2            | 0.001618723s               | 0.026362044s               |
4            | 0.003214434s               | 0.052079721s               |
8            | 0.006392632s               | 0.106090423s               |
16           | 0.012733697s               | 0.212877154s               |
32           | 0.060493391s               | 0.438669189s               |
64           | 0.104983842s               | 1.199435360s               |

Similarly, Eager Page Splitting increases the time it takes to clear the
dirty log for when using initially-all-set. The first time userspace
clears the dirty log, KVM will split huge pages:

             | Config: ept=Y, tdp_mmu=N, 100% writes initially-all-set |
             | Iteration 1 clear dirty log time                        |
             | ------------------------------------------------------- |
vCPU Count   | eager_page_split=N         | eager_page_split=Y         |
------------ | -------------------------- | -------------------------- |
2            | 0.001599286s               | 0.027095008s               |
4            | 0.003180496s               | 0.053376070s               |
8            | 0.006417064s               | 0.106120759s               |
16           | 0.012826026s               | 0.215146223s               |
32           | 0.027089937s               | 0.444193363s               |
64           | 0.090911208s               | 1.200149758s               |

Subsequent calls to clear the dirty log incur almost no additional cost
since KVM can very quickly determine there are no more huge pages to
split via the RMAP. This is unlike the TDP MMU which must re-traverse
the entire page table to check for huge pages.

             | Config: ept=Y, tdp_mmu=N, 100% writes initially-all-set |
             | Iteration 2 clear dirty log time                        |
             | ------------------------------------------------------- |
vCPU Count   | eager_page_split=N         | eager_page_split=Y         |
------------ | -------------------------- | -------------------------- |
2            | 0.016009056s               | 0.016397093s               |
4            | 0.031897037s               | 0.032602744s               |
8            | 0.063964980s               | 0.064671106s               |
16           | 0.128522311s               | 0.131025221s               |
32           | 0.259647643s               | 0.273217378s               |
64           | 0.647532182s               | 0.705208495s               |


Eager Page Splitting also improves the performance for shadow paging
configurations, as measured with ept=N. Although the absolute gains are
less for write-heavy workloads since KVM's shadow paging takes the write
lock to track 4KiB writes (i.e. no fast_page_faut() or PML). However
there are still major gains for read/write and read-heavy workloads.

             | Config: ept=N, tdp_mmu=Y, 100% writes                   |
             | Iteration 1 dirty memory time                           |
             | ------------------------------------------------------- |
vCPU Count   | eager_page_split=N         | eager_page_split=Y         |
------------ | -------------------------- | -------------------------- |
2            | 0.369208141s               | 0.346649742s               |
4            | 0.586295755s               | 0.486400573s               |
8            | 1.605597377s               | 1.433775448s               |
16           | 3.499827702s               | 3.424215648s               |
32           | 9.219003076s               | 8.142864934s               |
64           | 20.876099825s              | 19.342073745s              |

             | Config: ept=N, tdp_mmu=Y, 50% writes                    |
             | Iteration 1 dirty memory time                           |
             | ------------------------------------------------------- |
vCPU Count   | eager_page_split=N         | eager_page_split=Y         |
------------ | -------------------------- | -------------------------- |
2            | 0.381155928s               | 0.179281063s               |
4            | 0.530227136s               | 0.262919652s               |
8            | 1.771578193s               | 0.732303441s               |
16           | 3.744348496s               | 1.633369935s               |
32           | 9.558809131s               | 4.343945991s               |
64           | 20.257131790s              | 9.609166533s               |

             | Config: ept=N, tdp_mmu=Y, 5% writes                     |
             | Iteration 1 dirty memory time                           |
             | ------------------------------------------------------- |
vCPU Count   | eager_page_split=N         | eager_page_split=Y         |
------------ | -------------------------- | -------------------------- |
2            | 0.385243149s               | 0.020628192s               |
4            | 0.506447280s               | 0.024045445s               |
8            | 1.635995700s               | 0.064108556s               |
16           | 3.752015939s               | 0.131484155s               |
32           | 9.493783332s               | 0.422104253s               |
64           | 21.250099240s              | 0.984948363s               |

Testing
-------

- Ran all kvm-unit-tests and KVM selftests with all combinations of
  ept=[NY] and tdp_mmu=[NY].
- Booted a 32-bit non-PAE kernel with shadow paging to verify the
  quadrant change in patch 3.
- Tested VM live migration [*] with ept=N and ept=Y and observed pages
  being split via tracepoint and the pages_* stats.

[*] The live migration setup consisted of an 8 vCPU 8 GiB VM running
    on an Intel Cascade Lake host and backed by 1GiB HugeTLBFS memory.
    The VM was running Debian 10 and a workload that consisted of 16
    independent processes that each dirty memory. The tests were run
    with ept=N to exercise the interaction of Eager Page Splitting and
    shadow paging.

Version Log
-----------

v3:
 - Add R-b tags from Peter.
 - Explain direct SPs in indirect MMUs in commit message [Peter]
 - Change BUG_ON() to WARN_ON_ONCE() in quadrant calculation [me]
 - Eliminate unnecessary gotos [Peter]
 - Drop mmu_alloc_pte_list_desc() [Peter]
 - Also update access cache in mmu_set_spte() if was_rmapped [Peter]
 - Fix number of gfn bits in shadowed_translation cache [Peter]
 - Pass sp to make_huge_page_split_spte() to derive level and exec [me]
 - Eliminate flush var in kvm_rmap_zap_collapsible_sptes() [Peter]
 - Drop NULL pte_list_desc cache fallback [Peter]
 - Fix get_access to return sp->role.access. [me]
 - Re-use split cache across calls to CLEAR_DIRTY_LOG for better perf [me]
 - Top-up the split cache outside of the MMU lock when possible [me]
 - Refactor prepare_to_split_huge_page() into try_split_huge_page() [me]
 - Collapse PATCH 20, 23, and 24 avoid intermediate complexity [Peter]
 - Update the RISC-V function stage2_ioremap() [Anup]

v2: https://lore.kernel.org/kvm/20220311002528.2230172-1-dmatlack@google.com/
 - Add performance data for workloads that mix reads and writes [Peter]
 - Collect R-b tags from Ben and Sean.
 - Fix quadrant calculation when deriving role from parent [Sean]
 - Tweak new shadow page function names [Sean]
 - Move set_page_private() to allocation functions [Ben]
 - Only zap collapsible SPTEs up to MAX_LEVEL-1 [Ben]
 - Always top-up pte_list_desc cache to reduce complexity [Ben]
 - Require mmu cache capacity field to be initialized and add WARN()
   to reduce chance of programmer error [Marc]
 - Fix up kvm_mmu_memory_cache struct initialization in arm64 [Marc]

v1: https://lore.kernel.org/kvm/20220203010051.2813563-1-dmatlack@google.com/


David Matlack (23):
  KVM: x86/mmu: Optimize MMU page cache lookup for all direct SPs
  KVM: x86/mmu: Use a bool for direct
  KVM: x86/mmu: Derive shadow MMU page role from parent
  KVM: x86/mmu: Decompose kvm_mmu_get_page() into separate functions
  KVM: x86/mmu: Rename shadow MMU functions that deal with shadow pages
  KVM: x86/mmu: Pass memslot to kvm_mmu_new_shadow_page()
  KVM: x86/mmu: Separate shadow MMU sp allocation from initialization
  KVM: x86/mmu: Link spt to sp during allocation
  KVM: x86/mmu: Move huge page split sp allocation code to mmu.c
  KVM: x86/mmu: Use common code to free kvm_mmu_page structs
  KVM: x86/mmu: Use common code to allocate shadow pages from vCPU
    caches
  KVM: x86/mmu: Pass const memslot to rmap_add()
  KVM: x86/mmu: Pass const memslot to init_shadow_page() and descendants
  KVM: x86/mmu: Decouple rmap_add() and link_shadow_page() from kvm_vcpu
  KVM: x86/mmu: Update page stats in __rmap_add()
  KVM: x86/mmu: Cache the access bits of shadowed translations
  KVM: x86/mmu: Extend make_huge_page_split_spte() for the shadow MMU
  KVM: x86/mmu: Zap collapsible SPTEs at all levels in the shadow MMU
  KVM: x86/mmu: Refactor drop_large_spte()
  KVM: Allow for different capacities in kvm_mmu_memory_cache structs
  KVM: Allow GFP flags to be passed when topping up MMU caches
  KVM: x86/mmu: Support Eager Page Splitting in the shadow MMU
  KVM: selftests: Map x86_64 guest virtual memory with huge pages

 .../admin-guide/kernel-parameters.txt         |   3 -
 arch/arm64/include/asm/kvm_host.h             |   2 +-
 arch/arm64/kvm/arm.c                          |   1 +
 arch/arm64/kvm/mmu.c                          |  13 +-
 arch/mips/include/asm/kvm_host.h              |   2 +-
 arch/mips/kvm/mips.c                          |   2 +
 arch/riscv/include/asm/kvm_host.h             |   2 +-
 arch/riscv/kvm/mmu.c                          |  17 +-
 arch/riscv/kvm/vcpu.c                         |   1 +
 arch/x86/include/asm/kvm_host.h               |  20 +-
 arch/x86/include/asm/kvm_page_track.h         |   2 +-
 arch/x86/kvm/mmu/mmu.c                        | 732 ++++++++++++++----
 arch/x86/kvm/mmu/mmu_internal.h               |  27 +-
 arch/x86/kvm/mmu/page_track.c                 |   4 +-
 arch/x86/kvm/mmu/paging_tmpl.h                |  22 +-
 arch/x86/kvm/mmu/spte.c                       |  18 +-
 arch/x86/kvm/mmu/spte.h                       |   2 +-
 arch/x86/kvm/mmu/tdp_mmu.c                    |  48 +-
 arch/x86/kvm/mmu/tdp_mmu.h                    |   2 +-
 arch/x86/kvm/x86.c                            |   6 +
 include/linux/kvm_host.h                      |   1 +
 include/linux/kvm_types.h                     |  19 +-
 .../selftests/kvm/include/x86_64/processor.h  |   6 +
 tools/testing/selftests/kvm/lib/kvm_util.c    |   4 +-
 .../selftests/kvm/lib/x86_64/processor.c      |  31 +
 virt/kvm/kvm_main.c                           |  19 +-
 26 files changed, 780 insertions(+), 226 deletions(-)


base-commit: 19164ad08bf668bca4f4bfbaacaa0a47c1b737a6
-- 
2.35.1.1094.g7c7d902a7c-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [PATCH v3 01/23] KVM: x86/mmu: Optimize MMU page cache lookup for all direct SPs
  2022-04-01 17:55 ` David Matlack
@ 2022-04-01 17:55   ` David Matlack
  -1 siblings, 0 replies; 88+ messages in thread
From: David Matlack @ 2022-04-01 17:55 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

Commit fb58a9c345f6 ("KVM: x86/mmu: Optimize MMU page cache lookup for
fully direct MMUs") skipped the unsync checks and write flood clearing
for full direct MMUs. We can extend this further to skip the checks for
all direct shadow pages. Direct shadow pages in indirect MMUs (i.e.
shadow paging) are used when shadowing a guest huge page with smaller
pages. Such direct shadow pages, like their counterparts in fully direct
MMUs, are never marked unsynced or have a non-zero write-flooding count.

Checking sp->role.direct also generates better code than checking
direct_map because, due to register pressure, direct_map has to get
shoved onto the stack and then pulled back off.

No functional change intended.

Reviewed-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 1361eb4599b4..dbfda133adbe 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2034,7 +2034,6 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 					     int direct,
 					     unsigned int access)
 {
-	bool direct_mmu = vcpu->arch.mmu->direct_map;
 	union kvm_mmu_page_role role;
 	struct hlist_head *sp_list;
 	unsigned quadrant;
@@ -2075,7 +2074,8 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 			continue;
 		}
 
-		if (direct_mmu)
+		/* unsync and write-flooding only apply to indirect SPs. */
+		if (sp->role.direct)
 			goto trace_get_page;
 
 		if (sp->unsync) {
-- 
2.35.1.1094.g7c7d902a7c-goog


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v3 01/23] KVM: x86/mmu: Optimize MMU page cache lookup for all direct SPs
@ 2022-04-01 17:55   ` David Matlack
  0 siblings, 0 replies; 88+ messages in thread
From: David Matlack @ 2022-04-01 17:55 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Commit fb58a9c345f6 ("KVM: x86/mmu: Optimize MMU page cache lookup for
fully direct MMUs") skipped the unsync checks and write flood clearing
for full direct MMUs. We can extend this further to skip the checks for
all direct shadow pages. Direct shadow pages in indirect MMUs (i.e.
shadow paging) are used when shadowing a guest huge page with smaller
pages. Such direct shadow pages, like their counterparts in fully direct
MMUs, are never marked unsynced or have a non-zero write-flooding count.

Checking sp->role.direct also generates better code than checking
direct_map because, due to register pressure, direct_map has to get
shoved onto the stack and then pulled back off.

No functional change intended.

Reviewed-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 1361eb4599b4..dbfda133adbe 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2034,7 +2034,6 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 					     int direct,
 					     unsigned int access)
 {
-	bool direct_mmu = vcpu->arch.mmu->direct_map;
 	union kvm_mmu_page_role role;
 	struct hlist_head *sp_list;
 	unsigned quadrant;
@@ -2075,7 +2074,8 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 			continue;
 		}
 
-		if (direct_mmu)
+		/* unsync and write-flooding only apply to indirect SPs. */
+		if (sp->role.direct)
 			goto trace_get_page;
 
 		if (sp->unsync) {
-- 
2.35.1.1094.g7c7d902a7c-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v3 02/23] KVM: x86/mmu: Use a bool for direct
  2022-04-01 17:55 ` David Matlack
@ 2022-04-01 17:55   ` David Matlack
  -1 siblings, 0 replies; 88+ messages in thread
From: David Matlack @ 2022-04-01 17:55 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

The parameter "direct" can either be true or false, and all of the
callers pass in a bool variable or true/false literal, so just use the
type bool.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index dbfda133adbe..1c8d157c097b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1706,7 +1706,7 @@ static void drop_parent_pte(struct kvm_mmu_page *sp,
 	mmu_spte_clear_no_track(parent_pte);
 }
 
-static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, int direct)
+static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, bool direct)
 {
 	struct kvm_mmu_page *sp;
 
@@ -2031,7 +2031,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 					     gfn_t gfn,
 					     gva_t gaddr,
 					     unsigned level,
-					     int direct,
+					     bool direct,
 					     unsigned int access)
 {
 	union kvm_mmu_page_role role;
-- 
2.35.1.1094.g7c7d902a7c-goog


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v3 02/23] KVM: x86/mmu: Use a bool for direct
@ 2022-04-01 17:55   ` David Matlack
  0 siblings, 0 replies; 88+ messages in thread
From: David Matlack @ 2022-04-01 17:55 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

The parameter "direct" can either be true or false, and all of the
callers pass in a bool variable or true/false literal, so just use the
type bool.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index dbfda133adbe..1c8d157c097b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1706,7 +1706,7 @@ static void drop_parent_pte(struct kvm_mmu_page *sp,
 	mmu_spte_clear_no_track(parent_pte);
 }
 
-static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, int direct)
+static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, bool direct)
 {
 	struct kvm_mmu_page *sp;
 
@@ -2031,7 +2031,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 					     gfn_t gfn,
 					     gva_t gaddr,
 					     unsigned level,
-					     int direct,
+					     bool direct,
 					     unsigned int access)
 {
 	union kvm_mmu_page_role role;
-- 
2.35.1.1094.g7c7d902a7c-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v3 03/23] KVM: x86/mmu: Derive shadow MMU page role from parent
  2022-04-01 17:55 ` David Matlack
@ 2022-04-01 17:55   ` David Matlack
  -1 siblings, 0 replies; 88+ messages in thread
From: David Matlack @ 2022-04-01 17:55 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

Instead of computing the shadow page role from scratch for every new
page, we can derive most of the information from the parent shadow page.
This avoids redundant calculations and reduces the number of parameters
to kvm_mmu_get_page().

Preemptively split out the role calculation to a separate function for
use in a following commit.

No functional change intended.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c         | 96 +++++++++++++++++++++++-----------
 arch/x86/kvm/mmu/paging_tmpl.h |  9 ++--
 2 files changed, 71 insertions(+), 34 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 1c8d157c097b..8253d68cc30b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2027,30 +2027,14 @@ static void clear_sp_write_flooding_count(u64 *spte)
 	__clear_sp_write_flooding_count(sptep_to_sp(spte));
 }
 
-static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
-					     gfn_t gfn,
-					     gva_t gaddr,
-					     unsigned level,
-					     bool direct,
-					     unsigned int access)
+static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
+					     union kvm_mmu_page_role role)
 {
-	union kvm_mmu_page_role role;
 	struct hlist_head *sp_list;
-	unsigned quadrant;
 	struct kvm_mmu_page *sp;
 	int collisions = 0;
 	LIST_HEAD(invalid_list);
 
-	role = vcpu->arch.mmu->mmu_role.base;
-	role.level = level;
-	role.direct = direct;
-	role.access = access;
-	if (role.has_4_byte_gpte) {
-		quadrant = gaddr >> (PAGE_SHIFT + (PT64_PT_BITS * level));
-		quadrant &= (1 << ((PT32_PT_BITS - PT64_PT_BITS) * level)) - 1;
-		role.quadrant = quadrant;
-	}
-
 	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
 	for_each_valid_sp(vcpu->kvm, sp, sp_list) {
 		if (sp->gfn != gfn) {
@@ -2068,7 +2052,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 			 * Unsync pages must not be left as is, because the new
 			 * upper-level page will be write-protected.
 			 */
-			if (level > PG_LEVEL_4K && sp->unsync)
+			if (role.level > PG_LEVEL_4K && sp->unsync)
 				kvm_mmu_prepare_zap_page(vcpu->kvm, sp,
 							 &invalid_list);
 			continue;
@@ -2107,14 +2091,14 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 
 	++vcpu->kvm->stat.mmu_cache_miss;
 
-	sp = kvm_mmu_alloc_page(vcpu, direct);
+	sp = kvm_mmu_alloc_page(vcpu, role.direct);
 
 	sp->gfn = gfn;
 	sp->role = role;
 	hlist_add_head(&sp->hash_link, sp_list);
-	if (!direct) {
+	if (!role.direct) {
 		account_shadowed(vcpu->kvm, sp);
-		if (level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))
+		if (role.level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))
 			kvm_flush_remote_tlbs_with_address(vcpu->kvm, gfn, 1);
 	}
 	trace_kvm_mmu_get_page(sp, true);
@@ -2126,6 +2110,51 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 	return sp;
 }
 
+static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct, u32 access)
+{
+	struct kvm_mmu_page *parent_sp = sptep_to_sp(sptep);
+	union kvm_mmu_page_role role;
+
+	role = parent_sp->role;
+	role.level--;
+	role.access = access;
+	role.direct = direct;
+
+	/*
+	 * If the guest has 4-byte PTEs then that means it's using 32-bit,
+	 * 2-level, non-PAE paging. KVM shadows such guests using 4 PAE page
+	 * directories, each mapping 1/4 of the guest's linear address space
+	 * (1GiB). The shadow pages for those 4 page directories are
+	 * pre-allocated and assigned a separate quadrant in their role.
+	 *
+	 * Since we are allocating a child shadow page and there are only 2
+	 * levels, this must be a PG_LEVEL_4K shadow page. Here the quadrant
+	 * will either be 0 or 1 because it maps 1/2 of the address space mapped
+	 * by the guest's PG_LEVEL_4K page table (or 4MiB huge page) that it
+	 * is shadowing. In this case, the quadrant can be derived by the index
+	 * of the SPTE that points to the new child shadow page in the page
+	 * directory (parent_sp). Specifically, every 2 SPTEs in parent_sp
+	 * shadow one half of a guest's page table (or 4MiB huge page) so the
+	 * quadrant is just the parity of the index of the SPTE.
+	 */
+	if (role.has_4_byte_gpte) {
+		WARN_ON_ONCE(role.level != PG_LEVEL_4K);
+		role.quadrant = (sptep - parent_sp->spt) % 2;
+	}
+
+	return role;
+}
+
+static struct kvm_mmu_page *kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu,
+						 u64 *sptep, gfn_t gfn,
+						 bool direct, u32 access)
+{
+	union kvm_mmu_page_role role;
+
+	role = kvm_mmu_child_role(sptep, direct, access);
+	return kvm_mmu_get_page(vcpu, gfn, role);
+}
+
 static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterator,
 					struct kvm_vcpu *vcpu, hpa_t root,
 					u64 addr)
@@ -2930,8 +2959,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		if (is_shadow_present_pte(*it.sptep))
 			continue;
 
-		sp = kvm_mmu_get_page(vcpu, base_gfn, it.addr,
-				      it.level - 1, true, ACC_ALL);
+		sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn, true, ACC_ALL);
 
 		link_shadow_page(vcpu, it.sptep, sp);
 		if (fault->is_tdp && fault->huge_page_disallowed &&
@@ -3313,12 +3341,21 @@ static int mmu_check_root(struct kvm_vcpu *vcpu, gfn_t root_gfn)
 	return ret;
 }
 
-static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
+static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant,
 			    u8 level, bool direct)
 {
+	union kvm_mmu_page_role role;
 	struct kvm_mmu_page *sp;
 
-	sp = kvm_mmu_get_page(vcpu, gfn, gva, level, direct, ACC_ALL);
+	role = vcpu->arch.mmu->mmu_role.base;
+	role.level = level;
+	role.direct = direct;
+	role.access = ACC_ALL;
+
+	if (role.has_4_byte_gpte)
+		role.quadrant = quadrant;
+
+	sp = kvm_mmu_get_page(vcpu, gfn, role);
 	++sp->root_count;
 
 	return __pa(sp->spt);
@@ -3352,8 +3389,8 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
 		for (i = 0; i < 4; ++i) {
 			WARN_ON_ONCE(IS_VALID_PAE_ROOT(mmu->pae_root[i]));
 
-			root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT),
-					      i << 30, PT32_ROOT_LEVEL, true);
+			root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT), i,
+					      PT32_ROOT_LEVEL, true);
 			mmu->pae_root[i] = root | PT_PRESENT_MASK |
 					   shadow_me_mask;
 		}
@@ -3522,8 +3559,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 			root_gfn = pdptrs[i] >> PAGE_SHIFT;
 		}
 
-		root = mmu_alloc_root(vcpu, root_gfn, i << 30,
-				      PT32_ROOT_LEVEL, false);
+		root = mmu_alloc_root(vcpu, root_gfn, i, PT32_ROOT_LEVEL, false);
 		mmu->pae_root[i] = root | pm_mask;
 	}
 
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 8621188b46df..729394de2658 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -683,8 +683,9 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 		if (!is_shadow_present_pte(*it.sptep)) {
 			table_gfn = gw->table_gfn[it.level - 2];
 			access = gw->pt_access[it.level - 2];
-			sp = kvm_mmu_get_page(vcpu, table_gfn, fault->addr,
-					      it.level-1, false, access);
+			sp = kvm_mmu_get_child_sp(vcpu, it.sptep, table_gfn,
+						  false, access);
+
 			/*
 			 * We must synchronize the pagetable before linking it
 			 * because the guest doesn't need to flush tlb when
@@ -740,8 +741,8 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 		drop_large_spte(vcpu, it.sptep);
 
 		if (!is_shadow_present_pte(*it.sptep)) {
-			sp = kvm_mmu_get_page(vcpu, base_gfn, fault->addr,
-					      it.level - 1, true, direct_access);
+			sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn,
+						  true, direct_access);
 			link_shadow_page(vcpu, it.sptep, sp);
 			if (fault->huge_page_disallowed &&
 			    fault->req_level >= it.level)
-- 
2.35.1.1094.g7c7d902a7c-goog


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v3 03/23] KVM: x86/mmu: Derive shadow MMU page role from parent
@ 2022-04-01 17:55   ` David Matlack
  0 siblings, 0 replies; 88+ messages in thread
From: David Matlack @ 2022-04-01 17:55 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Instead of computing the shadow page role from scratch for every new
page, we can derive most of the information from the parent shadow page.
This avoids redundant calculations and reduces the number of parameters
to kvm_mmu_get_page().

Preemptively split out the role calculation to a separate function for
use in a following commit.

No functional change intended.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c         | 96 +++++++++++++++++++++++-----------
 arch/x86/kvm/mmu/paging_tmpl.h |  9 ++--
 2 files changed, 71 insertions(+), 34 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 1c8d157c097b..8253d68cc30b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2027,30 +2027,14 @@ static void clear_sp_write_flooding_count(u64 *spte)
 	__clear_sp_write_flooding_count(sptep_to_sp(spte));
 }
 
-static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
-					     gfn_t gfn,
-					     gva_t gaddr,
-					     unsigned level,
-					     bool direct,
-					     unsigned int access)
+static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
+					     union kvm_mmu_page_role role)
 {
-	union kvm_mmu_page_role role;
 	struct hlist_head *sp_list;
-	unsigned quadrant;
 	struct kvm_mmu_page *sp;
 	int collisions = 0;
 	LIST_HEAD(invalid_list);
 
-	role = vcpu->arch.mmu->mmu_role.base;
-	role.level = level;
-	role.direct = direct;
-	role.access = access;
-	if (role.has_4_byte_gpte) {
-		quadrant = gaddr >> (PAGE_SHIFT + (PT64_PT_BITS * level));
-		quadrant &= (1 << ((PT32_PT_BITS - PT64_PT_BITS) * level)) - 1;
-		role.quadrant = quadrant;
-	}
-
 	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
 	for_each_valid_sp(vcpu->kvm, sp, sp_list) {
 		if (sp->gfn != gfn) {
@@ -2068,7 +2052,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 			 * Unsync pages must not be left as is, because the new
 			 * upper-level page will be write-protected.
 			 */
-			if (level > PG_LEVEL_4K && sp->unsync)
+			if (role.level > PG_LEVEL_4K && sp->unsync)
 				kvm_mmu_prepare_zap_page(vcpu->kvm, sp,
 							 &invalid_list);
 			continue;
@@ -2107,14 +2091,14 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 
 	++vcpu->kvm->stat.mmu_cache_miss;
 
-	sp = kvm_mmu_alloc_page(vcpu, direct);
+	sp = kvm_mmu_alloc_page(vcpu, role.direct);
 
 	sp->gfn = gfn;
 	sp->role = role;
 	hlist_add_head(&sp->hash_link, sp_list);
-	if (!direct) {
+	if (!role.direct) {
 		account_shadowed(vcpu->kvm, sp);
-		if (level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))
+		if (role.level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))
 			kvm_flush_remote_tlbs_with_address(vcpu->kvm, gfn, 1);
 	}
 	trace_kvm_mmu_get_page(sp, true);
@@ -2126,6 +2110,51 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 	return sp;
 }
 
+static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct, u32 access)
+{
+	struct kvm_mmu_page *parent_sp = sptep_to_sp(sptep);
+	union kvm_mmu_page_role role;
+
+	role = parent_sp->role;
+	role.level--;
+	role.access = access;
+	role.direct = direct;
+
+	/*
+	 * If the guest has 4-byte PTEs then that means it's using 32-bit,
+	 * 2-level, non-PAE paging. KVM shadows such guests using 4 PAE page
+	 * directories, each mapping 1/4 of the guest's linear address space
+	 * (1GiB). The shadow pages for those 4 page directories are
+	 * pre-allocated and assigned a separate quadrant in their role.
+	 *
+	 * Since we are allocating a child shadow page and there are only 2
+	 * levels, this must be a PG_LEVEL_4K shadow page. Here the quadrant
+	 * will either be 0 or 1 because it maps 1/2 of the address space mapped
+	 * by the guest's PG_LEVEL_4K page table (or 4MiB huge page) that it
+	 * is shadowing. In this case, the quadrant can be derived by the index
+	 * of the SPTE that points to the new child shadow page in the page
+	 * directory (parent_sp). Specifically, every 2 SPTEs in parent_sp
+	 * shadow one half of a guest's page table (or 4MiB huge page) so the
+	 * quadrant is just the parity of the index of the SPTE.
+	 */
+	if (role.has_4_byte_gpte) {
+		WARN_ON_ONCE(role.level != PG_LEVEL_4K);
+		role.quadrant = (sptep - parent_sp->spt) % 2;
+	}
+
+	return role;
+}
+
+static struct kvm_mmu_page *kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu,
+						 u64 *sptep, gfn_t gfn,
+						 bool direct, u32 access)
+{
+	union kvm_mmu_page_role role;
+
+	role = kvm_mmu_child_role(sptep, direct, access);
+	return kvm_mmu_get_page(vcpu, gfn, role);
+}
+
 static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterator,
 					struct kvm_vcpu *vcpu, hpa_t root,
 					u64 addr)
@@ -2930,8 +2959,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		if (is_shadow_present_pte(*it.sptep))
 			continue;
 
-		sp = kvm_mmu_get_page(vcpu, base_gfn, it.addr,
-				      it.level - 1, true, ACC_ALL);
+		sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn, true, ACC_ALL);
 
 		link_shadow_page(vcpu, it.sptep, sp);
 		if (fault->is_tdp && fault->huge_page_disallowed &&
@@ -3313,12 +3341,21 @@ static int mmu_check_root(struct kvm_vcpu *vcpu, gfn_t root_gfn)
 	return ret;
 }
 
-static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
+static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant,
 			    u8 level, bool direct)
 {
+	union kvm_mmu_page_role role;
 	struct kvm_mmu_page *sp;
 
-	sp = kvm_mmu_get_page(vcpu, gfn, gva, level, direct, ACC_ALL);
+	role = vcpu->arch.mmu->mmu_role.base;
+	role.level = level;
+	role.direct = direct;
+	role.access = ACC_ALL;
+
+	if (role.has_4_byte_gpte)
+		role.quadrant = quadrant;
+
+	sp = kvm_mmu_get_page(vcpu, gfn, role);
 	++sp->root_count;
 
 	return __pa(sp->spt);
@@ -3352,8 +3389,8 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
 		for (i = 0; i < 4; ++i) {
 			WARN_ON_ONCE(IS_VALID_PAE_ROOT(mmu->pae_root[i]));
 
-			root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT),
-					      i << 30, PT32_ROOT_LEVEL, true);
+			root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT), i,
+					      PT32_ROOT_LEVEL, true);
 			mmu->pae_root[i] = root | PT_PRESENT_MASK |
 					   shadow_me_mask;
 		}
@@ -3522,8 +3559,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 			root_gfn = pdptrs[i] >> PAGE_SHIFT;
 		}
 
-		root = mmu_alloc_root(vcpu, root_gfn, i << 30,
-				      PT32_ROOT_LEVEL, false);
+		root = mmu_alloc_root(vcpu, root_gfn, i, PT32_ROOT_LEVEL, false);
 		mmu->pae_root[i] = root | pm_mask;
 	}
 
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 8621188b46df..729394de2658 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -683,8 +683,9 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 		if (!is_shadow_present_pte(*it.sptep)) {
 			table_gfn = gw->table_gfn[it.level - 2];
 			access = gw->pt_access[it.level - 2];
-			sp = kvm_mmu_get_page(vcpu, table_gfn, fault->addr,
-					      it.level-1, false, access);
+			sp = kvm_mmu_get_child_sp(vcpu, it.sptep, table_gfn,
+						  false, access);
+
 			/*
 			 * We must synchronize the pagetable before linking it
 			 * because the guest doesn't need to flush tlb when
@@ -740,8 +741,8 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 		drop_large_spte(vcpu, it.sptep);
 
 		if (!is_shadow_present_pte(*it.sptep)) {
-			sp = kvm_mmu_get_page(vcpu, base_gfn, fault->addr,
-					      it.level - 1, true, direct_access);
+			sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn,
+						  true, direct_access);
 			link_shadow_page(vcpu, it.sptep, sp);
 			if (fault->huge_page_disallowed &&
 			    fault->req_level >= it.level)
-- 
2.35.1.1094.g7c7d902a7c-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v3 04/23] KVM: x86/mmu: Decompose kvm_mmu_get_page() into separate functions
  2022-04-01 17:55 ` David Matlack
@ 2022-04-01 17:55   ` David Matlack
  -1 siblings, 0 replies; 88+ messages in thread
From: David Matlack @ 2022-04-01 17:55 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

Decompose kvm_mmu_get_page() into separate helper functions to increase
readability and prepare for allocating shadow pages without a vcpu
pointer.

Specifically, pull the guts of kvm_mmu_get_page() into 3 helper
functions:

__kvm_mmu_find_shadow_page() -
  Walks the page hash checking for any existing mmu pages that match the
  given gfn and role. Does not attempt to synchronize the page if it is
  unsync.

kvm_mmu_find_shadow_page() -
  Wraps __kvm_mmu_find_shadow_page() and handles syncing if necessary.

kvm_mmu_new_shadow_page()
  Allocates and initializes an entirely new kvm_mmu_page. This currently
  requries a vcpu pointer for allocation and looking up the memslot but
  that will be removed in a future commit.

  Note, kvm_mmu_new_shadow_page() is temporary and will be removed in a
  subsequent commit. The name uses "new" rather than the more typical
  "alloc" to avoid clashing with the existing kvm_mmu_alloc_page().

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c         | 124 +++++++++++++++++++++++----------
 arch/x86/kvm/mmu/paging_tmpl.h |   5 +-
 arch/x86/kvm/mmu/spte.c        |   5 +-
 3 files changed, 94 insertions(+), 40 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 8253d68cc30b..8fdddd25029d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2027,16 +2027,25 @@ static void clear_sp_write_flooding_count(u64 *spte)
 	__clear_sp_write_flooding_count(sptep_to_sp(spte));
 }
 
-static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
-					     union kvm_mmu_page_role role)
+/*
+ * Searches for an existing SP for the given gfn and role. Makes no attempt to
+ * sync the SP if it is marked unsync.
+ *
+ * If creating an upper-level page table, zaps unsynced pages for the same
+ * gfn and adds them to the invalid_list. It's the callers responsibility
+ * to call kvm_mmu_commit_zap_page() on invalid_list.
+ */
+static struct kvm_mmu_page *__kvm_mmu_find_shadow_page(struct kvm *kvm,
+						       gfn_t gfn,
+						       union kvm_mmu_page_role role,
+						       struct list_head *invalid_list)
 {
 	struct hlist_head *sp_list;
 	struct kvm_mmu_page *sp;
 	int collisions = 0;
-	LIST_HEAD(invalid_list);
 
-	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
-	for_each_valid_sp(vcpu->kvm, sp, sp_list) {
+	sp_list = &kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
+	for_each_valid_sp(kvm, sp, sp_list) {
 		if (sp->gfn != gfn) {
 			collisions++;
 			continue;
@@ -2053,60 +2062,103 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
 			 * upper-level page will be write-protected.
 			 */
 			if (role.level > PG_LEVEL_4K && sp->unsync)
-				kvm_mmu_prepare_zap_page(vcpu->kvm, sp,
-							 &invalid_list);
+				kvm_mmu_prepare_zap_page(kvm, sp, invalid_list);
+
 			continue;
 		}
 
-		/* unsync and write-flooding only apply to indirect SPs. */
-		if (sp->role.direct)
-			goto trace_get_page;
+		/* Write-flooding is only tracked for indirect SPs. */
+		if (!sp->role.direct)
+			__clear_sp_write_flooding_count(sp);
 
-		if (sp->unsync) {
-			/*
-			 * The page is good, but is stale.  kvm_sync_page does
-			 * get the latest guest state, but (unlike mmu_unsync_children)
-			 * it doesn't write-protect the page or mark it synchronized!
-			 * This way the validity of the mapping is ensured, but the
-			 * overhead of write protection is not incurred until the
-			 * guest invalidates the TLB mapping.  This allows multiple
-			 * SPs for a single gfn to be unsync.
-			 *
-			 * If the sync fails, the page is zapped.  If so, break
-			 * in order to rebuild it.
-			 */
-			if (!kvm_sync_page(vcpu, sp, &invalid_list))
-				break;
+		goto out;
+	}
 
+	sp = NULL;
+
+out:
+	if (collisions > kvm->stat.max_mmu_page_hash_collisions)
+		kvm->stat.max_mmu_page_hash_collisions = collisions;
+
+	return sp;
+}
+
+/*
+ * Looks up an existing SP for the given gfn and role if one exists. The
+ * return SP is guaranteed to be synced.
+ */
+static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
+						     gfn_t gfn,
+						     union kvm_mmu_page_role role)
+{
+	struct kvm_mmu_page *sp;
+	LIST_HEAD(invalid_list);
+
+	sp = __kvm_mmu_find_shadow_page(vcpu->kvm, gfn, role, &invalid_list);
+
+	if (sp && sp->unsync) {
+		/*
+		 * The page is good, but is stale.  kvm_sync_page does
+		 * get the latest guest state, but (unlike mmu_unsync_children)
+		 * it doesn't write-protect the page or mark it synchronized!
+		 * This way the validity of the mapping is ensured, but the
+		 * overhead of write protection is not incurred until the
+		 * guest invalidates the TLB mapping.  This allows multiple
+		 * SPs for a single gfn to be unsync.
+		 *
+		 * If the sync fails, the page is zapped and added to the
+		 * invalid_list.
+		 */
+		if (kvm_sync_page(vcpu, sp, &invalid_list)) {
 			WARN_ON(!list_empty(&invalid_list));
 			kvm_flush_remote_tlbs(vcpu->kvm);
+		} else {
+			sp = NULL;
 		}
+	}
 
-		__clear_sp_write_flooding_count(sp);
+	kvm_mmu_commit_zap_page(vcpu->kvm, &invalid_list);
+	return sp;
+}
 
-trace_get_page:
-		trace_kvm_mmu_get_page(sp, false);
-		goto out;
-	}
+static struct kvm_mmu_page *kvm_mmu_new_shadow_page(struct kvm_vcpu *vcpu,
+						    gfn_t gfn,
+						    union kvm_mmu_page_role role)
+{
+	struct kvm_mmu_page *sp;
+	struct hlist_head *sp_list;
 
 	++vcpu->kvm->stat.mmu_cache_miss;
 
 	sp = kvm_mmu_alloc_page(vcpu, role.direct);
-
 	sp->gfn = gfn;
 	sp->role = role;
+
+	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
 	hlist_add_head(&sp->hash_link, sp_list);
+
 	if (!role.direct) {
 		account_shadowed(vcpu->kvm, sp);
 		if (role.level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))
 			kvm_flush_remote_tlbs_with_address(vcpu->kvm, gfn, 1);
 	}
-	trace_kvm_mmu_get_page(sp, true);
-out:
-	kvm_mmu_commit_zap_page(vcpu->kvm, &invalid_list);
 
-	if (collisions > vcpu->kvm->stat.max_mmu_page_hash_collisions)
-		vcpu->kvm->stat.max_mmu_page_hash_collisions = collisions;
+	return sp;
+}
+
+static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
+					     union kvm_mmu_page_role role)
+{
+	struct kvm_mmu_page *sp;
+	bool created = false;
+
+	sp = kvm_mmu_find_shadow_page(vcpu, gfn, role);
+	if (!sp) {
+		created = true;
+		sp = kvm_mmu_new_shadow_page(vcpu, gfn, role);
+	}
+
+	trace_kvm_mmu_get_page(sp, created);
 	return sp;
 }
 
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 729394de2658..db63b5377465 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -692,8 +692,9 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 			 * the gpte is changed from non-present to present.
 			 * Otherwise, the guest may use the wrong mapping.
 			 *
-			 * For PG_LEVEL_4K, kvm_mmu_get_page() has already
-			 * synchronized it transiently via kvm_sync_page().
+			 * For PG_LEVEL_4K, kvm_mmu_get_existing_sp() has
+			 * already synchronized it transiently via
+			 * kvm_sync_page().
 			 *
 			 * For higher level pagetable, we synchronize it via
 			 * the slower mmu_sync_children().  If it needs to
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 4739b53c9734..d10189d9c877 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -150,8 +150,9 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 		/*
 		 * Optimization: for pte sync, if spte was writable the hash
 		 * lookup is unnecessary (and expensive). Write protection
-		 * is responsibility of kvm_mmu_get_page / kvm_mmu_sync_roots.
-		 * Same reasoning can be applied to dirty page accounting.
+		 * is responsibility of kvm_mmu_create_sp() and
+		 * kvm_mmu_sync_roots(). Same reasoning can be applied to dirty
+		 * page accounting.
 		 */
 		if (is_writable_pte(old_spte))
 			goto out;
-- 
2.35.1.1094.g7c7d902a7c-goog


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v3 04/23] KVM: x86/mmu: Decompose kvm_mmu_get_page() into separate functions
@ 2022-04-01 17:55   ` David Matlack
  0 siblings, 0 replies; 88+ messages in thread
From: David Matlack @ 2022-04-01 17:55 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Decompose kvm_mmu_get_page() into separate helper functions to increase
readability and prepare for allocating shadow pages without a vcpu
pointer.

Specifically, pull the guts of kvm_mmu_get_page() into 3 helper
functions:

__kvm_mmu_find_shadow_page() -
  Walks the page hash checking for any existing mmu pages that match the
  given gfn and role. Does not attempt to synchronize the page if it is
  unsync.

kvm_mmu_find_shadow_page() -
  Wraps __kvm_mmu_find_shadow_page() and handles syncing if necessary.

kvm_mmu_new_shadow_page()
  Allocates and initializes an entirely new kvm_mmu_page. This currently
  requries a vcpu pointer for allocation and looking up the memslot but
  that will be removed in a future commit.

  Note, kvm_mmu_new_shadow_page() is temporary and will be removed in a
  subsequent commit. The name uses "new" rather than the more typical
  "alloc" to avoid clashing with the existing kvm_mmu_alloc_page().

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c         | 124 +++++++++++++++++++++++----------
 arch/x86/kvm/mmu/paging_tmpl.h |   5 +-
 arch/x86/kvm/mmu/spte.c        |   5 +-
 3 files changed, 94 insertions(+), 40 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 8253d68cc30b..8fdddd25029d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2027,16 +2027,25 @@ static void clear_sp_write_flooding_count(u64 *spte)
 	__clear_sp_write_flooding_count(sptep_to_sp(spte));
 }
 
-static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
-					     union kvm_mmu_page_role role)
+/*
+ * Searches for an existing SP for the given gfn and role. Makes no attempt to
+ * sync the SP if it is marked unsync.
+ *
+ * If creating an upper-level page table, zaps unsynced pages for the same
+ * gfn and adds them to the invalid_list. It's the callers responsibility
+ * to call kvm_mmu_commit_zap_page() on invalid_list.
+ */
+static struct kvm_mmu_page *__kvm_mmu_find_shadow_page(struct kvm *kvm,
+						       gfn_t gfn,
+						       union kvm_mmu_page_role role,
+						       struct list_head *invalid_list)
 {
 	struct hlist_head *sp_list;
 	struct kvm_mmu_page *sp;
 	int collisions = 0;
-	LIST_HEAD(invalid_list);
 
-	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
-	for_each_valid_sp(vcpu->kvm, sp, sp_list) {
+	sp_list = &kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
+	for_each_valid_sp(kvm, sp, sp_list) {
 		if (sp->gfn != gfn) {
 			collisions++;
 			continue;
@@ -2053,60 +2062,103 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
 			 * upper-level page will be write-protected.
 			 */
 			if (role.level > PG_LEVEL_4K && sp->unsync)
-				kvm_mmu_prepare_zap_page(vcpu->kvm, sp,
-							 &invalid_list);
+				kvm_mmu_prepare_zap_page(kvm, sp, invalid_list);
+
 			continue;
 		}
 
-		/* unsync and write-flooding only apply to indirect SPs. */
-		if (sp->role.direct)
-			goto trace_get_page;
+		/* Write-flooding is only tracked for indirect SPs. */
+		if (!sp->role.direct)
+			__clear_sp_write_flooding_count(sp);
 
-		if (sp->unsync) {
-			/*
-			 * The page is good, but is stale.  kvm_sync_page does
-			 * get the latest guest state, but (unlike mmu_unsync_children)
-			 * it doesn't write-protect the page or mark it synchronized!
-			 * This way the validity of the mapping is ensured, but the
-			 * overhead of write protection is not incurred until the
-			 * guest invalidates the TLB mapping.  This allows multiple
-			 * SPs for a single gfn to be unsync.
-			 *
-			 * If the sync fails, the page is zapped.  If so, break
-			 * in order to rebuild it.
-			 */
-			if (!kvm_sync_page(vcpu, sp, &invalid_list))
-				break;
+		goto out;
+	}
 
+	sp = NULL;
+
+out:
+	if (collisions > kvm->stat.max_mmu_page_hash_collisions)
+		kvm->stat.max_mmu_page_hash_collisions = collisions;
+
+	return sp;
+}
+
+/*
+ * Looks up an existing SP for the given gfn and role if one exists. The
+ * return SP is guaranteed to be synced.
+ */
+static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
+						     gfn_t gfn,
+						     union kvm_mmu_page_role role)
+{
+	struct kvm_mmu_page *sp;
+	LIST_HEAD(invalid_list);
+
+	sp = __kvm_mmu_find_shadow_page(vcpu->kvm, gfn, role, &invalid_list);
+
+	if (sp && sp->unsync) {
+		/*
+		 * The page is good, but is stale.  kvm_sync_page does
+		 * get the latest guest state, but (unlike mmu_unsync_children)
+		 * it doesn't write-protect the page or mark it synchronized!
+		 * This way the validity of the mapping is ensured, but the
+		 * overhead of write protection is not incurred until the
+		 * guest invalidates the TLB mapping.  This allows multiple
+		 * SPs for a single gfn to be unsync.
+		 *
+		 * If the sync fails, the page is zapped and added to the
+		 * invalid_list.
+		 */
+		if (kvm_sync_page(vcpu, sp, &invalid_list)) {
 			WARN_ON(!list_empty(&invalid_list));
 			kvm_flush_remote_tlbs(vcpu->kvm);
+		} else {
+			sp = NULL;
 		}
+	}
 
-		__clear_sp_write_flooding_count(sp);
+	kvm_mmu_commit_zap_page(vcpu->kvm, &invalid_list);
+	return sp;
+}
 
-trace_get_page:
-		trace_kvm_mmu_get_page(sp, false);
-		goto out;
-	}
+static struct kvm_mmu_page *kvm_mmu_new_shadow_page(struct kvm_vcpu *vcpu,
+						    gfn_t gfn,
+						    union kvm_mmu_page_role role)
+{
+	struct kvm_mmu_page *sp;
+	struct hlist_head *sp_list;
 
 	++vcpu->kvm->stat.mmu_cache_miss;
 
 	sp = kvm_mmu_alloc_page(vcpu, role.direct);
-
 	sp->gfn = gfn;
 	sp->role = role;
+
+	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
 	hlist_add_head(&sp->hash_link, sp_list);
+
 	if (!role.direct) {
 		account_shadowed(vcpu->kvm, sp);
 		if (role.level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))
 			kvm_flush_remote_tlbs_with_address(vcpu->kvm, gfn, 1);
 	}
-	trace_kvm_mmu_get_page(sp, true);
-out:
-	kvm_mmu_commit_zap_page(vcpu->kvm, &invalid_list);
 
-	if (collisions > vcpu->kvm->stat.max_mmu_page_hash_collisions)
-		vcpu->kvm->stat.max_mmu_page_hash_collisions = collisions;
+	return sp;
+}
+
+static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
+					     union kvm_mmu_page_role role)
+{
+	struct kvm_mmu_page *sp;
+	bool created = false;
+
+	sp = kvm_mmu_find_shadow_page(vcpu, gfn, role);
+	if (!sp) {
+		created = true;
+		sp = kvm_mmu_new_shadow_page(vcpu, gfn, role);
+	}
+
+	trace_kvm_mmu_get_page(sp, created);
 	return sp;
 }
 
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 729394de2658..db63b5377465 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -692,8 +692,9 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 			 * the gpte is changed from non-present to present.
 			 * Otherwise, the guest may use the wrong mapping.
 			 *
-			 * For PG_LEVEL_4K, kvm_mmu_get_page() has already
-			 * synchronized it transiently via kvm_sync_page().
+			 * For PG_LEVEL_4K, kvm_mmu_get_existing_sp() has
+			 * already synchronized it transiently via
+			 * kvm_sync_page().
 			 *
 			 * For higher level pagetable, we synchronize it via
 			 * the slower mmu_sync_children().  If it needs to
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 4739b53c9734..d10189d9c877 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -150,8 +150,9 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 		/*
 		 * Optimization: for pte sync, if spte was writable the hash
 		 * lookup is unnecessary (and expensive). Write protection
-		 * is responsibility of kvm_mmu_get_page / kvm_mmu_sync_roots.
-		 * Same reasoning can be applied to dirty page accounting.
+		 * is responsibility of kvm_mmu_create_sp() and
+		 * kvm_mmu_sync_roots(). Same reasoning can be applied to dirty
+		 * page accounting.
 		 */
 		if (is_writable_pte(old_spte))
 			goto out;
-- 
2.35.1.1094.g7c7d902a7c-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v3 05/23] KVM: x86/mmu: Rename shadow MMU functions that deal with shadow pages
  2022-04-01 17:55 ` David Matlack
@ 2022-04-01 17:55   ` David Matlack
  -1 siblings, 0 replies; 88+ messages in thread
From: David Matlack @ 2022-04-01 17:55 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

Rename 3 functions:

  kvm_mmu_get_page()   -> kvm_mmu_get_shadow_page()
  kvm_mmu_alloc_page() -> kvm_mmu_alloc_shadow_page()
  kvm_mmu_free_page()  -> kvm_mmu_free_shadow_page()

This change makes it clear that these functions deal with shadow pages
rather than struct pages. Prefer "shadow_page" over the shorter "sp"
since these are core routines.

Acked-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 8fdddd25029d..dc1825de0752 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1668,7 +1668,7 @@ static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, long nr)
 	percpu_counter_add(&kvm_total_used_mmu_pages, nr);
 }
 
-static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
+static void kvm_mmu_free_shadow_page(struct kvm_mmu_page *sp)
 {
 	MMU_WARN_ON(!is_empty_shadow_page(sp->spt));
 	hlist_del(&sp->hash_link);
@@ -1706,7 +1706,8 @@ static void drop_parent_pte(struct kvm_mmu_page *sp,
 	mmu_spte_clear_no_track(parent_pte);
 }
 
-static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, bool direct)
+static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
+						      bool direct)
 {
 	struct kvm_mmu_page *sp;
 
@@ -2130,7 +2131,7 @@ static struct kvm_mmu_page *kvm_mmu_new_shadow_page(struct kvm_vcpu *vcpu,
 
 	++vcpu->kvm->stat.mmu_cache_miss;
 
-	sp = kvm_mmu_alloc_page(vcpu, role.direct);
+	sp = kvm_mmu_alloc_shadow_page(vcpu, role.direct);
 	sp->gfn = gfn;
 	sp->role = role;
 
@@ -2146,8 +2147,9 @@ static struct kvm_mmu_page *kvm_mmu_new_shadow_page(struct kvm_vcpu *vcpu,
 	return sp;
 }
 
-static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
-					     union kvm_mmu_page_role role)
+static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
+						    gfn_t gfn,
+						    union kvm_mmu_page_role role)
 {
 	struct kvm_mmu_page *sp;
 	bool created = false;
@@ -2204,7 +2206,7 @@ static struct kvm_mmu_page *kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu,
 	union kvm_mmu_page_role role;
 
 	role = kvm_mmu_child_role(sptep, direct, access);
-	return kvm_mmu_get_page(vcpu, gfn, role);
+	return kvm_mmu_get_shadow_page(vcpu, gfn, role);
 }
 
 static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterator,
@@ -2480,7 +2482,7 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
 
 	list_for_each_entry_safe(sp, nsp, invalid_list, link) {
 		WARN_ON(!sp->role.invalid || sp->root_count);
-		kvm_mmu_free_page(sp);
+		kvm_mmu_free_shadow_page(sp);
 	}
 }
 
@@ -3407,7 +3409,7 @@ static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant,
 	if (role.has_4_byte_gpte)
 		role.quadrant = quadrant;
 
-	sp = kvm_mmu_get_page(vcpu, gfn, role);
+	sp = kvm_mmu_get_shadow_page(vcpu, gfn, role);
 	++sp->root_count;
 
 	return __pa(sp->spt);
-- 
2.35.1.1094.g7c7d902a7c-goog


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v3 05/23] KVM: x86/mmu: Rename shadow MMU functions that deal with shadow pages
@ 2022-04-01 17:55   ` David Matlack
  0 siblings, 0 replies; 88+ messages in thread
From: David Matlack @ 2022-04-01 17:55 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Rename 3 functions:

  kvm_mmu_get_page()   -> kvm_mmu_get_shadow_page()
  kvm_mmu_alloc_page() -> kvm_mmu_alloc_shadow_page()
  kvm_mmu_free_page()  -> kvm_mmu_free_shadow_page()

This change makes it clear that these functions deal with shadow pages
rather than struct pages. Prefer "shadow_page" over the shorter "sp"
since these are core routines.

Acked-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 8fdddd25029d..dc1825de0752 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1668,7 +1668,7 @@ static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, long nr)
 	percpu_counter_add(&kvm_total_used_mmu_pages, nr);
 }
 
-static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
+static void kvm_mmu_free_shadow_page(struct kvm_mmu_page *sp)
 {
 	MMU_WARN_ON(!is_empty_shadow_page(sp->spt));
 	hlist_del(&sp->hash_link);
@@ -1706,7 +1706,8 @@ static void drop_parent_pte(struct kvm_mmu_page *sp,
 	mmu_spte_clear_no_track(parent_pte);
 }
 
-static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, bool direct)
+static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
+						      bool direct)
 {
 	struct kvm_mmu_page *sp;
 
@@ -2130,7 +2131,7 @@ static struct kvm_mmu_page *kvm_mmu_new_shadow_page(struct kvm_vcpu *vcpu,
 
 	++vcpu->kvm->stat.mmu_cache_miss;
 
-	sp = kvm_mmu_alloc_page(vcpu, role.direct);
+	sp = kvm_mmu_alloc_shadow_page(vcpu, role.direct);
 	sp->gfn = gfn;
 	sp->role = role;
 
@@ -2146,8 +2147,9 @@ static struct kvm_mmu_page *kvm_mmu_new_shadow_page(struct kvm_vcpu *vcpu,
 	return sp;
 }
 
-static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
-					     union kvm_mmu_page_role role)
+static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
+						    gfn_t gfn,
+						    union kvm_mmu_page_role role)
 {
 	struct kvm_mmu_page *sp;
 	bool created = false;
@@ -2204,7 +2206,7 @@ static struct kvm_mmu_page *kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu,
 	union kvm_mmu_page_role role;
 
 	role = kvm_mmu_child_role(sptep, direct, access);
-	return kvm_mmu_get_page(vcpu, gfn, role);
+	return kvm_mmu_get_shadow_page(vcpu, gfn, role);
 }
 
 static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterator,
@@ -2480,7 +2482,7 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
 
 	list_for_each_entry_safe(sp, nsp, invalid_list, link) {
 		WARN_ON(!sp->role.invalid || sp->root_count);
-		kvm_mmu_free_page(sp);
+		kvm_mmu_free_shadow_page(sp);
 	}
 }
 
@@ -3407,7 +3409,7 @@ static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant,
 	if (role.has_4_byte_gpte)
 		role.quadrant = quadrant;
 
-	sp = kvm_mmu_get_page(vcpu, gfn, role);
+	sp = kvm_mmu_get_shadow_page(vcpu, gfn, role);
 	++sp->root_count;
 
 	return __pa(sp->spt);
-- 
2.35.1.1094.g7c7d902a7c-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v3 06/23] KVM: x86/mmu: Pass memslot to kvm_mmu_new_shadow_page()
  2022-04-01 17:55 ` David Matlack
@ 2022-04-01 17:55   ` David Matlack
  -1 siblings, 0 replies; 88+ messages in thread
From: David Matlack @ 2022-04-01 17:55 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

Passing the memslot to kvm_mmu_new_shadow_page() avoids the need for the
vCPU pointer when write-protecting indirect 4k shadow pages. This moves
us closer to being able to create new shadow pages during VM ioctls for
eager page splitting, where there is not vCPU pointer.

This change does not negatively impact "Populate memory time" for ept=Y
or ept=N configurations since kvm_vcpu_gfn_to_memslot() caches the last
use slot. So even though we now look up the slot more often, it is a
very cheap check.

Opportunistically move the code to write-protect GFNs shadowed by
PG_LEVEL_4K shadow pages into account_shadowed() to reduce indentation
and consolidate the code. This also eliminates a memslot lookup.

No functional change intended.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 23 ++++++++++++-----------
 1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index dc1825de0752..abfb3e5d1372 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -793,16 +793,14 @@ void kvm_mmu_gfn_allow_lpage(const struct kvm_memory_slot *slot, gfn_t gfn)
 	update_gfn_disallow_lpage_count(slot, gfn, -1);
 }
 
-static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
+static void account_shadowed(struct kvm *kvm,
+			     struct kvm_memory_slot *slot,
+			     struct kvm_mmu_page *sp)
 {
-	struct kvm_memslots *slots;
-	struct kvm_memory_slot *slot;
 	gfn_t gfn;
 
 	kvm->arch.indirect_shadow_pages++;
 	gfn = sp->gfn;
-	slots = kvm_memslots_for_spte_role(kvm, sp->role);
-	slot = __gfn_to_memslot(slots, gfn);
 
 	/* the non-leaf shadow pages are keeping readonly. */
 	if (sp->role.level > PG_LEVEL_4K)
@@ -810,6 +808,9 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
 						    KVM_PAGE_TRACK_WRITE);
 
 	kvm_mmu_gfn_disallow_lpage(slot, gfn);
+
+	if (kvm_mmu_slot_gfn_write_protect(kvm, slot, gfn, PG_LEVEL_4K))
+		kvm_flush_remote_tlbs_with_address(kvm, gfn, 1);
 }
 
 void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp)
@@ -2123,6 +2124,7 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
 }
 
 static struct kvm_mmu_page *kvm_mmu_new_shadow_page(struct kvm_vcpu *vcpu,
+						    struct kvm_memory_slot *slot,
 						    gfn_t gfn,
 						    union kvm_mmu_page_role role)
 {
@@ -2138,11 +2140,8 @@ static struct kvm_mmu_page *kvm_mmu_new_shadow_page(struct kvm_vcpu *vcpu,
 	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
 	hlist_add_head(&sp->hash_link, sp_list);
 
-	if (!role.direct) {
-		account_shadowed(vcpu->kvm, sp);
-		if (role.level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))
-			kvm_flush_remote_tlbs_with_address(vcpu->kvm, gfn, 1);
-	}
+	if (!role.direct)
+		account_shadowed(vcpu->kvm, slot, sp);
 
 	return sp;
 }
@@ -2151,13 +2150,15 @@ static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
 						    gfn_t gfn,
 						    union kvm_mmu_page_role role)
 {
+	struct kvm_memory_slot *slot;
 	struct kvm_mmu_page *sp;
 	bool created = false;
 
 	sp = kvm_mmu_find_shadow_page(vcpu, gfn, role);
 	if (!sp) {
 		created = true;
-		sp = kvm_mmu_new_shadow_page(vcpu, gfn, role);
+		slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
+		sp = kvm_mmu_new_shadow_page(vcpu, slot, gfn, role);
 	}
 
 	trace_kvm_mmu_get_page(sp, created);
-- 
2.35.1.1094.g7c7d902a7c-goog


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v3 06/23] KVM: x86/mmu: Pass memslot to kvm_mmu_new_shadow_page()
@ 2022-04-01 17:55   ` David Matlack
  0 siblings, 0 replies; 88+ messages in thread
From: David Matlack @ 2022-04-01 17:55 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Passing the memslot to kvm_mmu_new_shadow_page() avoids the need for the
vCPU pointer when write-protecting indirect 4k shadow pages. This moves
us closer to being able to create new shadow pages during VM ioctls for
eager page splitting, where there is not vCPU pointer.

This change does not negatively impact "Populate memory time" for ept=Y
or ept=N configurations since kvm_vcpu_gfn_to_memslot() caches the last
use slot. So even though we now look up the slot more often, it is a
very cheap check.

Opportunistically move the code to write-protect GFNs shadowed by
PG_LEVEL_4K shadow pages into account_shadowed() to reduce indentation
and consolidate the code. This also eliminates a memslot lookup.

No functional change intended.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 23 ++++++++++++-----------
 1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index dc1825de0752..abfb3e5d1372 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -793,16 +793,14 @@ void kvm_mmu_gfn_allow_lpage(const struct kvm_memory_slot *slot, gfn_t gfn)
 	update_gfn_disallow_lpage_count(slot, gfn, -1);
 }
 
-static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
+static void account_shadowed(struct kvm *kvm,
+			     struct kvm_memory_slot *slot,
+			     struct kvm_mmu_page *sp)
 {
-	struct kvm_memslots *slots;
-	struct kvm_memory_slot *slot;
 	gfn_t gfn;
 
 	kvm->arch.indirect_shadow_pages++;
 	gfn = sp->gfn;
-	slots = kvm_memslots_for_spte_role(kvm, sp->role);
-	slot = __gfn_to_memslot(slots, gfn);
 
 	/* the non-leaf shadow pages are keeping readonly. */
 	if (sp->role.level > PG_LEVEL_4K)
@@ -810,6 +808,9 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
 						    KVM_PAGE_TRACK_WRITE);
 
 	kvm_mmu_gfn_disallow_lpage(slot, gfn);
+
+	if (kvm_mmu_slot_gfn_write_protect(kvm, slot, gfn, PG_LEVEL_4K))
+		kvm_flush_remote_tlbs_with_address(kvm, gfn, 1);
 }
 
 void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp)
@@ -2123,6 +2124,7 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
 }
 
 static struct kvm_mmu_page *kvm_mmu_new_shadow_page(struct kvm_vcpu *vcpu,
+						    struct kvm_memory_slot *slot,
 						    gfn_t gfn,
 						    union kvm_mmu_page_role role)
 {
@@ -2138,11 +2140,8 @@ static struct kvm_mmu_page *kvm_mmu_new_shadow_page(struct kvm_vcpu *vcpu,
 	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
 	hlist_add_head(&sp->hash_link, sp_list);
 
-	if (!role.direct) {
-		account_shadowed(vcpu->kvm, sp);
-		if (role.level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))
-			kvm_flush_remote_tlbs_with_address(vcpu->kvm, gfn, 1);
-	}
+	if (!role.direct)
+		account_shadowed(vcpu->kvm, slot, sp);
 
 	return sp;
 }
@@ -2151,13 +2150,15 @@ static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
 						    gfn_t gfn,
 						    union kvm_mmu_page_role role)
 {
+	struct kvm_memory_slot *slot;
 	struct kvm_mmu_page *sp;
 	bool created = false;
 
 	sp = kvm_mmu_find_shadow_page(vcpu, gfn, role);
 	if (!sp) {
 		created = true;
-		sp = kvm_mmu_new_shadow_page(vcpu, gfn, role);
+		slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
+		sp = kvm_mmu_new_shadow_page(vcpu, slot, gfn, role);
 	}
 
 	trace_kvm_mmu_get_page(sp, created);
-- 
2.35.1.1094.g7c7d902a7c-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v3 07/23] KVM: x86/mmu: Separate shadow MMU sp allocation from initialization
  2022-04-01 17:55 ` David Matlack
@ 2022-04-01 17:55   ` David Matlack
  -1 siblings, 0 replies; 88+ messages in thread
From: David Matlack @ 2022-04-01 17:55 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

Separate the code that allocates a new shadow page from the vCPU caches
from the code that initializes it. This is in preparation for creating
new shadow pages from VM ioctls for eager page splitting, where we do
not have access to the vCPU caches.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 38 ++++++++++++++++++--------------------
 1 file changed, 18 insertions(+), 20 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index abfb3e5d1372..421fcbc97f9e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1716,16 +1716,9 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
 	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
 	if (!direct)
 		sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache);
+
 	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
 
-	/*
-	 * active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages()
-	 * depends on valid pages being added to the head of the list.  See
-	 * comments in kvm_zap_obsolete_pages().
-	 */
-	sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
-	list_add(&sp->link, &vcpu->kvm->arch.active_mmu_pages);
-	kvm_mod_used_mmu_pages(vcpu->kvm, +1);
 	return sp;
 }
 
@@ -2123,27 +2116,31 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
 	return sp;
 }
 
-static struct kvm_mmu_page *kvm_mmu_new_shadow_page(struct kvm_vcpu *vcpu,
-						    struct kvm_memory_slot *slot,
-						    gfn_t gfn,
-						    union kvm_mmu_page_role role)
+static void init_shadow_page(struct kvm *kvm, struct kvm_mmu_page *sp,
+			     struct kvm_memory_slot *slot, gfn_t gfn,
+			     union kvm_mmu_page_role role)
 {
-	struct kvm_mmu_page *sp;
 	struct hlist_head *sp_list;
 
-	++vcpu->kvm->stat.mmu_cache_miss;
+	++kvm->stat.mmu_cache_miss;
 
-	sp = kvm_mmu_alloc_shadow_page(vcpu, role.direct);
 	sp->gfn = gfn;
 	sp->role = role;
+	sp->mmu_valid_gen = kvm->arch.mmu_valid_gen;
 
-	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
+	/*
+	 * active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages()
+	 * depends on valid pages being added to the head of the list.  See
+	 * comments in kvm_zap_obsolete_pages().
+	 */
+	list_add(&sp->link, &kvm->arch.active_mmu_pages);
+	kvm_mod_used_mmu_pages(kvm, 1);
+
+	sp_list = &kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
 	hlist_add_head(&sp->hash_link, sp_list);
 
 	if (!role.direct)
-		account_shadowed(vcpu->kvm, slot, sp);
-
-	return sp;
+		account_shadowed(kvm, slot, sp);
 }
 
 static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
@@ -2158,7 +2155,8 @@ static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
 	if (!sp) {
 		created = true;
 		slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
-		sp = kvm_mmu_new_shadow_page(vcpu, slot, gfn, role);
+		sp = kvm_mmu_alloc_shadow_page(vcpu, role.direct);
+		init_shadow_page(vcpu->kvm, sp, slot, gfn, role);
 	}
 
 	trace_kvm_mmu_get_page(sp, created);
-- 
2.35.1.1094.g7c7d902a7c-goog


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v3 07/23] KVM: x86/mmu: Separate shadow MMU sp allocation from initialization
@ 2022-04-01 17:55   ` David Matlack
  0 siblings, 0 replies; 88+ messages in thread
From: David Matlack @ 2022-04-01 17:55 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Separate the code that allocates a new shadow page from the vCPU caches
from the code that initializes it. This is in preparation for creating
new shadow pages from VM ioctls for eager page splitting, where we do
not have access to the vCPU caches.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 38 ++++++++++++++++++--------------------
 1 file changed, 18 insertions(+), 20 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index abfb3e5d1372..421fcbc97f9e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1716,16 +1716,9 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
 	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
 	if (!direct)
 		sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache);
+
 	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
 
-	/*
-	 * active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages()
-	 * depends on valid pages being added to the head of the list.  See
-	 * comments in kvm_zap_obsolete_pages().
-	 */
-	sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
-	list_add(&sp->link, &vcpu->kvm->arch.active_mmu_pages);
-	kvm_mod_used_mmu_pages(vcpu->kvm, +1);
 	return sp;
 }
 
@@ -2123,27 +2116,31 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
 	return sp;
 }
 
-static struct kvm_mmu_page *kvm_mmu_new_shadow_page(struct kvm_vcpu *vcpu,
-						    struct kvm_memory_slot *slot,
-						    gfn_t gfn,
-						    union kvm_mmu_page_role role)
+static void init_shadow_page(struct kvm *kvm, struct kvm_mmu_page *sp,
+			     struct kvm_memory_slot *slot, gfn_t gfn,
+			     union kvm_mmu_page_role role)
 {
-	struct kvm_mmu_page *sp;
 	struct hlist_head *sp_list;
 
-	++vcpu->kvm->stat.mmu_cache_miss;
+	++kvm->stat.mmu_cache_miss;
 
-	sp = kvm_mmu_alloc_shadow_page(vcpu, role.direct);
 	sp->gfn = gfn;
 	sp->role = role;
+	sp->mmu_valid_gen = kvm->arch.mmu_valid_gen;
 
-	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
+	/*
+	 * active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages()
+	 * depends on valid pages being added to the head of the list.  See
+	 * comments in kvm_zap_obsolete_pages().
+	 */
+	list_add(&sp->link, &kvm->arch.active_mmu_pages);
+	kvm_mod_used_mmu_pages(kvm, 1);
+
+	sp_list = &kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
 	hlist_add_head(&sp->hash_link, sp_list);
 
 	if (!role.direct)
-		account_shadowed(vcpu->kvm, slot, sp);
-
-	return sp;
+		account_shadowed(kvm, slot, sp);
 }
 
 static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
@@ -2158,7 +2155,8 @@ static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
 	if (!sp) {
 		created = true;
 		slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
-		sp = kvm_mmu_new_shadow_page(vcpu, slot, gfn, role);
+		sp = kvm_mmu_alloc_shadow_page(vcpu, role.direct);
+		init_shadow_page(vcpu->kvm, sp, slot, gfn, role);
 	}
 
 	trace_kvm_mmu_get_page(sp, created);
-- 
2.35.1.1094.g7c7d902a7c-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v3 08/23] KVM: x86/mmu: Link spt to sp during allocation
  2022-04-01 17:55 ` David Matlack
@ 2022-04-01 17:55   ` David Matlack
  -1 siblings, 0 replies; 88+ messages in thread
From: David Matlack @ 2022-04-01 17:55 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

Link the shadow page table to the sp (via set_page_private()) during
allocation rather than initialization. This is a more logical place to
do it because allocation time is also where we do the reverse link
(setting sp->spt).

This creates one extra call to set_page_private(), but having multiple
calls to set_page_private() is unavoidable anyway. We either do
set_page_private() during allocation, which requires 1 per allocation
function, or we do it during initialization, which requires 1 per
initialization function.

No functional change intended.

Suggested-by: Ben Gardon <bgardon@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index b3b6426725d4..17354e55735f 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -274,6 +274,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
 
 	sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
 	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
+	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
 
 	return sp;
 }
@@ -281,8 +282,6 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
 static void tdp_mmu_init_sp(struct kvm_mmu_page *sp, tdp_ptep_t sptep,
 			    gfn_t gfn, union kvm_mmu_page_role role)
 {
-	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
-
 	sp->role = role;
 	sp->gfn = gfn;
 	sp->ptep = sptep;
@@ -1435,6 +1434,8 @@ static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp)
 		return NULL;
 	}
 
+	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
+
 	return sp;
 }
 
-- 
2.35.1.1094.g7c7d902a7c-goog


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v3 08/23] KVM: x86/mmu: Link spt to sp during allocation
@ 2022-04-01 17:55   ` David Matlack
  0 siblings, 0 replies; 88+ messages in thread
From: David Matlack @ 2022-04-01 17:55 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Link the shadow page table to the sp (via set_page_private()) during
allocation rather than initialization. This is a more logical place to
do it because allocation time is also where we do the reverse link
(setting sp->spt).

This creates one extra call to set_page_private(), but having multiple
calls to set_page_private() is unavoidable anyway. We either do
set_page_private() during allocation, which requires 1 per allocation
function, or we do it during initialization, which requires 1 per
initialization function.

No functional change intended.

Suggested-by: Ben Gardon <bgardon@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index b3b6426725d4..17354e55735f 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -274,6 +274,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
 
 	sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
 	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
+	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
 
 	return sp;
 }
@@ -281,8 +282,6 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
 static void tdp_mmu_init_sp(struct kvm_mmu_page *sp, tdp_ptep_t sptep,
 			    gfn_t gfn, union kvm_mmu_page_role role)
 {
-	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
-
 	sp->role = role;
 	sp->gfn = gfn;
 	sp->ptep = sptep;
@@ -1435,6 +1434,8 @@ static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp)
 		return NULL;
 	}
 
+	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
+
 	return sp;
 }
 
-- 
2.35.1.1094.g7c7d902a7c-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v3 09/23] KVM: x86/mmu: Move huge page split sp allocation code to mmu.c
  2022-04-01 17:55 ` David Matlack
@ 2022-04-01 17:55   ` David Matlack
  -1 siblings, 0 replies; 88+ messages in thread
From: David Matlack @ 2022-04-01 17:55 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

Move the code that allocates a new shadow page for splitting huge pages
into mmu.c. Currently this code is only used by the TDP MMU but it will
be reused in subsequent commits to also split huge pages mapped by the
shadow MMU. Move the GFP flags calculation down into the allocation code
so that it does not have to be duplicated when the shadow MMU needs to
start allocating SPs for splitting.

Preemptively split out the gfp flags calculation to a separate helpers
for use in a subsequent commit that adds support for eager page
splitting to the shadow MMU.

No functional change intended.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c          | 37 +++++++++++++++++++++++++++++++++
 arch/x86/kvm/mmu/mmu_internal.h |  2 ++
 arch/x86/kvm/mmu/tdp_mmu.c      | 34 ++----------------------------
 3 files changed, 41 insertions(+), 32 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 421fcbc97f9e..657c2a906c12 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1722,6 +1722,43 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
 	return sp;
 }
 
+static inline gfp_t gfp_flags_for_split(bool locked)
+{
+	/*
+	 * If under the MMU lock, use GFP_NOWAIT to avoid direct reclaim (which
+	 * is slow) and to avoid making any filesystem callbacks (which can end
+	 * up invoking KVM MMU notifiers, resulting in a deadlock).
+	 */
+	return (locked ? GFP_NOWAIT : GFP_KERNEL) | __GFP_ACCOUNT;
+}
+
+/*
+ * Allocate a new shadow page, potentially while holding the MMU lock.
+ *
+ * Huge page splitting always uses direct shadow pages since the huge page is
+ * being mapped directly with a lower level page table. Thus there's no need to
+ * allocate the gfns array.
+ */
+struct kvm_mmu_page *kvm_mmu_alloc_direct_sp_for_split(bool locked)
+{
+	gfp_t gfp = gfp_flags_for_split(locked) | __GFP_ZERO;
+	struct kvm_mmu_page *sp;
+
+	sp = kmem_cache_alloc(mmu_page_header_cache, gfp);
+	if (!sp)
+		return NULL;
+
+	sp->spt = (void *)__get_free_page(gfp);
+	if (!sp->spt) {
+		kmem_cache_free(mmu_page_header_cache, sp);
+		return NULL;
+	}
+
+	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
+
+	return sp;
+}
+
 static void mark_unsync(u64 *spte);
 static void kvm_mmu_mark_parents_unsync(struct kvm_mmu_page *sp)
 {
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 1bff453f7cbe..a0648e7ddd33 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -171,4 +171,6 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
 void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
 void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
 
+struct kvm_mmu_page *kvm_mmu_alloc_direct_sp_for_split(bool locked);
+
 #endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 17354e55735f..34e581bcaaf6 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1418,43 +1418,13 @@ bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm,
 	return spte_set;
 }
 
-static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp)
-{
-	struct kvm_mmu_page *sp;
-
-	gfp |= __GFP_ZERO;
-
-	sp = kmem_cache_alloc(mmu_page_header_cache, gfp);
-	if (!sp)
-		return NULL;
-
-	sp->spt = (void *)__get_free_page(gfp);
-	if (!sp->spt) {
-		kmem_cache_free(mmu_page_header_cache, sp);
-		return NULL;
-	}
-
-	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
-
-	return sp;
-}
-
 static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
 						       struct tdp_iter *iter,
 						       bool shared)
 {
 	struct kvm_mmu_page *sp;
 
-	/*
-	 * Since we are allocating while under the MMU lock we have to be
-	 * careful about GFP flags. Use GFP_NOWAIT to avoid blocking on direct
-	 * reclaim and to avoid making any filesystem callbacks (which can end
-	 * up invoking KVM MMU notifiers, resulting in a deadlock).
-	 *
-	 * If this allocation fails we drop the lock and retry with reclaim
-	 * allowed.
-	 */
-	sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT);
+	sp = kvm_mmu_alloc_direct_sp_for_split(true);
 	if (sp)
 		return sp;
 
@@ -1466,7 +1436,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
 		write_unlock(&kvm->mmu_lock);
 
 	iter->yielded = true;
-	sp = __tdp_mmu_alloc_sp_for_split(GFP_KERNEL_ACCOUNT);
+	sp = kvm_mmu_alloc_direct_sp_for_split(false);
 
 	if (shared)
 		read_lock(&kvm->mmu_lock);
-- 
2.35.1.1094.g7c7d902a7c-goog


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v3 09/23] KVM: x86/mmu: Move huge page split sp allocation code to mmu.c
@ 2022-04-01 17:55   ` David Matlack
  0 siblings, 0 replies; 88+ messages in thread
From: David Matlack @ 2022-04-01 17:55 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Move the code that allocates a new shadow page for splitting huge pages
into mmu.c. Currently this code is only used by the TDP MMU but it will
be reused in subsequent commits to also split huge pages mapped by the
shadow MMU. Move the GFP flags calculation down into the allocation code
so that it does not have to be duplicated when the shadow MMU needs to
start allocating SPs for splitting.

Preemptively split out the gfp flags calculation to a separate helpers
for use in a subsequent commit that adds support for eager page
splitting to the shadow MMU.

No functional change intended.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c          | 37 +++++++++++++++++++++++++++++++++
 arch/x86/kvm/mmu/mmu_internal.h |  2 ++
 arch/x86/kvm/mmu/tdp_mmu.c      | 34 ++----------------------------
 3 files changed, 41 insertions(+), 32 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 421fcbc97f9e..657c2a906c12 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1722,6 +1722,43 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
 	return sp;
 }
 
+static inline gfp_t gfp_flags_for_split(bool locked)
+{
+	/*
+	 * If under the MMU lock, use GFP_NOWAIT to avoid direct reclaim (which
+	 * is slow) and to avoid making any filesystem callbacks (which can end
+	 * up invoking KVM MMU notifiers, resulting in a deadlock).
+	 */
+	return (locked ? GFP_NOWAIT : GFP_KERNEL) | __GFP_ACCOUNT;
+}
+
+/*
+ * Allocate a new shadow page, potentially while holding the MMU lock.
+ *
+ * Huge page splitting always uses direct shadow pages since the huge page is
+ * being mapped directly with a lower level page table. Thus there's no need to
+ * allocate the gfns array.
+ */
+struct kvm_mmu_page *kvm_mmu_alloc_direct_sp_for_split(bool locked)
+{
+	gfp_t gfp = gfp_flags_for_split(locked) | __GFP_ZERO;
+	struct kvm_mmu_page *sp;
+
+	sp = kmem_cache_alloc(mmu_page_header_cache, gfp);
+	if (!sp)
+		return NULL;
+
+	sp->spt = (void *)__get_free_page(gfp);
+	if (!sp->spt) {
+		kmem_cache_free(mmu_page_header_cache, sp);
+		return NULL;
+	}
+
+	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
+
+	return sp;
+}
+
 static void mark_unsync(u64 *spte);
 static void kvm_mmu_mark_parents_unsync(struct kvm_mmu_page *sp)
 {
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 1bff453f7cbe..a0648e7ddd33 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -171,4 +171,6 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
 void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
 void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
 
+struct kvm_mmu_page *kvm_mmu_alloc_direct_sp_for_split(bool locked);
+
 #endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 17354e55735f..34e581bcaaf6 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1418,43 +1418,13 @@ bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm,
 	return spte_set;
 }
 
-static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp)
-{
-	struct kvm_mmu_page *sp;
-
-	gfp |= __GFP_ZERO;
-
-	sp = kmem_cache_alloc(mmu_page_header_cache, gfp);
-	if (!sp)
-		return NULL;
-
-	sp->spt = (void *)__get_free_page(gfp);
-	if (!sp->spt) {
-		kmem_cache_free(mmu_page_header_cache, sp);
-		return NULL;
-	}
-
-	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
-
-	return sp;
-}
-
 static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
 						       struct tdp_iter *iter,
 						       bool shared)
 {
 	struct kvm_mmu_page *sp;
 
-	/*
-	 * Since we are allocating while under the MMU lock we have to be
-	 * careful about GFP flags. Use GFP_NOWAIT to avoid blocking on direct
-	 * reclaim and to avoid making any filesystem callbacks (which can end
-	 * up invoking KVM MMU notifiers, resulting in a deadlock).
-	 *
-	 * If this allocation fails we drop the lock and retry with reclaim
-	 * allowed.
-	 */
-	sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT);
+	sp = kvm_mmu_alloc_direct_sp_for_split(true);
 	if (sp)
 		return sp;
 
@@ -1466,7 +1436,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
 		write_unlock(&kvm->mmu_lock);
 
 	iter->yielded = true;
-	sp = __tdp_mmu_alloc_sp_for_split(GFP_KERNEL_ACCOUNT);
+	sp = kvm_mmu_alloc_direct_sp_for_split(false);
 
 	if (shared)
 		read_lock(&kvm->mmu_lock);
-- 
2.35.1.1094.g7c7d902a7c-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v3 10/23] KVM: x86/mmu: Use common code to free kvm_mmu_page structs
  2022-04-01 17:55 ` David Matlack
@ 2022-04-01 17:55   ` David Matlack
  -1 siblings, 0 replies; 88+ messages in thread
From: David Matlack @ 2022-04-01 17:55 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

Use a common function to free kvm_mmu_page structs in the TDP MMU and
the shadow MMU. This reduces the amount of duplicate code and is needed
in subsequent commits that allocate and free kvm_mmu_pages for eager
page splitting. Keep tdp_mmu_free_sp() as a wrapper to mirror
tdp_mmu_alloc_sp().

No functional change intended.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c          | 8 ++++----
 arch/x86/kvm/mmu/mmu_internal.h | 2 ++
 arch/x86/kvm/mmu/tdp_mmu.c      | 3 +--
 3 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 657c2a906c12..27996fdb0e7e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1669,11 +1669,8 @@ static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, long nr)
 	percpu_counter_add(&kvm_total_used_mmu_pages, nr);
 }
 
-static void kvm_mmu_free_shadow_page(struct kvm_mmu_page *sp)
+void kvm_mmu_free_shadow_page(struct kvm_mmu_page *sp)
 {
-	MMU_WARN_ON(!is_empty_shadow_page(sp->spt));
-	hlist_del(&sp->hash_link);
-	list_del(&sp->link);
 	free_page((unsigned long)sp->spt);
 	if (!sp->role.direct)
 		free_page((unsigned long)sp->gfns);
@@ -2518,6 +2515,9 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
 
 	list_for_each_entry_safe(sp, nsp, invalid_list, link) {
 		WARN_ON(!sp->role.invalid || sp->root_count);
+		MMU_WARN_ON(!is_empty_shadow_page(sp->spt));
+		hlist_del(&sp->hash_link);
+		list_del(&sp->link);
 		kvm_mmu_free_shadow_page(sp);
 	}
 }
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index a0648e7ddd33..5f91e4d07a95 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -173,4 +173,6 @@ void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
 
 struct kvm_mmu_page *kvm_mmu_alloc_direct_sp_for_split(bool locked);
 
+void kvm_mmu_free_shadow_page(struct kvm_mmu_page *sp);
+
 #endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 34e581bcaaf6..8b00c868405b 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -64,8 +64,7 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
 
 static void tdp_mmu_free_sp(struct kvm_mmu_page *sp)
 {
-	free_page((unsigned long)sp->spt);
-	kmem_cache_free(mmu_page_header_cache, sp);
+	kvm_mmu_free_shadow_page(sp);
 }
 
 /*
-- 
2.35.1.1094.g7c7d902a7c-goog


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v3 10/23] KVM: x86/mmu: Use common code to free kvm_mmu_page structs
@ 2022-04-01 17:55   ` David Matlack
  0 siblings, 0 replies; 88+ messages in thread
From: David Matlack @ 2022-04-01 17:55 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Use a common function to free kvm_mmu_page structs in the TDP MMU and
the shadow MMU. This reduces the amount of duplicate code and is needed
in subsequent commits that allocate and free kvm_mmu_pages for eager
page splitting. Keep tdp_mmu_free_sp() as a wrapper to mirror
tdp_mmu_alloc_sp().

No functional change intended.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c          | 8 ++++----
 arch/x86/kvm/mmu/mmu_internal.h | 2 ++
 arch/x86/kvm/mmu/tdp_mmu.c      | 3 +--
 3 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 657c2a906c12..27996fdb0e7e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1669,11 +1669,8 @@ static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, long nr)
 	percpu_counter_add(&kvm_total_used_mmu_pages, nr);
 }
 
-static void kvm_mmu_free_shadow_page(struct kvm_mmu_page *sp)
+void kvm_mmu_free_shadow_page(struct kvm_mmu_page *sp)
 {
-	MMU_WARN_ON(!is_empty_shadow_page(sp->spt));
-	hlist_del(&sp->hash_link);
-	list_del(&sp->link);
 	free_page((unsigned long)sp->spt);
 	if (!sp->role.direct)
 		free_page((unsigned long)sp->gfns);
@@ -2518,6 +2515,9 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
 
 	list_for_each_entry_safe(sp, nsp, invalid_list, link) {
 		WARN_ON(!sp->role.invalid || sp->root_count);
+		MMU_WARN_ON(!is_empty_shadow_page(sp->spt));
+		hlist_del(&sp->hash_link);
+		list_del(&sp->link);
 		kvm_mmu_free_shadow_page(sp);
 	}
 }
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index a0648e7ddd33..5f91e4d07a95 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -173,4 +173,6 @@ void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
 
 struct kvm_mmu_page *kvm_mmu_alloc_direct_sp_for_split(bool locked);
 
+void kvm_mmu_free_shadow_page(struct kvm_mmu_page *sp);
+
 #endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 34e581bcaaf6..8b00c868405b 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -64,8 +64,7 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
 
 static void tdp_mmu_free_sp(struct kvm_mmu_page *sp)
 {
-	free_page((unsigned long)sp->spt);
-	kmem_cache_free(mmu_page_header_cache, sp);
+	kvm_mmu_free_shadow_page(sp);
 }
 
 /*
-- 
2.35.1.1094.g7c7d902a7c-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v3 11/23] KVM: x86/mmu: Use common code to allocate shadow pages from vCPU caches
  2022-04-01 17:55 ` David Matlack
@ 2022-04-01 17:55   ` David Matlack
  -1 siblings, 0 replies; 88+ messages in thread
From: David Matlack @ 2022-04-01 17:55 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

Now that allocating shadow pages is isolated to a helper function, use
it in the TDP MMU as well. Keep tdp_mmu_alloc_sp() to avoid hard-coding
direct=true in multiple places.

No functional change intended.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c          | 3 +--
 arch/x86/kvm/mmu/mmu_internal.h | 1 +
 arch/x86/kvm/mmu/tdp_mmu.c      | 8 +-------
 3 files changed, 3 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 27996fdb0e7e..37385835c399 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1704,8 +1704,7 @@ static void drop_parent_pte(struct kvm_mmu_page *sp,
 	mmu_spte_clear_no_track(parent_pte);
 }
 
-static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
-						      bool direct)
+struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu, bool direct)
 {
 	struct kvm_mmu_page *sp;
 
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 5f91e4d07a95..d4e2de5f2a6d 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -173,6 +173,7 @@ void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
 
 struct kvm_mmu_page *kvm_mmu_alloc_direct_sp_for_split(bool locked);
 
+struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu, bool direct);
 void kvm_mmu_free_shadow_page(struct kvm_mmu_page *sp);
 
 #endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 8b00c868405b..f6201b89045b 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -269,13 +269,7 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
 
 static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
 {
-	struct kvm_mmu_page *sp;
-
-	sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
-	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
-	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
-
-	return sp;
+	return kvm_mmu_alloc_shadow_page(vcpu, true);
 }
 
 static void tdp_mmu_init_sp(struct kvm_mmu_page *sp, tdp_ptep_t sptep,
-- 
2.35.1.1094.g7c7d902a7c-goog


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v3 11/23] KVM: x86/mmu: Use common code to allocate shadow pages from vCPU caches
@ 2022-04-01 17:55   ` David Matlack
  0 siblings, 0 replies; 88+ messages in thread
From: David Matlack @ 2022-04-01 17:55 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Now that allocating shadow pages is isolated to a helper function, use
it in the TDP MMU as well. Keep tdp_mmu_alloc_sp() to avoid hard-coding
direct=true in multiple places.

No functional change intended.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c          | 3 +--
 arch/x86/kvm/mmu/mmu_internal.h | 1 +
 arch/x86/kvm/mmu/tdp_mmu.c      | 8 +-------
 3 files changed, 3 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 27996fdb0e7e..37385835c399 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1704,8 +1704,7 @@ static void drop_parent_pte(struct kvm_mmu_page *sp,
 	mmu_spte_clear_no_track(parent_pte);
 }
 
-static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
-						      bool direct)
+struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu, bool direct)
 {
 	struct kvm_mmu_page *sp;
 
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 5f91e4d07a95..d4e2de5f2a6d 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -173,6 +173,7 @@ void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
 
 struct kvm_mmu_page *kvm_mmu_alloc_direct_sp_for_split(bool locked);
 
+struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu, bool direct);
 void kvm_mmu_free_shadow_page(struct kvm_mmu_page *sp);
 
 #endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 8b00c868405b..f6201b89045b 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -269,13 +269,7 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
 
 static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
 {
-	struct kvm_mmu_page *sp;
-
-	sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
-	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
-	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
-
-	return sp;
+	return kvm_mmu_alloc_shadow_page(vcpu, true);
 }
 
 static void tdp_mmu_init_sp(struct kvm_mmu_page *sp, tdp_ptep_t sptep,
-- 
2.35.1.1094.g7c7d902a7c-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v3 12/23] KVM: x86/mmu: Pass const memslot to rmap_add()
  2022-04-01 17:55 ` David Matlack
@ 2022-04-01 17:55   ` David Matlack
  -1 siblings, 0 replies; 88+ messages in thread
From: David Matlack @ 2022-04-01 17:55 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

rmap_add() only uses the slot to call gfn_to_rmap() which takes a const
memslot.

No functional change intended.

Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 37385835c399..1efe161f9c02 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1596,7 +1596,7 @@ static bool kvm_test_age_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 
 #define RMAP_RECYCLE_THRESHOLD 1000
 
-static void rmap_add(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
+static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
 		     u64 *spte, gfn_t gfn)
 {
 	struct kvm_mmu_page *sp;
-- 
2.35.1.1094.g7c7d902a7c-goog


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v3 12/23] KVM: x86/mmu: Pass const memslot to rmap_add()
@ 2022-04-01 17:55   ` David Matlack
  0 siblings, 0 replies; 88+ messages in thread
From: David Matlack @ 2022-04-01 17:55 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

rmap_add() only uses the slot to call gfn_to_rmap() which takes a const
memslot.

No functional change intended.

Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 37385835c399..1efe161f9c02 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1596,7 +1596,7 @@ static bool kvm_test_age_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 
 #define RMAP_RECYCLE_THRESHOLD 1000
 
-static void rmap_add(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
+static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
 		     u64 *spte, gfn_t gfn)
 {
 	struct kvm_mmu_page *sp;
-- 
2.35.1.1094.g7c7d902a7c-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v3 13/23] KVM: x86/mmu: Pass const memslot to init_shadow_page() and descendants
  2022-04-01 17:55 ` David Matlack
@ 2022-04-01 17:55   ` David Matlack
  -1 siblings, 0 replies; 88+ messages in thread
From: David Matlack @ 2022-04-01 17:55 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

Use a const pointer so that init_shadow_page() can be called from
contexts where we have a const pointer.

No functional change intended.

Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/include/asm/kvm_page_track.h | 2 +-
 arch/x86/kvm/mmu/mmu.c                | 6 +++---
 arch/x86/kvm/mmu/mmu_internal.h       | 2 +-
 arch/x86/kvm/mmu/page_track.c         | 4 ++--
 arch/x86/kvm/mmu/tdp_mmu.c            | 2 +-
 arch/x86/kvm/mmu/tdp_mmu.h            | 2 +-
 6 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/kvm_page_track.h b/arch/x86/include/asm/kvm_page_track.h
index eb186bc57f6a..3a2dc183ae9a 100644
--- a/arch/x86/include/asm/kvm_page_track.h
+++ b/arch/x86/include/asm/kvm_page_track.h
@@ -58,7 +58,7 @@ int kvm_page_track_create_memslot(struct kvm *kvm,
 				  unsigned long npages);
 
 void kvm_slot_page_track_add_page(struct kvm *kvm,
-				  struct kvm_memory_slot *slot, gfn_t gfn,
+				  const struct kvm_memory_slot *slot, gfn_t gfn,
 				  enum kvm_page_track_mode mode);
 void kvm_slot_page_track_remove_page(struct kvm *kvm,
 				     struct kvm_memory_slot *slot, gfn_t gfn,
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 1efe161f9c02..39d9cccbdc7e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -794,7 +794,7 @@ void kvm_mmu_gfn_allow_lpage(const struct kvm_memory_slot *slot, gfn_t gfn)
 }
 
 static void account_shadowed(struct kvm *kvm,
-			     struct kvm_memory_slot *slot,
+			     const struct kvm_memory_slot *slot,
 			     struct kvm_mmu_page *sp)
 {
 	gfn_t gfn;
@@ -1373,7 +1373,7 @@ int kvm_cpu_dirty_log_size(void)
 }
 
 bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
-				    struct kvm_memory_slot *slot, u64 gfn,
+				    const struct kvm_memory_slot *slot, u64 gfn,
 				    int min_level)
 {
 	struct kvm_rmap_head *rmap_head;
@@ -2150,7 +2150,7 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
 }
 
 static void init_shadow_page(struct kvm *kvm, struct kvm_mmu_page *sp,
-			     struct kvm_memory_slot *slot, gfn_t gfn,
+			     const struct kvm_memory_slot *slot, gfn_t gfn,
 			     union kvm_mmu_page_role role)
 {
 	struct hlist_head *sp_list;
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index d4e2de5f2a6d..b6e22ba9c654 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -134,7 +134,7 @@ int mmu_try_to_unsync_pages(struct kvm *kvm, const struct kvm_memory_slot *slot,
 void kvm_mmu_gfn_disallow_lpage(const struct kvm_memory_slot *slot, gfn_t gfn);
 void kvm_mmu_gfn_allow_lpage(const struct kvm_memory_slot *slot, gfn_t gfn);
 bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
-				    struct kvm_memory_slot *slot, u64 gfn,
+				    const struct kvm_memory_slot *slot, u64 gfn,
 				    int min_level);
 void kvm_flush_remote_tlbs_with_address(struct kvm *kvm,
 					u64 start_gfn, u64 pages);
diff --git a/arch/x86/kvm/mmu/page_track.c b/arch/x86/kvm/mmu/page_track.c
index 2e09d1b6249f..3e7901294573 100644
--- a/arch/x86/kvm/mmu/page_track.c
+++ b/arch/x86/kvm/mmu/page_track.c
@@ -84,7 +84,7 @@ int kvm_page_track_write_tracking_alloc(struct kvm_memory_slot *slot)
 	return 0;
 }
 
-static void update_gfn_track(struct kvm_memory_slot *slot, gfn_t gfn,
+static void update_gfn_track(const struct kvm_memory_slot *slot, gfn_t gfn,
 			     enum kvm_page_track_mode mode, short count)
 {
 	int index, val;
@@ -112,7 +112,7 @@ static void update_gfn_track(struct kvm_memory_slot *slot, gfn_t gfn,
  * @mode: tracking mode, currently only write track is supported.
  */
 void kvm_slot_page_track_add_page(struct kvm *kvm,
-				  struct kvm_memory_slot *slot, gfn_t gfn,
+				  const struct kvm_memory_slot *slot, gfn_t gfn,
 				  enum kvm_page_track_mode mode)
 {
 
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index f6201b89045b..a04262bc34e2 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1793,7 +1793,7 @@ static bool write_protect_gfn(struct kvm *kvm, struct kvm_mmu_page *root,
  * Returns true if an SPTE was set and a TLB flush is needed.
  */
 bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
-				   struct kvm_memory_slot *slot, gfn_t gfn,
+				   const struct kvm_memory_slot *slot, gfn_t gfn,
 				   int min_level)
 {
 	struct kvm_mmu_page *root;
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 5e5ef2576c81..c139635d4209 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -48,7 +48,7 @@ void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
 				       const struct kvm_memory_slot *slot);
 
 bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
-				   struct kvm_memory_slot *slot, gfn_t gfn,
+				   const struct kvm_memory_slot *slot, gfn_t gfn,
 				   int min_level);
 
 void kvm_tdp_mmu_try_split_huge_pages(struct kvm *kvm,
-- 
2.35.1.1094.g7c7d902a7c-goog


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v3 13/23] KVM: x86/mmu: Pass const memslot to init_shadow_page() and descendants
@ 2022-04-01 17:55   ` David Matlack
  0 siblings, 0 replies; 88+ messages in thread
From: David Matlack @ 2022-04-01 17:55 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Use a const pointer so that init_shadow_page() can be called from
contexts where we have a const pointer.

No functional change intended.

Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/include/asm/kvm_page_track.h | 2 +-
 arch/x86/kvm/mmu/mmu.c                | 6 +++---
 arch/x86/kvm/mmu/mmu_internal.h       | 2 +-
 arch/x86/kvm/mmu/page_track.c         | 4 ++--
 arch/x86/kvm/mmu/tdp_mmu.c            | 2 +-
 arch/x86/kvm/mmu/tdp_mmu.h            | 2 +-
 6 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/kvm_page_track.h b/arch/x86/include/asm/kvm_page_track.h
index eb186bc57f6a..3a2dc183ae9a 100644
--- a/arch/x86/include/asm/kvm_page_track.h
+++ b/arch/x86/include/asm/kvm_page_track.h
@@ -58,7 +58,7 @@ int kvm_page_track_create_memslot(struct kvm *kvm,
 				  unsigned long npages);
 
 void kvm_slot_page_track_add_page(struct kvm *kvm,
-				  struct kvm_memory_slot *slot, gfn_t gfn,
+				  const struct kvm_memory_slot *slot, gfn_t gfn,
 				  enum kvm_page_track_mode mode);
 void kvm_slot_page_track_remove_page(struct kvm *kvm,
 				     struct kvm_memory_slot *slot, gfn_t gfn,
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 1efe161f9c02..39d9cccbdc7e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -794,7 +794,7 @@ void kvm_mmu_gfn_allow_lpage(const struct kvm_memory_slot *slot, gfn_t gfn)
 }
 
 static void account_shadowed(struct kvm *kvm,
-			     struct kvm_memory_slot *slot,
+			     const struct kvm_memory_slot *slot,
 			     struct kvm_mmu_page *sp)
 {
 	gfn_t gfn;
@@ -1373,7 +1373,7 @@ int kvm_cpu_dirty_log_size(void)
 }
 
 bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
-				    struct kvm_memory_slot *slot, u64 gfn,
+				    const struct kvm_memory_slot *slot, u64 gfn,
 				    int min_level)
 {
 	struct kvm_rmap_head *rmap_head;
@@ -2150,7 +2150,7 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
 }
 
 static void init_shadow_page(struct kvm *kvm, struct kvm_mmu_page *sp,
-			     struct kvm_memory_slot *slot, gfn_t gfn,
+			     const struct kvm_memory_slot *slot, gfn_t gfn,
 			     union kvm_mmu_page_role role)
 {
 	struct hlist_head *sp_list;
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index d4e2de5f2a6d..b6e22ba9c654 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -134,7 +134,7 @@ int mmu_try_to_unsync_pages(struct kvm *kvm, const struct kvm_memory_slot *slot,
 void kvm_mmu_gfn_disallow_lpage(const struct kvm_memory_slot *slot, gfn_t gfn);
 void kvm_mmu_gfn_allow_lpage(const struct kvm_memory_slot *slot, gfn_t gfn);
 bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
-				    struct kvm_memory_slot *slot, u64 gfn,
+				    const struct kvm_memory_slot *slot, u64 gfn,
 				    int min_level);
 void kvm_flush_remote_tlbs_with_address(struct kvm *kvm,
 					u64 start_gfn, u64 pages);
diff --git a/arch/x86/kvm/mmu/page_track.c b/arch/x86/kvm/mmu/page_track.c
index 2e09d1b6249f..3e7901294573 100644
--- a/arch/x86/kvm/mmu/page_track.c
+++ b/arch/x86/kvm/mmu/page_track.c
@@ -84,7 +84,7 @@ int kvm_page_track_write_tracking_alloc(struct kvm_memory_slot *slot)
 	return 0;
 }
 
-static void update_gfn_track(struct kvm_memory_slot *slot, gfn_t gfn,
+static void update_gfn_track(const struct kvm_memory_slot *slot, gfn_t gfn,
 			     enum kvm_page_track_mode mode, short count)
 {
 	int index, val;
@@ -112,7 +112,7 @@ static void update_gfn_track(struct kvm_memory_slot *slot, gfn_t gfn,
  * @mode: tracking mode, currently only write track is supported.
  */
 void kvm_slot_page_track_add_page(struct kvm *kvm,
-				  struct kvm_memory_slot *slot, gfn_t gfn,
+				  const struct kvm_memory_slot *slot, gfn_t gfn,
 				  enum kvm_page_track_mode mode)
 {
 
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index f6201b89045b..a04262bc34e2 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1793,7 +1793,7 @@ static bool write_protect_gfn(struct kvm *kvm, struct kvm_mmu_page *root,
  * Returns true if an SPTE was set and a TLB flush is needed.
  */
 bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
-				   struct kvm_memory_slot *slot, gfn_t gfn,
+				   const struct kvm_memory_slot *slot, gfn_t gfn,
 				   int min_level)
 {
 	struct kvm_mmu_page *root;
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 5e5ef2576c81..c139635d4209 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -48,7 +48,7 @@ void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
 				       const struct kvm_memory_slot *slot);
 
 bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
-				   struct kvm_memory_slot *slot, gfn_t gfn,
+				   const struct kvm_memory_slot *slot, gfn_t gfn,
 				   int min_level);
 
 void kvm_tdp_mmu_try_split_huge_pages(struct kvm *kvm,
-- 
2.35.1.1094.g7c7d902a7c-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v3 14/23] KVM: x86/mmu: Decouple rmap_add() and link_shadow_page() from kvm_vcpu
  2022-04-01 17:55 ` David Matlack
@ 2022-04-01 17:55   ` David Matlack
  -1 siblings, 0 replies; 88+ messages in thread
From: David Matlack @ 2022-04-01 17:55 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

Allow adding new entries to the rmap and linking shadow pages without a
struct kvm_vcpu pointer by moving the implementation of rmap_add() and
link_shadow_page() into inner helper functions.

No functional change intended.

Reviewed-by: Ben Gardon <bgardon@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 44 +++++++++++++++++++++++++-----------------
 1 file changed, 26 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 39d9cccbdc7e..7305a8c625c0 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -725,11 +725,6 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
 }
 
-static struct pte_list_desc *mmu_alloc_pte_list_desc(struct kvm_vcpu *vcpu)
-{
-	return kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_pte_list_desc_cache);
-}
-
 static void mmu_free_pte_list_desc(struct pte_list_desc *pte_list_desc)
 {
 	kmem_cache_free(pte_list_desc_cache, pte_list_desc);
@@ -874,7 +869,7 @@ gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu, gfn_t gfn,
 /*
  * Returns the number of pointers in the rmap chain, not counting the new one.
  */
-static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
+static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte,
 			struct kvm_rmap_head *rmap_head)
 {
 	struct pte_list_desc *desc;
@@ -885,7 +880,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
 		rmap_head->val = (unsigned long)spte;
 	} else if (!(rmap_head->val & 1)) {
 		rmap_printk("%p %llx 1->many\n", spte, *spte);
-		desc = mmu_alloc_pte_list_desc(vcpu);
+		desc = kvm_mmu_memory_cache_alloc(cache);
 		desc->sptes[0] = (u64 *)rmap_head->val;
 		desc->sptes[1] = spte;
 		desc->spte_count = 2;
@@ -897,7 +892,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
 		while (desc->spte_count == PTE_LIST_EXT) {
 			count += PTE_LIST_EXT;
 			if (!desc->more) {
-				desc->more = mmu_alloc_pte_list_desc(vcpu);
+				desc->more = kvm_mmu_memory_cache_alloc(cache);
 				desc = desc->more;
 				desc->spte_count = 0;
 				break;
@@ -1596,8 +1591,10 @@ static bool kvm_test_age_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 
 #define RMAP_RECYCLE_THRESHOLD 1000
 
-static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
-		     u64 *spte, gfn_t gfn)
+static void __rmap_add(struct kvm *kvm,
+		       struct kvm_mmu_memory_cache *cache,
+		       const struct kvm_memory_slot *slot,
+		       u64 *spte, gfn_t gfn)
 {
 	struct kvm_mmu_page *sp;
 	struct kvm_rmap_head *rmap_head;
@@ -1606,15 +1603,21 @@ static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
 	sp = sptep_to_sp(spte);
 	kvm_mmu_page_set_gfn(sp, spte - sp->spt, gfn);
 	rmap_head = gfn_to_rmap(gfn, sp->role.level, slot);
-	rmap_count = pte_list_add(vcpu, spte, rmap_head);
+	rmap_count = pte_list_add(cache, spte, rmap_head);
 
 	if (rmap_count > RMAP_RECYCLE_THRESHOLD) {
-		kvm_unmap_rmapp(vcpu->kvm, rmap_head, NULL, gfn, sp->role.level, __pte(0));
+		kvm_unmap_rmapp(kvm, rmap_head, NULL, gfn, sp->role.level, __pte(0));
 		kvm_flush_remote_tlbs_with_address(
-				vcpu->kvm, sp->gfn, KVM_PAGES_PER_HPAGE(sp->role.level));
+				kvm, sp->gfn, KVM_PAGES_PER_HPAGE(sp->role.level));
 	}
 }
 
+static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
+		     u64 *spte, gfn_t gfn)
+{
+	__rmap_add(vcpu->kvm, &vcpu->arch.mmu_pte_list_desc_cache, slot, spte, gfn);
+}
+
 bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	bool young = false;
@@ -1682,13 +1685,13 @@ static unsigned kvm_page_table_hashfn(gfn_t gfn)
 	return hash_64(gfn, KVM_MMU_HASH_SHIFT);
 }
 
-static void mmu_page_add_parent_pte(struct kvm_vcpu *vcpu,
+static void mmu_page_add_parent_pte(struct kvm_mmu_memory_cache *cache,
 				    struct kvm_mmu_page *sp, u64 *parent_pte)
 {
 	if (!parent_pte)
 		return;
 
-	pte_list_add(vcpu, parent_pte, &sp->parent_ptes);
+	pte_list_add(cache, parent_pte, &sp->parent_ptes);
 }
 
 static void mmu_page_remove_parent_pte(struct kvm_mmu_page *sp,
@@ -2304,8 +2307,8 @@ static void shadow_walk_next(struct kvm_shadow_walk_iterator *iterator)
 	__shadow_walk_next(iterator, *iterator->sptep);
 }
 
-static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep,
-			     struct kvm_mmu_page *sp)
+static void __link_shadow_page(struct kvm_mmu_memory_cache *cache, u64 *sptep,
+			       struct kvm_mmu_page *sp)
 {
 	u64 spte;
 
@@ -2315,12 +2318,17 @@ static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep,
 
 	mmu_spte_set(sptep, spte);
 
-	mmu_page_add_parent_pte(vcpu, sp, sptep);
+	mmu_page_add_parent_pte(cache, sp, sptep);
 
 	if (sp->unsync_children || sp->unsync)
 		mark_unsync(sptep);
 }
 
+static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep, struct kvm_mmu_page *sp)
+{
+	__link_shadow_page(&vcpu->arch.mmu_pte_list_desc_cache, sptep, sp);
+}
+
 static void validate_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 				   unsigned direct_access)
 {
-- 
2.35.1.1094.g7c7d902a7c-goog


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v3 14/23] KVM: x86/mmu: Decouple rmap_add() and link_shadow_page() from kvm_vcpu
@ 2022-04-01 17:55   ` David Matlack
  0 siblings, 0 replies; 88+ messages in thread
From: David Matlack @ 2022-04-01 17:55 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Allow adding new entries to the rmap and linking shadow pages without a
struct kvm_vcpu pointer by moving the implementation of rmap_add() and
link_shadow_page() into inner helper functions.

No functional change intended.

Reviewed-by: Ben Gardon <bgardon@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 44 +++++++++++++++++++++++++-----------------
 1 file changed, 26 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 39d9cccbdc7e..7305a8c625c0 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -725,11 +725,6 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
 }
 
-static struct pte_list_desc *mmu_alloc_pte_list_desc(struct kvm_vcpu *vcpu)
-{
-	return kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_pte_list_desc_cache);
-}
-
 static void mmu_free_pte_list_desc(struct pte_list_desc *pte_list_desc)
 {
 	kmem_cache_free(pte_list_desc_cache, pte_list_desc);
@@ -874,7 +869,7 @@ gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu, gfn_t gfn,
 /*
  * Returns the number of pointers in the rmap chain, not counting the new one.
  */
-static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
+static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte,
 			struct kvm_rmap_head *rmap_head)
 {
 	struct pte_list_desc *desc;
@@ -885,7 +880,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
 		rmap_head->val = (unsigned long)spte;
 	} else if (!(rmap_head->val & 1)) {
 		rmap_printk("%p %llx 1->many\n", spte, *spte);
-		desc = mmu_alloc_pte_list_desc(vcpu);
+		desc = kvm_mmu_memory_cache_alloc(cache);
 		desc->sptes[0] = (u64 *)rmap_head->val;
 		desc->sptes[1] = spte;
 		desc->spte_count = 2;
@@ -897,7 +892,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
 		while (desc->spte_count == PTE_LIST_EXT) {
 			count += PTE_LIST_EXT;
 			if (!desc->more) {
-				desc->more = mmu_alloc_pte_list_desc(vcpu);
+				desc->more = kvm_mmu_memory_cache_alloc(cache);
 				desc = desc->more;
 				desc->spte_count = 0;
 				break;
@@ -1596,8 +1591,10 @@ static bool kvm_test_age_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 
 #define RMAP_RECYCLE_THRESHOLD 1000
 
-static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
-		     u64 *spte, gfn_t gfn)
+static void __rmap_add(struct kvm *kvm,
+		       struct kvm_mmu_memory_cache *cache,
+		       const struct kvm_memory_slot *slot,
+		       u64 *spte, gfn_t gfn)
 {
 	struct kvm_mmu_page *sp;
 	struct kvm_rmap_head *rmap_head;
@@ -1606,15 +1603,21 @@ static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
 	sp = sptep_to_sp(spte);
 	kvm_mmu_page_set_gfn(sp, spte - sp->spt, gfn);
 	rmap_head = gfn_to_rmap(gfn, sp->role.level, slot);
-	rmap_count = pte_list_add(vcpu, spte, rmap_head);
+	rmap_count = pte_list_add(cache, spte, rmap_head);
 
 	if (rmap_count > RMAP_RECYCLE_THRESHOLD) {
-		kvm_unmap_rmapp(vcpu->kvm, rmap_head, NULL, gfn, sp->role.level, __pte(0));
+		kvm_unmap_rmapp(kvm, rmap_head, NULL, gfn, sp->role.level, __pte(0));
 		kvm_flush_remote_tlbs_with_address(
-				vcpu->kvm, sp->gfn, KVM_PAGES_PER_HPAGE(sp->role.level));
+				kvm, sp->gfn, KVM_PAGES_PER_HPAGE(sp->role.level));
 	}
 }
 
+static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
+		     u64 *spte, gfn_t gfn)
+{
+	__rmap_add(vcpu->kvm, &vcpu->arch.mmu_pte_list_desc_cache, slot, spte, gfn);
+}
+
 bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	bool young = false;
@@ -1682,13 +1685,13 @@ static unsigned kvm_page_table_hashfn(gfn_t gfn)
 	return hash_64(gfn, KVM_MMU_HASH_SHIFT);
 }
 
-static void mmu_page_add_parent_pte(struct kvm_vcpu *vcpu,
+static void mmu_page_add_parent_pte(struct kvm_mmu_memory_cache *cache,
 				    struct kvm_mmu_page *sp, u64 *parent_pte)
 {
 	if (!parent_pte)
 		return;
 
-	pte_list_add(vcpu, parent_pte, &sp->parent_ptes);
+	pte_list_add(cache, parent_pte, &sp->parent_ptes);
 }
 
 static void mmu_page_remove_parent_pte(struct kvm_mmu_page *sp,
@@ -2304,8 +2307,8 @@ static void shadow_walk_next(struct kvm_shadow_walk_iterator *iterator)
 	__shadow_walk_next(iterator, *iterator->sptep);
 }
 
-static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep,
-			     struct kvm_mmu_page *sp)
+static void __link_shadow_page(struct kvm_mmu_memory_cache *cache, u64 *sptep,
+			       struct kvm_mmu_page *sp)
 {
 	u64 spte;
 
@@ -2315,12 +2318,17 @@ static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep,
 
 	mmu_spte_set(sptep, spte);
 
-	mmu_page_add_parent_pte(vcpu, sp, sptep);
+	mmu_page_add_parent_pte(cache, sp, sptep);
 
 	if (sp->unsync_children || sp->unsync)
 		mark_unsync(sptep);
 }
 
+static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep, struct kvm_mmu_page *sp)
+{
+	__link_shadow_page(&vcpu->arch.mmu_pte_list_desc_cache, sptep, sp);
+}
+
 static void validate_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 				   unsigned direct_access)
 {
-- 
2.35.1.1094.g7c7d902a7c-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v3 15/23] KVM: x86/mmu: Update page stats in __rmap_add()
  2022-04-01 17:55 ` David Matlack
@ 2022-04-01 17:55   ` David Matlack
  -1 siblings, 0 replies; 88+ messages in thread
From: David Matlack @ 2022-04-01 17:55 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

Update the page stats in __rmap_add() rather than at the call site. This
will avoid having to manually update page stats when splitting huge
pages in a subsequent commit.

No functional change intended.

Reviewed-by: Ben Gardon <bgardon@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 7305a8c625c0..5e1002d57689 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1602,6 +1602,8 @@ static void __rmap_add(struct kvm *kvm,
 
 	sp = sptep_to_sp(spte);
 	kvm_mmu_page_set_gfn(sp, spte - sp->spt, gfn);
+	kvm_update_page_stats(kvm, sp->role.level, 1);
+
 	rmap_head = gfn_to_rmap(gfn, sp->role.level, slot);
 	rmap_count = pte_list_add(cache, spte, rmap_head);
 
@@ -2839,7 +2841,6 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
 
 	if (!was_rmapped) {
 		WARN_ON_ONCE(ret == RET_PF_SPURIOUS);
-		kvm_update_page_stats(vcpu->kvm, level, 1);
 		rmap_add(vcpu, slot, sptep, gfn);
 	}
 
-- 
2.35.1.1094.g7c7d902a7c-goog


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v3 15/23] KVM: x86/mmu: Update page stats in __rmap_add()
@ 2022-04-01 17:55   ` David Matlack
  0 siblings, 0 replies; 88+ messages in thread
From: David Matlack @ 2022-04-01 17:55 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Update the page stats in __rmap_add() rather than at the call site. This
will avoid having to manually update page stats when splitting huge
pages in a subsequent commit.

No functional change intended.

Reviewed-by: Ben Gardon <bgardon@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 7305a8c625c0..5e1002d57689 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1602,6 +1602,8 @@ static void __rmap_add(struct kvm *kvm,
 
 	sp = sptep_to_sp(spte);
 	kvm_mmu_page_set_gfn(sp, spte - sp->spt, gfn);
+	kvm_update_page_stats(kvm, sp->role.level, 1);
+
 	rmap_head = gfn_to_rmap(gfn, sp->role.level, slot);
 	rmap_count = pte_list_add(cache, spte, rmap_head);
 
@@ -2839,7 +2841,6 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
 
 	if (!was_rmapped) {
 		WARN_ON_ONCE(ret == RET_PF_SPURIOUS);
-		kvm_update_page_stats(vcpu->kvm, level, 1);
 		rmap_add(vcpu, slot, sptep, gfn);
 	}
 
-- 
2.35.1.1094.g7c7d902a7c-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v3 16/23] KVM: x86/mmu: Cache the access bits of shadowed translations
  2022-04-01 17:55 ` David Matlack
@ 2022-04-01 17:55   ` David Matlack
  -1 siblings, 0 replies; 88+ messages in thread
From: David Matlack @ 2022-04-01 17:55 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

In order to split a huge page we need to know what access bits to assign
to the role of the new child page table. This can't be easily derived
from the huge page SPTE itself since KVM applies its own access policies
on top, such as for HugePage NX.

We could walk the guest page tables to determine the correct access
bits, but that is difficult to plumb outside of a vCPU fault context.
Instead, we can store the original access bits for each leaf SPTE
alongside the GFN in the gfns array. The access bits only take up 3
bits, which leaves 61 bits left over for gfns, which is more than
enough. So this change does not require any additional memory.

In order to keep the access bit cache in sync with the guest, we have to
extend FNAME(sync_page) to also update the access bits.

Now that the gfns array caches more information than just GFNs, rename
it to shadowed_translation.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/include/asm/kvm_host.h |  2 +-
 arch/x86/kvm/mmu/mmu.c          | 71 ++++++++++++++++++++++++++++-----
 arch/x86/kvm/mmu/mmu_internal.h | 20 +++++++++-
 arch/x86/kvm/mmu/paging_tmpl.h  |  8 +++-
 4 files changed, 85 insertions(+), 16 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 9694dd5e6ccc..be4349c9ffea 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -696,7 +696,7 @@ struct kvm_vcpu_arch {
 
 	struct kvm_mmu_memory_cache mmu_pte_list_desc_cache;
 	struct kvm_mmu_memory_cache mmu_shadow_page_cache;
-	struct kvm_mmu_memory_cache mmu_gfn_array_cache;
+	struct kvm_mmu_memory_cache mmu_shadowed_info_cache;
 	struct kvm_mmu_memory_cache mmu_page_header_cache;
 
 	/*
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 5e1002d57689..3a425ed80e23 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -708,7 +708,7 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
 	if (r)
 		return r;
 	if (maybe_indirect) {
-		r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_gfn_array_cache,
+		r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadowed_info_cache,
 					       PT64_ROOT_MAX_LEVEL);
 		if (r)
 			return r;
@@ -721,7 +721,7 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
 {
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
-	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_gfn_array_cache);
+	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadowed_info_cache);
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
 }
 
@@ -733,7 +733,7 @@ static void mmu_free_pte_list_desc(struct pte_list_desc *pte_list_desc)
 static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index)
 {
 	if (!sp->role.direct)
-		return sp->gfns[index];
+		return sp->shadowed_translation[index].gfn;
 
 	return sp->gfn + (index << ((sp->role.level - 1) * PT64_LEVEL_BITS));
 }
@@ -741,7 +741,7 @@ static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index)
 static void kvm_mmu_page_set_gfn(struct kvm_mmu_page *sp, int index, gfn_t gfn)
 {
 	if (!sp->role.direct) {
-		sp->gfns[index] = gfn;
+		sp->shadowed_translation[index].gfn = gfn;
 		return;
 	}
 
@@ -752,6 +752,47 @@ static void kvm_mmu_page_set_gfn(struct kvm_mmu_page *sp, int index, gfn_t gfn)
 				   kvm_mmu_page_get_gfn(sp, index), gfn);
 }
 
+static void kvm_mmu_page_set_access(struct kvm_mmu_page *sp, int index, u32 access)
+{
+	if (!sp->role.direct) {
+		sp->shadowed_translation[index].access = access;
+		return;
+	}
+
+	if (WARN_ON(access != sp->role.access))
+		pr_err_ratelimited("access mismatch under direct page %llx "
+				   "(expected %llx, got %llx)\n",
+				   kvm_mmu_page_get_gfn(sp, index),
+				   sp->role.access, access);
+}
+
+/*
+ * For leaf SPTEs, fetch the *guest* access permissions being shadowed. Note
+ * that the SPTE itself may have a more constrained access permissions that
+ * what the guest enforces. For example, a guest may create an executable
+ * huge PTE but KVM may disallow execution to mitigate iTLB multihit.
+ */
+static u32 kvm_mmu_page_get_access(struct kvm_mmu_page *sp, int index)
+{
+	if (!sp->role.direct)
+		return sp->shadowed_translation[index].access;
+
+	/*
+	 * For direct MMUs (e.g. TDP or non-paging guests) there are no *guest*
+	 * access permissions being shadowed. So we can just return ACC_ALL
+	 * here.
+	 *
+	 * For indirect MMUs (shadow paging), direct shadow pages exist when KVM
+	 * is shadowing a guest huge page with smaller pages, since the guest
+	 * huge page is being directly mapped. In this case the guest access
+	 * permissions being shadowed are the access permissions of the huge
+	 * page.
+	 *
+	 * In both cases, sp->role.access contains exactly what we want.
+	 */
+	return sp->role.access;
+}
+
 /*
  * Return the pointer to the large page information for a given gfn,
  * handling slots that are not large page aligned.
@@ -1594,7 +1635,7 @@ static bool kvm_test_age_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 static void __rmap_add(struct kvm *kvm,
 		       struct kvm_mmu_memory_cache *cache,
 		       const struct kvm_memory_slot *slot,
-		       u64 *spte, gfn_t gfn)
+		       u64 *spte, gfn_t gfn, u32 access)
 {
 	struct kvm_mmu_page *sp;
 	struct kvm_rmap_head *rmap_head;
@@ -1602,6 +1643,7 @@ static void __rmap_add(struct kvm *kvm,
 
 	sp = sptep_to_sp(spte);
 	kvm_mmu_page_set_gfn(sp, spte - sp->spt, gfn);
+	kvm_mmu_page_set_access(sp, spte - sp->spt, access);
 	kvm_update_page_stats(kvm, sp->role.level, 1);
 
 	rmap_head = gfn_to_rmap(gfn, sp->role.level, slot);
@@ -1615,9 +1657,9 @@ static void __rmap_add(struct kvm *kvm,
 }
 
 static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
-		     u64 *spte, gfn_t gfn)
+		     u64 *spte, gfn_t gfn, u32 access)
 {
-	__rmap_add(vcpu->kvm, &vcpu->arch.mmu_pte_list_desc_cache, slot, spte, gfn);
+	__rmap_add(vcpu->kvm, &vcpu->arch.mmu_pte_list_desc_cache, slot, spte, gfn, access);
 }
 
 bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
@@ -1678,7 +1720,7 @@ void kvm_mmu_free_shadow_page(struct kvm_mmu_page *sp)
 {
 	free_page((unsigned long)sp->spt);
 	if (!sp->role.direct)
-		free_page((unsigned long)sp->gfns);
+		free_page((unsigned long)sp->shadowed_translation);
 	kmem_cache_free(mmu_page_header_cache, sp);
 }
 
@@ -1715,8 +1757,12 @@ struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu, bool direc
 
 	sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
 	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
+
+	BUILD_BUG_ON(sizeof(sp->shadowed_translation[0]) != sizeof(u64));
+
 	if (!direct)
-		sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache);
+		sp->shadowed_translation =
+			kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadowed_info_cache);
 
 	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
 
@@ -1738,7 +1784,7 @@ static inline gfp_t gfp_flags_for_split(bool locked)
  *
  * Huge page splitting always uses direct shadow pages since the huge page is
  * being mapped directly with a lower level page table. Thus there's no need to
- * allocate the gfns array.
+ * allocate the shadowed_translation array.
  */
 struct kvm_mmu_page *kvm_mmu_alloc_direct_sp_for_split(bool locked)
 {
@@ -2841,7 +2887,10 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
 
 	if (!was_rmapped) {
 		WARN_ON_ONCE(ret == RET_PF_SPURIOUS);
-		rmap_add(vcpu, slot, sptep, gfn);
+		rmap_add(vcpu, slot, sptep, gfn, pte_access);
+	} else {
+		/* Already rmapped but the pte_access bits may have changed. */
+		kvm_mmu_page_set_access(sp, sptep - sp->spt, pte_access);
 	}
 
 	return ret;
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index b6e22ba9c654..3f76f4c1ae59 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -32,6 +32,18 @@ extern bool dbg;
 
 typedef u64 __rcu *tdp_ptep_t;
 
+/*
+ * Stores the result of the guest translation being shadowed by an SPTE. KVM
+ * shadows two types of guest translations: nGPA -> GPA (shadow EPT/NPT) and
+ * GVA -> GPA (traditional shadow paging). In both cases the result of the
+ * translation is a GPA and a set of access constraints.
+ */
+struct shadowed_translation_entry {
+	/* Note, GFNs can have at most 64 - PAGE_SHIFT = 52 bits. */
+	u64 gfn:52;
+	u64 access:3;
+};
+
 struct kvm_mmu_page {
 	/*
 	 * Note, "link" through "spt" fit in a single 64 byte cache line on
@@ -53,8 +65,12 @@ struct kvm_mmu_page {
 	gfn_t gfn;
 
 	u64 *spt;
-	/* hold the gfn of each spte inside spt */
-	gfn_t *gfns;
+	/*
+	 * Caches the result of the intermediate guest translation being
+	 * shadowed by each SPTE. NULL for direct shadow pages.
+	 */
+	struct shadowed_translation_entry *shadowed_translation;
+
 	/* Currently serving as active root */
 	union {
 		int root_count;
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index db63b5377465..91c2088464ce 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -1014,7 +1014,8 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 }
 
 /*
- * Using the cached information from sp->gfns is safe because:
+ * Using the information in sp->shadowed_translation (kvm_mmu_page_get_gfn()
+ * and kvm_mmu_page_get_access()) is safe because:
  * - The spte has a reference to the struct page, so the pfn for a given gfn
  *   can't change unless all sptes pointing to it are nuked first.
  *
@@ -1088,12 +1089,15 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
 		if (sync_mmio_spte(vcpu, &sp->spt[i], gfn, pte_access))
 			continue;
 
-		if (gfn != sp->gfns[i]) {
+		if (gfn != kvm_mmu_page_get_gfn(sp, i)) {
 			drop_spte(vcpu->kvm, &sp->spt[i]);
 			flush = true;
 			continue;
 		}
 
+		if (pte_access != kvm_mmu_page_get_access(sp, i))
+			kvm_mmu_page_set_access(sp, i, pte_access);
+
 		sptep = &sp->spt[i];
 		spte = *sptep;
 		host_writable = spte & shadow_host_writable_mask;
-- 
2.35.1.1094.g7c7d902a7c-goog


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v3 16/23] KVM: x86/mmu: Cache the access bits of shadowed translations
@ 2022-04-01 17:55   ` David Matlack
  0 siblings, 0 replies; 88+ messages in thread
From: David Matlack @ 2022-04-01 17:55 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

In order to split a huge page we need to know what access bits to assign
to the role of the new child page table. This can't be easily derived
from the huge page SPTE itself since KVM applies its own access policies
on top, such as for HugePage NX.

We could walk the guest page tables to determine the correct access
bits, but that is difficult to plumb outside of a vCPU fault context.
Instead, we can store the original access bits for each leaf SPTE
alongside the GFN in the gfns array. The access bits only take up 3
bits, which leaves 61 bits left over for gfns, which is more than
enough. So this change does not require any additional memory.

In order to keep the access bit cache in sync with the guest, we have to
extend FNAME(sync_page) to also update the access bits.

Now that the gfns array caches more information than just GFNs, rename
it to shadowed_translation.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/include/asm/kvm_host.h |  2 +-
 arch/x86/kvm/mmu/mmu.c          | 71 ++++++++++++++++++++++++++++-----
 arch/x86/kvm/mmu/mmu_internal.h | 20 +++++++++-
 arch/x86/kvm/mmu/paging_tmpl.h  |  8 +++-
 4 files changed, 85 insertions(+), 16 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 9694dd5e6ccc..be4349c9ffea 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -696,7 +696,7 @@ struct kvm_vcpu_arch {
 
 	struct kvm_mmu_memory_cache mmu_pte_list_desc_cache;
 	struct kvm_mmu_memory_cache mmu_shadow_page_cache;
-	struct kvm_mmu_memory_cache mmu_gfn_array_cache;
+	struct kvm_mmu_memory_cache mmu_shadowed_info_cache;
 	struct kvm_mmu_memory_cache mmu_page_header_cache;
 
 	/*
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 5e1002d57689..3a425ed80e23 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -708,7 +708,7 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
 	if (r)
 		return r;
 	if (maybe_indirect) {
-		r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_gfn_array_cache,
+		r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadowed_info_cache,
 					       PT64_ROOT_MAX_LEVEL);
 		if (r)
 			return r;
@@ -721,7 +721,7 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
 {
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
-	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_gfn_array_cache);
+	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadowed_info_cache);
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
 }
 
@@ -733,7 +733,7 @@ static void mmu_free_pte_list_desc(struct pte_list_desc *pte_list_desc)
 static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index)
 {
 	if (!sp->role.direct)
-		return sp->gfns[index];
+		return sp->shadowed_translation[index].gfn;
 
 	return sp->gfn + (index << ((sp->role.level - 1) * PT64_LEVEL_BITS));
 }
@@ -741,7 +741,7 @@ static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index)
 static void kvm_mmu_page_set_gfn(struct kvm_mmu_page *sp, int index, gfn_t gfn)
 {
 	if (!sp->role.direct) {
-		sp->gfns[index] = gfn;
+		sp->shadowed_translation[index].gfn = gfn;
 		return;
 	}
 
@@ -752,6 +752,47 @@ static void kvm_mmu_page_set_gfn(struct kvm_mmu_page *sp, int index, gfn_t gfn)
 				   kvm_mmu_page_get_gfn(sp, index), gfn);
 }
 
+static void kvm_mmu_page_set_access(struct kvm_mmu_page *sp, int index, u32 access)
+{
+	if (!sp->role.direct) {
+		sp->shadowed_translation[index].access = access;
+		return;
+	}
+
+	if (WARN_ON(access != sp->role.access))
+		pr_err_ratelimited("access mismatch under direct page %llx "
+				   "(expected %llx, got %llx)\n",
+				   kvm_mmu_page_get_gfn(sp, index),
+				   sp->role.access, access);
+}
+
+/*
+ * For leaf SPTEs, fetch the *guest* access permissions being shadowed. Note
+ * that the SPTE itself may have a more constrained access permissions that
+ * what the guest enforces. For example, a guest may create an executable
+ * huge PTE but KVM may disallow execution to mitigate iTLB multihit.
+ */
+static u32 kvm_mmu_page_get_access(struct kvm_mmu_page *sp, int index)
+{
+	if (!sp->role.direct)
+		return sp->shadowed_translation[index].access;
+
+	/*
+	 * For direct MMUs (e.g. TDP or non-paging guests) there are no *guest*
+	 * access permissions being shadowed. So we can just return ACC_ALL
+	 * here.
+	 *
+	 * For indirect MMUs (shadow paging), direct shadow pages exist when KVM
+	 * is shadowing a guest huge page with smaller pages, since the guest
+	 * huge page is being directly mapped. In this case the guest access
+	 * permissions being shadowed are the access permissions of the huge
+	 * page.
+	 *
+	 * In both cases, sp->role.access contains exactly what we want.
+	 */
+	return sp->role.access;
+}
+
 /*
  * Return the pointer to the large page information for a given gfn,
  * handling slots that are not large page aligned.
@@ -1594,7 +1635,7 @@ static bool kvm_test_age_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 static void __rmap_add(struct kvm *kvm,
 		       struct kvm_mmu_memory_cache *cache,
 		       const struct kvm_memory_slot *slot,
-		       u64 *spte, gfn_t gfn)
+		       u64 *spte, gfn_t gfn, u32 access)
 {
 	struct kvm_mmu_page *sp;
 	struct kvm_rmap_head *rmap_head;
@@ -1602,6 +1643,7 @@ static void __rmap_add(struct kvm *kvm,
 
 	sp = sptep_to_sp(spte);
 	kvm_mmu_page_set_gfn(sp, spte - sp->spt, gfn);
+	kvm_mmu_page_set_access(sp, spte - sp->spt, access);
 	kvm_update_page_stats(kvm, sp->role.level, 1);
 
 	rmap_head = gfn_to_rmap(gfn, sp->role.level, slot);
@@ -1615,9 +1657,9 @@ static void __rmap_add(struct kvm *kvm,
 }
 
 static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
-		     u64 *spte, gfn_t gfn)
+		     u64 *spte, gfn_t gfn, u32 access)
 {
-	__rmap_add(vcpu->kvm, &vcpu->arch.mmu_pte_list_desc_cache, slot, spte, gfn);
+	__rmap_add(vcpu->kvm, &vcpu->arch.mmu_pte_list_desc_cache, slot, spte, gfn, access);
 }
 
 bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
@@ -1678,7 +1720,7 @@ void kvm_mmu_free_shadow_page(struct kvm_mmu_page *sp)
 {
 	free_page((unsigned long)sp->spt);
 	if (!sp->role.direct)
-		free_page((unsigned long)sp->gfns);
+		free_page((unsigned long)sp->shadowed_translation);
 	kmem_cache_free(mmu_page_header_cache, sp);
 }
 
@@ -1715,8 +1757,12 @@ struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu, bool direc
 
 	sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
 	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
+
+	BUILD_BUG_ON(sizeof(sp->shadowed_translation[0]) != sizeof(u64));
+
 	if (!direct)
-		sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache);
+		sp->shadowed_translation =
+			kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadowed_info_cache);
 
 	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
 
@@ -1738,7 +1784,7 @@ static inline gfp_t gfp_flags_for_split(bool locked)
  *
  * Huge page splitting always uses direct shadow pages since the huge page is
  * being mapped directly with a lower level page table. Thus there's no need to
- * allocate the gfns array.
+ * allocate the shadowed_translation array.
  */
 struct kvm_mmu_page *kvm_mmu_alloc_direct_sp_for_split(bool locked)
 {
@@ -2841,7 +2887,10 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
 
 	if (!was_rmapped) {
 		WARN_ON_ONCE(ret == RET_PF_SPURIOUS);
-		rmap_add(vcpu, slot, sptep, gfn);
+		rmap_add(vcpu, slot, sptep, gfn, pte_access);
+	} else {
+		/* Already rmapped but the pte_access bits may have changed. */
+		kvm_mmu_page_set_access(sp, sptep - sp->spt, pte_access);
 	}
 
 	return ret;
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index b6e22ba9c654..3f76f4c1ae59 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -32,6 +32,18 @@ extern bool dbg;
 
 typedef u64 __rcu *tdp_ptep_t;
 
+/*
+ * Stores the result of the guest translation being shadowed by an SPTE. KVM
+ * shadows two types of guest translations: nGPA -> GPA (shadow EPT/NPT) and
+ * GVA -> GPA (traditional shadow paging). In both cases the result of the
+ * translation is a GPA and a set of access constraints.
+ */
+struct shadowed_translation_entry {
+	/* Note, GFNs can have at most 64 - PAGE_SHIFT = 52 bits. */
+	u64 gfn:52;
+	u64 access:3;
+};
+
 struct kvm_mmu_page {
 	/*
 	 * Note, "link" through "spt" fit in a single 64 byte cache line on
@@ -53,8 +65,12 @@ struct kvm_mmu_page {
 	gfn_t gfn;
 
 	u64 *spt;
-	/* hold the gfn of each spte inside spt */
-	gfn_t *gfns;
+	/*
+	 * Caches the result of the intermediate guest translation being
+	 * shadowed by each SPTE. NULL for direct shadow pages.
+	 */
+	struct shadowed_translation_entry *shadowed_translation;
+
 	/* Currently serving as active root */
 	union {
 		int root_count;
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index db63b5377465..91c2088464ce 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -1014,7 +1014,8 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 }
 
 /*
- * Using the cached information from sp->gfns is safe because:
+ * Using the information in sp->shadowed_translation (kvm_mmu_page_get_gfn()
+ * and kvm_mmu_page_get_access()) is safe because:
  * - The spte has a reference to the struct page, so the pfn for a given gfn
  *   can't change unless all sptes pointing to it are nuked first.
  *
@@ -1088,12 +1089,15 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
 		if (sync_mmio_spte(vcpu, &sp->spt[i], gfn, pte_access))
 			continue;
 
-		if (gfn != sp->gfns[i]) {
+		if (gfn != kvm_mmu_page_get_gfn(sp, i)) {
 			drop_spte(vcpu->kvm, &sp->spt[i]);
 			flush = true;
 			continue;
 		}
 
+		if (pte_access != kvm_mmu_page_get_access(sp, i))
+			kvm_mmu_page_set_access(sp, i, pte_access);
+
 		sptep = &sp->spt[i];
 		spte = *sptep;
 		host_writable = spte & shadow_host_writable_mask;
-- 
2.35.1.1094.g7c7d902a7c-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v3 17/23] KVM: x86/mmu: Extend make_huge_page_split_spte() for the shadow MMU
  2022-04-01 17:55 ` David Matlack
@ 2022-04-01 17:55   ` David Matlack
  -1 siblings, 0 replies; 88+ messages in thread
From: David Matlack @ 2022-04-01 17:55 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

Currently make_huge_page_split_spte() assumes execute permissions can be
granted to any 4K SPTE when splitting huge pages. This is true for the
TDP MMU but is not necessarily true for the shadow MMU, since we may be
splitting a huge page that shadows a non-executable guest huge page.

To fix this, pass in the child shadow page where the huge page will be
split and derive the execution permission from the shadow page's role.
This is correct because huge pages are always split with direct shadow
page and thus the shadow page role contains the correct access
permissions.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/spte.c    | 13 +++++++------
 arch/x86/kvm/mmu/spte.h    |  2 +-
 arch/x86/kvm/mmu/tdp_mmu.c |  2 +-
 3 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index d10189d9c877..ef6537c6f5ef 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -216,10 +216,11 @@ static u64 make_spte_executable(u64 spte)
  * This is used during huge page splitting to build the SPTEs that make up the
  * new page table.
  */
-u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index)
+u64 make_huge_page_split_spte(u64 huge_spte, struct kvm_mmu_page *sp, int index)
 {
+	bool exec_allowed = sp->role.access & ACC_EXEC_MASK;
+	int child_level = sp->role.level;
 	u64 child_spte;
-	int child_level;
 
 	if (WARN_ON_ONCE(!is_shadow_present_pte(huge_spte)))
 		return 0;
@@ -228,7 +229,6 @@ u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index)
 		return 0;
 
 	child_spte = huge_spte;
-	child_level = huge_level - 1;
 
 	/*
 	 * The child_spte already has the base address of the huge page being
@@ -241,10 +241,11 @@ u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index)
 		child_spte &= ~PT_PAGE_SIZE_MASK;
 
 		/*
-		 * When splitting to a 4K page, mark the page executable as the
-		 * NX hugepage mitigation no longer applies.
+		 * When splitting to a 4K page where execution is allowed, mark
+		 * the page executable as the NX hugepage mitigation no longer
+		 * applies.
 		 */
-		if (is_nx_huge_page_enabled())
+		if (exec_allowed && is_nx_huge_page_enabled())
 			child_spte = make_spte_executable(child_spte);
 	}
 
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 73f12615416f..921ea77f1b5e 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -415,7 +415,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	       unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
 	       u64 old_spte, bool prefetch, bool can_unsync,
 	       bool host_writable, u64 *new_spte);
-u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index);
+u64 make_huge_page_split_spte(u64 huge_spte, struct kvm_mmu_page *sp, int index);
 u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled);
 u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access);
 u64 mark_spte_for_access_track(u64 spte);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index a04262bc34e2..36d241405ecc 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1455,7 +1455,7 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
 	 * not been linked in yet and thus is not reachable from any other CPU.
 	 */
 	for (i = 0; i < PT64_ENT_PER_PAGE; i++)
-		sp->spt[i] = make_huge_page_split_spte(huge_spte, level, i);
+		sp->spt[i] = make_huge_page_split_spte(huge_spte, sp, i);
 
 	/*
 	 * Replace the huge spte with a pointer to the populated lower level
-- 
2.35.1.1094.g7c7d902a7c-goog


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v3 17/23] KVM: x86/mmu: Extend make_huge_page_split_spte() for the shadow MMU
@ 2022-04-01 17:55   ` David Matlack
  0 siblings, 0 replies; 88+ messages in thread
From: David Matlack @ 2022-04-01 17:55 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Currently make_huge_page_split_spte() assumes execute permissions can be
granted to any 4K SPTE when splitting huge pages. This is true for the
TDP MMU but is not necessarily true for the shadow MMU, since we may be
splitting a huge page that shadows a non-executable guest huge page.

To fix this, pass in the child shadow page where the huge page will be
split and derive the execution permission from the shadow page's role.
This is correct because huge pages are always split with direct shadow
page and thus the shadow page role contains the correct access
permissions.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/spte.c    | 13 +++++++------
 arch/x86/kvm/mmu/spte.h    |  2 +-
 arch/x86/kvm/mmu/tdp_mmu.c |  2 +-
 3 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index d10189d9c877..ef6537c6f5ef 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -216,10 +216,11 @@ static u64 make_spte_executable(u64 spte)
  * This is used during huge page splitting to build the SPTEs that make up the
  * new page table.
  */
-u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index)
+u64 make_huge_page_split_spte(u64 huge_spte, struct kvm_mmu_page *sp, int index)
 {
+	bool exec_allowed = sp->role.access & ACC_EXEC_MASK;
+	int child_level = sp->role.level;
 	u64 child_spte;
-	int child_level;
 
 	if (WARN_ON_ONCE(!is_shadow_present_pte(huge_spte)))
 		return 0;
@@ -228,7 +229,6 @@ u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index)
 		return 0;
 
 	child_spte = huge_spte;
-	child_level = huge_level - 1;
 
 	/*
 	 * The child_spte already has the base address of the huge page being
@@ -241,10 +241,11 @@ u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index)
 		child_spte &= ~PT_PAGE_SIZE_MASK;
 
 		/*
-		 * When splitting to a 4K page, mark the page executable as the
-		 * NX hugepage mitigation no longer applies.
+		 * When splitting to a 4K page where execution is allowed, mark
+		 * the page executable as the NX hugepage mitigation no longer
+		 * applies.
 		 */
-		if (is_nx_huge_page_enabled())
+		if (exec_allowed && is_nx_huge_page_enabled())
 			child_spte = make_spte_executable(child_spte);
 	}
 
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 73f12615416f..921ea77f1b5e 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -415,7 +415,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	       unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
 	       u64 old_spte, bool prefetch, bool can_unsync,
 	       bool host_writable, u64 *new_spte);
-u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index);
+u64 make_huge_page_split_spte(u64 huge_spte, struct kvm_mmu_page *sp, int index);
 u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled);
 u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access);
 u64 mark_spte_for_access_track(u64 spte);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index a04262bc34e2..36d241405ecc 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1455,7 +1455,7 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
 	 * not been linked in yet and thus is not reachable from any other CPU.
 	 */
 	for (i = 0; i < PT64_ENT_PER_PAGE; i++)
-		sp->spt[i] = make_huge_page_split_spte(huge_spte, level, i);
+		sp->spt[i] = make_huge_page_split_spte(huge_spte, sp, i);
 
 	/*
 	 * Replace the huge spte with a pointer to the populated lower level
-- 
2.35.1.1094.g7c7d902a7c-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v3 18/23] KVM: x86/mmu: Zap collapsible SPTEs at all levels in the shadow MMU
  2022-04-01 17:55 ` David Matlack
@ 2022-04-01 17:55   ` David Matlack
  -1 siblings, 0 replies; 88+ messages in thread
From: David Matlack @ 2022-04-01 17:55 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

Currently KVM only zaps collapsible 4KiB SPTEs in the shadow MMU (i.e.
in the rmap). This is fine for now KVM never creates intermediate huge
pages during dirty logging, i.e. a 1GiB page is never partially split to
a 2MiB page.

However, this will stop being true once the shadow MMU participates in
eager page splitting, which can in fact leave behind partially split
huge pages. In preparation for that change, change the shadow MMU to
iterate over all necessary levels when zapping collapsible SPTEs.

No functional change intended.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 21 ++++++++++++++-------
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 3a425ed80e23..6390b23d286a 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6172,18 +6172,25 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
 	return need_tlb_flush;
 }
 
+static void kvm_rmap_zap_collapsible_sptes(struct kvm *kvm,
+					   const struct kvm_memory_slot *slot)
+{
+	/*
+	 * Note, use KVM_MAX_HUGEPAGE_LEVEL - 1 since there's no need to zap
+	 * pages that are already mapped at the maximum possible level.
+	 */
+	if (slot_handle_level(kvm, slot, kvm_mmu_zap_collapsible_spte,
+			      PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL - 1,
+			      true))
+		kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
+}
+
 void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
 				   const struct kvm_memory_slot *slot)
 {
 	if (kvm_memslots_have_rmaps(kvm)) {
 		write_lock(&kvm->mmu_lock);
-		/*
-		 * Zap only 4k SPTEs since the legacy MMU only supports dirty
-		 * logging at a 4k granularity and never creates collapsible
-		 * 2m SPTEs during dirty logging.
-		 */
-		if (slot_handle_level_4k(kvm, slot, kvm_mmu_zap_collapsible_spte, true))
-			kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
+		kvm_rmap_zap_collapsible_sptes(kvm, slot);
 		write_unlock(&kvm->mmu_lock);
 	}
 
-- 
2.35.1.1094.g7c7d902a7c-goog


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v3 18/23] KVM: x86/mmu: Zap collapsible SPTEs at all levels in the shadow MMU
@ 2022-04-01 17:55   ` David Matlack
  0 siblings, 0 replies; 88+ messages in thread
From: David Matlack @ 2022-04-01 17:55 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Currently KVM only zaps collapsible 4KiB SPTEs in the shadow MMU (i.e.
in the rmap). This is fine for now KVM never creates intermediate huge
pages during dirty logging, i.e. a 1GiB page is never partially split to
a 2MiB page.

However, this will stop being true once the shadow MMU participates in
eager page splitting, which can in fact leave behind partially split
huge pages. In preparation for that change, change the shadow MMU to
iterate over all necessary levels when zapping collapsible SPTEs.

No functional change intended.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 21 ++++++++++++++-------
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 3a425ed80e23..6390b23d286a 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6172,18 +6172,25 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
 	return need_tlb_flush;
 }
 
+static void kvm_rmap_zap_collapsible_sptes(struct kvm *kvm,
+					   const struct kvm_memory_slot *slot)
+{
+	/*
+	 * Note, use KVM_MAX_HUGEPAGE_LEVEL - 1 since there's no need to zap
+	 * pages that are already mapped at the maximum possible level.
+	 */
+	if (slot_handle_level(kvm, slot, kvm_mmu_zap_collapsible_spte,
+			      PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL - 1,
+			      true))
+		kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
+}
+
 void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
 				   const struct kvm_memory_slot *slot)
 {
 	if (kvm_memslots_have_rmaps(kvm)) {
 		write_lock(&kvm->mmu_lock);
-		/*
-		 * Zap only 4k SPTEs since the legacy MMU only supports dirty
-		 * logging at a 4k granularity and never creates collapsible
-		 * 2m SPTEs during dirty logging.
-		 */
-		if (slot_handle_level_4k(kvm, slot, kvm_mmu_zap_collapsible_spte, true))
-			kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
+		kvm_rmap_zap_collapsible_sptes(kvm, slot);
 		write_unlock(&kvm->mmu_lock);
 	}
 
-- 
2.35.1.1094.g7c7d902a7c-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v3 19/23] KVM: x86/mmu: Refactor drop_large_spte()
  2022-04-01 17:55 ` David Matlack
@ 2022-04-01 17:55   ` David Matlack
  -1 siblings, 0 replies; 88+ messages in thread
From: David Matlack @ 2022-04-01 17:55 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

drop_large_spte() drops a large SPTE if it exists and then flushes TLBs.
Its helper function, __drop_large_spte(), does the drop without the
flush.

In preparation for eager page splitting, which will need to sometimes
flush when dropping large SPTEs (and sometimes not), push the flushing
logic down into __drop_large_spte() and add a bool parameter to control
it.

No functional change intended.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 29 +++++++++++++++--------------
 1 file changed, 15 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6390b23d286a..f058f28909ea 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1184,28 +1184,29 @@ static void drop_spte(struct kvm *kvm, u64 *sptep)
 		rmap_remove(kvm, sptep);
 }
 
-
-static bool __drop_large_spte(struct kvm *kvm, u64 *sptep)
+static void __drop_large_spte(struct kvm *kvm, u64 *sptep, bool flush)
 {
-	if (is_large_pte(*sptep)) {
-		WARN_ON(sptep_to_sp(sptep)->role.level == PG_LEVEL_4K);
-		drop_spte(kvm, sptep);
-		return true;
-	}
+	struct kvm_mmu_page *sp;
 
-	return false;
-}
+	if (!is_large_pte(*sptep))
+		return;
 
-static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
-{
-	if (__drop_large_spte(vcpu->kvm, sptep)) {
-		struct kvm_mmu_page *sp = sptep_to_sp(sptep);
+	sp = sptep_to_sp(sptep);
+	WARN_ON(sp->role.level == PG_LEVEL_4K);
 
-		kvm_flush_remote_tlbs_with_address(vcpu->kvm, sp->gfn,
+	drop_spte(kvm, sptep);
+
+	if (flush) {
+		kvm_flush_remote_tlbs_with_address(kvm, sp->gfn,
 			KVM_PAGES_PER_HPAGE(sp->role.level));
 	}
 }
 
+static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
+{
+	return __drop_large_spte(vcpu->kvm, sptep, true);
+}
+
 /*
  * Write-protect on the specified @sptep, @pt_protect indicates whether
  * spte write-protection is caused by protecting shadow page table.
-- 
2.35.1.1094.g7c7d902a7c-goog


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v3 19/23] KVM: x86/mmu: Refactor drop_large_spte()
@ 2022-04-01 17:55   ` David Matlack
  0 siblings, 0 replies; 88+ messages in thread
From: David Matlack @ 2022-04-01 17:55 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

drop_large_spte() drops a large SPTE if it exists and then flushes TLBs.
Its helper function, __drop_large_spte(), does the drop without the
flush.

In preparation for eager page splitting, which will need to sometimes
flush when dropping large SPTEs (and sometimes not), push the flushing
logic down into __drop_large_spte() and add a bool parameter to control
it.

No functional change intended.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 29 +++++++++++++++--------------
 1 file changed, 15 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6390b23d286a..f058f28909ea 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1184,28 +1184,29 @@ static void drop_spte(struct kvm *kvm, u64 *sptep)
 		rmap_remove(kvm, sptep);
 }
 
-
-static bool __drop_large_spte(struct kvm *kvm, u64 *sptep)
+static void __drop_large_spte(struct kvm *kvm, u64 *sptep, bool flush)
 {
-	if (is_large_pte(*sptep)) {
-		WARN_ON(sptep_to_sp(sptep)->role.level == PG_LEVEL_4K);
-		drop_spte(kvm, sptep);
-		return true;
-	}
+	struct kvm_mmu_page *sp;
 
-	return false;
-}
+	if (!is_large_pte(*sptep))
+		return;
 
-static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
-{
-	if (__drop_large_spte(vcpu->kvm, sptep)) {
-		struct kvm_mmu_page *sp = sptep_to_sp(sptep);
+	sp = sptep_to_sp(sptep);
+	WARN_ON(sp->role.level == PG_LEVEL_4K);
 
-		kvm_flush_remote_tlbs_with_address(vcpu->kvm, sp->gfn,
+	drop_spte(kvm, sptep);
+
+	if (flush) {
+		kvm_flush_remote_tlbs_with_address(kvm, sp->gfn,
 			KVM_PAGES_PER_HPAGE(sp->role.level));
 	}
 }
 
+static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
+{
+	return __drop_large_spte(vcpu->kvm, sptep, true);
+}
+
 /*
  * Write-protect on the specified @sptep, @pt_protect indicates whether
  * spte write-protection is caused by protecting shadow page table.
-- 
2.35.1.1094.g7c7d902a7c-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v3 20/23] KVM: Allow for different capacities in kvm_mmu_memory_cache structs
  2022-04-01 17:55 ` David Matlack
@ 2022-04-01 17:55   ` David Matlack
  -1 siblings, 0 replies; 88+ messages in thread
From: David Matlack @ 2022-04-01 17:55 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

Allow the capacity of the kvm_mmu_memory_cache struct to be chosen at
declaration time rather than being fixed for all declarations. This will
be used in a follow-up commit to declare an cache in x86 with a capacity
of 512+ objects without having to increase the capacity of all caches in
KVM.

This change requires each cache now specify its capacity at runtime,
since the cache struct itself no longer has a fixed capacity known at
compile time. To protect against someone accidentally defining a
kvm_mmu_memory_cache struct directly (without the extra storage), this
commit includes a WARN_ON() in kvm_mmu_topup_memory_cache().

This change, unfortunately, adds some grottiness to
kvm_phys_addr_ioremap() in arm64, which uses a function-local (i.e.
stack-allocated) kvm_mmu_memory_cache struct. Since C does not allow
anonymous structs in functions, the new wrapper struct that contains
kvm_mmu_memory_cache and the objects pointer array, must be named, which
means dealing with an outer and inner struct. The outer struct can't be
dropped since then there would be no guarantee the kvm_mmu_memory_cache
struct and objects array would be laid out consecutively on the stack.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/arm64/include/asm/kvm_host.h |  2 +-
 arch/arm64/kvm/arm.c              |  1 +
 arch/arm64/kvm/mmu.c              | 13 +++++++++----
 arch/mips/include/asm/kvm_host.h  |  2 +-
 arch/mips/kvm/mips.c              |  2 ++
 arch/riscv/include/asm/kvm_host.h |  2 +-
 arch/riscv/kvm/mmu.c              | 17 ++++++++++-------
 arch/riscv/kvm/vcpu.c             |  1 +
 arch/x86/include/asm/kvm_host.h   |  8 ++++----
 arch/x86/kvm/mmu/mmu.c            |  9 +++++++++
 include/linux/kvm_types.h         | 19 +++++++++++++++++--
 virt/kvm/kvm_main.c               | 10 +++++++++-
 12 files changed, 65 insertions(+), 21 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 0e96087885fe..4670491899de 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -362,7 +362,7 @@ struct kvm_vcpu_arch {
 	bool pause;
 
 	/* Cache some mmu pages needed inside spinlock regions */
-	struct kvm_mmu_memory_cache mmu_page_cache;
+	DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_cache);
 
 	/* Target CPU and feature flags */
 	int target;
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index ba9165e84396..af4d8a490af5 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -320,6 +320,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
 	vcpu->arch.target = -1;
 	bitmap_zero(vcpu->arch.features, KVM_VCPU_MAX_FEATURES);
 
+	vcpu->arch.mmu_page_cache.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
 	vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
 
 	/* Set up the timer */
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 0d19259454d8..01e15bcb7be2 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -764,7 +764,12 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
 {
 	phys_addr_t addr;
 	int ret = 0;
-	struct kvm_mmu_memory_cache cache = { 0, __GFP_ZERO, NULL, };
+	DEFINE_KVM_MMU_MEMORY_CACHE(cache) page_cache = {
+		.cache = {
+			.gfp_zero = __GFP_ZERO,
+			.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE,
+		},
+	};
 	struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
 	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_DEVICE |
 				     KVM_PGTABLE_PROT_R |
@@ -777,14 +782,14 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
 	guest_ipa &= PAGE_MASK;
 
 	for (addr = guest_ipa; addr < guest_ipa + size; addr += PAGE_SIZE) {
-		ret = kvm_mmu_topup_memory_cache(&cache,
+		ret = kvm_mmu_topup_memory_cache(&page_cache.cache,
 						 kvm_mmu_cache_min_pages(kvm));
 		if (ret)
 			break;
 
 		write_lock(&kvm->mmu_lock);
 		ret = kvm_pgtable_stage2_map(pgt, addr, PAGE_SIZE, pa, prot,
-					     &cache);
+					     &page_cache.cache);
 		write_unlock(&kvm->mmu_lock);
 		if (ret)
 			break;
@@ -792,7 +797,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
 		pa += PAGE_SIZE;
 	}
 
-	kvm_mmu_free_memory_cache(&cache);
+	kvm_mmu_free_memory_cache(&page_cache.cache);
 	return ret;
 }
 
diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
index 717716cc51c5..935511d7fc3a 100644
--- a/arch/mips/include/asm/kvm_host.h
+++ b/arch/mips/include/asm/kvm_host.h
@@ -347,7 +347,7 @@ struct kvm_vcpu_arch {
 	unsigned long pending_exceptions_clr;
 
 	/* Cache some mmu pages needed inside spinlock regions */
-	struct kvm_mmu_memory_cache mmu_page_cache;
+	DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_cache);
 
 	/* vcpu's vzguestid is different on each host cpu in an smp system */
 	u32 vzguestid[NR_CPUS];
diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
index a25e0b73ee70..45c7179144dc 100644
--- a/arch/mips/kvm/mips.c
+++ b/arch/mips/kvm/mips.c
@@ -387,6 +387,8 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
 	if (err)
 		goto out_free_gebase;
 
+	vcpu->arch.mmu_page_cache.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
+
 	return 0;
 
 out_free_gebase:
diff --git a/arch/riscv/include/asm/kvm_host.h b/arch/riscv/include/asm/kvm_host.h
index 78da839657e5..4ec0b7a3d515 100644
--- a/arch/riscv/include/asm/kvm_host.h
+++ b/arch/riscv/include/asm/kvm_host.h
@@ -186,7 +186,7 @@ struct kvm_vcpu_arch {
 	struct kvm_sbi_context sbi_context;
 
 	/* Cache pages needed to program page tables with spinlock held */
-	struct kvm_mmu_memory_cache mmu_page_cache;
+	DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_cache);
 
 	/* VCPU power-off state */
 	bool power_off;
diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c
index f80a34fbf102..5ffd164a5aeb 100644
--- a/arch/riscv/kvm/mmu.c
+++ b/arch/riscv/kvm/mmu.c
@@ -347,10 +347,12 @@ static int stage2_ioremap(struct kvm *kvm, gpa_t gpa, phys_addr_t hpa,
 	int ret = 0;
 	unsigned long pfn;
 	phys_addr_t addr, end;
-	struct kvm_mmu_memory_cache pcache;
-
-	memset(&pcache, 0, sizeof(pcache));
-	pcache.gfp_zero = __GFP_ZERO;
+	DEFINE_KVM_MMU_MEMORY_CACHE(cache) page_cache = {
+		.cache = {
+			.gfp_zero = __GFP_ZERO,
+			.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE,
+		},
+	};
 
 	end = (gpa + size + PAGE_SIZE - 1) & PAGE_MASK;
 	pfn = __phys_to_pfn(hpa);
@@ -361,12 +363,13 @@ static int stage2_ioremap(struct kvm *kvm, gpa_t gpa, phys_addr_t hpa,
 		if (!writable)
 			pte = pte_wrprotect(pte);
 
-		ret = kvm_mmu_topup_memory_cache(&pcache, stage2_pgd_levels);
+		ret = kvm_mmu_topup_memory_cache(&page_cache.cache,
+						 stage2_pgd_levels);
 		if (ret)
 			goto out;
 
 		spin_lock(&kvm->mmu_lock);
-		ret = stage2_set_pte(kvm, 0, &pcache, addr, &pte);
+		ret = stage2_set_pte(kvm, 0, &page_cache.cache, addr, &pte);
 		spin_unlock(&kvm->mmu_lock);
 		if (ret)
 			goto out;
@@ -375,7 +378,7 @@ static int stage2_ioremap(struct kvm *kvm, gpa_t gpa, phys_addr_t hpa,
 	}
 
 out:
-	kvm_mmu_free_memory_cache(&pcache);
+	kvm_mmu_free_memory_cache(&page_cache.cache);
 	return ret;
 }
 
diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
index 624166004e36..6a5f5aa45bac 100644
--- a/arch/riscv/kvm/vcpu.c
+++ b/arch/riscv/kvm/vcpu.c
@@ -94,6 +94,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
 
 	/* Mark this VCPU never ran */
 	vcpu->arch.ran_atleast_once = false;
+	vcpu->arch.mmu_page_cache.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
 	vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
 
 	/* Setup ISA features available to VCPU */
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index be4349c9ffea..ffb2b99f3a60 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -694,10 +694,10 @@ struct kvm_vcpu_arch {
 	 */
 	struct kvm_mmu *walk_mmu;
 
-	struct kvm_mmu_memory_cache mmu_pte_list_desc_cache;
-	struct kvm_mmu_memory_cache mmu_shadow_page_cache;
-	struct kvm_mmu_memory_cache mmu_shadowed_info_cache;
-	struct kvm_mmu_memory_cache mmu_page_header_cache;
+	DEFINE_KVM_MMU_MEMORY_CACHE(mmu_pte_list_desc_cache);
+	DEFINE_KVM_MMU_MEMORY_CACHE(mmu_shadow_page_cache);
+	DEFINE_KVM_MMU_MEMORY_CACHE(mmu_shadowed_info_cache);
+	DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_header_cache);
 
 	/*
 	 * QEMU userspace and the guest each have their own FPU state.
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f058f28909ea..a8200b3f8782 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5800,12 +5800,21 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
 {
 	int ret;
 
+	vcpu->arch.mmu_pte_list_desc_cache.capacity =
+		KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
 	vcpu->arch.mmu_pte_list_desc_cache.kmem_cache = pte_list_desc_cache;
 	vcpu->arch.mmu_pte_list_desc_cache.gfp_zero = __GFP_ZERO;
 
+	vcpu->arch.mmu_page_header_cache.capacity =
+		KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
 	vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;
 	vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
 
+	vcpu->arch.mmu_shadowed_info_cache.capacity =
+		KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
+
+	vcpu->arch.mmu_shadow_page_cache.capacity =
+		KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
 	vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
 
 	vcpu->arch.mmu = &vcpu->arch.root_mmu;
diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index ac1ebb37a0ff..579cf39986ec 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -83,14 +83,29 @@ struct gfn_to_pfn_cache {
  * MMU flows is problematic, as is triggering reclaim, I/O, etc... while
  * holding MMU locks.  Note, these caches act more like prefetch buffers than
  * classical caches, i.e. objects are not returned to the cache on being freed.
+ *
+ * The storage for the cache object pointers is laid out after the struct, to
+ * allow different declarations to choose different capacities. The capacity
+ * field defines the number of object pointers available after the struct.
  */
 struct kvm_mmu_memory_cache {
 	int nobjs;
+	int capacity;
 	gfp_t gfp_zero;
 	struct kmem_cache *kmem_cache;
-	void *objects[KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE];
+	void *objects[];
 };
-#endif
+
+#define __DEFINE_KVM_MMU_MEMORY_CACHE(_name, _capacity)		\
+	struct {						\
+		struct kvm_mmu_memory_cache _name;		\
+		void *_name##_objects[_capacity];		\
+	}
+
+#define DEFINE_KVM_MMU_MEMORY_CACHE(_name) \
+	__DEFINE_KVM_MMU_MEMORY_CACHE(_name, KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE)
+
+#endif /* KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE */
 
 #define HALT_POLL_HIST_COUNT			32
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 70e05af5ebea..c4cac4195f4a 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -373,9 +373,17 @@ int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
 {
 	void *obj;
 
+	/*
+	 * The capacity fieldmust be initialized since the storage for the
+	 * objects pointer array is laid out after the kvm_mmu_memory_cache
+	 * struct and not known at compile time.
+	 */
+	if (WARN_ON(mc->capacity == 0))
+		return -EINVAL;
+
 	if (mc->nobjs >= min)
 		return 0;
-	while (mc->nobjs < ARRAY_SIZE(mc->objects)) {
+	while (mc->nobjs < mc->capacity) {
 		obj = mmu_memory_cache_alloc_obj(mc, GFP_KERNEL_ACCOUNT);
 		if (!obj)
 			return mc->nobjs >= min ? 0 : -ENOMEM;
-- 
2.35.1.1094.g7c7d902a7c-goog


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v3 20/23] KVM: Allow for different capacities in kvm_mmu_memory_cache structs
@ 2022-04-01 17:55   ` David Matlack
  0 siblings, 0 replies; 88+ messages in thread
From: David Matlack @ 2022-04-01 17:55 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Allow the capacity of the kvm_mmu_memory_cache struct to be chosen at
declaration time rather than being fixed for all declarations. This will
be used in a follow-up commit to declare an cache in x86 with a capacity
of 512+ objects without having to increase the capacity of all caches in
KVM.

This change requires each cache now specify its capacity at runtime,
since the cache struct itself no longer has a fixed capacity known at
compile time. To protect against someone accidentally defining a
kvm_mmu_memory_cache struct directly (without the extra storage), this
commit includes a WARN_ON() in kvm_mmu_topup_memory_cache().

This change, unfortunately, adds some grottiness to
kvm_phys_addr_ioremap() in arm64, which uses a function-local (i.e.
stack-allocated) kvm_mmu_memory_cache struct. Since C does not allow
anonymous structs in functions, the new wrapper struct that contains
kvm_mmu_memory_cache and the objects pointer array, must be named, which
means dealing with an outer and inner struct. The outer struct can't be
dropped since then there would be no guarantee the kvm_mmu_memory_cache
struct and objects array would be laid out consecutively on the stack.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/arm64/include/asm/kvm_host.h |  2 +-
 arch/arm64/kvm/arm.c              |  1 +
 arch/arm64/kvm/mmu.c              | 13 +++++++++----
 arch/mips/include/asm/kvm_host.h  |  2 +-
 arch/mips/kvm/mips.c              |  2 ++
 arch/riscv/include/asm/kvm_host.h |  2 +-
 arch/riscv/kvm/mmu.c              | 17 ++++++++++-------
 arch/riscv/kvm/vcpu.c             |  1 +
 arch/x86/include/asm/kvm_host.h   |  8 ++++----
 arch/x86/kvm/mmu/mmu.c            |  9 +++++++++
 include/linux/kvm_types.h         | 19 +++++++++++++++++--
 virt/kvm/kvm_main.c               | 10 +++++++++-
 12 files changed, 65 insertions(+), 21 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 0e96087885fe..4670491899de 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -362,7 +362,7 @@ struct kvm_vcpu_arch {
 	bool pause;
 
 	/* Cache some mmu pages needed inside spinlock regions */
-	struct kvm_mmu_memory_cache mmu_page_cache;
+	DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_cache);
 
 	/* Target CPU and feature flags */
 	int target;
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index ba9165e84396..af4d8a490af5 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -320,6 +320,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
 	vcpu->arch.target = -1;
 	bitmap_zero(vcpu->arch.features, KVM_VCPU_MAX_FEATURES);
 
+	vcpu->arch.mmu_page_cache.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
 	vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
 
 	/* Set up the timer */
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 0d19259454d8..01e15bcb7be2 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -764,7 +764,12 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
 {
 	phys_addr_t addr;
 	int ret = 0;
-	struct kvm_mmu_memory_cache cache = { 0, __GFP_ZERO, NULL, };
+	DEFINE_KVM_MMU_MEMORY_CACHE(cache) page_cache = {
+		.cache = {
+			.gfp_zero = __GFP_ZERO,
+			.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE,
+		},
+	};
 	struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
 	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_DEVICE |
 				     KVM_PGTABLE_PROT_R |
@@ -777,14 +782,14 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
 	guest_ipa &= PAGE_MASK;
 
 	for (addr = guest_ipa; addr < guest_ipa + size; addr += PAGE_SIZE) {
-		ret = kvm_mmu_topup_memory_cache(&cache,
+		ret = kvm_mmu_topup_memory_cache(&page_cache.cache,
 						 kvm_mmu_cache_min_pages(kvm));
 		if (ret)
 			break;
 
 		write_lock(&kvm->mmu_lock);
 		ret = kvm_pgtable_stage2_map(pgt, addr, PAGE_SIZE, pa, prot,
-					     &cache);
+					     &page_cache.cache);
 		write_unlock(&kvm->mmu_lock);
 		if (ret)
 			break;
@@ -792,7 +797,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
 		pa += PAGE_SIZE;
 	}
 
-	kvm_mmu_free_memory_cache(&cache);
+	kvm_mmu_free_memory_cache(&page_cache.cache);
 	return ret;
 }
 
diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
index 717716cc51c5..935511d7fc3a 100644
--- a/arch/mips/include/asm/kvm_host.h
+++ b/arch/mips/include/asm/kvm_host.h
@@ -347,7 +347,7 @@ struct kvm_vcpu_arch {
 	unsigned long pending_exceptions_clr;
 
 	/* Cache some mmu pages needed inside spinlock regions */
-	struct kvm_mmu_memory_cache mmu_page_cache;
+	DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_cache);
 
 	/* vcpu's vzguestid is different on each host cpu in an smp system */
 	u32 vzguestid[NR_CPUS];
diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
index a25e0b73ee70..45c7179144dc 100644
--- a/arch/mips/kvm/mips.c
+++ b/arch/mips/kvm/mips.c
@@ -387,6 +387,8 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
 	if (err)
 		goto out_free_gebase;
 
+	vcpu->arch.mmu_page_cache.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
+
 	return 0;
 
 out_free_gebase:
diff --git a/arch/riscv/include/asm/kvm_host.h b/arch/riscv/include/asm/kvm_host.h
index 78da839657e5..4ec0b7a3d515 100644
--- a/arch/riscv/include/asm/kvm_host.h
+++ b/arch/riscv/include/asm/kvm_host.h
@@ -186,7 +186,7 @@ struct kvm_vcpu_arch {
 	struct kvm_sbi_context sbi_context;
 
 	/* Cache pages needed to program page tables with spinlock held */
-	struct kvm_mmu_memory_cache mmu_page_cache;
+	DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_cache);
 
 	/* VCPU power-off state */
 	bool power_off;
diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c
index f80a34fbf102..5ffd164a5aeb 100644
--- a/arch/riscv/kvm/mmu.c
+++ b/arch/riscv/kvm/mmu.c
@@ -347,10 +347,12 @@ static int stage2_ioremap(struct kvm *kvm, gpa_t gpa, phys_addr_t hpa,
 	int ret = 0;
 	unsigned long pfn;
 	phys_addr_t addr, end;
-	struct kvm_mmu_memory_cache pcache;
-
-	memset(&pcache, 0, sizeof(pcache));
-	pcache.gfp_zero = __GFP_ZERO;
+	DEFINE_KVM_MMU_MEMORY_CACHE(cache) page_cache = {
+		.cache = {
+			.gfp_zero = __GFP_ZERO,
+			.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE,
+		},
+	};
 
 	end = (gpa + size + PAGE_SIZE - 1) & PAGE_MASK;
 	pfn = __phys_to_pfn(hpa);
@@ -361,12 +363,13 @@ static int stage2_ioremap(struct kvm *kvm, gpa_t gpa, phys_addr_t hpa,
 		if (!writable)
 			pte = pte_wrprotect(pte);
 
-		ret = kvm_mmu_topup_memory_cache(&pcache, stage2_pgd_levels);
+		ret = kvm_mmu_topup_memory_cache(&page_cache.cache,
+						 stage2_pgd_levels);
 		if (ret)
 			goto out;
 
 		spin_lock(&kvm->mmu_lock);
-		ret = stage2_set_pte(kvm, 0, &pcache, addr, &pte);
+		ret = stage2_set_pte(kvm, 0, &page_cache.cache, addr, &pte);
 		spin_unlock(&kvm->mmu_lock);
 		if (ret)
 			goto out;
@@ -375,7 +378,7 @@ static int stage2_ioremap(struct kvm *kvm, gpa_t gpa, phys_addr_t hpa,
 	}
 
 out:
-	kvm_mmu_free_memory_cache(&pcache);
+	kvm_mmu_free_memory_cache(&page_cache.cache);
 	return ret;
 }
 
diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
index 624166004e36..6a5f5aa45bac 100644
--- a/arch/riscv/kvm/vcpu.c
+++ b/arch/riscv/kvm/vcpu.c
@@ -94,6 +94,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
 
 	/* Mark this VCPU never ran */
 	vcpu->arch.ran_atleast_once = false;
+	vcpu->arch.mmu_page_cache.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
 	vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
 
 	/* Setup ISA features available to VCPU */
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index be4349c9ffea..ffb2b99f3a60 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -694,10 +694,10 @@ struct kvm_vcpu_arch {
 	 */
 	struct kvm_mmu *walk_mmu;
 
-	struct kvm_mmu_memory_cache mmu_pte_list_desc_cache;
-	struct kvm_mmu_memory_cache mmu_shadow_page_cache;
-	struct kvm_mmu_memory_cache mmu_shadowed_info_cache;
-	struct kvm_mmu_memory_cache mmu_page_header_cache;
+	DEFINE_KVM_MMU_MEMORY_CACHE(mmu_pte_list_desc_cache);
+	DEFINE_KVM_MMU_MEMORY_CACHE(mmu_shadow_page_cache);
+	DEFINE_KVM_MMU_MEMORY_CACHE(mmu_shadowed_info_cache);
+	DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_header_cache);
 
 	/*
 	 * QEMU userspace and the guest each have their own FPU state.
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f058f28909ea..a8200b3f8782 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5800,12 +5800,21 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
 {
 	int ret;
 
+	vcpu->arch.mmu_pte_list_desc_cache.capacity =
+		KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
 	vcpu->arch.mmu_pte_list_desc_cache.kmem_cache = pte_list_desc_cache;
 	vcpu->arch.mmu_pte_list_desc_cache.gfp_zero = __GFP_ZERO;
 
+	vcpu->arch.mmu_page_header_cache.capacity =
+		KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
 	vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;
 	vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
 
+	vcpu->arch.mmu_shadowed_info_cache.capacity =
+		KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
+
+	vcpu->arch.mmu_shadow_page_cache.capacity =
+		KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
 	vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
 
 	vcpu->arch.mmu = &vcpu->arch.root_mmu;
diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index ac1ebb37a0ff..579cf39986ec 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -83,14 +83,29 @@ struct gfn_to_pfn_cache {
  * MMU flows is problematic, as is triggering reclaim, I/O, etc... while
  * holding MMU locks.  Note, these caches act more like prefetch buffers than
  * classical caches, i.e. objects are not returned to the cache on being freed.
+ *
+ * The storage for the cache object pointers is laid out after the struct, to
+ * allow different declarations to choose different capacities. The capacity
+ * field defines the number of object pointers available after the struct.
  */
 struct kvm_mmu_memory_cache {
 	int nobjs;
+	int capacity;
 	gfp_t gfp_zero;
 	struct kmem_cache *kmem_cache;
-	void *objects[KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE];
+	void *objects[];
 };
-#endif
+
+#define __DEFINE_KVM_MMU_MEMORY_CACHE(_name, _capacity)		\
+	struct {						\
+		struct kvm_mmu_memory_cache _name;		\
+		void *_name##_objects[_capacity];		\
+	}
+
+#define DEFINE_KVM_MMU_MEMORY_CACHE(_name) \
+	__DEFINE_KVM_MMU_MEMORY_CACHE(_name, KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE)
+
+#endif /* KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE */
 
 #define HALT_POLL_HIST_COUNT			32
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 70e05af5ebea..c4cac4195f4a 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -373,9 +373,17 @@ int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
 {
 	void *obj;
 
+	/*
+	 * The capacity fieldmust be initialized since the storage for the
+	 * objects pointer array is laid out after the kvm_mmu_memory_cache
+	 * struct and not known at compile time.
+	 */
+	if (WARN_ON(mc->capacity == 0))
+		return -EINVAL;
+
 	if (mc->nobjs >= min)
 		return 0;
-	while (mc->nobjs < ARRAY_SIZE(mc->objects)) {
+	while (mc->nobjs < mc->capacity) {
 		obj = mmu_memory_cache_alloc_obj(mc, GFP_KERNEL_ACCOUNT);
 		if (!obj)
 			return mc->nobjs >= min ? 0 : -ENOMEM;
-- 
2.35.1.1094.g7c7d902a7c-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v3 21/23] KVM: Allow GFP flags to be passed when topping up MMU caches
  2022-04-01 17:55 ` David Matlack
@ 2022-04-01 17:55   ` David Matlack
  -1 siblings, 0 replies; 88+ messages in thread
From: David Matlack @ 2022-04-01 17:55 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

This will be used in a subsequent commit to top-up MMU caches under the
MMU lock with GFP_NOWAIT as part of eager page splitting.

No functional change intended.

Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 include/linux/kvm_host.h | 1 +
 virt/kvm/kvm_main.c      | 9 +++++++--
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 252ee4a61b58..7d3a1f28beb2 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1335,6 +1335,7 @@ void kvm_flush_remote_tlbs(struct kvm *kvm);
 
 #ifdef KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE
 int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min);
+int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min, gfp_t gfp);
 int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc);
 void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
 void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index c4cac4195f4a..554148ea0c30 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -369,7 +369,7 @@ static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc,
 		return (void *)__get_free_page(gfp_flags);
 }
 
-int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
+int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min, gfp_t gfp)
 {
 	void *obj;
 
@@ -384,7 +384,7 @@ int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
 	if (mc->nobjs >= min)
 		return 0;
 	while (mc->nobjs < mc->capacity) {
-		obj = mmu_memory_cache_alloc_obj(mc, GFP_KERNEL_ACCOUNT);
+		obj = mmu_memory_cache_alloc_obj(mc, gfp);
 		if (!obj)
 			return mc->nobjs >= min ? 0 : -ENOMEM;
 		mc->objects[mc->nobjs++] = obj;
@@ -392,6 +392,11 @@ int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
 	return 0;
 }
 
+int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
+{
+	return __kvm_mmu_topup_memory_cache(mc, min, GFP_KERNEL_ACCOUNT);
+}
+
 int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc)
 {
 	return mc->nobjs;
-- 
2.35.1.1094.g7c7d902a7c-goog


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v3 21/23] KVM: Allow GFP flags to be passed when topping up MMU caches
@ 2022-04-01 17:55   ` David Matlack
  0 siblings, 0 replies; 88+ messages in thread
From: David Matlack @ 2022-04-01 17:55 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

This will be used in a subsequent commit to top-up MMU caches under the
MMU lock with GFP_NOWAIT as part of eager page splitting.

No functional change intended.

Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 include/linux/kvm_host.h | 1 +
 virt/kvm/kvm_main.c      | 9 +++++++--
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 252ee4a61b58..7d3a1f28beb2 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1335,6 +1335,7 @@ void kvm_flush_remote_tlbs(struct kvm *kvm);
 
 #ifdef KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE
 int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min);
+int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min, gfp_t gfp);
 int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc);
 void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
 void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index c4cac4195f4a..554148ea0c30 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -369,7 +369,7 @@ static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc,
 		return (void *)__get_free_page(gfp_flags);
 }
 
-int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
+int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min, gfp_t gfp)
 {
 	void *obj;
 
@@ -384,7 +384,7 @@ int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
 	if (mc->nobjs >= min)
 		return 0;
 	while (mc->nobjs < mc->capacity) {
-		obj = mmu_memory_cache_alloc_obj(mc, GFP_KERNEL_ACCOUNT);
+		obj = mmu_memory_cache_alloc_obj(mc, gfp);
 		if (!obj)
 			return mc->nobjs >= min ? 0 : -ENOMEM;
 		mc->objects[mc->nobjs++] = obj;
@@ -392,6 +392,11 @@ int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
 	return 0;
 }
 
+int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
+{
+	return __kvm_mmu_topup_memory_cache(mc, min, GFP_KERNEL_ACCOUNT);
+}
+
 int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc)
 {
 	return mc->nobjs;
-- 
2.35.1.1094.g7c7d902a7c-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v3 22/23] KVM: x86/mmu: Support Eager Page Splitting in the shadow MMU
  2022-04-01 17:55 ` David Matlack
@ 2022-04-01 17:55   ` David Matlack
  -1 siblings, 0 replies; 88+ messages in thread
From: David Matlack @ 2022-04-01 17:55 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

Add support for Eager Page Splitting pages that are mapped by the shadow
MMU. Walk through the rmap first splitting all 1GiB pages to 2MiB pages,
and then splitting all 2MiB pages to 4KiB pages.

Splitting huge pages mapped by the shadow MMU requries dealing with some
extra complexity beyond that of the TDP MMU:

(1) The shadow MMU has a limit on the number of shadow pages that are
    allowed to be allocated. So, as a policy, Eager Page Splitting
    refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
    pages available.

(2) Huge pages may be mapped by indirect shadow pages which have the
    possibility of being unsync. As a policy we opt not to split such
    pages as their translation may no longer be valid.

(3) Splitting a huge page may end up re-using an existing lower level
    shadow page tables. This is unlike the TDP MMU which always allocates
    new shadow page tables when splitting.

(4) When installing the lower level SPTEs, they must be added to the
    rmap which may require allocating additional pte_list_desc structs.

Note, for case (3) we have to be careful about dealing with what's
already in the lower level page table. Specifically the lower level page
table may only be partially filled in and may point to even lower level
page tables that are partially filled in. We can fill in non-present
entries, but recursing into the lower level page tables would be too
complex.

This means that Eager Page Splitting may partially unmap a huge page.
To handle this we flush TLBs after dropping the huge SPTE whenever we
are about to install a lower level page table that was partially filled
in (*). We can skip the TLB flush if the lower level page table was
empty (no aliasing) or identical to what we were already going to
populate it with (aliased huge page that was just eagerly split).

(*) This TLB flush could probably be delayed until we're about to drop
the MMU lock, which would also let us batch flushes for multiple splits.
However such scenarios should be rare in practice (a huge page must be
aliased in multiple SPTEs and have been split for NX Huge Pages in only
some of them). Flushing immediately is simpler to plumb and also reduces
the chances of tripping over a CPU bug (e.g. see iTLB multi-hit).

Suggested-by: Peter Feiner <pfeiner@google.com>
[ This commit is based off of the original implementation of Eager Page
  Splitting from Peter in Google's kernel from 2016 that handles
  cases (1) and (2) above. ]
Signed-off-by: David Matlack <dmatlack@google.com>
---
 .../admin-guide/kernel-parameters.txt         |   3 -
 arch/x86/include/asm/kvm_host.h               |  12 +
 arch/x86/kvm/mmu/mmu.c                        | 268 ++++++++++++++++++
 arch/x86/kvm/x86.c                            |   6 +
 4 files changed, 286 insertions(+), 3 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 05161afd7642..495f6ac53801 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2360,9 +2360,6 @@
 			the KVM_CLEAR_DIRTY ioctl, and only for the pages being
 			cleared.
 
-			Eager page splitting currently only supports splitting
-			huge pages mapped by the TDP MMU.
-
 			Default is Y (on).
 
 	kvm.enable_vmware_backdoor=[KVM] Support VMware backdoor PV interface.
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index ffb2b99f3a60..053a32afd18b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1246,6 +1246,16 @@ struct kvm_arch {
 	hpa_t	hv_root_tdp;
 	spinlock_t hv_root_tdp_lock;
 #endif
+
+	/*
+	 * Memory cache used to allocate pte_list_desc structs while splitting
+	 * huge pages. In the worst case, to split one huge page we need 512
+	 * pte_list_desc structs to add each lower level leaf sptep to the rmap
+	 * plus 1 to extend the parent_ptes rmap of the lower level page table.
+	 */
+#define HUGE_PAGE_SPLIT_DESC_CACHE_CAPACITY 513
+	__DEFINE_KVM_MMU_MEMORY_CACHE(huge_page_split_desc_cache,
+				      HUGE_PAGE_SPLIT_DESC_CACHE_CAPACITY);
 };
 
 struct kvm_vm_stat {
@@ -1621,6 +1631,8 @@ void kvm_mmu_zap_all(struct kvm *kvm);
 void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen);
 void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned long kvm_nr_mmu_pages);
 
+void free_huge_page_split_desc_cache(struct kvm *kvm);
+
 int load_pdptrs(struct kvm_vcpu *vcpu, unsigned long cr3);
 
 int emulator_write_phys(struct kvm_vcpu *vcpu, gpa_t gpa,
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index a8200b3f8782..9adafed43048 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5972,6 +5972,11 @@ void kvm_mmu_init_vm(struct kvm *kvm)
 	node->track_write = kvm_mmu_pte_write;
 	node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;
 	kvm_page_track_register_notifier(kvm, node);
+
+	kvm->arch.huge_page_split_desc_cache.capacity =
+		HUGE_PAGE_SPLIT_DESC_CACHE_CAPACITY;
+	kvm->arch.huge_page_split_desc_cache.kmem_cache = pte_list_desc_cache;
+	kvm->arch.huge_page_split_desc_cache.gfp_zero = __GFP_ZERO;
 }
 
 void kvm_mmu_uninit_vm(struct kvm *kvm)
@@ -6102,12 +6107,267 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
 		kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
 }
 
+static int topup_huge_page_split_desc_cache(struct kvm *kvm, bool locked)
+{
+	gfp_t gfp = gfp_flags_for_split(locked);
+
+	/*
+	 * We may need up to HUGE_PAGE_SPLIT_DESC_CACHE_CAPACITY descriptors
+	 * to split any given huge page. We could more accurately calculate how
+	 * many we actually need by inspecting all the rmaps and check which
+	 * will need new descriptors, but that's not worth the extra cost or
+	 * code complexity.
+	 */
+	return __kvm_mmu_topup_memory_cache(
+			&kvm->arch.huge_page_split_desc_cache,
+			HUGE_PAGE_SPLIT_DESC_CACHE_CAPACITY,
+			gfp);
+}
+
+void free_huge_page_split_desc_cache(struct kvm *kvm)
+{
+	kvm_mmu_free_memory_cache(&kvm->arch.huge_page_split_desc_cache);
+}
+
+static int alloc_memory_for_split(struct kvm *kvm, struct kvm_mmu_page **spp,
+				  bool locked)
+{
+	int r;
+
+	r = topup_huge_page_split_desc_cache(kvm, locked);
+	if (r)
+		return r;
+
+	if (!*spp) {
+		*spp = kvm_mmu_alloc_direct_sp_for_split(locked);
+		r = *spp ? 0 : -ENOMEM;
+	}
+
+	return r;
+}
+
+static struct kvm_mmu_page *kvm_mmu_get_sp_for_split(struct kvm *kvm,
+						     const struct kvm_memory_slot *slot,
+						     u64 *huge_sptep,
+						     struct kvm_mmu_page **spp)
+{
+	struct kvm_mmu_page *sp, *huge_sp = sptep_to_sp(huge_sptep);
+	union kvm_mmu_page_role role;
+	LIST_HEAD(invalid_list);
+	unsigned int access;
+	gfn_t gfn;
+
+	gfn = kvm_mmu_page_get_gfn(huge_sp, huge_sptep - huge_sp->spt);
+	access = kvm_mmu_page_get_access(huge_sp, huge_sptep - huge_sp->spt);
+
+	/*
+	 * Huge page splitting always uses direct shadow pages since we are
+	 * directly mapping the huge page GFN region with smaller pages.
+	 */
+	role = kvm_mmu_child_role(huge_sptep, true, access);
+
+	sp = __kvm_mmu_find_shadow_page(kvm, gfn, role, &invalid_list);
+	if (sp) {
+		/* Direct SPs should never be unsync. */
+		WARN_ON_ONCE(sp->unsync);
+		trace_kvm_mmu_get_page(sp, false);
+	} else {
+		swap(sp, *spp);
+		init_shadow_page(kvm, sp, slot, gfn, role);
+		trace_kvm_mmu_get_page(sp, true);
+	}
+
+	kvm_mmu_commit_zap_page(kvm, &invalid_list);
+
+	return sp;
+}
+
+static void kvm_mmu_split_huge_page(struct kvm *kvm,
+				    const struct kvm_memory_slot *slot,
+				    u64 *huge_sptep, struct kvm_mmu_page **spp)
+
+{
+	struct kvm_mmu_memory_cache *cache = &kvm->arch.huge_page_split_desc_cache;
+	u64 huge_spte = READ_ONCE(*huge_sptep);
+	struct kvm_mmu_page *sp;
+	bool flush = false;
+	u64 *sptep, spte;
+	gfn_t gfn;
+	int index;
+
+	sp = kvm_mmu_get_sp_for_split(kvm, slot, huge_sptep, spp);
+
+	for (index = 0; index < PT64_ENT_PER_PAGE; index++) {
+		sptep = &sp->spt[index];
+		gfn = kvm_mmu_page_get_gfn(sp, index);
+
+		/*
+		 * sp may have populated page table entries, e.g. if this huge
+		 * page is aliased by multiple sptes with the same access
+		 * permissions. We know the sptes will be mapping the same
+		 * gfn-to-pfn translation since sp is direct. However, a given
+		 * spte may point to an even lower level page table. We don't
+		 * know if that lower level page table is completely filled in,
+		 * i.e. we may be effectively unmapping a region of memory, so
+		 * we must flush the TLB.
+		 */
+		if (is_shadow_present_pte(*sptep)) {
+			flush |= !is_last_spte(*sptep, sp->role.level);
+			continue;
+		}
+
+		spte = make_huge_page_split_spte(huge_spte, sp, index);
+		mmu_spte_set(sptep, spte);
+		__rmap_add(kvm, cache, slot, sptep, gfn, sp->role.access);
+	}
+
+	/*
+	 * Replace the huge spte with a pointer to the populated lower level
+	 * page table. If the lower-level page table indentically maps the huge
+	 * page, there's no need for a TLB flush. Otherwise, flush TLBs after
+	 * dropping the huge page and before installing the shadow page table.
+	 */
+	__drop_large_spte(kvm, huge_sptep, flush);
+	__link_shadow_page(cache, huge_sptep, sp);
+}
+
+static int __try_split_huge_page(struct kvm *kvm,
+				 const struct kvm_memory_slot *slot,
+				 u64 *huge_sptep, struct kvm_mmu_page **spp)
+{
+	int r = 0;
+
+	if (kvm_mmu_available_pages(kvm) <= KVM_MIN_FREE_MMU_PAGES)
+		return -ENOSPC;
+
+	if (need_resched() || rwlock_needbreak(&kvm->mmu_lock))
+		goto drop_lock;
+
+	r = alloc_memory_for_split(kvm, spp, true);
+	if (r)
+		goto drop_lock;
+
+	kvm_mmu_split_huge_page(kvm, slot, huge_sptep, spp);
+
+	return 0;
+
+drop_lock:
+	write_unlock(&kvm->mmu_lock);
+	cond_resched();
+	r = alloc_memory_for_split(kvm, spp, false);
+	write_lock(&kvm->mmu_lock);
+
+	/*
+	 * Ask the caller to try again if the allocation succeeded. We dropped
+	 * the MMU lock so huge_sptep may no longer be valid.
+	 */
+	return r ?: -EAGAIN;
+}
+
+static int try_split_huge_page(struct kvm *kvm,
+			       const struct kvm_memory_slot *slot,
+			       u64 *huge_sptep, struct kvm_mmu_page **spp)
+{
+	struct kvm_mmu_page *huge_sp = sptep_to_sp(huge_sptep);
+	int level, r;
+	gfn_t gfn;
+	u64 spte;
+
+	/*
+	 * Record information about the huge page being split to use in the
+	 * tracepoint below. Do this now because __try_split_huge_page() may
+	 * drop the MMU lock, after which huge_sptep may no longer be a valid
+	 * pointer.
+	 */
+	gfn = kvm_mmu_page_get_gfn(huge_sp, huge_sptep - huge_sp->spt);
+	level = huge_sp->role.level;
+	spte = *huge_sptep;
+
+	r = __try_split_huge_page(kvm, slot, huge_sptep, spp);
+
+	trace_kvm_mmu_split_huge_page(gfn, spte, level, r);
+
+	return r;
+}
+
+
+static bool skip_split_huge_page(u64 *huge_sptep)
+{
+	struct kvm_mmu_page *sp = sptep_to_sp(huge_sptep);
+
+	if (WARN_ON_ONCE(!is_large_pte(*huge_sptep)))
+		return true;
+
+	/*
+	 * As a policy, do not split huge pages if the sp on which they reside
+	 * is unsync. Unsync means the guest is modifying the page table being
+	 * shadowed, so splitting may be a waste of cycles and memory.
+	 */
+	return sp->role.invalid || sp->unsync;
+}
+
+static bool rmap_try_split_huge_pages(struct kvm *kvm,
+				      struct kvm_rmap_head *rmap_head,
+				      const struct kvm_memory_slot *slot)
+{
+	struct kvm_mmu_page *sp = NULL;
+	struct rmap_iterator iter;
+	u64 *huge_sptep;
+	int r;
+
+restart:
+	for_each_rmap_spte(rmap_head, &iter, huge_sptep) {
+		if (skip_split_huge_page(huge_sptep))
+			continue;
+
+		r = try_split_huge_page(kvm, slot, huge_sptep, &sp);
+		if (r < 0 && r != -EAGAIN)
+			break;
+
+		/*
+		 * Splitting succeeded (and removed huge_sptep from the
+		 * iterator) or we had to drop the MMU lock. Either way, restart
+		 * the iterator to get it back into a consistent state.
+		 */
+		goto restart;
+	}
+
+	if (sp)
+		kvm_mmu_free_shadow_page(sp);
+
+	return false;
+}
+
+static void kvm_rmap_try_split_huge_pages(struct kvm *kvm,
+					  const struct kvm_memory_slot *slot,
+					  gfn_t start, gfn_t end,
+					  int target_level)
+{
+	int level;
+
+	/*
+	 * Split huge pages starting with KVM_MAX_HUGEPAGE_LEVEL and working
+	 * down to the target level. This ensures pages are recursively split
+	 * all the way to the target level. There's no need to split pages
+	 * already at the target level.
+	 */
+	for (level = KVM_MAX_HUGEPAGE_LEVEL; level > target_level; level--) {
+		slot_handle_level_range(kvm, slot,
+					rmap_try_split_huge_pages,
+					level, level, start, end - 1,
+					true, false);
+	}
+}
+
 /* Must be called with the mmu_lock held in write-mode. */
 void kvm_mmu_try_split_huge_pages(struct kvm *kvm,
 				   const struct kvm_memory_slot *memslot,
 				   u64 start, u64 end,
 				   int target_level)
 {
+	if (kvm_memslots_have_rmaps(kvm))
+		kvm_rmap_try_split_huge_pages(kvm, memslot, start, end, target_level);
+
 	if (is_tdp_mmu_enabled(kvm))
 		kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end,
 						 target_level, false);
@@ -6125,6 +6385,14 @@ void kvm_mmu_slot_try_split_huge_pages(struct kvm *kvm,
 	u64 start = memslot->base_gfn;
 	u64 end = start + memslot->npages;
 
+	if (kvm_memslots_have_rmaps(kvm)) {
+		topup_huge_page_split_desc_cache(kvm, false);
+		write_lock(&kvm->mmu_lock);
+		kvm_rmap_try_split_huge_pages(kvm, memslot, start, end, target_level);
+		write_unlock(&kvm->mmu_lock);
+		free_huge_page_split_desc_cache(kvm);
+	}
+
 	if (is_tdp_mmu_enabled(kvm)) {
 		read_lock(&kvm->mmu_lock);
 		kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end, target_level, true);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d3a9ce07a565..02728c3f088e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12106,6 +12106,12 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
 		 * page faults will create the large-page sptes.
 		 */
 		kvm_mmu_zap_collapsible_sptes(kvm, new);
+
+		/*
+		 * Free any memory left behind by eager page splitting. Ignore
+		 * the module parameter since userspace might have changed it.
+		 */
+		free_huge_page_split_desc_cache(kvm);
 	} else {
 		/*
 		 * Initially-all-set does not require write protecting any page,
-- 
2.35.1.1094.g7c7d902a7c-goog


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v3 22/23] KVM: x86/mmu: Support Eager Page Splitting in the shadow MMU
@ 2022-04-01 17:55   ` David Matlack
  0 siblings, 0 replies; 88+ messages in thread
From: David Matlack @ 2022-04-01 17:55 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Add support for Eager Page Splitting pages that are mapped by the shadow
MMU. Walk through the rmap first splitting all 1GiB pages to 2MiB pages,
and then splitting all 2MiB pages to 4KiB pages.

Splitting huge pages mapped by the shadow MMU requries dealing with some
extra complexity beyond that of the TDP MMU:

(1) The shadow MMU has a limit on the number of shadow pages that are
    allowed to be allocated. So, as a policy, Eager Page Splitting
    refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
    pages available.

(2) Huge pages may be mapped by indirect shadow pages which have the
    possibility of being unsync. As a policy we opt not to split such
    pages as their translation may no longer be valid.

(3) Splitting a huge page may end up re-using an existing lower level
    shadow page tables. This is unlike the TDP MMU which always allocates
    new shadow page tables when splitting.

(4) When installing the lower level SPTEs, they must be added to the
    rmap which may require allocating additional pte_list_desc structs.

Note, for case (3) we have to be careful about dealing with what's
already in the lower level page table. Specifically the lower level page
table may only be partially filled in and may point to even lower level
page tables that are partially filled in. We can fill in non-present
entries, but recursing into the lower level page tables would be too
complex.

This means that Eager Page Splitting may partially unmap a huge page.
To handle this we flush TLBs after dropping the huge SPTE whenever we
are about to install a lower level page table that was partially filled
in (*). We can skip the TLB flush if the lower level page table was
empty (no aliasing) or identical to what we were already going to
populate it with (aliased huge page that was just eagerly split).

(*) This TLB flush could probably be delayed until we're about to drop
the MMU lock, which would also let us batch flushes for multiple splits.
However such scenarios should be rare in practice (a huge page must be
aliased in multiple SPTEs and have been split for NX Huge Pages in only
some of them). Flushing immediately is simpler to plumb and also reduces
the chances of tripping over a CPU bug (e.g. see iTLB multi-hit).

Suggested-by: Peter Feiner <pfeiner@google.com>
[ This commit is based off of the original implementation of Eager Page
  Splitting from Peter in Google's kernel from 2016 that handles
  cases (1) and (2) above. ]
Signed-off-by: David Matlack <dmatlack@google.com>
---
 .../admin-guide/kernel-parameters.txt         |   3 -
 arch/x86/include/asm/kvm_host.h               |  12 +
 arch/x86/kvm/mmu/mmu.c                        | 268 ++++++++++++++++++
 arch/x86/kvm/x86.c                            |   6 +
 4 files changed, 286 insertions(+), 3 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 05161afd7642..495f6ac53801 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2360,9 +2360,6 @@
 			the KVM_CLEAR_DIRTY ioctl, and only for the pages being
 			cleared.
 
-			Eager page splitting currently only supports splitting
-			huge pages mapped by the TDP MMU.
-
 			Default is Y (on).
 
 	kvm.enable_vmware_backdoor=[KVM] Support VMware backdoor PV interface.
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index ffb2b99f3a60..053a32afd18b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1246,6 +1246,16 @@ struct kvm_arch {
 	hpa_t	hv_root_tdp;
 	spinlock_t hv_root_tdp_lock;
 #endif
+
+	/*
+	 * Memory cache used to allocate pte_list_desc structs while splitting
+	 * huge pages. In the worst case, to split one huge page we need 512
+	 * pte_list_desc structs to add each lower level leaf sptep to the rmap
+	 * plus 1 to extend the parent_ptes rmap of the lower level page table.
+	 */
+#define HUGE_PAGE_SPLIT_DESC_CACHE_CAPACITY 513
+	__DEFINE_KVM_MMU_MEMORY_CACHE(huge_page_split_desc_cache,
+				      HUGE_PAGE_SPLIT_DESC_CACHE_CAPACITY);
 };
 
 struct kvm_vm_stat {
@@ -1621,6 +1631,8 @@ void kvm_mmu_zap_all(struct kvm *kvm);
 void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen);
 void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned long kvm_nr_mmu_pages);
 
+void free_huge_page_split_desc_cache(struct kvm *kvm);
+
 int load_pdptrs(struct kvm_vcpu *vcpu, unsigned long cr3);
 
 int emulator_write_phys(struct kvm_vcpu *vcpu, gpa_t gpa,
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index a8200b3f8782..9adafed43048 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5972,6 +5972,11 @@ void kvm_mmu_init_vm(struct kvm *kvm)
 	node->track_write = kvm_mmu_pte_write;
 	node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;
 	kvm_page_track_register_notifier(kvm, node);
+
+	kvm->arch.huge_page_split_desc_cache.capacity =
+		HUGE_PAGE_SPLIT_DESC_CACHE_CAPACITY;
+	kvm->arch.huge_page_split_desc_cache.kmem_cache = pte_list_desc_cache;
+	kvm->arch.huge_page_split_desc_cache.gfp_zero = __GFP_ZERO;
 }
 
 void kvm_mmu_uninit_vm(struct kvm *kvm)
@@ -6102,12 +6107,267 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
 		kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
 }
 
+static int topup_huge_page_split_desc_cache(struct kvm *kvm, bool locked)
+{
+	gfp_t gfp = gfp_flags_for_split(locked);
+
+	/*
+	 * We may need up to HUGE_PAGE_SPLIT_DESC_CACHE_CAPACITY descriptors
+	 * to split any given huge page. We could more accurately calculate how
+	 * many we actually need by inspecting all the rmaps and check which
+	 * will need new descriptors, but that's not worth the extra cost or
+	 * code complexity.
+	 */
+	return __kvm_mmu_topup_memory_cache(
+			&kvm->arch.huge_page_split_desc_cache,
+			HUGE_PAGE_SPLIT_DESC_CACHE_CAPACITY,
+			gfp);
+}
+
+void free_huge_page_split_desc_cache(struct kvm *kvm)
+{
+	kvm_mmu_free_memory_cache(&kvm->arch.huge_page_split_desc_cache);
+}
+
+static int alloc_memory_for_split(struct kvm *kvm, struct kvm_mmu_page **spp,
+				  bool locked)
+{
+	int r;
+
+	r = topup_huge_page_split_desc_cache(kvm, locked);
+	if (r)
+		return r;
+
+	if (!*spp) {
+		*spp = kvm_mmu_alloc_direct_sp_for_split(locked);
+		r = *spp ? 0 : -ENOMEM;
+	}
+
+	return r;
+}
+
+static struct kvm_mmu_page *kvm_mmu_get_sp_for_split(struct kvm *kvm,
+						     const struct kvm_memory_slot *slot,
+						     u64 *huge_sptep,
+						     struct kvm_mmu_page **spp)
+{
+	struct kvm_mmu_page *sp, *huge_sp = sptep_to_sp(huge_sptep);
+	union kvm_mmu_page_role role;
+	LIST_HEAD(invalid_list);
+	unsigned int access;
+	gfn_t gfn;
+
+	gfn = kvm_mmu_page_get_gfn(huge_sp, huge_sptep - huge_sp->spt);
+	access = kvm_mmu_page_get_access(huge_sp, huge_sptep - huge_sp->spt);
+
+	/*
+	 * Huge page splitting always uses direct shadow pages since we are
+	 * directly mapping the huge page GFN region with smaller pages.
+	 */
+	role = kvm_mmu_child_role(huge_sptep, true, access);
+
+	sp = __kvm_mmu_find_shadow_page(kvm, gfn, role, &invalid_list);
+	if (sp) {
+		/* Direct SPs should never be unsync. */
+		WARN_ON_ONCE(sp->unsync);
+		trace_kvm_mmu_get_page(sp, false);
+	} else {
+		swap(sp, *spp);
+		init_shadow_page(kvm, sp, slot, gfn, role);
+		trace_kvm_mmu_get_page(sp, true);
+	}
+
+	kvm_mmu_commit_zap_page(kvm, &invalid_list);
+
+	return sp;
+}
+
+static void kvm_mmu_split_huge_page(struct kvm *kvm,
+				    const struct kvm_memory_slot *slot,
+				    u64 *huge_sptep, struct kvm_mmu_page **spp)
+
+{
+	struct kvm_mmu_memory_cache *cache = &kvm->arch.huge_page_split_desc_cache;
+	u64 huge_spte = READ_ONCE(*huge_sptep);
+	struct kvm_mmu_page *sp;
+	bool flush = false;
+	u64 *sptep, spte;
+	gfn_t gfn;
+	int index;
+
+	sp = kvm_mmu_get_sp_for_split(kvm, slot, huge_sptep, spp);
+
+	for (index = 0; index < PT64_ENT_PER_PAGE; index++) {
+		sptep = &sp->spt[index];
+		gfn = kvm_mmu_page_get_gfn(sp, index);
+
+		/*
+		 * sp may have populated page table entries, e.g. if this huge
+		 * page is aliased by multiple sptes with the same access
+		 * permissions. We know the sptes will be mapping the same
+		 * gfn-to-pfn translation since sp is direct. However, a given
+		 * spte may point to an even lower level page table. We don't
+		 * know if that lower level page table is completely filled in,
+		 * i.e. we may be effectively unmapping a region of memory, so
+		 * we must flush the TLB.
+		 */
+		if (is_shadow_present_pte(*sptep)) {
+			flush |= !is_last_spte(*sptep, sp->role.level);
+			continue;
+		}
+
+		spte = make_huge_page_split_spte(huge_spte, sp, index);
+		mmu_spte_set(sptep, spte);
+		__rmap_add(kvm, cache, slot, sptep, gfn, sp->role.access);
+	}
+
+	/*
+	 * Replace the huge spte with a pointer to the populated lower level
+	 * page table. If the lower-level page table indentically maps the huge
+	 * page, there's no need for a TLB flush. Otherwise, flush TLBs after
+	 * dropping the huge page and before installing the shadow page table.
+	 */
+	__drop_large_spte(kvm, huge_sptep, flush);
+	__link_shadow_page(cache, huge_sptep, sp);
+}
+
+static int __try_split_huge_page(struct kvm *kvm,
+				 const struct kvm_memory_slot *slot,
+				 u64 *huge_sptep, struct kvm_mmu_page **spp)
+{
+	int r = 0;
+
+	if (kvm_mmu_available_pages(kvm) <= KVM_MIN_FREE_MMU_PAGES)
+		return -ENOSPC;
+
+	if (need_resched() || rwlock_needbreak(&kvm->mmu_lock))
+		goto drop_lock;
+
+	r = alloc_memory_for_split(kvm, spp, true);
+	if (r)
+		goto drop_lock;
+
+	kvm_mmu_split_huge_page(kvm, slot, huge_sptep, spp);
+
+	return 0;
+
+drop_lock:
+	write_unlock(&kvm->mmu_lock);
+	cond_resched();
+	r = alloc_memory_for_split(kvm, spp, false);
+	write_lock(&kvm->mmu_lock);
+
+	/*
+	 * Ask the caller to try again if the allocation succeeded. We dropped
+	 * the MMU lock so huge_sptep may no longer be valid.
+	 */
+	return r ?: -EAGAIN;
+}
+
+static int try_split_huge_page(struct kvm *kvm,
+			       const struct kvm_memory_slot *slot,
+			       u64 *huge_sptep, struct kvm_mmu_page **spp)
+{
+	struct kvm_mmu_page *huge_sp = sptep_to_sp(huge_sptep);
+	int level, r;
+	gfn_t gfn;
+	u64 spte;
+
+	/*
+	 * Record information about the huge page being split to use in the
+	 * tracepoint below. Do this now because __try_split_huge_page() may
+	 * drop the MMU lock, after which huge_sptep may no longer be a valid
+	 * pointer.
+	 */
+	gfn = kvm_mmu_page_get_gfn(huge_sp, huge_sptep - huge_sp->spt);
+	level = huge_sp->role.level;
+	spte = *huge_sptep;
+
+	r = __try_split_huge_page(kvm, slot, huge_sptep, spp);
+
+	trace_kvm_mmu_split_huge_page(gfn, spte, level, r);
+
+	return r;
+}
+
+
+static bool skip_split_huge_page(u64 *huge_sptep)
+{
+	struct kvm_mmu_page *sp = sptep_to_sp(huge_sptep);
+
+	if (WARN_ON_ONCE(!is_large_pte(*huge_sptep)))
+		return true;
+
+	/*
+	 * As a policy, do not split huge pages if the sp on which they reside
+	 * is unsync. Unsync means the guest is modifying the page table being
+	 * shadowed, so splitting may be a waste of cycles and memory.
+	 */
+	return sp->role.invalid || sp->unsync;
+}
+
+static bool rmap_try_split_huge_pages(struct kvm *kvm,
+				      struct kvm_rmap_head *rmap_head,
+				      const struct kvm_memory_slot *slot)
+{
+	struct kvm_mmu_page *sp = NULL;
+	struct rmap_iterator iter;
+	u64 *huge_sptep;
+	int r;
+
+restart:
+	for_each_rmap_spte(rmap_head, &iter, huge_sptep) {
+		if (skip_split_huge_page(huge_sptep))
+			continue;
+
+		r = try_split_huge_page(kvm, slot, huge_sptep, &sp);
+		if (r < 0 && r != -EAGAIN)
+			break;
+
+		/*
+		 * Splitting succeeded (and removed huge_sptep from the
+		 * iterator) or we had to drop the MMU lock. Either way, restart
+		 * the iterator to get it back into a consistent state.
+		 */
+		goto restart;
+	}
+
+	if (sp)
+		kvm_mmu_free_shadow_page(sp);
+
+	return false;
+}
+
+static void kvm_rmap_try_split_huge_pages(struct kvm *kvm,
+					  const struct kvm_memory_slot *slot,
+					  gfn_t start, gfn_t end,
+					  int target_level)
+{
+	int level;
+
+	/*
+	 * Split huge pages starting with KVM_MAX_HUGEPAGE_LEVEL and working
+	 * down to the target level. This ensures pages are recursively split
+	 * all the way to the target level. There's no need to split pages
+	 * already at the target level.
+	 */
+	for (level = KVM_MAX_HUGEPAGE_LEVEL; level > target_level; level--) {
+		slot_handle_level_range(kvm, slot,
+					rmap_try_split_huge_pages,
+					level, level, start, end - 1,
+					true, false);
+	}
+}
+
 /* Must be called with the mmu_lock held in write-mode. */
 void kvm_mmu_try_split_huge_pages(struct kvm *kvm,
 				   const struct kvm_memory_slot *memslot,
 				   u64 start, u64 end,
 				   int target_level)
 {
+	if (kvm_memslots_have_rmaps(kvm))
+		kvm_rmap_try_split_huge_pages(kvm, memslot, start, end, target_level);
+
 	if (is_tdp_mmu_enabled(kvm))
 		kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end,
 						 target_level, false);
@@ -6125,6 +6385,14 @@ void kvm_mmu_slot_try_split_huge_pages(struct kvm *kvm,
 	u64 start = memslot->base_gfn;
 	u64 end = start + memslot->npages;
 
+	if (kvm_memslots_have_rmaps(kvm)) {
+		topup_huge_page_split_desc_cache(kvm, false);
+		write_lock(&kvm->mmu_lock);
+		kvm_rmap_try_split_huge_pages(kvm, memslot, start, end, target_level);
+		write_unlock(&kvm->mmu_lock);
+		free_huge_page_split_desc_cache(kvm);
+	}
+
 	if (is_tdp_mmu_enabled(kvm)) {
 		read_lock(&kvm->mmu_lock);
 		kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end, target_level, true);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d3a9ce07a565..02728c3f088e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12106,6 +12106,12 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
 		 * page faults will create the large-page sptes.
 		 */
 		kvm_mmu_zap_collapsible_sptes(kvm, new);
+
+		/*
+		 * Free any memory left behind by eager page splitting. Ignore
+		 * the module parameter since userspace might have changed it.
+		 */
+		free_huge_page_split_desc_cache(kvm);
 	} else {
 		/*
 		 * Initially-all-set does not require write protecting any page,
-- 
2.35.1.1094.g7c7d902a7c-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v3 23/23] KVM: selftests: Map x86_64 guest virtual memory with huge pages
  2022-04-01 17:55 ` David Matlack
@ 2022-04-01 17:55   ` David Matlack
  -1 siblings, 0 replies; 88+ messages in thread
From: David Matlack @ 2022-04-01 17:55 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

Override virt_map() in x86_64 selftests to use the largest page size
possible when mapping guest virtual memory. This enables testing eager
page splitting with shadow paging (e.g. kvm_intel.ept=N), as it allows
KVM to shadow guest memory with huge pages.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 .../selftests/kvm/include/x86_64/processor.h  |  6 ++++
 tools/testing/selftests/kvm/lib/kvm_util.c    |  4 +--
 .../selftests/kvm/lib/x86_64/processor.c      | 31 +++++++++++++++++++
 3 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/x86_64/processor.h b/tools/testing/selftests/kvm/include/x86_64/processor.h
index 37db341d4cc5..efb228d2fbf7 100644
--- a/tools/testing/selftests/kvm/include/x86_64/processor.h
+++ b/tools/testing/selftests/kvm/include/x86_64/processor.h
@@ -470,6 +470,12 @@ enum x86_page_size {
 	X86_PAGE_SIZE_2M,
 	X86_PAGE_SIZE_1G,
 };
+
+static inline size_t page_size_bytes(enum x86_page_size page_size)
+{
+	return 1UL << (page_size * 9 + 12);
+}
+
 void __virt_pg_map(struct kvm_vm *vm, uint64_t vaddr, uint64_t paddr,
 		   enum x86_page_size page_size);
 
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 1665a220abcb..60198587236d 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -1432,8 +1432,8 @@ vm_vaddr_t vm_vaddr_alloc_page(struct kvm_vm *vm)
  * Within the VM given by @vm, creates a virtual translation for
  * @npages starting at @vaddr to the page range starting at @paddr.
  */
-void virt_map(struct kvm_vm *vm, uint64_t vaddr, uint64_t paddr,
-	      unsigned int npages)
+void __weak virt_map(struct kvm_vm *vm, uint64_t vaddr, uint64_t paddr,
+		     unsigned int npages)
 {
 	size_t page_size = vm->page_size;
 	size_t size = npages * page_size;
diff --git a/tools/testing/selftests/kvm/lib/x86_64/processor.c b/tools/testing/selftests/kvm/lib/x86_64/processor.c
index 9f000dfb5594..7df84292d5de 100644
--- a/tools/testing/selftests/kvm/lib/x86_64/processor.c
+++ b/tools/testing/selftests/kvm/lib/x86_64/processor.c
@@ -282,6 +282,37 @@ void virt_pg_map(struct kvm_vm *vm, uint64_t vaddr, uint64_t paddr)
 	__virt_pg_map(vm, vaddr, paddr, X86_PAGE_SIZE_4K);
 }
 
+void virt_map(struct kvm_vm *vm, uint64_t vaddr, uint64_t paddr, unsigned int npages)
+{
+	size_t size = (size_t) npages * vm->page_size;
+	size_t vend = vaddr + size;
+	enum x86_page_size page_size;
+	size_t stride;
+
+	TEST_ASSERT(vaddr + size > vaddr, "Vaddr overflow");
+	TEST_ASSERT(paddr + size > paddr, "Paddr overflow");
+
+	/*
+	 * Map the region with all 1G pages if possible, falling back to all
+	 * 2M pages, and finally all 4K pages. This could be improved to use
+	 * a mix of page sizes so that more of the region is mapped with large
+	 * pages.
+	 */
+	for (page_size = X86_PAGE_SIZE_1G; page_size >= X86_PAGE_SIZE_4K; page_size--) {
+		stride = page_size_bytes(page_size);
+
+		if (!(vaddr % stride) && !(paddr % stride) && !(size % stride))
+			break;
+	}
+
+	TEST_ASSERT(page_size >= X86_PAGE_SIZE_4K,
+		    "Cannot map unaligned region: vaddr 0x%lx paddr 0x%lx npages 0x%x\n",
+		    vaddr, paddr, npages);
+
+	for (; vaddr < vend; vaddr += stride, paddr += stride)
+		__virt_pg_map(vm, vaddr, paddr, page_size);
+}
+
 static struct pageTableEntry *_vm_get_page_table_entry(struct kvm_vm *vm, int vcpuid,
 						       uint64_t vaddr)
 {
-- 
2.35.1.1094.g7c7d902a7c-goog


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v3 23/23] KVM: selftests: Map x86_64 guest virtual memory with huge pages
@ 2022-04-01 17:55   ` David Matlack
  0 siblings, 0 replies; 88+ messages in thread
From: David Matlack @ 2022-04-01 17:55 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Override virt_map() in x86_64 selftests to use the largest page size
possible when mapping guest virtual memory. This enables testing eager
page splitting with shadow paging (e.g. kvm_intel.ept=N), as it allows
KVM to shadow guest memory with huge pages.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 .../selftests/kvm/include/x86_64/processor.h  |  6 ++++
 tools/testing/selftests/kvm/lib/kvm_util.c    |  4 +--
 .../selftests/kvm/lib/x86_64/processor.c      | 31 +++++++++++++++++++
 3 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/x86_64/processor.h b/tools/testing/selftests/kvm/include/x86_64/processor.h
index 37db341d4cc5..efb228d2fbf7 100644
--- a/tools/testing/selftests/kvm/include/x86_64/processor.h
+++ b/tools/testing/selftests/kvm/include/x86_64/processor.h
@@ -470,6 +470,12 @@ enum x86_page_size {
 	X86_PAGE_SIZE_2M,
 	X86_PAGE_SIZE_1G,
 };
+
+static inline size_t page_size_bytes(enum x86_page_size page_size)
+{
+	return 1UL << (page_size * 9 + 12);
+}
+
 void __virt_pg_map(struct kvm_vm *vm, uint64_t vaddr, uint64_t paddr,
 		   enum x86_page_size page_size);
 
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 1665a220abcb..60198587236d 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -1432,8 +1432,8 @@ vm_vaddr_t vm_vaddr_alloc_page(struct kvm_vm *vm)
  * Within the VM given by @vm, creates a virtual translation for
  * @npages starting at @vaddr to the page range starting at @paddr.
  */
-void virt_map(struct kvm_vm *vm, uint64_t vaddr, uint64_t paddr,
-	      unsigned int npages)
+void __weak virt_map(struct kvm_vm *vm, uint64_t vaddr, uint64_t paddr,
+		     unsigned int npages)
 {
 	size_t page_size = vm->page_size;
 	size_t size = npages * page_size;
diff --git a/tools/testing/selftests/kvm/lib/x86_64/processor.c b/tools/testing/selftests/kvm/lib/x86_64/processor.c
index 9f000dfb5594..7df84292d5de 100644
--- a/tools/testing/selftests/kvm/lib/x86_64/processor.c
+++ b/tools/testing/selftests/kvm/lib/x86_64/processor.c
@@ -282,6 +282,37 @@ void virt_pg_map(struct kvm_vm *vm, uint64_t vaddr, uint64_t paddr)
 	__virt_pg_map(vm, vaddr, paddr, X86_PAGE_SIZE_4K);
 }
 
+void virt_map(struct kvm_vm *vm, uint64_t vaddr, uint64_t paddr, unsigned int npages)
+{
+	size_t size = (size_t) npages * vm->page_size;
+	size_t vend = vaddr + size;
+	enum x86_page_size page_size;
+	size_t stride;
+
+	TEST_ASSERT(vaddr + size > vaddr, "Vaddr overflow");
+	TEST_ASSERT(paddr + size > paddr, "Paddr overflow");
+
+	/*
+	 * Map the region with all 1G pages if possible, falling back to all
+	 * 2M pages, and finally all 4K pages. This could be improved to use
+	 * a mix of page sizes so that more of the region is mapped with large
+	 * pages.
+	 */
+	for (page_size = X86_PAGE_SIZE_1G; page_size >= X86_PAGE_SIZE_4K; page_size--) {
+		stride = page_size_bytes(page_size);
+
+		if (!(vaddr % stride) && !(paddr % stride) && !(size % stride))
+			break;
+	}
+
+	TEST_ASSERT(page_size >= X86_PAGE_SIZE_4K,
+		    "Cannot map unaligned region: vaddr 0x%lx paddr 0x%lx npages 0x%x\n",
+		    vaddr, paddr, npages);
+
+	for (; vaddr < vend; vaddr += stride, paddr += stride)
+		__virt_pg_map(vm, vaddr, paddr, page_size);
+}
+
 static struct pageTableEntry *_vm_get_page_table_entry(struct kvm_vm *vm, int vcpuid,
 						       uint64_t vaddr)
 {
-- 
2.35.1.1094.g7c7d902a7c-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* Re: [PATCH v3 16/23] KVM: x86/mmu: Cache the access bits of shadowed translations
  2022-04-01 17:55   ` David Matlack
@ 2022-04-02  6:19     ` kernel test robot
  -1 siblings, 0 replies; 88+ messages in thread
From: kernel test robot @ 2022-04-02  6:19 UTC (permalink / raw)
  To: David Matlack, Paolo Bonzini
  Cc: kbuild-all, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Sean Christopherson, Andrew Jones, Ben Gardon, Peter Xu,
	maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

Hi David,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on 19164ad08bf668bca4f4bfbaacaa0a47c1b737a6]

url:    https://github.com/intel-lab-lkp/linux/commits/David-Matlack/KVM-Extend-Eager-Page-Splitting-to-the-shadow-MMU/20220402-015911
base:   19164ad08bf668bca4f4bfbaacaa0a47c1b737a6
config: x86_64-randconfig-a002 (https://download.01.org/0day-ci/archive/20220402/202204021419.Rh6Pgcl9-lkp@intel.com/config)
compiler: gcc-11 (Debian 11.2.0-19) 11.2.0
reproduce (this is a W=1 build):
        # https://github.com/intel-lab-lkp/linux/commit/5f7a06676291033d880081035c2efae13702a0c4
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review David-Matlack/KVM-Extend-Eager-Page-Splitting-to-the-shadow-MMU/20220402-015911
        git checkout 5f7a06676291033d880081035c2efae13702a0c4
        # save the config file to linux build tree
        mkdir build_dir
        make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash arch/x86/kvm/

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

   In file included from include/linux/kernel.h:29,
                    from include/linux/cpumask.h:10,
                    from include/linux/mm_types_task.h:14,
                    from include/linux/mm_types.h:5,
                    from arch/x86/kvm/irq.h:13,
                    from arch/x86/kvm/mmu/mmu.c:18:
   arch/x86/kvm/mmu/mmu.c: In function 'kvm_mmu_page_set_access':
>> include/linux/kern_levels.h:5:25: warning: format '%llx' expects argument of type 'long long unsigned int', but argument 3 has type 'int' [-Wformat=]
       5 | #define KERN_SOH        "\001"          /* ASCII Start Of Header */
         |                         ^~~~~~
   include/linux/printk.h:418:25: note: in definition of macro 'printk_index_wrap'
     418 |                 _p_func(_fmt, ##__VA_ARGS__);                           \
         |                         ^~~~
   include/linux/printk.h:640:17: note: in expansion of macro 'printk'
     640 |                 printk(fmt, ##__VA_ARGS__);                             \
         |                 ^~~~~~
   include/linux/printk.h:654:9: note: in expansion of macro 'printk_ratelimited'
     654 |         printk_ratelimited(KERN_ERR pr_fmt(fmt), ##__VA_ARGS__)
         |         ^~~~~~~~~~~~~~~~~~
   include/linux/kern_levels.h:11:25: note: in expansion of macro 'KERN_SOH'
      11 | #define KERN_ERR        KERN_SOH "3"    /* error conditions */
         |                         ^~~~~~~~
   include/linux/printk.h:654:28: note: in expansion of macro 'KERN_ERR'
     654 |         printk_ratelimited(KERN_ERR pr_fmt(fmt), ##__VA_ARGS__)
         |                            ^~~~~~~~
   arch/x86/kvm/mmu/mmu.c:763:17: note: in expansion of macro 'pr_err_ratelimited'
     763 |                 pr_err_ratelimited("access mismatch under direct page %llx "
         |                 ^~~~~~~~~~~~~~~~~~
>> include/linux/kern_levels.h:5:25: warning: format '%llx' expects argument of type 'long long unsigned int', but argument 4 has type 'u32' {aka 'unsigned int'} [-Wformat=]
       5 | #define KERN_SOH        "\001"          /* ASCII Start Of Header */
         |                         ^~~~~~
   include/linux/printk.h:418:25: note: in definition of macro 'printk_index_wrap'
     418 |                 _p_func(_fmt, ##__VA_ARGS__);                           \
         |                         ^~~~
   include/linux/printk.h:640:17: note: in expansion of macro 'printk'
     640 |                 printk(fmt, ##__VA_ARGS__);                             \
         |                 ^~~~~~
   include/linux/printk.h:654:9: note: in expansion of macro 'printk_ratelimited'
     654 |         printk_ratelimited(KERN_ERR pr_fmt(fmt), ##__VA_ARGS__)
         |         ^~~~~~~~~~~~~~~~~~
   include/linux/kern_levels.h:11:25: note: in expansion of macro 'KERN_SOH'
      11 | #define KERN_ERR        KERN_SOH "3"    /* error conditions */
         |                         ^~~~~~~~
   include/linux/printk.h:654:28: note: in expansion of macro 'KERN_ERR'
     654 |         printk_ratelimited(KERN_ERR pr_fmt(fmt), ##__VA_ARGS__)
         |                            ^~~~~~~~
   arch/x86/kvm/mmu/mmu.c:763:17: note: in expansion of macro 'pr_err_ratelimited'
     763 |                 pr_err_ratelimited("access mismatch under direct page %llx "
         |                 ^~~~~~~~~~~~~~~~~~


vim +5 include/linux/kern_levels.h

314ba3520e513a7 Joe Perches 2012-07-30  4  
04d2c8c83d0e3ac Joe Perches 2012-07-30 @5  #define KERN_SOH	"\001"		/* ASCII Start Of Header */
04d2c8c83d0e3ac Joe Perches 2012-07-30  6  #define KERN_SOH_ASCII	'\001'
04d2c8c83d0e3ac Joe Perches 2012-07-30  7  

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v3 16/23] KVM: x86/mmu: Cache the access bits of shadowed translations
@ 2022-04-02  6:19     ` kernel test robot
  0 siblings, 0 replies; 88+ messages in thread
From: kernel test robot @ 2022-04-02  6:19 UTC (permalink / raw)
  To: David Matlack, Paolo Bonzini
  Cc: Albert Ou, kbuild-all, Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Hi David,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on 19164ad08bf668bca4f4bfbaacaa0a47c1b737a6]

url:    https://github.com/intel-lab-lkp/linux/commits/David-Matlack/KVM-Extend-Eager-Page-Splitting-to-the-shadow-MMU/20220402-015911
base:   19164ad08bf668bca4f4bfbaacaa0a47c1b737a6
config: x86_64-randconfig-a002 (https://download.01.org/0day-ci/archive/20220402/202204021419.Rh6Pgcl9-lkp@intel.com/config)
compiler: gcc-11 (Debian 11.2.0-19) 11.2.0
reproduce (this is a W=1 build):
        # https://github.com/intel-lab-lkp/linux/commit/5f7a06676291033d880081035c2efae13702a0c4
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review David-Matlack/KVM-Extend-Eager-Page-Splitting-to-the-shadow-MMU/20220402-015911
        git checkout 5f7a06676291033d880081035c2efae13702a0c4
        # save the config file to linux build tree
        mkdir build_dir
        make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash arch/x86/kvm/

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

   In file included from include/linux/kernel.h:29,
                    from include/linux/cpumask.h:10,
                    from include/linux/mm_types_task.h:14,
                    from include/linux/mm_types.h:5,
                    from arch/x86/kvm/irq.h:13,
                    from arch/x86/kvm/mmu/mmu.c:18:
   arch/x86/kvm/mmu/mmu.c: In function 'kvm_mmu_page_set_access':
>> include/linux/kern_levels.h:5:25: warning: format '%llx' expects argument of type 'long long unsigned int', but argument 3 has type 'int' [-Wformat=]
       5 | #define KERN_SOH        "\001"          /* ASCII Start Of Header */
         |                         ^~~~~~
   include/linux/printk.h:418:25: note: in definition of macro 'printk_index_wrap'
     418 |                 _p_func(_fmt, ##__VA_ARGS__);                           \
         |                         ^~~~
   include/linux/printk.h:640:17: note: in expansion of macro 'printk'
     640 |                 printk(fmt, ##__VA_ARGS__);                             \
         |                 ^~~~~~
   include/linux/printk.h:654:9: note: in expansion of macro 'printk_ratelimited'
     654 |         printk_ratelimited(KERN_ERR pr_fmt(fmt), ##__VA_ARGS__)
         |         ^~~~~~~~~~~~~~~~~~
   include/linux/kern_levels.h:11:25: note: in expansion of macro 'KERN_SOH'
      11 | #define KERN_ERR        KERN_SOH "3"    /* error conditions */
         |                         ^~~~~~~~
   include/linux/printk.h:654:28: note: in expansion of macro 'KERN_ERR'
     654 |         printk_ratelimited(KERN_ERR pr_fmt(fmt), ##__VA_ARGS__)
         |                            ^~~~~~~~
   arch/x86/kvm/mmu/mmu.c:763:17: note: in expansion of macro 'pr_err_ratelimited'
     763 |                 pr_err_ratelimited("access mismatch under direct page %llx "
         |                 ^~~~~~~~~~~~~~~~~~
>> include/linux/kern_levels.h:5:25: warning: format '%llx' expects argument of type 'long long unsigned int', but argument 4 has type 'u32' {aka 'unsigned int'} [-Wformat=]
       5 | #define KERN_SOH        "\001"          /* ASCII Start Of Header */
         |                         ^~~~~~
   include/linux/printk.h:418:25: note: in definition of macro 'printk_index_wrap'
     418 |                 _p_func(_fmt, ##__VA_ARGS__);                           \
         |                         ^~~~
   include/linux/printk.h:640:17: note: in expansion of macro 'printk'
     640 |                 printk(fmt, ##__VA_ARGS__);                             \
         |                 ^~~~~~
   include/linux/printk.h:654:9: note: in expansion of macro 'printk_ratelimited'
     654 |         printk_ratelimited(KERN_ERR pr_fmt(fmt), ##__VA_ARGS__)
         |         ^~~~~~~~~~~~~~~~~~
   include/linux/kern_levels.h:11:25: note: in expansion of macro 'KERN_SOH'
      11 | #define KERN_ERR        KERN_SOH "3"    /* error conditions */
         |                         ^~~~~~~~
   include/linux/printk.h:654:28: note: in expansion of macro 'KERN_ERR'
     654 |         printk_ratelimited(KERN_ERR pr_fmt(fmt), ##__VA_ARGS__)
         |                            ^~~~~~~~
   arch/x86/kvm/mmu/mmu.c:763:17: note: in expansion of macro 'pr_err_ratelimited'
     763 |                 pr_err_ratelimited("access mismatch under direct page %llx "
         |                 ^~~~~~~~~~~~~~~~~~


vim +5 include/linux/kern_levels.h

314ba3520e513a7 Joe Perches 2012-07-30  4  
04d2c8c83d0e3ac Joe Perches 2012-07-30 @5  #define KERN_SOH	"\001"		/* ASCII Start Of Header */
04d2c8c83d0e3ac Joe Perches 2012-07-30  6  #define KERN_SOH_ASCII	'\001'
04d2c8c83d0e3ac Joe Perches 2012-07-30  7  

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v3 16/23] KVM: x86/mmu: Cache the access bits of shadowed translations
  2022-04-01 17:55   ` David Matlack
@ 2022-04-02  7:01     ` kernel test robot
  -1 siblings, 0 replies; 88+ messages in thread
From: kernel test robot @ 2022-04-02  7:01 UTC (permalink / raw)
  To: David Matlack, Paolo Bonzini
  Cc: llvm, kbuild-all, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Sean Christopherson, Andrew Jones, Ben Gardon, Peter Xu,
	maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

Hi David,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on 19164ad08bf668bca4f4bfbaacaa0a47c1b737a6]

url:    https://github.com/intel-lab-lkp/linux/commits/David-Matlack/KVM-Extend-Eager-Page-Splitting-to-the-shadow-MMU/20220402-015911
base:   19164ad08bf668bca4f4bfbaacaa0a47c1b737a6
config: x86_64-randconfig-a014 (https://download.01.org/0day-ci/archive/20220402/202204021411.oOhP2vFP-lkp@intel.com/config)
compiler: clang version 15.0.0 (https://github.com/llvm/llvm-project c4a1b07d0979e7ff20d7d541af666d822d66b566)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/5f7a06676291033d880081035c2efae13702a0c4
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review David-Matlack/KVM-Extend-Eager-Page-Splitting-to-the-shadow-MMU/20220402-015911
        git checkout 5f7a06676291033d880081035c2efae13702a0c4
        # save the config file to linux build tree
        mkdir build_dir
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash arch/x86/kvm/

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> arch/x86/kvm/mmu/mmu.c:766:8: warning: format specifies type 'unsigned long long' but the argument has type 'int' [-Wformat]
                                      sp->role.access, access);
                                      ^~~~~~~~~~~~~~~
   include/linux/printk.h:654:45: note: expanded from macro 'pr_err_ratelimited'
           printk_ratelimited(KERN_ERR pr_fmt(fmt), ##__VA_ARGS__)
                                              ~~~     ^~~~~~~~~~~
   include/linux/printk.h:640:17: note: expanded from macro 'printk_ratelimited'
                   printk(fmt, ##__VA_ARGS__);                             \
                          ~~~    ^~~~~~~~~~~
   include/linux/printk.h:446:60: note: expanded from macro 'printk'
   #define printk(fmt, ...) printk_index_wrap(_printk, fmt, ##__VA_ARGS__)
                                                       ~~~    ^~~~~~~~~~~
   include/linux/printk.h:418:19: note: expanded from macro 'printk_index_wrap'
                   _p_func(_fmt, ##__VA_ARGS__);                           \
                           ~~~~    ^~~~~~~~~~~
>> arch/x86/kvm/mmu/mmu.c:766:25: warning: format specifies type 'unsigned long long' but the argument has type 'u32' (aka 'unsigned int') [-Wformat]
                                      sp->role.access, access);
                                                       ^~~~~~
   include/linux/printk.h:654:45: note: expanded from macro 'pr_err_ratelimited'
           printk_ratelimited(KERN_ERR pr_fmt(fmt), ##__VA_ARGS__)
                                              ~~~     ^~~~~~~~~~~
   include/linux/printk.h:640:17: note: expanded from macro 'printk_ratelimited'
                   printk(fmt, ##__VA_ARGS__);                             \
                          ~~~    ^~~~~~~~~~~
   include/linux/printk.h:446:60: note: expanded from macro 'printk'
   #define printk(fmt, ...) printk_index_wrap(_printk, fmt, ##__VA_ARGS__)
                                                       ~~~    ^~~~~~~~~~~
   include/linux/printk.h:418:19: note: expanded from macro 'printk_index_wrap'
                   _p_func(_fmt, ##__VA_ARGS__);                           \
                           ~~~~    ^~~~~~~~~~~
   2 warnings generated.


vim +766 arch/x86/kvm/mmu/mmu.c

   754	
   755	static void kvm_mmu_page_set_access(struct kvm_mmu_page *sp, int index, u32 access)
   756	{
   757		if (!sp->role.direct) {
   758			sp->shadowed_translation[index].access = access;
   759			return;
   760		}
   761	
   762		if (WARN_ON(access != sp->role.access))
   763			pr_err_ratelimited("access mismatch under direct page %llx "
   764					   "(expected %llx, got %llx)\n",
   765					   kvm_mmu_page_get_gfn(sp, index),
 > 766					   sp->role.access, access);
   767	}
   768	

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v3 16/23] KVM: x86/mmu: Cache the access bits of shadowed translations
@ 2022-04-02  7:01     ` kernel test robot
  0 siblings, 0 replies; 88+ messages in thread
From: kernel test robot @ 2022-04-02  7:01 UTC (permalink / raw)
  To: David Matlack, Paolo Bonzini
  Cc: Huacai Chen, kbuild-all,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Marc Zyngier, llvm,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Albert Ou, Palmer Dabbelt,
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Hi David,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on 19164ad08bf668bca4f4bfbaacaa0a47c1b737a6]

url:    https://github.com/intel-lab-lkp/linux/commits/David-Matlack/KVM-Extend-Eager-Page-Splitting-to-the-shadow-MMU/20220402-015911
base:   19164ad08bf668bca4f4bfbaacaa0a47c1b737a6
config: x86_64-randconfig-a014 (https://download.01.org/0day-ci/archive/20220402/202204021411.oOhP2vFP-lkp@intel.com/config)
compiler: clang version 15.0.0 (https://github.com/llvm/llvm-project c4a1b07d0979e7ff20d7d541af666d822d66b566)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/5f7a06676291033d880081035c2efae13702a0c4
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review David-Matlack/KVM-Extend-Eager-Page-Splitting-to-the-shadow-MMU/20220402-015911
        git checkout 5f7a06676291033d880081035c2efae13702a0c4
        # save the config file to linux build tree
        mkdir build_dir
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash arch/x86/kvm/

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> arch/x86/kvm/mmu/mmu.c:766:8: warning: format specifies type 'unsigned long long' but the argument has type 'int' [-Wformat]
                                      sp->role.access, access);
                                      ^~~~~~~~~~~~~~~
   include/linux/printk.h:654:45: note: expanded from macro 'pr_err_ratelimited'
           printk_ratelimited(KERN_ERR pr_fmt(fmt), ##__VA_ARGS__)
                                              ~~~     ^~~~~~~~~~~
   include/linux/printk.h:640:17: note: expanded from macro 'printk_ratelimited'
                   printk(fmt, ##__VA_ARGS__);                             \
                          ~~~    ^~~~~~~~~~~
   include/linux/printk.h:446:60: note: expanded from macro 'printk'
   #define printk(fmt, ...) printk_index_wrap(_printk, fmt, ##__VA_ARGS__)
                                                       ~~~    ^~~~~~~~~~~
   include/linux/printk.h:418:19: note: expanded from macro 'printk_index_wrap'
                   _p_func(_fmt, ##__VA_ARGS__);                           \
                           ~~~~    ^~~~~~~~~~~
>> arch/x86/kvm/mmu/mmu.c:766:25: warning: format specifies type 'unsigned long long' but the argument has type 'u32' (aka 'unsigned int') [-Wformat]
                                      sp->role.access, access);
                                                       ^~~~~~
   include/linux/printk.h:654:45: note: expanded from macro 'pr_err_ratelimited'
           printk_ratelimited(KERN_ERR pr_fmt(fmt), ##__VA_ARGS__)
                                              ~~~     ^~~~~~~~~~~
   include/linux/printk.h:640:17: note: expanded from macro 'printk_ratelimited'
                   printk(fmt, ##__VA_ARGS__);                             \
                          ~~~    ^~~~~~~~~~~
   include/linux/printk.h:446:60: note: expanded from macro 'printk'
   #define printk(fmt, ...) printk_index_wrap(_printk, fmt, ##__VA_ARGS__)
                                                       ~~~    ^~~~~~~~~~~
   include/linux/printk.h:418:19: note: expanded from macro 'printk_index_wrap'
                   _p_func(_fmt, ##__VA_ARGS__);                           \
                           ~~~~    ^~~~~~~~~~~
   2 warnings generated.


vim +766 arch/x86/kvm/mmu/mmu.c

   754	
   755	static void kvm_mmu_page_set_access(struct kvm_mmu_page *sp, int index, u32 access)
   756	{
   757		if (!sp->role.direct) {
   758			sp->shadowed_translation[index].access = access;
   759			return;
   760		}
   761	
   762		if (WARN_ON(access != sp->role.access))
   763			pr_err_ratelimited("access mismatch under direct page %llx "
   764					   "(expected %llx, got %llx)\n",
   765					   kvm_mmu_page_get_gfn(sp, index),
 > 766					   sp->role.access, access);
   767	}
   768	

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v3 02/23] KVM: x86/mmu: Use a bool for direct
  2022-04-01 17:55   ` David Matlack
@ 2022-04-08 22:24     ` Sean Christopherson
  -1 siblings, 0 replies; 88+ messages in thread
From: Sean Christopherson @ 2022-04-08 22:24 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Fri, Apr 01, 2022, David Matlack wrote:
> The parameter "direct" can either be true or false, and all of the
> callers pass in a bool variable or true/false literal, so just use the
> type bool.
> 
> No functional change intended.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---

Reviewed-by: Sean Christopherson <seanjc@google.com>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v3 02/23] KVM: x86/mmu: Use a bool for direct
@ 2022-04-08 22:24     ` Sean Christopherson
  0 siblings, 0 replies; 88+ messages in thread
From: Sean Christopherson @ 2022-04-08 22:24 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Fri, Apr 01, 2022, David Matlack wrote:
> The parameter "direct" can either be true or false, and all of the
> callers pass in a bool variable or true/false literal, so just use the
> type bool.
> 
> No functional change intended.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---

Reviewed-by: Sean Christopherson <seanjc@google.com>
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v3 16/23] KVM: x86/mmu: Cache the access bits of shadowed translations
  2022-04-01 17:55   ` David Matlack
@ 2022-04-09  0:02     ` Sean Christopherson
  -1 siblings, 0 replies; 88+ messages in thread
From: Sean Christopherson @ 2022-04-09  0:02 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Fri, Apr 01, 2022, David Matlack wrote:
> @@ -733,7 +733,7 @@ static void mmu_free_pte_list_desc(struct pte_list_desc *pte_list_desc)
>  static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index)
>  {
>  	if (!sp->role.direct)
> -		return sp->gfns[index];
> +		return sp->shadowed_translation[index].gfn;
>  
>  	return sp->gfn + (index << ((sp->role.level - 1) * PT64_LEVEL_BITS));
>  }
> @@ -741,7 +741,7 @@ static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index)
>  static void kvm_mmu_page_set_gfn(struct kvm_mmu_page *sp, int index, gfn_t gfn)

This should be replaced with a single helper to set the gfn+access.  Under no
circumstance should _just_ the gfn change, and that will allow us to optimize
writing the entry.  More below.

>  {
>  	if (!sp->role.direct) {
> -		sp->gfns[index] = gfn;
> +		sp->shadowed_translation[index].gfn = gfn;
>  		return;
>  	}
>  
> @@ -752,6 +752,47 @@ static void kvm_mmu_page_set_gfn(struct kvm_mmu_page *sp, int index, gfn_t gfn)
>  				   kvm_mmu_page_get_gfn(sp, index), gfn);
>  }
>  
> +static void kvm_mmu_page_set_access(struct kvm_mmu_page *sp, int index, u32 access)
> +{
> +	if (!sp->role.direct) {
> +		sp->shadowed_translation[index].access = access;
> +		return;
> +	}
> +
> +	if (WARN_ON(access != sp->role.access))
> +		pr_err_ratelimited("access mismatch under direct page %llx "

LOL, I realize this is not your code, but ratelimiting under a WARN ain't gonna
help much :-)

This also generates a warning and fails to compile with KVM_WERROR=y, though I
believe the test bots already reported that.


arch/x86/kvm/mmu/mmu.c: In function ‘kvm_mmu_page_set_access’:
include/linux/kern_levels.h:5:25: error: format ‘%llx’ expects argument of type ‘long long unsigned int’, but argument 3 has type ‘int’ [-Werror=format=]
    5 | #define KERN_SOH        "\001"          /* ASCII Start Of Header */
      |                         ^~~~~~
include/linux/printk.h:418:25: note: in definition of macro ‘printk_index_wrap’
  418 |                 _p_func(_fmt, ##__VA_ARGS__);                           \
      |                         ^~~~
include/linux/printk.h:640:17: note: in expansion of macro ‘printk’
  640 |                 printk(fmt, ##__VA_ARGS__);                             \
      |                 ^~~~~~
include/linux/printk.h:654:9: note: in expansion of macro ‘printk_ratelimited’
  654 |         printk_ratelimited(KERN_ERR pr_fmt(fmt), ##__VA_ARGS__)
      |         ^~~~~~~~~~~~~~~~~~
include/linux/kern_levels.h:11:25: note: in expansion of macro ‘KERN_SOH’
   11 | #define KERN_ERR        KERN_SOH "3"    /* error conditions */
      |                         ^~~~~~~~
include/linux/printk.h:654:28: note: in expansion of macro ‘KERN_ERR’
  654 |         printk_ratelimited(KERN_ERR pr_fmt(fmt), ##__VA_ARGS__)
      |                            ^~~~~~~~
arch/x86/kvm/mmu/mmu.c:763:17: note: in expansion of macro ‘pr_err_ratelimited’
  763 |                 pr_err_ratelimited("access mismatch under direct page %llx "
      |                 ^~~~~~~~~~~~~~~~~~


> +				   "(expected %llx, got %llx)\n",
> +				   kvm_mmu_page_get_gfn(sp, index),
> +				   sp->role.access, access);
> +}
> +
> +/*
> + * For leaf SPTEs, fetch the *guest* access permissions being shadowed. Note
> + * that the SPTE itself may have a more constrained access permissions that
> + * what the guest enforces. For example, a guest may create an executable
> + * huge PTE but KVM may disallow execution to mitigate iTLB multihit.
> + */
> +static u32 kvm_mmu_page_get_access(struct kvm_mmu_page *sp, int index)
> +{
> +	if (!sp->role.direct)
> +		return sp->shadowed_translation[index].access;
> +
> +	/*
> +	 * For direct MMUs (e.g. TDP or non-paging guests) there are no *guest*
> +	 * access permissions being shadowed. So we can just return ACC_ALL
> +	 * here.
> +	 *
> +	 * For indirect MMUs (shadow paging), direct shadow pages exist when KVM
> +	 * is shadowing a guest huge page with smaller pages, since the guest
> +	 * huge page is being directly mapped. In this case the guest access
> +	 * permissions being shadowed are the access permissions of the huge
> +	 * page.
> +	 *
> +	 * In both cases, sp->role.access contains exactly what we want.
> +	 */
> +	return sp->role.access;
> +}

...

> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index b6e22ba9c654..3f76f4c1ae59 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -32,6 +32,18 @@ extern bool dbg;
>  
>  typedef u64 __rcu *tdp_ptep_t;
>  
> +/*
> + * Stores the result of the guest translation being shadowed by an SPTE. KVM
> + * shadows two types of guest translations: nGPA -> GPA (shadow EPT/NPT) and
> + * GVA -> GPA (traditional shadow paging). In both cases the result of the
> + * translation is a GPA and a set of access constraints.
> + */
> +struct shadowed_translation_entry {
> +	/* Note, GFNs can have at most 64 - PAGE_SHIFT = 52 bits. */
> +	u64 gfn:52;
> +	u64 access:3;

A bitfield is completely unnecessary and generates bad code.  As is, it generates
_really_ bad code because extracting and setting requires non-standard 64-bit value
masks, multiple operations, and accesses to unaligned data.  The generated code can
be made slightly less awful by using a fully byte for access and 64 bits for GFN,
but it still sucks compared to what we can hand generate.

The other aspect of this is that retrieving the GFN is a frequent operation,
whereas the access is almost never read.  I.e. we should bias for reading the GFN
above all else.

The simple and obvious thing is to not reinvent the wheel.  GFN = (GPA >> PAGE_SHIFT),
and ignoring NX, access lives in the lower 12 bits of a PTE.  Then reading the GFN is
a simple SHR, and reading access info is a simple AND.

We might also be able to optimize FNAME(sync_page), but I don't care much about
that, it's rarely used for nested TDP.

So, keep translation_entry a gfn_t *, then do:

static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index)
{
	if (!sp->role.direct)
		return sp->shadowed_translation[index] >> PAGE_SHIFT;

	return sp->gfn + (index << ((sp->role.level - 1) * PT64_LEVEL_BITS));
}

static void kvm_mmu_page_set_translation(struct kvm_mmu_page *sp, int index,
					 gfn_t gfn, unsigned int access)
{
	if (!sp->role.direct) {
		sp->shadowed_translation[index] = (gfn << PAGE_SHIFT) | access;
		return;
	}

	if (WARN_ON(gfn != kvm_mmu_page_get_gfn(sp, index)))
		pr_err_ratelimited("gfn mismatch under direct page %llx "
				   "(expected %llx, got %llx)\n",
				   sp->gfn,
				   kvm_mmu_page_get_gfn(sp, index), gfn);
}

static void kvm_mmu_page_set_access(struct kvm_mmu_page *sp, int index,
				    unsigned int access)
{
	if (sp->role.direct)
		return;

	sp->shadowed_translation[index] &= PAGE_MASK;
	sp->shadowed_translation[index] |= access;
}


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v3 16/23] KVM: x86/mmu: Cache the access bits of shadowed translations
@ 2022-04-09  0:02     ` Sean Christopherson
  0 siblings, 0 replies; 88+ messages in thread
From: Sean Christopherson @ 2022-04-09  0:02 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Fri, Apr 01, 2022, David Matlack wrote:
> @@ -733,7 +733,7 @@ static void mmu_free_pte_list_desc(struct pte_list_desc *pte_list_desc)
>  static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index)
>  {
>  	if (!sp->role.direct)
> -		return sp->gfns[index];
> +		return sp->shadowed_translation[index].gfn;
>  
>  	return sp->gfn + (index << ((sp->role.level - 1) * PT64_LEVEL_BITS));
>  }
> @@ -741,7 +741,7 @@ static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index)
>  static void kvm_mmu_page_set_gfn(struct kvm_mmu_page *sp, int index, gfn_t gfn)

This should be replaced with a single helper to set the gfn+access.  Under no
circumstance should _just_ the gfn change, and that will allow us to optimize
writing the entry.  More below.

>  {
>  	if (!sp->role.direct) {
> -		sp->gfns[index] = gfn;
> +		sp->shadowed_translation[index].gfn = gfn;
>  		return;
>  	}
>  
> @@ -752,6 +752,47 @@ static void kvm_mmu_page_set_gfn(struct kvm_mmu_page *sp, int index, gfn_t gfn)
>  				   kvm_mmu_page_get_gfn(sp, index), gfn);
>  }
>  
> +static void kvm_mmu_page_set_access(struct kvm_mmu_page *sp, int index, u32 access)
> +{
> +	if (!sp->role.direct) {
> +		sp->shadowed_translation[index].access = access;
> +		return;
> +	}
> +
> +	if (WARN_ON(access != sp->role.access))
> +		pr_err_ratelimited("access mismatch under direct page %llx "

LOL, I realize this is not your code, but ratelimiting under a WARN ain't gonna
help much :-)

This also generates a warning and fails to compile with KVM_WERROR=y, though I
believe the test bots already reported that.


arch/x86/kvm/mmu/mmu.c: In function ‘kvm_mmu_page_set_access’:
include/linux/kern_levels.h:5:25: error: format ‘%llx’ expects argument of type ‘long long unsigned int’, but argument 3 has type ‘int’ [-Werror=format=]
    5 | #define KERN_SOH        "\001"          /* ASCII Start Of Header */
      |                         ^~~~~~
include/linux/printk.h:418:25: note: in definition of macro ‘printk_index_wrap’
  418 |                 _p_func(_fmt, ##__VA_ARGS__);                           \
      |                         ^~~~
include/linux/printk.h:640:17: note: in expansion of macro ‘printk’
  640 |                 printk(fmt, ##__VA_ARGS__);                             \
      |                 ^~~~~~
include/linux/printk.h:654:9: note: in expansion of macro ‘printk_ratelimited’
  654 |         printk_ratelimited(KERN_ERR pr_fmt(fmt), ##__VA_ARGS__)
      |         ^~~~~~~~~~~~~~~~~~
include/linux/kern_levels.h:11:25: note: in expansion of macro ‘KERN_SOH’
   11 | #define KERN_ERR        KERN_SOH "3"    /* error conditions */
      |                         ^~~~~~~~
include/linux/printk.h:654:28: note: in expansion of macro ‘KERN_ERR’
  654 |         printk_ratelimited(KERN_ERR pr_fmt(fmt), ##__VA_ARGS__)
      |                            ^~~~~~~~
arch/x86/kvm/mmu/mmu.c:763:17: note: in expansion of macro ‘pr_err_ratelimited’
  763 |                 pr_err_ratelimited("access mismatch under direct page %llx "
      |                 ^~~~~~~~~~~~~~~~~~


> +				   "(expected %llx, got %llx)\n",
> +				   kvm_mmu_page_get_gfn(sp, index),
> +				   sp->role.access, access);
> +}
> +
> +/*
> + * For leaf SPTEs, fetch the *guest* access permissions being shadowed. Note
> + * that the SPTE itself may have a more constrained access permissions that
> + * what the guest enforces. For example, a guest may create an executable
> + * huge PTE but KVM may disallow execution to mitigate iTLB multihit.
> + */
> +static u32 kvm_mmu_page_get_access(struct kvm_mmu_page *sp, int index)
> +{
> +	if (!sp->role.direct)
> +		return sp->shadowed_translation[index].access;
> +
> +	/*
> +	 * For direct MMUs (e.g. TDP or non-paging guests) there are no *guest*
> +	 * access permissions being shadowed. So we can just return ACC_ALL
> +	 * here.
> +	 *
> +	 * For indirect MMUs (shadow paging), direct shadow pages exist when KVM
> +	 * is shadowing a guest huge page with smaller pages, since the guest
> +	 * huge page is being directly mapped. In this case the guest access
> +	 * permissions being shadowed are the access permissions of the huge
> +	 * page.
> +	 *
> +	 * In both cases, sp->role.access contains exactly what we want.
> +	 */
> +	return sp->role.access;
> +}

...

> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index b6e22ba9c654..3f76f4c1ae59 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -32,6 +32,18 @@ extern bool dbg;
>  
>  typedef u64 __rcu *tdp_ptep_t;
>  
> +/*
> + * Stores the result of the guest translation being shadowed by an SPTE. KVM
> + * shadows two types of guest translations: nGPA -> GPA (shadow EPT/NPT) and
> + * GVA -> GPA (traditional shadow paging). In both cases the result of the
> + * translation is a GPA and a set of access constraints.
> + */
> +struct shadowed_translation_entry {
> +	/* Note, GFNs can have at most 64 - PAGE_SHIFT = 52 bits. */
> +	u64 gfn:52;
> +	u64 access:3;

A bitfield is completely unnecessary and generates bad code.  As is, it generates
_really_ bad code because extracting and setting requires non-standard 64-bit value
masks, multiple operations, and accesses to unaligned data.  The generated code can
be made slightly less awful by using a fully byte for access and 64 bits for GFN,
but it still sucks compared to what we can hand generate.

The other aspect of this is that retrieving the GFN is a frequent operation,
whereas the access is almost never read.  I.e. we should bias for reading the GFN
above all else.

The simple and obvious thing is to not reinvent the wheel.  GFN = (GPA >> PAGE_SHIFT),
and ignoring NX, access lives in the lower 12 bits of a PTE.  Then reading the GFN is
a simple SHR, and reading access info is a simple AND.

We might also be able to optimize FNAME(sync_page), but I don't care much about
that, it's rarely used for nested TDP.

So, keep translation_entry a gfn_t *, then do:

static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index)
{
	if (!sp->role.direct)
		return sp->shadowed_translation[index] >> PAGE_SHIFT;

	return sp->gfn + (index << ((sp->role.level - 1) * PT64_LEVEL_BITS));
}

static void kvm_mmu_page_set_translation(struct kvm_mmu_page *sp, int index,
					 gfn_t gfn, unsigned int access)
{
	if (!sp->role.direct) {
		sp->shadowed_translation[index] = (gfn << PAGE_SHIFT) | access;
		return;
	}

	if (WARN_ON(gfn != kvm_mmu_page_get_gfn(sp, index)))
		pr_err_ratelimited("gfn mismatch under direct page %llx "
				   "(expected %llx, got %llx)\n",
				   sp->gfn,
				   kvm_mmu_page_get_gfn(sp, index), gfn);
}

static void kvm_mmu_page_set_access(struct kvm_mmu_page *sp, int index,
				    unsigned int access)
{
	if (sp->role.direct)
		return;

	sp->shadowed_translation[index] &= PAGE_MASK;
	sp->shadowed_translation[index] |= access;
}

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v3 22/23] KVM: x86/mmu: Support Eager Page Splitting in the shadow MMU
  2022-04-01 17:55   ` David Matlack
@ 2022-04-09  0:39     ` Sean Christopherson
  -1 siblings, 0 replies; 88+ messages in thread
From: Sean Christopherson @ 2022-04-09  0:39 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Fri, Apr 01, 2022, David Matlack wrote:
> Add support for Eager Page Splitting pages that are mapped by the shadow
> MMU. Walk through the rmap first splitting all 1GiB pages to 2MiB pages,
> and then splitting all 2MiB pages to 4KiB pages.
> 
> Splitting huge pages mapped by the shadow MMU requries dealing with some
> extra complexity beyond that of the TDP MMU:
> 
> (1) The shadow MMU has a limit on the number of shadow pages that are
>     allowed to be allocated. So, as a policy, Eager Page Splitting
>     refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
>     pages available.
> 
> (2) Huge pages may be mapped by indirect shadow pages which have the
>     possibility of being unsync. As a policy we opt not to split such
>     pages as their translation may no longer be valid.

This shouldn't be possible, shadow pages whose role is > 4k are always write-protected
and not allowed to become unsync.

> 
> (3) Splitting a huge page may end up re-using an existing lower level
>     shadow page tables. This is unlike the TDP MMU which always allocates
>     new shadow page tables when splitting.

...

> +static void kvm_mmu_split_huge_page(struct kvm *kvm,
> +				    const struct kvm_memory_slot *slot,
> +				    u64 *huge_sptep, struct kvm_mmu_page **spp)
> +
> +{
> +	struct kvm_mmu_memory_cache *cache = &kvm->arch.huge_page_split_desc_cache;
> +	u64 huge_spte = READ_ONCE(*huge_sptep);
> +	struct kvm_mmu_page *sp;
> +	bool flush = false;
> +	u64 *sptep, spte;
> +	gfn_t gfn;
> +	int index;
> +
> +	sp = kvm_mmu_get_sp_for_split(kvm, slot, huge_sptep, spp);
> +
> +	for (index = 0; index < PT64_ENT_PER_PAGE; index++) {
> +		sptep = &sp->spt[index];
> +		gfn = kvm_mmu_page_get_gfn(sp, index);
> +
> +		/*
> +		 * sp may have populated page table entries, e.g. if this huge
> +		 * page is aliased by multiple sptes with the same access
> +		 * permissions. We know the sptes will be mapping the same
> +		 * gfn-to-pfn translation since sp is direct. However, a given
> +		 * spte may point to an even lower level page table. We don't
> +		 * know if that lower level page table is completely filled in,
> +		 * i.e. we may be effectively unmapping a region of memory, so
> +		 * we must flush the TLB.

Random side topic, please avoid "we" and other pronouns in comments and changelogs,
it gets real easy to lose track of what a pronoun is referring to, especially in
changelogs where "we" might be KVM, might be the kernel, might be the team that's
using the patch, might be an author that's prone to illeism, etc...

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v3 22/23] KVM: x86/mmu: Support Eager Page Splitting in the shadow MMU
@ 2022-04-09  0:39     ` Sean Christopherson
  0 siblings, 0 replies; 88+ messages in thread
From: Sean Christopherson @ 2022-04-09  0:39 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Fri, Apr 01, 2022, David Matlack wrote:
> Add support for Eager Page Splitting pages that are mapped by the shadow
> MMU. Walk through the rmap first splitting all 1GiB pages to 2MiB pages,
> and then splitting all 2MiB pages to 4KiB pages.
> 
> Splitting huge pages mapped by the shadow MMU requries dealing with some
> extra complexity beyond that of the TDP MMU:
> 
> (1) The shadow MMU has a limit on the number of shadow pages that are
>     allowed to be allocated. So, as a policy, Eager Page Splitting
>     refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
>     pages available.
> 
> (2) Huge pages may be mapped by indirect shadow pages which have the
>     possibility of being unsync. As a policy we opt not to split such
>     pages as their translation may no longer be valid.

This shouldn't be possible, shadow pages whose role is > 4k are always write-protected
and not allowed to become unsync.

> 
> (3) Splitting a huge page may end up re-using an existing lower level
>     shadow page tables. This is unlike the TDP MMU which always allocates
>     new shadow page tables when splitting.

...

> +static void kvm_mmu_split_huge_page(struct kvm *kvm,
> +				    const struct kvm_memory_slot *slot,
> +				    u64 *huge_sptep, struct kvm_mmu_page **spp)
> +
> +{
> +	struct kvm_mmu_memory_cache *cache = &kvm->arch.huge_page_split_desc_cache;
> +	u64 huge_spte = READ_ONCE(*huge_sptep);
> +	struct kvm_mmu_page *sp;
> +	bool flush = false;
> +	u64 *sptep, spte;
> +	gfn_t gfn;
> +	int index;
> +
> +	sp = kvm_mmu_get_sp_for_split(kvm, slot, huge_sptep, spp);
> +
> +	for (index = 0; index < PT64_ENT_PER_PAGE; index++) {
> +		sptep = &sp->spt[index];
> +		gfn = kvm_mmu_page_get_gfn(sp, index);
> +
> +		/*
> +		 * sp may have populated page table entries, e.g. if this huge
> +		 * page is aliased by multiple sptes with the same access
> +		 * permissions. We know the sptes will be mapping the same
> +		 * gfn-to-pfn translation since sp is direct. However, a given
> +		 * spte may point to an even lower level page table. We don't
> +		 * know if that lower level page table is completely filled in,
> +		 * i.e. we may be effectively unmapping a region of memory, so
> +		 * we must flush the TLB.

Random side topic, please avoid "we" and other pronouns in comments and changelogs,
it gets real easy to lose track of what a pronoun is referring to, especially in
changelogs where "we" might be KVM, might be the kernel, might be the team that's
using the patch, might be an author that's prone to illeism, etc...
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v3 00/23] KVM: Extend Eager Page Splitting to the shadow MMU
  2022-04-01 17:55 ` David Matlack
@ 2022-04-11 17:12   ` Sean Christopherson
  -1 siblings, 0 replies; 88+ messages in thread
From: Sean Christopherson @ 2022-04-11 17:12 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Fri, Apr 01, 2022, David Matlack wrote:
> This series extends KVM's Eager Page Splitting to also split huge pages
> mapped by the shadow MMU, i.e. huge pages present in the memslot rmaps.
> This will be useful for configurations that use Nested Virtualization,
> disable the TDP MMU, or disable/lack TDP hardware support.
> 
> For background on Eager Page Splitting, see:
>  - Proposal: https://lore.kernel.org/kvm/CALzav=dV_U4r1K9oDq4esb4mpBQDQ2ROQ5zH5wV3KpOaZrRW-A@mail.gmail.com/
>  - TDP MMU support: https://lore.kernel.org/kvm/20220119230739.2234394-1-dmatlack@google.com/
> 
> Splitting huge pages mapped by the shadow MMU is more complicated than
> the TDP MMU, but it is also more important for performance as the shadow
> MMU handles huge page write-protection faults under the write lock.  See
> the Performance section for more details.
> 
> The extra complexity of splitting huge pages mapped by the shadow MMU
> comes from a few places:

I think we should restrict eager page splitting to the TDP MMU being enabled,
i.e. restrict shadow MMU support to nested MMUs.

A decent chunk of the churn and complexity in this series comes from having to
deal with the intersection of things no one cares about in practice (!TDP shadow
paging), and/or things we should be putting into maintenance-only mode (legacy MMU
with TDP enabled).  I see zero reason to support this for legacy shadow paging
without a very concrete, very performance sensitive use case, and the legacy MMU
with TDP should be a hard "no".

With those out of the way, unsync support can also be jettisoned, because barring
use cases I don't know about, hypervisors don't modify TDP entries in the same way
that kernels modify native page tables, i.e. don't benefit from allowing SPTEs to
go unsync.

The other feature that I think we should deprecate (which I'm pretty sure someone on
our team, maybe even you, is planning on proposing upstream) is support for zapping
KVM shadow pages for the shrinker.  In hindsight, we should have done that a few
years ago instead of fixing the bug that made KVM's support meaningful (see commit
ebdb292dac79 ("KVM: x86/mmu: Batch zap MMU pages when shrinking the slab").  Doing
that for nested MMUs only (or at least first) should be less controversial.

The other thing we want to do sooner than later is improve the scalability of the
nested MMU.  A relatively simple way to pick some juicy low hanging fruit, if we
drop the aforementioned features we don't actually need for nested MMUs, would be
to turn all of the tracking structures needed for handling a page fault into
per-root lists/structures, e.g. active_mmu_pages and mmu_page_hash.  Unless L1 is
doing something funky, there is unlikely to be overlap between nested TDP page
tables, i.e. per-root tracking shouldn't cause a memory explosion.

At that point, as a first step/stopgap toward a more scalable nested MMU implementation,
nested TDP page faults, zapping of obsolete pages (memslot updates), and eager page
splitting (I think) can take mmu_lock for read and then take a per-root spinlock.

At a bare minimum, taking mmu_lock for read would prevent a nested vCPU from blocking
the TDP MMU, which in itself should be a big win.  Zapping after a memslot updates
would not interfere at all with re-faulting memory since zapping the obsolete roots
would never get a lock conflict.  And for use cases that spin up a large number of small
L2 VMs, per-root locking will allow KVM to handle page faults for each L2 in parallel,
which could be a huge performance boost for select use cases.

Circling back to eager page splitting, this series could be reworked to take the
first step of forking FNAME(page_fault), FNAME(fetch) and kvm_mmu_get_page() in
order to provide the necessary path for reworking nested MMU page faults.  Then it
can remove unsync and shrinker support for nested MMUs.  With those gone,
dissecting the nested MMU variant of kvm_mmu_get_page() should be simpler/cleaner
than dealing with the existing kvm_mmu_get_page(), i.e. should eliminate at least
some of the complexity/churn.

> Performance
> -----------
> 
> To measure the performance impact of Eager Page Splitting I ran
> dirty_log_perf_test with tdp_mmu=N, various virtual CPU counts, 1GiB per
> vCPU, and backed by 1GiB HugeTLB memory. The amount of memory that was
> written to versus read was controlled with the -f option.
> 
> To measure the imapct of customer performance, we can look at the time
> it takes all vCPUs to dirty memory after dirty logging has been enabled.
> Without Eager Page Splitting enabled, such dirtying must take faults to
> split huge pages and bottleneck on the MMU lock.
> 
>              | Config: ept=Y, tdp_mmu=N, 100% writes                   |
>              | Config: ept=Y, tdp_mmu=N, 100% writes                   |
>              | Config: ept=Y, tdp_mmu=N, 100% writes initially-all-set |
>              | Config: ept=Y, tdp_mmu=N, 100% writes initially-all-set |
>              | Config: ept=N, tdp_mmu=Y, 100% writes                   |
>              | Config: ept=N, tdp_mmu=Y, 50% writes                    |
>              | Config: ept=N, tdp_mmu=Y, 5% writes                     |

IMO, to justify this there needs to be performance numbers for ept=Y, tdp_mmu=Y,
i.e. for the use case we actually care about.  I don't expect the outcome to be
any different, but it really should be explicitly tested.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v3 00/23] KVM: Extend Eager Page Splitting to the shadow MMU
@ 2022-04-11 17:12   ` Sean Christopherson
  0 siblings, 0 replies; 88+ messages in thread
From: Sean Christopherson @ 2022-04-11 17:12 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Fri, Apr 01, 2022, David Matlack wrote:
> This series extends KVM's Eager Page Splitting to also split huge pages
> mapped by the shadow MMU, i.e. huge pages present in the memslot rmaps.
> This will be useful for configurations that use Nested Virtualization,
> disable the TDP MMU, or disable/lack TDP hardware support.
> 
> For background on Eager Page Splitting, see:
>  - Proposal: https://lore.kernel.org/kvm/CALzav=dV_U4r1K9oDq4esb4mpBQDQ2ROQ5zH5wV3KpOaZrRW-A@mail.gmail.com/
>  - TDP MMU support: https://lore.kernel.org/kvm/20220119230739.2234394-1-dmatlack@google.com/
> 
> Splitting huge pages mapped by the shadow MMU is more complicated than
> the TDP MMU, but it is also more important for performance as the shadow
> MMU handles huge page write-protection faults under the write lock.  See
> the Performance section for more details.
> 
> The extra complexity of splitting huge pages mapped by the shadow MMU
> comes from a few places:

I think we should restrict eager page splitting to the TDP MMU being enabled,
i.e. restrict shadow MMU support to nested MMUs.

A decent chunk of the churn and complexity in this series comes from having to
deal with the intersection of things no one cares about in practice (!TDP shadow
paging), and/or things we should be putting into maintenance-only mode (legacy MMU
with TDP enabled).  I see zero reason to support this for legacy shadow paging
without a very concrete, very performance sensitive use case, and the legacy MMU
with TDP should be a hard "no".

With those out of the way, unsync support can also be jettisoned, because barring
use cases I don't know about, hypervisors don't modify TDP entries in the same way
that kernels modify native page tables, i.e. don't benefit from allowing SPTEs to
go unsync.

The other feature that I think we should deprecate (which I'm pretty sure someone on
our team, maybe even you, is planning on proposing upstream) is support for zapping
KVM shadow pages for the shrinker.  In hindsight, we should have done that a few
years ago instead of fixing the bug that made KVM's support meaningful (see commit
ebdb292dac79 ("KVM: x86/mmu: Batch zap MMU pages when shrinking the slab").  Doing
that for nested MMUs only (or at least first) should be less controversial.

The other thing we want to do sooner than later is improve the scalability of the
nested MMU.  A relatively simple way to pick some juicy low hanging fruit, if we
drop the aforementioned features we don't actually need for nested MMUs, would be
to turn all of the tracking structures needed for handling a page fault into
per-root lists/structures, e.g. active_mmu_pages and mmu_page_hash.  Unless L1 is
doing something funky, there is unlikely to be overlap between nested TDP page
tables, i.e. per-root tracking shouldn't cause a memory explosion.

At that point, as a first step/stopgap toward a more scalable nested MMU implementation,
nested TDP page faults, zapping of obsolete pages (memslot updates), and eager page
splitting (I think) can take mmu_lock for read and then take a per-root spinlock.

At a bare minimum, taking mmu_lock for read would prevent a nested vCPU from blocking
the TDP MMU, which in itself should be a big win.  Zapping after a memslot updates
would not interfere at all with re-faulting memory since zapping the obsolete roots
would never get a lock conflict.  And for use cases that spin up a large number of small
L2 VMs, per-root locking will allow KVM to handle page faults for each L2 in parallel,
which could be a huge performance boost for select use cases.

Circling back to eager page splitting, this series could be reworked to take the
first step of forking FNAME(page_fault), FNAME(fetch) and kvm_mmu_get_page() in
order to provide the necessary path for reworking nested MMU page faults.  Then it
can remove unsync and shrinker support for nested MMUs.  With those gone,
dissecting the nested MMU variant of kvm_mmu_get_page() should be simpler/cleaner
than dealing with the existing kvm_mmu_get_page(), i.e. should eliminate at least
some of the complexity/churn.

> Performance
> -----------
> 
> To measure the performance impact of Eager Page Splitting I ran
> dirty_log_perf_test with tdp_mmu=N, various virtual CPU counts, 1GiB per
> vCPU, and backed by 1GiB HugeTLB memory. The amount of memory that was
> written to versus read was controlled with the -f option.
> 
> To measure the imapct of customer performance, we can look at the time
> it takes all vCPUs to dirty memory after dirty logging has been enabled.
> Without Eager Page Splitting enabled, such dirtying must take faults to
> split huge pages and bottleneck on the MMU lock.
> 
>              | Config: ept=Y, tdp_mmu=N, 100% writes                   |
>              | Config: ept=Y, tdp_mmu=N, 100% writes                   |
>              | Config: ept=Y, tdp_mmu=N, 100% writes initially-all-set |
>              | Config: ept=Y, tdp_mmu=N, 100% writes initially-all-set |
>              | Config: ept=N, tdp_mmu=Y, 100% writes                   |
>              | Config: ept=N, tdp_mmu=Y, 50% writes                    |
>              | Config: ept=N, tdp_mmu=Y, 5% writes                     |

IMO, to justify this there needs to be performance numbers for ept=Y, tdp_mmu=Y,
i.e. for the use case we actually care about.  I don't expect the outcome to be
any different, but it really should be explicitly tested.
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v3 00/23] KVM: Extend Eager Page Splitting to the shadow MMU
  2022-04-11 17:12   ` Sean Christopherson
@ 2022-04-11 17:54     ` David Matlack
  -1 siblings, 0 replies; 88+ messages in thread
From: David Matlack @ 2022-04-11 17:54 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Andrew Jones, Ben Gardon, Peter Xu, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Mon, Apr 11, 2022 at 10:12 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Apr 01, 2022, David Matlack wrote:
> > This series extends KVM's Eager Page Splitting to also split huge pages
> > mapped by the shadow MMU, i.e. huge pages present in the memslot rmaps.
> > This will be useful for configurations that use Nested Virtualization,
> > disable the TDP MMU, or disable/lack TDP hardware support.
> >
> > For background on Eager Page Splitting, see:
> >  - Proposal: https://lore.kernel.org/kvm/CALzav=dV_U4r1K9oDq4esb4mpBQDQ2ROQ5zH5wV3KpOaZrRW-A@mail.gmail.com/
> >  - TDP MMU support: https://lore.kernel.org/kvm/20220119230739.2234394-1-dmatlack@google.com/
> >
> > Splitting huge pages mapped by the shadow MMU is more complicated than
> > the TDP MMU, but it is also more important for performance as the shadow
> > MMU handles huge page write-protection faults under the write lock.  See
> > the Performance section for more details.
> >
> > The extra complexity of splitting huge pages mapped by the shadow MMU
> > comes from a few places:
>
> I think we should restrict eager page splitting to the TDP MMU being enabled,
> i.e. restrict shadow MMU support to nested MMUs.
>
> A decent chunk of the churn and complexity in this series comes from having to
> deal with the intersection of things no one cares about in practice (!TDP shadow
> paging), and/or things we should be putting into maintenance-only mode (legacy MMU
> with TDP enabled).  I see zero reason to support this for legacy shadow paging
> without a very concrete, very performance sensitive use case, and the legacy MMU
> with TDP should be a hard "no".
>
> With those out of the way, unsync support can also be jettisoned, because barring
> use cases I don't know about, hypervisors don't modify TDP entries in the same way
> that kernels modify native page tables, i.e. don't benefit from allowing SPTEs to
> go unsync.
>
> The other feature that I think we should deprecate (which I'm pretty sure someone on
> our team, maybe even you, is planning on proposing upstream) is support for zapping
> KVM shadow pages for the shrinker.  In hindsight, we should have done that a few
> years ago instead of fixing the bug that made KVM's support meaningful (see commit
> ebdb292dac79 ("KVM: x86/mmu: Batch zap MMU pages when shrinking the slab").  Doing
> that for nested MMUs only (or at least first) should be less controversial.
>
> The other thing we want to do sooner than later is improve the scalability of the
> nested MMU.  A relatively simple way to pick some juicy low hanging fruit, if we
> drop the aforementioned features we don't actually need for nested MMUs, would be
> to turn all of the tracking structures needed for handling a page fault into
> per-root lists/structures, e.g. active_mmu_pages and mmu_page_hash.  Unless L1 is
> doing something funky, there is unlikely to be overlap between nested TDP page
> tables, i.e. per-root tracking shouldn't cause a memory explosion.
>
> At that point, as a first step/stopgap toward a more scalable nested MMU implementation,
> nested TDP page faults, zapping of obsolete pages (memslot updates), and eager page
> splitting (I think) can take mmu_lock for read and then take a per-root spinlock.
>
> At a bare minimum, taking mmu_lock for read would prevent a nested vCPU from blocking
> the TDP MMU, which in itself should be a big win.  Zapping after a memslot updates
> would not interfere at all with re-faulting memory since zapping the obsolete roots
> would never get a lock conflict.  And for use cases that spin up a large number of small
> L2 VMs, per-root locking will allow KVM to handle page faults for each L2 in parallel,
> which could be a huge performance boost for select use cases.
>
> Circling back to eager page splitting, this series could be reworked to take the
> first step of forking FNAME(page_fault), FNAME(fetch) and kvm_mmu_get_page() in
> order to provide the necessary path for reworking nested MMU page faults.  Then it
> can remove unsync and shrinker support for nested MMUs.  With those gone,
> dissecting the nested MMU variant of kvm_mmu_get_page() should be simpler/cleaner
> than dealing with the existing kvm_mmu_get_page(), i.e. should eliminate at least
> some of the complexity/churn.

These sound like useful improvements but I am not really seeing the
value of sequencing them before this series:

 - IMO the "churn" in patches 1-14 are a net improvement to the
existing code. They improve readability by decomposing the shadow page
creation path into smaller functions with better names, reduce the
amount of redundant calculations, and reduce the dependence on struct
kvm_vcpu where it is not needed. Even if eager page splitting is
completely dropped I think they would be useful to merge.

 - Patches 15-21 are necessary complexity to support eager page
splitting, but wouldn't change at all if this splitting was specific
to splitting nested MMUs.

 - Outside of patches 1-14, unsync really doesn't play a role other
than to skip splitting if sp->unsync is true. But as you pointed out
in patch 22, that check can already be dropped since SPs with roles
>4k are never marked unsync.

I'd be fine with limiting eager page splitting to tdp_mmu=Y since
nested is the primary use-case for Google and I agree TDP with the
shadow MMU should be phased out. This would be an artificial
limitation in the short term, but I imagine it would make all those
improvements easier to make down the road.

>
> > Performance
> > -----------
> >
> > To measure the performance impact of Eager Page Splitting I ran
> > dirty_log_perf_test with tdp_mmu=N, various virtual CPU counts, 1GiB per
> > vCPU, and backed by 1GiB HugeTLB memory. The amount of memory that was
> > written to versus read was controlled with the -f option.
> >
> > To measure the imapct of customer performance, we can look at the time
> > it takes all vCPUs to dirty memory after dirty logging has been enabled.
> > Without Eager Page Splitting enabled, such dirtying must take faults to
> > split huge pages and bottleneck on the MMU lock.
> >
> >              | Config: ept=Y, tdp_mmu=N, 100% writes                   |
> >              | Config: ept=Y, tdp_mmu=N, 100% writes                   |
> >              | Config: ept=Y, tdp_mmu=N, 100% writes initially-all-set |
> >              | Config: ept=Y, tdp_mmu=N, 100% writes initially-all-set |
> >              | Config: ept=N, tdp_mmu=Y, 100% writes                   |
> >              | Config: ept=N, tdp_mmu=Y, 50% writes                    |
> >              | Config: ept=N, tdp_mmu=Y, 5% writes                     |
>
> IMO, to justify this there needs to be performance numbers for ept=Y, tdp_mmu=Y,
> i.e. for the use case we actually care about.  I don't expect the outcome to be
> any different, but it really should be explicitly tested.

That's a fair request I guess. There should be no difference in
performance from the ept=N results but it require a lot more effort to
rig up, which is why I tested this way.

I'll look into collecting some results with nested MMUs. On the plus
side, better selftests support for nested MMUs will be useful as the
various improvements you suggested are implemented.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v3 00/23] KVM: Extend Eager Page Splitting to the shadow MMU
@ 2022-04-11 17:54     ` David Matlack
  0 siblings, 0 replies; 88+ messages in thread
From: David Matlack @ 2022-04-11 17:54 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Mon, Apr 11, 2022 at 10:12 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Apr 01, 2022, David Matlack wrote:
> > This series extends KVM's Eager Page Splitting to also split huge pages
> > mapped by the shadow MMU, i.e. huge pages present in the memslot rmaps.
> > This will be useful for configurations that use Nested Virtualization,
> > disable the TDP MMU, or disable/lack TDP hardware support.
> >
> > For background on Eager Page Splitting, see:
> >  - Proposal: https://lore.kernel.org/kvm/CALzav=dV_U4r1K9oDq4esb4mpBQDQ2ROQ5zH5wV3KpOaZrRW-A@mail.gmail.com/
> >  - TDP MMU support: https://lore.kernel.org/kvm/20220119230739.2234394-1-dmatlack@google.com/
> >
> > Splitting huge pages mapped by the shadow MMU is more complicated than
> > the TDP MMU, but it is also more important for performance as the shadow
> > MMU handles huge page write-protection faults under the write lock.  See
> > the Performance section for more details.
> >
> > The extra complexity of splitting huge pages mapped by the shadow MMU
> > comes from a few places:
>
> I think we should restrict eager page splitting to the TDP MMU being enabled,
> i.e. restrict shadow MMU support to nested MMUs.
>
> A decent chunk of the churn and complexity in this series comes from having to
> deal with the intersection of things no one cares about in practice (!TDP shadow
> paging), and/or things we should be putting into maintenance-only mode (legacy MMU
> with TDP enabled).  I see zero reason to support this for legacy shadow paging
> without a very concrete, very performance sensitive use case, and the legacy MMU
> with TDP should be a hard "no".
>
> With those out of the way, unsync support can also be jettisoned, because barring
> use cases I don't know about, hypervisors don't modify TDP entries in the same way
> that kernels modify native page tables, i.e. don't benefit from allowing SPTEs to
> go unsync.
>
> The other feature that I think we should deprecate (which I'm pretty sure someone on
> our team, maybe even you, is planning on proposing upstream) is support for zapping
> KVM shadow pages for the shrinker.  In hindsight, we should have done that a few
> years ago instead of fixing the bug that made KVM's support meaningful (see commit
> ebdb292dac79 ("KVM: x86/mmu: Batch zap MMU pages when shrinking the slab").  Doing
> that for nested MMUs only (or at least first) should be less controversial.
>
> The other thing we want to do sooner than later is improve the scalability of the
> nested MMU.  A relatively simple way to pick some juicy low hanging fruit, if we
> drop the aforementioned features we don't actually need for nested MMUs, would be
> to turn all of the tracking structures needed for handling a page fault into
> per-root lists/structures, e.g. active_mmu_pages and mmu_page_hash.  Unless L1 is
> doing something funky, there is unlikely to be overlap between nested TDP page
> tables, i.e. per-root tracking shouldn't cause a memory explosion.
>
> At that point, as a first step/stopgap toward a more scalable nested MMU implementation,
> nested TDP page faults, zapping of obsolete pages (memslot updates), and eager page
> splitting (I think) can take mmu_lock for read and then take a per-root spinlock.
>
> At a bare minimum, taking mmu_lock for read would prevent a nested vCPU from blocking
> the TDP MMU, which in itself should be a big win.  Zapping after a memslot updates
> would not interfere at all with re-faulting memory since zapping the obsolete roots
> would never get a lock conflict.  And for use cases that spin up a large number of small
> L2 VMs, per-root locking will allow KVM to handle page faults for each L2 in parallel,
> which could be a huge performance boost for select use cases.
>
> Circling back to eager page splitting, this series could be reworked to take the
> first step of forking FNAME(page_fault), FNAME(fetch) and kvm_mmu_get_page() in
> order to provide the necessary path for reworking nested MMU page faults.  Then it
> can remove unsync and shrinker support for nested MMUs.  With those gone,
> dissecting the nested MMU variant of kvm_mmu_get_page() should be simpler/cleaner
> than dealing with the existing kvm_mmu_get_page(), i.e. should eliminate at least
> some of the complexity/churn.

These sound like useful improvements but I am not really seeing the
value of sequencing them before this series:

 - IMO the "churn" in patches 1-14 are a net improvement to the
existing code. They improve readability by decomposing the shadow page
creation path into smaller functions with better names, reduce the
amount of redundant calculations, and reduce the dependence on struct
kvm_vcpu where it is not needed. Even if eager page splitting is
completely dropped I think they would be useful to merge.

 - Patches 15-21 are necessary complexity to support eager page
splitting, but wouldn't change at all if this splitting was specific
to splitting nested MMUs.

 - Outside of patches 1-14, unsync really doesn't play a role other
than to skip splitting if sp->unsync is true. But as you pointed out
in patch 22, that check can already be dropped since SPs with roles
>4k are never marked unsync.

I'd be fine with limiting eager page splitting to tdp_mmu=Y since
nested is the primary use-case for Google and I agree TDP with the
shadow MMU should be phased out. This would be an artificial
limitation in the short term, but I imagine it would make all those
improvements easier to make down the road.

>
> > Performance
> > -----------
> >
> > To measure the performance impact of Eager Page Splitting I ran
> > dirty_log_perf_test with tdp_mmu=N, various virtual CPU counts, 1GiB per
> > vCPU, and backed by 1GiB HugeTLB memory. The amount of memory that was
> > written to versus read was controlled with the -f option.
> >
> > To measure the imapct of customer performance, we can look at the time
> > it takes all vCPUs to dirty memory after dirty logging has been enabled.
> > Without Eager Page Splitting enabled, such dirtying must take faults to
> > split huge pages and bottleneck on the MMU lock.
> >
> >              | Config: ept=Y, tdp_mmu=N, 100% writes                   |
> >              | Config: ept=Y, tdp_mmu=N, 100% writes                   |
> >              | Config: ept=Y, tdp_mmu=N, 100% writes initially-all-set |
> >              | Config: ept=Y, tdp_mmu=N, 100% writes initially-all-set |
> >              | Config: ept=N, tdp_mmu=Y, 100% writes                   |
> >              | Config: ept=N, tdp_mmu=Y, 50% writes                    |
> >              | Config: ept=N, tdp_mmu=Y, 5% writes                     |
>
> IMO, to justify this there needs to be performance numbers for ept=Y, tdp_mmu=Y,
> i.e. for the use case we actually care about.  I don't expect the outcome to be
> any different, but it really should be explicitly tested.

That's a fair request I guess. There should be no difference in
performance from the ept=N results but it require a lot more effort to
rig up, which is why I tested this way.

I'll look into collecting some results with nested MMUs. On the plus
side, better selftests support for nested MMUs will be useful as the
various improvements you suggested are implemented.
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v3 00/23] KVM: Extend Eager Page Splitting to the shadow MMU
  2022-04-11 17:54     ` David Matlack
@ 2022-04-11 20:12       ` Sean Christopherson
  -1 siblings, 0 replies; 88+ messages in thread
From: Sean Christopherson @ 2022-04-11 20:12 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Andrew Jones, Ben Gardon, Peter Xu, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Mon, Apr 11, 2022, David Matlack wrote:
> On Mon, Apr 11, 2022 at 10:12 AM Sean Christopherson <seanjc@google.com> wrote:
> > Circling back to eager page splitting, this series could be reworked to take the
> > first step of forking FNAME(page_fault), FNAME(fetch) and kvm_mmu_get_page() in
> > order to provide the necessary path for reworking nested MMU page faults.  Then it
> > can remove unsync and shrinker support for nested MMUs.  With those gone,
> > dissecting the nested MMU variant of kvm_mmu_get_page() should be simpler/cleaner
> > than dealing with the existing kvm_mmu_get_page(), i.e. should eliminate at least
> > some of the complexity/churn.
> 
> These sound like useful improvements but I am not really seeing the
> value of sequencing them before this series:
> 
>  - IMO the "churn" in patches 1-14 are a net improvement to the
> existing code. They improve readability by decomposing the shadow page
> creation path into smaller functions with better names, reduce the
> amount of redundant calculations, and reduce the dependence on struct
> kvm_vcpu where it is not needed. Even if eager page splitting is
> completely dropped I think they would be useful to merge.

I definitely like some of patches 1-14, probably most after a few read throughs.
But there are key parts that I do not like that are motivated almost entirely by
the desire to support page splitting.  Specifically, I don't like splitting the
logic of finding a page, and I don't like having a separate alloc vs. initializer
(though I'm guessing this will be needed somewhere to split huge pages for nested
MMUs).

E.g. I'd prefer the "get" flow look like the below (completely untested, for
discussion purposes only).  There's still churn, but the core loop is almost
entirely unchanged.

And it's not just this series, I don't want future improvements nested TDP to have
to deal with the legacy baggage.

Waaaay off topic, why do we still bother with stat.max_mmu_page_hash_collision?
I assume it was originally added to tune the hashing logic?  At this point is it
anything but wasted cycles?

static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
						     gfn_t gfn,
						     unsigned int gfn_hash,
						     union kvm_mmu_page_role role)
{
	struct hlist_head *sp_list = &kvm->arch.mmu_page_hash[gfn_hash];
	struct kvm_mmu_page *sp;
	LIST_HEAD(invalid_list);

	int collisions = 0;

	for_each_valid_sp(kvm, sp, sp_list) {
		if (sp->gfn != gfn) {
			collisions++;
			continue;
		}

		if (sp->role.word != role.word) {
			/*
			 * If the guest is creating an upper-level page, zap
			 * unsync pages for the same gfn.  While it's possible
			 * the guest is using recursive page tables, in all
			 * likelihood the guest has stopped using the unsync
			 * page and is installing a completely unrelated page.
			 * Unsync pages must not be left as is, because the new
			 * upper-level page will be write-protected.
			 */
			if (role.level > PG_LEVEL_4K && sp->unsync)
				kvm_mmu_prepare_zap_page(vcpu->kvm, sp, invalid_list);

			continue;
		}

		/* unsync and write-flooding only apply to indirect SPs. */
		if (sp->role.direct)
			goto out;

		if (sp->unsync) {
			/*
			 * The page is good, but is stale.  kvm_sync_page does
			 * get the latest guest state, but (unlike mmu_unsync_children)
			 * it doesn't write-protect the page or mark it synchronized!
			 * This way the validity of the mapping is ensured, but the
			 * overhead of write protection is not incurred until the
			 * guest invalidates the TLB mapping.  This allows multiple
			 * SPs for a single gfn to be unsync.
			 *
			 * If the sync fails, the page is zapped.  If so, break
			 * in order to rebuild it.
			 */
			if (!kvm_sync_page(vcpu, sp, &invalid_list))
				break;

			WARN_ON(!list_empty(&invalid_list));
			kvm_flush_remote_tlbs(vcpu->kvm);
		}

		__clear_sp_write_flooding_count(sp);
		goto out;
	}

	sp = NULL;

out:
	if (collisions > kvm->stat.max_mmu_page_hash_collisions)
		kvm->stat.max_mmu_page_hash_collisions = collisions;

	kvm_mmu_commit_zap_page(vcpu->kvm, &invalid_list);
	return sp;
}

static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
						      gfn_t gfn,
						      unsigned int gfn_hash,
						      union kvm_mmu_page_role role)
{
	struct kvm_mmu_page *sp = __kvm_mmu_alloc_shadow_page(vcpu, role.direct);
	struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
	struct hlist_head *sp_list = &kvm->arch.mmu_page_hash[gfn_hash];

	++kvm->stat.mmu_cache_miss;

	sp->gfn = gfn;
	sp->role = role;
	sp->mmu_valid_gen = kvm->arch.mmu_valid_gen;

	/*
	 * active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages()
	 * depends on valid pages being added to the head of the list.  See
	 * comments in kvm_zap_obsolete_pages().
	 */
	list_add(&sp->link, &kvm->arch.active_mmu_pages);
	kvm_mod_used_mmu_pages(kvm, 1);

	sp_list = &kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
	hlist_add_head(&sp->hash_link, sp_list);

	if (!role.direct)
		account_shadowed(kvm, slot, sp);
}


static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
						    gfn_t gfn,
						    union kvm_mmu_page_role role)
{
	unsigned int gfn_hash = kvm_page_table_hashfn(gfn);
	struct kvm_mmu_page *sp;
	bool created = false;

	sp = kvm_mmu_find_shadow_page(vcpu, gfn, gfn_hash, role);
	if (!sp) {
		created = true;
		sp = kvm_mmu_alloc_shadow_page(vcpu, gfn, gfn_hash, role);
	}

	trace_kvm_mmu_get_page(sp, created);
	return sp;
}

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v3 00/23] KVM: Extend Eager Page Splitting to the shadow MMU
@ 2022-04-11 20:12       ` Sean Christopherson
  0 siblings, 0 replies; 88+ messages in thread
From: Sean Christopherson @ 2022-04-11 20:12 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Mon, Apr 11, 2022, David Matlack wrote:
> On Mon, Apr 11, 2022 at 10:12 AM Sean Christopherson <seanjc@google.com> wrote:
> > Circling back to eager page splitting, this series could be reworked to take the
> > first step of forking FNAME(page_fault), FNAME(fetch) and kvm_mmu_get_page() in
> > order to provide the necessary path for reworking nested MMU page faults.  Then it
> > can remove unsync and shrinker support for nested MMUs.  With those gone,
> > dissecting the nested MMU variant of kvm_mmu_get_page() should be simpler/cleaner
> > than dealing with the existing kvm_mmu_get_page(), i.e. should eliminate at least
> > some of the complexity/churn.
> 
> These sound like useful improvements but I am not really seeing the
> value of sequencing them before this series:
> 
>  - IMO the "churn" in patches 1-14 are a net improvement to the
> existing code. They improve readability by decomposing the shadow page
> creation path into smaller functions with better names, reduce the
> amount of redundant calculations, and reduce the dependence on struct
> kvm_vcpu where it is not needed. Even if eager page splitting is
> completely dropped I think they would be useful to merge.

I definitely like some of patches 1-14, probably most after a few read throughs.
But there are key parts that I do not like that are motivated almost entirely by
the desire to support page splitting.  Specifically, I don't like splitting the
logic of finding a page, and I don't like having a separate alloc vs. initializer
(though I'm guessing this will be needed somewhere to split huge pages for nested
MMUs).

E.g. I'd prefer the "get" flow look like the below (completely untested, for
discussion purposes only).  There's still churn, but the core loop is almost
entirely unchanged.

And it's not just this series, I don't want future improvements nested TDP to have
to deal with the legacy baggage.

Waaaay off topic, why do we still bother with stat.max_mmu_page_hash_collision?
I assume it was originally added to tune the hashing logic?  At this point is it
anything but wasted cycles?

static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
						     gfn_t gfn,
						     unsigned int gfn_hash,
						     union kvm_mmu_page_role role)
{
	struct hlist_head *sp_list = &kvm->arch.mmu_page_hash[gfn_hash];
	struct kvm_mmu_page *sp;
	LIST_HEAD(invalid_list);

	int collisions = 0;

	for_each_valid_sp(kvm, sp, sp_list) {
		if (sp->gfn != gfn) {
			collisions++;
			continue;
		}

		if (sp->role.word != role.word) {
			/*
			 * If the guest is creating an upper-level page, zap
			 * unsync pages for the same gfn.  While it's possible
			 * the guest is using recursive page tables, in all
			 * likelihood the guest has stopped using the unsync
			 * page and is installing a completely unrelated page.
			 * Unsync pages must not be left as is, because the new
			 * upper-level page will be write-protected.
			 */
			if (role.level > PG_LEVEL_4K && sp->unsync)
				kvm_mmu_prepare_zap_page(vcpu->kvm, sp, invalid_list);

			continue;
		}

		/* unsync and write-flooding only apply to indirect SPs. */
		if (sp->role.direct)
			goto out;

		if (sp->unsync) {
			/*
			 * The page is good, but is stale.  kvm_sync_page does
			 * get the latest guest state, but (unlike mmu_unsync_children)
			 * it doesn't write-protect the page or mark it synchronized!
			 * This way the validity of the mapping is ensured, but the
			 * overhead of write protection is not incurred until the
			 * guest invalidates the TLB mapping.  This allows multiple
			 * SPs for a single gfn to be unsync.
			 *
			 * If the sync fails, the page is zapped.  If so, break
			 * in order to rebuild it.
			 */
			if (!kvm_sync_page(vcpu, sp, &invalid_list))
				break;

			WARN_ON(!list_empty(&invalid_list));
			kvm_flush_remote_tlbs(vcpu->kvm);
		}

		__clear_sp_write_flooding_count(sp);
		goto out;
	}

	sp = NULL;

out:
	if (collisions > kvm->stat.max_mmu_page_hash_collisions)
		kvm->stat.max_mmu_page_hash_collisions = collisions;

	kvm_mmu_commit_zap_page(vcpu->kvm, &invalid_list);
	return sp;
}

static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
						      gfn_t gfn,
						      unsigned int gfn_hash,
						      union kvm_mmu_page_role role)
{
	struct kvm_mmu_page *sp = __kvm_mmu_alloc_shadow_page(vcpu, role.direct);
	struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
	struct hlist_head *sp_list = &kvm->arch.mmu_page_hash[gfn_hash];

	++kvm->stat.mmu_cache_miss;

	sp->gfn = gfn;
	sp->role = role;
	sp->mmu_valid_gen = kvm->arch.mmu_valid_gen;

	/*
	 * active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages()
	 * depends on valid pages being added to the head of the list.  See
	 * comments in kvm_zap_obsolete_pages().
	 */
	list_add(&sp->link, &kvm->arch.active_mmu_pages);
	kvm_mod_used_mmu_pages(kvm, 1);

	sp_list = &kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
	hlist_add_head(&sp->hash_link, sp_list);

	if (!role.direct)
		account_shadowed(kvm, slot, sp);
}


static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
						    gfn_t gfn,
						    union kvm_mmu_page_role role)
{
	unsigned int gfn_hash = kvm_page_table_hashfn(gfn);
	struct kvm_mmu_page *sp;
	bool created = false;

	sp = kvm_mmu_find_shadow_page(vcpu, gfn, gfn_hash, role);
	if (!sp) {
		created = true;
		sp = kvm_mmu_alloc_shadow_page(vcpu, gfn, gfn_hash, role);
	}

	trace_kvm_mmu_get_page(sp, created);
	return sp;
}
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v3 00/23] KVM: Extend Eager Page Splitting to the shadow MMU
  2022-04-11 20:12       ` Sean Christopherson
@ 2022-04-11 23:41         ` David Matlack
  -1 siblings, 0 replies; 88+ messages in thread
From: David Matlack @ 2022-04-11 23:41 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Andrew Jones, Ben Gardon, Peter Xu, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Mon, Apr 11, 2022 at 1:12 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Apr 11, 2022, David Matlack wrote:
> > On Mon, Apr 11, 2022 at 10:12 AM Sean Christopherson <seanjc@google.com> wrote:
> > > Circling back to eager page splitting, this series could be reworked to take the
> > > first step of forking FNAME(page_fault), FNAME(fetch) and kvm_mmu_get_page() in
> > > order to provide the necessary path for reworking nested MMU page faults.  Then it
> > > can remove unsync and shrinker support for nested MMUs.  With those gone,
> > > dissecting the nested MMU variant of kvm_mmu_get_page() should be simpler/cleaner
> > > than dealing with the existing kvm_mmu_get_page(), i.e. should eliminate at least
> > > some of the complexity/churn.
> >
> > These sound like useful improvements but I am not really seeing the
> > value of sequencing them before this series:
> >
> >  - IMO the "churn" in patches 1-14 are a net improvement to the
> > existing code. They improve readability by decomposing the shadow page
> > creation path into smaller functions with better names, reduce the
> > amount of redundant calculations, and reduce the dependence on struct
> > kvm_vcpu where it is not needed. Even if eager page splitting is
> > completely dropped I think they would be useful to merge.
>
> I definitely like some of patches 1-14, probably most after a few read throughs.
> But there are key parts that I do not like that are motivated almost entirely by
> the desire to support page splitting.  Specifically, I don't like splitting the
> logic of finding a page, and I don't like having a separate alloc vs. initializer
> (though I'm guessing this will be needed somewhere to split huge pages for nested
> MMUs).
>
> E.g. I'd prefer the "get" flow look like the below (completely untested, for
> discussion purposes only).  There's still churn, but the core loop is almost
> entirely unchanged.
>
> And it's not just this series, I don't want future improvements nested TDP to have
> to deal with the legacy baggage.

One thing that would be helpful is if you can explain in a bit more
specifically what you'd like to see. Part of the reason why I prefer
to sequence your proposal after eager page splitting is that I do not
fully understand what you're proposing, and how complex it would be.
e.g. Forking FNAME(fetch), FNAME(page_fault), and kvm_mmu_get_page()
for nested MMUs does not sound like less churn.

From my perspective, this series is a net improvement to the
readability and maintainability of existing code, while adding a
performance improvement (eager page splitting). All of the changes you
are proposing can still be implemented on top if and when they become
a priority (including merging {,__}kvm_find_shadow_page()). And if we
limit eager page splitting to nested MMUs, we don't have to worry
about maintaining eager page splitting with TDP shadow MMU or legacy
shadow paging over time.


>
> Waaaay off topic, why do we still bother with stat.max_mmu_page_hash_collision?
> I assume it was originally added to tune the hashing logic?  At this point is it
> anything but wasted cycles?
>
> static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
>                                                      gfn_t gfn,
>                                                      unsigned int gfn_hash,
>                                                      union kvm_mmu_page_role role)
> {
>         struct hlist_head *sp_list = &kvm->arch.mmu_page_hash[gfn_hash];
>         struct kvm_mmu_page *sp;
>         LIST_HEAD(invalid_list);
>
>         int collisions = 0;
>
>         for_each_valid_sp(kvm, sp, sp_list) {
>                 if (sp->gfn != gfn) {
>                         collisions++;
>                         continue;
>                 }
>
>                 if (sp->role.word != role.word) {
>                         /*
>                          * If the guest is creating an upper-level page, zap
>                          * unsync pages for the same gfn.  While it's possible
>                          * the guest is using recursive page tables, in all
>                          * likelihood the guest has stopped using the unsync
>                          * page and is installing a completely unrelated page.
>                          * Unsync pages must not be left as is, because the new
>                          * upper-level page will be write-protected.
>                          */
>                         if (role.level > PG_LEVEL_4K && sp->unsync)
>                                 kvm_mmu_prepare_zap_page(vcpu->kvm, sp, invalid_list);
>
>                         continue;
>                 }
>
>                 /* unsync and write-flooding only apply to indirect SPs. */
>                 if (sp->role.direct)
>                         goto out;
>
>                 if (sp->unsync) {
>                         /*
>                          * The page is good, but is stale.  kvm_sync_page does
>                          * get the latest guest state, but (unlike mmu_unsync_children)
>                          * it doesn't write-protect the page or mark it synchronized!
>                          * This way the validity of the mapping is ensured, but the
>                          * overhead of write protection is not incurred until the
>                          * guest invalidates the TLB mapping.  This allows multiple
>                          * SPs for a single gfn to be unsync.
>                          *
>                          * If the sync fails, the page is zapped.  If so, break
>                          * in order to rebuild it.
>                          */
>                         if (!kvm_sync_page(vcpu, sp, &invalid_list))
>                                 break;
>
>                         WARN_ON(!list_empty(&invalid_list));
>                         kvm_flush_remote_tlbs(vcpu->kvm);
>                 }
>
>                 __clear_sp_write_flooding_count(sp);
>                 goto out;
>         }
>
>         sp = NULL;
>
> out:
>         if (collisions > kvm->stat.max_mmu_page_hash_collisions)
>                 kvm->stat.max_mmu_page_hash_collisions = collisions;
>
>         kvm_mmu_commit_zap_page(vcpu->kvm, &invalid_list);
>         return sp;
> }
>
> static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
>                                                       gfn_t gfn,
>                                                       unsigned int gfn_hash,
>                                                       union kvm_mmu_page_role role)
> {
>         struct kvm_mmu_page *sp = __kvm_mmu_alloc_shadow_page(vcpu, role.direct);
>         struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
>         struct hlist_head *sp_list = &kvm->arch.mmu_page_hash[gfn_hash];
>
>         ++kvm->stat.mmu_cache_miss;
>
>         sp->gfn = gfn;
>         sp->role = role;
>         sp->mmu_valid_gen = kvm->arch.mmu_valid_gen;
>
>         /*
>          * active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages()
>          * depends on valid pages being added to the head of the list.  See
>          * comments in kvm_zap_obsolete_pages().
>          */
>         list_add(&sp->link, &kvm->arch.active_mmu_pages);
>         kvm_mod_used_mmu_pages(kvm, 1);
>
>         sp_list = &kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
>         hlist_add_head(&sp->hash_link, sp_list);
>
>         if (!role.direct)
>                 account_shadowed(kvm, slot, sp);
> }
>
>
> static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
>                                                     gfn_t gfn,
>                                                     union kvm_mmu_page_role role)
> {
>         unsigned int gfn_hash = kvm_page_table_hashfn(gfn);
>         struct kvm_mmu_page *sp;
>         bool created = false;
>
>         sp = kvm_mmu_find_shadow_page(vcpu, gfn, gfn_hash, role);
>         if (!sp) {
>                 created = true;
>                 sp = kvm_mmu_alloc_shadow_page(vcpu, gfn, gfn_hash, role);
>         }
>
>         trace_kvm_mmu_get_page(sp, created);
>         return sp;
> }

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v3 00/23] KVM: Extend Eager Page Splitting to the shadow MMU
@ 2022-04-11 23:41         ` David Matlack
  0 siblings, 0 replies; 88+ messages in thread
From: David Matlack @ 2022-04-11 23:41 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Mon, Apr 11, 2022 at 1:12 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Apr 11, 2022, David Matlack wrote:
> > On Mon, Apr 11, 2022 at 10:12 AM Sean Christopherson <seanjc@google.com> wrote:
> > > Circling back to eager page splitting, this series could be reworked to take the
> > > first step of forking FNAME(page_fault), FNAME(fetch) and kvm_mmu_get_page() in
> > > order to provide the necessary path for reworking nested MMU page faults.  Then it
> > > can remove unsync and shrinker support for nested MMUs.  With those gone,
> > > dissecting the nested MMU variant of kvm_mmu_get_page() should be simpler/cleaner
> > > than dealing with the existing kvm_mmu_get_page(), i.e. should eliminate at least
> > > some of the complexity/churn.
> >
> > These sound like useful improvements but I am not really seeing the
> > value of sequencing them before this series:
> >
> >  - IMO the "churn" in patches 1-14 are a net improvement to the
> > existing code. They improve readability by decomposing the shadow page
> > creation path into smaller functions with better names, reduce the
> > amount of redundant calculations, and reduce the dependence on struct
> > kvm_vcpu where it is not needed. Even if eager page splitting is
> > completely dropped I think they would be useful to merge.
>
> I definitely like some of patches 1-14, probably most after a few read throughs.
> But there are key parts that I do not like that are motivated almost entirely by
> the desire to support page splitting.  Specifically, I don't like splitting the
> logic of finding a page, and I don't like having a separate alloc vs. initializer
> (though I'm guessing this will be needed somewhere to split huge pages for nested
> MMUs).
>
> E.g. I'd prefer the "get" flow look like the below (completely untested, for
> discussion purposes only).  There's still churn, but the core loop is almost
> entirely unchanged.
>
> And it's not just this series, I don't want future improvements nested TDP to have
> to deal with the legacy baggage.

One thing that would be helpful is if you can explain in a bit more
specifically what you'd like to see. Part of the reason why I prefer
to sequence your proposal after eager page splitting is that I do not
fully understand what you're proposing, and how complex it would be.
e.g. Forking FNAME(fetch), FNAME(page_fault), and kvm_mmu_get_page()
for nested MMUs does not sound like less churn.

From my perspective, this series is a net improvement to the
readability and maintainability of existing code, while adding a
performance improvement (eager page splitting). All of the changes you
are proposing can still be implemented on top if and when they become
a priority (including merging {,__}kvm_find_shadow_page()). And if we
limit eager page splitting to nested MMUs, we don't have to worry
about maintaining eager page splitting with TDP shadow MMU or legacy
shadow paging over time.


>
> Waaaay off topic, why do we still bother with stat.max_mmu_page_hash_collision?
> I assume it was originally added to tune the hashing logic?  At this point is it
> anything but wasted cycles?
>
> static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
>                                                      gfn_t gfn,
>                                                      unsigned int gfn_hash,
>                                                      union kvm_mmu_page_role role)
> {
>         struct hlist_head *sp_list = &kvm->arch.mmu_page_hash[gfn_hash];
>         struct kvm_mmu_page *sp;
>         LIST_HEAD(invalid_list);
>
>         int collisions = 0;
>
>         for_each_valid_sp(kvm, sp, sp_list) {
>                 if (sp->gfn != gfn) {
>                         collisions++;
>                         continue;
>                 }
>
>                 if (sp->role.word != role.word) {
>                         /*
>                          * If the guest is creating an upper-level page, zap
>                          * unsync pages for the same gfn.  While it's possible
>                          * the guest is using recursive page tables, in all
>                          * likelihood the guest has stopped using the unsync
>                          * page and is installing a completely unrelated page.
>                          * Unsync pages must not be left as is, because the new
>                          * upper-level page will be write-protected.
>                          */
>                         if (role.level > PG_LEVEL_4K && sp->unsync)
>                                 kvm_mmu_prepare_zap_page(vcpu->kvm, sp, invalid_list);
>
>                         continue;
>                 }
>
>                 /* unsync and write-flooding only apply to indirect SPs. */
>                 if (sp->role.direct)
>                         goto out;
>
>                 if (sp->unsync) {
>                         /*
>                          * The page is good, but is stale.  kvm_sync_page does
>                          * get the latest guest state, but (unlike mmu_unsync_children)
>                          * it doesn't write-protect the page or mark it synchronized!
>                          * This way the validity of the mapping is ensured, but the
>                          * overhead of write protection is not incurred until the
>                          * guest invalidates the TLB mapping.  This allows multiple
>                          * SPs for a single gfn to be unsync.
>                          *
>                          * If the sync fails, the page is zapped.  If so, break
>                          * in order to rebuild it.
>                          */
>                         if (!kvm_sync_page(vcpu, sp, &invalid_list))
>                                 break;
>
>                         WARN_ON(!list_empty(&invalid_list));
>                         kvm_flush_remote_tlbs(vcpu->kvm);
>                 }
>
>                 __clear_sp_write_flooding_count(sp);
>                 goto out;
>         }
>
>         sp = NULL;
>
> out:
>         if (collisions > kvm->stat.max_mmu_page_hash_collisions)
>                 kvm->stat.max_mmu_page_hash_collisions = collisions;
>
>         kvm_mmu_commit_zap_page(vcpu->kvm, &invalid_list);
>         return sp;
> }
>
> static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
>                                                       gfn_t gfn,
>                                                       unsigned int gfn_hash,
>                                                       union kvm_mmu_page_role role)
> {
>         struct kvm_mmu_page *sp = __kvm_mmu_alloc_shadow_page(vcpu, role.direct);
>         struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
>         struct hlist_head *sp_list = &kvm->arch.mmu_page_hash[gfn_hash];
>
>         ++kvm->stat.mmu_cache_miss;
>
>         sp->gfn = gfn;
>         sp->role = role;
>         sp->mmu_valid_gen = kvm->arch.mmu_valid_gen;
>
>         /*
>          * active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages()
>          * depends on valid pages being added to the head of the list.  See
>          * comments in kvm_zap_obsolete_pages().
>          */
>         list_add(&sp->link, &kvm->arch.active_mmu_pages);
>         kvm_mod_used_mmu_pages(kvm, 1);
>
>         sp_list = &kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
>         hlist_add_head(&sp->hash_link, sp_list);
>
>         if (!role.direct)
>                 account_shadowed(kvm, slot, sp);
> }
>
>
> static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
>                                                     gfn_t gfn,
>                                                     union kvm_mmu_page_role role)
> {
>         unsigned int gfn_hash = kvm_page_table_hashfn(gfn);
>         struct kvm_mmu_page *sp;
>         bool created = false;
>
>         sp = kvm_mmu_find_shadow_page(vcpu, gfn, gfn_hash, role);
>         if (!sp) {
>                 created = true;
>                 sp = kvm_mmu_alloc_shadow_page(vcpu, gfn, gfn_hash, role);
>         }
>
>         trace_kvm_mmu_get_page(sp, created);
>         return sp;
> }
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v3 00/23] KVM: Extend Eager Page Splitting to the shadow MMU
  2022-04-11 23:41         ` David Matlack
@ 2022-04-12  0:39           ` Sean Christopherson
  -1 siblings, 0 replies; 88+ messages in thread
From: Sean Christopherson @ 2022-04-12  0:39 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Andrew Jones, Ben Gardon, Peter Xu, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Mon, Apr 11, 2022, David Matlack wrote:
> On Mon, Apr 11, 2022 at 1:12 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Mon, Apr 11, 2022, David Matlack wrote:
> > > On Mon, Apr 11, 2022 at 10:12 AM Sean Christopherson <seanjc@google.com> wrote:
> > > > Circling back to eager page splitting, this series could be reworked to take the
> > > > first step of forking FNAME(page_fault), FNAME(fetch) and kvm_mmu_get_page() in
> > > > order to provide the necessary path for reworking nested MMU page faults.  Then it
> > > > can remove unsync and shrinker support for nested MMUs.  With those gone,
> > > > dissecting the nested MMU variant of kvm_mmu_get_page() should be simpler/cleaner
> > > > than dealing with the existing kvm_mmu_get_page(), i.e. should eliminate at least
> > > > some of the complexity/churn.
> > >
> > > These sound like useful improvements but I am not really seeing the
> > > value of sequencing them before this series:
> > >
> > >  - IMO the "churn" in patches 1-14 are a net improvement to the
> > > existing code. They improve readability by decomposing the shadow page
> > > creation path into smaller functions with better names, reduce the
> > > amount of redundant calculations, and reduce the dependence on struct
> > > kvm_vcpu where it is not needed. Even if eager page splitting is
> > > completely dropped I think they would be useful to merge.
> >
> > I definitely like some of patches 1-14, probably most after a few read throughs.
> > But there are key parts that I do not like that are motivated almost entirely by
> > the desire to support page splitting.  Specifically, I don't like splitting the
> > logic of finding a page, and I don't like having a separate alloc vs. initializer
> > (though I'm guessing this will be needed somewhere to split huge pages for nested
> > MMUs).
> >
> > E.g. I'd prefer the "get" flow look like the below (completely untested, for
> > discussion purposes only).  There's still churn, but the core loop is almost
> > entirely unchanged.
> >
> > And it's not just this series, I don't want future improvements nested TDP to have
> > to deal with the legacy baggage.
> 
> One thing that would be helpful is if you can explain in a bit more
> specifically what you'd like to see. Part of the reason why I prefer
> to sequence your proposal after eager page splitting is that I do not
> fully understand what you're proposing, and how complex it would be.
> e.g. Forking FNAME(fetch), FNAME(page_fault), and kvm_mmu_get_page()
> for nested MMUs does not sound like less churn.

Oh, it's most definitely not less code, and probably more churn.  But, it's churn
that pushes us in a more favorable direction and that is desirable long term.  I
don't mind churning code, but I want the churn to make future life easier, not
harder.  Details below.

> From my perspective, this series is a net improvement to the
> readability and maintainability of existing code, while adding a
> performance improvement (eager page splitting). All of the changes you
> are proposing can still be implemented on top if

They can be implemented on top, but I want to avoid inhireting complexity we
don't actually want/need, unsync support being the most notable.

What I mean by "fork" is that after the cleanups that make sense irrespective of
eager page splitting, we make a copy of FNAME(page_fault) and add FNAME(get_shadow_page),
extracting common logic where we can and probably doing something fancy to avoid
having multiple copies of FNAME(get_shadow_page).  Looking again at the code, it's
probably best to keep FNAME(fetch), at least for now, as it's only the single unsync
check that we can purge at this point.

That gives us e.g. FNAME(nested_page_fault) that support EPT and 64-bit NPT, and
a nested TDP specific get_shadow_page().

Then we rip out the unsync stuff for nested MMUs, which is quite clean because we
can key off of tdp_enabled.  It'll leave dead code for 32-bit hosts running nested
VMs, but I highly doubt anyone will notice the perf hit.

At that point, dissect kvm_nested_mmu_get_page() for eager page splitting and
continue on.

It's not drastically different than what you have now, but it avoids the nastiness
around unsync pages, e.g. I'm pretty sure kvm_mmu_alloc_shadow_page() can be reused
as I proposed and the "find" becomes something like:

static struct kvm_mmu_page *kvm_mmu_nested_tdp_find_sp(struct kvm_vcpu *vcpu,
						       gfn_t gfn,
						       unsigned int gfn_hash,
						       union kvm_mmu_page_role role)
{
	struct hlist_head *sp_list = &kvm->arch.mmu_page_hash[gfn_hash];
	struct kvm_mmu_page *sp;

	for_each_valid_sp(kvm, sp, sp_list) {
		if (sp->gfn != gfn || sp->role.word != role.word)
			continue;

		__clear_sp_write_flooding_count(sp);
		return sp;
	}

	return NULL;
}

Having the separate page fault and get_shadow_page(), without the baggage of unsync
in particular, sets us up for switching to taking mmu_lock for read, and in the
distant future, implementing whatever new scheme someone concocts for shadowing
nested TDP.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v3 00/23] KVM: Extend Eager Page Splitting to the shadow MMU
@ 2022-04-12  0:39           ` Sean Christopherson
  0 siblings, 0 replies; 88+ messages in thread
From: Sean Christopherson @ 2022-04-12  0:39 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Mon, Apr 11, 2022, David Matlack wrote:
> On Mon, Apr 11, 2022 at 1:12 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Mon, Apr 11, 2022, David Matlack wrote:
> > > On Mon, Apr 11, 2022 at 10:12 AM Sean Christopherson <seanjc@google.com> wrote:
> > > > Circling back to eager page splitting, this series could be reworked to take the
> > > > first step of forking FNAME(page_fault), FNAME(fetch) and kvm_mmu_get_page() in
> > > > order to provide the necessary path for reworking nested MMU page faults.  Then it
> > > > can remove unsync and shrinker support for nested MMUs.  With those gone,
> > > > dissecting the nested MMU variant of kvm_mmu_get_page() should be simpler/cleaner
> > > > than dealing with the existing kvm_mmu_get_page(), i.e. should eliminate at least
> > > > some of the complexity/churn.
> > >
> > > These sound like useful improvements but I am not really seeing the
> > > value of sequencing them before this series:
> > >
> > >  - IMO the "churn" in patches 1-14 are a net improvement to the
> > > existing code. They improve readability by decomposing the shadow page
> > > creation path into smaller functions with better names, reduce the
> > > amount of redundant calculations, and reduce the dependence on struct
> > > kvm_vcpu where it is not needed. Even if eager page splitting is
> > > completely dropped I think they would be useful to merge.
> >
> > I definitely like some of patches 1-14, probably most after a few read throughs.
> > But there are key parts that I do not like that are motivated almost entirely by
> > the desire to support page splitting.  Specifically, I don't like splitting the
> > logic of finding a page, and I don't like having a separate alloc vs. initializer
> > (though I'm guessing this will be needed somewhere to split huge pages for nested
> > MMUs).
> >
> > E.g. I'd prefer the "get" flow look like the below (completely untested, for
> > discussion purposes only).  There's still churn, but the core loop is almost
> > entirely unchanged.
> >
> > And it's not just this series, I don't want future improvements nested TDP to have
> > to deal with the legacy baggage.
> 
> One thing that would be helpful is if you can explain in a bit more
> specifically what you'd like to see. Part of the reason why I prefer
> to sequence your proposal after eager page splitting is that I do not
> fully understand what you're proposing, and how complex it would be.
> e.g. Forking FNAME(fetch), FNAME(page_fault), and kvm_mmu_get_page()
> for nested MMUs does not sound like less churn.

Oh, it's most definitely not less code, and probably more churn.  But, it's churn
that pushes us in a more favorable direction and that is desirable long term.  I
don't mind churning code, but I want the churn to make future life easier, not
harder.  Details below.

> From my perspective, this series is a net improvement to the
> readability and maintainability of existing code, while adding a
> performance improvement (eager page splitting). All of the changes you
> are proposing can still be implemented on top if

They can be implemented on top, but I want to avoid inhireting complexity we
don't actually want/need, unsync support being the most notable.

What I mean by "fork" is that after the cleanups that make sense irrespective of
eager page splitting, we make a copy of FNAME(page_fault) and add FNAME(get_shadow_page),
extracting common logic where we can and probably doing something fancy to avoid
having multiple copies of FNAME(get_shadow_page).  Looking again at the code, it's
probably best to keep FNAME(fetch), at least for now, as it's only the single unsync
check that we can purge at this point.

That gives us e.g. FNAME(nested_page_fault) that support EPT and 64-bit NPT, and
a nested TDP specific get_shadow_page().

Then we rip out the unsync stuff for nested MMUs, which is quite clean because we
can key off of tdp_enabled.  It'll leave dead code for 32-bit hosts running nested
VMs, but I highly doubt anyone will notice the perf hit.

At that point, dissect kvm_nested_mmu_get_page() for eager page splitting and
continue on.

It's not drastically different than what you have now, but it avoids the nastiness
around unsync pages, e.g. I'm pretty sure kvm_mmu_alloc_shadow_page() can be reused
as I proposed and the "find" becomes something like:

static struct kvm_mmu_page *kvm_mmu_nested_tdp_find_sp(struct kvm_vcpu *vcpu,
						       gfn_t gfn,
						       unsigned int gfn_hash,
						       union kvm_mmu_page_role role)
{
	struct hlist_head *sp_list = &kvm->arch.mmu_page_hash[gfn_hash];
	struct kvm_mmu_page *sp;

	for_each_valid_sp(kvm, sp, sp_list) {
		if (sp->gfn != gfn || sp->role.word != role.word)
			continue;

		__clear_sp_write_flooding_count(sp);
		return sp;
	}

	return NULL;
}

Having the separate page fault and get_shadow_page(), without the baggage of unsync
in particular, sets us up for switching to taking mmu_lock for read, and in the
distant future, implementing whatever new scheme someone concocts for shadowing
nested TDP.
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v3 00/23] KVM: Extend Eager Page Splitting to the shadow MMU
  2022-04-12  0:39           ` Sean Christopherson
@ 2022-04-12 16:49             ` David Matlack
  -1 siblings, 0 replies; 88+ messages in thread
From: David Matlack @ 2022-04-12 16:49 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Andrew Jones, Ben Gardon, Peter Xu, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Mon, Apr 11, 2022 at 5:39 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Apr 11, 2022, David Matlack wrote:
> >
> > One thing that would be helpful is if you can explain in a bit more
> > specifically what you'd like to see. Part of the reason why I prefer
> > to sequence your proposal after eager page splitting is that I do not
> > fully understand what you're proposing, and how complex it would be.
> > e.g. Forking FNAME(fetch), FNAME(page_fault), and kvm_mmu_get_page()
> > for nested MMUs does not sound like less churn.
>
> Oh, it's most definitely not less code, and probably more churn.  But, it's churn
> that pushes us in a more favorable direction and that is desirable long term.  I
> don't mind churning code, but I want the churn to make future life easier, not
> harder.  Details below.

Of course. Let's make sure we're on the same page about what churn
introduced by this series will make future life harder that we hope to
avoid. If I understand you correctly, it's the following 2 changes:

 (a.) Using separate functions to allocate SPs and initialize SPs.
 (b.) Separating kvm_mmu_find_shadow_page() from __kvm_mmu_find_shadow_page().

(a.) stems from the fact that SP allocation during eager page
splitting is made directly rather than through kvm_mmu_memory_caches,
which was what you pushed for in the TDP MMU implementation. We could
instead use kvm_mmu_memory_caches for the shadow MMU eager page
splitting to eliminate (a.). But otherwise (a.) is necessary
complexity of eager page splitting because it needs to allocate SPs
differently from the vCPU fault path.

As for (b.), see below...

>
> > From my perspective, this series is a net improvement to the
> > readability and maintainability of existing code, while adding a
> > performance improvement (eager page splitting). All of the changes you
> > are proposing can still be implemented on top if
>
> They can be implemented on top, but I want to avoid inhireting complexity we
> don't actually want/need, unsync support being the most notable.
>
> What I mean by "fork" is that after the cleanups that make sense irrespective of
> eager page splitting, we make a copy of FNAME(page_fault) and add FNAME(get_shadow_page),
> extracting common logic where we can and probably doing something fancy to avoid
> having multiple copies of FNAME(get_shadow_page).  Looking again at the code, it's
> probably best to keep FNAME(fetch), at least for now, as it's only the single unsync
> check that we can purge at this point.
>
> That gives us e.g. FNAME(nested_page_fault) that support EPT and 64-bit NPT, and
> a nested TDP specific get_shadow_page().
>
> Then we rip out the unsync stuff for nested MMUs, which is quite clean because we
> can key off of tdp_enabled.  It'll leave dead code for 32-bit hosts running nested
> VMs, but I highly doubt anyone will notice the perf hit.
>
> At that point, dissect kvm_nested_mmu_get_page() for eager page splitting and
> continue on.
>
> It's not drastically different than what you have now, but it avoids the nastiness
> around unsync pages, e.g. I'm pretty sure kvm_mmu_alloc_shadow_page() can be reused
> as I proposed and the "find" becomes something like:
>
> static struct kvm_mmu_page *kvm_mmu_nested_tdp_find_sp(struct kvm_vcpu *vcpu,
>                                                        gfn_t gfn,
>                                                        unsigned int gfn_hash,
>                                                        union kvm_mmu_page_role role)
> {
>         struct hlist_head *sp_list = &kvm->arch.mmu_page_hash[gfn_hash];
>         struct kvm_mmu_page *sp;
>
>         for_each_valid_sp(kvm, sp, sp_list) {
>                 if (sp->gfn != gfn || sp->role.word != role.word)
>                         continue;
>
>                 __clear_sp_write_flooding_count(sp);
>                 return sp;
>         }
>
>         return NULL;
> }

IIUC all of this would be to avoid separating
kvm_mmu_find_shadow_page() from __kvm_mmu_find_shadow_page() correct?
i.e. Nested MMUs would have their own "find" function, which is called
by eager page splitting, and thus no separate
__kvm_mmu_find_shadow_page().

But __kvm_mmu_find_shadow_page(), as implemented in this series, is
about 90% similar to what you proposed for
kvm_mmu_nested_tdp_find_sp(). And in fact it would work correctly to
use __kvm_mmu_find_shadow_page() for nested MMUs, since we know the
sp->unsync condition would just be skipped.

So even if we did everything you proposed (which seems like an awful
lot just to avoid __kvm_mmu_find_shadow_page()), there's a chance we
would still end up with the exact same code. i.e.
kvm_mmu_nested_tdp_find_sp() would be implemented by calling
__kvm_mmu_find_shadow_page(), because it would be a waste to
re-implement an almost identical function?

>
> Having the separate page fault and get_shadow_page(), without the baggage of unsync
> in particular, sets us up for switching to taking mmu_lock for read, and in the
> distant future, implementing whatever new scheme someone concocts for shadowing
> nested TDP.

Taking MMU read lock with per-root spinlocks for nested MMUs is a
great idea btw. I think it would be a great improvement.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v3 00/23] KVM: Extend Eager Page Splitting to the shadow MMU
@ 2022-04-12 16:49             ` David Matlack
  0 siblings, 0 replies; 88+ messages in thread
From: David Matlack @ 2022-04-12 16:49 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Mon, Apr 11, 2022 at 5:39 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Apr 11, 2022, David Matlack wrote:
> >
> > One thing that would be helpful is if you can explain in a bit more
> > specifically what you'd like to see. Part of the reason why I prefer
> > to sequence your proposal after eager page splitting is that I do not
> > fully understand what you're proposing, and how complex it would be.
> > e.g. Forking FNAME(fetch), FNAME(page_fault), and kvm_mmu_get_page()
> > for nested MMUs does not sound like less churn.
>
> Oh, it's most definitely not less code, and probably more churn.  But, it's churn
> that pushes us in a more favorable direction and that is desirable long term.  I
> don't mind churning code, but I want the churn to make future life easier, not
> harder.  Details below.

Of course. Let's make sure we're on the same page about what churn
introduced by this series will make future life harder that we hope to
avoid. If I understand you correctly, it's the following 2 changes:

 (a.) Using separate functions to allocate SPs and initialize SPs.
 (b.) Separating kvm_mmu_find_shadow_page() from __kvm_mmu_find_shadow_page().

(a.) stems from the fact that SP allocation during eager page
splitting is made directly rather than through kvm_mmu_memory_caches,
which was what you pushed for in the TDP MMU implementation. We could
instead use kvm_mmu_memory_caches for the shadow MMU eager page
splitting to eliminate (a.). But otherwise (a.) is necessary
complexity of eager page splitting because it needs to allocate SPs
differently from the vCPU fault path.

As for (b.), see below...

>
> > From my perspective, this series is a net improvement to the
> > readability and maintainability of existing code, while adding a
> > performance improvement (eager page splitting). All of the changes you
> > are proposing can still be implemented on top if
>
> They can be implemented on top, but I want to avoid inhireting complexity we
> don't actually want/need, unsync support being the most notable.
>
> What I mean by "fork" is that after the cleanups that make sense irrespective of
> eager page splitting, we make a copy of FNAME(page_fault) and add FNAME(get_shadow_page),
> extracting common logic where we can and probably doing something fancy to avoid
> having multiple copies of FNAME(get_shadow_page).  Looking again at the code, it's
> probably best to keep FNAME(fetch), at least for now, as it's only the single unsync
> check that we can purge at this point.
>
> That gives us e.g. FNAME(nested_page_fault) that support EPT and 64-bit NPT, and
> a nested TDP specific get_shadow_page().
>
> Then we rip out the unsync stuff for nested MMUs, which is quite clean because we
> can key off of tdp_enabled.  It'll leave dead code for 32-bit hosts running nested
> VMs, but I highly doubt anyone will notice the perf hit.
>
> At that point, dissect kvm_nested_mmu_get_page() for eager page splitting and
> continue on.
>
> It's not drastically different than what you have now, but it avoids the nastiness
> around unsync pages, e.g. I'm pretty sure kvm_mmu_alloc_shadow_page() can be reused
> as I proposed and the "find" becomes something like:
>
> static struct kvm_mmu_page *kvm_mmu_nested_tdp_find_sp(struct kvm_vcpu *vcpu,
>                                                        gfn_t gfn,
>                                                        unsigned int gfn_hash,
>                                                        union kvm_mmu_page_role role)
> {
>         struct hlist_head *sp_list = &kvm->arch.mmu_page_hash[gfn_hash];
>         struct kvm_mmu_page *sp;
>
>         for_each_valid_sp(kvm, sp, sp_list) {
>                 if (sp->gfn != gfn || sp->role.word != role.word)
>                         continue;
>
>                 __clear_sp_write_flooding_count(sp);
>                 return sp;
>         }
>
>         return NULL;
> }

IIUC all of this would be to avoid separating
kvm_mmu_find_shadow_page() from __kvm_mmu_find_shadow_page() correct?
i.e. Nested MMUs would have their own "find" function, which is called
by eager page splitting, and thus no separate
__kvm_mmu_find_shadow_page().

But __kvm_mmu_find_shadow_page(), as implemented in this series, is
about 90% similar to what you proposed for
kvm_mmu_nested_tdp_find_sp(). And in fact it would work correctly to
use __kvm_mmu_find_shadow_page() for nested MMUs, since we know the
sp->unsync condition would just be skipped.

So even if we did everything you proposed (which seems like an awful
lot just to avoid __kvm_mmu_find_shadow_page()), there's a chance we
would still end up with the exact same code. i.e.
kvm_mmu_nested_tdp_find_sp() would be implemented by calling
__kvm_mmu_find_shadow_page(), because it would be a waste to
re-implement an almost identical function?

>
> Having the separate page fault and get_shadow_page(), without the baggage of unsync
> in particular, sets us up for switching to taking mmu_lock for read, and in the
> distant future, implementing whatever new scheme someone concocts for shadowing
> nested TDP.

Taking MMU read lock with per-root spinlocks for nested MMUs is a
great idea btw. I think it would be a great improvement.
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v3 00/23] KVM: Extend Eager Page Splitting to the shadow MMU
  2022-04-12 16:49             ` David Matlack
@ 2022-04-13  1:02               ` Sean Christopherson
  -1 siblings, 0 replies; 88+ messages in thread
From: Sean Christopherson @ 2022-04-13  1:02 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Andrew Jones, Ben Gardon, Peter Xu, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Tue, Apr 12, 2022, David Matlack wrote:
> On Mon, Apr 11, 2022 at 5:39 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Mon, Apr 11, 2022, David Matlack wrote:
> > >
> > > One thing that would be helpful is if you can explain in a bit more
> > > specifically what you'd like to see. Part of the reason why I prefer
> > > to sequence your proposal after eager page splitting is that I do not
> > > fully understand what you're proposing, and how complex it would be.
> > > e.g. Forking FNAME(fetch), FNAME(page_fault), and kvm_mmu_get_page()
> > > for nested MMUs does not sound like less churn.
> >
> > Oh, it's most definitely not less code, and probably more churn.  But, it's churn
> > that pushes us in a more favorable direction and that is desirable long term.  I
> > don't mind churning code, but I want the churn to make future life easier, not
> > harder.  Details below.
> 
> Of course. Let's make sure we're on the same page about what churn
> introduced by this series will make future life harder that we hope to
> avoid. If I understand you correctly, it's the following 2 changes:
> 
>  (a.) Using separate functions to allocate SPs and initialize SPs.
>  (b.) Separating kvm_mmu_find_shadow_page() from __kvm_mmu_find_shadow_page().
> 
> (a.) stems from the fact that SP allocation during eager page
> splitting is made directly rather than through kvm_mmu_memory_caches,
> which was what you pushed for in the TDP MMU implementation. We could
> instead use kvm_mmu_memory_caches for the shadow MMU eager page

...

> So even if we did everything you proposed (which seems like an awful
> lot just to avoid __kvm_mmu_find_shadow_page()), there's a chance we
> would still end up with the exact same code. i.e.
> kvm_mmu_nested_tdp_find_sp() would be implemented by calling
> __kvm_mmu_find_shadow_page(), because it would be a waste to
> re-implement an almost identical function?

I went far enough down this path to know that my idea isn't completely awful,
and wouldn't actually need to fork FNAME(page_fault) at this time, but sadly I
still dislike the end result.

Your assessment that the we'd still end up with very similar (if not quite exact)
code is spot on.  Ditto for your other assertion in (a) about using the caches.

My vote for this series is to go the cache route, e.g. wrap kvm_mmu_memory_caches
in a struct and pass that into kvm_mmu_get_page().  I still think it was the right
call to ignore the caches for the TDP MMU, it gives the TDP MMU more flexibility
and it was trivial to bypass the caches since the TDP MMU was doing its own thing
anyways.

But for the shadow MMU, IMO the cons outweigh the pros.  E.g. in addition to
ending up with two similar but subtly different "get page" flows, passing around
"struct kvm_mmu_page **spp" is a bit unpleasant.  Ditto for having a partially
initialized kvm_mmu_page.  The split code also ends up in a wierd state where it
uses the caches for the pte_list, but not the other allocations.

There will be one wart due to unsync pages needing @vcpu, but we can pass in NULL
for the split case and assert that @vcpu is non-null since all of the children
should be direct.

		if (sp->unsync) {
			if (WARN_ON_ONCE(!vcpu)) {
				kvm_mmu_prepare_zap_page(kvm, sp,
							 &invalid_list);
				continue;
			}

			/*
			 * The page is good, but is stale.  kvm_sync_page does
			 * get the latest guest state, but (unlike mmu_unsync_children)
			 * it doesn't write-protect the page or mark it synchronized!
			 * This way the validity of the mapping is ensured, but the
			 * overhead of write protection is not incurred until the
			 * guest invalidates the TLB mapping.  This allows multiple
			 * SPs for a single gfn to be unsync.
			 *
			 * If the sync fails, the page is zapped.  If so, break
			 * in order to rebuild it.
			 */
			if (!kvm_sync_page(vcpu, sp, &invalid_list))
				break;

			WARN_ON(!list_empty(&invalid_list));
			kvm_flush_remote_tlbs(kvm);
		}

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v3 00/23] KVM: Extend Eager Page Splitting to the shadow MMU
@ 2022-04-13  1:02               ` Sean Christopherson
  0 siblings, 0 replies; 88+ messages in thread
From: Sean Christopherson @ 2022-04-13  1:02 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Tue, Apr 12, 2022, David Matlack wrote:
> On Mon, Apr 11, 2022 at 5:39 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Mon, Apr 11, 2022, David Matlack wrote:
> > >
> > > One thing that would be helpful is if you can explain in a bit more
> > > specifically what you'd like to see. Part of the reason why I prefer
> > > to sequence your proposal after eager page splitting is that I do not
> > > fully understand what you're proposing, and how complex it would be.
> > > e.g. Forking FNAME(fetch), FNAME(page_fault), and kvm_mmu_get_page()
> > > for nested MMUs does not sound like less churn.
> >
> > Oh, it's most definitely not less code, and probably more churn.  But, it's churn
> > that pushes us in a more favorable direction and that is desirable long term.  I
> > don't mind churning code, but I want the churn to make future life easier, not
> > harder.  Details below.
> 
> Of course. Let's make sure we're on the same page about what churn
> introduced by this series will make future life harder that we hope to
> avoid. If I understand you correctly, it's the following 2 changes:
> 
>  (a.) Using separate functions to allocate SPs and initialize SPs.
>  (b.) Separating kvm_mmu_find_shadow_page() from __kvm_mmu_find_shadow_page().
> 
> (a.) stems from the fact that SP allocation during eager page
> splitting is made directly rather than through kvm_mmu_memory_caches,
> which was what you pushed for in the TDP MMU implementation. We could
> instead use kvm_mmu_memory_caches for the shadow MMU eager page

...

> So even if we did everything you proposed (which seems like an awful
> lot just to avoid __kvm_mmu_find_shadow_page()), there's a chance we
> would still end up with the exact same code. i.e.
> kvm_mmu_nested_tdp_find_sp() would be implemented by calling
> __kvm_mmu_find_shadow_page(), because it would be a waste to
> re-implement an almost identical function?

I went far enough down this path to know that my idea isn't completely awful,
and wouldn't actually need to fork FNAME(page_fault) at this time, but sadly I
still dislike the end result.

Your assessment that the we'd still end up with very similar (if not quite exact)
code is spot on.  Ditto for your other assertion in (a) about using the caches.

My vote for this series is to go the cache route, e.g. wrap kvm_mmu_memory_caches
in a struct and pass that into kvm_mmu_get_page().  I still think it was the right
call to ignore the caches for the TDP MMU, it gives the TDP MMU more flexibility
and it was trivial to bypass the caches since the TDP MMU was doing its own thing
anyways.

But for the shadow MMU, IMO the cons outweigh the pros.  E.g. in addition to
ending up with two similar but subtly different "get page" flows, passing around
"struct kvm_mmu_page **spp" is a bit unpleasant.  Ditto for having a partially
initialized kvm_mmu_page.  The split code also ends up in a wierd state where it
uses the caches for the pte_list, but not the other allocations.

There will be one wart due to unsync pages needing @vcpu, but we can pass in NULL
for the split case and assert that @vcpu is non-null since all of the children
should be direct.

		if (sp->unsync) {
			if (WARN_ON_ONCE(!vcpu)) {
				kvm_mmu_prepare_zap_page(kvm, sp,
							 &invalid_list);
				continue;
			}

			/*
			 * The page is good, but is stale.  kvm_sync_page does
			 * get the latest guest state, but (unlike mmu_unsync_children)
			 * it doesn't write-protect the page or mark it synchronized!
			 * This way the validity of the mapping is ensured, but the
			 * overhead of write protection is not incurred until the
			 * guest invalidates the TLB mapping.  This allows multiple
			 * SPs for a single gfn to be unsync.
			 *
			 * If the sync fails, the page is zapped.  If so, break
			 * in order to rebuild it.
			 */
			if (!kvm_sync_page(vcpu, sp, &invalid_list))
				break;

			WARN_ON(!list_empty(&invalid_list));
			kvm_flush_remote_tlbs(kvm);
		}
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v3 00/23] KVM: Extend Eager Page Splitting to the shadow MMU
  2022-04-13  1:02               ` Sean Christopherson
@ 2022-04-13 17:57                 ` David Matlack
  -1 siblings, 0 replies; 88+ messages in thread
From: David Matlack @ 2022-04-13 17:57 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Andrew Jones, Ben Gardon, Peter Xu, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Wed, Apr 13, 2022 at 01:02:51AM +0000, Sean Christopherson wrote:
> On Tue, Apr 12, 2022, David Matlack wrote:
> > On Mon, Apr 11, 2022 at 5:39 PM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > On Mon, Apr 11, 2022, David Matlack wrote:
> > > >
> > > > One thing that would be helpful is if you can explain in a bit more
> > > > specifically what you'd like to see. Part of the reason why I prefer
> > > > to sequence your proposal after eager page splitting is that I do not
> > > > fully understand what you're proposing, and how complex it would be.
> > > > e.g. Forking FNAME(fetch), FNAME(page_fault), and kvm_mmu_get_page()
> > > > for nested MMUs does not sound like less churn.
> > >
> > > Oh, it's most definitely not less code, and probably more churn.  But, it's churn
> > > that pushes us in a more favorable direction and that is desirable long term.  I
> > > don't mind churning code, but I want the churn to make future life easier, not
> > > harder.  Details below.
> > 
> > Of course. Let's make sure we're on the same page about what churn
> > introduced by this series will make future life harder that we hope to
> > avoid. If I understand you correctly, it's the following 2 changes:
> > 
> >  (a.) Using separate functions to allocate SPs and initialize SPs.
> >  (b.) Separating kvm_mmu_find_shadow_page() from __kvm_mmu_find_shadow_page().
> > 
> > (a.) stems from the fact that SP allocation during eager page
> > splitting is made directly rather than through kvm_mmu_memory_caches,
> > which was what you pushed for in the TDP MMU implementation. We could
> > instead use kvm_mmu_memory_caches for the shadow MMU eager page
> 
> ...
> 
> > So even if we did everything you proposed (which seems like an awful
> > lot just to avoid __kvm_mmu_find_shadow_page()), there's a chance we
> > would still end up with the exact same code. i.e.
> > kvm_mmu_nested_tdp_find_sp() would be implemented by calling
> > __kvm_mmu_find_shadow_page(), because it would be a waste to
> > re-implement an almost identical function?
> 
> I went far enough down this path to know that my idea isn't completely awful,
> and wouldn't actually need to fork FNAME(page_fault) at this time, but sadly I
> still dislike the end result.

Thanks for looking into it so quickly so we could figure out a path
forward.

> 
> Your assessment that the we'd still end up with very similar (if not quite exact)
> code is spot on.  Ditto for your other assertion in (a) about using the caches.
> 
> My vote for this series is to go the cache route, e.g. wrap kvm_mmu_memory_caches
> in a struct and pass that into kvm_mmu_get_page().  I still think it was the right
> call to ignore the caches for the TDP MMU, it gives the TDP MMU more flexibility
> and it was trivial to bypass the caches since the TDP MMU was doing its own thing
> anyways.
> 
> But for the shadow MMU, IMO the cons outweigh the pros.  E.g. in addition to
> ending up with two similar but subtly different "get page" flows, passing around
> "struct kvm_mmu_page **spp" is a bit unpleasant.  Ditto for having a partially
> initialized kvm_mmu_page.  The split code also ends up in a wierd state where it
> uses the caches for the pte_list, but not the other allocations.

Sounds good. I will rework the series to use kvm_mmu_memory_cache
structs for the SP allocation during eager page splitting. That will
eliminate the separate allocation and initialization which will be a
nice cleanup. And it will be great to get rid of the spp crud.

And per your earlier feedback, I will also limit eager page splitting to
nested MMUs.

> 
> There will be one wart due to unsync pages needing @vcpu, but we can pass in NULL
> for the split case and assert that @vcpu is non-null since all of the children
> should be direct.

The NULL vcpu check will be a little gross, but it should never trigger
in practice since eager page splitting always requests direct SPs. My
preference has been to enforce that in code by splitting out
__kvm_mmu_find_shadow_page(), but I can see the advantage of your
proposal is that eager page splitting and faults will go through the
exact same code path to get a kvm_mmu_page.

> 
> 		if (sp->unsync) {
> 			if (WARN_ON_ONCE(!vcpu)) {
> 				kvm_mmu_prepare_zap_page(kvm, sp,
> 							 &invalid_list);
> 				continue;
> 			}
> 
> 			/*
> 			 * The page is good, but is stale.  kvm_sync_page does
> 			 * get the latest guest state, but (unlike mmu_unsync_children)
> 			 * it doesn't write-protect the page or mark it synchronized!
> 			 * This way the validity of the mapping is ensured, but the
> 			 * overhead of write protection is not incurred until the
> 			 * guest invalidates the TLB mapping.  This allows multiple
> 			 * SPs for a single gfn to be unsync.
> 			 *
> 			 * If the sync fails, the page is zapped.  If so, break
> 			 * in order to rebuild it.
> 			 */
> 			if (!kvm_sync_page(vcpu, sp, &invalid_list))
> 				break;
> 
> 			WARN_ON(!list_empty(&invalid_list));
> 			kvm_flush_remote_tlbs(kvm);
> 		}

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v3 00/23] KVM: Extend Eager Page Splitting to the shadow MMU
@ 2022-04-13 17:57                 ` David Matlack
  0 siblings, 0 replies; 88+ messages in thread
From: David Matlack @ 2022-04-13 17:57 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Wed, Apr 13, 2022 at 01:02:51AM +0000, Sean Christopherson wrote:
> On Tue, Apr 12, 2022, David Matlack wrote:
> > On Mon, Apr 11, 2022 at 5:39 PM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > On Mon, Apr 11, 2022, David Matlack wrote:
> > > >
> > > > One thing that would be helpful is if you can explain in a bit more
> > > > specifically what you'd like to see. Part of the reason why I prefer
> > > > to sequence your proposal after eager page splitting is that I do not
> > > > fully understand what you're proposing, and how complex it would be.
> > > > e.g. Forking FNAME(fetch), FNAME(page_fault), and kvm_mmu_get_page()
> > > > for nested MMUs does not sound like less churn.
> > >
> > > Oh, it's most definitely not less code, and probably more churn.  But, it's churn
> > > that pushes us in a more favorable direction and that is desirable long term.  I
> > > don't mind churning code, but I want the churn to make future life easier, not
> > > harder.  Details below.
> > 
> > Of course. Let's make sure we're on the same page about what churn
> > introduced by this series will make future life harder that we hope to
> > avoid. If I understand you correctly, it's the following 2 changes:
> > 
> >  (a.) Using separate functions to allocate SPs and initialize SPs.
> >  (b.) Separating kvm_mmu_find_shadow_page() from __kvm_mmu_find_shadow_page().
> > 
> > (a.) stems from the fact that SP allocation during eager page
> > splitting is made directly rather than through kvm_mmu_memory_caches,
> > which was what you pushed for in the TDP MMU implementation. We could
> > instead use kvm_mmu_memory_caches for the shadow MMU eager page
> 
> ...
> 
> > So even if we did everything you proposed (which seems like an awful
> > lot just to avoid __kvm_mmu_find_shadow_page()), there's a chance we
> > would still end up with the exact same code. i.e.
> > kvm_mmu_nested_tdp_find_sp() would be implemented by calling
> > __kvm_mmu_find_shadow_page(), because it would be a waste to
> > re-implement an almost identical function?
> 
> I went far enough down this path to know that my idea isn't completely awful,
> and wouldn't actually need to fork FNAME(page_fault) at this time, but sadly I
> still dislike the end result.

Thanks for looking into it so quickly so we could figure out a path
forward.

> 
> Your assessment that the we'd still end up with very similar (if not quite exact)
> code is spot on.  Ditto for your other assertion in (a) about using the caches.
> 
> My vote for this series is to go the cache route, e.g. wrap kvm_mmu_memory_caches
> in a struct and pass that into kvm_mmu_get_page().  I still think it was the right
> call to ignore the caches for the TDP MMU, it gives the TDP MMU more flexibility
> and it was trivial to bypass the caches since the TDP MMU was doing its own thing
> anyways.
> 
> But for the shadow MMU, IMO the cons outweigh the pros.  E.g. in addition to
> ending up with two similar but subtly different "get page" flows, passing around
> "struct kvm_mmu_page **spp" is a bit unpleasant.  Ditto for having a partially
> initialized kvm_mmu_page.  The split code also ends up in a wierd state where it
> uses the caches for the pte_list, but not the other allocations.

Sounds good. I will rework the series to use kvm_mmu_memory_cache
structs for the SP allocation during eager page splitting. That will
eliminate the separate allocation and initialization which will be a
nice cleanup. And it will be great to get rid of the spp crud.

And per your earlier feedback, I will also limit eager page splitting to
nested MMUs.

> 
> There will be one wart due to unsync pages needing @vcpu, but we can pass in NULL
> for the split case and assert that @vcpu is non-null since all of the children
> should be direct.

The NULL vcpu check will be a little gross, but it should never trigger
in practice since eager page splitting always requests direct SPs. My
preference has been to enforce that in code by splitting out
__kvm_mmu_find_shadow_page(), but I can see the advantage of your
proposal is that eager page splitting and faults will go through the
exact same code path to get a kvm_mmu_page.

> 
> 		if (sp->unsync) {
> 			if (WARN_ON_ONCE(!vcpu)) {
> 				kvm_mmu_prepare_zap_page(kvm, sp,
> 							 &invalid_list);
> 				continue;
> 			}
> 
> 			/*
> 			 * The page is good, but is stale.  kvm_sync_page does
> 			 * get the latest guest state, but (unlike mmu_unsync_children)
> 			 * it doesn't write-protect the page or mark it synchronized!
> 			 * This way the validity of the mapping is ensured, but the
> 			 * overhead of write protection is not incurred until the
> 			 * guest invalidates the TLB mapping.  This allows multiple
> 			 * SPs for a single gfn to be unsync.
> 			 *
> 			 * If the sync fails, the page is zapped.  If so, break
> 			 * in order to rebuild it.
> 			 */
> 			if (!kvm_sync_page(vcpu, sp, &invalid_list))
> 				break;
> 
> 			WARN_ON(!list_empty(&invalid_list));
> 			kvm_flush_remote_tlbs(kvm);
> 		}
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v3 00/23] KVM: Extend Eager Page Splitting to the shadow MMU
  2022-04-13 17:57                 ` David Matlack
@ 2022-04-13 18:28                   ` Sean Christopherson
  -1 siblings, 0 replies; 88+ messages in thread
From: Sean Christopherson @ 2022-04-13 18:28 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Andrew Jones, Ben Gardon, Peter Xu, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Wed, Apr 13, 2022, David Matlack wrote:
> On Wed, Apr 13, 2022 at 01:02:51AM +0000, Sean Christopherson wrote:
> > There will be one wart due to unsync pages needing @vcpu, but we can pass in NULL
> > for the split case and assert that @vcpu is non-null since all of the children
> > should be direct.
> 
> The NULL vcpu check will be a little gross,

Yeah, I would even call it a lot gross :-)

> but it should never trigger in practice since eager page splitting always
> requests direct SPs. My preference has been to enforce that in code by
> splitting out

It still is enforced in code, just at different points.  The split version WARNs
and continues after finding a page, the below WARNs and rejects _while_ finding
the page.

Speaking of WARNs, that reminds me... it might be worth adding a WARN in
kvm_mmu_get_child_sp() to document (and detect, but more to document) that @direct
should never encounter an page with unsync or unsync_children, e.g. 

	union kvm_mmu_page_role role;
	struct kvm_mmu_page *sp;

	role = kvm_mmu_child_role(sptep, direct, access);
	sp = kvm_mmu_get_page(vcpu, gfn, role);

	/* Comment goes here about direct pages in shadow MMUs? */
	WARN_ON(direct && (sp->unsync || sp->unsync_children));
	return sp;

The indirect walk of FNAME(fetch)() handles unsync_children, but none of the other
callers do.  Obviously shouldn't happen, but especially in the huge page split
case it took me a second to understand exactly why it can't happen.

> but I can see the advantage of your proposal is that eager page splitting and
> faults will go through the exact same code path to get a kvm_mmu_page.
> __kvm_mmu_find_shadow_page(), but I can see the advantage of your
> proposal is that eager page splitting and faults will go through the
> exact same code path to get a kvm_mmu_page.
> 
> > 
> > 		if (sp->unsync) {
> > 			if (WARN_ON_ONCE(!vcpu)) {
> > 				kvm_mmu_prepare_zap_page(kvm, sp,
> > 							 &invalid_list);
> > 				continue;
> > 			}
> > 
> > 			/*
> > 			 * The page is good, but is stale.  kvm_sync_page does
> > 			 * get the latest guest state, but (unlike mmu_unsync_children)
> > 			 * it doesn't write-protect the page or mark it synchronized!
> > 			 * This way the validity of the mapping is ensured, but the
> > 			 * overhead of write protection is not incurred until the
> > 			 * guest invalidates the TLB mapping.  This allows multiple
> > 			 * SPs for a single gfn to be unsync.
> > 			 *
> > 			 * If the sync fails, the page is zapped.  If so, break
> > 			 * in order to rebuild it.
> > 			 */
> > 			if (!kvm_sync_page(vcpu, sp, &invalid_list))
> > 				break;
> > 
> > 			WARN_ON(!list_empty(&invalid_list));
> > 			kvm_flush_remote_tlbs(kvm);
> > 		}

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v3 00/23] KVM: Extend Eager Page Splitting to the shadow MMU
@ 2022-04-13 18:28                   ` Sean Christopherson
  0 siblings, 0 replies; 88+ messages in thread
From: Sean Christopherson @ 2022-04-13 18:28 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Wed, Apr 13, 2022, David Matlack wrote:
> On Wed, Apr 13, 2022 at 01:02:51AM +0000, Sean Christopherson wrote:
> > There will be one wart due to unsync pages needing @vcpu, but we can pass in NULL
> > for the split case and assert that @vcpu is non-null since all of the children
> > should be direct.
> 
> The NULL vcpu check will be a little gross,

Yeah, I would even call it a lot gross :-)

> but it should never trigger in practice since eager page splitting always
> requests direct SPs. My preference has been to enforce that in code by
> splitting out

It still is enforced in code, just at different points.  The split version WARNs
and continues after finding a page, the below WARNs and rejects _while_ finding
the page.

Speaking of WARNs, that reminds me... it might be worth adding a WARN in
kvm_mmu_get_child_sp() to document (and detect, but more to document) that @direct
should never encounter an page with unsync or unsync_children, e.g. 

	union kvm_mmu_page_role role;
	struct kvm_mmu_page *sp;

	role = kvm_mmu_child_role(sptep, direct, access);
	sp = kvm_mmu_get_page(vcpu, gfn, role);

	/* Comment goes here about direct pages in shadow MMUs? */
	WARN_ON(direct && (sp->unsync || sp->unsync_children));
	return sp;

The indirect walk of FNAME(fetch)() handles unsync_children, but none of the other
callers do.  Obviously shouldn't happen, but especially in the huge page split
case it took me a second to understand exactly why it can't happen.

> but I can see the advantage of your proposal is that eager page splitting and
> faults will go through the exact same code path to get a kvm_mmu_page.
> __kvm_mmu_find_shadow_page(), but I can see the advantage of your
> proposal is that eager page splitting and faults will go through the
> exact same code path to get a kvm_mmu_page.
> 
> > 
> > 		if (sp->unsync) {
> > 			if (WARN_ON_ONCE(!vcpu)) {
> > 				kvm_mmu_prepare_zap_page(kvm, sp,
> > 							 &invalid_list);
> > 				continue;
> > 			}
> > 
> > 			/*
> > 			 * The page is good, but is stale.  kvm_sync_page does
> > 			 * get the latest guest state, but (unlike mmu_unsync_children)
> > 			 * it doesn't write-protect the page or mark it synchronized!
> > 			 * This way the validity of the mapping is ensured, but the
> > 			 * overhead of write protection is not incurred until the
> > 			 * guest invalidates the TLB mapping.  This allows multiple
> > 			 * SPs for a single gfn to be unsync.
> > 			 *
> > 			 * If the sync fails, the page is zapped.  If so, break
> > 			 * in order to rebuild it.
> > 			 */
> > 			if (!kvm_sync_page(vcpu, sp, &invalid_list))
> > 				break;
> > 
> > 			WARN_ON(!list_empty(&invalid_list));
> > 			kvm_flush_remote_tlbs(kvm);
> > 		}
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v3 00/23] KVM: Extend Eager Page Splitting to the shadow MMU
  2022-04-13 18:28                   ` Sean Christopherson
@ 2022-04-13 21:22                     ` David Matlack
  -1 siblings, 0 replies; 88+ messages in thread
From: David Matlack @ 2022-04-13 21:22 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Andrew Jones, Ben Gardon, Peter Xu, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Wed, Apr 13, 2022 at 11:28 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Wed, Apr 13, 2022, David Matlack wrote:
> > On Wed, Apr 13, 2022 at 01:02:51AM +0000, Sean Christopherson wrote:
> > > There will be one wart due to unsync pages needing @vcpu, but we can pass in NULL
> > > for the split case and assert that @vcpu is non-null since all of the children
> > > should be direct.
> >
> > The NULL vcpu check will be a little gross,
>
> Yeah, I would even call it a lot gross :-)
>
> > but it should never trigger in practice since eager page splitting always
> > requests direct SPs. My preference has been to enforce that in code by
> > splitting out
>
> It still is enforced in code, just at different points.  The split version WARNs
> and continues after finding a page, the below WARNs and rejects _while_ finding
> the page.
>
> Speaking of WARNs, that reminds me... it might be worth adding a WARN in
> kvm_mmu_get_child_sp() to document (and detect, but more to document) that @direct
> should never encounter an page with unsync or unsync_children, e.g.
>
>         union kvm_mmu_page_role role;
>         struct kvm_mmu_page *sp;
>
>         role = kvm_mmu_child_role(sptep, direct, access);
>         sp = kvm_mmu_get_page(vcpu, gfn, role);
>
>         /* Comment goes here about direct pages in shadow MMUs? */
>         WARN_ON(direct && (sp->unsync || sp->unsync_children));
>         return sp;
>
> The indirect walk of FNAME(fetch)() handles unsync_children, but none of the other
> callers do.  Obviously shouldn't happen, but especially in the huge page split
> case it took me a second to understand exactly why it can't happen.

Will do.

>
> > but I can see the advantage of your proposal is that eager page splitting and
> > faults will go through the exact same code path to get a kvm_mmu_page.
> > __kvm_mmu_find_shadow_page(), but I can see the advantage of your
> > proposal is that eager page splitting and faults will go through the
> > exact same code path to get a kvm_mmu_page.
> >
> > >
> > >             if (sp->unsync) {
> > >                     if (WARN_ON_ONCE(!vcpu)) {
> > >                             kvm_mmu_prepare_zap_page(kvm, sp,
> > >                                                      &invalid_list);
> > >                             continue;
> > >                     }
> > >
> > >                     /*
> > >                      * The page is good, but is stale.  kvm_sync_page does
> > >                      * get the latest guest state, but (unlike mmu_unsync_children)
> > >                      * it doesn't write-protect the page or mark it synchronized!
> > >                      * This way the validity of the mapping is ensured, but the
> > >                      * overhead of write protection is not incurred until the
> > >                      * guest invalidates the TLB mapping.  This allows multiple
> > >                      * SPs for a single gfn to be unsync.
> > >                      *
> > >                      * If the sync fails, the page is zapped.  If so, break
> > >                      * in order to rebuild it.
> > >                      */
> > >                     if (!kvm_sync_page(vcpu, sp, &invalid_list))
> > >                             break;
> > >
> > >                     WARN_ON(!list_empty(&invalid_list));
> > >                     kvm_flush_remote_tlbs(kvm);
> > >             }

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v3 00/23] KVM: Extend Eager Page Splitting to the shadow MMU
@ 2022-04-13 21:22                     ` David Matlack
  0 siblings, 0 replies; 88+ messages in thread
From: David Matlack @ 2022-04-13 21:22 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Wed, Apr 13, 2022 at 11:28 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Wed, Apr 13, 2022, David Matlack wrote:
> > On Wed, Apr 13, 2022 at 01:02:51AM +0000, Sean Christopherson wrote:
> > > There will be one wart due to unsync pages needing @vcpu, but we can pass in NULL
> > > for the split case and assert that @vcpu is non-null since all of the children
> > > should be direct.
> >
> > The NULL vcpu check will be a little gross,
>
> Yeah, I would even call it a lot gross :-)
>
> > but it should never trigger in practice since eager page splitting always
> > requests direct SPs. My preference has been to enforce that in code by
> > splitting out
>
> It still is enforced in code, just at different points.  The split version WARNs
> and continues after finding a page, the below WARNs and rejects _while_ finding
> the page.
>
> Speaking of WARNs, that reminds me... it might be worth adding a WARN in
> kvm_mmu_get_child_sp() to document (and detect, but more to document) that @direct
> should never encounter an page with unsync or unsync_children, e.g.
>
>         union kvm_mmu_page_role role;
>         struct kvm_mmu_page *sp;
>
>         role = kvm_mmu_child_role(sptep, direct, access);
>         sp = kvm_mmu_get_page(vcpu, gfn, role);
>
>         /* Comment goes here about direct pages in shadow MMUs? */
>         WARN_ON(direct && (sp->unsync || sp->unsync_children));
>         return sp;
>
> The indirect walk of FNAME(fetch)() handles unsync_children, but none of the other
> callers do.  Obviously shouldn't happen, but especially in the huge page split
> case it took me a second to understand exactly why it can't happen.

Will do.

>
> > but I can see the advantage of your proposal is that eager page splitting and
> > faults will go through the exact same code path to get a kvm_mmu_page.
> > __kvm_mmu_find_shadow_page(), but I can see the advantage of your
> > proposal is that eager page splitting and faults will go through the
> > exact same code path to get a kvm_mmu_page.
> >
> > >
> > >             if (sp->unsync) {
> > >                     if (WARN_ON_ONCE(!vcpu)) {
> > >                             kvm_mmu_prepare_zap_page(kvm, sp,
> > >                                                      &invalid_list);
> > >                             continue;
> > >                     }
> > >
> > >                     /*
> > >                      * The page is good, but is stale.  kvm_sync_page does
> > >                      * get the latest guest state, but (unlike mmu_unsync_children)
> > >                      * it doesn't write-protect the page or mark it synchronized!
> > >                      * This way the validity of the mapping is ensured, but the
> > >                      * overhead of write protection is not incurred until the
> > >                      * guest invalidates the TLB mapping.  This allows multiple
> > >                      * SPs for a single gfn to be unsync.
> > >                      *
> > >                      * If the sync fails, the page is zapped.  If so, break
> > >                      * in order to rebuild it.
> > >                      */
> > >                     if (!kvm_sync_page(vcpu, sp, &invalid_list))
> > >                             break;
> > >
> > >                     WARN_ON(!list_empty(&invalid_list));
> > >                     kvm_flush_remote_tlbs(kvm);
> > >             }
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v3 16/23] KVM: x86/mmu: Cache the access bits of shadowed translations
  2022-04-09  0:02     ` Sean Christopherson
@ 2022-04-14 16:47       ` David Matlack
  -1 siblings, 0 replies; 88+ messages in thread
From: David Matlack @ 2022-04-14 16:47 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Andrew Jones, Ben Gardon, Peter Xu, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Fri, Apr 8, 2022 at 5:02 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Apr 01, 2022, David Matlack wrote:
> > @@ -733,7 +733,7 @@ static void mmu_free_pte_list_desc(struct pte_list_desc *pte_list_desc)
> >  static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index)
> >  {
> >       if (!sp->role.direct)
> > -             return sp->gfns[index];
> > +             return sp->shadowed_translation[index].gfn;
> >
> >       return sp->gfn + (index << ((sp->role.level - 1) * PT64_LEVEL_BITS));
> >  }
> > @@ -741,7 +741,7 @@ static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index)
> >  static void kvm_mmu_page_set_gfn(struct kvm_mmu_page *sp, int index, gfn_t gfn)
>
> This should be replaced with a single helper to set the gfn+access.  Under no
> circumstance should _just_ the gfn change, and that will allow us to optimize
> writing the entry.  More below.
>
> >  {
> >       if (!sp->role.direct) {
> > -             sp->gfns[index] = gfn;
> > +             sp->shadowed_translation[index].gfn = gfn;
> >               return;
> >       }
> >
> > @@ -752,6 +752,47 @@ static void kvm_mmu_page_set_gfn(struct kvm_mmu_page *sp, int index, gfn_t gfn)
> >                                  kvm_mmu_page_get_gfn(sp, index), gfn);
> >  }
> >
> > +static void kvm_mmu_page_set_access(struct kvm_mmu_page *sp, int index, u32 access)
> > +{
> > +     if (!sp->role.direct) {
> > +             sp->shadowed_translation[index].access = access;
> > +             return;
> > +     }
> > +
> > +     if (WARN_ON(access != sp->role.access))
> > +             pr_err_ratelimited("access mismatch under direct page %llx "
>
> LOL, I realize this is not your code, but ratelimiting under a WARN ain't gonna
> help much :-)

Ha! Yeah this silly. I'll see about adding a precursor patch to make
it less terrible.

>
> This also generates a warning and fails to compile with KVM_WERROR=y, though I
> believe the test bots already reported that.
>
>
> arch/x86/kvm/mmu/mmu.c: In function ‘kvm_mmu_page_set_access’:
> include/linux/kern_levels.h:5:25: error: format ‘%llx’ expects argument of type ‘long long unsigned int’, but argument 3 has type ‘int’ [-Werror=format=]
>     5 | #define KERN_SOH        "\001"          /* ASCII Start Of Header */
>       |                         ^~~~~~
> include/linux/printk.h:418:25: note: in definition of macro ‘printk_index_wrap’
>   418 |                 _p_func(_fmt, ##__VA_ARGS__);                           \
>       |                         ^~~~
> include/linux/printk.h:640:17: note: in expansion of macro ‘printk’
>   640 |                 printk(fmt, ##__VA_ARGS__);                             \
>       |                 ^~~~~~
> include/linux/printk.h:654:9: note: in expansion of macro ‘printk_ratelimited’
>   654 |         printk_ratelimited(KERN_ERR pr_fmt(fmt), ##__VA_ARGS__)
>       |         ^~~~~~~~~~~~~~~~~~
> include/linux/kern_levels.h:11:25: note: in expansion of macro ‘KERN_SOH’
>    11 | #define KERN_ERR        KERN_SOH "3"    /* error conditions */
>       |                         ^~~~~~~~
> include/linux/printk.h:654:28: note: in expansion of macro ‘KERN_ERR’
>   654 |         printk_ratelimited(KERN_ERR pr_fmt(fmt), ##__VA_ARGS__)
>       |                            ^~~~~~~~
> arch/x86/kvm/mmu/mmu.c:763:17: note: in expansion of macro ‘pr_err_ratelimited’
>   763 |                 pr_err_ratelimited("access mismatch under direct page %llx "
>       |                 ^~~~~~~~~~~~~~~~~~
>
>
> > +                                "(expected %llx, got %llx)\n",
> > +                                kvm_mmu_page_get_gfn(sp, index),
> > +                                sp->role.access, access);
> > +}
> > +
> > +/*
> > + * For leaf SPTEs, fetch the *guest* access permissions being shadowed. Note
> > + * that the SPTE itself may have a more constrained access permissions that
> > + * what the guest enforces. For example, a guest may create an executable
> > + * huge PTE but KVM may disallow execution to mitigate iTLB multihit.
> > + */
> > +static u32 kvm_mmu_page_get_access(struct kvm_mmu_page *sp, int index)
> > +{
> > +     if (!sp->role.direct)
> > +             return sp->shadowed_translation[index].access;
> > +
> > +     /*
> > +      * For direct MMUs (e.g. TDP or non-paging guests) there are no *guest*
> > +      * access permissions being shadowed. So we can just return ACC_ALL
> > +      * here.
> > +      *
> > +      * For indirect MMUs (shadow paging), direct shadow pages exist when KVM
> > +      * is shadowing a guest huge page with smaller pages, since the guest
> > +      * huge page is being directly mapped. In this case the guest access
> > +      * permissions being shadowed are the access permissions of the huge
> > +      * page.
> > +      *
> > +      * In both cases, sp->role.access contains exactly what we want.
> > +      */
> > +     return sp->role.access;
> > +}
>
> ...
>
> > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> > index b6e22ba9c654..3f76f4c1ae59 100644
> > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > @@ -32,6 +32,18 @@ extern bool dbg;
> >
> >  typedef u64 __rcu *tdp_ptep_t;
> >
> > +/*
> > + * Stores the result of the guest translation being shadowed by an SPTE. KVM
> > + * shadows two types of guest translations: nGPA -> GPA (shadow EPT/NPT) and
> > + * GVA -> GPA (traditional shadow paging). In both cases the result of the
> > + * translation is a GPA and a set of access constraints.
> > + */
> > +struct shadowed_translation_entry {
> > +     /* Note, GFNs can have at most 64 - PAGE_SHIFT = 52 bits. */
> > +     u64 gfn:52;
> > +     u64 access:3;
>
> A bitfield is completely unnecessary and generates bad code.  As is, it generates
> _really_ bad code because extracting and setting requires non-standard 64-bit value
> masks, multiple operations, and accesses to unaligned data.  The generated code can
> be made slightly less awful by using a fully byte for access and 64 bits for GFN,
> but it still sucks compared to what we can hand generate.
>
> The other aspect of this is that retrieving the GFN is a frequent operation,
> whereas the access is almost never read.  I.e. we should bias for reading the GFN
> above all else.
>
> The simple and obvious thing is to not reinvent the wheel.  GFN = (GPA >> PAGE_SHIFT),
> and ignoring NX, access lives in the lower 12 bits of a PTE.  Then reading the GFN is
> a simple SHR, and reading access info is a simple AND.
>
> We might also be able to optimize FNAME(sync_page), but I don't care much about
> that, it's rarely used for nested TDP.
>
> So, keep translation_entry a gfn_t *, then do:

Looks good, will do in v4.

>
> static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index)
> {
>         if (!sp->role.direct)
>                 return sp->shadowed_translation[index] >> PAGE_SHIFT;
>
>         return sp->gfn + (index << ((sp->role.level - 1) * PT64_LEVEL_BITS));
> }
>
> static void kvm_mmu_page_set_translation(struct kvm_mmu_page *sp, int index,
>                                          gfn_t gfn, unsigned int access)
> {
>         if (!sp->role.direct) {
>                 sp->shadowed_translation[index] = (gfn << PAGE_SHIFT) | access;
>                 return;
>         }
>
>         if (WARN_ON(gfn != kvm_mmu_page_get_gfn(sp, index)))
>                 pr_err_ratelimited("gfn mismatch under direct page %llx "
>                                    "(expected %llx, got %llx)\n",
>                                    sp->gfn,
>                                    kvm_mmu_page_get_gfn(sp, index), gfn);
> }
>
> static void kvm_mmu_page_set_access(struct kvm_mmu_page *sp, int index,
>                                     unsigned int access)
> {
>         if (sp->role.direct)
>                 return;
>
>         sp->shadowed_translation[index] &= PAGE_MASK;
>         sp->shadowed_translation[index] |= access;
> }
>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v3 16/23] KVM: x86/mmu: Cache the access bits of shadowed translations
@ 2022-04-14 16:47       ` David Matlack
  0 siblings, 0 replies; 88+ messages in thread
From: David Matlack @ 2022-04-14 16:47 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Fri, Apr 8, 2022 at 5:02 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Apr 01, 2022, David Matlack wrote:
> > @@ -733,7 +733,7 @@ static void mmu_free_pte_list_desc(struct pte_list_desc *pte_list_desc)
> >  static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index)
> >  {
> >       if (!sp->role.direct)
> > -             return sp->gfns[index];
> > +             return sp->shadowed_translation[index].gfn;
> >
> >       return sp->gfn + (index << ((sp->role.level - 1) * PT64_LEVEL_BITS));
> >  }
> > @@ -741,7 +741,7 @@ static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index)
> >  static void kvm_mmu_page_set_gfn(struct kvm_mmu_page *sp, int index, gfn_t gfn)
>
> This should be replaced with a single helper to set the gfn+access.  Under no
> circumstance should _just_ the gfn change, and that will allow us to optimize
> writing the entry.  More below.
>
> >  {
> >       if (!sp->role.direct) {
> > -             sp->gfns[index] = gfn;
> > +             sp->shadowed_translation[index].gfn = gfn;
> >               return;
> >       }
> >
> > @@ -752,6 +752,47 @@ static void kvm_mmu_page_set_gfn(struct kvm_mmu_page *sp, int index, gfn_t gfn)
> >                                  kvm_mmu_page_get_gfn(sp, index), gfn);
> >  }
> >
> > +static void kvm_mmu_page_set_access(struct kvm_mmu_page *sp, int index, u32 access)
> > +{
> > +     if (!sp->role.direct) {
> > +             sp->shadowed_translation[index].access = access;
> > +             return;
> > +     }
> > +
> > +     if (WARN_ON(access != sp->role.access))
> > +             pr_err_ratelimited("access mismatch under direct page %llx "
>
> LOL, I realize this is not your code, but ratelimiting under a WARN ain't gonna
> help much :-)

Ha! Yeah this silly. I'll see about adding a precursor patch to make
it less terrible.

>
> This also generates a warning and fails to compile with KVM_WERROR=y, though I
> believe the test bots already reported that.
>
>
> arch/x86/kvm/mmu/mmu.c: In function ‘kvm_mmu_page_set_access’:
> include/linux/kern_levels.h:5:25: error: format ‘%llx’ expects argument of type ‘long long unsigned int’, but argument 3 has type ‘int’ [-Werror=format=]
>     5 | #define KERN_SOH        "\001"          /* ASCII Start Of Header */
>       |                         ^~~~~~
> include/linux/printk.h:418:25: note: in definition of macro ‘printk_index_wrap’
>   418 |                 _p_func(_fmt, ##__VA_ARGS__);                           \
>       |                         ^~~~
> include/linux/printk.h:640:17: note: in expansion of macro ‘printk’
>   640 |                 printk(fmt, ##__VA_ARGS__);                             \
>       |                 ^~~~~~
> include/linux/printk.h:654:9: note: in expansion of macro ‘printk_ratelimited’
>   654 |         printk_ratelimited(KERN_ERR pr_fmt(fmt), ##__VA_ARGS__)
>       |         ^~~~~~~~~~~~~~~~~~
> include/linux/kern_levels.h:11:25: note: in expansion of macro ‘KERN_SOH’
>    11 | #define KERN_ERR        KERN_SOH "3"    /* error conditions */
>       |                         ^~~~~~~~
> include/linux/printk.h:654:28: note: in expansion of macro ‘KERN_ERR’
>   654 |         printk_ratelimited(KERN_ERR pr_fmt(fmt), ##__VA_ARGS__)
>       |                            ^~~~~~~~
> arch/x86/kvm/mmu/mmu.c:763:17: note: in expansion of macro ‘pr_err_ratelimited’
>   763 |                 pr_err_ratelimited("access mismatch under direct page %llx "
>       |                 ^~~~~~~~~~~~~~~~~~
>
>
> > +                                "(expected %llx, got %llx)\n",
> > +                                kvm_mmu_page_get_gfn(sp, index),
> > +                                sp->role.access, access);
> > +}
> > +
> > +/*
> > + * For leaf SPTEs, fetch the *guest* access permissions being shadowed. Note
> > + * that the SPTE itself may have a more constrained access permissions that
> > + * what the guest enforces. For example, a guest may create an executable
> > + * huge PTE but KVM may disallow execution to mitigate iTLB multihit.
> > + */
> > +static u32 kvm_mmu_page_get_access(struct kvm_mmu_page *sp, int index)
> > +{
> > +     if (!sp->role.direct)
> > +             return sp->shadowed_translation[index].access;
> > +
> > +     /*
> > +      * For direct MMUs (e.g. TDP or non-paging guests) there are no *guest*
> > +      * access permissions being shadowed. So we can just return ACC_ALL
> > +      * here.
> > +      *
> > +      * For indirect MMUs (shadow paging), direct shadow pages exist when KVM
> > +      * is shadowing a guest huge page with smaller pages, since the guest
> > +      * huge page is being directly mapped. In this case the guest access
> > +      * permissions being shadowed are the access permissions of the huge
> > +      * page.
> > +      *
> > +      * In both cases, sp->role.access contains exactly what we want.
> > +      */
> > +     return sp->role.access;
> > +}
>
> ...
>
> > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> > index b6e22ba9c654..3f76f4c1ae59 100644
> > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > @@ -32,6 +32,18 @@ extern bool dbg;
> >
> >  typedef u64 __rcu *tdp_ptep_t;
> >
> > +/*
> > + * Stores the result of the guest translation being shadowed by an SPTE. KVM
> > + * shadows two types of guest translations: nGPA -> GPA (shadow EPT/NPT) and
> > + * GVA -> GPA (traditional shadow paging). In both cases the result of the
> > + * translation is a GPA and a set of access constraints.
> > + */
> > +struct shadowed_translation_entry {
> > +     /* Note, GFNs can have at most 64 - PAGE_SHIFT = 52 bits. */
> > +     u64 gfn:52;
> > +     u64 access:3;
>
> A bitfield is completely unnecessary and generates bad code.  As is, it generates
> _really_ bad code because extracting and setting requires non-standard 64-bit value
> masks, multiple operations, and accesses to unaligned data.  The generated code can
> be made slightly less awful by using a fully byte for access and 64 bits for GFN,
> but it still sucks compared to what we can hand generate.
>
> The other aspect of this is that retrieving the GFN is a frequent operation,
> whereas the access is almost never read.  I.e. we should bias for reading the GFN
> above all else.
>
> The simple and obvious thing is to not reinvent the wheel.  GFN = (GPA >> PAGE_SHIFT),
> and ignoring NX, access lives in the lower 12 bits of a PTE.  Then reading the GFN is
> a simple SHR, and reading access info is a simple AND.
>
> We might also be able to optimize FNAME(sync_page), but I don't care much about
> that, it's rarely used for nested TDP.
>
> So, keep translation_entry a gfn_t *, then do:

Looks good, will do in v4.

>
> static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index)
> {
>         if (!sp->role.direct)
>                 return sp->shadowed_translation[index] >> PAGE_SHIFT;
>
>         return sp->gfn + (index << ((sp->role.level - 1) * PT64_LEVEL_BITS));
> }
>
> static void kvm_mmu_page_set_translation(struct kvm_mmu_page *sp, int index,
>                                          gfn_t gfn, unsigned int access)
> {
>         if (!sp->role.direct) {
>                 sp->shadowed_translation[index] = (gfn << PAGE_SHIFT) | access;
>                 return;
>         }
>
>         if (WARN_ON(gfn != kvm_mmu_page_get_gfn(sp, index)))
>                 pr_err_ratelimited("gfn mismatch under direct page %llx "
>                                    "(expected %llx, got %llx)\n",
>                                    sp->gfn,
>                                    kvm_mmu_page_get_gfn(sp, index), gfn);
> }
>
> static void kvm_mmu_page_set_access(struct kvm_mmu_page *sp, int index,
>                                     unsigned int access)
> {
>         if (sp->role.direct)
>                 return;
>
>         sp->shadowed_translation[index] &= PAGE_MASK;
>         sp->shadowed_translation[index] |= access;
> }
>
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v3 22/23] KVM: x86/mmu: Support Eager Page Splitting in the shadow MMU
  2022-04-09  0:39     ` Sean Christopherson
@ 2022-04-14 16:50       ` David Matlack
  -1 siblings, 0 replies; 88+ messages in thread
From: David Matlack @ 2022-04-14 16:50 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Andrew Jones, Ben Gardon, Peter Xu, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Fri, Apr 8, 2022 at 5:40 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Apr 01, 2022, David Matlack wrote:
> > Add support for Eager Page Splitting pages that are mapped by the shadow
> > MMU. Walk through the rmap first splitting all 1GiB pages to 2MiB pages,
> > and then splitting all 2MiB pages to 4KiB pages.
> >
> > Splitting huge pages mapped by the shadow MMU requries dealing with some
> > extra complexity beyond that of the TDP MMU:
> >
> > (1) The shadow MMU has a limit on the number of shadow pages that are
> >     allowed to be allocated. So, as a policy, Eager Page Splitting
> >     refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
> >     pages available.
> >
> > (2) Huge pages may be mapped by indirect shadow pages which have the
> >     possibility of being unsync. As a policy we opt not to split such
> >     pages as their translation may no longer be valid.
>
> This shouldn't be possible, shadow pages whose role is > 4k are always write-protected
> and not allowed to become unsync.

Ah ok, then the unsync check is unnecessary (or at least could WARN_ON()).

>
> >
> > (3) Splitting a huge page may end up re-using an existing lower level
> >     shadow page tables. This is unlike the TDP MMU which always allocates
> >     new shadow page tables when splitting.
>
> ...
>
> > +static void kvm_mmu_split_huge_page(struct kvm *kvm,
> > +                                 const struct kvm_memory_slot *slot,
> > +                                 u64 *huge_sptep, struct kvm_mmu_page **spp)
> > +
> > +{
> > +     struct kvm_mmu_memory_cache *cache = &kvm->arch.huge_page_split_desc_cache;
> > +     u64 huge_spte = READ_ONCE(*huge_sptep);
> > +     struct kvm_mmu_page *sp;
> > +     bool flush = false;
> > +     u64 *sptep, spte;
> > +     gfn_t gfn;
> > +     int index;
> > +
> > +     sp = kvm_mmu_get_sp_for_split(kvm, slot, huge_sptep, spp);
> > +
> > +     for (index = 0; index < PT64_ENT_PER_PAGE; index++) {
> > +             sptep = &sp->spt[index];
> > +             gfn = kvm_mmu_page_get_gfn(sp, index);
> > +
> > +             /*
> > +              * sp may have populated page table entries, e.g. if this huge
> > +              * page is aliased by multiple sptes with the same access
> > +              * permissions. We know the sptes will be mapping the same
> > +              * gfn-to-pfn translation since sp is direct. However, a given
> > +              * spte may point to an even lower level page table. We don't
> > +              * know if that lower level page table is completely filled in,
> > +              * i.e. we may be effectively unmapping a region of memory, so
> > +              * we must flush the TLB.
>
> Random side topic, please avoid "we" and other pronouns in comments and changelogs,
> it gets real easy to lose track of what a pronoun is referring to, especially in
> changelogs where "we" might be KVM, might be the kernel, might be the team that's
> using the patch, might be an author that's prone to illeism, etc...

Agreed. It's a bad habit of mine, and despite conscious effort to
limit use of "we" in my comments and change logs, it inevitably creeps
in. I'll do a pass on this series to get rid of the use of "we"
throughout.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v3 22/23] KVM: x86/mmu: Support Eager Page Splitting in the shadow MMU
@ 2022-04-14 16:50       ` David Matlack
  0 siblings, 0 replies; 88+ messages in thread
From: David Matlack @ 2022-04-14 16:50 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Fri, Apr 8, 2022 at 5:40 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Apr 01, 2022, David Matlack wrote:
> > Add support for Eager Page Splitting pages that are mapped by the shadow
> > MMU. Walk through the rmap first splitting all 1GiB pages to 2MiB pages,
> > and then splitting all 2MiB pages to 4KiB pages.
> >
> > Splitting huge pages mapped by the shadow MMU requries dealing with some
> > extra complexity beyond that of the TDP MMU:
> >
> > (1) The shadow MMU has a limit on the number of shadow pages that are
> >     allowed to be allocated. So, as a policy, Eager Page Splitting
> >     refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
> >     pages available.
> >
> > (2) Huge pages may be mapped by indirect shadow pages which have the
> >     possibility of being unsync. As a policy we opt not to split such
> >     pages as their translation may no longer be valid.
>
> This shouldn't be possible, shadow pages whose role is > 4k are always write-protected
> and not allowed to become unsync.

Ah ok, then the unsync check is unnecessary (or at least could WARN_ON()).

>
> >
> > (3) Splitting a huge page may end up re-using an existing lower level
> >     shadow page tables. This is unlike the TDP MMU which always allocates
> >     new shadow page tables when splitting.
>
> ...
>
> > +static void kvm_mmu_split_huge_page(struct kvm *kvm,
> > +                                 const struct kvm_memory_slot *slot,
> > +                                 u64 *huge_sptep, struct kvm_mmu_page **spp)
> > +
> > +{
> > +     struct kvm_mmu_memory_cache *cache = &kvm->arch.huge_page_split_desc_cache;
> > +     u64 huge_spte = READ_ONCE(*huge_sptep);
> > +     struct kvm_mmu_page *sp;
> > +     bool flush = false;
> > +     u64 *sptep, spte;
> > +     gfn_t gfn;
> > +     int index;
> > +
> > +     sp = kvm_mmu_get_sp_for_split(kvm, slot, huge_sptep, spp);
> > +
> > +     for (index = 0; index < PT64_ENT_PER_PAGE; index++) {
> > +             sptep = &sp->spt[index];
> > +             gfn = kvm_mmu_page_get_gfn(sp, index);
> > +
> > +             /*
> > +              * sp may have populated page table entries, e.g. if this huge
> > +              * page is aliased by multiple sptes with the same access
> > +              * permissions. We know the sptes will be mapping the same
> > +              * gfn-to-pfn translation since sp is direct. However, a given
> > +              * spte may point to an even lower level page table. We don't
> > +              * know if that lower level page table is completely filled in,
> > +              * i.e. we may be effectively unmapping a region of memory, so
> > +              * we must flush the TLB.
>
> Random side topic, please avoid "we" and other pronouns in comments and changelogs,
> it gets real easy to lose track of what a pronoun is referring to, especially in
> changelogs where "we" might be KVM, might be the kernel, might be the team that's
> using the patch, might be an author that's prone to illeism, etc...

Agreed. It's a bad habit of mine, and despite conscious effort to
limit use of "we" in my comments and change logs, it inevitably creeps
in. I'll do a pass on this series to get rid of the use of "we"
throughout.
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v3 20/23] KVM: Allow for different capacities in kvm_mmu_memory_cache structs
  2022-04-01 17:55   ` David Matlack
@ 2022-04-20 10:55     ` Anup Patel
  -1 siblings, 0 replies; 88+ messages in thread
From: Anup Patel @ 2022-04-20 10:55 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Fri, Apr 1, 2022 at 11:26 PM David Matlack <dmatlack@google.com> wrote:
>
> Allow the capacity of the kvm_mmu_memory_cache struct to be chosen at
> declaration time rather than being fixed for all declarations. This will
> be used in a follow-up commit to declare an cache in x86 with a capacity
> of 512+ objects without having to increase the capacity of all caches in
> KVM.
>
> This change requires each cache now specify its capacity at runtime,
> since the cache struct itself no longer has a fixed capacity known at
> compile time. To protect against someone accidentally defining a
> kvm_mmu_memory_cache struct directly (without the extra storage), this
> commit includes a WARN_ON() in kvm_mmu_topup_memory_cache().
>
> This change, unfortunately, adds some grottiness to
> kvm_phys_addr_ioremap() in arm64, which uses a function-local (i.e.
> stack-allocated) kvm_mmu_memory_cache struct. Since C does not allow
> anonymous structs in functions, the new wrapper struct that contains
> kvm_mmu_memory_cache and the objects pointer array, must be named, which
> means dealing with an outer and inner struct. The outer struct can't be
> dropped since then there would be no guarantee the kvm_mmu_memory_cache
> struct and objects array would be laid out consecutively on the stack.
>
> No functional change intended.
>
> Signed-off-by: David Matlack <dmatlack@google.com>

Looks good to me.

For KVM RISC-V:
Acked-by: Anup Patel <anup@brainfault.org>

Regards,
Anup

> ---
>  arch/arm64/include/asm/kvm_host.h |  2 +-
>  arch/arm64/kvm/arm.c              |  1 +
>  arch/arm64/kvm/mmu.c              | 13 +++++++++----
>  arch/mips/include/asm/kvm_host.h  |  2 +-
>  arch/mips/kvm/mips.c              |  2 ++
>  arch/riscv/include/asm/kvm_host.h |  2 +-
>  arch/riscv/kvm/mmu.c              | 17 ++++++++++-------
>  arch/riscv/kvm/vcpu.c             |  1 +
>  arch/x86/include/asm/kvm_host.h   |  8 ++++----
>  arch/x86/kvm/mmu/mmu.c            |  9 +++++++++
>  include/linux/kvm_types.h         | 19 +++++++++++++++++--
>  virt/kvm/kvm_main.c               | 10 +++++++++-
>  12 files changed, 65 insertions(+), 21 deletions(-)
>
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index 0e96087885fe..4670491899de 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -362,7 +362,7 @@ struct kvm_vcpu_arch {
>         bool pause;
>
>         /* Cache some mmu pages needed inside spinlock regions */
> -       struct kvm_mmu_memory_cache mmu_page_cache;
> +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_cache);
>
>         /* Target CPU and feature flags */
>         int target;
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index ba9165e84396..af4d8a490af5 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -320,6 +320,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
>         vcpu->arch.target = -1;
>         bitmap_zero(vcpu->arch.features, KVM_VCPU_MAX_FEATURES);
>
> +       vcpu->arch.mmu_page_cache.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
>         vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
>
>         /* Set up the timer */
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 0d19259454d8..01e15bcb7be2 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -764,7 +764,12 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
>  {
>         phys_addr_t addr;
>         int ret = 0;
> -       struct kvm_mmu_memory_cache cache = { 0, __GFP_ZERO, NULL, };
> +       DEFINE_KVM_MMU_MEMORY_CACHE(cache) page_cache = {
> +               .cache = {
> +                       .gfp_zero = __GFP_ZERO,
> +                       .capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE,
> +               },
> +       };
>         struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
>         enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_DEVICE |
>                                      KVM_PGTABLE_PROT_R |
> @@ -777,14 +782,14 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
>         guest_ipa &= PAGE_MASK;
>
>         for (addr = guest_ipa; addr < guest_ipa + size; addr += PAGE_SIZE) {
> -               ret = kvm_mmu_topup_memory_cache(&cache,
> +               ret = kvm_mmu_topup_memory_cache(&page_cache.cache,
>                                                  kvm_mmu_cache_min_pages(kvm));
>                 if (ret)
>                         break;
>
>                 write_lock(&kvm->mmu_lock);
>                 ret = kvm_pgtable_stage2_map(pgt, addr, PAGE_SIZE, pa, prot,
> -                                            &cache);
> +                                            &page_cache.cache);
>                 write_unlock(&kvm->mmu_lock);
>                 if (ret)
>                         break;
> @@ -792,7 +797,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
>                 pa += PAGE_SIZE;
>         }
>
> -       kvm_mmu_free_memory_cache(&cache);
> +       kvm_mmu_free_memory_cache(&page_cache.cache);
>         return ret;
>  }
>
> diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
> index 717716cc51c5..935511d7fc3a 100644
> --- a/arch/mips/include/asm/kvm_host.h
> +++ b/arch/mips/include/asm/kvm_host.h
> @@ -347,7 +347,7 @@ struct kvm_vcpu_arch {
>         unsigned long pending_exceptions_clr;
>
>         /* Cache some mmu pages needed inside spinlock regions */
> -       struct kvm_mmu_memory_cache mmu_page_cache;
> +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_cache);
>
>         /* vcpu's vzguestid is different on each host cpu in an smp system */
>         u32 vzguestid[NR_CPUS];
> diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
> index a25e0b73ee70..45c7179144dc 100644
> --- a/arch/mips/kvm/mips.c
> +++ b/arch/mips/kvm/mips.c
> @@ -387,6 +387,8 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
>         if (err)
>                 goto out_free_gebase;
>
> +       vcpu->arch.mmu_page_cache.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
> +
>         return 0;
>
>  out_free_gebase:
> diff --git a/arch/riscv/include/asm/kvm_host.h b/arch/riscv/include/asm/kvm_host.h
> index 78da839657e5..4ec0b7a3d515 100644
> --- a/arch/riscv/include/asm/kvm_host.h
> +++ b/arch/riscv/include/asm/kvm_host.h
> @@ -186,7 +186,7 @@ struct kvm_vcpu_arch {
>         struct kvm_sbi_context sbi_context;
>
>         /* Cache pages needed to program page tables with spinlock held */
> -       struct kvm_mmu_memory_cache mmu_page_cache;
> +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_cache);
>
>         /* VCPU power-off state */
>         bool power_off;
> diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c
> index f80a34fbf102..5ffd164a5aeb 100644
> --- a/arch/riscv/kvm/mmu.c
> +++ b/arch/riscv/kvm/mmu.c
> @@ -347,10 +347,12 @@ static int stage2_ioremap(struct kvm *kvm, gpa_t gpa, phys_addr_t hpa,
>         int ret = 0;
>         unsigned long pfn;
>         phys_addr_t addr, end;
> -       struct kvm_mmu_memory_cache pcache;
> -
> -       memset(&pcache, 0, sizeof(pcache));
> -       pcache.gfp_zero = __GFP_ZERO;
> +       DEFINE_KVM_MMU_MEMORY_CACHE(cache) page_cache = {
> +               .cache = {
> +                       .gfp_zero = __GFP_ZERO,
> +                       .capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE,
> +               },
> +       };
>
>         end = (gpa + size + PAGE_SIZE - 1) & PAGE_MASK;
>         pfn = __phys_to_pfn(hpa);
> @@ -361,12 +363,13 @@ static int stage2_ioremap(struct kvm *kvm, gpa_t gpa, phys_addr_t hpa,
>                 if (!writable)
>                         pte = pte_wrprotect(pte);
>
> -               ret = kvm_mmu_topup_memory_cache(&pcache, stage2_pgd_levels);
> +               ret = kvm_mmu_topup_memory_cache(&page_cache.cache,
> +                                                stage2_pgd_levels);
>                 if (ret)
>                         goto out;
>
>                 spin_lock(&kvm->mmu_lock);
> -               ret = stage2_set_pte(kvm, 0, &pcache, addr, &pte);
> +               ret = stage2_set_pte(kvm, 0, &page_cache.cache, addr, &pte);
>                 spin_unlock(&kvm->mmu_lock);
>                 if (ret)
>                         goto out;
> @@ -375,7 +378,7 @@ static int stage2_ioremap(struct kvm *kvm, gpa_t gpa, phys_addr_t hpa,
>         }
>
>  out:
> -       kvm_mmu_free_memory_cache(&pcache);
> +       kvm_mmu_free_memory_cache(&page_cache.cache);
>         return ret;
>  }
>
> diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
> index 624166004e36..6a5f5aa45bac 100644
> --- a/arch/riscv/kvm/vcpu.c
> +++ b/arch/riscv/kvm/vcpu.c
> @@ -94,6 +94,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
>
>         /* Mark this VCPU never ran */
>         vcpu->arch.ran_atleast_once = false;
> +       vcpu->arch.mmu_page_cache.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
>         vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
>
>         /* Setup ISA features available to VCPU */
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index be4349c9ffea..ffb2b99f3a60 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -694,10 +694,10 @@ struct kvm_vcpu_arch {
>          */
>         struct kvm_mmu *walk_mmu;
>
> -       struct kvm_mmu_memory_cache mmu_pte_list_desc_cache;
> -       struct kvm_mmu_memory_cache mmu_shadow_page_cache;
> -       struct kvm_mmu_memory_cache mmu_shadowed_info_cache;
> -       struct kvm_mmu_memory_cache mmu_page_header_cache;
> +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_pte_list_desc_cache);
> +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_shadow_page_cache);
> +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_shadowed_info_cache);
> +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_header_cache);
>
>         /*
>          * QEMU userspace and the guest each have their own FPU state.
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index f058f28909ea..a8200b3f8782 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5800,12 +5800,21 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
>  {
>         int ret;
>
> +       vcpu->arch.mmu_pte_list_desc_cache.capacity =
> +               KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
>         vcpu->arch.mmu_pte_list_desc_cache.kmem_cache = pte_list_desc_cache;
>         vcpu->arch.mmu_pte_list_desc_cache.gfp_zero = __GFP_ZERO;
>
> +       vcpu->arch.mmu_page_header_cache.capacity =
> +               KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
>         vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;
>         vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
>
> +       vcpu->arch.mmu_shadowed_info_cache.capacity =
> +               KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
> +
> +       vcpu->arch.mmu_shadow_page_cache.capacity =
> +               KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
>         vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
>
>         vcpu->arch.mmu = &vcpu->arch.root_mmu;
> diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> index ac1ebb37a0ff..579cf39986ec 100644
> --- a/include/linux/kvm_types.h
> +++ b/include/linux/kvm_types.h
> @@ -83,14 +83,29 @@ struct gfn_to_pfn_cache {
>   * MMU flows is problematic, as is triggering reclaim, I/O, etc... while
>   * holding MMU locks.  Note, these caches act more like prefetch buffers than
>   * classical caches, i.e. objects are not returned to the cache on being freed.
> + *
> + * The storage for the cache object pointers is laid out after the struct, to
> + * allow different declarations to choose different capacities. The capacity
> + * field defines the number of object pointers available after the struct.
>   */
>  struct kvm_mmu_memory_cache {
>         int nobjs;
> +       int capacity;
>         gfp_t gfp_zero;
>         struct kmem_cache *kmem_cache;
> -       void *objects[KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE];
> +       void *objects[];
>  };
> -#endif
> +
> +#define __DEFINE_KVM_MMU_MEMORY_CACHE(_name, _capacity)                \
> +       struct {                                                \
> +               struct kvm_mmu_memory_cache _name;              \
> +               void *_name##_objects[_capacity];               \
> +       }
> +
> +#define DEFINE_KVM_MMU_MEMORY_CACHE(_name) \
> +       __DEFINE_KVM_MMU_MEMORY_CACHE(_name, KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE)
> +
> +#endif /* KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE */
>
>  #define HALT_POLL_HIST_COUNT                   32
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 70e05af5ebea..c4cac4195f4a 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -373,9 +373,17 @@ int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
>  {
>         void *obj;
>
> +       /*
> +        * The capacity fieldmust be initialized since the storage for the
> +        * objects pointer array is laid out after the kvm_mmu_memory_cache
> +        * struct and not known at compile time.
> +        */
> +       if (WARN_ON(mc->capacity == 0))
> +               return -EINVAL;
> +
>         if (mc->nobjs >= min)
>                 return 0;
> -       while (mc->nobjs < ARRAY_SIZE(mc->objects)) {
> +       while (mc->nobjs < mc->capacity) {
>                 obj = mmu_memory_cache_alloc_obj(mc, GFP_KERNEL_ACCOUNT);
>                 if (!obj)
>                         return mc->nobjs >= min ? 0 : -ENOMEM;
> --
> 2.35.1.1094.g7c7d902a7c-goog
>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v3 20/23] KVM: Allow for different capacities in kvm_mmu_memory_cache structs
@ 2022-04-20 10:55     ` Anup Patel
  0 siblings, 0 replies; 88+ messages in thread
From: Anup Patel @ 2022-04-20 10:55 UTC (permalink / raw)
  To: David Matlack
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Fri, Apr 1, 2022 at 11:26 PM David Matlack <dmatlack@google.com> wrote:
>
> Allow the capacity of the kvm_mmu_memory_cache struct to be chosen at
> declaration time rather than being fixed for all declarations. This will
> be used in a follow-up commit to declare an cache in x86 with a capacity
> of 512+ objects without having to increase the capacity of all caches in
> KVM.
>
> This change requires each cache now specify its capacity at runtime,
> since the cache struct itself no longer has a fixed capacity known at
> compile time. To protect against someone accidentally defining a
> kvm_mmu_memory_cache struct directly (without the extra storage), this
> commit includes a WARN_ON() in kvm_mmu_topup_memory_cache().
>
> This change, unfortunately, adds some grottiness to
> kvm_phys_addr_ioremap() in arm64, which uses a function-local (i.e.
> stack-allocated) kvm_mmu_memory_cache struct. Since C does not allow
> anonymous structs in functions, the new wrapper struct that contains
> kvm_mmu_memory_cache and the objects pointer array, must be named, which
> means dealing with an outer and inner struct. The outer struct can't be
> dropped since then there would be no guarantee the kvm_mmu_memory_cache
> struct and objects array would be laid out consecutively on the stack.
>
> No functional change intended.
>
> Signed-off-by: David Matlack <dmatlack@google.com>

Looks good to me.

For KVM RISC-V:
Acked-by: Anup Patel <anup@brainfault.org>

Regards,
Anup

> ---
>  arch/arm64/include/asm/kvm_host.h |  2 +-
>  arch/arm64/kvm/arm.c              |  1 +
>  arch/arm64/kvm/mmu.c              | 13 +++++++++----
>  arch/mips/include/asm/kvm_host.h  |  2 +-
>  arch/mips/kvm/mips.c              |  2 ++
>  arch/riscv/include/asm/kvm_host.h |  2 +-
>  arch/riscv/kvm/mmu.c              | 17 ++++++++++-------
>  arch/riscv/kvm/vcpu.c             |  1 +
>  arch/x86/include/asm/kvm_host.h   |  8 ++++----
>  arch/x86/kvm/mmu/mmu.c            |  9 +++++++++
>  include/linux/kvm_types.h         | 19 +++++++++++++++++--
>  virt/kvm/kvm_main.c               | 10 +++++++++-
>  12 files changed, 65 insertions(+), 21 deletions(-)
>
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index 0e96087885fe..4670491899de 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -362,7 +362,7 @@ struct kvm_vcpu_arch {
>         bool pause;
>
>         /* Cache some mmu pages needed inside spinlock regions */
> -       struct kvm_mmu_memory_cache mmu_page_cache;
> +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_cache);
>
>         /* Target CPU and feature flags */
>         int target;
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index ba9165e84396..af4d8a490af5 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -320,6 +320,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
>         vcpu->arch.target = -1;
>         bitmap_zero(vcpu->arch.features, KVM_VCPU_MAX_FEATURES);
>
> +       vcpu->arch.mmu_page_cache.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
>         vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
>
>         /* Set up the timer */
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 0d19259454d8..01e15bcb7be2 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -764,7 +764,12 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
>  {
>         phys_addr_t addr;
>         int ret = 0;
> -       struct kvm_mmu_memory_cache cache = { 0, __GFP_ZERO, NULL, };
> +       DEFINE_KVM_MMU_MEMORY_CACHE(cache) page_cache = {
> +               .cache = {
> +                       .gfp_zero = __GFP_ZERO,
> +                       .capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE,
> +               },
> +       };
>         struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
>         enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_DEVICE |
>                                      KVM_PGTABLE_PROT_R |
> @@ -777,14 +782,14 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
>         guest_ipa &= PAGE_MASK;
>
>         for (addr = guest_ipa; addr < guest_ipa + size; addr += PAGE_SIZE) {
> -               ret = kvm_mmu_topup_memory_cache(&cache,
> +               ret = kvm_mmu_topup_memory_cache(&page_cache.cache,
>                                                  kvm_mmu_cache_min_pages(kvm));
>                 if (ret)
>                         break;
>
>                 write_lock(&kvm->mmu_lock);
>                 ret = kvm_pgtable_stage2_map(pgt, addr, PAGE_SIZE, pa, prot,
> -                                            &cache);
> +                                            &page_cache.cache);
>                 write_unlock(&kvm->mmu_lock);
>                 if (ret)
>                         break;
> @@ -792,7 +797,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
>                 pa += PAGE_SIZE;
>         }
>
> -       kvm_mmu_free_memory_cache(&cache);
> +       kvm_mmu_free_memory_cache(&page_cache.cache);
>         return ret;
>  }
>
> diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
> index 717716cc51c5..935511d7fc3a 100644
> --- a/arch/mips/include/asm/kvm_host.h
> +++ b/arch/mips/include/asm/kvm_host.h
> @@ -347,7 +347,7 @@ struct kvm_vcpu_arch {
>         unsigned long pending_exceptions_clr;
>
>         /* Cache some mmu pages needed inside spinlock regions */
> -       struct kvm_mmu_memory_cache mmu_page_cache;
> +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_cache);
>
>         /* vcpu's vzguestid is different on each host cpu in an smp system */
>         u32 vzguestid[NR_CPUS];
> diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
> index a25e0b73ee70..45c7179144dc 100644
> --- a/arch/mips/kvm/mips.c
> +++ b/arch/mips/kvm/mips.c
> @@ -387,6 +387,8 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
>         if (err)
>                 goto out_free_gebase;
>
> +       vcpu->arch.mmu_page_cache.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
> +
>         return 0;
>
>  out_free_gebase:
> diff --git a/arch/riscv/include/asm/kvm_host.h b/arch/riscv/include/asm/kvm_host.h
> index 78da839657e5..4ec0b7a3d515 100644
> --- a/arch/riscv/include/asm/kvm_host.h
> +++ b/arch/riscv/include/asm/kvm_host.h
> @@ -186,7 +186,7 @@ struct kvm_vcpu_arch {
>         struct kvm_sbi_context sbi_context;
>
>         /* Cache pages needed to program page tables with spinlock held */
> -       struct kvm_mmu_memory_cache mmu_page_cache;
> +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_cache);
>
>         /* VCPU power-off state */
>         bool power_off;
> diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c
> index f80a34fbf102..5ffd164a5aeb 100644
> --- a/arch/riscv/kvm/mmu.c
> +++ b/arch/riscv/kvm/mmu.c
> @@ -347,10 +347,12 @@ static int stage2_ioremap(struct kvm *kvm, gpa_t gpa, phys_addr_t hpa,
>         int ret = 0;
>         unsigned long pfn;
>         phys_addr_t addr, end;
> -       struct kvm_mmu_memory_cache pcache;
> -
> -       memset(&pcache, 0, sizeof(pcache));
> -       pcache.gfp_zero = __GFP_ZERO;
> +       DEFINE_KVM_MMU_MEMORY_CACHE(cache) page_cache = {
> +               .cache = {
> +                       .gfp_zero = __GFP_ZERO,
> +                       .capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE,
> +               },
> +       };
>
>         end = (gpa + size + PAGE_SIZE - 1) & PAGE_MASK;
>         pfn = __phys_to_pfn(hpa);
> @@ -361,12 +363,13 @@ static int stage2_ioremap(struct kvm *kvm, gpa_t gpa, phys_addr_t hpa,
>                 if (!writable)
>                         pte = pte_wrprotect(pte);
>
> -               ret = kvm_mmu_topup_memory_cache(&pcache, stage2_pgd_levels);
> +               ret = kvm_mmu_topup_memory_cache(&page_cache.cache,
> +                                                stage2_pgd_levels);
>                 if (ret)
>                         goto out;
>
>                 spin_lock(&kvm->mmu_lock);
> -               ret = stage2_set_pte(kvm, 0, &pcache, addr, &pte);
> +               ret = stage2_set_pte(kvm, 0, &page_cache.cache, addr, &pte);
>                 spin_unlock(&kvm->mmu_lock);
>                 if (ret)
>                         goto out;
> @@ -375,7 +378,7 @@ static int stage2_ioremap(struct kvm *kvm, gpa_t gpa, phys_addr_t hpa,
>         }
>
>  out:
> -       kvm_mmu_free_memory_cache(&pcache);
> +       kvm_mmu_free_memory_cache(&page_cache.cache);
>         return ret;
>  }
>
> diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
> index 624166004e36..6a5f5aa45bac 100644
> --- a/arch/riscv/kvm/vcpu.c
> +++ b/arch/riscv/kvm/vcpu.c
> @@ -94,6 +94,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
>
>         /* Mark this VCPU never ran */
>         vcpu->arch.ran_atleast_once = false;
> +       vcpu->arch.mmu_page_cache.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
>         vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
>
>         /* Setup ISA features available to VCPU */
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index be4349c9ffea..ffb2b99f3a60 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -694,10 +694,10 @@ struct kvm_vcpu_arch {
>          */
>         struct kvm_mmu *walk_mmu;
>
> -       struct kvm_mmu_memory_cache mmu_pte_list_desc_cache;
> -       struct kvm_mmu_memory_cache mmu_shadow_page_cache;
> -       struct kvm_mmu_memory_cache mmu_shadowed_info_cache;
> -       struct kvm_mmu_memory_cache mmu_page_header_cache;
> +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_pte_list_desc_cache);
> +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_shadow_page_cache);
> +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_shadowed_info_cache);
> +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_header_cache);
>
>         /*
>          * QEMU userspace and the guest each have their own FPU state.
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index f058f28909ea..a8200b3f8782 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5800,12 +5800,21 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
>  {
>         int ret;
>
> +       vcpu->arch.mmu_pte_list_desc_cache.capacity =
> +               KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
>         vcpu->arch.mmu_pte_list_desc_cache.kmem_cache = pte_list_desc_cache;
>         vcpu->arch.mmu_pte_list_desc_cache.gfp_zero = __GFP_ZERO;
>
> +       vcpu->arch.mmu_page_header_cache.capacity =
> +               KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
>         vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;
>         vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
>
> +       vcpu->arch.mmu_shadowed_info_cache.capacity =
> +               KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
> +
> +       vcpu->arch.mmu_shadow_page_cache.capacity =
> +               KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
>         vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
>
>         vcpu->arch.mmu = &vcpu->arch.root_mmu;
> diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> index ac1ebb37a0ff..579cf39986ec 100644
> --- a/include/linux/kvm_types.h
> +++ b/include/linux/kvm_types.h
> @@ -83,14 +83,29 @@ struct gfn_to_pfn_cache {
>   * MMU flows is problematic, as is triggering reclaim, I/O, etc... while
>   * holding MMU locks.  Note, these caches act more like prefetch buffers than
>   * classical caches, i.e. objects are not returned to the cache on being freed.
> + *
> + * The storage for the cache object pointers is laid out after the struct, to
> + * allow different declarations to choose different capacities. The capacity
> + * field defines the number of object pointers available after the struct.
>   */
>  struct kvm_mmu_memory_cache {
>         int nobjs;
> +       int capacity;
>         gfp_t gfp_zero;
>         struct kmem_cache *kmem_cache;
> -       void *objects[KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE];
> +       void *objects[];
>  };
> -#endif
> +
> +#define __DEFINE_KVM_MMU_MEMORY_CACHE(_name, _capacity)                \
> +       struct {                                                \
> +               struct kvm_mmu_memory_cache _name;              \
> +               void *_name##_objects[_capacity];               \
> +       }
> +
> +#define DEFINE_KVM_MMU_MEMORY_CACHE(_name) \
> +       __DEFINE_KVM_MMU_MEMORY_CACHE(_name, KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE)
> +
> +#endif /* KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE */
>
>  #define HALT_POLL_HIST_COUNT                   32
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 70e05af5ebea..c4cac4195f4a 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -373,9 +373,17 @@ int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
>  {
>         void *obj;
>
> +       /*
> +        * The capacity fieldmust be initialized since the storage for the
> +        * objects pointer array is laid out after the kvm_mmu_memory_cache
> +        * struct and not known at compile time.
> +        */
> +       if (WARN_ON(mc->capacity == 0))
> +               return -EINVAL;
> +
>         if (mc->nobjs >= min)
>                 return 0;
> -       while (mc->nobjs < ARRAY_SIZE(mc->objects)) {
> +       while (mc->nobjs < mc->capacity) {
>                 obj = mmu_memory_cache_alloc_obj(mc, GFP_KERNEL_ACCOUNT);
>                 if (!obj)
>                         return mc->nobjs >= min ? 0 : -ENOMEM;
> --
> 2.35.1.1094.g7c7d902a7c-goog
>
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v3 20/23] KVM: Allow for different capacities in kvm_mmu_memory_cache structs
  2022-04-01 17:55   ` David Matlack
@ 2022-04-21 16:19     ` Ben Gardon
  -1 siblings, 0 replies; 88+ messages in thread
From: Ben Gardon @ 2022-04-21 16:19 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Paolo Bonzini, Maciej S . Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Fri, Apr 1, 2022 at 10:56 AM David Matlack <dmatlack@google.com> wrote:
>
> Allow the capacity of the kvm_mmu_memory_cache struct to be chosen at
> declaration time rather than being fixed for all declarations. This will
> be used in a follow-up commit to declare an cache in x86 with a capacity
> of 512+ objects without having to increase the capacity of all caches in
> KVM.
>
> This change requires each cache now specify its capacity at runtime,
> since the cache struct itself no longer has a fixed capacity known at
> compile time. To protect against someone accidentally defining a
> kvm_mmu_memory_cache struct directly (without the extra storage), this
> commit includes a WARN_ON() in kvm_mmu_topup_memory_cache().
>
> This change, unfortunately, adds some grottiness to
> kvm_phys_addr_ioremap() in arm64, which uses a function-local (i.e.
> stack-allocated) kvm_mmu_memory_cache struct. Since C does not allow
> anonymous structs in functions, the new wrapper struct that contains
> kvm_mmu_memory_cache and the objects pointer array, must be named, which
> means dealing with an outer and inner struct. The outer struct can't be
> dropped since then there would be no guarantee the kvm_mmu_memory_cache
> struct and objects array would be laid out consecutively on the stack.

What do you think about lazily allocating the array to hold the objs
for the cache when we go to top up the cache?
We already have to do error handling there and the allocation for the
array to hold pointers for the cache objs will be small relative to
filling up the cache.
It seems like this introduces a lot of complexity to preserve the
static / stack allocation of that array in an increasingly dynamic
context.

>
> No functional change intended.
>
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/arm64/include/asm/kvm_host.h |  2 +-
>  arch/arm64/kvm/arm.c              |  1 +
>  arch/arm64/kvm/mmu.c              | 13 +++++++++----
>  arch/mips/include/asm/kvm_host.h  |  2 +-
>  arch/mips/kvm/mips.c              |  2 ++
>  arch/riscv/include/asm/kvm_host.h |  2 +-
>  arch/riscv/kvm/mmu.c              | 17 ++++++++++-------
>  arch/riscv/kvm/vcpu.c             |  1 +
>  arch/x86/include/asm/kvm_host.h   |  8 ++++----
>  arch/x86/kvm/mmu/mmu.c            |  9 +++++++++
>  include/linux/kvm_types.h         | 19 +++++++++++++++++--
>  virt/kvm/kvm_main.c               | 10 +++++++++-
>  12 files changed, 65 insertions(+), 21 deletions(-)
>
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index 0e96087885fe..4670491899de 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -362,7 +362,7 @@ struct kvm_vcpu_arch {
>         bool pause;
>
>         /* Cache some mmu pages needed inside spinlock regions */
> -       struct kvm_mmu_memory_cache mmu_page_cache;
> +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_cache);
>
>         /* Target CPU and feature flags */
>         int target;
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index ba9165e84396..af4d8a490af5 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -320,6 +320,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
>         vcpu->arch.target = -1;
>         bitmap_zero(vcpu->arch.features, KVM_VCPU_MAX_FEATURES);
>
> +       vcpu->arch.mmu_page_cache.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
>         vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
>
>         /* Set up the timer */
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 0d19259454d8..01e15bcb7be2 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -764,7 +764,12 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
>  {
>         phys_addr_t addr;
>         int ret = 0;
> -       struct kvm_mmu_memory_cache cache = { 0, __GFP_ZERO, NULL, };
> +       DEFINE_KVM_MMU_MEMORY_CACHE(cache) page_cache = {
> +               .cache = {
> +                       .gfp_zero = __GFP_ZERO,
> +                       .capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE,
> +               },
> +       };
>         struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
>         enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_DEVICE |
>                                      KVM_PGTABLE_PROT_R |
> @@ -777,14 +782,14 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
>         guest_ipa &= PAGE_MASK;
>
>         for (addr = guest_ipa; addr < guest_ipa + size; addr += PAGE_SIZE) {
> -               ret = kvm_mmu_topup_memory_cache(&cache,
> +               ret = kvm_mmu_topup_memory_cache(&page_cache.cache,
>                                                  kvm_mmu_cache_min_pages(kvm));
>                 if (ret)
>                         break;
>
>                 write_lock(&kvm->mmu_lock);
>                 ret = kvm_pgtable_stage2_map(pgt, addr, PAGE_SIZE, pa, prot,
> -                                            &cache);
> +                                            &page_cache.cache);
>                 write_unlock(&kvm->mmu_lock);
>                 if (ret)
>                         break;
> @@ -792,7 +797,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
>                 pa += PAGE_SIZE;
>         }
>
> -       kvm_mmu_free_memory_cache(&cache);
> +       kvm_mmu_free_memory_cache(&page_cache.cache);
>         return ret;
>  }
>
> diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
> index 717716cc51c5..935511d7fc3a 100644
> --- a/arch/mips/include/asm/kvm_host.h
> +++ b/arch/mips/include/asm/kvm_host.h
> @@ -347,7 +347,7 @@ struct kvm_vcpu_arch {
>         unsigned long pending_exceptions_clr;
>
>         /* Cache some mmu pages needed inside spinlock regions */
> -       struct kvm_mmu_memory_cache mmu_page_cache;
> +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_cache);
>
>         /* vcpu's vzguestid is different on each host cpu in an smp system */
>         u32 vzguestid[NR_CPUS];
> diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
> index a25e0b73ee70..45c7179144dc 100644
> --- a/arch/mips/kvm/mips.c
> +++ b/arch/mips/kvm/mips.c
> @@ -387,6 +387,8 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
>         if (err)
>                 goto out_free_gebase;
>
> +       vcpu->arch.mmu_page_cache.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
> +
>         return 0;
>
>  out_free_gebase:
> diff --git a/arch/riscv/include/asm/kvm_host.h b/arch/riscv/include/asm/kvm_host.h
> index 78da839657e5..4ec0b7a3d515 100644
> --- a/arch/riscv/include/asm/kvm_host.h
> +++ b/arch/riscv/include/asm/kvm_host.h
> @@ -186,7 +186,7 @@ struct kvm_vcpu_arch {
>         struct kvm_sbi_context sbi_context;
>
>         /* Cache pages needed to program page tables with spinlock held */
> -       struct kvm_mmu_memory_cache mmu_page_cache;
> +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_cache);
>
>         /* VCPU power-off state */
>         bool power_off;
> diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c
> index f80a34fbf102..5ffd164a5aeb 100644
> --- a/arch/riscv/kvm/mmu.c
> +++ b/arch/riscv/kvm/mmu.c
> @@ -347,10 +347,12 @@ static int stage2_ioremap(struct kvm *kvm, gpa_t gpa, phys_addr_t hpa,
>         int ret = 0;
>         unsigned long pfn;
>         phys_addr_t addr, end;
> -       struct kvm_mmu_memory_cache pcache;
> -
> -       memset(&pcache, 0, sizeof(pcache));
> -       pcache.gfp_zero = __GFP_ZERO;
> +       DEFINE_KVM_MMU_MEMORY_CACHE(cache) page_cache = {
> +               .cache = {
> +                       .gfp_zero = __GFP_ZERO,
> +                       .capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE,
> +               },
> +       };
>
>         end = (gpa + size + PAGE_SIZE - 1) & PAGE_MASK;
>         pfn = __phys_to_pfn(hpa);
> @@ -361,12 +363,13 @@ static int stage2_ioremap(struct kvm *kvm, gpa_t gpa, phys_addr_t hpa,
>                 if (!writable)
>                         pte = pte_wrprotect(pte);
>
> -               ret = kvm_mmu_topup_memory_cache(&pcache, stage2_pgd_levels);
> +               ret = kvm_mmu_topup_memory_cache(&page_cache.cache,
> +                                                stage2_pgd_levels);
>                 if (ret)
>                         goto out;
>
>                 spin_lock(&kvm->mmu_lock);
> -               ret = stage2_set_pte(kvm, 0, &pcache, addr, &pte);
> +               ret = stage2_set_pte(kvm, 0, &page_cache.cache, addr, &pte);
>                 spin_unlock(&kvm->mmu_lock);
>                 if (ret)
>                         goto out;
> @@ -375,7 +378,7 @@ static int stage2_ioremap(struct kvm *kvm, gpa_t gpa, phys_addr_t hpa,
>         }
>
>  out:
> -       kvm_mmu_free_memory_cache(&pcache);
> +       kvm_mmu_free_memory_cache(&page_cache.cache);
>         return ret;
>  }
>
> diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
> index 624166004e36..6a5f5aa45bac 100644
> --- a/arch/riscv/kvm/vcpu.c
> +++ b/arch/riscv/kvm/vcpu.c
> @@ -94,6 +94,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
>
>         /* Mark this VCPU never ran */
>         vcpu->arch.ran_atleast_once = false;
> +       vcpu->arch.mmu_page_cache.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
>         vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
>
>         /* Setup ISA features available to VCPU */
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index be4349c9ffea..ffb2b99f3a60 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -694,10 +694,10 @@ struct kvm_vcpu_arch {
>          */
>         struct kvm_mmu *walk_mmu;
>
> -       struct kvm_mmu_memory_cache mmu_pte_list_desc_cache;
> -       struct kvm_mmu_memory_cache mmu_shadow_page_cache;
> -       struct kvm_mmu_memory_cache mmu_shadowed_info_cache;
> -       struct kvm_mmu_memory_cache mmu_page_header_cache;
> +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_pte_list_desc_cache);
> +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_shadow_page_cache);
> +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_shadowed_info_cache);
> +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_header_cache);
>
>         /*
>          * QEMU userspace and the guest each have their own FPU state.
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index f058f28909ea..a8200b3f8782 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5800,12 +5800,21 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
>  {
>         int ret;
>
> +       vcpu->arch.mmu_pte_list_desc_cache.capacity =
> +               KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
>         vcpu->arch.mmu_pte_list_desc_cache.kmem_cache = pte_list_desc_cache;
>         vcpu->arch.mmu_pte_list_desc_cache.gfp_zero = __GFP_ZERO;
>
> +       vcpu->arch.mmu_page_header_cache.capacity =
> +               KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
>         vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;
>         vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
>
> +       vcpu->arch.mmu_shadowed_info_cache.capacity =
> +               KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
> +
> +       vcpu->arch.mmu_shadow_page_cache.capacity =
> +               KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
>         vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
>
>         vcpu->arch.mmu = &vcpu->arch.root_mmu;
> diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> index ac1ebb37a0ff..579cf39986ec 100644
> --- a/include/linux/kvm_types.h
> +++ b/include/linux/kvm_types.h
> @@ -83,14 +83,29 @@ struct gfn_to_pfn_cache {
>   * MMU flows is problematic, as is triggering reclaim, I/O, etc... while
>   * holding MMU locks.  Note, these caches act more like prefetch buffers than
>   * classical caches, i.e. objects are not returned to the cache on being freed.
> + *
> + * The storage for the cache object pointers is laid out after the struct, to
> + * allow different declarations to choose different capacities. The capacity
> + * field defines the number of object pointers available after the struct.
>   */
>  struct kvm_mmu_memory_cache {
>         int nobjs;
> +       int capacity;
>         gfp_t gfp_zero;
>         struct kmem_cache *kmem_cache;
> -       void *objects[KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE];
> +       void *objects[];
>  };
> -#endif
> +
> +#define __DEFINE_KVM_MMU_MEMORY_CACHE(_name, _capacity)                \
> +       struct {                                                \
> +               struct kvm_mmu_memory_cache _name;              \
> +               void *_name##_objects[_capacity];               \
> +       }
> +
> +#define DEFINE_KVM_MMU_MEMORY_CACHE(_name) \
> +       __DEFINE_KVM_MMU_MEMORY_CACHE(_name, KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE)
> +
> +#endif /* KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE */
>
>  #define HALT_POLL_HIST_COUNT                   32
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 70e05af5ebea..c4cac4195f4a 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -373,9 +373,17 @@ int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
>  {
>         void *obj;
>
> +       /*
> +        * The capacity fieldmust be initialized since the storage for the
> +        * objects pointer array is laid out after the kvm_mmu_memory_cache
> +        * struct and not known at compile time.
> +        */
> +       if (WARN_ON(mc->capacity == 0))
> +               return -EINVAL;
> +
>         if (mc->nobjs >= min)
>                 return 0;
> -       while (mc->nobjs < ARRAY_SIZE(mc->objects)) {
> +       while (mc->nobjs < mc->capacity) {
>                 obj = mmu_memory_cache_alloc_obj(mc, GFP_KERNEL_ACCOUNT);
>                 if (!obj)
>                         return mc->nobjs >= min ? 0 : -ENOMEM;
> --
> 2.35.1.1094.g7c7d902a7c-goog
>
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v3 20/23] KVM: Allow for different capacities in kvm_mmu_memory_cache structs
@ 2022-04-21 16:19     ` Ben Gardon
  0 siblings, 0 replies; 88+ messages in thread
From: Ben Gardon @ 2022-04-21 16:19 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Sean Christopherson, Andrew Jones, Peter Xu,
	Maciej S . Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Fri, Apr 1, 2022 at 10:56 AM David Matlack <dmatlack@google.com> wrote:
>
> Allow the capacity of the kvm_mmu_memory_cache struct to be chosen at
> declaration time rather than being fixed for all declarations. This will
> be used in a follow-up commit to declare an cache in x86 with a capacity
> of 512+ objects without having to increase the capacity of all caches in
> KVM.
>
> This change requires each cache now specify its capacity at runtime,
> since the cache struct itself no longer has a fixed capacity known at
> compile time. To protect against someone accidentally defining a
> kvm_mmu_memory_cache struct directly (without the extra storage), this
> commit includes a WARN_ON() in kvm_mmu_topup_memory_cache().
>
> This change, unfortunately, adds some grottiness to
> kvm_phys_addr_ioremap() in arm64, which uses a function-local (i.e.
> stack-allocated) kvm_mmu_memory_cache struct. Since C does not allow
> anonymous structs in functions, the new wrapper struct that contains
> kvm_mmu_memory_cache and the objects pointer array, must be named, which
> means dealing with an outer and inner struct. The outer struct can't be
> dropped since then there would be no guarantee the kvm_mmu_memory_cache
> struct and objects array would be laid out consecutively on the stack.

What do you think about lazily allocating the array to hold the objs
for the cache when we go to top up the cache?
We already have to do error handling there and the allocation for the
array to hold pointers for the cache objs will be small relative to
filling up the cache.
It seems like this introduces a lot of complexity to preserve the
static / stack allocation of that array in an increasingly dynamic
context.

>
> No functional change intended.
>
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/arm64/include/asm/kvm_host.h |  2 +-
>  arch/arm64/kvm/arm.c              |  1 +
>  arch/arm64/kvm/mmu.c              | 13 +++++++++----
>  arch/mips/include/asm/kvm_host.h  |  2 +-
>  arch/mips/kvm/mips.c              |  2 ++
>  arch/riscv/include/asm/kvm_host.h |  2 +-
>  arch/riscv/kvm/mmu.c              | 17 ++++++++++-------
>  arch/riscv/kvm/vcpu.c             |  1 +
>  arch/x86/include/asm/kvm_host.h   |  8 ++++----
>  arch/x86/kvm/mmu/mmu.c            |  9 +++++++++
>  include/linux/kvm_types.h         | 19 +++++++++++++++++--
>  virt/kvm/kvm_main.c               | 10 +++++++++-
>  12 files changed, 65 insertions(+), 21 deletions(-)
>
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index 0e96087885fe..4670491899de 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -362,7 +362,7 @@ struct kvm_vcpu_arch {
>         bool pause;
>
>         /* Cache some mmu pages needed inside spinlock regions */
> -       struct kvm_mmu_memory_cache mmu_page_cache;
> +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_cache);
>
>         /* Target CPU and feature flags */
>         int target;
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index ba9165e84396..af4d8a490af5 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -320,6 +320,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
>         vcpu->arch.target = -1;
>         bitmap_zero(vcpu->arch.features, KVM_VCPU_MAX_FEATURES);
>
> +       vcpu->arch.mmu_page_cache.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
>         vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
>
>         /* Set up the timer */
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 0d19259454d8..01e15bcb7be2 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -764,7 +764,12 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
>  {
>         phys_addr_t addr;
>         int ret = 0;
> -       struct kvm_mmu_memory_cache cache = { 0, __GFP_ZERO, NULL, };
> +       DEFINE_KVM_MMU_MEMORY_CACHE(cache) page_cache = {
> +               .cache = {
> +                       .gfp_zero = __GFP_ZERO,
> +                       .capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE,
> +               },
> +       };
>         struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
>         enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_DEVICE |
>                                      KVM_PGTABLE_PROT_R |
> @@ -777,14 +782,14 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
>         guest_ipa &= PAGE_MASK;
>
>         for (addr = guest_ipa; addr < guest_ipa + size; addr += PAGE_SIZE) {
> -               ret = kvm_mmu_topup_memory_cache(&cache,
> +               ret = kvm_mmu_topup_memory_cache(&page_cache.cache,
>                                                  kvm_mmu_cache_min_pages(kvm));
>                 if (ret)
>                         break;
>
>                 write_lock(&kvm->mmu_lock);
>                 ret = kvm_pgtable_stage2_map(pgt, addr, PAGE_SIZE, pa, prot,
> -                                            &cache);
> +                                            &page_cache.cache);
>                 write_unlock(&kvm->mmu_lock);
>                 if (ret)
>                         break;
> @@ -792,7 +797,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
>                 pa += PAGE_SIZE;
>         }
>
> -       kvm_mmu_free_memory_cache(&cache);
> +       kvm_mmu_free_memory_cache(&page_cache.cache);
>         return ret;
>  }
>
> diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
> index 717716cc51c5..935511d7fc3a 100644
> --- a/arch/mips/include/asm/kvm_host.h
> +++ b/arch/mips/include/asm/kvm_host.h
> @@ -347,7 +347,7 @@ struct kvm_vcpu_arch {
>         unsigned long pending_exceptions_clr;
>
>         /* Cache some mmu pages needed inside spinlock regions */
> -       struct kvm_mmu_memory_cache mmu_page_cache;
> +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_cache);
>
>         /* vcpu's vzguestid is different on each host cpu in an smp system */
>         u32 vzguestid[NR_CPUS];
> diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
> index a25e0b73ee70..45c7179144dc 100644
> --- a/arch/mips/kvm/mips.c
> +++ b/arch/mips/kvm/mips.c
> @@ -387,6 +387,8 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
>         if (err)
>                 goto out_free_gebase;
>
> +       vcpu->arch.mmu_page_cache.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
> +
>         return 0;
>
>  out_free_gebase:
> diff --git a/arch/riscv/include/asm/kvm_host.h b/arch/riscv/include/asm/kvm_host.h
> index 78da839657e5..4ec0b7a3d515 100644
> --- a/arch/riscv/include/asm/kvm_host.h
> +++ b/arch/riscv/include/asm/kvm_host.h
> @@ -186,7 +186,7 @@ struct kvm_vcpu_arch {
>         struct kvm_sbi_context sbi_context;
>
>         /* Cache pages needed to program page tables with spinlock held */
> -       struct kvm_mmu_memory_cache mmu_page_cache;
> +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_cache);
>
>         /* VCPU power-off state */
>         bool power_off;
> diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c
> index f80a34fbf102..5ffd164a5aeb 100644
> --- a/arch/riscv/kvm/mmu.c
> +++ b/arch/riscv/kvm/mmu.c
> @@ -347,10 +347,12 @@ static int stage2_ioremap(struct kvm *kvm, gpa_t gpa, phys_addr_t hpa,
>         int ret = 0;
>         unsigned long pfn;
>         phys_addr_t addr, end;
> -       struct kvm_mmu_memory_cache pcache;
> -
> -       memset(&pcache, 0, sizeof(pcache));
> -       pcache.gfp_zero = __GFP_ZERO;
> +       DEFINE_KVM_MMU_MEMORY_CACHE(cache) page_cache = {
> +               .cache = {
> +                       .gfp_zero = __GFP_ZERO,
> +                       .capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE,
> +               },
> +       };
>
>         end = (gpa + size + PAGE_SIZE - 1) & PAGE_MASK;
>         pfn = __phys_to_pfn(hpa);
> @@ -361,12 +363,13 @@ static int stage2_ioremap(struct kvm *kvm, gpa_t gpa, phys_addr_t hpa,
>                 if (!writable)
>                         pte = pte_wrprotect(pte);
>
> -               ret = kvm_mmu_topup_memory_cache(&pcache, stage2_pgd_levels);
> +               ret = kvm_mmu_topup_memory_cache(&page_cache.cache,
> +                                                stage2_pgd_levels);
>                 if (ret)
>                         goto out;
>
>                 spin_lock(&kvm->mmu_lock);
> -               ret = stage2_set_pte(kvm, 0, &pcache, addr, &pte);
> +               ret = stage2_set_pte(kvm, 0, &page_cache.cache, addr, &pte);
>                 spin_unlock(&kvm->mmu_lock);
>                 if (ret)
>                         goto out;
> @@ -375,7 +378,7 @@ static int stage2_ioremap(struct kvm *kvm, gpa_t gpa, phys_addr_t hpa,
>         }
>
>  out:
> -       kvm_mmu_free_memory_cache(&pcache);
> +       kvm_mmu_free_memory_cache(&page_cache.cache);
>         return ret;
>  }
>
> diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
> index 624166004e36..6a5f5aa45bac 100644
> --- a/arch/riscv/kvm/vcpu.c
> +++ b/arch/riscv/kvm/vcpu.c
> @@ -94,6 +94,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
>
>         /* Mark this VCPU never ran */
>         vcpu->arch.ran_atleast_once = false;
> +       vcpu->arch.mmu_page_cache.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
>         vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
>
>         /* Setup ISA features available to VCPU */
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index be4349c9ffea..ffb2b99f3a60 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -694,10 +694,10 @@ struct kvm_vcpu_arch {
>          */
>         struct kvm_mmu *walk_mmu;
>
> -       struct kvm_mmu_memory_cache mmu_pte_list_desc_cache;
> -       struct kvm_mmu_memory_cache mmu_shadow_page_cache;
> -       struct kvm_mmu_memory_cache mmu_shadowed_info_cache;
> -       struct kvm_mmu_memory_cache mmu_page_header_cache;
> +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_pte_list_desc_cache);
> +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_shadow_page_cache);
> +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_shadowed_info_cache);
> +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_header_cache);
>
>         /*
>          * QEMU userspace and the guest each have their own FPU state.
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index f058f28909ea..a8200b3f8782 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5800,12 +5800,21 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
>  {
>         int ret;
>
> +       vcpu->arch.mmu_pte_list_desc_cache.capacity =
> +               KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
>         vcpu->arch.mmu_pte_list_desc_cache.kmem_cache = pte_list_desc_cache;
>         vcpu->arch.mmu_pte_list_desc_cache.gfp_zero = __GFP_ZERO;
>
> +       vcpu->arch.mmu_page_header_cache.capacity =
> +               KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
>         vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;
>         vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
>
> +       vcpu->arch.mmu_shadowed_info_cache.capacity =
> +               KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
> +
> +       vcpu->arch.mmu_shadow_page_cache.capacity =
> +               KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
>         vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
>
>         vcpu->arch.mmu = &vcpu->arch.root_mmu;
> diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> index ac1ebb37a0ff..579cf39986ec 100644
> --- a/include/linux/kvm_types.h
> +++ b/include/linux/kvm_types.h
> @@ -83,14 +83,29 @@ struct gfn_to_pfn_cache {
>   * MMU flows is problematic, as is triggering reclaim, I/O, etc... while
>   * holding MMU locks.  Note, these caches act more like prefetch buffers than
>   * classical caches, i.e. objects are not returned to the cache on being freed.
> + *
> + * The storage for the cache object pointers is laid out after the struct, to
> + * allow different declarations to choose different capacities. The capacity
> + * field defines the number of object pointers available after the struct.
>   */
>  struct kvm_mmu_memory_cache {
>         int nobjs;
> +       int capacity;
>         gfp_t gfp_zero;
>         struct kmem_cache *kmem_cache;
> -       void *objects[KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE];
> +       void *objects[];
>  };
> -#endif
> +
> +#define __DEFINE_KVM_MMU_MEMORY_CACHE(_name, _capacity)                \
> +       struct {                                                \
> +               struct kvm_mmu_memory_cache _name;              \
> +               void *_name##_objects[_capacity];               \
> +       }
> +
> +#define DEFINE_KVM_MMU_MEMORY_CACHE(_name) \
> +       __DEFINE_KVM_MMU_MEMORY_CACHE(_name, KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE)
> +
> +#endif /* KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE */
>
>  #define HALT_POLL_HIST_COUNT                   32
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 70e05af5ebea..c4cac4195f4a 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -373,9 +373,17 @@ int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
>  {
>         void *obj;
>
> +       /*
> +        * The capacity fieldmust be initialized since the storage for the
> +        * objects pointer array is laid out after the kvm_mmu_memory_cache
> +        * struct and not known at compile time.
> +        */
> +       if (WARN_ON(mc->capacity == 0))
> +               return -EINVAL;
> +
>         if (mc->nobjs >= min)
>                 return 0;
> -       while (mc->nobjs < ARRAY_SIZE(mc->objects)) {
> +       while (mc->nobjs < mc->capacity) {
>                 obj = mmu_memory_cache_alloc_obj(mc, GFP_KERNEL_ACCOUNT);
>                 if (!obj)
>                         return mc->nobjs >= min ? 0 : -ENOMEM;
> --
> 2.35.1.1094.g7c7d902a7c-goog
>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v3 20/23] KVM: Allow for different capacities in kvm_mmu_memory_cache structs
  2022-04-21 16:19     ` Ben Gardon
@ 2022-04-21 16:33       ` David Matlack
  -1 siblings, 0 replies; 88+ messages in thread
From: David Matlack @ 2022-04-21 16:33 UTC (permalink / raw)
  To: Ben Gardon
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Sean Christopherson, Andrew Jones, Peter Xu,
	Maciej S . Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Thu, Apr 21, 2022 at 9:19 AM Ben Gardon <bgardon@google.com> wrote:
>
> On Fri, Apr 1, 2022 at 10:56 AM David Matlack <dmatlack@google.com> wrote:
> >
> > Allow the capacity of the kvm_mmu_memory_cache struct to be chosen at
> > declaration time rather than being fixed for all declarations. This will
> > be used in a follow-up commit to declare an cache in x86 with a capacity
> > of 512+ objects without having to increase the capacity of all caches in
> > KVM.
> >
> > This change requires each cache now specify its capacity at runtime,
> > since the cache struct itself no longer has a fixed capacity known at
> > compile time. To protect against someone accidentally defining a
> > kvm_mmu_memory_cache struct directly (without the extra storage), this
> > commit includes a WARN_ON() in kvm_mmu_topup_memory_cache().
> >
> > This change, unfortunately, adds some grottiness to
> > kvm_phys_addr_ioremap() in arm64, which uses a function-local (i.e.
> > stack-allocated) kvm_mmu_memory_cache struct. Since C does not allow
> > anonymous structs in functions, the new wrapper struct that contains
> > kvm_mmu_memory_cache and the objects pointer array, must be named, which
> > means dealing with an outer and inner struct. The outer struct can't be
> > dropped since then there would be no guarantee the kvm_mmu_memory_cache
> > struct and objects array would be laid out consecutively on the stack.
>
> What do you think about lazily allocating the array to hold the objs
> for the cache when we go to top up the cache?
> We already have to do error handling there and the allocation for the
> array to hold pointers for the cache objs will be small relative to
> filling up the cache.
> It seems like this introduces a lot of complexity to preserve the
> static / stack allocation of that array in an increasingly dynamic
> context.

That's a really interesting idea and would eliminate the need for the
outer struct. I'll play around with that in v4. Thanks!

>
> >
> > No functional change intended.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> >  arch/arm64/include/asm/kvm_host.h |  2 +-
> >  arch/arm64/kvm/arm.c              |  1 +
> >  arch/arm64/kvm/mmu.c              | 13 +++++++++----
> >  arch/mips/include/asm/kvm_host.h  |  2 +-
> >  arch/mips/kvm/mips.c              |  2 ++
> >  arch/riscv/include/asm/kvm_host.h |  2 +-
> >  arch/riscv/kvm/mmu.c              | 17 ++++++++++-------
> >  arch/riscv/kvm/vcpu.c             |  1 +
> >  arch/x86/include/asm/kvm_host.h   |  8 ++++----
> >  arch/x86/kvm/mmu/mmu.c            |  9 +++++++++
> >  include/linux/kvm_types.h         | 19 +++++++++++++++++--
> >  virt/kvm/kvm_main.c               | 10 +++++++++-
> >  12 files changed, 65 insertions(+), 21 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> > index 0e96087885fe..4670491899de 100644
> > --- a/arch/arm64/include/asm/kvm_host.h
> > +++ b/arch/arm64/include/asm/kvm_host.h
> > @@ -362,7 +362,7 @@ struct kvm_vcpu_arch {
> >         bool pause;
> >
> >         /* Cache some mmu pages needed inside spinlock regions */
> > -       struct kvm_mmu_memory_cache mmu_page_cache;
> > +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_cache);
> >
> >         /* Target CPU and feature flags */
> >         int target;
> > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > index ba9165e84396..af4d8a490af5 100644
> > --- a/arch/arm64/kvm/arm.c
> > +++ b/arch/arm64/kvm/arm.c
> > @@ -320,6 +320,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
> >         vcpu->arch.target = -1;
> >         bitmap_zero(vcpu->arch.features, KVM_VCPU_MAX_FEATURES);
> >
> > +       vcpu->arch.mmu_page_cache.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
> >         vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
> >
> >         /* Set up the timer */
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index 0d19259454d8..01e15bcb7be2 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -764,7 +764,12 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
> >  {
> >         phys_addr_t addr;
> >         int ret = 0;
> > -       struct kvm_mmu_memory_cache cache = { 0, __GFP_ZERO, NULL, };
> > +       DEFINE_KVM_MMU_MEMORY_CACHE(cache) page_cache = {
> > +               .cache = {
> > +                       .gfp_zero = __GFP_ZERO,
> > +                       .capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE,
> > +               },
> > +       };
> >         struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
> >         enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_DEVICE |
> >                                      KVM_PGTABLE_PROT_R |
> > @@ -777,14 +782,14 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
> >         guest_ipa &= PAGE_MASK;
> >
> >         for (addr = guest_ipa; addr < guest_ipa + size; addr += PAGE_SIZE) {
> > -               ret = kvm_mmu_topup_memory_cache(&cache,
> > +               ret = kvm_mmu_topup_memory_cache(&page_cache.cache,
> >                                                  kvm_mmu_cache_min_pages(kvm));
> >                 if (ret)
> >                         break;
> >
> >                 write_lock(&kvm->mmu_lock);
> >                 ret = kvm_pgtable_stage2_map(pgt, addr, PAGE_SIZE, pa, prot,
> > -                                            &cache);
> > +                                            &page_cache.cache);
> >                 write_unlock(&kvm->mmu_lock);
> >                 if (ret)
> >                         break;
> > @@ -792,7 +797,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
> >                 pa += PAGE_SIZE;
> >         }
> >
> > -       kvm_mmu_free_memory_cache(&cache);
> > +       kvm_mmu_free_memory_cache(&page_cache.cache);
> >         return ret;
> >  }
> >
> > diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
> > index 717716cc51c5..935511d7fc3a 100644
> > --- a/arch/mips/include/asm/kvm_host.h
> > +++ b/arch/mips/include/asm/kvm_host.h
> > @@ -347,7 +347,7 @@ struct kvm_vcpu_arch {
> >         unsigned long pending_exceptions_clr;
> >
> >         /* Cache some mmu pages needed inside spinlock regions */
> > -       struct kvm_mmu_memory_cache mmu_page_cache;
> > +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_cache);
> >
> >         /* vcpu's vzguestid is different on each host cpu in an smp system */
> >         u32 vzguestid[NR_CPUS];
> > diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
> > index a25e0b73ee70..45c7179144dc 100644
> > --- a/arch/mips/kvm/mips.c
> > +++ b/arch/mips/kvm/mips.c
> > @@ -387,6 +387,8 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
> >         if (err)
> >                 goto out_free_gebase;
> >
> > +       vcpu->arch.mmu_page_cache.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
> > +
> >         return 0;
> >
> >  out_free_gebase:
> > diff --git a/arch/riscv/include/asm/kvm_host.h b/arch/riscv/include/asm/kvm_host.h
> > index 78da839657e5..4ec0b7a3d515 100644
> > --- a/arch/riscv/include/asm/kvm_host.h
> > +++ b/arch/riscv/include/asm/kvm_host.h
> > @@ -186,7 +186,7 @@ struct kvm_vcpu_arch {
> >         struct kvm_sbi_context sbi_context;
> >
> >         /* Cache pages needed to program page tables with spinlock held */
> > -       struct kvm_mmu_memory_cache mmu_page_cache;
> > +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_cache);
> >
> >         /* VCPU power-off state */
> >         bool power_off;
> > diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c
> > index f80a34fbf102..5ffd164a5aeb 100644
> > --- a/arch/riscv/kvm/mmu.c
> > +++ b/arch/riscv/kvm/mmu.c
> > @@ -347,10 +347,12 @@ static int stage2_ioremap(struct kvm *kvm, gpa_t gpa, phys_addr_t hpa,
> >         int ret = 0;
> >         unsigned long pfn;
> >         phys_addr_t addr, end;
> > -       struct kvm_mmu_memory_cache pcache;
> > -
> > -       memset(&pcache, 0, sizeof(pcache));
> > -       pcache.gfp_zero = __GFP_ZERO;
> > +       DEFINE_KVM_MMU_MEMORY_CACHE(cache) page_cache = {
> > +               .cache = {
> > +                       .gfp_zero = __GFP_ZERO,
> > +                       .capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE,
> > +               },
> > +       };
> >
> >         end = (gpa + size + PAGE_SIZE - 1) & PAGE_MASK;
> >         pfn = __phys_to_pfn(hpa);
> > @@ -361,12 +363,13 @@ static int stage2_ioremap(struct kvm *kvm, gpa_t gpa, phys_addr_t hpa,
> >                 if (!writable)
> >                         pte = pte_wrprotect(pte);
> >
> > -               ret = kvm_mmu_topup_memory_cache(&pcache, stage2_pgd_levels);
> > +               ret = kvm_mmu_topup_memory_cache(&page_cache.cache,
> > +                                                stage2_pgd_levels);
> >                 if (ret)
> >                         goto out;
> >
> >                 spin_lock(&kvm->mmu_lock);
> > -               ret = stage2_set_pte(kvm, 0, &pcache, addr, &pte);
> > +               ret = stage2_set_pte(kvm, 0, &page_cache.cache, addr, &pte);
> >                 spin_unlock(&kvm->mmu_lock);
> >                 if (ret)
> >                         goto out;
> > @@ -375,7 +378,7 @@ static int stage2_ioremap(struct kvm *kvm, gpa_t gpa, phys_addr_t hpa,
> >         }
> >
> >  out:
> > -       kvm_mmu_free_memory_cache(&pcache);
> > +       kvm_mmu_free_memory_cache(&page_cache.cache);
> >         return ret;
> >  }
> >
> > diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
> > index 624166004e36..6a5f5aa45bac 100644
> > --- a/arch/riscv/kvm/vcpu.c
> > +++ b/arch/riscv/kvm/vcpu.c
> > @@ -94,6 +94,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
> >
> >         /* Mark this VCPU never ran */
> >         vcpu->arch.ran_atleast_once = false;
> > +       vcpu->arch.mmu_page_cache.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
> >         vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
> >
> >         /* Setup ISA features available to VCPU */
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index be4349c9ffea..ffb2b99f3a60 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -694,10 +694,10 @@ struct kvm_vcpu_arch {
> >          */
> >         struct kvm_mmu *walk_mmu;
> >
> > -       struct kvm_mmu_memory_cache mmu_pte_list_desc_cache;
> > -       struct kvm_mmu_memory_cache mmu_shadow_page_cache;
> > -       struct kvm_mmu_memory_cache mmu_shadowed_info_cache;
> > -       struct kvm_mmu_memory_cache mmu_page_header_cache;
> > +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_pte_list_desc_cache);
> > +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_shadow_page_cache);
> > +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_shadowed_info_cache);
> > +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_header_cache);
> >
> >         /*
> >          * QEMU userspace and the guest each have their own FPU state.
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index f058f28909ea..a8200b3f8782 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -5800,12 +5800,21 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
> >  {
> >         int ret;
> >
> > +       vcpu->arch.mmu_pte_list_desc_cache.capacity =
> > +               KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
> >         vcpu->arch.mmu_pte_list_desc_cache.kmem_cache = pte_list_desc_cache;
> >         vcpu->arch.mmu_pte_list_desc_cache.gfp_zero = __GFP_ZERO;
> >
> > +       vcpu->arch.mmu_page_header_cache.capacity =
> > +               KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
> >         vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;
> >         vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
> >
> > +       vcpu->arch.mmu_shadowed_info_cache.capacity =
> > +               KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
> > +
> > +       vcpu->arch.mmu_shadow_page_cache.capacity =
> > +               KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
> >         vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
> >
> >         vcpu->arch.mmu = &vcpu->arch.root_mmu;
> > diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> > index ac1ebb37a0ff..579cf39986ec 100644
> > --- a/include/linux/kvm_types.h
> > +++ b/include/linux/kvm_types.h
> > @@ -83,14 +83,29 @@ struct gfn_to_pfn_cache {
> >   * MMU flows is problematic, as is triggering reclaim, I/O, etc... while
> >   * holding MMU locks.  Note, these caches act more like prefetch buffers than
> >   * classical caches, i.e. objects are not returned to the cache on being freed.
> > + *
> > + * The storage for the cache object pointers is laid out after the struct, to
> > + * allow different declarations to choose different capacities. The capacity
> > + * field defines the number of object pointers available after the struct.
> >   */
> >  struct kvm_mmu_memory_cache {
> >         int nobjs;
> > +       int capacity;
> >         gfp_t gfp_zero;
> >         struct kmem_cache *kmem_cache;
> > -       void *objects[KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE];
> > +       void *objects[];
> >  };
> > -#endif
> > +
> > +#define __DEFINE_KVM_MMU_MEMORY_CACHE(_name, _capacity)                \
> > +       struct {                                                \
> > +               struct kvm_mmu_memory_cache _name;              \
> > +               void *_name##_objects[_capacity];               \
> > +       }
> > +
> > +#define DEFINE_KVM_MMU_MEMORY_CACHE(_name) \
> > +       __DEFINE_KVM_MMU_MEMORY_CACHE(_name, KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE)
> > +
> > +#endif /* KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE */
> >
> >  #define HALT_POLL_HIST_COUNT                   32
> >
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 70e05af5ebea..c4cac4195f4a 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -373,9 +373,17 @@ int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
> >  {
> >         void *obj;
> >
> > +       /*
> > +        * The capacity fieldmust be initialized since the storage for the
> > +        * objects pointer array is laid out after the kvm_mmu_memory_cache
> > +        * struct and not known at compile time.
> > +        */
> > +       if (WARN_ON(mc->capacity == 0))
> > +               return -EINVAL;
> > +
> >         if (mc->nobjs >= min)
> >                 return 0;
> > -       while (mc->nobjs < ARRAY_SIZE(mc->objects)) {
> > +       while (mc->nobjs < mc->capacity) {
> >                 obj = mmu_memory_cache_alloc_obj(mc, GFP_KERNEL_ACCOUNT);
> >                 if (!obj)
> >                         return mc->nobjs >= min ? 0 : -ENOMEM;
> > --
> > 2.35.1.1094.g7c7d902a7c-goog
> >

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v3 20/23] KVM: Allow for different capacities in kvm_mmu_memory_cache structs
@ 2022-04-21 16:33       ` David Matlack
  0 siblings, 0 replies; 88+ messages in thread
From: David Matlack @ 2022-04-21 16:33 UTC (permalink / raw)
  To: Ben Gardon
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Paolo Bonzini, Maciej S . Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Thu, Apr 21, 2022 at 9:19 AM Ben Gardon <bgardon@google.com> wrote:
>
> On Fri, Apr 1, 2022 at 10:56 AM David Matlack <dmatlack@google.com> wrote:
> >
> > Allow the capacity of the kvm_mmu_memory_cache struct to be chosen at
> > declaration time rather than being fixed for all declarations. This will
> > be used in a follow-up commit to declare an cache in x86 with a capacity
> > of 512+ objects without having to increase the capacity of all caches in
> > KVM.
> >
> > This change requires each cache now specify its capacity at runtime,
> > since the cache struct itself no longer has a fixed capacity known at
> > compile time. To protect against someone accidentally defining a
> > kvm_mmu_memory_cache struct directly (without the extra storage), this
> > commit includes a WARN_ON() in kvm_mmu_topup_memory_cache().
> >
> > This change, unfortunately, adds some grottiness to
> > kvm_phys_addr_ioremap() in arm64, which uses a function-local (i.e.
> > stack-allocated) kvm_mmu_memory_cache struct. Since C does not allow
> > anonymous structs in functions, the new wrapper struct that contains
> > kvm_mmu_memory_cache and the objects pointer array, must be named, which
> > means dealing with an outer and inner struct. The outer struct can't be
> > dropped since then there would be no guarantee the kvm_mmu_memory_cache
> > struct and objects array would be laid out consecutively on the stack.
>
> What do you think about lazily allocating the array to hold the objs
> for the cache when we go to top up the cache?
> We already have to do error handling there and the allocation for the
> array to hold pointers for the cache objs will be small relative to
> filling up the cache.
> It seems like this introduces a lot of complexity to preserve the
> static / stack allocation of that array in an increasingly dynamic
> context.

That's a really interesting idea and would eliminate the need for the
outer struct. I'll play around with that in v4. Thanks!

>
> >
> > No functional change intended.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> >  arch/arm64/include/asm/kvm_host.h |  2 +-
> >  arch/arm64/kvm/arm.c              |  1 +
> >  arch/arm64/kvm/mmu.c              | 13 +++++++++----
> >  arch/mips/include/asm/kvm_host.h  |  2 +-
> >  arch/mips/kvm/mips.c              |  2 ++
> >  arch/riscv/include/asm/kvm_host.h |  2 +-
> >  arch/riscv/kvm/mmu.c              | 17 ++++++++++-------
> >  arch/riscv/kvm/vcpu.c             |  1 +
> >  arch/x86/include/asm/kvm_host.h   |  8 ++++----
> >  arch/x86/kvm/mmu/mmu.c            |  9 +++++++++
> >  include/linux/kvm_types.h         | 19 +++++++++++++++++--
> >  virt/kvm/kvm_main.c               | 10 +++++++++-
> >  12 files changed, 65 insertions(+), 21 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> > index 0e96087885fe..4670491899de 100644
> > --- a/arch/arm64/include/asm/kvm_host.h
> > +++ b/arch/arm64/include/asm/kvm_host.h
> > @@ -362,7 +362,7 @@ struct kvm_vcpu_arch {
> >         bool pause;
> >
> >         /* Cache some mmu pages needed inside spinlock regions */
> > -       struct kvm_mmu_memory_cache mmu_page_cache;
> > +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_cache);
> >
> >         /* Target CPU and feature flags */
> >         int target;
> > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > index ba9165e84396..af4d8a490af5 100644
> > --- a/arch/arm64/kvm/arm.c
> > +++ b/arch/arm64/kvm/arm.c
> > @@ -320,6 +320,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
> >         vcpu->arch.target = -1;
> >         bitmap_zero(vcpu->arch.features, KVM_VCPU_MAX_FEATURES);
> >
> > +       vcpu->arch.mmu_page_cache.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
> >         vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
> >
> >         /* Set up the timer */
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index 0d19259454d8..01e15bcb7be2 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -764,7 +764,12 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
> >  {
> >         phys_addr_t addr;
> >         int ret = 0;
> > -       struct kvm_mmu_memory_cache cache = { 0, __GFP_ZERO, NULL, };
> > +       DEFINE_KVM_MMU_MEMORY_CACHE(cache) page_cache = {
> > +               .cache = {
> > +                       .gfp_zero = __GFP_ZERO,
> > +                       .capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE,
> > +               },
> > +       };
> >         struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
> >         enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_DEVICE |
> >                                      KVM_PGTABLE_PROT_R |
> > @@ -777,14 +782,14 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
> >         guest_ipa &= PAGE_MASK;
> >
> >         for (addr = guest_ipa; addr < guest_ipa + size; addr += PAGE_SIZE) {
> > -               ret = kvm_mmu_topup_memory_cache(&cache,
> > +               ret = kvm_mmu_topup_memory_cache(&page_cache.cache,
> >                                                  kvm_mmu_cache_min_pages(kvm));
> >                 if (ret)
> >                         break;
> >
> >                 write_lock(&kvm->mmu_lock);
> >                 ret = kvm_pgtable_stage2_map(pgt, addr, PAGE_SIZE, pa, prot,
> > -                                            &cache);
> > +                                            &page_cache.cache);
> >                 write_unlock(&kvm->mmu_lock);
> >                 if (ret)
> >                         break;
> > @@ -792,7 +797,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
> >                 pa += PAGE_SIZE;
> >         }
> >
> > -       kvm_mmu_free_memory_cache(&cache);
> > +       kvm_mmu_free_memory_cache(&page_cache.cache);
> >         return ret;
> >  }
> >
> > diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
> > index 717716cc51c5..935511d7fc3a 100644
> > --- a/arch/mips/include/asm/kvm_host.h
> > +++ b/arch/mips/include/asm/kvm_host.h
> > @@ -347,7 +347,7 @@ struct kvm_vcpu_arch {
> >         unsigned long pending_exceptions_clr;
> >
> >         /* Cache some mmu pages needed inside spinlock regions */
> > -       struct kvm_mmu_memory_cache mmu_page_cache;
> > +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_cache);
> >
> >         /* vcpu's vzguestid is different on each host cpu in an smp system */
> >         u32 vzguestid[NR_CPUS];
> > diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
> > index a25e0b73ee70..45c7179144dc 100644
> > --- a/arch/mips/kvm/mips.c
> > +++ b/arch/mips/kvm/mips.c
> > @@ -387,6 +387,8 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
> >         if (err)
> >                 goto out_free_gebase;
> >
> > +       vcpu->arch.mmu_page_cache.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
> > +
> >         return 0;
> >
> >  out_free_gebase:
> > diff --git a/arch/riscv/include/asm/kvm_host.h b/arch/riscv/include/asm/kvm_host.h
> > index 78da839657e5..4ec0b7a3d515 100644
> > --- a/arch/riscv/include/asm/kvm_host.h
> > +++ b/arch/riscv/include/asm/kvm_host.h
> > @@ -186,7 +186,7 @@ struct kvm_vcpu_arch {
> >         struct kvm_sbi_context sbi_context;
> >
> >         /* Cache pages needed to program page tables with spinlock held */
> > -       struct kvm_mmu_memory_cache mmu_page_cache;
> > +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_cache);
> >
> >         /* VCPU power-off state */
> >         bool power_off;
> > diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c
> > index f80a34fbf102..5ffd164a5aeb 100644
> > --- a/arch/riscv/kvm/mmu.c
> > +++ b/arch/riscv/kvm/mmu.c
> > @@ -347,10 +347,12 @@ static int stage2_ioremap(struct kvm *kvm, gpa_t gpa, phys_addr_t hpa,
> >         int ret = 0;
> >         unsigned long pfn;
> >         phys_addr_t addr, end;
> > -       struct kvm_mmu_memory_cache pcache;
> > -
> > -       memset(&pcache, 0, sizeof(pcache));
> > -       pcache.gfp_zero = __GFP_ZERO;
> > +       DEFINE_KVM_MMU_MEMORY_CACHE(cache) page_cache = {
> > +               .cache = {
> > +                       .gfp_zero = __GFP_ZERO,
> > +                       .capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE,
> > +               },
> > +       };
> >
> >         end = (gpa + size + PAGE_SIZE - 1) & PAGE_MASK;
> >         pfn = __phys_to_pfn(hpa);
> > @@ -361,12 +363,13 @@ static int stage2_ioremap(struct kvm *kvm, gpa_t gpa, phys_addr_t hpa,
> >                 if (!writable)
> >                         pte = pte_wrprotect(pte);
> >
> > -               ret = kvm_mmu_topup_memory_cache(&pcache, stage2_pgd_levels);
> > +               ret = kvm_mmu_topup_memory_cache(&page_cache.cache,
> > +                                                stage2_pgd_levels);
> >                 if (ret)
> >                         goto out;
> >
> >                 spin_lock(&kvm->mmu_lock);
> > -               ret = stage2_set_pte(kvm, 0, &pcache, addr, &pte);
> > +               ret = stage2_set_pte(kvm, 0, &page_cache.cache, addr, &pte);
> >                 spin_unlock(&kvm->mmu_lock);
> >                 if (ret)
> >                         goto out;
> > @@ -375,7 +378,7 @@ static int stage2_ioremap(struct kvm *kvm, gpa_t gpa, phys_addr_t hpa,
> >         }
> >
> >  out:
> > -       kvm_mmu_free_memory_cache(&pcache);
> > +       kvm_mmu_free_memory_cache(&page_cache.cache);
> >         return ret;
> >  }
> >
> > diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
> > index 624166004e36..6a5f5aa45bac 100644
> > --- a/arch/riscv/kvm/vcpu.c
> > +++ b/arch/riscv/kvm/vcpu.c
> > @@ -94,6 +94,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
> >
> >         /* Mark this VCPU never ran */
> >         vcpu->arch.ran_atleast_once = false;
> > +       vcpu->arch.mmu_page_cache.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
> >         vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
> >
> >         /* Setup ISA features available to VCPU */
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index be4349c9ffea..ffb2b99f3a60 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -694,10 +694,10 @@ struct kvm_vcpu_arch {
> >          */
> >         struct kvm_mmu *walk_mmu;
> >
> > -       struct kvm_mmu_memory_cache mmu_pte_list_desc_cache;
> > -       struct kvm_mmu_memory_cache mmu_shadow_page_cache;
> > -       struct kvm_mmu_memory_cache mmu_shadowed_info_cache;
> > -       struct kvm_mmu_memory_cache mmu_page_header_cache;
> > +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_pte_list_desc_cache);
> > +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_shadow_page_cache);
> > +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_shadowed_info_cache);
> > +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_header_cache);
> >
> >         /*
> >          * QEMU userspace and the guest each have their own FPU state.
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index f058f28909ea..a8200b3f8782 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -5800,12 +5800,21 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
> >  {
> >         int ret;
> >
> > +       vcpu->arch.mmu_pte_list_desc_cache.capacity =
> > +               KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
> >         vcpu->arch.mmu_pte_list_desc_cache.kmem_cache = pte_list_desc_cache;
> >         vcpu->arch.mmu_pte_list_desc_cache.gfp_zero = __GFP_ZERO;
> >
> > +       vcpu->arch.mmu_page_header_cache.capacity =
> > +               KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
> >         vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;
> >         vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
> >
> > +       vcpu->arch.mmu_shadowed_info_cache.capacity =
> > +               KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
> > +
> > +       vcpu->arch.mmu_shadow_page_cache.capacity =
> > +               KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
> >         vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
> >
> >         vcpu->arch.mmu = &vcpu->arch.root_mmu;
> > diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> > index ac1ebb37a0ff..579cf39986ec 100644
> > --- a/include/linux/kvm_types.h
> > +++ b/include/linux/kvm_types.h
> > @@ -83,14 +83,29 @@ struct gfn_to_pfn_cache {
> >   * MMU flows is problematic, as is triggering reclaim, I/O, etc... while
> >   * holding MMU locks.  Note, these caches act more like prefetch buffers than
> >   * classical caches, i.e. objects are not returned to the cache on being freed.
> > + *
> > + * The storage for the cache object pointers is laid out after the struct, to
> > + * allow different declarations to choose different capacities. The capacity
> > + * field defines the number of object pointers available after the struct.
> >   */
> >  struct kvm_mmu_memory_cache {
> >         int nobjs;
> > +       int capacity;
> >         gfp_t gfp_zero;
> >         struct kmem_cache *kmem_cache;
> > -       void *objects[KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE];
> > +       void *objects[];
> >  };
> > -#endif
> > +
> > +#define __DEFINE_KVM_MMU_MEMORY_CACHE(_name, _capacity)                \
> > +       struct {                                                \
> > +               struct kvm_mmu_memory_cache _name;              \
> > +               void *_name##_objects[_capacity];               \
> > +       }
> > +
> > +#define DEFINE_KVM_MMU_MEMORY_CACHE(_name) \
> > +       __DEFINE_KVM_MMU_MEMORY_CACHE(_name, KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE)
> > +
> > +#endif /* KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE */
> >
> >  #define HALT_POLL_HIST_COUNT                   32
> >
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 70e05af5ebea..c4cac4195f4a 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -373,9 +373,17 @@ int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
> >  {
> >         void *obj;
> >
> > +       /*
> > +        * The capacity fieldmust be initialized since the storage for the
> > +        * objects pointer array is laid out after the kvm_mmu_memory_cache
> > +        * struct and not known at compile time.
> > +        */
> > +       if (WARN_ON(mc->capacity == 0))
> > +               return -EINVAL;
> > +
> >         if (mc->nobjs >= min)
> >                 return 0;
> > -       while (mc->nobjs < ARRAY_SIZE(mc->objects)) {
> > +       while (mc->nobjs < mc->capacity) {
> >                 obj = mmu_memory_cache_alloc_obj(mc, GFP_KERNEL_ACCOUNT);
> >                 if (!obj)
> >                         return mc->nobjs >= min ? 0 : -ENOMEM;
> > --
> > 2.35.1.1094.g7c7d902a7c-goog
> >
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 88+ messages in thread

end of thread, other threads:[~2022-04-22 17:50 UTC | newest]

Thread overview: 88+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-01 17:55 [PATCH v3 00/23] KVM: Extend Eager Page Splitting to the shadow MMU David Matlack
2022-04-01 17:55 ` David Matlack
2022-04-01 17:55 ` [PATCH v3 01/23] KVM: x86/mmu: Optimize MMU page cache lookup for all direct SPs David Matlack
2022-04-01 17:55   ` David Matlack
2022-04-01 17:55 ` [PATCH v3 02/23] KVM: x86/mmu: Use a bool for direct David Matlack
2022-04-01 17:55   ` David Matlack
2022-04-08 22:24   ` Sean Christopherson
2022-04-08 22:24     ` Sean Christopherson
2022-04-01 17:55 ` [PATCH v3 03/23] KVM: x86/mmu: Derive shadow MMU page role from parent David Matlack
2022-04-01 17:55   ` David Matlack
2022-04-01 17:55 ` [PATCH v3 04/23] KVM: x86/mmu: Decompose kvm_mmu_get_page() into separate functions David Matlack
2022-04-01 17:55   ` David Matlack
2022-04-01 17:55 ` [PATCH v3 05/23] KVM: x86/mmu: Rename shadow MMU functions that deal with shadow pages David Matlack
2022-04-01 17:55   ` David Matlack
2022-04-01 17:55 ` [PATCH v3 06/23] KVM: x86/mmu: Pass memslot to kvm_mmu_new_shadow_page() David Matlack
2022-04-01 17:55   ` David Matlack
2022-04-01 17:55 ` [PATCH v3 07/23] KVM: x86/mmu: Separate shadow MMU sp allocation from initialization David Matlack
2022-04-01 17:55   ` David Matlack
2022-04-01 17:55 ` [PATCH v3 08/23] KVM: x86/mmu: Link spt to sp during allocation David Matlack
2022-04-01 17:55   ` David Matlack
2022-04-01 17:55 ` [PATCH v3 09/23] KVM: x86/mmu: Move huge page split sp allocation code to mmu.c David Matlack
2022-04-01 17:55   ` David Matlack
2022-04-01 17:55 ` [PATCH v3 10/23] KVM: x86/mmu: Use common code to free kvm_mmu_page structs David Matlack
2022-04-01 17:55   ` David Matlack
2022-04-01 17:55 ` [PATCH v3 11/23] KVM: x86/mmu: Use common code to allocate shadow pages from vCPU caches David Matlack
2022-04-01 17:55   ` David Matlack
2022-04-01 17:55 ` [PATCH v3 12/23] KVM: x86/mmu: Pass const memslot to rmap_add() David Matlack
2022-04-01 17:55   ` David Matlack
2022-04-01 17:55 ` [PATCH v3 13/23] KVM: x86/mmu: Pass const memslot to init_shadow_page() and descendants David Matlack
2022-04-01 17:55   ` David Matlack
2022-04-01 17:55 ` [PATCH v3 14/23] KVM: x86/mmu: Decouple rmap_add() and link_shadow_page() from kvm_vcpu David Matlack
2022-04-01 17:55   ` David Matlack
2022-04-01 17:55 ` [PATCH v3 15/23] KVM: x86/mmu: Update page stats in __rmap_add() David Matlack
2022-04-01 17:55   ` David Matlack
2022-04-01 17:55 ` [PATCH v3 16/23] KVM: x86/mmu: Cache the access bits of shadowed translations David Matlack
2022-04-01 17:55   ` David Matlack
2022-04-02  6:19   ` kernel test robot
2022-04-02  6:19     ` kernel test robot
2022-04-02  7:01   ` kernel test robot
2022-04-02  7:01     ` kernel test robot
2022-04-09  0:02   ` Sean Christopherson
2022-04-09  0:02     ` Sean Christopherson
2022-04-14 16:47     ` David Matlack
2022-04-14 16:47       ` David Matlack
2022-04-01 17:55 ` [PATCH v3 17/23] KVM: x86/mmu: Extend make_huge_page_split_spte() for the shadow MMU David Matlack
2022-04-01 17:55   ` David Matlack
2022-04-01 17:55 ` [PATCH v3 18/23] KVM: x86/mmu: Zap collapsible SPTEs at all levels in " David Matlack
2022-04-01 17:55   ` David Matlack
2022-04-01 17:55 ` [PATCH v3 19/23] KVM: x86/mmu: Refactor drop_large_spte() David Matlack
2022-04-01 17:55   ` David Matlack
2022-04-01 17:55 ` [PATCH v3 20/23] KVM: Allow for different capacities in kvm_mmu_memory_cache structs David Matlack
2022-04-01 17:55   ` David Matlack
2022-04-20 10:55   ` Anup Patel
2022-04-20 10:55     ` Anup Patel
2022-04-21 16:19   ` Ben Gardon
2022-04-21 16:19     ` Ben Gardon
2022-04-21 16:33     ` David Matlack
2022-04-21 16:33       ` David Matlack
2022-04-01 17:55 ` [PATCH v3 21/23] KVM: Allow GFP flags to be passed when topping up MMU caches David Matlack
2022-04-01 17:55   ` David Matlack
2022-04-01 17:55 ` [PATCH v3 22/23] KVM: x86/mmu: Support Eager Page Splitting in the shadow MMU David Matlack
2022-04-01 17:55   ` David Matlack
2022-04-09  0:39   ` Sean Christopherson
2022-04-09  0:39     ` Sean Christopherson
2022-04-14 16:50     ` David Matlack
2022-04-14 16:50       ` David Matlack
2022-04-01 17:55 ` [PATCH v3 23/23] KVM: selftests: Map x86_64 guest virtual memory with huge pages David Matlack
2022-04-01 17:55   ` David Matlack
2022-04-11 17:12 ` [PATCH v3 00/23] KVM: Extend Eager Page Splitting to the shadow MMU Sean Christopherson
2022-04-11 17:12   ` Sean Christopherson
2022-04-11 17:54   ` David Matlack
2022-04-11 17:54     ` David Matlack
2022-04-11 20:12     ` Sean Christopherson
2022-04-11 20:12       ` Sean Christopherson
2022-04-11 23:41       ` David Matlack
2022-04-11 23:41         ` David Matlack
2022-04-12  0:39         ` Sean Christopherson
2022-04-12  0:39           ` Sean Christopherson
2022-04-12 16:49           ` David Matlack
2022-04-12 16:49             ` David Matlack
2022-04-13  1:02             ` Sean Christopherson
2022-04-13  1:02               ` Sean Christopherson
2022-04-13 17:57               ` David Matlack
2022-04-13 17:57                 ` David Matlack
2022-04-13 18:28                 ` Sean Christopherson
2022-04-13 18:28                   ` Sean Christopherson
2022-04-13 21:22                   ` David Matlack
2022-04-13 21:22                     ` David Matlack

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.