All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 00/26] Extend Eager Page Splitting to the shadow MMU
@ 2022-03-11  0:25 ` David Matlack
  0 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-11  0:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

[ NOTE: I would like to do more testing on this series than what is
  described in the Testing section below, but I will be out-of-office all
  next week so I wanted to share what I have so far. I fully expect a v3
  of this series anyway after collecting more reviews from Sean. ]

This series extends KVM's Eager Page Splitting to also split huge pages
mapped by the shadow MMU, i.e. huge pages present in the memslot rmaps.
This will be useful for configurations that use Nested Virtualization,
disable the TDP MMU, or disable/lack TDP hardware support.

For background on Eager Page Splitting, see:
 - Proposal: https://lore.kernel.org/kvm/CALzav=dV_U4r1K9oDq4esb4mpBQDQ2ROQ5zH5wV3KpOaZrRW-A@mail.gmail.com/
 - TDP MMU support: https://lore.kernel.org/kvm/20220119230739.2234394-1-dmatlack@google.com/

Splitting huge pages mapped by the shadow MMU is more complicated than
the TDP MMU, but it is also more important for performance as the shadow
MMU handles huge page write-protection faults under the write lock.  See
the Performance section for more details.

The extra complexity of splitting huge pages mapped by the shadow MMU
comes from a few places:

(1) The shadow MMU has a limit on the number of shadow pages that are
    allowed to be allocated. So, as a policy, Eager Page Splitting
    refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
    pages available.

(2) Huge pages may be mapped by indirect shadow pages.

    - Indirect shadow pages have the possibilty of being unsync. As a
      policy we opt not to split such pages as their translation may no
      longer be valid.
    - Huge pages on indirect shadow pages may have access permission
      constraints from the guest (unlike the TDP MMU which is ACC_ALL
      by default).

(3) Splitting a huge page may end up re-using an existing lower level
    shadow page tables. This is unlike the TDP MMU which always allocates
    new shadow page tables when splitting.

(4) When installing the lower level SPTEs, they must be added to the
    rmap which may require allocating additional pte_list_desc structs.

In Google's internal implementation of Eager Page Splitting, we do not
handle cases (3) and (4), and intstead opts to skip splitting entirely
(case 3) or only partially splitting (case 4). This series handles the
additional cases (patches 21-25), which comes with some additional
complexity and an additional 4KiB of memory per VM to store the extra
pte_list_desc cache. However it does also avoid the need for TLB flushes
in most cases.

About half of this series, patches 1-15, is just refactoring the
existing MMU code in preparation for splitting. The bulk of the
refactoring is to make it possible to operate on the MMU outside of a
vCPU context.

Motivation
----------

During dirty logging, VMs using the shadow MMU suffer from:

(1) Write-protection faults on huge pages that take the MMU lock to
    unmap the huge page, map a 4KiB page, and update the dirty log.

(2) Non-present faults caused by (1) that take the MMU lock to map in
    the missing page.

(3) Write-protection faults on 4KiB pages that take the MMU lock to
    make the page writable and update the dirty log. [Note: These faults
    only take the MMU lock during shadow paging.]

The lock contention from (1), (2) and (3) can severely degrade
application performance to the point of failure.  Eager page splitting
eliminates (1) by moving the splitting of huge pages off the vCPU
threads onto the thread invoking VM-ioctls to configure dirty logging,
and eliminates (2) by fully splitting each huge page into its
constituent small pages. (3) is still a concern for shadow paging
workloads (e.g. nested virtualization) but is not addressed by this
series.

Splitting in the VM-ioctl thread is useful because it can run in the
background without interrupting vCPU execution. However, it does take
the MMU lock so it may introduce some extra contention if vCPUs are
hammering the MMU lock. This is offset by the fact that eager page
splitting drops the MMU lock after splitting each SPTE if there is any
contention, and the fact that eager page splitting is reducing the MMU
lock contention from (1) and (2) above. Even workloads that only write
to 5% of their memory see massive MMU lock contention reduction during
dirty logging thanks to Eager Page Splitting (see Performance data
below).

A downside of Eager Page Splitting is that it splits all huge pages,
which may include ranges of memory that are never written to by the
guest and thus could theoretically stay huge. Workloads that write to
only a fraction of their memory may see higher TLB miss costs with Eager
Page Splitting enabled. However, that is secondary to the application
failure that otherwise may occur without Eager Page Splitting.

Further work is necessary to improve the TLB miss performance for
read-heavy workoads, such as dirty logging at 2M instead of 4K.

Performance
-----------

To measure the performance impact of Eager Page Splitting I ran
dirty_log_perf_test with tdp_mmu=N, various virtual CPU counts, 1GiB per
vCPU, and backed by 1GiB HugeTLB memory. The amount of memory that was
written to versus read was controlled with the -f option.

To measure the imapct of customer performance, we can look at the time
it takes all vCPUs to dirty memory after dirty logging has been enabled.
Without Eager Page Splitting enabled, such dirtying must take faults to
split huge pages and bottleneck on the MMU lock.

             | 100% written / 0% read                      |
             | --------------------------------------------|
             | "Iteration 1 dirty memory time" (ept=Y)     |
             | ------------------------------------------- |
vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
------------ | -------------------- | -------------------- |
2            | 0.310786549s         | 0.058731929s         |
4            | 0.419165587s         | 0.059615316s         |
8            | 1.061233860s         | 0.060945457s         |
16           | 2.852955595s         | 0.067069980s         |
32           | 7.032750509s         | 0.078623606s         |
64           | 16.501287504s        | 0.083914116s         |


             | 5% written / 95% read                       |
             | --------------------------------------------|
             | "Iteration 1 dirty memory time" (ept=Y)     |
             | ------------------------------------------- |
vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
------------ | -------------------- | -------------------- |
2            | 0.325023846s         | 0.006049684s         |
4            | 0.398393318s         | 0.006275966s         |
8            | 1.242848347s         | 0.006861012s         |
16           | 2.724926895s         | 0.010056859s         |
32           | 7.134648637s         | 0.012153849s         |
64           | 16.804434189s        | 0.017575228s         |

Eager Page Splitting does increase the time it takes to enable dirty
logging when not using initially-all-set, since that's when KVM splits
huge pages. However, this runs in parallel with vCPU execution and drops
the MMU lock whenever there is contention.

             | 100% written / 0% read                      |
             | --------------------------------------------|
             | "Enabling dirty logging time" (ept=Y)       |
             | ------------------------------------------- |
vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
------------ | -------------------- | -------------------- |
2            | 0.001581619s         |  0.025699730s        |
4            | 0.003138664s         |  0.051510208s        |
8            | 0.006247177s         |  0.102960379s        |
16           | 0.012603892s         |  0.206949435s        |
32           | 0.026428036s         |  0.435855597s        |
64           | 0.103826796s         |  1.199686530s        |

Similarly, Eager Page Splitting increases the time it takes to clear the
dirty log for when using initially-all-set. The first time userspace
clears the dirty log, KVM will split huge pages:

             | 100% written / 0% read                      |
             | --------------------------------------------|
             | "Iteration 1 clear dirty log time" (ept=Y)  |
             | ------------------------------------------- |
vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
------------ | -------------------- | -------------------- |
2            | 0.001544730s         | 0.055327916s         |
4            | 0.003145920s         | 0.111887354s         |
8            | 0.006306964s         | 0.223920530s         |
16           | 0.012681628s         | 0.447849488s         |
32           | 0.026827560s         | 0.943874520s         |
64           | 0.090461490s         | 2.664388025s         |

Subsequent calls to clear the dirty log incur almost no additional cost
since KVM can very quickly determine there are no more huge pages to
split via the RMAP. This is unlike the TDP MMU which must re-traverse
the entire page table to check for huge pages.

             | 100% written / 0% read                      |
             | --------------------------------------------|
             | "Iteration 2 clear dirty log time" (ept=Y)  |
             | ------------------------------------------- |
vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
------------ | -------------------- | -------------------- |
2            | 0.015613726s         | 0.015771982s         |
4            | 0.031456620s         | 0.031911594s         |
8            | 0.063341572s         | 0.063837403s         |
16           | 0.128409332s         | 0.127484064s         |
32           | 0.255635696s         | 0.268837996s         |
64           | 0.695572818s         | 0.700420727s         |

Eager Page Splitting also improves the performance for shadow paging
configurations, as measured with ept=N. Although the absolute gains are
less for write-heavy workloads since KVM's shadow paging takes the write
lock to track 4KiB writes (i.e. no fast_page_faut() or PML). However
there are still major gains for read/write and read-heavy workloads.

             | 100% written / 0% read                      |
             | --------------------------------------------|
             | "Iteration 1 dirty memory time" (ept=N)     |
             | ------------------------------------------- |
vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
------------ | -------------------- | -------------------- |
2            | 0.373022770s         | 0.348926043s         |
4            | 0.563697483s         | 0.453022037s         |
8            | 1.588492808s         | 1.524962010s         |
16           | 3.988934732s         | 3.369129917s         |
32           | 9.470333115s         | 8.292953856s         |
64           | 20.086419186s        | 18.531840021s        |


             | 50% written / 50% read                      |
             | --------------------------------------------|
             | "Iteration 1 dirty memory time" (ept=N)     |
             | ------------------------------------------- |
vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
------------ | -------------------- | -------------------- |
2            | 0.374301914s         | 0.174864494s         |
4            | 0.539841346s         | 0.213246828s         |
8            | 1.759793717s         | 0.526697696s         |
16           | 3.786053801s         | 1.338638169s         |
32           | 9.603927533s         | 3.869825083s         |
64           | 20.376757135s        | 9.158492731s         |


             | 5% written / 95% read                       |
             | --------------------------------------------|
             | "Iteration 1 dirty memory time"             |
             | ------------------------------------------- |
vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
------------ | -------------------- | -------------------- |
2            | 0.381538968s         | 0.020396121s         |
4            | 0.511922608s         | 0.022625023s         |
8            | 1.464410632s         | 0.054727913s         |
16           | 3.783471041s         | 0.133412717s         |
32           | 9.519432076s         | 0.390443803s         |
64           | 21.052654299s        | 0.929496710s         |

Testing
-------

- Ran all kvm-unit-tests and KVM selftests with all combinations of
  ept=[NY] and tdp_mmu=[NY].
- Booted a 32-bit non-PAE kernel with shadow paging to verify the
  quadrant change in patch 3.

Version Log
-----------

v2:
 - Add performance data for workloads that mix reads and writes [Peter]
 - Collect R-b tags from Ben and Sean.
 - Fix quadrant calculation when deriving role from parent [Sean]
 - Tweak new shadow page function names [Sean]
 - Move set_page_private() to allocation functions [Ben]
 - Only zap collapsible SPTEs up to MAX_LEVEL-1 [Ben]
 - Always top-up pte_list_desc cache to reduce complexity [Ben]
 - Require mmu cache capacity field to be initialized and add WARN()
   to reduce chance of programmer error [Marc]
 - Fix up kvm_mmu_memory_cache struct initialization in arm64 [Marc]

v1: https://lore.kernel.org/kvm/20220203010051.2813563-1-dmatlack@google.com/

David Matlack (26):
  KVM: x86/mmu: Optimize MMU page cache lookup for all direct SPs
  KVM: x86/mmu: Use a bool for direct
  KVM: x86/mmu: Derive shadow MMU page role from parent
  KVM: x86/mmu: Decompose kvm_mmu_get_page() into separate functions
  KVM: x86/mmu: Rename shadow MMU functions that deal with shadow pages
  KVM: x86/mmu: Pass memslot to kvm_mmu_new_shadow_page()
  KVM: x86/mmu: Separate shadow MMU sp allocation from initialization
  KVM: x86/mmu: Link spt to sp during allocation
  KVM: x86/mmu: Move huge page split sp allocation code to mmu.c
  KVM: x86/mmu: Use common code to free kvm_mmu_page structs
  KVM: x86/mmu: Use common code to allocate kvm_mmu_page structs from
    vCPU caches
  KVM: x86/mmu: Pass const memslot to rmap_add()
  KVM: x86/mmu: Pass const memslot to init_shadow_page() and descendants
  KVM: x86/mmu: Decouple rmap_add() and link_shadow_page() from kvm_vcpu
  KVM: x86/mmu: Update page stats in __rmap_add()
  KVM: x86/mmu: Cache the access bits of shadowed translations
  KVM: x86/mmu: Pass access information to make_huge_page_split_spte()
  KVM: x86/mmu: Zap collapsible SPTEs at all levels in the shadow MMU
  KVM: x86/mmu: Refactor drop_large_spte()
  KVM: x86/mmu: Extend Eager Page Splitting to the shadow MMU
  KVM: Allow for different capacities in kvm_mmu_memory_cache structs
  KVM: Allow GFP flags to be passed when topping up MMU caches
  KVM: x86/mmu: Fully split huge pages that require extra pte_list_desc
    structs
  KVM: x86/mmu: Split huge pages aliased by multiple SPTEs
  KVM: x86/mmu: Drop NULL pte_list_desc_cache fallback
  KVM: selftests: Map x86_64 guest virtual memory with huge pages

 .../admin-guide/kernel-parameters.txt         |   3 -
 arch/arm64/include/asm/kvm_host.h             |   2 +-
 arch/arm64/kvm/arm.c                          |   1 +
 arch/arm64/kvm/mmu.c                          |  13 +-
 arch/mips/include/asm/kvm_host.h              |   2 +-
 arch/mips/kvm/mips.c                          |   2 +
 arch/riscv/include/asm/kvm_host.h             |   2 +-
 arch/riscv/kvm/vcpu.c                         |   1 +
 arch/x86/include/asm/kvm_host.h               |  19 +-
 arch/x86/include/asm/kvm_page_track.h         |   2 +-
 arch/x86/kvm/mmu/mmu.c                        | 744 +++++++++++++++---
 arch/x86/kvm/mmu/mmu_internal.h               |  22 +-
 arch/x86/kvm/mmu/page_track.c                 |   4 +-
 arch/x86/kvm/mmu/paging_tmpl.h                |  21 +-
 arch/x86/kvm/mmu/spte.c                       |  10 +-
 arch/x86/kvm/mmu/spte.h                       |   3 +-
 arch/x86/kvm/mmu/tdp_mmu.c                    |  48 +-
 arch/x86/kvm/mmu/tdp_mmu.h                    |   2 +-
 include/linux/kvm_host.h                      |   1 +
 include/linux/kvm_types.h                     |  19 +-
 .../selftests/kvm/include/x86_64/processor.h  |   6 +
 tools/testing/selftests/kvm/lib/kvm_util.c    |   4 +-
 .../selftests/kvm/lib/x86_64/processor.c      |  31 +
 virt/kvm/kvm_main.c                           |  19 +-
 24 files changed, 768 insertions(+), 213 deletions(-)


base-commit: ce41d078aaa9cf15cbbb4a42878cc6160d76525e
-- 
2.35.1.723.g4982287a31-goog


^ permalink raw reply	[flat|nested] 134+ messages in thread

* [PATCH v2 00/26] Extend Eager Page Splitting to the shadow MMU
@ 2022-03-11  0:25 ` David Matlack
  0 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-11  0:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

[ NOTE: I would like to do more testing on this series than what is
  described in the Testing section below, but I will be out-of-office all
  next week so I wanted to share what I have so far. I fully expect a v3
  of this series anyway after collecting more reviews from Sean. ]

This series extends KVM's Eager Page Splitting to also split huge pages
mapped by the shadow MMU, i.e. huge pages present in the memslot rmaps.
This will be useful for configurations that use Nested Virtualization,
disable the TDP MMU, or disable/lack TDP hardware support.

For background on Eager Page Splitting, see:
 - Proposal: https://lore.kernel.org/kvm/CALzav=dV_U4r1K9oDq4esb4mpBQDQ2ROQ5zH5wV3KpOaZrRW-A@mail.gmail.com/
 - TDP MMU support: https://lore.kernel.org/kvm/20220119230739.2234394-1-dmatlack@google.com/

Splitting huge pages mapped by the shadow MMU is more complicated than
the TDP MMU, but it is also more important for performance as the shadow
MMU handles huge page write-protection faults under the write lock.  See
the Performance section for more details.

The extra complexity of splitting huge pages mapped by the shadow MMU
comes from a few places:

(1) The shadow MMU has a limit on the number of shadow pages that are
    allowed to be allocated. So, as a policy, Eager Page Splitting
    refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
    pages available.

(2) Huge pages may be mapped by indirect shadow pages.

    - Indirect shadow pages have the possibilty of being unsync. As a
      policy we opt not to split such pages as their translation may no
      longer be valid.
    - Huge pages on indirect shadow pages may have access permission
      constraints from the guest (unlike the TDP MMU which is ACC_ALL
      by default).

(3) Splitting a huge page may end up re-using an existing lower level
    shadow page tables. This is unlike the TDP MMU which always allocates
    new shadow page tables when splitting.

(4) When installing the lower level SPTEs, they must be added to the
    rmap which may require allocating additional pte_list_desc structs.

In Google's internal implementation of Eager Page Splitting, we do not
handle cases (3) and (4), and intstead opts to skip splitting entirely
(case 3) or only partially splitting (case 4). This series handles the
additional cases (patches 21-25), which comes with some additional
complexity and an additional 4KiB of memory per VM to store the extra
pte_list_desc cache. However it does also avoid the need for TLB flushes
in most cases.

About half of this series, patches 1-15, is just refactoring the
existing MMU code in preparation for splitting. The bulk of the
refactoring is to make it possible to operate on the MMU outside of a
vCPU context.

Motivation
----------

During dirty logging, VMs using the shadow MMU suffer from:

(1) Write-protection faults on huge pages that take the MMU lock to
    unmap the huge page, map a 4KiB page, and update the dirty log.

(2) Non-present faults caused by (1) that take the MMU lock to map in
    the missing page.

(3) Write-protection faults on 4KiB pages that take the MMU lock to
    make the page writable and update the dirty log. [Note: These faults
    only take the MMU lock during shadow paging.]

The lock contention from (1), (2) and (3) can severely degrade
application performance to the point of failure.  Eager page splitting
eliminates (1) by moving the splitting of huge pages off the vCPU
threads onto the thread invoking VM-ioctls to configure dirty logging,
and eliminates (2) by fully splitting each huge page into its
constituent small pages. (3) is still a concern for shadow paging
workloads (e.g. nested virtualization) but is not addressed by this
series.

Splitting in the VM-ioctl thread is useful because it can run in the
background without interrupting vCPU execution. However, it does take
the MMU lock so it may introduce some extra contention if vCPUs are
hammering the MMU lock. This is offset by the fact that eager page
splitting drops the MMU lock after splitting each SPTE if there is any
contention, and the fact that eager page splitting is reducing the MMU
lock contention from (1) and (2) above. Even workloads that only write
to 5% of their memory see massive MMU lock contention reduction during
dirty logging thanks to Eager Page Splitting (see Performance data
below).

A downside of Eager Page Splitting is that it splits all huge pages,
which may include ranges of memory that are never written to by the
guest and thus could theoretically stay huge. Workloads that write to
only a fraction of their memory may see higher TLB miss costs with Eager
Page Splitting enabled. However, that is secondary to the application
failure that otherwise may occur without Eager Page Splitting.

Further work is necessary to improve the TLB miss performance for
read-heavy workoads, such as dirty logging at 2M instead of 4K.

Performance
-----------

To measure the performance impact of Eager Page Splitting I ran
dirty_log_perf_test with tdp_mmu=N, various virtual CPU counts, 1GiB per
vCPU, and backed by 1GiB HugeTLB memory. The amount of memory that was
written to versus read was controlled with the -f option.

To measure the imapct of customer performance, we can look at the time
it takes all vCPUs to dirty memory after dirty logging has been enabled.
Without Eager Page Splitting enabled, such dirtying must take faults to
split huge pages and bottleneck on the MMU lock.

             | 100% written / 0% read                      |
             | --------------------------------------------|
             | "Iteration 1 dirty memory time" (ept=Y)     |
             | ------------------------------------------- |
vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
------------ | -------------------- | -------------------- |
2            | 0.310786549s         | 0.058731929s         |
4            | 0.419165587s         | 0.059615316s         |
8            | 1.061233860s         | 0.060945457s         |
16           | 2.852955595s         | 0.067069980s         |
32           | 7.032750509s         | 0.078623606s         |
64           | 16.501287504s        | 0.083914116s         |


             | 5% written / 95% read                       |
             | --------------------------------------------|
             | "Iteration 1 dirty memory time" (ept=Y)     |
             | ------------------------------------------- |
vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
------------ | -------------------- | -------------------- |
2            | 0.325023846s         | 0.006049684s         |
4            | 0.398393318s         | 0.006275966s         |
8            | 1.242848347s         | 0.006861012s         |
16           | 2.724926895s         | 0.010056859s         |
32           | 7.134648637s         | 0.012153849s         |
64           | 16.804434189s        | 0.017575228s         |

Eager Page Splitting does increase the time it takes to enable dirty
logging when not using initially-all-set, since that's when KVM splits
huge pages. However, this runs in parallel with vCPU execution and drops
the MMU lock whenever there is contention.

             | 100% written / 0% read                      |
             | --------------------------------------------|
             | "Enabling dirty logging time" (ept=Y)       |
             | ------------------------------------------- |
vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
------------ | -------------------- | -------------------- |
2            | 0.001581619s         |  0.025699730s        |
4            | 0.003138664s         |  0.051510208s        |
8            | 0.006247177s         |  0.102960379s        |
16           | 0.012603892s         |  0.206949435s        |
32           | 0.026428036s         |  0.435855597s        |
64           | 0.103826796s         |  1.199686530s        |

Similarly, Eager Page Splitting increases the time it takes to clear the
dirty log for when using initially-all-set. The first time userspace
clears the dirty log, KVM will split huge pages:

             | 100% written / 0% read                      |
             | --------------------------------------------|
             | "Iteration 1 clear dirty log time" (ept=Y)  |
             | ------------------------------------------- |
vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
------------ | -------------------- | -------------------- |
2            | 0.001544730s         | 0.055327916s         |
4            | 0.003145920s         | 0.111887354s         |
8            | 0.006306964s         | 0.223920530s         |
16           | 0.012681628s         | 0.447849488s         |
32           | 0.026827560s         | 0.943874520s         |
64           | 0.090461490s         | 2.664388025s         |

Subsequent calls to clear the dirty log incur almost no additional cost
since KVM can very quickly determine there are no more huge pages to
split via the RMAP. This is unlike the TDP MMU which must re-traverse
the entire page table to check for huge pages.

             | 100% written / 0% read                      |
             | --------------------------------------------|
             | "Iteration 2 clear dirty log time" (ept=Y)  |
             | ------------------------------------------- |
vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
------------ | -------------------- | -------------------- |
2            | 0.015613726s         | 0.015771982s         |
4            | 0.031456620s         | 0.031911594s         |
8            | 0.063341572s         | 0.063837403s         |
16           | 0.128409332s         | 0.127484064s         |
32           | 0.255635696s         | 0.268837996s         |
64           | 0.695572818s         | 0.700420727s         |

Eager Page Splitting also improves the performance for shadow paging
configurations, as measured with ept=N. Although the absolute gains are
less for write-heavy workloads since KVM's shadow paging takes the write
lock to track 4KiB writes (i.e. no fast_page_faut() or PML). However
there are still major gains for read/write and read-heavy workloads.

             | 100% written / 0% read                      |
             | --------------------------------------------|
             | "Iteration 1 dirty memory time" (ept=N)     |
             | ------------------------------------------- |
vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
------------ | -------------------- | -------------------- |
2            | 0.373022770s         | 0.348926043s         |
4            | 0.563697483s         | 0.453022037s         |
8            | 1.588492808s         | 1.524962010s         |
16           | 3.988934732s         | 3.369129917s         |
32           | 9.470333115s         | 8.292953856s         |
64           | 20.086419186s        | 18.531840021s        |


             | 50% written / 50% read                      |
             | --------------------------------------------|
             | "Iteration 1 dirty memory time" (ept=N)     |
             | ------------------------------------------- |
vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
------------ | -------------------- | -------------------- |
2            | 0.374301914s         | 0.174864494s         |
4            | 0.539841346s         | 0.213246828s         |
8            | 1.759793717s         | 0.526697696s         |
16           | 3.786053801s         | 1.338638169s         |
32           | 9.603927533s         | 3.869825083s         |
64           | 20.376757135s        | 9.158492731s         |


             | 5% written / 95% read                       |
             | --------------------------------------------|
             | "Iteration 1 dirty memory time"             |
             | ------------------------------------------- |
vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
------------ | -------------------- | -------------------- |
2            | 0.381538968s         | 0.020396121s         |
4            | 0.511922608s         | 0.022625023s         |
8            | 1.464410632s         | 0.054727913s         |
16           | 3.783471041s         | 0.133412717s         |
32           | 9.519432076s         | 0.390443803s         |
64           | 21.052654299s        | 0.929496710s         |

Testing
-------

- Ran all kvm-unit-tests and KVM selftests with all combinations of
  ept=[NY] and tdp_mmu=[NY].
- Booted a 32-bit non-PAE kernel with shadow paging to verify the
  quadrant change in patch 3.

Version Log
-----------

v2:
 - Add performance data for workloads that mix reads and writes [Peter]
 - Collect R-b tags from Ben and Sean.
 - Fix quadrant calculation when deriving role from parent [Sean]
 - Tweak new shadow page function names [Sean]
 - Move set_page_private() to allocation functions [Ben]
 - Only zap collapsible SPTEs up to MAX_LEVEL-1 [Ben]
 - Always top-up pte_list_desc cache to reduce complexity [Ben]
 - Require mmu cache capacity field to be initialized and add WARN()
   to reduce chance of programmer error [Marc]
 - Fix up kvm_mmu_memory_cache struct initialization in arm64 [Marc]

v1: https://lore.kernel.org/kvm/20220203010051.2813563-1-dmatlack@google.com/

David Matlack (26):
  KVM: x86/mmu: Optimize MMU page cache lookup for all direct SPs
  KVM: x86/mmu: Use a bool for direct
  KVM: x86/mmu: Derive shadow MMU page role from parent
  KVM: x86/mmu: Decompose kvm_mmu_get_page() into separate functions
  KVM: x86/mmu: Rename shadow MMU functions that deal with shadow pages
  KVM: x86/mmu: Pass memslot to kvm_mmu_new_shadow_page()
  KVM: x86/mmu: Separate shadow MMU sp allocation from initialization
  KVM: x86/mmu: Link spt to sp during allocation
  KVM: x86/mmu: Move huge page split sp allocation code to mmu.c
  KVM: x86/mmu: Use common code to free kvm_mmu_page structs
  KVM: x86/mmu: Use common code to allocate kvm_mmu_page structs from
    vCPU caches
  KVM: x86/mmu: Pass const memslot to rmap_add()
  KVM: x86/mmu: Pass const memslot to init_shadow_page() and descendants
  KVM: x86/mmu: Decouple rmap_add() and link_shadow_page() from kvm_vcpu
  KVM: x86/mmu: Update page stats in __rmap_add()
  KVM: x86/mmu: Cache the access bits of shadowed translations
  KVM: x86/mmu: Pass access information to make_huge_page_split_spte()
  KVM: x86/mmu: Zap collapsible SPTEs at all levels in the shadow MMU
  KVM: x86/mmu: Refactor drop_large_spte()
  KVM: x86/mmu: Extend Eager Page Splitting to the shadow MMU
  KVM: Allow for different capacities in kvm_mmu_memory_cache structs
  KVM: Allow GFP flags to be passed when topping up MMU caches
  KVM: x86/mmu: Fully split huge pages that require extra pte_list_desc
    structs
  KVM: x86/mmu: Split huge pages aliased by multiple SPTEs
  KVM: x86/mmu: Drop NULL pte_list_desc_cache fallback
  KVM: selftests: Map x86_64 guest virtual memory with huge pages

 .../admin-guide/kernel-parameters.txt         |   3 -
 arch/arm64/include/asm/kvm_host.h             |   2 +-
 arch/arm64/kvm/arm.c                          |   1 +
 arch/arm64/kvm/mmu.c                          |  13 +-
 arch/mips/include/asm/kvm_host.h              |   2 +-
 arch/mips/kvm/mips.c                          |   2 +
 arch/riscv/include/asm/kvm_host.h             |   2 +-
 arch/riscv/kvm/vcpu.c                         |   1 +
 arch/x86/include/asm/kvm_host.h               |  19 +-
 arch/x86/include/asm/kvm_page_track.h         |   2 +-
 arch/x86/kvm/mmu/mmu.c                        | 744 +++++++++++++++---
 arch/x86/kvm/mmu/mmu_internal.h               |  22 +-
 arch/x86/kvm/mmu/page_track.c                 |   4 +-
 arch/x86/kvm/mmu/paging_tmpl.h                |  21 +-
 arch/x86/kvm/mmu/spte.c                       |  10 +-
 arch/x86/kvm/mmu/spte.h                       |   3 +-
 arch/x86/kvm/mmu/tdp_mmu.c                    |  48 +-
 arch/x86/kvm/mmu/tdp_mmu.h                    |   2 +-
 include/linux/kvm_host.h                      |   1 +
 include/linux/kvm_types.h                     |  19 +-
 .../selftests/kvm/include/x86_64/processor.h  |   6 +
 tools/testing/selftests/kvm/lib/kvm_util.c    |   4 +-
 .../selftests/kvm/lib/x86_64/processor.c      |  31 +
 virt/kvm/kvm_main.c                           |  19 +-
 24 files changed, 768 insertions(+), 213 deletions(-)


base-commit: ce41d078aaa9cf15cbbb4a42878cc6160d76525e
-- 
2.35.1.723.g4982287a31-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 134+ messages in thread

* [PATCH v2 01/26] KVM: x86/mmu: Optimize MMU page cache lookup for all direct SPs
  2022-03-11  0:25 ` David Matlack
@ 2022-03-11  0:25   ` David Matlack
  -1 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-11  0:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

Commit fb58a9c345f6 ("KVM: x86/mmu: Optimize MMU page cache lookup for
fully direct MMUs") skipped the unsync checks and write flood clearing
for full direct MMUs. We can extend this further and skip the checks for
all direct shadow pages. Direct shadow pages are never marked unsynced
or have a non-zero write-flooding count.

Checking sp->role.direct alos generates better code than checking
direct_map because, due to register pressure, direct_map has to get
shoved onto the stack and then pulled back off.

No functional change intended.

Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 3b8da8b0745e..3ad67f70e51c 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2034,7 +2034,6 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 					     int direct,
 					     unsigned int access)
 {
-	bool direct_mmu = vcpu->arch.mmu->direct_map;
 	union kvm_mmu_page_role role;
 	struct hlist_head *sp_list;
 	unsigned quadrant;
@@ -2075,7 +2074,8 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 			continue;
 		}
 
-		if (direct_mmu)
+		/* unsync and write-flooding only apply to indirect SPs. */
+		if (sp->role.direct)
 			goto trace_get_page;
 
 		if (sp->unsync) {

base-commit: ce41d078aaa9cf15cbbb4a42878cc6160d76525e
-- 
2.35.1.723.g4982287a31-goog


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH v2 01/26] KVM: x86/mmu: Optimize MMU page cache lookup for all direct SPs
@ 2022-03-11  0:25   ` David Matlack
  0 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-11  0:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Commit fb58a9c345f6 ("KVM: x86/mmu: Optimize MMU page cache lookup for
fully direct MMUs") skipped the unsync checks and write flood clearing
for full direct MMUs. We can extend this further and skip the checks for
all direct shadow pages. Direct shadow pages are never marked unsynced
or have a non-zero write-flooding count.

Checking sp->role.direct alos generates better code than checking
direct_map because, due to register pressure, direct_map has to get
shoved onto the stack and then pulled back off.

No functional change intended.

Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 3b8da8b0745e..3ad67f70e51c 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2034,7 +2034,6 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 					     int direct,
 					     unsigned int access)
 {
-	bool direct_mmu = vcpu->arch.mmu->direct_map;
 	union kvm_mmu_page_role role;
 	struct hlist_head *sp_list;
 	unsigned quadrant;
@@ -2075,7 +2074,8 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 			continue;
 		}
 
-		if (direct_mmu)
+		/* unsync and write-flooding only apply to indirect SPs. */
+		if (sp->role.direct)
 			goto trace_get_page;
 
 		if (sp->unsync) {

base-commit: ce41d078aaa9cf15cbbb4a42878cc6160d76525e
-- 
2.35.1.723.g4982287a31-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH v2 02/26] KVM: x86/mmu: Use a bool for direct
  2022-03-11  0:25 ` David Matlack
@ 2022-03-11  0:25   ` David Matlack
  -1 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-11  0:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

The parameter "direct" can either be true or false, and all of the
callers pass in a bool variable or true/false literal, so just use the
type bool.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 3ad67f70e51c..146df73a982e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1706,7 +1706,7 @@ static void drop_parent_pte(struct kvm_mmu_page *sp,
 	mmu_spte_clear_no_track(parent_pte);
 }
 
-static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, int direct)
+static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, bool direct)
 {
 	struct kvm_mmu_page *sp;
 
@@ -2031,7 +2031,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 					     gfn_t gfn,
 					     gva_t gaddr,
 					     unsigned level,
-					     int direct,
+					     bool direct,
 					     unsigned int access)
 {
 	union kvm_mmu_page_role role;
-- 
2.35.1.723.g4982287a31-goog


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH v2 02/26] KVM: x86/mmu: Use a bool for direct
@ 2022-03-11  0:25   ` David Matlack
  0 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-11  0:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

The parameter "direct" can either be true or false, and all of the
callers pass in a bool variable or true/false literal, so just use the
type bool.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 3ad67f70e51c..146df73a982e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1706,7 +1706,7 @@ static void drop_parent_pte(struct kvm_mmu_page *sp,
 	mmu_spte_clear_no_track(parent_pte);
 }
 
-static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, int direct)
+static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, bool direct)
 {
 	struct kvm_mmu_page *sp;
 
@@ -2031,7 +2031,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 					     gfn_t gfn,
 					     gva_t gaddr,
 					     unsigned level,
-					     int direct,
+					     bool direct,
 					     unsigned int access)
 {
 	union kvm_mmu_page_role role;
-- 
2.35.1.723.g4982287a31-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH v2 03/26] KVM: x86/mmu: Derive shadow MMU page role from parent
  2022-03-11  0:25 ` David Matlack
@ 2022-03-11  0:25   ` David Matlack
  -1 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-11  0:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

Instead of computing the shadow page role from scratch for every new
page, we can derive most of the information from the parent shadow page.
This avoids redundant calculations and reduces the number of parameters
to kvm_mmu_get_page().

Preemptively split out the role calculation to a separate function for
use in a following commit.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c         | 91 ++++++++++++++++++++++++----------
 arch/x86/kvm/mmu/paging_tmpl.h |  9 ++--
 2 files changed, 71 insertions(+), 29 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 146df73a982e..23c2004c6435 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2027,30 +2027,14 @@ static void clear_sp_write_flooding_count(u64 *spte)
 	__clear_sp_write_flooding_count(sptep_to_sp(spte));
 }
 
-static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
-					     gfn_t gfn,
-					     gva_t gaddr,
-					     unsigned level,
-					     bool direct,
-					     unsigned int access)
+static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
+					     union kvm_mmu_page_role role)
 {
-	union kvm_mmu_page_role role;
 	struct hlist_head *sp_list;
-	unsigned quadrant;
 	struct kvm_mmu_page *sp;
 	int collisions = 0;
 	LIST_HEAD(invalid_list);
 
-	role = vcpu->arch.mmu->mmu_role.base;
-	role.level = level;
-	role.direct = direct;
-	role.access = access;
-	if (role.has_4_byte_gpte) {
-		quadrant = gaddr >> (PAGE_SHIFT + (PT64_PT_BITS * level));
-		quadrant &= (1 << ((PT32_PT_BITS - PT64_PT_BITS) * level)) - 1;
-		role.quadrant = quadrant;
-	}
-
 	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
 	for_each_valid_sp(vcpu->kvm, sp, sp_list) {
 		if (sp->gfn != gfn) {
@@ -2068,7 +2052,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 			 * Unsync pages must not be left as is, because the new
 			 * upper-level page will be write-protected.
 			 */
-			if (level > PG_LEVEL_4K && sp->unsync)
+			if (role.level > PG_LEVEL_4K && sp->unsync)
 				kvm_mmu_prepare_zap_page(vcpu->kvm, sp,
 							 &invalid_list);
 			continue;
@@ -2107,14 +2091,14 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 
 	++vcpu->kvm->stat.mmu_cache_miss;
 
-	sp = kvm_mmu_alloc_page(vcpu, direct);
+	sp = kvm_mmu_alloc_page(vcpu, role.direct);
 
 	sp->gfn = gfn;
 	sp->role = role;
 	hlist_add_head(&sp->hash_link, sp_list);
-	if (!direct) {
+	if (!role.direct) {
 		account_shadowed(vcpu->kvm, sp);
-		if (level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))
+		if (role.level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))
 			kvm_flush_remote_tlbs_with_address(vcpu->kvm, gfn, 1);
 	}
 	trace_kvm_mmu_get_page(sp, true);
@@ -2126,6 +2110,51 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 	return sp;
 }
 
+static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct, u32 access)
+{
+	struct kvm_mmu_page *parent_sp = sptep_to_sp(sptep);
+	union kvm_mmu_page_role role;
+
+	role = parent_sp->role;
+	role.level--;
+	role.access = access;
+	role.direct = direct;
+
+	/*
+	 * If the guest has 4-byte PTEs then that means it's using 32-bit,
+	 * 2-level, non-PAE paging. KVM shadows such guests using 4 PAE page
+	 * directories, each mapping 1/4 of the guest's linear address space
+	 * (1GiB). The shadow pages for those 4 page directories are
+	 * pre-allocated and assigned a separate quadrant in their role.
+	 *
+	 * Since we are allocating a child shadow page and there are only 2
+	 * levels, this must be a PG_LEVEL_4K shadow page. Here the quadrant
+	 * will either be 0 or 1 because it maps 1/2 of the address space mapped
+	 * by the guest's PG_LEVEL_4K page table (or 4MiB huge page) that it
+	 * is shadowing. In this case, the quadrant can be derived by the index
+	 * of the SPTE that points to the new child shadow page in the page
+	 * directory (parent_sp). Specifically, every 2 SPTEs in parent_sp
+	 * shadow one half of a guest's page table (or 4MiB huge page) so the
+	 * quadrant is just the parity of the index of the SPTE.
+	 */
+	if (role.has_4_byte_gpte) {
+		BUG_ON(role.level != PG_LEVEL_4K);
+		role.quadrant = (sptep - parent_sp->spt) % 2;
+	}
+
+	return role;
+}
+
+static struct kvm_mmu_page *kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu,
+						 u64 *sptep, gfn_t gfn,
+						 bool direct, u32 access)
+{
+	union kvm_mmu_page_role role;
+
+	role = kvm_mmu_child_role(sptep, direct, access);
+	return kvm_mmu_get_page(vcpu, gfn, role);
+}
+
 static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterator,
 					struct kvm_vcpu *vcpu, hpa_t root,
 					u64 addr)
@@ -2930,8 +2959,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		if (is_shadow_present_pte(*it.sptep))
 			continue;
 
-		sp = kvm_mmu_get_page(vcpu, base_gfn, it.addr,
-				      it.level - 1, true, ACC_ALL);
+		sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn, true, ACC_ALL);
 
 		link_shadow_page(vcpu, it.sptep, sp);
 		if (fault->is_tdp && fault->huge_page_disallowed &&
@@ -3316,9 +3344,22 @@ static int mmu_check_root(struct kvm_vcpu *vcpu, gfn_t root_gfn)
 static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
 			    u8 level, bool direct)
 {
+	union kvm_mmu_page_role role;
 	struct kvm_mmu_page *sp;
+	unsigned int quadrant;
+
+	role = vcpu->arch.mmu->mmu_role.base;
+	role.level = level;
+	role.direct = direct;
+	role.access = ACC_ALL;
+
+	if (role.has_4_byte_gpte) {
+		quadrant = gva >> (PAGE_SHIFT + (PT64_PT_BITS * level));
+		quadrant &= (1 << ((PT32_PT_BITS - PT64_PT_BITS) * level)) - 1;
+		role.quadrant = quadrant;
+	}
 
-	sp = kvm_mmu_get_page(vcpu, gfn, gva, level, direct, ACC_ALL);
+	sp = kvm_mmu_get_page(vcpu, gfn, role);
 	++sp->root_count;
 
 	return __pa(sp->spt);
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 252c77805eb9..c3909a07e938 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -683,8 +683,9 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 		if (!is_shadow_present_pte(*it.sptep)) {
 			table_gfn = gw->table_gfn[it.level - 2];
 			access = gw->pt_access[it.level - 2];
-			sp = kvm_mmu_get_page(vcpu, table_gfn, fault->addr,
-					      it.level-1, false, access);
+			sp = kvm_mmu_get_child_sp(vcpu, it.sptep, table_gfn,
+						  false, access);
+
 			/*
 			 * We must synchronize the pagetable before linking it
 			 * because the guest doesn't need to flush tlb when
@@ -740,8 +741,8 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 		drop_large_spte(vcpu, it.sptep);
 
 		if (!is_shadow_present_pte(*it.sptep)) {
-			sp = kvm_mmu_get_page(vcpu, base_gfn, fault->addr,
-					      it.level - 1, true, direct_access);
+			sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn,
+						  true, direct_access);
 			link_shadow_page(vcpu, it.sptep, sp);
 			if (fault->huge_page_disallowed &&
 			    fault->req_level >= it.level)
-- 
2.35.1.723.g4982287a31-goog


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH v2 03/26] KVM: x86/mmu: Derive shadow MMU page role from parent
@ 2022-03-11  0:25   ` David Matlack
  0 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-11  0:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Instead of computing the shadow page role from scratch for every new
page, we can derive most of the information from the parent shadow page.
This avoids redundant calculations and reduces the number of parameters
to kvm_mmu_get_page().

Preemptively split out the role calculation to a separate function for
use in a following commit.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c         | 91 ++++++++++++++++++++++++----------
 arch/x86/kvm/mmu/paging_tmpl.h |  9 ++--
 2 files changed, 71 insertions(+), 29 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 146df73a982e..23c2004c6435 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2027,30 +2027,14 @@ static void clear_sp_write_flooding_count(u64 *spte)
 	__clear_sp_write_flooding_count(sptep_to_sp(spte));
 }
 
-static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
-					     gfn_t gfn,
-					     gva_t gaddr,
-					     unsigned level,
-					     bool direct,
-					     unsigned int access)
+static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
+					     union kvm_mmu_page_role role)
 {
-	union kvm_mmu_page_role role;
 	struct hlist_head *sp_list;
-	unsigned quadrant;
 	struct kvm_mmu_page *sp;
 	int collisions = 0;
 	LIST_HEAD(invalid_list);
 
-	role = vcpu->arch.mmu->mmu_role.base;
-	role.level = level;
-	role.direct = direct;
-	role.access = access;
-	if (role.has_4_byte_gpte) {
-		quadrant = gaddr >> (PAGE_SHIFT + (PT64_PT_BITS * level));
-		quadrant &= (1 << ((PT32_PT_BITS - PT64_PT_BITS) * level)) - 1;
-		role.quadrant = quadrant;
-	}
-
 	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
 	for_each_valid_sp(vcpu->kvm, sp, sp_list) {
 		if (sp->gfn != gfn) {
@@ -2068,7 +2052,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 			 * Unsync pages must not be left as is, because the new
 			 * upper-level page will be write-protected.
 			 */
-			if (level > PG_LEVEL_4K && sp->unsync)
+			if (role.level > PG_LEVEL_4K && sp->unsync)
 				kvm_mmu_prepare_zap_page(vcpu->kvm, sp,
 							 &invalid_list);
 			continue;
@@ -2107,14 +2091,14 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 
 	++vcpu->kvm->stat.mmu_cache_miss;
 
-	sp = kvm_mmu_alloc_page(vcpu, direct);
+	sp = kvm_mmu_alloc_page(vcpu, role.direct);
 
 	sp->gfn = gfn;
 	sp->role = role;
 	hlist_add_head(&sp->hash_link, sp_list);
-	if (!direct) {
+	if (!role.direct) {
 		account_shadowed(vcpu->kvm, sp);
-		if (level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))
+		if (role.level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))
 			kvm_flush_remote_tlbs_with_address(vcpu->kvm, gfn, 1);
 	}
 	trace_kvm_mmu_get_page(sp, true);
@@ -2126,6 +2110,51 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 	return sp;
 }
 
+static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct, u32 access)
+{
+	struct kvm_mmu_page *parent_sp = sptep_to_sp(sptep);
+	union kvm_mmu_page_role role;
+
+	role = parent_sp->role;
+	role.level--;
+	role.access = access;
+	role.direct = direct;
+
+	/*
+	 * If the guest has 4-byte PTEs then that means it's using 32-bit,
+	 * 2-level, non-PAE paging. KVM shadows such guests using 4 PAE page
+	 * directories, each mapping 1/4 of the guest's linear address space
+	 * (1GiB). The shadow pages for those 4 page directories are
+	 * pre-allocated and assigned a separate quadrant in their role.
+	 *
+	 * Since we are allocating a child shadow page and there are only 2
+	 * levels, this must be a PG_LEVEL_4K shadow page. Here the quadrant
+	 * will either be 0 or 1 because it maps 1/2 of the address space mapped
+	 * by the guest's PG_LEVEL_4K page table (or 4MiB huge page) that it
+	 * is shadowing. In this case, the quadrant can be derived by the index
+	 * of the SPTE that points to the new child shadow page in the page
+	 * directory (parent_sp). Specifically, every 2 SPTEs in parent_sp
+	 * shadow one half of a guest's page table (or 4MiB huge page) so the
+	 * quadrant is just the parity of the index of the SPTE.
+	 */
+	if (role.has_4_byte_gpte) {
+		BUG_ON(role.level != PG_LEVEL_4K);
+		role.quadrant = (sptep - parent_sp->spt) % 2;
+	}
+
+	return role;
+}
+
+static struct kvm_mmu_page *kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu,
+						 u64 *sptep, gfn_t gfn,
+						 bool direct, u32 access)
+{
+	union kvm_mmu_page_role role;
+
+	role = kvm_mmu_child_role(sptep, direct, access);
+	return kvm_mmu_get_page(vcpu, gfn, role);
+}
+
 static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterator,
 					struct kvm_vcpu *vcpu, hpa_t root,
 					u64 addr)
@@ -2930,8 +2959,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		if (is_shadow_present_pte(*it.sptep))
 			continue;
 
-		sp = kvm_mmu_get_page(vcpu, base_gfn, it.addr,
-				      it.level - 1, true, ACC_ALL);
+		sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn, true, ACC_ALL);
 
 		link_shadow_page(vcpu, it.sptep, sp);
 		if (fault->is_tdp && fault->huge_page_disallowed &&
@@ -3316,9 +3344,22 @@ static int mmu_check_root(struct kvm_vcpu *vcpu, gfn_t root_gfn)
 static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
 			    u8 level, bool direct)
 {
+	union kvm_mmu_page_role role;
 	struct kvm_mmu_page *sp;
+	unsigned int quadrant;
+
+	role = vcpu->arch.mmu->mmu_role.base;
+	role.level = level;
+	role.direct = direct;
+	role.access = ACC_ALL;
+
+	if (role.has_4_byte_gpte) {
+		quadrant = gva >> (PAGE_SHIFT + (PT64_PT_BITS * level));
+		quadrant &= (1 << ((PT32_PT_BITS - PT64_PT_BITS) * level)) - 1;
+		role.quadrant = quadrant;
+	}
 
-	sp = kvm_mmu_get_page(vcpu, gfn, gva, level, direct, ACC_ALL);
+	sp = kvm_mmu_get_page(vcpu, gfn, role);
 	++sp->root_count;
 
 	return __pa(sp->spt);
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 252c77805eb9..c3909a07e938 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -683,8 +683,9 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 		if (!is_shadow_present_pte(*it.sptep)) {
 			table_gfn = gw->table_gfn[it.level - 2];
 			access = gw->pt_access[it.level - 2];
-			sp = kvm_mmu_get_page(vcpu, table_gfn, fault->addr,
-					      it.level-1, false, access);
+			sp = kvm_mmu_get_child_sp(vcpu, it.sptep, table_gfn,
+						  false, access);
+
 			/*
 			 * We must synchronize the pagetable before linking it
 			 * because the guest doesn't need to flush tlb when
@@ -740,8 +741,8 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 		drop_large_spte(vcpu, it.sptep);
 
 		if (!is_shadow_present_pte(*it.sptep)) {
-			sp = kvm_mmu_get_page(vcpu, base_gfn, fault->addr,
-					      it.level - 1, true, direct_access);
+			sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn,
+						  true, direct_access);
 			link_shadow_page(vcpu, it.sptep, sp);
 			if (fault->huge_page_disallowed &&
 			    fault->req_level >= it.level)
-- 
2.35.1.723.g4982287a31-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH v2 04/26] KVM: x86/mmu: Decompose kvm_mmu_get_page() into separate functions
  2022-03-11  0:25 ` David Matlack
@ 2022-03-11  0:25   ` David Matlack
  -1 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-11  0:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

Decompose kvm_mmu_get_page() into separate helper functions to increase
readability and prepare for allocating shadow pages without a vcpu
pointer.

Specifically, pull the guts of kvm_mmu_get_page() into 3 helper
functions:

__kvm_mmu_find_shadow_page() -
  Walks the page hash checking for any existing mmu pages that match the
  given gfn and role. Does not attempt to synchronize the page if it is
  unsync.

kvm_mmu_find_shadow_page() -
  Wraps __kvm_mmu_find_shadow_page() and handles syncing if necessary.

kvm_mmu_new_shadow_page()
  Allocates and initializes an entirely new kvm_mmu_page. This currently
  requries a vcpu pointer for allocation and looking up the memslot but
  that will be removed in a future commit.

  Note, kvm_mmu_new_shadow_page() is temporary and will be removed in a
  subsequent commit. The name uses "new" rather than the more typical
  "alloc" to avoid clashing with the existing kvm_mmu_alloc_page().

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c         | 132 ++++++++++++++++++++++++---------
 arch/x86/kvm/mmu/paging_tmpl.h |   5 +-
 arch/x86/kvm/mmu/spte.c        |   5 +-
 3 files changed, 101 insertions(+), 41 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 23c2004c6435..80dbfe07c87b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2027,16 +2027,25 @@ static void clear_sp_write_flooding_count(u64 *spte)
 	__clear_sp_write_flooding_count(sptep_to_sp(spte));
 }
 
-static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
-					     union kvm_mmu_page_role role)
+/*
+ * Searches for an existing SP for the given gfn and role. Makes no attempt to
+ * sync the SP if it is marked unsync.
+ *
+ * If creating an upper-level page table, zaps unsynced pages for the same
+ * gfn and adds them to the invalid_list. It's the callers responsibility
+ * to call kvm_mmu_commit_zap_page() on invalid_list.
+ */
+static struct kvm_mmu_page *__kvm_mmu_find_shadow_page(struct kvm *kvm,
+						       gfn_t gfn,
+						       union kvm_mmu_page_role role,
+						       struct list_head *invalid_list)
 {
 	struct hlist_head *sp_list;
 	struct kvm_mmu_page *sp;
 	int collisions = 0;
-	LIST_HEAD(invalid_list);
 
-	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
-	for_each_valid_sp(vcpu->kvm, sp, sp_list) {
+	sp_list = &kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
+	for_each_valid_sp(kvm, sp, sp_list) {
 		if (sp->gfn != gfn) {
 			collisions++;
 			continue;
@@ -2053,60 +2062,109 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
 			 * upper-level page will be write-protected.
 			 */
 			if (role.level > PG_LEVEL_4K && sp->unsync)
-				kvm_mmu_prepare_zap_page(vcpu->kvm, sp,
-							 &invalid_list);
+				kvm_mmu_prepare_zap_page(kvm, sp, invalid_list);
+
 			continue;
 		}
 
-		/* unsync and write-flooding only apply to indirect SPs. */
-		if (sp->role.direct)
-			goto trace_get_page;
+		/* Write-flooding is only tracked for indirect SPs. */
+		if (!sp->role.direct)
+			__clear_sp_write_flooding_count(sp);
 
-		if (sp->unsync) {
-			/*
-			 * The page is good, but is stale.  kvm_sync_page does
-			 * get the latest guest state, but (unlike mmu_unsync_children)
-			 * it doesn't write-protect the page or mark it synchronized!
-			 * This way the validity of the mapping is ensured, but the
-			 * overhead of write protection is not incurred until the
-			 * guest invalidates the TLB mapping.  This allows multiple
-			 * SPs for a single gfn to be unsync.
-			 *
-			 * If the sync fails, the page is zapped.  If so, break
-			 * in order to rebuild it.
-			 */
-			if (!kvm_sync_page(vcpu, sp, &invalid_list))
-				break;
+		goto out;
+	}
 
-			WARN_ON(!list_empty(&invalid_list));
-			kvm_flush_remote_tlbs(vcpu->kvm);
-		}
+	sp = NULL;
 
-		__clear_sp_write_flooding_count(sp);
+out:
+	if (collisions > kvm->stat.max_mmu_page_hash_collisions)
+		kvm->stat.max_mmu_page_hash_collisions = collisions;
+
+	return sp;
+}
 
-trace_get_page:
-		trace_kvm_mmu_get_page(sp, false);
+/*
+ * Looks up an existing SP for the given gfn and role if one exists. The
+ * return SP is guaranteed to be synced.
+ */
+static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
+						     gfn_t gfn,
+						     union kvm_mmu_page_role role)
+{
+	struct kvm_mmu_page *sp;
+	LIST_HEAD(invalid_list);
+
+	sp = __kvm_mmu_find_shadow_page(vcpu->kvm, gfn, role, &invalid_list);
+	if (!sp)
 		goto out;
+
+	if (sp->unsync) {
+		/*
+		 * The page is good, but is stale.  kvm_sync_page does
+		 * get the latest guest state, but (unlike mmu_unsync_children)
+		 * it doesn't write-protect the page or mark it synchronized!
+		 * This way the validity of the mapping is ensured, but the
+		 * overhead of write protection is not incurred until the
+		 * guest invalidates the TLB mapping.  This allows multiple
+		 * SPs for a single gfn to be unsync.
+		 *
+		 * If the sync fails, the page is zapped and added to the
+		 * invalid_list.
+		 */
+		if (!kvm_sync_page(vcpu, sp, &invalid_list)) {
+			sp = NULL;
+			goto out;
+		}
+
+		WARN_ON(!list_empty(&invalid_list));
+		kvm_flush_remote_tlbs(vcpu->kvm);
 	}
 
+out:
+	kvm_mmu_commit_zap_page(vcpu->kvm, &invalid_list);
+	return sp;
+}
+
+static struct kvm_mmu_page *kvm_mmu_new_shadow_page(struct kvm_vcpu *vcpu,
+						    gfn_t gfn,
+						    union kvm_mmu_page_role role)
+{
+	struct kvm_mmu_page *sp;
+	struct hlist_head *sp_list;
+
 	++vcpu->kvm->stat.mmu_cache_miss;
 
 	sp = kvm_mmu_alloc_page(vcpu, role.direct);
-
 	sp->gfn = gfn;
 	sp->role = role;
+
+	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
 	hlist_add_head(&sp->hash_link, sp_list);
+
 	if (!role.direct) {
 		account_shadowed(vcpu->kvm, sp);
 		if (role.level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))
 			kvm_flush_remote_tlbs_with_address(vcpu->kvm, gfn, 1);
 	}
-	trace_kvm_mmu_get_page(sp, true);
-out:
-	kvm_mmu_commit_zap_page(vcpu->kvm, &invalid_list);
 
-	if (collisions > vcpu->kvm->stat.max_mmu_page_hash_collisions)
-		vcpu->kvm->stat.max_mmu_page_hash_collisions = collisions;
+	return sp;
+}
+
+static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
+					     union kvm_mmu_page_role role)
+{
+	struct kvm_mmu_page *sp;
+	bool created = false;
+
+	sp = kvm_mmu_find_shadow_page(vcpu, gfn, role);
+	if (sp)
+		goto out;
+
+	created = true;
+	sp = kvm_mmu_new_shadow_page(vcpu, gfn, role);
+
+out:
+	trace_kvm_mmu_get_page(sp, created);
 	return sp;
 }
 
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index c3909a07e938..55cac59b9c9b 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -692,8 +692,9 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 			 * the gpte is changed from non-present to present.
 			 * Otherwise, the guest may use the wrong mapping.
 			 *
-			 * For PG_LEVEL_4K, kvm_mmu_get_page() has already
-			 * synchronized it transiently via kvm_sync_page().
+			 * For PG_LEVEL_4K, kvm_mmu_get_existing_sp() has
+			 * already synchronized it transiently via
+			 * kvm_sync_page().
 			 *
 			 * For higher level pagetable, we synchronize it via
 			 * the slower mmu_sync_children().  If it needs to
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 4739b53c9734..d10189d9c877 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -150,8 +150,9 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 		/*
 		 * Optimization: for pte sync, if spte was writable the hash
 		 * lookup is unnecessary (and expensive). Write protection
-		 * is responsibility of kvm_mmu_get_page / kvm_mmu_sync_roots.
-		 * Same reasoning can be applied to dirty page accounting.
+		 * is responsibility of kvm_mmu_create_sp() and
+		 * kvm_mmu_sync_roots(). Same reasoning can be applied to dirty
+		 * page accounting.
 		 */
 		if (is_writable_pte(old_spte))
 			goto out;
-- 
2.35.1.723.g4982287a31-goog


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH v2 04/26] KVM: x86/mmu: Decompose kvm_mmu_get_page() into separate functions
@ 2022-03-11  0:25   ` David Matlack
  0 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-11  0:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Decompose kvm_mmu_get_page() into separate helper functions to increase
readability and prepare for allocating shadow pages without a vcpu
pointer.

Specifically, pull the guts of kvm_mmu_get_page() into 3 helper
functions:

__kvm_mmu_find_shadow_page() -
  Walks the page hash checking for any existing mmu pages that match the
  given gfn and role. Does not attempt to synchronize the page if it is
  unsync.

kvm_mmu_find_shadow_page() -
  Wraps __kvm_mmu_find_shadow_page() and handles syncing if necessary.

kvm_mmu_new_shadow_page()
  Allocates and initializes an entirely new kvm_mmu_page. This currently
  requries a vcpu pointer for allocation and looking up the memslot but
  that will be removed in a future commit.

  Note, kvm_mmu_new_shadow_page() is temporary and will be removed in a
  subsequent commit. The name uses "new" rather than the more typical
  "alloc" to avoid clashing with the existing kvm_mmu_alloc_page().

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c         | 132 ++++++++++++++++++++++++---------
 arch/x86/kvm/mmu/paging_tmpl.h |   5 +-
 arch/x86/kvm/mmu/spte.c        |   5 +-
 3 files changed, 101 insertions(+), 41 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 23c2004c6435..80dbfe07c87b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2027,16 +2027,25 @@ static void clear_sp_write_flooding_count(u64 *spte)
 	__clear_sp_write_flooding_count(sptep_to_sp(spte));
 }
 
-static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
-					     union kvm_mmu_page_role role)
+/*
+ * Searches for an existing SP for the given gfn and role. Makes no attempt to
+ * sync the SP if it is marked unsync.
+ *
+ * If creating an upper-level page table, zaps unsynced pages for the same
+ * gfn and adds them to the invalid_list. It's the callers responsibility
+ * to call kvm_mmu_commit_zap_page() on invalid_list.
+ */
+static struct kvm_mmu_page *__kvm_mmu_find_shadow_page(struct kvm *kvm,
+						       gfn_t gfn,
+						       union kvm_mmu_page_role role,
+						       struct list_head *invalid_list)
 {
 	struct hlist_head *sp_list;
 	struct kvm_mmu_page *sp;
 	int collisions = 0;
-	LIST_HEAD(invalid_list);
 
-	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
-	for_each_valid_sp(vcpu->kvm, sp, sp_list) {
+	sp_list = &kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
+	for_each_valid_sp(kvm, sp, sp_list) {
 		if (sp->gfn != gfn) {
 			collisions++;
 			continue;
@@ -2053,60 +2062,109 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
 			 * upper-level page will be write-protected.
 			 */
 			if (role.level > PG_LEVEL_4K && sp->unsync)
-				kvm_mmu_prepare_zap_page(vcpu->kvm, sp,
-							 &invalid_list);
+				kvm_mmu_prepare_zap_page(kvm, sp, invalid_list);
+
 			continue;
 		}
 
-		/* unsync and write-flooding only apply to indirect SPs. */
-		if (sp->role.direct)
-			goto trace_get_page;
+		/* Write-flooding is only tracked for indirect SPs. */
+		if (!sp->role.direct)
+			__clear_sp_write_flooding_count(sp);
 
-		if (sp->unsync) {
-			/*
-			 * The page is good, but is stale.  kvm_sync_page does
-			 * get the latest guest state, but (unlike mmu_unsync_children)
-			 * it doesn't write-protect the page or mark it synchronized!
-			 * This way the validity of the mapping is ensured, but the
-			 * overhead of write protection is not incurred until the
-			 * guest invalidates the TLB mapping.  This allows multiple
-			 * SPs for a single gfn to be unsync.
-			 *
-			 * If the sync fails, the page is zapped.  If so, break
-			 * in order to rebuild it.
-			 */
-			if (!kvm_sync_page(vcpu, sp, &invalid_list))
-				break;
+		goto out;
+	}
 
-			WARN_ON(!list_empty(&invalid_list));
-			kvm_flush_remote_tlbs(vcpu->kvm);
-		}
+	sp = NULL;
 
-		__clear_sp_write_flooding_count(sp);
+out:
+	if (collisions > kvm->stat.max_mmu_page_hash_collisions)
+		kvm->stat.max_mmu_page_hash_collisions = collisions;
+
+	return sp;
+}
 
-trace_get_page:
-		trace_kvm_mmu_get_page(sp, false);
+/*
+ * Looks up an existing SP for the given gfn and role if one exists. The
+ * return SP is guaranteed to be synced.
+ */
+static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
+						     gfn_t gfn,
+						     union kvm_mmu_page_role role)
+{
+	struct kvm_mmu_page *sp;
+	LIST_HEAD(invalid_list);
+
+	sp = __kvm_mmu_find_shadow_page(vcpu->kvm, gfn, role, &invalid_list);
+	if (!sp)
 		goto out;
+
+	if (sp->unsync) {
+		/*
+		 * The page is good, but is stale.  kvm_sync_page does
+		 * get the latest guest state, but (unlike mmu_unsync_children)
+		 * it doesn't write-protect the page or mark it synchronized!
+		 * This way the validity of the mapping is ensured, but the
+		 * overhead of write protection is not incurred until the
+		 * guest invalidates the TLB mapping.  This allows multiple
+		 * SPs for a single gfn to be unsync.
+		 *
+		 * If the sync fails, the page is zapped and added to the
+		 * invalid_list.
+		 */
+		if (!kvm_sync_page(vcpu, sp, &invalid_list)) {
+			sp = NULL;
+			goto out;
+		}
+
+		WARN_ON(!list_empty(&invalid_list));
+		kvm_flush_remote_tlbs(vcpu->kvm);
 	}
 
+out:
+	kvm_mmu_commit_zap_page(vcpu->kvm, &invalid_list);
+	return sp;
+}
+
+static struct kvm_mmu_page *kvm_mmu_new_shadow_page(struct kvm_vcpu *vcpu,
+						    gfn_t gfn,
+						    union kvm_mmu_page_role role)
+{
+	struct kvm_mmu_page *sp;
+	struct hlist_head *sp_list;
+
 	++vcpu->kvm->stat.mmu_cache_miss;
 
 	sp = kvm_mmu_alloc_page(vcpu, role.direct);
-
 	sp->gfn = gfn;
 	sp->role = role;
+
+	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
 	hlist_add_head(&sp->hash_link, sp_list);
+
 	if (!role.direct) {
 		account_shadowed(vcpu->kvm, sp);
 		if (role.level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))
 			kvm_flush_remote_tlbs_with_address(vcpu->kvm, gfn, 1);
 	}
-	trace_kvm_mmu_get_page(sp, true);
-out:
-	kvm_mmu_commit_zap_page(vcpu->kvm, &invalid_list);
 
-	if (collisions > vcpu->kvm->stat.max_mmu_page_hash_collisions)
-		vcpu->kvm->stat.max_mmu_page_hash_collisions = collisions;
+	return sp;
+}
+
+static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
+					     union kvm_mmu_page_role role)
+{
+	struct kvm_mmu_page *sp;
+	bool created = false;
+
+	sp = kvm_mmu_find_shadow_page(vcpu, gfn, role);
+	if (sp)
+		goto out;
+
+	created = true;
+	sp = kvm_mmu_new_shadow_page(vcpu, gfn, role);
+
+out:
+	trace_kvm_mmu_get_page(sp, created);
 	return sp;
 }
 
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index c3909a07e938..55cac59b9c9b 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -692,8 +692,9 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 			 * the gpte is changed from non-present to present.
 			 * Otherwise, the guest may use the wrong mapping.
 			 *
-			 * For PG_LEVEL_4K, kvm_mmu_get_page() has already
-			 * synchronized it transiently via kvm_sync_page().
+			 * For PG_LEVEL_4K, kvm_mmu_get_existing_sp() has
+			 * already synchronized it transiently via
+			 * kvm_sync_page().
 			 *
 			 * For higher level pagetable, we synchronize it via
 			 * the slower mmu_sync_children().  If it needs to
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 4739b53c9734..d10189d9c877 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -150,8 +150,9 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 		/*
 		 * Optimization: for pte sync, if spte was writable the hash
 		 * lookup is unnecessary (and expensive). Write protection
-		 * is responsibility of kvm_mmu_get_page / kvm_mmu_sync_roots.
-		 * Same reasoning can be applied to dirty page accounting.
+		 * is responsibility of kvm_mmu_create_sp() and
+		 * kvm_mmu_sync_roots(). Same reasoning can be applied to dirty
+		 * page accounting.
 		 */
 		if (is_writable_pte(old_spte))
 			goto out;
-- 
2.35.1.723.g4982287a31-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH v2 05/26] KVM: x86/mmu: Rename shadow MMU functions that deal with shadow pages
  2022-03-11  0:25 ` David Matlack
@ 2022-03-11  0:25   ` David Matlack
  -1 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-11  0:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

Rename 3 functions:

  kvm_mmu_get_page()   -> kvm_mmu_get_shadow_page()
  kvm_mmu_alloc_page() -> kvm_mmu_alloc_shadow_page()
  kvm_mmu_free_page()  -> kvm_mmu_free_shadow_page()

This change makes it clear that these functions deal with shadow pages
rather than struct pages. Prefer "shadow_page" over the shorter "sp"
since these are core routines.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 80dbfe07c87b..b6fb50e32291 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1668,7 +1668,7 @@ static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, long nr)
 	percpu_counter_add(&kvm_total_used_mmu_pages, nr);
 }
 
-static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
+static void kvm_mmu_free_shadow_page(struct kvm_mmu_page *sp)
 {
 	MMU_WARN_ON(!is_empty_shadow_page(sp->spt));
 	hlist_del(&sp->hash_link);
@@ -1706,7 +1706,8 @@ static void drop_parent_pte(struct kvm_mmu_page *sp,
 	mmu_spte_clear_no_track(parent_pte);
 }
 
-static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, bool direct)
+static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
+						      bool direct)
 {
 	struct kvm_mmu_page *sp;
 
@@ -2134,7 +2135,7 @@ static struct kvm_mmu_page *kvm_mmu_new_shadow_page(struct kvm_vcpu *vcpu,
 
 	++vcpu->kvm->stat.mmu_cache_miss;
 
-	sp = kvm_mmu_alloc_page(vcpu, role.direct);
+	sp = kvm_mmu_alloc_shadow_page(vcpu, role.direct);
 	sp->gfn = gfn;
 	sp->role = role;
 
@@ -2150,8 +2151,9 @@ static struct kvm_mmu_page *kvm_mmu_new_shadow_page(struct kvm_vcpu *vcpu,
 	return sp;
 }
 
-static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
-					     union kvm_mmu_page_role role)
+static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
+						    gfn_t gfn,
+						    union kvm_mmu_page_role role)
 {
 	struct kvm_mmu_page *sp;
 	bool created = false;
@@ -2210,7 +2212,7 @@ static struct kvm_mmu_page *kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu,
 	union kvm_mmu_page_role role;
 
 	role = kvm_mmu_child_role(sptep, direct, access);
-	return kvm_mmu_get_page(vcpu, gfn, role);
+	return kvm_mmu_get_shadow_page(vcpu, gfn, role);
 }
 
 static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterator,
@@ -2486,7 +2488,7 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
 
 	list_for_each_entry_safe(sp, nsp, invalid_list, link) {
 		WARN_ON(!sp->role.invalid || sp->root_count);
-		kvm_mmu_free_page(sp);
+		kvm_mmu_free_shadow_page(sp);
 	}
 }
 
@@ -3417,7 +3419,7 @@ static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
 		role.quadrant = quadrant;
 	}
 
-	sp = kvm_mmu_get_page(vcpu, gfn, role);
+	sp = kvm_mmu_get_shadow_page(vcpu, gfn, role);
 	++sp->root_count;
 
 	return __pa(sp->spt);
-- 
2.35.1.723.g4982287a31-goog


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH v2 05/26] KVM: x86/mmu: Rename shadow MMU functions that deal with shadow pages
@ 2022-03-11  0:25   ` David Matlack
  0 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-11  0:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Rename 3 functions:

  kvm_mmu_get_page()   -> kvm_mmu_get_shadow_page()
  kvm_mmu_alloc_page() -> kvm_mmu_alloc_shadow_page()
  kvm_mmu_free_page()  -> kvm_mmu_free_shadow_page()

This change makes it clear that these functions deal with shadow pages
rather than struct pages. Prefer "shadow_page" over the shorter "sp"
since these are core routines.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 80dbfe07c87b..b6fb50e32291 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1668,7 +1668,7 @@ static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, long nr)
 	percpu_counter_add(&kvm_total_used_mmu_pages, nr);
 }
 
-static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
+static void kvm_mmu_free_shadow_page(struct kvm_mmu_page *sp)
 {
 	MMU_WARN_ON(!is_empty_shadow_page(sp->spt));
 	hlist_del(&sp->hash_link);
@@ -1706,7 +1706,8 @@ static void drop_parent_pte(struct kvm_mmu_page *sp,
 	mmu_spte_clear_no_track(parent_pte);
 }
 
-static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, bool direct)
+static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
+						      bool direct)
 {
 	struct kvm_mmu_page *sp;
 
@@ -2134,7 +2135,7 @@ static struct kvm_mmu_page *kvm_mmu_new_shadow_page(struct kvm_vcpu *vcpu,
 
 	++vcpu->kvm->stat.mmu_cache_miss;
 
-	sp = kvm_mmu_alloc_page(vcpu, role.direct);
+	sp = kvm_mmu_alloc_shadow_page(vcpu, role.direct);
 	sp->gfn = gfn;
 	sp->role = role;
 
@@ -2150,8 +2151,9 @@ static struct kvm_mmu_page *kvm_mmu_new_shadow_page(struct kvm_vcpu *vcpu,
 	return sp;
 }
 
-static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
-					     union kvm_mmu_page_role role)
+static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
+						    gfn_t gfn,
+						    union kvm_mmu_page_role role)
 {
 	struct kvm_mmu_page *sp;
 	bool created = false;
@@ -2210,7 +2212,7 @@ static struct kvm_mmu_page *kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu,
 	union kvm_mmu_page_role role;
 
 	role = kvm_mmu_child_role(sptep, direct, access);
-	return kvm_mmu_get_page(vcpu, gfn, role);
+	return kvm_mmu_get_shadow_page(vcpu, gfn, role);
 }
 
 static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterator,
@@ -2486,7 +2488,7 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
 
 	list_for_each_entry_safe(sp, nsp, invalid_list, link) {
 		WARN_ON(!sp->role.invalid || sp->root_count);
-		kvm_mmu_free_page(sp);
+		kvm_mmu_free_shadow_page(sp);
 	}
 }
 
@@ -3417,7 +3419,7 @@ static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
 		role.quadrant = quadrant;
 	}
 
-	sp = kvm_mmu_get_page(vcpu, gfn, role);
+	sp = kvm_mmu_get_shadow_page(vcpu, gfn, role);
 	++sp->root_count;
 
 	return __pa(sp->spt);
-- 
2.35.1.723.g4982287a31-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH v2 06/26] KVM: x86/mmu: Pass memslot to kvm_mmu_new_shadow_page()
  2022-03-11  0:25 ` David Matlack
@ 2022-03-11  0:25   ` David Matlack
  -1 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-11  0:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

Passing the memslot to kvm_mmu_new_shadow_page() avoids the need for the
vCPU pointer when write-protecting indirect 4k shadow pages. This moves
us closer to being able to create new shadow pages during VM ioctls for
eager page splitting, where there is not vCPU pointer.

This change does not negatively impact "Populate memory time" for ept=Y
or ept=N configurations since kvm_vcpu_gfn_to_memslot() caches the last
use slot. So even though we now look up the slot more often, it is a
very cheap check.

Opportunistically move the code to write-protect GFNs shadowed by
PG_LEVEL_4K shadow pages into account_shadowed() to reduce indentation
and consolidate the code. This also eliminates a memslot lookup.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 23 ++++++++++++-----------
 1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index b6fb50e32291..519910938478 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -793,16 +793,14 @@ void kvm_mmu_gfn_allow_lpage(const struct kvm_memory_slot *slot, gfn_t gfn)
 	update_gfn_disallow_lpage_count(slot, gfn, -1);
 }
 
-static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
+static void account_shadowed(struct kvm *kvm,
+			     struct kvm_memory_slot *slot,
+			     struct kvm_mmu_page *sp)
 {
-	struct kvm_memslots *slots;
-	struct kvm_memory_slot *slot;
 	gfn_t gfn;
 
 	kvm->arch.indirect_shadow_pages++;
 	gfn = sp->gfn;
-	slots = kvm_memslots_for_spte_role(kvm, sp->role);
-	slot = __gfn_to_memslot(slots, gfn);
 
 	/* the non-leaf shadow pages are keeping readonly. */
 	if (sp->role.level > PG_LEVEL_4K)
@@ -810,6 +808,9 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
 						    KVM_PAGE_TRACK_WRITE);
 
 	kvm_mmu_gfn_disallow_lpage(slot, gfn);
+
+	if (kvm_mmu_slot_gfn_write_protect(kvm, slot, gfn, PG_LEVEL_4K))
+		kvm_flush_remote_tlbs_with_address(kvm, gfn, 1);
 }
 
 void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp)
@@ -2127,6 +2128,7 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
 }
 
 static struct kvm_mmu_page *kvm_mmu_new_shadow_page(struct kvm_vcpu *vcpu,
+						    struct kvm_memory_slot *slot,
 						    gfn_t gfn,
 						    union kvm_mmu_page_role role)
 {
@@ -2142,11 +2144,8 @@ static struct kvm_mmu_page *kvm_mmu_new_shadow_page(struct kvm_vcpu *vcpu,
 	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
 	hlist_add_head(&sp->hash_link, sp_list);
 
-	if (!role.direct) {
-		account_shadowed(vcpu->kvm, sp);
-		if (role.level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))
-			kvm_flush_remote_tlbs_with_address(vcpu->kvm, gfn, 1);
-	}
+	if (!role.direct)
+		account_shadowed(vcpu->kvm, slot, sp);
 
 	return sp;
 }
@@ -2155,6 +2154,7 @@ static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
 						    gfn_t gfn,
 						    union kvm_mmu_page_role role)
 {
+	struct kvm_memory_slot *slot;
 	struct kvm_mmu_page *sp;
 	bool created = false;
 
@@ -2163,7 +2163,8 @@ static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
 		goto out;
 
 	created = true;
-	sp = kvm_mmu_new_shadow_page(vcpu, gfn, role);
+	slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
+	sp = kvm_mmu_new_shadow_page(vcpu, slot, gfn, role);
 
 out:
 	trace_kvm_mmu_get_page(sp, created);
-- 
2.35.1.723.g4982287a31-goog


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH v2 06/26] KVM: x86/mmu: Pass memslot to kvm_mmu_new_shadow_page()
@ 2022-03-11  0:25   ` David Matlack
  0 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-11  0:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Passing the memslot to kvm_mmu_new_shadow_page() avoids the need for the
vCPU pointer when write-protecting indirect 4k shadow pages. This moves
us closer to being able to create new shadow pages during VM ioctls for
eager page splitting, where there is not vCPU pointer.

This change does not negatively impact "Populate memory time" for ept=Y
or ept=N configurations since kvm_vcpu_gfn_to_memslot() caches the last
use slot. So even though we now look up the slot more often, it is a
very cheap check.

Opportunistically move the code to write-protect GFNs shadowed by
PG_LEVEL_4K shadow pages into account_shadowed() to reduce indentation
and consolidate the code. This also eliminates a memslot lookup.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 23 ++++++++++++-----------
 1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index b6fb50e32291..519910938478 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -793,16 +793,14 @@ void kvm_mmu_gfn_allow_lpage(const struct kvm_memory_slot *slot, gfn_t gfn)
 	update_gfn_disallow_lpage_count(slot, gfn, -1);
 }
 
-static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
+static void account_shadowed(struct kvm *kvm,
+			     struct kvm_memory_slot *slot,
+			     struct kvm_mmu_page *sp)
 {
-	struct kvm_memslots *slots;
-	struct kvm_memory_slot *slot;
 	gfn_t gfn;
 
 	kvm->arch.indirect_shadow_pages++;
 	gfn = sp->gfn;
-	slots = kvm_memslots_for_spte_role(kvm, sp->role);
-	slot = __gfn_to_memslot(slots, gfn);
 
 	/* the non-leaf shadow pages are keeping readonly. */
 	if (sp->role.level > PG_LEVEL_4K)
@@ -810,6 +808,9 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
 						    KVM_PAGE_TRACK_WRITE);
 
 	kvm_mmu_gfn_disallow_lpage(slot, gfn);
+
+	if (kvm_mmu_slot_gfn_write_protect(kvm, slot, gfn, PG_LEVEL_4K))
+		kvm_flush_remote_tlbs_with_address(kvm, gfn, 1);
 }
 
 void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp)
@@ -2127,6 +2128,7 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
 }
 
 static struct kvm_mmu_page *kvm_mmu_new_shadow_page(struct kvm_vcpu *vcpu,
+						    struct kvm_memory_slot *slot,
 						    gfn_t gfn,
 						    union kvm_mmu_page_role role)
 {
@@ -2142,11 +2144,8 @@ static struct kvm_mmu_page *kvm_mmu_new_shadow_page(struct kvm_vcpu *vcpu,
 	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
 	hlist_add_head(&sp->hash_link, sp_list);
 
-	if (!role.direct) {
-		account_shadowed(vcpu->kvm, sp);
-		if (role.level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))
-			kvm_flush_remote_tlbs_with_address(vcpu->kvm, gfn, 1);
-	}
+	if (!role.direct)
+		account_shadowed(vcpu->kvm, slot, sp);
 
 	return sp;
 }
@@ -2155,6 +2154,7 @@ static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
 						    gfn_t gfn,
 						    union kvm_mmu_page_role role)
 {
+	struct kvm_memory_slot *slot;
 	struct kvm_mmu_page *sp;
 	bool created = false;
 
@@ -2163,7 +2163,8 @@ static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
 		goto out;
 
 	created = true;
-	sp = kvm_mmu_new_shadow_page(vcpu, gfn, role);
+	slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
+	sp = kvm_mmu_new_shadow_page(vcpu, slot, gfn, role);
 
 out:
 	trace_kvm_mmu_get_page(sp, created);
-- 
2.35.1.723.g4982287a31-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH v2 07/26] KVM: x86/mmu: Separate shadow MMU sp allocation from initialization
  2022-03-11  0:25 ` David Matlack
@ 2022-03-11  0:25   ` David Matlack
  -1 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-11  0:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

Separate the code that allocates a new shadow page from the vCPU caches
from the code that initializes it. This is in preparation for creating
new shadow pages from VM ioctls for eager page splitting, where we do
not have access to the vCPU caches.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 38 ++++++++++++++++++--------------------
 1 file changed, 18 insertions(+), 20 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 519910938478..e866e05c4ba5 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1716,16 +1716,9 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
 	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
 	if (!direct)
 		sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache);
+
 	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
 
-	/*
-	 * active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages()
-	 * depends on valid pages being added to the head of the list.  See
-	 * comments in kvm_zap_obsolete_pages().
-	 */
-	sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
-	list_add(&sp->link, &vcpu->kvm->arch.active_mmu_pages);
-	kvm_mod_used_mmu_pages(vcpu->kvm, +1);
 	return sp;
 }
 
@@ -2127,27 +2120,31 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
 	return sp;
 }
 
-static struct kvm_mmu_page *kvm_mmu_new_shadow_page(struct kvm_vcpu *vcpu,
-						    struct kvm_memory_slot *slot,
-						    gfn_t gfn,
-						    union kvm_mmu_page_role role)
+static void init_shadow_page(struct kvm *kvm, struct kvm_mmu_page *sp,
+			     struct kvm_memory_slot *slot, gfn_t gfn,
+			     union kvm_mmu_page_role role)
 {
-	struct kvm_mmu_page *sp;
 	struct hlist_head *sp_list;
 
-	++vcpu->kvm->stat.mmu_cache_miss;
+	++kvm->stat.mmu_cache_miss;
 
-	sp = kvm_mmu_alloc_shadow_page(vcpu, role.direct);
 	sp->gfn = gfn;
 	sp->role = role;
+	sp->mmu_valid_gen = kvm->arch.mmu_valid_gen;
 
-	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
+	/*
+	 * active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages()
+	 * depends on valid pages being added to the head of the list.  See
+	 * comments in kvm_zap_obsolete_pages().
+	 */
+	list_add(&sp->link, &kvm->arch.active_mmu_pages);
+	kvm_mod_used_mmu_pages(kvm, 1);
+
+	sp_list = &kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
 	hlist_add_head(&sp->hash_link, sp_list);
 
 	if (!role.direct)
-		account_shadowed(vcpu->kvm, slot, sp);
-
-	return sp;
+		account_shadowed(kvm, slot, sp);
 }
 
 static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
@@ -2164,7 +2161,8 @@ static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
 
 	created = true;
 	slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
-	sp = kvm_mmu_new_shadow_page(vcpu, slot, gfn, role);
+	sp = kvm_mmu_alloc_shadow_page(vcpu, role.direct);
+	init_shadow_page(vcpu->kvm, sp, slot, gfn, role);
 
 out:
 	trace_kvm_mmu_get_page(sp, created);
-- 
2.35.1.723.g4982287a31-goog


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH v2 07/26] KVM: x86/mmu: Separate shadow MMU sp allocation from initialization
@ 2022-03-11  0:25   ` David Matlack
  0 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-11  0:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Separate the code that allocates a new shadow page from the vCPU caches
from the code that initializes it. This is in preparation for creating
new shadow pages from VM ioctls for eager page splitting, where we do
not have access to the vCPU caches.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 38 ++++++++++++++++++--------------------
 1 file changed, 18 insertions(+), 20 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 519910938478..e866e05c4ba5 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1716,16 +1716,9 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
 	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
 	if (!direct)
 		sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache);
+
 	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
 
-	/*
-	 * active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages()
-	 * depends on valid pages being added to the head of the list.  See
-	 * comments in kvm_zap_obsolete_pages().
-	 */
-	sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
-	list_add(&sp->link, &vcpu->kvm->arch.active_mmu_pages);
-	kvm_mod_used_mmu_pages(vcpu->kvm, +1);
 	return sp;
 }
 
@@ -2127,27 +2120,31 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
 	return sp;
 }
 
-static struct kvm_mmu_page *kvm_mmu_new_shadow_page(struct kvm_vcpu *vcpu,
-						    struct kvm_memory_slot *slot,
-						    gfn_t gfn,
-						    union kvm_mmu_page_role role)
+static void init_shadow_page(struct kvm *kvm, struct kvm_mmu_page *sp,
+			     struct kvm_memory_slot *slot, gfn_t gfn,
+			     union kvm_mmu_page_role role)
 {
-	struct kvm_mmu_page *sp;
 	struct hlist_head *sp_list;
 
-	++vcpu->kvm->stat.mmu_cache_miss;
+	++kvm->stat.mmu_cache_miss;
 
-	sp = kvm_mmu_alloc_shadow_page(vcpu, role.direct);
 	sp->gfn = gfn;
 	sp->role = role;
+	sp->mmu_valid_gen = kvm->arch.mmu_valid_gen;
 
-	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
+	/*
+	 * active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages()
+	 * depends on valid pages being added to the head of the list.  See
+	 * comments in kvm_zap_obsolete_pages().
+	 */
+	list_add(&sp->link, &kvm->arch.active_mmu_pages);
+	kvm_mod_used_mmu_pages(kvm, 1);
+
+	sp_list = &kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
 	hlist_add_head(&sp->hash_link, sp_list);
 
 	if (!role.direct)
-		account_shadowed(vcpu->kvm, slot, sp);
-
-	return sp;
+		account_shadowed(kvm, slot, sp);
 }
 
 static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
@@ -2164,7 +2161,8 @@ static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
 
 	created = true;
 	slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
-	sp = kvm_mmu_new_shadow_page(vcpu, slot, gfn, role);
+	sp = kvm_mmu_alloc_shadow_page(vcpu, role.direct);
+	init_shadow_page(vcpu->kvm, sp, slot, gfn, role);
 
 out:
 	trace_kvm_mmu_get_page(sp, created);
-- 
2.35.1.723.g4982287a31-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH v2 08/26] KVM: x86/mmu: Link spt to sp during allocation
  2022-03-11  0:25 ` David Matlack
@ 2022-03-11  0:25   ` David Matlack
  -1 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-11  0:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

Link the shadow page table to the sp (via set_page_private()) during
allocation rather than initialization. This is a more logical place to
do it because allocation time is also where we do the reverse link
(setting sp->spt).

This creates one extra call to set_page_private(), but having multiple
calls to set_page_private() is unavoidable anyway. We either do
set_page_private() during allocation, which requires 1 per allocation
function, or we do it during initialization, which requires 1 per
initialization function.

No functional change intended.

Suggested-by: Ben Gardon <bgardon@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index af60922906ef..eecb0215e636 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -274,6 +274,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
 
 	sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
 	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
+	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
 
 	return sp;
 }
@@ -281,8 +282,6 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
 static void tdp_mmu_init_sp(struct kvm_mmu_page *sp, tdp_ptep_t sptep,
 			    gfn_t gfn, union kvm_mmu_page_role role)
 {
-	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
-
 	sp->role = role;
 	sp->gfn = gfn;
 	sp->ptep = sptep;
@@ -1410,6 +1409,8 @@ static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp)
 		return NULL;
 	}
 
+	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
+
 	return sp;
 }
 
-- 
2.35.1.723.g4982287a31-goog


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH v2 08/26] KVM: x86/mmu: Link spt to sp during allocation
@ 2022-03-11  0:25   ` David Matlack
  0 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-11  0:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Link the shadow page table to the sp (via set_page_private()) during
allocation rather than initialization. This is a more logical place to
do it because allocation time is also where we do the reverse link
(setting sp->spt).

This creates one extra call to set_page_private(), but having multiple
calls to set_page_private() is unavoidable anyway. We either do
set_page_private() during allocation, which requires 1 per allocation
function, or we do it during initialization, which requires 1 per
initialization function.

No functional change intended.

Suggested-by: Ben Gardon <bgardon@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index af60922906ef..eecb0215e636 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -274,6 +274,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
 
 	sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
 	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
+	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
 
 	return sp;
 }
@@ -281,8 +282,6 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
 static void tdp_mmu_init_sp(struct kvm_mmu_page *sp, tdp_ptep_t sptep,
 			    gfn_t gfn, union kvm_mmu_page_role role)
 {
-	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
-
 	sp->role = role;
 	sp->gfn = gfn;
 	sp->ptep = sptep;
@@ -1410,6 +1409,8 @@ static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp)
 		return NULL;
 	}
 
+	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
+
 	return sp;
 }
 
-- 
2.35.1.723.g4982287a31-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH v2 09/26] KVM: x86/mmu: Move huge page split sp allocation code to mmu.c
  2022-03-11  0:25 ` David Matlack
@ 2022-03-11  0:25   ` David Matlack
  -1 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-11  0:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

Move the code that allocates a new shadow page for splitting huge pages
into mmu.c. Currently this code is only used by the TDP MMU but it will
be reused in subsequent commits to also split huge pages mapped by the
shadow MMU.

While here, also shove the GFP complexity down into the allocation
function so that it does not have to be duplicated when the shadow MMU
needs to start allocating SPs for splitting.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c          | 34 +++++++++++++++++++++++++++++++++
 arch/x86/kvm/mmu/mmu_internal.h |  2 ++
 arch/x86/kvm/mmu/tdp_mmu.c      | 34 ++-------------------------------
 3 files changed, 38 insertions(+), 32 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e866e05c4ba5..c12d5016f6dc 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1722,6 +1722,40 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
 	return sp;
 }
 
+/*
+ * Allocate a new shadow page, potentially while holding the MMU lock.
+ *
+ * Huge page splitting always uses direct shadow pages since the huge page is
+ * being mapped directly with a lower level page table. Thus there's no need to
+ * allocate the gfns array.
+ */
+struct kvm_mmu_page *kvm_mmu_alloc_direct_sp_for_split(bool locked)
+{
+	struct kvm_mmu_page *sp;
+	gfp_t gfp;
+
+	/*
+	 * If under the MMU lock, use GFP_NOWAIT to avoid direct reclaim (which
+	 * is slow) and to avoid making any filesystem callbacks (which can end
+	 * up invoking KVM MMU notifiers, resulting in a deadlock).
+	 */
+	gfp = (locked ? GFP_NOWAIT : GFP_KERNEL) | __GFP_ACCOUNT | __GFP_ZERO;
+
+	sp = kmem_cache_alloc(mmu_page_header_cache, gfp);
+	if (!sp)
+		return NULL;
+
+	sp->spt = (void *)__get_free_page(gfp);
+	if (!sp->spt) {
+		kmem_cache_free(mmu_page_header_cache, sp);
+		return NULL;
+	}
+
+	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
+
+	return sp;
+}
+
 static void mark_unsync(u64 *spte);
 static void kvm_mmu_mark_parents_unsync(struct kvm_mmu_page *sp)
 {
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 1bff453f7cbe..a0648e7ddd33 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -171,4 +171,6 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
 void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
 void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
 
+struct kvm_mmu_page *kvm_mmu_alloc_direct_sp_for_split(bool locked);
+
 #endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index eecb0215e636..1a43f908d508 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1393,43 +1393,13 @@ bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm,
 	return spte_set;
 }
 
-static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp)
-{
-	struct kvm_mmu_page *sp;
-
-	gfp |= __GFP_ZERO;
-
-	sp = kmem_cache_alloc(mmu_page_header_cache, gfp);
-	if (!sp)
-		return NULL;
-
-	sp->spt = (void *)__get_free_page(gfp);
-	if (!sp->spt) {
-		kmem_cache_free(mmu_page_header_cache, sp);
-		return NULL;
-	}
-
-	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
-
-	return sp;
-}
-
 static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
 						       struct tdp_iter *iter,
 						       bool shared)
 {
 	struct kvm_mmu_page *sp;
 
-	/*
-	 * Since we are allocating while under the MMU lock we have to be
-	 * careful about GFP flags. Use GFP_NOWAIT to avoid blocking on direct
-	 * reclaim and to avoid making any filesystem callbacks (which can end
-	 * up invoking KVM MMU notifiers, resulting in a deadlock).
-	 *
-	 * If this allocation fails we drop the lock and retry with reclaim
-	 * allowed.
-	 */
-	sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT);
+	sp = kvm_mmu_alloc_direct_sp_for_split(true);
 	if (sp)
 		return sp;
 
@@ -1441,7 +1411,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
 		write_unlock(&kvm->mmu_lock);
 
 	iter->yielded = true;
-	sp = __tdp_mmu_alloc_sp_for_split(GFP_KERNEL_ACCOUNT);
+	sp = kvm_mmu_alloc_direct_sp_for_split(false);
 
 	if (shared)
 		read_lock(&kvm->mmu_lock);
-- 
2.35.1.723.g4982287a31-goog


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH v2 09/26] KVM: x86/mmu: Move huge page split sp allocation code to mmu.c
@ 2022-03-11  0:25   ` David Matlack
  0 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-11  0:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Move the code that allocates a new shadow page for splitting huge pages
into mmu.c. Currently this code is only used by the TDP MMU but it will
be reused in subsequent commits to also split huge pages mapped by the
shadow MMU.

While here, also shove the GFP complexity down into the allocation
function so that it does not have to be duplicated when the shadow MMU
needs to start allocating SPs for splitting.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c          | 34 +++++++++++++++++++++++++++++++++
 arch/x86/kvm/mmu/mmu_internal.h |  2 ++
 arch/x86/kvm/mmu/tdp_mmu.c      | 34 ++-------------------------------
 3 files changed, 38 insertions(+), 32 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e866e05c4ba5..c12d5016f6dc 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1722,6 +1722,40 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
 	return sp;
 }
 
+/*
+ * Allocate a new shadow page, potentially while holding the MMU lock.
+ *
+ * Huge page splitting always uses direct shadow pages since the huge page is
+ * being mapped directly with a lower level page table. Thus there's no need to
+ * allocate the gfns array.
+ */
+struct kvm_mmu_page *kvm_mmu_alloc_direct_sp_for_split(bool locked)
+{
+	struct kvm_mmu_page *sp;
+	gfp_t gfp;
+
+	/*
+	 * If under the MMU lock, use GFP_NOWAIT to avoid direct reclaim (which
+	 * is slow) and to avoid making any filesystem callbacks (which can end
+	 * up invoking KVM MMU notifiers, resulting in a deadlock).
+	 */
+	gfp = (locked ? GFP_NOWAIT : GFP_KERNEL) | __GFP_ACCOUNT | __GFP_ZERO;
+
+	sp = kmem_cache_alloc(mmu_page_header_cache, gfp);
+	if (!sp)
+		return NULL;
+
+	sp->spt = (void *)__get_free_page(gfp);
+	if (!sp->spt) {
+		kmem_cache_free(mmu_page_header_cache, sp);
+		return NULL;
+	}
+
+	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
+
+	return sp;
+}
+
 static void mark_unsync(u64 *spte);
 static void kvm_mmu_mark_parents_unsync(struct kvm_mmu_page *sp)
 {
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 1bff453f7cbe..a0648e7ddd33 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -171,4 +171,6 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
 void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
 void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
 
+struct kvm_mmu_page *kvm_mmu_alloc_direct_sp_for_split(bool locked);
+
 #endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index eecb0215e636..1a43f908d508 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1393,43 +1393,13 @@ bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm,
 	return spte_set;
 }
 
-static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp)
-{
-	struct kvm_mmu_page *sp;
-
-	gfp |= __GFP_ZERO;
-
-	sp = kmem_cache_alloc(mmu_page_header_cache, gfp);
-	if (!sp)
-		return NULL;
-
-	sp->spt = (void *)__get_free_page(gfp);
-	if (!sp->spt) {
-		kmem_cache_free(mmu_page_header_cache, sp);
-		return NULL;
-	}
-
-	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
-
-	return sp;
-}
-
 static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
 						       struct tdp_iter *iter,
 						       bool shared)
 {
 	struct kvm_mmu_page *sp;
 
-	/*
-	 * Since we are allocating while under the MMU lock we have to be
-	 * careful about GFP flags. Use GFP_NOWAIT to avoid blocking on direct
-	 * reclaim and to avoid making any filesystem callbacks (which can end
-	 * up invoking KVM MMU notifiers, resulting in a deadlock).
-	 *
-	 * If this allocation fails we drop the lock and retry with reclaim
-	 * allowed.
-	 */
-	sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT);
+	sp = kvm_mmu_alloc_direct_sp_for_split(true);
 	if (sp)
 		return sp;
 
@@ -1441,7 +1411,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
 		write_unlock(&kvm->mmu_lock);
 
 	iter->yielded = true;
-	sp = __tdp_mmu_alloc_sp_for_split(GFP_KERNEL_ACCOUNT);
+	sp = kvm_mmu_alloc_direct_sp_for_split(false);
 
 	if (shared)
 		read_lock(&kvm->mmu_lock);
-- 
2.35.1.723.g4982287a31-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH v2 10/26] KVM: x86/mmu: Use common code to free kvm_mmu_page structs
  2022-03-11  0:25 ` David Matlack
@ 2022-03-11  0:25   ` David Matlack
  -1 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-11  0:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

Use a common function to free kvm_mmu_page structs in the TDP MMU and
the shadow MMU. This reduces the amount of duplicate code and is needed
in subsequent commits that allocate and free kvm_mmu_pages for eager
page splitting.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c          | 8 ++++----
 arch/x86/kvm/mmu/mmu_internal.h | 2 ++
 arch/x86/kvm/mmu/tdp_mmu.c      | 3 +--
 3 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index c12d5016f6dc..2dcafbef5ffc 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1669,11 +1669,8 @@ static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, long nr)
 	percpu_counter_add(&kvm_total_used_mmu_pages, nr);
 }
 
-static void kvm_mmu_free_shadow_page(struct kvm_mmu_page *sp)
+void kvm_mmu_free_shadow_page(struct kvm_mmu_page *sp)
 {
-	MMU_WARN_ON(!is_empty_shadow_page(sp->spt));
-	hlist_del(&sp->hash_link);
-	list_del(&sp->link);
 	free_page((unsigned long)sp->spt);
 	if (!sp->role.direct)
 		free_page((unsigned long)sp->gfns);
@@ -2521,6 +2518,9 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
 
 	list_for_each_entry_safe(sp, nsp, invalid_list, link) {
 		WARN_ON(!sp->role.invalid || sp->root_count);
+		MMU_WARN_ON(!is_empty_shadow_page(sp->spt));
+		hlist_del(&sp->hash_link);
+		list_del(&sp->link);
 		kvm_mmu_free_shadow_page(sp);
 	}
 }
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index a0648e7ddd33..5f91e4d07a95 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -173,4 +173,6 @@ void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
 
 struct kvm_mmu_page *kvm_mmu_alloc_direct_sp_for_split(bool locked);
 
+void kvm_mmu_free_shadow_page(struct kvm_mmu_page *sp);
+
 #endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 1a43f908d508..184874a82a1b 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -64,8 +64,7 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
 
 static void tdp_mmu_free_sp(struct kvm_mmu_page *sp)
 {
-	free_page((unsigned long)sp->spt);
-	kmem_cache_free(mmu_page_header_cache, sp);
+	kvm_mmu_free_shadow_page(sp);
 }
 
 /*
-- 
2.35.1.723.g4982287a31-goog


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH v2 10/26] KVM: x86/mmu: Use common code to free kvm_mmu_page structs
@ 2022-03-11  0:25   ` David Matlack
  0 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-11  0:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Use a common function to free kvm_mmu_page structs in the TDP MMU and
the shadow MMU. This reduces the amount of duplicate code and is needed
in subsequent commits that allocate and free kvm_mmu_pages for eager
page splitting.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c          | 8 ++++----
 arch/x86/kvm/mmu/mmu_internal.h | 2 ++
 arch/x86/kvm/mmu/tdp_mmu.c      | 3 +--
 3 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index c12d5016f6dc..2dcafbef5ffc 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1669,11 +1669,8 @@ static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, long nr)
 	percpu_counter_add(&kvm_total_used_mmu_pages, nr);
 }
 
-static void kvm_mmu_free_shadow_page(struct kvm_mmu_page *sp)
+void kvm_mmu_free_shadow_page(struct kvm_mmu_page *sp)
 {
-	MMU_WARN_ON(!is_empty_shadow_page(sp->spt));
-	hlist_del(&sp->hash_link);
-	list_del(&sp->link);
 	free_page((unsigned long)sp->spt);
 	if (!sp->role.direct)
 		free_page((unsigned long)sp->gfns);
@@ -2521,6 +2518,9 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
 
 	list_for_each_entry_safe(sp, nsp, invalid_list, link) {
 		WARN_ON(!sp->role.invalid || sp->root_count);
+		MMU_WARN_ON(!is_empty_shadow_page(sp->spt));
+		hlist_del(&sp->hash_link);
+		list_del(&sp->link);
 		kvm_mmu_free_shadow_page(sp);
 	}
 }
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index a0648e7ddd33..5f91e4d07a95 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -173,4 +173,6 @@ void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
 
 struct kvm_mmu_page *kvm_mmu_alloc_direct_sp_for_split(bool locked);
 
+void kvm_mmu_free_shadow_page(struct kvm_mmu_page *sp);
+
 #endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 1a43f908d508..184874a82a1b 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -64,8 +64,7 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
 
 static void tdp_mmu_free_sp(struct kvm_mmu_page *sp)
 {
-	free_page((unsigned long)sp->spt);
-	kmem_cache_free(mmu_page_header_cache, sp);
+	kvm_mmu_free_shadow_page(sp);
 }
 
 /*
-- 
2.35.1.723.g4982287a31-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH v2 11/26] KVM: x86/mmu: Use common code to allocate kvm_mmu_page structs from vCPU caches
  2022-03-11  0:25 ` David Matlack
@ 2022-03-11  0:25   ` David Matlack
  -1 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-11  0:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

Now that allocating a kvm_mmu_page struct is isolated to a helper
function, it can be re-used in the TDP MMU.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c          | 3 +--
 arch/x86/kvm/mmu/mmu_internal.h | 1 +
 arch/x86/kvm/mmu/tdp_mmu.c      | 8 +-------
 3 files changed, 3 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 2dcafbef5ffc..4c8feaeb063d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1704,8 +1704,7 @@ static void drop_parent_pte(struct kvm_mmu_page *sp,
 	mmu_spte_clear_no_track(parent_pte);
 }
 
-static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
-						      bool direct)
+struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu, bool direct)
 {
 	struct kvm_mmu_page *sp;
 
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 5f91e4d07a95..d4e2de5f2a6d 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -173,6 +173,7 @@ void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
 
 struct kvm_mmu_page *kvm_mmu_alloc_direct_sp_for_split(bool locked);
 
+struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu, bool direct);
 void kvm_mmu_free_shadow_page(struct kvm_mmu_page *sp);
 
 #endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 184874a82a1b..f285fd76717b 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -269,13 +269,7 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
 
 static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
 {
-	struct kvm_mmu_page *sp;
-
-	sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
-	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
-	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
-
-	return sp;
+	return kvm_mmu_alloc_shadow_page(vcpu, true);
 }
 
 static void tdp_mmu_init_sp(struct kvm_mmu_page *sp, tdp_ptep_t sptep,
-- 
2.35.1.723.g4982287a31-goog


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH v2 11/26] KVM: x86/mmu: Use common code to allocate kvm_mmu_page structs from vCPU caches
@ 2022-03-11  0:25   ` David Matlack
  0 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-11  0:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Now that allocating a kvm_mmu_page struct is isolated to a helper
function, it can be re-used in the TDP MMU.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c          | 3 +--
 arch/x86/kvm/mmu/mmu_internal.h | 1 +
 arch/x86/kvm/mmu/tdp_mmu.c      | 8 +-------
 3 files changed, 3 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 2dcafbef5ffc..4c8feaeb063d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1704,8 +1704,7 @@ static void drop_parent_pte(struct kvm_mmu_page *sp,
 	mmu_spte_clear_no_track(parent_pte);
 }
 
-static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
-						      bool direct)
+struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu, bool direct)
 {
 	struct kvm_mmu_page *sp;
 
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 5f91e4d07a95..d4e2de5f2a6d 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -173,6 +173,7 @@ void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
 
 struct kvm_mmu_page *kvm_mmu_alloc_direct_sp_for_split(bool locked);
 
+struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu, bool direct);
 void kvm_mmu_free_shadow_page(struct kvm_mmu_page *sp);
 
 #endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 184874a82a1b..f285fd76717b 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -269,13 +269,7 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
 
 static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
 {
-	struct kvm_mmu_page *sp;
-
-	sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
-	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
-	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
-
-	return sp;
+	return kvm_mmu_alloc_shadow_page(vcpu, true);
 }
 
 static void tdp_mmu_init_sp(struct kvm_mmu_page *sp, tdp_ptep_t sptep,
-- 
2.35.1.723.g4982287a31-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH v2 12/26] KVM: x86/mmu: Pass const memslot to rmap_add()
  2022-03-11  0:25 ` David Matlack
@ 2022-03-11  0:25   ` David Matlack
  -1 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-11  0:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

rmap_add() only uses the slot to call gfn_to_rmap() which takes a const
memslot.

No functional change intended.

Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 4c8feaeb063d..23c0a36ac93f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1596,7 +1596,7 @@ static bool kvm_test_age_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 
 #define RMAP_RECYCLE_THRESHOLD 1000
 
-static void rmap_add(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
+static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
 		     u64 *spte, gfn_t gfn)
 {
 	struct kvm_mmu_page *sp;
-- 
2.35.1.723.g4982287a31-goog


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH v2 12/26] KVM: x86/mmu: Pass const memslot to rmap_add()
@ 2022-03-11  0:25   ` David Matlack
  0 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-11  0:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

rmap_add() only uses the slot to call gfn_to_rmap() which takes a const
memslot.

No functional change intended.

Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 4c8feaeb063d..23c0a36ac93f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1596,7 +1596,7 @@ static bool kvm_test_age_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 
 #define RMAP_RECYCLE_THRESHOLD 1000
 
-static void rmap_add(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
+static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
 		     u64 *spte, gfn_t gfn)
 {
 	struct kvm_mmu_page *sp;
-- 
2.35.1.723.g4982287a31-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH v2 13/26] KVM: x86/mmu: Pass const memslot to init_shadow_page() and descendants
  2022-03-11  0:25 ` David Matlack
@ 2022-03-11  0:25   ` David Matlack
  -1 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-11  0:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

Use a const pointer so that init_shadow_page() can be called from
contexts where we have a const pointer.

No functional change intended.

Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/include/asm/kvm_page_track.h | 2 +-
 arch/x86/kvm/mmu/mmu.c                | 6 +++---
 arch/x86/kvm/mmu/mmu_internal.h       | 2 +-
 arch/x86/kvm/mmu/page_track.c         | 4 ++--
 arch/x86/kvm/mmu/tdp_mmu.c            | 2 +-
 arch/x86/kvm/mmu/tdp_mmu.h            | 2 +-
 6 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/kvm_page_track.h b/arch/x86/include/asm/kvm_page_track.h
index eb186bc57f6a..3a2dc183ae9a 100644
--- a/arch/x86/include/asm/kvm_page_track.h
+++ b/arch/x86/include/asm/kvm_page_track.h
@@ -58,7 +58,7 @@ int kvm_page_track_create_memslot(struct kvm *kvm,
 				  unsigned long npages);
 
 void kvm_slot_page_track_add_page(struct kvm *kvm,
-				  struct kvm_memory_slot *slot, gfn_t gfn,
+				  const struct kvm_memory_slot *slot, gfn_t gfn,
 				  enum kvm_page_track_mode mode);
 void kvm_slot_page_track_remove_page(struct kvm *kvm,
 				     struct kvm_memory_slot *slot, gfn_t gfn,
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 23c0a36ac93f..d7ad71be6c52 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -794,7 +794,7 @@ void kvm_mmu_gfn_allow_lpage(const struct kvm_memory_slot *slot, gfn_t gfn)
 }
 
 static void account_shadowed(struct kvm *kvm,
-			     struct kvm_memory_slot *slot,
+			     const struct kvm_memory_slot *slot,
 			     struct kvm_mmu_page *sp)
 {
 	gfn_t gfn;
@@ -1373,7 +1373,7 @@ int kvm_cpu_dirty_log_size(void)
 }
 
 bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
-				    struct kvm_memory_slot *slot, u64 gfn,
+				    const struct kvm_memory_slot *slot, u64 gfn,
 				    int min_level)
 {
 	struct kvm_rmap_head *rmap_head;
@@ -2151,7 +2151,7 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
 }
 
 static void init_shadow_page(struct kvm *kvm, struct kvm_mmu_page *sp,
-			     struct kvm_memory_slot *slot, gfn_t gfn,
+			     const struct kvm_memory_slot *slot, gfn_t gfn,
 			     union kvm_mmu_page_role role)
 {
 	struct hlist_head *sp_list;
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index d4e2de5f2a6d..b6e22ba9c654 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -134,7 +134,7 @@ int mmu_try_to_unsync_pages(struct kvm *kvm, const struct kvm_memory_slot *slot,
 void kvm_mmu_gfn_disallow_lpage(const struct kvm_memory_slot *slot, gfn_t gfn);
 void kvm_mmu_gfn_allow_lpage(const struct kvm_memory_slot *slot, gfn_t gfn);
 bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
-				    struct kvm_memory_slot *slot, u64 gfn,
+				    const struct kvm_memory_slot *slot, u64 gfn,
 				    int min_level);
 void kvm_flush_remote_tlbs_with_address(struct kvm *kvm,
 					u64 start_gfn, u64 pages);
diff --git a/arch/x86/kvm/mmu/page_track.c b/arch/x86/kvm/mmu/page_track.c
index 2e09d1b6249f..3e7901294573 100644
--- a/arch/x86/kvm/mmu/page_track.c
+++ b/arch/x86/kvm/mmu/page_track.c
@@ -84,7 +84,7 @@ int kvm_page_track_write_tracking_alloc(struct kvm_memory_slot *slot)
 	return 0;
 }
 
-static void update_gfn_track(struct kvm_memory_slot *slot, gfn_t gfn,
+static void update_gfn_track(const struct kvm_memory_slot *slot, gfn_t gfn,
 			     enum kvm_page_track_mode mode, short count)
 {
 	int index, val;
@@ -112,7 +112,7 @@ static void update_gfn_track(struct kvm_memory_slot *slot, gfn_t gfn,
  * @mode: tracking mode, currently only write track is supported.
  */
 void kvm_slot_page_track_add_page(struct kvm *kvm,
-				  struct kvm_memory_slot *slot, gfn_t gfn,
+				  const struct kvm_memory_slot *slot, gfn_t gfn,
 				  enum kvm_page_track_mode mode)
 {
 
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index f285fd76717b..85b7bc333302 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1768,7 +1768,7 @@ static bool write_protect_gfn(struct kvm *kvm, struct kvm_mmu_page *root,
  * Returns true if an SPTE was set and a TLB flush is needed.
  */
 bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
-				   struct kvm_memory_slot *slot, gfn_t gfn,
+				   const struct kvm_memory_slot *slot, gfn_t gfn,
 				   int min_level)
 {
 	struct kvm_mmu_page *root;
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 54bc8118c40a..8308bfa4126b 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -42,7 +42,7 @@ void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
 				       const struct kvm_memory_slot *slot);
 
 bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
-				   struct kvm_memory_slot *slot, gfn_t gfn,
+				   const struct kvm_memory_slot *slot, gfn_t gfn,
 				   int min_level);
 
 void kvm_tdp_mmu_try_split_huge_pages(struct kvm *kvm,
-- 
2.35.1.723.g4982287a31-goog


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH v2 13/26] KVM: x86/mmu: Pass const memslot to init_shadow_page() and descendants
@ 2022-03-11  0:25   ` David Matlack
  0 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-11  0:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Use a const pointer so that init_shadow_page() can be called from
contexts where we have a const pointer.

No functional change intended.

Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/include/asm/kvm_page_track.h | 2 +-
 arch/x86/kvm/mmu/mmu.c                | 6 +++---
 arch/x86/kvm/mmu/mmu_internal.h       | 2 +-
 arch/x86/kvm/mmu/page_track.c         | 4 ++--
 arch/x86/kvm/mmu/tdp_mmu.c            | 2 +-
 arch/x86/kvm/mmu/tdp_mmu.h            | 2 +-
 6 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/kvm_page_track.h b/arch/x86/include/asm/kvm_page_track.h
index eb186bc57f6a..3a2dc183ae9a 100644
--- a/arch/x86/include/asm/kvm_page_track.h
+++ b/arch/x86/include/asm/kvm_page_track.h
@@ -58,7 +58,7 @@ int kvm_page_track_create_memslot(struct kvm *kvm,
 				  unsigned long npages);
 
 void kvm_slot_page_track_add_page(struct kvm *kvm,
-				  struct kvm_memory_slot *slot, gfn_t gfn,
+				  const struct kvm_memory_slot *slot, gfn_t gfn,
 				  enum kvm_page_track_mode mode);
 void kvm_slot_page_track_remove_page(struct kvm *kvm,
 				     struct kvm_memory_slot *slot, gfn_t gfn,
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 23c0a36ac93f..d7ad71be6c52 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -794,7 +794,7 @@ void kvm_mmu_gfn_allow_lpage(const struct kvm_memory_slot *slot, gfn_t gfn)
 }
 
 static void account_shadowed(struct kvm *kvm,
-			     struct kvm_memory_slot *slot,
+			     const struct kvm_memory_slot *slot,
 			     struct kvm_mmu_page *sp)
 {
 	gfn_t gfn;
@@ -1373,7 +1373,7 @@ int kvm_cpu_dirty_log_size(void)
 }
 
 bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
-				    struct kvm_memory_slot *slot, u64 gfn,
+				    const struct kvm_memory_slot *slot, u64 gfn,
 				    int min_level)
 {
 	struct kvm_rmap_head *rmap_head;
@@ -2151,7 +2151,7 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
 }
 
 static void init_shadow_page(struct kvm *kvm, struct kvm_mmu_page *sp,
-			     struct kvm_memory_slot *slot, gfn_t gfn,
+			     const struct kvm_memory_slot *slot, gfn_t gfn,
 			     union kvm_mmu_page_role role)
 {
 	struct hlist_head *sp_list;
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index d4e2de5f2a6d..b6e22ba9c654 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -134,7 +134,7 @@ int mmu_try_to_unsync_pages(struct kvm *kvm, const struct kvm_memory_slot *slot,
 void kvm_mmu_gfn_disallow_lpage(const struct kvm_memory_slot *slot, gfn_t gfn);
 void kvm_mmu_gfn_allow_lpage(const struct kvm_memory_slot *slot, gfn_t gfn);
 bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
-				    struct kvm_memory_slot *slot, u64 gfn,
+				    const struct kvm_memory_slot *slot, u64 gfn,
 				    int min_level);
 void kvm_flush_remote_tlbs_with_address(struct kvm *kvm,
 					u64 start_gfn, u64 pages);
diff --git a/arch/x86/kvm/mmu/page_track.c b/arch/x86/kvm/mmu/page_track.c
index 2e09d1b6249f..3e7901294573 100644
--- a/arch/x86/kvm/mmu/page_track.c
+++ b/arch/x86/kvm/mmu/page_track.c
@@ -84,7 +84,7 @@ int kvm_page_track_write_tracking_alloc(struct kvm_memory_slot *slot)
 	return 0;
 }
 
-static void update_gfn_track(struct kvm_memory_slot *slot, gfn_t gfn,
+static void update_gfn_track(const struct kvm_memory_slot *slot, gfn_t gfn,
 			     enum kvm_page_track_mode mode, short count)
 {
 	int index, val;
@@ -112,7 +112,7 @@ static void update_gfn_track(struct kvm_memory_slot *slot, gfn_t gfn,
  * @mode: tracking mode, currently only write track is supported.
  */
 void kvm_slot_page_track_add_page(struct kvm *kvm,
-				  struct kvm_memory_slot *slot, gfn_t gfn,
+				  const struct kvm_memory_slot *slot, gfn_t gfn,
 				  enum kvm_page_track_mode mode)
 {
 
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index f285fd76717b..85b7bc333302 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1768,7 +1768,7 @@ static bool write_protect_gfn(struct kvm *kvm, struct kvm_mmu_page *root,
  * Returns true if an SPTE was set and a TLB flush is needed.
  */
 bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
-				   struct kvm_memory_slot *slot, gfn_t gfn,
+				   const struct kvm_memory_slot *slot, gfn_t gfn,
 				   int min_level)
 {
 	struct kvm_mmu_page *root;
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 54bc8118c40a..8308bfa4126b 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -42,7 +42,7 @@ void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
 				       const struct kvm_memory_slot *slot);
 
 bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
-				   struct kvm_memory_slot *slot, gfn_t gfn,
+				   const struct kvm_memory_slot *slot, gfn_t gfn,
 				   int min_level);
 
 void kvm_tdp_mmu_try_split_huge_pages(struct kvm *kvm,
-- 
2.35.1.723.g4982287a31-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH v2 14/26] KVM: x86/mmu: Decouple rmap_add() and link_shadow_page() from kvm_vcpu
  2022-03-11  0:25 ` David Matlack
@ 2022-03-11  0:25   ` David Matlack
  -1 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-11  0:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

Allow adding new entries to the rmap and linking shadow pages without a
struct kvm_vcpu pointer by moving the implementation of rmap_add() and
link_shadow_page() into inner helper functions.

No functional change intended.

Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 43 +++++++++++++++++++++++++++---------------
 1 file changed, 28 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index d7ad71be6c52..c57070ed157d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -725,9 +725,9 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
 }
 
-static struct pte_list_desc *mmu_alloc_pte_list_desc(struct kvm_vcpu *vcpu)
+static struct pte_list_desc *mmu_alloc_pte_list_desc(struct kvm_mmu_memory_cache *cache)
 {
-	return kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_pte_list_desc_cache);
+	return kvm_mmu_memory_cache_alloc(cache);
 }
 
 static void mmu_free_pte_list_desc(struct pte_list_desc *pte_list_desc)
@@ -874,7 +874,7 @@ gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu, gfn_t gfn,
 /*
  * Returns the number of pointers in the rmap chain, not counting the new one.
  */
-static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
+static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte,
 			struct kvm_rmap_head *rmap_head)
 {
 	struct pte_list_desc *desc;
@@ -885,7 +885,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
 		rmap_head->val = (unsigned long)spte;
 	} else if (!(rmap_head->val & 1)) {
 		rmap_printk("%p %llx 1->many\n", spte, *spte);
-		desc = mmu_alloc_pte_list_desc(vcpu);
+		desc = mmu_alloc_pte_list_desc(cache);
 		desc->sptes[0] = (u64 *)rmap_head->val;
 		desc->sptes[1] = spte;
 		desc->spte_count = 2;
@@ -897,7 +897,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
 		while (desc->spte_count == PTE_LIST_EXT) {
 			count += PTE_LIST_EXT;
 			if (!desc->more) {
-				desc->more = mmu_alloc_pte_list_desc(vcpu);
+				desc->more = mmu_alloc_pte_list_desc(cache);
 				desc = desc->more;
 				desc->spte_count = 0;
 				break;
@@ -1596,8 +1596,10 @@ static bool kvm_test_age_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 
 #define RMAP_RECYCLE_THRESHOLD 1000
 
-static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
-		     u64 *spte, gfn_t gfn)
+static void __rmap_add(struct kvm *kvm,
+		       struct kvm_mmu_memory_cache *cache,
+		       const struct kvm_memory_slot *slot,
+		       u64 *spte, gfn_t gfn)
 {
 	struct kvm_mmu_page *sp;
 	struct kvm_rmap_head *rmap_head;
@@ -1606,15 +1608,21 @@ static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
 	sp = sptep_to_sp(spte);
 	kvm_mmu_page_set_gfn(sp, spte - sp->spt, gfn);
 	rmap_head = gfn_to_rmap(gfn, sp->role.level, slot);
-	rmap_count = pte_list_add(vcpu, spte, rmap_head);
+	rmap_count = pte_list_add(cache, spte, rmap_head);
 
 	if (rmap_count > RMAP_RECYCLE_THRESHOLD) {
-		kvm_unmap_rmapp(vcpu->kvm, rmap_head, NULL, gfn, sp->role.level, __pte(0));
+		kvm_unmap_rmapp(kvm, rmap_head, NULL, gfn, sp->role.level, __pte(0));
 		kvm_flush_remote_tlbs_with_address(
-				vcpu->kvm, sp->gfn, KVM_PAGES_PER_HPAGE(sp->role.level));
+				kvm, sp->gfn, KVM_PAGES_PER_HPAGE(sp->role.level));
 	}
 }
 
+static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
+		     u64 *spte, gfn_t gfn)
+{
+	__rmap_add(vcpu->kvm, &vcpu->arch.mmu_pte_list_desc_cache, slot, spte, gfn);
+}
+
 bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	bool young = false;
@@ -1682,13 +1690,13 @@ static unsigned kvm_page_table_hashfn(gfn_t gfn)
 	return hash_64(gfn, KVM_MMU_HASH_SHIFT);
 }
 
-static void mmu_page_add_parent_pte(struct kvm_vcpu *vcpu,
+static void mmu_page_add_parent_pte(struct kvm_mmu_memory_cache *cache,
 				    struct kvm_mmu_page *sp, u64 *parent_pte)
 {
 	if (!parent_pte)
 		return;
 
-	pte_list_add(vcpu, parent_pte, &sp->parent_ptes);
+	pte_list_add(cache, parent_pte, &sp->parent_ptes);
 }
 
 static void mmu_page_remove_parent_pte(struct kvm_mmu_page *sp,
@@ -2307,8 +2315,8 @@ static void shadow_walk_next(struct kvm_shadow_walk_iterator *iterator)
 	__shadow_walk_next(iterator, *iterator->sptep);
 }
 
-static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep,
-			     struct kvm_mmu_page *sp)
+static void __link_shadow_page(struct kvm_mmu_memory_cache *cache, u64 *sptep,
+			       struct kvm_mmu_page *sp)
 {
 	u64 spte;
 
@@ -2318,12 +2326,17 @@ static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep,
 
 	mmu_spte_set(sptep, spte);
 
-	mmu_page_add_parent_pte(vcpu, sp, sptep);
+	mmu_page_add_parent_pte(cache, sp, sptep);
 
 	if (sp->unsync_children || sp->unsync)
 		mark_unsync(sptep);
 }
 
+static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep, struct kvm_mmu_page *sp)
+{
+	__link_shadow_page(&vcpu->arch.mmu_pte_list_desc_cache, sptep, sp);
+}
+
 static void validate_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 				   unsigned direct_access)
 {
-- 
2.35.1.723.g4982287a31-goog


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH v2 14/26] KVM: x86/mmu: Decouple rmap_add() and link_shadow_page() from kvm_vcpu
@ 2022-03-11  0:25   ` David Matlack
  0 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-11  0:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Allow adding new entries to the rmap and linking shadow pages without a
struct kvm_vcpu pointer by moving the implementation of rmap_add() and
link_shadow_page() into inner helper functions.

No functional change intended.

Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 43 +++++++++++++++++++++++++++---------------
 1 file changed, 28 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index d7ad71be6c52..c57070ed157d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -725,9 +725,9 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
 }
 
-static struct pte_list_desc *mmu_alloc_pte_list_desc(struct kvm_vcpu *vcpu)
+static struct pte_list_desc *mmu_alloc_pte_list_desc(struct kvm_mmu_memory_cache *cache)
 {
-	return kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_pte_list_desc_cache);
+	return kvm_mmu_memory_cache_alloc(cache);
 }
 
 static void mmu_free_pte_list_desc(struct pte_list_desc *pte_list_desc)
@@ -874,7 +874,7 @@ gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu, gfn_t gfn,
 /*
  * Returns the number of pointers in the rmap chain, not counting the new one.
  */
-static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
+static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte,
 			struct kvm_rmap_head *rmap_head)
 {
 	struct pte_list_desc *desc;
@@ -885,7 +885,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
 		rmap_head->val = (unsigned long)spte;
 	} else if (!(rmap_head->val & 1)) {
 		rmap_printk("%p %llx 1->many\n", spte, *spte);
-		desc = mmu_alloc_pte_list_desc(vcpu);
+		desc = mmu_alloc_pte_list_desc(cache);
 		desc->sptes[0] = (u64 *)rmap_head->val;
 		desc->sptes[1] = spte;
 		desc->spte_count = 2;
@@ -897,7 +897,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
 		while (desc->spte_count == PTE_LIST_EXT) {
 			count += PTE_LIST_EXT;
 			if (!desc->more) {
-				desc->more = mmu_alloc_pte_list_desc(vcpu);
+				desc->more = mmu_alloc_pte_list_desc(cache);
 				desc = desc->more;
 				desc->spte_count = 0;
 				break;
@@ -1596,8 +1596,10 @@ static bool kvm_test_age_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 
 #define RMAP_RECYCLE_THRESHOLD 1000
 
-static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
-		     u64 *spte, gfn_t gfn)
+static void __rmap_add(struct kvm *kvm,
+		       struct kvm_mmu_memory_cache *cache,
+		       const struct kvm_memory_slot *slot,
+		       u64 *spte, gfn_t gfn)
 {
 	struct kvm_mmu_page *sp;
 	struct kvm_rmap_head *rmap_head;
@@ -1606,15 +1608,21 @@ static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
 	sp = sptep_to_sp(spte);
 	kvm_mmu_page_set_gfn(sp, spte - sp->spt, gfn);
 	rmap_head = gfn_to_rmap(gfn, sp->role.level, slot);
-	rmap_count = pte_list_add(vcpu, spte, rmap_head);
+	rmap_count = pte_list_add(cache, spte, rmap_head);
 
 	if (rmap_count > RMAP_RECYCLE_THRESHOLD) {
-		kvm_unmap_rmapp(vcpu->kvm, rmap_head, NULL, gfn, sp->role.level, __pte(0));
+		kvm_unmap_rmapp(kvm, rmap_head, NULL, gfn, sp->role.level, __pte(0));
 		kvm_flush_remote_tlbs_with_address(
-				vcpu->kvm, sp->gfn, KVM_PAGES_PER_HPAGE(sp->role.level));
+				kvm, sp->gfn, KVM_PAGES_PER_HPAGE(sp->role.level));
 	}
 }
 
+static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
+		     u64 *spte, gfn_t gfn)
+{
+	__rmap_add(vcpu->kvm, &vcpu->arch.mmu_pte_list_desc_cache, slot, spte, gfn);
+}
+
 bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	bool young = false;
@@ -1682,13 +1690,13 @@ static unsigned kvm_page_table_hashfn(gfn_t gfn)
 	return hash_64(gfn, KVM_MMU_HASH_SHIFT);
 }
 
-static void mmu_page_add_parent_pte(struct kvm_vcpu *vcpu,
+static void mmu_page_add_parent_pte(struct kvm_mmu_memory_cache *cache,
 				    struct kvm_mmu_page *sp, u64 *parent_pte)
 {
 	if (!parent_pte)
 		return;
 
-	pte_list_add(vcpu, parent_pte, &sp->parent_ptes);
+	pte_list_add(cache, parent_pte, &sp->parent_ptes);
 }
 
 static void mmu_page_remove_parent_pte(struct kvm_mmu_page *sp,
@@ -2307,8 +2315,8 @@ static void shadow_walk_next(struct kvm_shadow_walk_iterator *iterator)
 	__shadow_walk_next(iterator, *iterator->sptep);
 }
 
-static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep,
-			     struct kvm_mmu_page *sp)
+static void __link_shadow_page(struct kvm_mmu_memory_cache *cache, u64 *sptep,
+			       struct kvm_mmu_page *sp)
 {
 	u64 spte;
 
@@ -2318,12 +2326,17 @@ static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep,
 
 	mmu_spte_set(sptep, spte);
 
-	mmu_page_add_parent_pte(vcpu, sp, sptep);
+	mmu_page_add_parent_pte(cache, sp, sptep);
 
 	if (sp->unsync_children || sp->unsync)
 		mark_unsync(sptep);
 }
 
+static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep, struct kvm_mmu_page *sp)
+{
+	__link_shadow_page(&vcpu->arch.mmu_pte_list_desc_cache, sptep, sp);
+}
+
 static void validate_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 				   unsigned direct_access)
 {
-- 
2.35.1.723.g4982287a31-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH v2 15/26] KVM: x86/mmu: Update page stats in __rmap_add()
  2022-03-11  0:25 ` David Matlack
@ 2022-03-11  0:25   ` David Matlack
  -1 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-11  0:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

Update the page stats in __rmap_add() rather than at the call site. This
will avoid having to manually update page stats when splitting huge
pages in a subsequent commit.

No functional change intended.

Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index c57070ed157d..73a7077f9991 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1607,6 +1607,8 @@ static void __rmap_add(struct kvm *kvm,
 
 	sp = sptep_to_sp(spte);
 	kvm_mmu_page_set_gfn(sp, spte - sp->spt, gfn);
+	kvm_update_page_stats(kvm, sp->role.level, 1);
+
 	rmap_head = gfn_to_rmap(gfn, sp->role.level, slot);
 	rmap_count = pte_list_add(cache, spte, rmap_head);
 
@@ -2847,7 +2849,6 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
 
 	if (!was_rmapped) {
 		WARN_ON_ONCE(ret == RET_PF_SPURIOUS);
-		kvm_update_page_stats(vcpu->kvm, level, 1);
 		rmap_add(vcpu, slot, sptep, gfn);
 	}
 
-- 
2.35.1.723.g4982287a31-goog


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH v2 15/26] KVM: x86/mmu: Update page stats in __rmap_add()
@ 2022-03-11  0:25   ` David Matlack
  0 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-11  0:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Update the page stats in __rmap_add() rather than at the call site. This
will avoid having to manually update page stats when splitting huge
pages in a subsequent commit.

No functional change intended.

Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index c57070ed157d..73a7077f9991 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1607,6 +1607,8 @@ static void __rmap_add(struct kvm *kvm,
 
 	sp = sptep_to_sp(spte);
 	kvm_mmu_page_set_gfn(sp, spte - sp->spt, gfn);
+	kvm_update_page_stats(kvm, sp->role.level, 1);
+
 	rmap_head = gfn_to_rmap(gfn, sp->role.level, slot);
 	rmap_count = pte_list_add(cache, spte, rmap_head);
 
@@ -2847,7 +2849,6 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
 
 	if (!was_rmapped) {
 		WARN_ON_ONCE(ret == RET_PF_SPURIOUS);
-		kvm_update_page_stats(vcpu->kvm, level, 1);
 		rmap_add(vcpu, slot, sptep, gfn);
 	}
 
-- 
2.35.1.723.g4982287a31-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH v2 16/26] KVM: x86/mmu: Cache the access bits of shadowed translations
  2022-03-11  0:25 ` David Matlack
@ 2022-03-11  0:25   ` David Matlack
  -1 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-11  0:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

In order to split a huge page we need to know what access bits to assign
to the role of the new child page table. This can't be easily derived
from the huge page SPTE itself since KVM applies its own access policies
on top, such as for HugePage NX.

We could walk the guest page tables to determine the correct access
bits, but that is difficult to plumb outside of a vCPU fault context.
Instead, we can store the original access bits for each leaf SPTE
alongside the GFN in the gfns array. The access bits only take up 3
bits, which leaves 61 bits left over for gfns, which is more than
enough. So this change does not require any additional memory.

In order to keep the access bit cache in sync with the guest, we have to
extend FNAME(sync_page) to also update the access bits.

Now that the gfns array caches more information than just GFNs, rename
it to shadowed_translation.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/include/asm/kvm_host.h |  2 +-
 arch/x86/kvm/mmu/mmu.c          | 32 +++++++++++++++++++-------------
 arch/x86/kvm/mmu/mmu_internal.h | 15 +++++++++++++--
 arch/x86/kvm/mmu/paging_tmpl.h  |  7 +++++--
 4 files changed, 38 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f72e80178ffc..0f5a36772bdc 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -694,7 +694,7 @@ struct kvm_vcpu_arch {
 
 	struct kvm_mmu_memory_cache mmu_pte_list_desc_cache;
 	struct kvm_mmu_memory_cache mmu_shadow_page_cache;
-	struct kvm_mmu_memory_cache mmu_gfn_array_cache;
+	struct kvm_mmu_memory_cache mmu_shadowed_translation_cache;
 	struct kvm_mmu_memory_cache mmu_page_header_cache;
 
 	/*
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 73a7077f9991..89a7a8d7a632 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -708,7 +708,7 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
 	if (r)
 		return r;
 	if (maybe_indirect) {
-		r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_gfn_array_cache,
+		r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadowed_translation_cache,
 					       PT64_ROOT_MAX_LEVEL);
 		if (r)
 			return r;
@@ -721,7 +721,7 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
 {
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
-	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_gfn_array_cache);
+	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadowed_translation_cache);
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
 }
 
@@ -738,15 +738,17 @@ static void mmu_free_pte_list_desc(struct pte_list_desc *pte_list_desc)
 static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index)
 {
 	if (!sp->role.direct)
-		return sp->gfns[index];
+		return sp->shadowed_translation[index].gfn;
 
 	return sp->gfn + (index << ((sp->role.level - 1) * PT64_LEVEL_BITS));
 }
 
-static void kvm_mmu_page_set_gfn(struct kvm_mmu_page *sp, int index, gfn_t gfn)
+static void kvm_mmu_page_set_gfn_access(struct kvm_mmu_page *sp, int index,
+					gfn_t gfn, u32 access)
 {
 	if (!sp->role.direct) {
-		sp->gfns[index] = gfn;
+		sp->shadowed_translation[index].gfn = gfn;
+		sp->shadowed_translation[index].access = access;
 		return;
 	}
 
@@ -1599,14 +1601,14 @@ static bool kvm_test_age_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 static void __rmap_add(struct kvm *kvm,
 		       struct kvm_mmu_memory_cache *cache,
 		       const struct kvm_memory_slot *slot,
-		       u64 *spte, gfn_t gfn)
+		       u64 *spte, gfn_t gfn, u32 access)
 {
 	struct kvm_mmu_page *sp;
 	struct kvm_rmap_head *rmap_head;
 	int rmap_count;
 
 	sp = sptep_to_sp(spte);
-	kvm_mmu_page_set_gfn(sp, spte - sp->spt, gfn);
+	kvm_mmu_page_set_gfn_access(sp, spte - sp->spt, gfn, access);
 	kvm_update_page_stats(kvm, sp->role.level, 1);
 
 	rmap_head = gfn_to_rmap(gfn, sp->role.level, slot);
@@ -1620,9 +1622,9 @@ static void __rmap_add(struct kvm *kvm,
 }
 
 static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
-		     u64 *spte, gfn_t gfn)
+		     u64 *spte, gfn_t gfn, u32 access)
 {
-	__rmap_add(vcpu->kvm, &vcpu->arch.mmu_pte_list_desc_cache, slot, spte, gfn);
+	__rmap_add(vcpu->kvm, &vcpu->arch.mmu_pte_list_desc_cache, slot, spte, gfn, access);
 }
 
 bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
@@ -1683,7 +1685,7 @@ void kvm_mmu_free_shadow_page(struct kvm_mmu_page *sp)
 {
 	free_page((unsigned long)sp->spt);
 	if (!sp->role.direct)
-		free_page((unsigned long)sp->gfns);
+		free_page((unsigned long)sp->shadowed_translation);
 	kmem_cache_free(mmu_page_header_cache, sp);
 }
 
@@ -1720,8 +1722,12 @@ struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu, bool direc
 
 	sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
 	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
+
+	BUILD_BUG_ON(sizeof(sp->shadowed_translation[0]) != sizeof(u64));
+
 	if (!direct)
-		sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache);
+		sp->shadowed_translation =
+			kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadowed_translation_cache);
 
 	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
 
@@ -1733,7 +1739,7 @@ struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu, bool direc
  *
  * Huge page splitting always uses direct shadow pages since the huge page is
  * being mapped directly with a lower level page table. Thus there's no need to
- * allocate the gfns array.
+ * allocate the shadowed_translation array.
  */
 struct kvm_mmu_page *kvm_mmu_alloc_direct_sp_for_split(bool locked)
 {
@@ -2849,7 +2855,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
 
 	if (!was_rmapped) {
 		WARN_ON_ONCE(ret == RET_PF_SPURIOUS);
-		rmap_add(vcpu, slot, sptep, gfn);
+		rmap_add(vcpu, slot, sptep, gfn, pte_access);
 	}
 
 	return ret;
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index b6e22ba9c654..c5b8ee625df7 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -32,6 +32,11 @@ extern bool dbg;
 
 typedef u64 __rcu *tdp_ptep_t;
 
+struct shadowed_translation_entry {
+	u64 access:3;
+	u64 gfn:56;
+};
+
 struct kvm_mmu_page {
 	/*
 	 * Note, "link" through "spt" fit in a single 64 byte cache line on
@@ -53,8 +58,14 @@ struct kvm_mmu_page {
 	gfn_t gfn;
 
 	u64 *spt;
-	/* hold the gfn of each spte inside spt */
-	gfn_t *gfns;
+	/*
+	 * For indirect shadow pages, caches the result of the intermediate
+	 * guest translation being shadowed by each SPTE.
+	 *
+	 * NULL for direct shadow pages.
+	 */
+	struct shadowed_translation_entry *shadowed_translation;
+
 	/* Currently serving as active root */
 	union {
 		int root_count;
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 55cac59b9c9b..128eccadf1de 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -1014,7 +1014,7 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 }
 
 /*
- * Using the cached information from sp->gfns is safe because:
+ * Using the information in sp->shadowed_translation is safe because:
  * - The spte has a reference to the struct page, so the pfn for a given gfn
  *   can't change unless all sptes pointing to it are nuked first.
  *
@@ -1088,12 +1088,15 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
 		if (sync_mmio_spte(vcpu, &sp->spt[i], gfn, pte_access))
 			continue;
 
-		if (gfn != sp->gfns[i]) {
+		if (gfn != sp->shadowed_translation[i].gfn) {
 			drop_spte(vcpu->kvm, &sp->spt[i]);
 			flush = true;
 			continue;
 		}
 
+		if (pte_access != sp->shadowed_translation[i].access)
+			sp->shadowed_translation[i].access = pte_access;
+
 		sptep = &sp->spt[i];
 		spte = *sptep;
 		host_writable = spte & shadow_host_writable_mask;
-- 
2.35.1.723.g4982287a31-goog


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH v2 16/26] KVM: x86/mmu: Cache the access bits of shadowed translations
@ 2022-03-11  0:25   ` David Matlack
  0 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-11  0:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

In order to split a huge page we need to know what access bits to assign
to the role of the new child page table. This can't be easily derived
from the huge page SPTE itself since KVM applies its own access policies
on top, such as for HugePage NX.

We could walk the guest page tables to determine the correct access
bits, but that is difficult to plumb outside of a vCPU fault context.
Instead, we can store the original access bits for each leaf SPTE
alongside the GFN in the gfns array. The access bits only take up 3
bits, which leaves 61 bits left over for gfns, which is more than
enough. So this change does not require any additional memory.

In order to keep the access bit cache in sync with the guest, we have to
extend FNAME(sync_page) to also update the access bits.

Now that the gfns array caches more information than just GFNs, rename
it to shadowed_translation.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/include/asm/kvm_host.h |  2 +-
 arch/x86/kvm/mmu/mmu.c          | 32 +++++++++++++++++++-------------
 arch/x86/kvm/mmu/mmu_internal.h | 15 +++++++++++++--
 arch/x86/kvm/mmu/paging_tmpl.h  |  7 +++++--
 4 files changed, 38 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f72e80178ffc..0f5a36772bdc 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -694,7 +694,7 @@ struct kvm_vcpu_arch {
 
 	struct kvm_mmu_memory_cache mmu_pte_list_desc_cache;
 	struct kvm_mmu_memory_cache mmu_shadow_page_cache;
-	struct kvm_mmu_memory_cache mmu_gfn_array_cache;
+	struct kvm_mmu_memory_cache mmu_shadowed_translation_cache;
 	struct kvm_mmu_memory_cache mmu_page_header_cache;
 
 	/*
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 73a7077f9991..89a7a8d7a632 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -708,7 +708,7 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
 	if (r)
 		return r;
 	if (maybe_indirect) {
-		r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_gfn_array_cache,
+		r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadowed_translation_cache,
 					       PT64_ROOT_MAX_LEVEL);
 		if (r)
 			return r;
@@ -721,7 +721,7 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
 {
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
-	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_gfn_array_cache);
+	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadowed_translation_cache);
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
 }
 
@@ -738,15 +738,17 @@ static void mmu_free_pte_list_desc(struct pte_list_desc *pte_list_desc)
 static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index)
 {
 	if (!sp->role.direct)
-		return sp->gfns[index];
+		return sp->shadowed_translation[index].gfn;
 
 	return sp->gfn + (index << ((sp->role.level - 1) * PT64_LEVEL_BITS));
 }
 
-static void kvm_mmu_page_set_gfn(struct kvm_mmu_page *sp, int index, gfn_t gfn)
+static void kvm_mmu_page_set_gfn_access(struct kvm_mmu_page *sp, int index,
+					gfn_t gfn, u32 access)
 {
 	if (!sp->role.direct) {
-		sp->gfns[index] = gfn;
+		sp->shadowed_translation[index].gfn = gfn;
+		sp->shadowed_translation[index].access = access;
 		return;
 	}
 
@@ -1599,14 +1601,14 @@ static bool kvm_test_age_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 static void __rmap_add(struct kvm *kvm,
 		       struct kvm_mmu_memory_cache *cache,
 		       const struct kvm_memory_slot *slot,
-		       u64 *spte, gfn_t gfn)
+		       u64 *spte, gfn_t gfn, u32 access)
 {
 	struct kvm_mmu_page *sp;
 	struct kvm_rmap_head *rmap_head;
 	int rmap_count;
 
 	sp = sptep_to_sp(spte);
-	kvm_mmu_page_set_gfn(sp, spte - sp->spt, gfn);
+	kvm_mmu_page_set_gfn_access(sp, spte - sp->spt, gfn, access);
 	kvm_update_page_stats(kvm, sp->role.level, 1);
 
 	rmap_head = gfn_to_rmap(gfn, sp->role.level, slot);
@@ -1620,9 +1622,9 @@ static void __rmap_add(struct kvm *kvm,
 }
 
 static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
-		     u64 *spte, gfn_t gfn)
+		     u64 *spte, gfn_t gfn, u32 access)
 {
-	__rmap_add(vcpu->kvm, &vcpu->arch.mmu_pte_list_desc_cache, slot, spte, gfn);
+	__rmap_add(vcpu->kvm, &vcpu->arch.mmu_pte_list_desc_cache, slot, spte, gfn, access);
 }
 
 bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
@@ -1683,7 +1685,7 @@ void kvm_mmu_free_shadow_page(struct kvm_mmu_page *sp)
 {
 	free_page((unsigned long)sp->spt);
 	if (!sp->role.direct)
-		free_page((unsigned long)sp->gfns);
+		free_page((unsigned long)sp->shadowed_translation);
 	kmem_cache_free(mmu_page_header_cache, sp);
 }
 
@@ -1720,8 +1722,12 @@ struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu, bool direc
 
 	sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
 	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
+
+	BUILD_BUG_ON(sizeof(sp->shadowed_translation[0]) != sizeof(u64));
+
 	if (!direct)
-		sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache);
+		sp->shadowed_translation =
+			kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadowed_translation_cache);
 
 	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
 
@@ -1733,7 +1739,7 @@ struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu, bool direc
  *
  * Huge page splitting always uses direct shadow pages since the huge page is
  * being mapped directly with a lower level page table. Thus there's no need to
- * allocate the gfns array.
+ * allocate the shadowed_translation array.
  */
 struct kvm_mmu_page *kvm_mmu_alloc_direct_sp_for_split(bool locked)
 {
@@ -2849,7 +2855,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
 
 	if (!was_rmapped) {
 		WARN_ON_ONCE(ret == RET_PF_SPURIOUS);
-		rmap_add(vcpu, slot, sptep, gfn);
+		rmap_add(vcpu, slot, sptep, gfn, pte_access);
 	}
 
 	return ret;
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index b6e22ba9c654..c5b8ee625df7 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -32,6 +32,11 @@ extern bool dbg;
 
 typedef u64 __rcu *tdp_ptep_t;
 
+struct shadowed_translation_entry {
+	u64 access:3;
+	u64 gfn:56;
+};
+
 struct kvm_mmu_page {
 	/*
 	 * Note, "link" through "spt" fit in a single 64 byte cache line on
@@ -53,8 +58,14 @@ struct kvm_mmu_page {
 	gfn_t gfn;
 
 	u64 *spt;
-	/* hold the gfn of each spte inside spt */
-	gfn_t *gfns;
+	/*
+	 * For indirect shadow pages, caches the result of the intermediate
+	 * guest translation being shadowed by each SPTE.
+	 *
+	 * NULL for direct shadow pages.
+	 */
+	struct shadowed_translation_entry *shadowed_translation;
+
 	/* Currently serving as active root */
 	union {
 		int root_count;
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 55cac59b9c9b..128eccadf1de 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -1014,7 +1014,7 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 }
 
 /*
- * Using the cached information from sp->gfns is safe because:
+ * Using the information in sp->shadowed_translation is safe because:
  * - The spte has a reference to the struct page, so the pfn for a given gfn
  *   can't change unless all sptes pointing to it are nuked first.
  *
@@ -1088,12 +1088,15 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
 		if (sync_mmio_spte(vcpu, &sp->spt[i], gfn, pte_access))
 			continue;
 
-		if (gfn != sp->gfns[i]) {
+		if (gfn != sp->shadowed_translation[i].gfn) {
 			drop_spte(vcpu->kvm, &sp->spt[i]);
 			flush = true;
 			continue;
 		}
 
+		if (pte_access != sp->shadowed_translation[i].access)
+			sp->shadowed_translation[i].access = pte_access;
+
 		sptep = &sp->spt[i];
 		spte = *sptep;
 		host_writable = spte & shadow_host_writable_mask;
-- 
2.35.1.723.g4982287a31-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH v2 17/26] KVM: x86/mmu: Pass access information to make_huge_page_split_spte()
  2022-03-11  0:25 ` David Matlack
@ 2022-03-11  0:25   ` David Matlack
  -1 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-11  0:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

Currently make_huge_page_split_spte() assumes execute permissions can be
granted to any 4K SPTE when splitting huge pages. This is true for the
TDP MMU but is not necessarily true for the shadow MMU. Huge pages
mapped by the shadow MMU may be shadowing huge pages that the guest has
disallowed execute permissions.

No functional change intended.

Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/spte.c    | 5 +++--
 arch/x86/kvm/mmu/spte.h    | 3 ++-
 arch/x86/kvm/mmu/tdp_mmu.c | 2 +-
 3 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index d10189d9c877..7294f95464a7 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -216,7 +216,8 @@ static u64 make_spte_executable(u64 spte)
  * This is used during huge page splitting to build the SPTEs that make up the
  * new page table.
  */
-u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index)
+u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index,
+			      unsigned int access)
 {
 	u64 child_spte;
 	int child_level;
@@ -244,7 +245,7 @@ u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index)
 		 * When splitting to a 4K page, mark the page executable as the
 		 * NX hugepage mitigation no longer applies.
 		 */
-		if (is_nx_huge_page_enabled())
+		if (is_nx_huge_page_enabled() && (access & ACC_EXEC_MASK))
 			child_spte = make_spte_executable(child_spte);
 	}
 
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 73f12615416f..c7ccdd5c440d 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -415,7 +415,8 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	       unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
 	       u64 old_spte, bool prefetch, bool can_unsync,
 	       bool host_writable, u64 *new_spte);
-u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index);
+u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index,
+			      unsigned int access);
 u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled);
 u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access);
 u64 mark_spte_for_access_track(u64 spte);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 85b7bc333302..541b145b2df2 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1430,7 +1430,7 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
 	 * not been linked in yet and thus is not reachable from any other CPU.
 	 */
 	for (i = 0; i < PT64_ENT_PER_PAGE; i++)
-		sp->spt[i] = make_huge_page_split_spte(huge_spte, level, i);
+		sp->spt[i] = make_huge_page_split_spte(huge_spte, level, i, ACC_ALL);
 
 	/*
 	 * Replace the huge spte with a pointer to the populated lower level
-- 
2.35.1.723.g4982287a31-goog


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH v2 17/26] KVM: x86/mmu: Pass access information to make_huge_page_split_spte()
@ 2022-03-11  0:25   ` David Matlack
  0 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-11  0:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Currently make_huge_page_split_spte() assumes execute permissions can be
granted to any 4K SPTE when splitting huge pages. This is true for the
TDP MMU but is not necessarily true for the shadow MMU. Huge pages
mapped by the shadow MMU may be shadowing huge pages that the guest has
disallowed execute permissions.

No functional change intended.

Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/spte.c    | 5 +++--
 arch/x86/kvm/mmu/spte.h    | 3 ++-
 arch/x86/kvm/mmu/tdp_mmu.c | 2 +-
 3 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index d10189d9c877..7294f95464a7 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -216,7 +216,8 @@ static u64 make_spte_executable(u64 spte)
  * This is used during huge page splitting to build the SPTEs that make up the
  * new page table.
  */
-u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index)
+u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index,
+			      unsigned int access)
 {
 	u64 child_spte;
 	int child_level;
@@ -244,7 +245,7 @@ u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index)
 		 * When splitting to a 4K page, mark the page executable as the
 		 * NX hugepage mitigation no longer applies.
 		 */
-		if (is_nx_huge_page_enabled())
+		if (is_nx_huge_page_enabled() && (access & ACC_EXEC_MASK))
 			child_spte = make_spte_executable(child_spte);
 	}
 
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 73f12615416f..c7ccdd5c440d 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -415,7 +415,8 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	       unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
 	       u64 old_spte, bool prefetch, bool can_unsync,
 	       bool host_writable, u64 *new_spte);
-u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index);
+u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index,
+			      unsigned int access);
 u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled);
 u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access);
 u64 mark_spte_for_access_track(u64 spte);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 85b7bc333302..541b145b2df2 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1430,7 +1430,7 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
 	 * not been linked in yet and thus is not reachable from any other CPU.
 	 */
 	for (i = 0; i < PT64_ENT_PER_PAGE; i++)
-		sp->spt[i] = make_huge_page_split_spte(huge_spte, level, i);
+		sp->spt[i] = make_huge_page_split_spte(huge_spte, level, i, ACC_ALL);
 
 	/*
 	 * Replace the huge spte with a pointer to the populated lower level
-- 
2.35.1.723.g4982287a31-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH v2 18/26] KVM: x86/mmu: Zap collapsible SPTEs at all levels in the shadow MMU
  2022-03-11  0:25 ` David Matlack
@ 2022-03-11  0:25   ` David Matlack
  -1 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-11  0:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

Currently KVM only zaps collapsible 4KiB SPTEs in the shadow MMU (i.e.
in the rmap). This is fine for now KVM never creates intermediate huge
pages during dirty logging, i.e. a 1GiB page is never partially split to
a 2MiB page.

However, this will stop being true once the shadow MMU participates in
eager page splitting, which can in fact leave behind partially split
huge pages. In preparation for that change, change the shadow MMU to
iterate over all necessary levels when zapping collapsible SPTEs.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 26 +++++++++++++++++++-------
 1 file changed, 19 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 89a7a8d7a632..2032be3edd71 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6142,18 +6142,30 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
 	return need_tlb_flush;
 }
 
+static void kvm_rmap_zap_collapsible_sptes(struct kvm *kvm,
+					   const struct kvm_memory_slot *slot)
+{
+	bool flush;
+
+	/*
+	 * Note, use KVM_MAX_HUGEPAGE_LEVEL - 1 since there's no need to zap
+	 * pages that are already mapped at the maximum possible level.
+	 */
+	flush = slot_handle_level(kvm, slot, kvm_mmu_zap_collapsible_spte,
+				  PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL - 1,
+				  true);
+
+	if (flush)
+		kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
+
+}
+
 void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
 				   const struct kvm_memory_slot *slot)
 {
 	if (kvm_memslots_have_rmaps(kvm)) {
 		write_lock(&kvm->mmu_lock);
-		/*
-		 * Zap only 4k SPTEs since the legacy MMU only supports dirty
-		 * logging at a 4k granularity and never creates collapsible
-		 * 2m SPTEs during dirty logging.
-		 */
-		if (slot_handle_level_4k(kvm, slot, kvm_mmu_zap_collapsible_spte, true))
-			kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
+		kvm_rmap_zap_collapsible_sptes(kvm, slot);
 		write_unlock(&kvm->mmu_lock);
 	}
 
-- 
2.35.1.723.g4982287a31-goog


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH v2 18/26] KVM: x86/mmu: Zap collapsible SPTEs at all levels in the shadow MMU
@ 2022-03-11  0:25   ` David Matlack
  0 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-11  0:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Currently KVM only zaps collapsible 4KiB SPTEs in the shadow MMU (i.e.
in the rmap). This is fine for now KVM never creates intermediate huge
pages during dirty logging, i.e. a 1GiB page is never partially split to
a 2MiB page.

However, this will stop being true once the shadow MMU participates in
eager page splitting, which can in fact leave behind partially split
huge pages. In preparation for that change, change the shadow MMU to
iterate over all necessary levels when zapping collapsible SPTEs.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 26 +++++++++++++++++++-------
 1 file changed, 19 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 89a7a8d7a632..2032be3edd71 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6142,18 +6142,30 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
 	return need_tlb_flush;
 }
 
+static void kvm_rmap_zap_collapsible_sptes(struct kvm *kvm,
+					   const struct kvm_memory_slot *slot)
+{
+	bool flush;
+
+	/*
+	 * Note, use KVM_MAX_HUGEPAGE_LEVEL - 1 since there's no need to zap
+	 * pages that are already mapped at the maximum possible level.
+	 */
+	flush = slot_handle_level(kvm, slot, kvm_mmu_zap_collapsible_spte,
+				  PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL - 1,
+				  true);
+
+	if (flush)
+		kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
+
+}
+
 void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
 				   const struct kvm_memory_slot *slot)
 {
 	if (kvm_memslots_have_rmaps(kvm)) {
 		write_lock(&kvm->mmu_lock);
-		/*
-		 * Zap only 4k SPTEs since the legacy MMU only supports dirty
-		 * logging at a 4k granularity and never creates collapsible
-		 * 2m SPTEs during dirty logging.
-		 */
-		if (slot_handle_level_4k(kvm, slot, kvm_mmu_zap_collapsible_spte, true))
-			kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
+		kvm_rmap_zap_collapsible_sptes(kvm, slot);
 		write_unlock(&kvm->mmu_lock);
 	}
 
-- 
2.35.1.723.g4982287a31-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH v2 19/26] KVM: x86/mmu: Refactor drop_large_spte()
  2022-03-11  0:25 ` David Matlack
@ 2022-03-11  0:25   ` David Matlack
  -1 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-11  0:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

drop_large_spte() drops a large SPTE if it exists and then flushes TLBs.
Its helper function, __drop_large_spte(), does the drop without the
flush.

In preparation for eager page splitting, which will need to sometimes
flush when dropping large SPTEs (and sometimes not), push the flushing
logic down into __drop_large_spte() and add a bool parameter to control
it.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 29 +++++++++++++++--------------
 1 file changed, 15 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 2032be3edd71..926ddfaa9e1a 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1150,28 +1150,29 @@ static void drop_spte(struct kvm *kvm, u64 *sptep)
 		rmap_remove(kvm, sptep);
 }
 
-
-static bool __drop_large_spte(struct kvm *kvm, u64 *sptep)
+static void __drop_large_spte(struct kvm *kvm, u64 *sptep, bool flush)
 {
-	if (is_large_pte(*sptep)) {
-		WARN_ON(sptep_to_sp(sptep)->role.level == PG_LEVEL_4K);
-		drop_spte(kvm, sptep);
-		return true;
-	}
+	struct kvm_mmu_page *sp;
 
-	return false;
-}
+	if (!is_large_pte(*sptep))
+		return;
 
-static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
-{
-	if (__drop_large_spte(vcpu->kvm, sptep)) {
-		struct kvm_mmu_page *sp = sptep_to_sp(sptep);
+	sp = sptep_to_sp(sptep);
+	WARN_ON(sp->role.level == PG_LEVEL_4K);
 
-		kvm_flush_remote_tlbs_with_address(vcpu->kvm, sp->gfn,
+	drop_spte(kvm, sptep);
+
+	if (flush) {
+		kvm_flush_remote_tlbs_with_address(kvm, sp->gfn,
 			KVM_PAGES_PER_HPAGE(sp->role.level));
 	}
 }
 
+static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
+{
+	return __drop_large_spte(vcpu->kvm, sptep, true);
+}
+
 /*
  * Write-protect on the specified @sptep, @pt_protect indicates whether
  * spte write-protection is caused by protecting shadow page table.
-- 
2.35.1.723.g4982287a31-goog


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH v2 19/26] KVM: x86/mmu: Refactor drop_large_spte()
@ 2022-03-11  0:25   ` David Matlack
  0 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-11  0:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

drop_large_spte() drops a large SPTE if it exists and then flushes TLBs.
Its helper function, __drop_large_spte(), does the drop without the
flush.

In preparation for eager page splitting, which will need to sometimes
flush when dropping large SPTEs (and sometimes not), push the flushing
logic down into __drop_large_spte() and add a bool parameter to control
it.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 29 +++++++++++++++--------------
 1 file changed, 15 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 2032be3edd71..926ddfaa9e1a 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1150,28 +1150,29 @@ static void drop_spte(struct kvm *kvm, u64 *sptep)
 		rmap_remove(kvm, sptep);
 }
 
-
-static bool __drop_large_spte(struct kvm *kvm, u64 *sptep)
+static void __drop_large_spte(struct kvm *kvm, u64 *sptep, bool flush)
 {
-	if (is_large_pte(*sptep)) {
-		WARN_ON(sptep_to_sp(sptep)->role.level == PG_LEVEL_4K);
-		drop_spte(kvm, sptep);
-		return true;
-	}
+	struct kvm_mmu_page *sp;
 
-	return false;
-}
+	if (!is_large_pte(*sptep))
+		return;
 
-static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
-{
-	if (__drop_large_spte(vcpu->kvm, sptep)) {
-		struct kvm_mmu_page *sp = sptep_to_sp(sptep);
+	sp = sptep_to_sp(sptep);
+	WARN_ON(sp->role.level == PG_LEVEL_4K);
 
-		kvm_flush_remote_tlbs_with_address(vcpu->kvm, sp->gfn,
+	drop_spte(kvm, sptep);
+
+	if (flush) {
+		kvm_flush_remote_tlbs_with_address(kvm, sp->gfn,
 			KVM_PAGES_PER_HPAGE(sp->role.level));
 	}
 }
 
+static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
+{
+	return __drop_large_spte(vcpu->kvm, sptep, true);
+}
+
 /*
  * Write-protect on the specified @sptep, @pt_protect indicates whether
  * spte write-protection is caused by protecting shadow page table.
-- 
2.35.1.723.g4982287a31-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH v2 20/26] KVM: x86/mmu: Extend Eager Page Splitting to the shadow MMU
  2022-03-11  0:25 ` David Matlack
@ 2022-03-11  0:25   ` David Matlack
  -1 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-11  0:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

Extend KVM's eager page splitting to also split huge pages that are
mapped by the shadow MMU. Specifically, walk through the rmap splitting
all 1GiB pages to 2MiB pages, and splitting all 2MiB pages to 4KiB
pages.

Splitting huge pages mapped by the shadow MMU requries dealing with some
extra complexity beyond that of the TDP MMU:

(1) The shadow MMU has a limit on the number of shadow pages that are
    allowed to be allocated. So, as a policy, Eager Page Splitting
    refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
    pages available.

(2) Huge pages may be mapped by indirect shadow pages which have the
    possibility of being unsync. As a policy we opt not to split such
    pages as their translation may no longer be valid.

(3) Splitting a huge page may end up re-using an existing lower level
    shadow page tables. This is unlike the TDP MMU which always allocates
    new shadow page tables when splitting.  This commit does *not*
    handle such aliasing and opts not to split such huge pages.

(4) When installing the lower level SPTEs, they must be added to the
    rmap which may require allocating additional pte_list_desc structs.
    This commit does *not* handle such cases and instead opts to leave
    such lower-level SPTEs non-present. In this situation TLBs must be
    flushed before dropping the MMU lock as a portion of the huge page
    region is being unmapped.

Suggested-by: Peter Feiner <pfeiner@google.com>
[ This commit is based off of the original implementation of Eager Page
  Splitting from Peter in Google's kernel from 2016. ]
Signed-off-by: David Matlack <dmatlack@google.com>
---
 .../admin-guide/kernel-parameters.txt         |   3 -
 arch/x86/kvm/mmu/mmu.c                        | 307 ++++++++++++++++++
 2 files changed, 307 insertions(+), 3 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 05161afd7642..495f6ac53801 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2360,9 +2360,6 @@
 			the KVM_CLEAR_DIRTY ioctl, and only for the pages being
 			cleared.
 
-			Eager page splitting currently only supports splitting
-			huge pages mapped by the TDP MMU.
-
 			Default is Y (on).
 
 	kvm.enable_vmware_backdoor=[KVM] Support VMware backdoor PV interface.
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 926ddfaa9e1a..dd56b5b9624f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -727,6 +727,11 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
 
 static struct pte_list_desc *mmu_alloc_pte_list_desc(struct kvm_mmu_memory_cache *cache)
 {
+	static const gfp_t gfp_nocache = GFP_ATOMIC | __GFP_ACCOUNT | __GFP_ZERO;
+
+	if (WARN_ON_ONCE(!cache))
+		return kmem_cache_alloc(pte_list_desc_cache, gfp_nocache);
+
 	return kvm_mmu_memory_cache_alloc(cache);
 }
 
@@ -743,6 +748,28 @@ static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index)
 	return sp->gfn + (index << ((sp->role.level - 1) * PT64_LEVEL_BITS));
 }
 
+static gfn_t sptep_to_gfn(u64 *sptep)
+{
+	struct kvm_mmu_page *sp = sptep_to_sp(sptep);
+
+	return kvm_mmu_page_get_gfn(sp, sptep - sp->spt);
+}
+
+static unsigned int kvm_mmu_page_get_access(struct kvm_mmu_page *sp, int index)
+{
+	if (!sp->role.direct)
+		return sp->shadowed_translation[index].access;
+
+	return sp->role.access;
+}
+
+static unsigned int sptep_to_access(u64 *sptep)
+{
+	struct kvm_mmu_page *sp = sptep_to_sp(sptep);
+
+	return kvm_mmu_page_get_access(sp, sptep - sp->spt);
+}
+
 static void kvm_mmu_page_set_gfn_access(struct kvm_mmu_page *sp, int index,
 					gfn_t gfn, u32 access)
 {
@@ -912,6 +939,9 @@ static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte,
 	return count;
 }
 
+static struct kvm_rmap_head *gfn_to_rmap(gfn_t gfn, int level,
+					 const struct kvm_memory_slot *slot);
+
 static void
 pte_list_desc_remove_entry(struct kvm_rmap_head *rmap_head,
 			   struct pte_list_desc *desc, int i,
@@ -2125,6 +2155,23 @@ static struct kvm_mmu_page *__kvm_mmu_find_shadow_page(struct kvm *kvm,
 	return sp;
 }
 
+static struct kvm_mmu_page *kvm_mmu_find_direct_sp(struct kvm *kvm, gfn_t gfn,
+						   union kvm_mmu_page_role role)
+{
+	struct kvm_mmu_page *sp;
+	LIST_HEAD(invalid_list);
+
+	BUG_ON(!role.direct);
+
+	sp = __kvm_mmu_find_shadow_page(kvm, gfn, role, &invalid_list);
+
+	/* Direct SPs are never unsync. */
+	WARN_ON_ONCE(sp && sp->unsync);
+
+	kvm_mmu_commit_zap_page(kvm, &invalid_list);
+	return sp;
+}
+
 /*
  * Looks up an existing SP for the given gfn and role if one exists. The
  * return SP is guaranteed to be synced.
@@ -6063,12 +6110,266 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
 		kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
 }
 
+static int prepare_to_split_huge_page(struct kvm *kvm,
+				      const struct kvm_memory_slot *slot,
+				      u64 *huge_sptep,
+				      struct kvm_mmu_page **spp,
+				      bool *flush,
+				      bool *dropped_lock)
+{
+	int r = 0;
+
+	*dropped_lock = false;
+
+	if (kvm_mmu_available_pages(kvm) <= KVM_MIN_FREE_MMU_PAGES)
+		return -ENOSPC;
+
+	if (need_resched() || rwlock_needbreak(&kvm->mmu_lock))
+		goto drop_lock;
+
+	*spp = kvm_mmu_alloc_direct_sp_for_split(true);
+	if (r)
+		goto drop_lock;
+
+	return 0;
+
+drop_lock:
+	if (*flush)
+		kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
+
+	*flush = false;
+	*dropped_lock = true;
+
+	write_unlock(&kvm->mmu_lock);
+	cond_resched();
+	*spp = kvm_mmu_alloc_direct_sp_for_split(false);
+	if (!*spp)
+		r = -ENOMEM;
+	write_lock(&kvm->mmu_lock);
+
+	return r;
+}
+
+static struct kvm_mmu_page *kvm_mmu_get_sp_for_split(struct kvm *kvm,
+						     const struct kvm_memory_slot *slot,
+						     u64 *huge_sptep,
+						     struct kvm_mmu_page **spp)
+{
+	struct kvm_mmu_page *split_sp;
+	union kvm_mmu_page_role role;
+	unsigned int access;
+	gfn_t gfn;
+
+	gfn = sptep_to_gfn(huge_sptep);
+	access = sptep_to_access(huge_sptep);
+
+	/*
+	 * Huge page splitting always uses direct shadow pages since we are
+	 * directly mapping the huge page GFN region with smaller pages.
+	 */
+	role = kvm_mmu_child_role(huge_sptep, true, access);
+	split_sp = kvm_mmu_find_direct_sp(kvm, gfn, role);
+
+	/*
+	 * Opt not to split if the lower-level SP already exists. This requires
+	 * more complex handling as the SP may be already partially filled in
+	 * and may need extra pte_list_desc structs to update parent_ptes.
+	 */
+	if (split_sp)
+		return NULL;
+
+	swap(split_sp, *spp);
+	init_shadow_page(kvm, split_sp, slot, gfn, role);
+	trace_kvm_mmu_get_page(split_sp, true);
+
+	return split_sp;
+}
+
+static int kvm_mmu_split_huge_page(struct kvm *kvm,
+				   const struct kvm_memory_slot *slot,
+				   u64 *huge_sptep, struct kvm_mmu_page **spp,
+				   bool *flush)
+
+{
+	struct kvm_mmu_page *split_sp;
+	u64 huge_spte, split_spte;
+	int split_level, index;
+	unsigned int access;
+	u64 *split_sptep;
+	gfn_t split_gfn;
+
+	split_sp = kvm_mmu_get_sp_for_split(kvm, slot, huge_sptep, spp);
+	if (!split_sp)
+		return -EOPNOTSUPP;
+
+	/*
+	 * Since we did not allocate pte_list_desc_structs for the split, we
+	 * cannot add a new parent SPTE to parent_ptes. This should never happen
+	 * in practice though since this is a fresh SP.
+	 *
+	 * Note, this makes it safe to pass NULL to __link_shadow_page() below.
+	 */
+	if (WARN_ON_ONCE(split_sp->parent_ptes.val))
+		return -EINVAL;
+
+	huge_spte = READ_ONCE(*huge_sptep);
+
+	split_level = split_sp->role.level;
+	access = split_sp->role.access;
+
+	for (index = 0; index < PT64_ENT_PER_PAGE; index++) {
+		split_sptep = &split_sp->spt[index];
+		split_gfn = kvm_mmu_page_get_gfn(split_sp, index);
+
+		BUG_ON(is_shadow_present_pte(*split_sptep));
+
+		/*
+		 * Since we did not allocate pte_list_desc structs for the
+		 * split, we can't add a new SPTE that maps this GFN.
+		 * Skipping this SPTE means we're only partially mapping the
+		 * huge page, which means we'll need to flush TLBs before
+		 * dropping the MMU lock.
+		 *
+		 * Note, this make it safe to pass NULL to __rmap_add() below.
+		 */
+		if (gfn_to_rmap(split_gfn, split_level, slot)->val) {
+			*flush = true;
+			continue;
+		}
+
+		split_spte = make_huge_page_split_spte(
+				huge_spte, split_level + 1, index, access);
+
+		mmu_spte_set(split_sptep, split_spte);
+		__rmap_add(kvm, NULL, slot, split_sptep, split_gfn, access);
+	}
+
+	/*
+	 * Replace the huge spte with a pointer to the populated lower level
+	 * page table. Since we are making this change without a TLB flush vCPUs
+	 * will see a mix of the split mappings and the original huge mapping,
+	 * depending on what's currently in their TLB. This is fine from a
+	 * correctness standpoint since the translation will either be identical
+	 * or non-present. To account for non-present mappings, the TLB will be
+	 * flushed prior to dropping the MMU lock.
+	 */
+	__drop_large_spte(kvm, huge_sptep, false);
+	__link_shadow_page(NULL, huge_sptep, split_sp);
+
+	return 0;
+}
+
+static bool should_split_huge_page(u64 *huge_sptep)
+{
+	struct kvm_mmu_page *huge_sp = sptep_to_sp(huge_sptep);
+
+	if (WARN_ON_ONCE(!is_large_pte(*huge_sptep)))
+		return false;
+
+	if (huge_sp->role.invalid)
+		return false;
+
+	/*
+	 * As a policy, do not split huge pages if SP on which they reside
+	 * is unsync. Unsync means the guest is modifying the page table being
+	 * shadowed by huge_sp, so splitting may be a waste of cycles and
+	 * memory.
+	 */
+	if (huge_sp->unsync)
+		return false;
+
+	return true;
+}
+
+static bool rmap_try_split_huge_pages(struct kvm *kvm,
+				      struct kvm_rmap_head *rmap_head,
+				      const struct kvm_memory_slot *slot)
+{
+	struct kvm_mmu_page *sp = NULL;
+	struct rmap_iterator iter;
+	u64 *huge_sptep, spte;
+	bool flush = false;
+	bool dropped_lock;
+	int level;
+	gfn_t gfn;
+	int r;
+
+restart:
+	for_each_rmap_spte(rmap_head, &iter, huge_sptep) {
+		if (!should_split_huge_page(huge_sptep))
+			continue;
+
+		spte = *huge_sptep;
+		level = sptep_to_sp(huge_sptep)->role.level;
+		gfn = sptep_to_gfn(huge_sptep);
+
+		r = prepare_to_split_huge_page(kvm, slot, huge_sptep, &sp, &flush, &dropped_lock);
+		if (r) {
+			trace_kvm_mmu_split_huge_page(gfn, spte, level, r);
+			break;
+		}
+
+		if (dropped_lock)
+			goto restart;
+
+		r = kvm_mmu_split_huge_page(kvm, slot, huge_sptep, &sp, &flush);
+
+		trace_kvm_mmu_split_huge_page(gfn, spte, level, r);
+
+		/*
+		 * If splitting is successful we must restart the iterator
+		 * because huge_sptep has just been removed from it.
+		 */
+		if (!r)
+			goto restart;
+	}
+
+	if (sp)
+		kvm_mmu_free_shadow_page(sp);
+
+	return flush;
+}
+
+static void kvm_rmap_try_split_huge_pages(struct kvm *kvm,
+					  const struct kvm_memory_slot *slot,
+					  gfn_t start, gfn_t end,
+					  int target_level)
+{
+	bool flush;
+	int level;
+
+	/*
+	 * Split huge pages starting with KVM_MAX_HUGEPAGE_LEVEL and working
+	 * down to the target level. This ensures pages are recursively split
+	 * all the way to the target level. There's no need to split pages
+	 * already at the target level.
+	 *
+	 * Note that TLB flushes must be done before dropping the MMU lock since
+	 * rmap_try_split_huge_pages() may partially split any given huge page,
+	 * i.e. it may effectively unmap (make non-present) a portion of the
+	 * huge page.
+	 */
+	for (level = KVM_MAX_HUGEPAGE_LEVEL; level > target_level; level--) {
+		flush = slot_handle_level_range(kvm, slot,
+						rmap_try_split_huge_pages,
+						level, level, start, end - 1,
+						true, flush);
+	}
+
+	if (flush)
+		kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
+}
+
 /* Must be called with the mmu_lock held in write-mode. */
 void kvm_mmu_try_split_huge_pages(struct kvm *kvm,
 				   const struct kvm_memory_slot *memslot,
 				   u64 start, u64 end,
 				   int target_level)
 {
+	if (kvm_memslots_have_rmaps(kvm))
+		kvm_rmap_try_split_huge_pages(kvm, memslot, start, end,
+					      target_level);
+
 	if (is_tdp_mmu_enabled(kvm))
 		kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end,
 						 target_level, false);
@@ -6086,6 +6387,12 @@ void kvm_mmu_slot_try_split_huge_pages(struct kvm *kvm,
 	u64 start = memslot->base_gfn;
 	u64 end = start + memslot->npages;
 
+	if (kvm_memslots_have_rmaps(kvm)) {
+		write_lock(&kvm->mmu_lock);
+		kvm_rmap_try_split_huge_pages(kvm, memslot, start, end, target_level);
+		write_unlock(&kvm->mmu_lock);
+	}
+
 	if (is_tdp_mmu_enabled(kvm)) {
 		read_lock(&kvm->mmu_lock);
 		kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end, target_level, true);
-- 
2.35.1.723.g4982287a31-goog


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH v2 20/26] KVM: x86/mmu: Extend Eager Page Splitting to the shadow MMU
@ 2022-03-11  0:25   ` David Matlack
  0 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-11  0:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Extend KVM's eager page splitting to also split huge pages that are
mapped by the shadow MMU. Specifically, walk through the rmap splitting
all 1GiB pages to 2MiB pages, and splitting all 2MiB pages to 4KiB
pages.

Splitting huge pages mapped by the shadow MMU requries dealing with some
extra complexity beyond that of the TDP MMU:

(1) The shadow MMU has a limit on the number of shadow pages that are
    allowed to be allocated. So, as a policy, Eager Page Splitting
    refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
    pages available.

(2) Huge pages may be mapped by indirect shadow pages which have the
    possibility of being unsync. As a policy we opt not to split such
    pages as their translation may no longer be valid.

(3) Splitting a huge page may end up re-using an existing lower level
    shadow page tables. This is unlike the TDP MMU which always allocates
    new shadow page tables when splitting.  This commit does *not*
    handle such aliasing and opts not to split such huge pages.

(4) When installing the lower level SPTEs, they must be added to the
    rmap which may require allocating additional pte_list_desc structs.
    This commit does *not* handle such cases and instead opts to leave
    such lower-level SPTEs non-present. In this situation TLBs must be
    flushed before dropping the MMU lock as a portion of the huge page
    region is being unmapped.

Suggested-by: Peter Feiner <pfeiner@google.com>
[ This commit is based off of the original implementation of Eager Page
  Splitting from Peter in Google's kernel from 2016. ]
Signed-off-by: David Matlack <dmatlack@google.com>
---
 .../admin-guide/kernel-parameters.txt         |   3 -
 arch/x86/kvm/mmu/mmu.c                        | 307 ++++++++++++++++++
 2 files changed, 307 insertions(+), 3 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 05161afd7642..495f6ac53801 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2360,9 +2360,6 @@
 			the KVM_CLEAR_DIRTY ioctl, and only for the pages being
 			cleared.
 
-			Eager page splitting currently only supports splitting
-			huge pages mapped by the TDP MMU.
-
 			Default is Y (on).
 
 	kvm.enable_vmware_backdoor=[KVM] Support VMware backdoor PV interface.
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 926ddfaa9e1a..dd56b5b9624f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -727,6 +727,11 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
 
 static struct pte_list_desc *mmu_alloc_pte_list_desc(struct kvm_mmu_memory_cache *cache)
 {
+	static const gfp_t gfp_nocache = GFP_ATOMIC | __GFP_ACCOUNT | __GFP_ZERO;
+
+	if (WARN_ON_ONCE(!cache))
+		return kmem_cache_alloc(pte_list_desc_cache, gfp_nocache);
+
 	return kvm_mmu_memory_cache_alloc(cache);
 }
 
@@ -743,6 +748,28 @@ static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index)
 	return sp->gfn + (index << ((sp->role.level - 1) * PT64_LEVEL_BITS));
 }
 
+static gfn_t sptep_to_gfn(u64 *sptep)
+{
+	struct kvm_mmu_page *sp = sptep_to_sp(sptep);
+
+	return kvm_mmu_page_get_gfn(sp, sptep - sp->spt);
+}
+
+static unsigned int kvm_mmu_page_get_access(struct kvm_mmu_page *sp, int index)
+{
+	if (!sp->role.direct)
+		return sp->shadowed_translation[index].access;
+
+	return sp->role.access;
+}
+
+static unsigned int sptep_to_access(u64 *sptep)
+{
+	struct kvm_mmu_page *sp = sptep_to_sp(sptep);
+
+	return kvm_mmu_page_get_access(sp, sptep - sp->spt);
+}
+
 static void kvm_mmu_page_set_gfn_access(struct kvm_mmu_page *sp, int index,
 					gfn_t gfn, u32 access)
 {
@@ -912,6 +939,9 @@ static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte,
 	return count;
 }
 
+static struct kvm_rmap_head *gfn_to_rmap(gfn_t gfn, int level,
+					 const struct kvm_memory_slot *slot);
+
 static void
 pte_list_desc_remove_entry(struct kvm_rmap_head *rmap_head,
 			   struct pte_list_desc *desc, int i,
@@ -2125,6 +2155,23 @@ static struct kvm_mmu_page *__kvm_mmu_find_shadow_page(struct kvm *kvm,
 	return sp;
 }
 
+static struct kvm_mmu_page *kvm_mmu_find_direct_sp(struct kvm *kvm, gfn_t gfn,
+						   union kvm_mmu_page_role role)
+{
+	struct kvm_mmu_page *sp;
+	LIST_HEAD(invalid_list);
+
+	BUG_ON(!role.direct);
+
+	sp = __kvm_mmu_find_shadow_page(kvm, gfn, role, &invalid_list);
+
+	/* Direct SPs are never unsync. */
+	WARN_ON_ONCE(sp && sp->unsync);
+
+	kvm_mmu_commit_zap_page(kvm, &invalid_list);
+	return sp;
+}
+
 /*
  * Looks up an existing SP for the given gfn and role if one exists. The
  * return SP is guaranteed to be synced.
@@ -6063,12 +6110,266 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
 		kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
 }
 
+static int prepare_to_split_huge_page(struct kvm *kvm,
+				      const struct kvm_memory_slot *slot,
+				      u64 *huge_sptep,
+				      struct kvm_mmu_page **spp,
+				      bool *flush,
+				      bool *dropped_lock)
+{
+	int r = 0;
+
+	*dropped_lock = false;
+
+	if (kvm_mmu_available_pages(kvm) <= KVM_MIN_FREE_MMU_PAGES)
+		return -ENOSPC;
+
+	if (need_resched() || rwlock_needbreak(&kvm->mmu_lock))
+		goto drop_lock;
+
+	*spp = kvm_mmu_alloc_direct_sp_for_split(true);
+	if (r)
+		goto drop_lock;
+
+	return 0;
+
+drop_lock:
+	if (*flush)
+		kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
+
+	*flush = false;
+	*dropped_lock = true;
+
+	write_unlock(&kvm->mmu_lock);
+	cond_resched();
+	*spp = kvm_mmu_alloc_direct_sp_for_split(false);
+	if (!*spp)
+		r = -ENOMEM;
+	write_lock(&kvm->mmu_lock);
+
+	return r;
+}
+
+static struct kvm_mmu_page *kvm_mmu_get_sp_for_split(struct kvm *kvm,
+						     const struct kvm_memory_slot *slot,
+						     u64 *huge_sptep,
+						     struct kvm_mmu_page **spp)
+{
+	struct kvm_mmu_page *split_sp;
+	union kvm_mmu_page_role role;
+	unsigned int access;
+	gfn_t gfn;
+
+	gfn = sptep_to_gfn(huge_sptep);
+	access = sptep_to_access(huge_sptep);
+
+	/*
+	 * Huge page splitting always uses direct shadow pages since we are
+	 * directly mapping the huge page GFN region with smaller pages.
+	 */
+	role = kvm_mmu_child_role(huge_sptep, true, access);
+	split_sp = kvm_mmu_find_direct_sp(kvm, gfn, role);
+
+	/*
+	 * Opt not to split if the lower-level SP already exists. This requires
+	 * more complex handling as the SP may be already partially filled in
+	 * and may need extra pte_list_desc structs to update parent_ptes.
+	 */
+	if (split_sp)
+		return NULL;
+
+	swap(split_sp, *spp);
+	init_shadow_page(kvm, split_sp, slot, gfn, role);
+	trace_kvm_mmu_get_page(split_sp, true);
+
+	return split_sp;
+}
+
+static int kvm_mmu_split_huge_page(struct kvm *kvm,
+				   const struct kvm_memory_slot *slot,
+				   u64 *huge_sptep, struct kvm_mmu_page **spp,
+				   bool *flush)
+
+{
+	struct kvm_mmu_page *split_sp;
+	u64 huge_spte, split_spte;
+	int split_level, index;
+	unsigned int access;
+	u64 *split_sptep;
+	gfn_t split_gfn;
+
+	split_sp = kvm_mmu_get_sp_for_split(kvm, slot, huge_sptep, spp);
+	if (!split_sp)
+		return -EOPNOTSUPP;
+
+	/*
+	 * Since we did not allocate pte_list_desc_structs for the split, we
+	 * cannot add a new parent SPTE to parent_ptes. This should never happen
+	 * in practice though since this is a fresh SP.
+	 *
+	 * Note, this makes it safe to pass NULL to __link_shadow_page() below.
+	 */
+	if (WARN_ON_ONCE(split_sp->parent_ptes.val))
+		return -EINVAL;
+
+	huge_spte = READ_ONCE(*huge_sptep);
+
+	split_level = split_sp->role.level;
+	access = split_sp->role.access;
+
+	for (index = 0; index < PT64_ENT_PER_PAGE; index++) {
+		split_sptep = &split_sp->spt[index];
+		split_gfn = kvm_mmu_page_get_gfn(split_sp, index);
+
+		BUG_ON(is_shadow_present_pte(*split_sptep));
+
+		/*
+		 * Since we did not allocate pte_list_desc structs for the
+		 * split, we can't add a new SPTE that maps this GFN.
+		 * Skipping this SPTE means we're only partially mapping the
+		 * huge page, which means we'll need to flush TLBs before
+		 * dropping the MMU lock.
+		 *
+		 * Note, this make it safe to pass NULL to __rmap_add() below.
+		 */
+		if (gfn_to_rmap(split_gfn, split_level, slot)->val) {
+			*flush = true;
+			continue;
+		}
+
+		split_spte = make_huge_page_split_spte(
+				huge_spte, split_level + 1, index, access);
+
+		mmu_spte_set(split_sptep, split_spte);
+		__rmap_add(kvm, NULL, slot, split_sptep, split_gfn, access);
+	}
+
+	/*
+	 * Replace the huge spte with a pointer to the populated lower level
+	 * page table. Since we are making this change without a TLB flush vCPUs
+	 * will see a mix of the split mappings and the original huge mapping,
+	 * depending on what's currently in their TLB. This is fine from a
+	 * correctness standpoint since the translation will either be identical
+	 * or non-present. To account for non-present mappings, the TLB will be
+	 * flushed prior to dropping the MMU lock.
+	 */
+	__drop_large_spte(kvm, huge_sptep, false);
+	__link_shadow_page(NULL, huge_sptep, split_sp);
+
+	return 0;
+}
+
+static bool should_split_huge_page(u64 *huge_sptep)
+{
+	struct kvm_mmu_page *huge_sp = sptep_to_sp(huge_sptep);
+
+	if (WARN_ON_ONCE(!is_large_pte(*huge_sptep)))
+		return false;
+
+	if (huge_sp->role.invalid)
+		return false;
+
+	/*
+	 * As a policy, do not split huge pages if SP on which they reside
+	 * is unsync. Unsync means the guest is modifying the page table being
+	 * shadowed by huge_sp, so splitting may be a waste of cycles and
+	 * memory.
+	 */
+	if (huge_sp->unsync)
+		return false;
+
+	return true;
+}
+
+static bool rmap_try_split_huge_pages(struct kvm *kvm,
+				      struct kvm_rmap_head *rmap_head,
+				      const struct kvm_memory_slot *slot)
+{
+	struct kvm_mmu_page *sp = NULL;
+	struct rmap_iterator iter;
+	u64 *huge_sptep, spte;
+	bool flush = false;
+	bool dropped_lock;
+	int level;
+	gfn_t gfn;
+	int r;
+
+restart:
+	for_each_rmap_spte(rmap_head, &iter, huge_sptep) {
+		if (!should_split_huge_page(huge_sptep))
+			continue;
+
+		spte = *huge_sptep;
+		level = sptep_to_sp(huge_sptep)->role.level;
+		gfn = sptep_to_gfn(huge_sptep);
+
+		r = prepare_to_split_huge_page(kvm, slot, huge_sptep, &sp, &flush, &dropped_lock);
+		if (r) {
+			trace_kvm_mmu_split_huge_page(gfn, spte, level, r);
+			break;
+		}
+
+		if (dropped_lock)
+			goto restart;
+
+		r = kvm_mmu_split_huge_page(kvm, slot, huge_sptep, &sp, &flush);
+
+		trace_kvm_mmu_split_huge_page(gfn, spte, level, r);
+
+		/*
+		 * If splitting is successful we must restart the iterator
+		 * because huge_sptep has just been removed from it.
+		 */
+		if (!r)
+			goto restart;
+	}
+
+	if (sp)
+		kvm_mmu_free_shadow_page(sp);
+
+	return flush;
+}
+
+static void kvm_rmap_try_split_huge_pages(struct kvm *kvm,
+					  const struct kvm_memory_slot *slot,
+					  gfn_t start, gfn_t end,
+					  int target_level)
+{
+	bool flush;
+	int level;
+
+	/*
+	 * Split huge pages starting with KVM_MAX_HUGEPAGE_LEVEL and working
+	 * down to the target level. This ensures pages are recursively split
+	 * all the way to the target level. There's no need to split pages
+	 * already at the target level.
+	 *
+	 * Note that TLB flushes must be done before dropping the MMU lock since
+	 * rmap_try_split_huge_pages() may partially split any given huge page,
+	 * i.e. it may effectively unmap (make non-present) a portion of the
+	 * huge page.
+	 */
+	for (level = KVM_MAX_HUGEPAGE_LEVEL; level > target_level; level--) {
+		flush = slot_handle_level_range(kvm, slot,
+						rmap_try_split_huge_pages,
+						level, level, start, end - 1,
+						true, flush);
+	}
+
+	if (flush)
+		kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
+}
+
 /* Must be called with the mmu_lock held in write-mode. */
 void kvm_mmu_try_split_huge_pages(struct kvm *kvm,
 				   const struct kvm_memory_slot *memslot,
 				   u64 start, u64 end,
 				   int target_level)
 {
+	if (kvm_memslots_have_rmaps(kvm))
+		kvm_rmap_try_split_huge_pages(kvm, memslot, start, end,
+					      target_level);
+
 	if (is_tdp_mmu_enabled(kvm))
 		kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end,
 						 target_level, false);
@@ -6086,6 +6387,12 @@ void kvm_mmu_slot_try_split_huge_pages(struct kvm *kvm,
 	u64 start = memslot->base_gfn;
 	u64 end = start + memslot->npages;
 
+	if (kvm_memslots_have_rmaps(kvm)) {
+		write_lock(&kvm->mmu_lock);
+		kvm_rmap_try_split_huge_pages(kvm, memslot, start, end, target_level);
+		write_unlock(&kvm->mmu_lock);
+	}
+
 	if (is_tdp_mmu_enabled(kvm)) {
 		read_lock(&kvm->mmu_lock);
 		kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end, target_level, true);
-- 
2.35.1.723.g4982287a31-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH v2 21/26] KVM: Allow for different capacities in kvm_mmu_memory_cache structs
  2022-03-11  0:25 ` David Matlack
@ 2022-03-11  0:25   ` David Matlack
  -1 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-11  0:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

Allow the capacity of the kvm_mmu_memory_cache struct to be chosen at
declaration time rather than being fixed for all declarations. This will
be used in a follow-up commit to declare an cache in x86 with a capacity
of 512+ objects without having to increase the capacity of all caches in
KVM.

This change requires each cache now specify its capacity at runtime,
since the cache struct itself no longer has a fixed capacity known at
compile time. To protect against someone accidentally defining a
kvm_mmu_memory_cache struct directly (without the extra storage), this
commit includes a WARN_ON() in kvm_mmu_topup_memory_cache().

This change, unfortunately, adds some grottiness to
kvm_phys_addr_ioremap() in arm64, which uses a function-local (i.e.
stack-allocated) kvm_mmu_memory_cache struct. Since C does not allow
anonymous structs in functions, the new wrapper struct that contains
kvm_mmu_memory_cache and the objects pointer array, must be named, which
means dealing with an outer and inner struct. The outer struct can't be
dropped since then there would be no guarantee the kvm_mmu_memory_cache
struct and objects array would be laid out consecutively on the stack.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/arm64/include/asm/kvm_host.h |  2 +-
 arch/arm64/kvm/arm.c              |  1 +
 arch/arm64/kvm/mmu.c              | 13 +++++++++----
 arch/mips/include/asm/kvm_host.h  |  2 +-
 arch/mips/kvm/mips.c              |  2 ++
 arch/riscv/include/asm/kvm_host.h |  2 +-
 arch/riscv/kvm/vcpu.c             |  1 +
 arch/x86/include/asm/kvm_host.h   |  8 ++++----
 arch/x86/kvm/mmu/mmu.c            |  9 +++++++++
 include/linux/kvm_types.h         | 19 +++++++++++++++++--
 virt/kvm/kvm_main.c               | 10 +++++++++-
 11 files changed, 55 insertions(+), 14 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 5bc01e62c08a..1369415290dd 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -357,7 +357,7 @@ struct kvm_vcpu_arch {
 	bool pause;
 
 	/* Cache some mmu pages needed inside spinlock regions */
-	struct kvm_mmu_memory_cache mmu_page_cache;
+	DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_cache);
 
 	/* Target CPU and feature flags */
 	int target;
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index ecc5958e27fe..5e38385be0ef 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -319,6 +319,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
 	vcpu->arch.target = -1;
 	bitmap_zero(vcpu->arch.features, KVM_VCPU_MAX_FEATURES);
 
+	vcpu->arch.mmu_page_cache.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
 	vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
 
 	/* Set up the timer */
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index bc2aba953299..940089ba65ad 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -765,7 +765,12 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
 {
 	phys_addr_t addr;
 	int ret = 0;
-	struct kvm_mmu_memory_cache cache = { 0, __GFP_ZERO, NULL, };
+	DEFINE_KVM_MMU_MEMORY_CACHE(cache) page_cache = {
+		.cache = {
+			.gfp_zero = __GFP_ZERO,
+			.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE,
+		},
+	};
 	struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
 	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_DEVICE |
 				     KVM_PGTABLE_PROT_R |
@@ -778,14 +783,14 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
 	guest_ipa &= PAGE_MASK;
 
 	for (addr = guest_ipa; addr < guest_ipa + size; addr += PAGE_SIZE) {
-		ret = kvm_mmu_topup_memory_cache(&cache,
+		ret = kvm_mmu_topup_memory_cache(&page_cache.cache,
 						 kvm_mmu_cache_min_pages(kvm));
 		if (ret)
 			break;
 
 		spin_lock(&kvm->mmu_lock);
 		ret = kvm_pgtable_stage2_map(pgt, addr, PAGE_SIZE, pa, prot,
-					     &cache);
+					     &page_cache.cache);
 		spin_unlock(&kvm->mmu_lock);
 		if (ret)
 			break;
@@ -793,7 +798,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
 		pa += PAGE_SIZE;
 	}
 
-	kvm_mmu_free_memory_cache(&cache);
+	kvm_mmu_free_memory_cache(&page_cache.cache);
 	return ret;
 }
 
diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
index 717716cc51c5..935511d7fc3a 100644
--- a/arch/mips/include/asm/kvm_host.h
+++ b/arch/mips/include/asm/kvm_host.h
@@ -347,7 +347,7 @@ struct kvm_vcpu_arch {
 	unsigned long pending_exceptions_clr;
 
 	/* Cache some mmu pages needed inside spinlock regions */
-	struct kvm_mmu_memory_cache mmu_page_cache;
+	DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_cache);
 
 	/* vcpu's vzguestid is different on each host cpu in an smp system */
 	u32 vzguestid[NR_CPUS];
diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
index a25e0b73ee70..45c7179144dc 100644
--- a/arch/mips/kvm/mips.c
+++ b/arch/mips/kvm/mips.c
@@ -387,6 +387,8 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
 	if (err)
 		goto out_free_gebase;
 
+	vcpu->arch.mmu_page_cache.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
+
 	return 0;
 
 out_free_gebase:
diff --git a/arch/riscv/include/asm/kvm_host.h b/arch/riscv/include/asm/kvm_host.h
index 99ef6a120617..5bd4902ebda3 100644
--- a/arch/riscv/include/asm/kvm_host.h
+++ b/arch/riscv/include/asm/kvm_host.h
@@ -186,7 +186,7 @@ struct kvm_vcpu_arch {
 	struct kvm_sbi_context sbi_context;
 
 	/* Cache pages needed to program page tables with spinlock held */
-	struct kvm_mmu_memory_cache mmu_page_cache;
+	DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_cache);
 
 	/* VCPU power-off state */
 	bool power_off;
diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
index 624166004e36..6a5f5aa45bac 100644
--- a/arch/riscv/kvm/vcpu.c
+++ b/arch/riscv/kvm/vcpu.c
@@ -94,6 +94,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
 
 	/* Mark this VCPU never ran */
 	vcpu->arch.ran_atleast_once = false;
+	vcpu->arch.mmu_page_cache.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
 	vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
 
 	/* Setup ISA features available to VCPU */
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 0f5a36772bdc..544dde11963b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -692,10 +692,10 @@ struct kvm_vcpu_arch {
 	 */
 	struct kvm_mmu *walk_mmu;
 
-	struct kvm_mmu_memory_cache mmu_pte_list_desc_cache;
-	struct kvm_mmu_memory_cache mmu_shadow_page_cache;
-	struct kvm_mmu_memory_cache mmu_shadowed_translation_cache;
-	struct kvm_mmu_memory_cache mmu_page_header_cache;
+	DEFINE_KVM_MMU_MEMORY_CACHE(mmu_pte_list_desc_cache);
+	DEFINE_KVM_MMU_MEMORY_CACHE(mmu_shadow_page_cache);
+	DEFINE_KVM_MMU_MEMORY_CACHE(mmu_shadowed_translation_cache);
+	DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_header_cache);
 
 	/*
 	 * QEMU userspace and the guest each have their own FPU state.
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index dd56b5b9624f..24e7e053e05b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5817,12 +5817,21 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
 {
 	int ret;
 
+	vcpu->arch.mmu_pte_list_desc_cache.capacity =
+		KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
 	vcpu->arch.mmu_pte_list_desc_cache.kmem_cache = pte_list_desc_cache;
 	vcpu->arch.mmu_pte_list_desc_cache.gfp_zero = __GFP_ZERO;
 
+	vcpu->arch.mmu_page_header_cache.capacity =
+		KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
 	vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;
 	vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
 
+	vcpu->arch.mmu_shadowed_translation_cache.capacity =
+		KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
+
+	vcpu->arch.mmu_shadow_page_cache.capacity =
+		KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
 	vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
 
 	vcpu->arch.mmu = &vcpu->arch.root_mmu;
diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index ac1ebb37a0ff..579cf39986ec 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -83,14 +83,29 @@ struct gfn_to_pfn_cache {
  * MMU flows is problematic, as is triggering reclaim, I/O, etc... while
  * holding MMU locks.  Note, these caches act more like prefetch buffers than
  * classical caches, i.e. objects are not returned to the cache on being freed.
+ *
+ * The storage for the cache object pointers is laid out after the struct, to
+ * allow different declarations to choose different capacities. The capacity
+ * field defines the number of object pointers available after the struct.
  */
 struct kvm_mmu_memory_cache {
 	int nobjs;
+	int capacity;
 	gfp_t gfp_zero;
 	struct kmem_cache *kmem_cache;
-	void *objects[KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE];
+	void *objects[];
 };
-#endif
+
+#define __DEFINE_KVM_MMU_MEMORY_CACHE(_name, _capacity)		\
+	struct {						\
+		struct kvm_mmu_memory_cache _name;		\
+		void *_name##_objects[_capacity];		\
+	}
+
+#define DEFINE_KVM_MMU_MEMORY_CACHE(_name) \
+	__DEFINE_KVM_MMU_MEMORY_CACHE(_name, KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE)
+
+#endif /* KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE */
 
 #define HALT_POLL_HIST_COUNT			32
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 9581a24c3d17..1d849ba9529f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -371,9 +371,17 @@ int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
 {
 	void *obj;
 
+	/*
+	 * The capacity fieldmust be initialized since the storage for the
+	 * objects pointer array is laid out after the kvm_mmu_memory_cache
+	 * struct and not known at compile time.
+	 */
+	if (WARN_ON(mc->capacity == 0))
+		return -EINVAL;
+
 	if (mc->nobjs >= min)
 		return 0;
-	while (mc->nobjs < ARRAY_SIZE(mc->objects)) {
+	while (mc->nobjs < mc->capacity) {
 		obj = mmu_memory_cache_alloc_obj(mc, GFP_KERNEL_ACCOUNT);
 		if (!obj)
 			return mc->nobjs >= min ? 0 : -ENOMEM;
-- 
2.35.1.723.g4982287a31-goog


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH v2 21/26] KVM: Allow for different capacities in kvm_mmu_memory_cache structs
@ 2022-03-11  0:25   ` David Matlack
  0 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-11  0:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Allow the capacity of the kvm_mmu_memory_cache struct to be chosen at
declaration time rather than being fixed for all declarations. This will
be used in a follow-up commit to declare an cache in x86 with a capacity
of 512+ objects without having to increase the capacity of all caches in
KVM.

This change requires each cache now specify its capacity at runtime,
since the cache struct itself no longer has a fixed capacity known at
compile time. To protect against someone accidentally defining a
kvm_mmu_memory_cache struct directly (without the extra storage), this
commit includes a WARN_ON() in kvm_mmu_topup_memory_cache().

This change, unfortunately, adds some grottiness to
kvm_phys_addr_ioremap() in arm64, which uses a function-local (i.e.
stack-allocated) kvm_mmu_memory_cache struct. Since C does not allow
anonymous structs in functions, the new wrapper struct that contains
kvm_mmu_memory_cache and the objects pointer array, must be named, which
means dealing with an outer and inner struct. The outer struct can't be
dropped since then there would be no guarantee the kvm_mmu_memory_cache
struct and objects array would be laid out consecutively on the stack.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/arm64/include/asm/kvm_host.h |  2 +-
 arch/arm64/kvm/arm.c              |  1 +
 arch/arm64/kvm/mmu.c              | 13 +++++++++----
 arch/mips/include/asm/kvm_host.h  |  2 +-
 arch/mips/kvm/mips.c              |  2 ++
 arch/riscv/include/asm/kvm_host.h |  2 +-
 arch/riscv/kvm/vcpu.c             |  1 +
 arch/x86/include/asm/kvm_host.h   |  8 ++++----
 arch/x86/kvm/mmu/mmu.c            |  9 +++++++++
 include/linux/kvm_types.h         | 19 +++++++++++++++++--
 virt/kvm/kvm_main.c               | 10 +++++++++-
 11 files changed, 55 insertions(+), 14 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 5bc01e62c08a..1369415290dd 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -357,7 +357,7 @@ struct kvm_vcpu_arch {
 	bool pause;
 
 	/* Cache some mmu pages needed inside spinlock regions */
-	struct kvm_mmu_memory_cache mmu_page_cache;
+	DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_cache);
 
 	/* Target CPU and feature flags */
 	int target;
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index ecc5958e27fe..5e38385be0ef 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -319,6 +319,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
 	vcpu->arch.target = -1;
 	bitmap_zero(vcpu->arch.features, KVM_VCPU_MAX_FEATURES);
 
+	vcpu->arch.mmu_page_cache.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
 	vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
 
 	/* Set up the timer */
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index bc2aba953299..940089ba65ad 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -765,7 +765,12 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
 {
 	phys_addr_t addr;
 	int ret = 0;
-	struct kvm_mmu_memory_cache cache = { 0, __GFP_ZERO, NULL, };
+	DEFINE_KVM_MMU_MEMORY_CACHE(cache) page_cache = {
+		.cache = {
+			.gfp_zero = __GFP_ZERO,
+			.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE,
+		},
+	};
 	struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
 	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_DEVICE |
 				     KVM_PGTABLE_PROT_R |
@@ -778,14 +783,14 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
 	guest_ipa &= PAGE_MASK;
 
 	for (addr = guest_ipa; addr < guest_ipa + size; addr += PAGE_SIZE) {
-		ret = kvm_mmu_topup_memory_cache(&cache,
+		ret = kvm_mmu_topup_memory_cache(&page_cache.cache,
 						 kvm_mmu_cache_min_pages(kvm));
 		if (ret)
 			break;
 
 		spin_lock(&kvm->mmu_lock);
 		ret = kvm_pgtable_stage2_map(pgt, addr, PAGE_SIZE, pa, prot,
-					     &cache);
+					     &page_cache.cache);
 		spin_unlock(&kvm->mmu_lock);
 		if (ret)
 			break;
@@ -793,7 +798,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
 		pa += PAGE_SIZE;
 	}
 
-	kvm_mmu_free_memory_cache(&cache);
+	kvm_mmu_free_memory_cache(&page_cache.cache);
 	return ret;
 }
 
diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
index 717716cc51c5..935511d7fc3a 100644
--- a/arch/mips/include/asm/kvm_host.h
+++ b/arch/mips/include/asm/kvm_host.h
@@ -347,7 +347,7 @@ struct kvm_vcpu_arch {
 	unsigned long pending_exceptions_clr;
 
 	/* Cache some mmu pages needed inside spinlock regions */
-	struct kvm_mmu_memory_cache mmu_page_cache;
+	DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_cache);
 
 	/* vcpu's vzguestid is different on each host cpu in an smp system */
 	u32 vzguestid[NR_CPUS];
diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
index a25e0b73ee70..45c7179144dc 100644
--- a/arch/mips/kvm/mips.c
+++ b/arch/mips/kvm/mips.c
@@ -387,6 +387,8 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
 	if (err)
 		goto out_free_gebase;
 
+	vcpu->arch.mmu_page_cache.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
+
 	return 0;
 
 out_free_gebase:
diff --git a/arch/riscv/include/asm/kvm_host.h b/arch/riscv/include/asm/kvm_host.h
index 99ef6a120617..5bd4902ebda3 100644
--- a/arch/riscv/include/asm/kvm_host.h
+++ b/arch/riscv/include/asm/kvm_host.h
@@ -186,7 +186,7 @@ struct kvm_vcpu_arch {
 	struct kvm_sbi_context sbi_context;
 
 	/* Cache pages needed to program page tables with spinlock held */
-	struct kvm_mmu_memory_cache mmu_page_cache;
+	DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_cache);
 
 	/* VCPU power-off state */
 	bool power_off;
diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
index 624166004e36..6a5f5aa45bac 100644
--- a/arch/riscv/kvm/vcpu.c
+++ b/arch/riscv/kvm/vcpu.c
@@ -94,6 +94,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
 
 	/* Mark this VCPU never ran */
 	vcpu->arch.ran_atleast_once = false;
+	vcpu->arch.mmu_page_cache.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
 	vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
 
 	/* Setup ISA features available to VCPU */
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 0f5a36772bdc..544dde11963b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -692,10 +692,10 @@ struct kvm_vcpu_arch {
 	 */
 	struct kvm_mmu *walk_mmu;
 
-	struct kvm_mmu_memory_cache mmu_pte_list_desc_cache;
-	struct kvm_mmu_memory_cache mmu_shadow_page_cache;
-	struct kvm_mmu_memory_cache mmu_shadowed_translation_cache;
-	struct kvm_mmu_memory_cache mmu_page_header_cache;
+	DEFINE_KVM_MMU_MEMORY_CACHE(mmu_pte_list_desc_cache);
+	DEFINE_KVM_MMU_MEMORY_CACHE(mmu_shadow_page_cache);
+	DEFINE_KVM_MMU_MEMORY_CACHE(mmu_shadowed_translation_cache);
+	DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_header_cache);
 
 	/*
 	 * QEMU userspace and the guest each have their own FPU state.
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index dd56b5b9624f..24e7e053e05b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5817,12 +5817,21 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
 {
 	int ret;
 
+	vcpu->arch.mmu_pte_list_desc_cache.capacity =
+		KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
 	vcpu->arch.mmu_pte_list_desc_cache.kmem_cache = pte_list_desc_cache;
 	vcpu->arch.mmu_pte_list_desc_cache.gfp_zero = __GFP_ZERO;
 
+	vcpu->arch.mmu_page_header_cache.capacity =
+		KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
 	vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;
 	vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
 
+	vcpu->arch.mmu_shadowed_translation_cache.capacity =
+		KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
+
+	vcpu->arch.mmu_shadow_page_cache.capacity =
+		KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
 	vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
 
 	vcpu->arch.mmu = &vcpu->arch.root_mmu;
diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index ac1ebb37a0ff..579cf39986ec 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -83,14 +83,29 @@ struct gfn_to_pfn_cache {
  * MMU flows is problematic, as is triggering reclaim, I/O, etc... while
  * holding MMU locks.  Note, these caches act more like prefetch buffers than
  * classical caches, i.e. objects are not returned to the cache on being freed.
+ *
+ * The storage for the cache object pointers is laid out after the struct, to
+ * allow different declarations to choose different capacities. The capacity
+ * field defines the number of object pointers available after the struct.
  */
 struct kvm_mmu_memory_cache {
 	int nobjs;
+	int capacity;
 	gfp_t gfp_zero;
 	struct kmem_cache *kmem_cache;
-	void *objects[KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE];
+	void *objects[];
 };
-#endif
+
+#define __DEFINE_KVM_MMU_MEMORY_CACHE(_name, _capacity)		\
+	struct {						\
+		struct kvm_mmu_memory_cache _name;		\
+		void *_name##_objects[_capacity];		\
+	}
+
+#define DEFINE_KVM_MMU_MEMORY_CACHE(_name) \
+	__DEFINE_KVM_MMU_MEMORY_CACHE(_name, KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE)
+
+#endif /* KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE */
 
 #define HALT_POLL_HIST_COUNT			32
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 9581a24c3d17..1d849ba9529f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -371,9 +371,17 @@ int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
 {
 	void *obj;
 
+	/*
+	 * The capacity fieldmust be initialized since the storage for the
+	 * objects pointer array is laid out after the kvm_mmu_memory_cache
+	 * struct and not known at compile time.
+	 */
+	if (WARN_ON(mc->capacity == 0))
+		return -EINVAL;
+
 	if (mc->nobjs >= min)
 		return 0;
-	while (mc->nobjs < ARRAY_SIZE(mc->objects)) {
+	while (mc->nobjs < mc->capacity) {
 		obj = mmu_memory_cache_alloc_obj(mc, GFP_KERNEL_ACCOUNT);
 		if (!obj)
 			return mc->nobjs >= min ? 0 : -ENOMEM;
-- 
2.35.1.723.g4982287a31-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH v2 22/26] KVM: Allow GFP flags to be passed when topping up MMU caches
  2022-03-11  0:25 ` David Matlack
@ 2022-03-11  0:25   ` David Matlack
  -1 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-11  0:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

This will be used in a subsequent commit to top-up MMU caches under the
MMU lock with GFP_NOWAIT as part of eager page splitting.

No functional change intended.

Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 include/linux/kvm_host.h | 1 +
 virt/kvm/kvm_main.c      | 9 +++++++--
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 252ee4a61b58..7d3a1f28beb2 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1335,6 +1335,7 @@ void kvm_flush_remote_tlbs(struct kvm *kvm);
 
 #ifdef KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE
 int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min);
+int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min, gfp_t gfp);
 int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc);
 void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
 void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 1d849ba9529f..7861874af1c8 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -367,7 +367,7 @@ static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc,
 		return (void *)__get_free_page(gfp_flags);
 }
 
-int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
+int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min, gfp_t gfp)
 {
 	void *obj;
 
@@ -382,7 +382,7 @@ int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
 	if (mc->nobjs >= min)
 		return 0;
 	while (mc->nobjs < mc->capacity) {
-		obj = mmu_memory_cache_alloc_obj(mc, GFP_KERNEL_ACCOUNT);
+		obj = mmu_memory_cache_alloc_obj(mc, gfp);
 		if (!obj)
 			return mc->nobjs >= min ? 0 : -ENOMEM;
 		mc->objects[mc->nobjs++] = obj;
@@ -390,6 +390,11 @@ int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
 	return 0;
 }
 
+int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
+{
+	return __kvm_mmu_topup_memory_cache(mc, min, GFP_KERNEL_ACCOUNT);
+}
+
 int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc)
 {
 	return mc->nobjs;
-- 
2.35.1.723.g4982287a31-goog


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH v2 22/26] KVM: Allow GFP flags to be passed when topping up MMU caches
@ 2022-03-11  0:25   ` David Matlack
  0 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-11  0:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

This will be used in a subsequent commit to top-up MMU caches under the
MMU lock with GFP_NOWAIT as part of eager page splitting.

No functional change intended.

Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 include/linux/kvm_host.h | 1 +
 virt/kvm/kvm_main.c      | 9 +++++++--
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 252ee4a61b58..7d3a1f28beb2 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1335,6 +1335,7 @@ void kvm_flush_remote_tlbs(struct kvm *kvm);
 
 #ifdef KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE
 int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min);
+int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min, gfp_t gfp);
 int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc);
 void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
 void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 1d849ba9529f..7861874af1c8 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -367,7 +367,7 @@ static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc,
 		return (void *)__get_free_page(gfp_flags);
 }
 
-int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
+int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min, gfp_t gfp)
 {
 	void *obj;
 
@@ -382,7 +382,7 @@ int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
 	if (mc->nobjs >= min)
 		return 0;
 	while (mc->nobjs < mc->capacity) {
-		obj = mmu_memory_cache_alloc_obj(mc, GFP_KERNEL_ACCOUNT);
+		obj = mmu_memory_cache_alloc_obj(mc, gfp);
 		if (!obj)
 			return mc->nobjs >= min ? 0 : -ENOMEM;
 		mc->objects[mc->nobjs++] = obj;
@@ -390,6 +390,11 @@ int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
 	return 0;
 }
 
+int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
+{
+	return __kvm_mmu_topup_memory_cache(mc, min, GFP_KERNEL_ACCOUNT);
+}
+
 int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc)
 {
 	return mc->nobjs;
-- 
2.35.1.723.g4982287a31-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH v2 23/26] KVM: x86/mmu: Fully split huge pages that require extra pte_list_desc structs
  2022-03-11  0:25 ` David Matlack
@ 2022-03-11  0:25   ` David Matlack
  -1 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-11  0:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

When splitting a huge page we need to add all of the lower level SPTEs
to the memslot rmap. The current implementation of eager page splitting
bails if adding an SPTE would require allocating an extra pte_list_desc
struct. Fix this limitation by allocating enough pte_list_desc structs
before splitting the huge page.

This eliminates the need for TLB flushing under the MMU lock because the
huge page is always entirely split (no subregion of the huge page is
unmapped).

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/include/asm/kvm_host.h |  10 +++
 arch/x86/kvm/mmu/mmu.c          | 131 ++++++++++++++++++--------------
 2 files changed, 85 insertions(+), 56 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 544dde11963b..00a5c0bcc2eb 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1240,6 +1240,16 @@ struct kvm_arch {
 	hpa_t	hv_root_tdp;
 	spinlock_t hv_root_tdp_lock;
 #endif
+
+	/*
+	 * Memory cache used to allocate pte_list_desc structs while splitting
+	 * huge pages. In the worst case, to split one huge page we need 512
+	 * pte_list_desc structs to add each new lower level leaf sptep to the
+	 * memslot rmap.
+	 */
+#define HUGE_PAGE_SPLIT_DESC_CACHE_CAPACITY 512
+	__DEFINE_KVM_MMU_MEMORY_CACHE(huge_page_split_desc_cache,
+				      HUGE_PAGE_SPLIT_DESC_CACHE_CAPACITY);
 };
 
 struct kvm_vm_stat {
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 24e7e053e05b..95b8e2ef562f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1765,6 +1765,16 @@ struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu, bool direc
 	return sp;
 }
 
+static inline gfp_t gfp_flags_for_split(bool locked)
+{
+	/*
+	 * If under the MMU lock, use GFP_NOWAIT to avoid direct reclaim (which
+	 * is slow) and to avoid making any filesystem callbacks (which can end
+	 * up invoking KVM MMU notifiers, resulting in a deadlock).
+	 */
+	return (locked ? GFP_NOWAIT : GFP_KERNEL) | __GFP_ACCOUNT;
+}
+
 /*
  * Allocate a new shadow page, potentially while holding the MMU lock.
  *
@@ -1772,17 +1782,11 @@ struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu, bool direc
  * being mapped directly with a lower level page table. Thus there's no need to
  * allocate the shadowed_translation array.
  */
-struct kvm_mmu_page *kvm_mmu_alloc_direct_sp_for_split(bool locked)
+static struct kvm_mmu_page *__kvm_mmu_alloc_direct_sp_for_split(gfp_t gfp)
 {
 	struct kvm_mmu_page *sp;
-	gfp_t gfp;
 
-	/*
-	 * If under the MMU lock, use GFP_NOWAIT to avoid direct reclaim (which
-	 * is slow) and to avoid making any filesystem callbacks (which can end
-	 * up invoking KVM MMU notifiers, resulting in a deadlock).
-	 */
-	gfp = (locked ? GFP_NOWAIT : GFP_KERNEL) | __GFP_ACCOUNT | __GFP_ZERO;
+	gfp |= __GFP_ZERO;
 
 	sp = kmem_cache_alloc(mmu_page_header_cache, gfp);
 	if (!sp)
@@ -1799,6 +1803,13 @@ struct kvm_mmu_page *kvm_mmu_alloc_direct_sp_for_split(bool locked)
 	return sp;
 }
 
+struct kvm_mmu_page *kvm_mmu_alloc_direct_sp_for_split(bool locked)
+{
+	gfp_t gfp = gfp_flags_for_split(locked);
+
+	return __kvm_mmu_alloc_direct_sp_for_split(gfp);
+}
+
 static void mark_unsync(u64 *spte);
 static void kvm_mmu_mark_parents_unsync(struct kvm_mmu_page *sp)
 {
@@ -5989,6 +6000,11 @@ void kvm_mmu_init_vm(struct kvm *kvm)
 	node->track_write = kvm_mmu_pte_write;
 	node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;
 	kvm_page_track_register_notifier(kvm, node);
+
+	kvm->arch.huge_page_split_desc_cache.capacity =
+		HUGE_PAGE_SPLIT_DESC_CACHE_CAPACITY;
+	kvm->arch.huge_page_split_desc_cache.kmem_cache = pte_list_desc_cache;
+	kvm->arch.huge_page_split_desc_cache.gfp_zero = __GFP_ZERO;
 }
 
 void kvm_mmu_uninit_vm(struct kvm *kvm)
@@ -6119,11 +6135,43 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
 		kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
 }
 
+static int topup_huge_page_split_desc_cache(struct kvm *kvm, gfp_t gfp)
+{
+	/*
+	 * We may need up to HUGE_PAGE_SPLIT_DESC_CACHE_CAPACITY descriptors
+	 * to split any given huge page. We could more accurately calculate how
+	 * many we actually need by inspecting all the rmaps and check which
+	 * will need new descriptors, but that's not worth the extra cost or
+	 * code complexity.
+	 */
+	return __kvm_mmu_topup_memory_cache(
+			&kvm->arch.huge_page_split_desc_cache,
+			HUGE_PAGE_SPLIT_DESC_CACHE_CAPACITY,
+			gfp);
+}
+
+static int alloc_memory_for_split(struct kvm *kvm, struct kvm_mmu_page **spp,
+				  bool locked)
+{
+	gfp_t gfp = gfp_flags_for_split(locked);
+	int r;
+
+	r = topup_huge_page_split_desc_cache(kvm, gfp);
+	if (r)
+		return r;
+
+	if (!*spp) {
+		*spp = __kvm_mmu_alloc_direct_sp_for_split(gfp);
+		r = *spp ? 0 : -ENOMEM;
+	}
+
+	return r;
+}
+
 static int prepare_to_split_huge_page(struct kvm *kvm,
 				      const struct kvm_memory_slot *slot,
 				      u64 *huge_sptep,
 				      struct kvm_mmu_page **spp,
-				      bool *flush,
 				      bool *dropped_lock)
 {
 	int r = 0;
@@ -6136,24 +6184,18 @@ static int prepare_to_split_huge_page(struct kvm *kvm,
 	if (need_resched() || rwlock_needbreak(&kvm->mmu_lock))
 		goto drop_lock;
 
-	*spp = kvm_mmu_alloc_direct_sp_for_split(true);
+	r = alloc_memory_for_split(kvm, spp, true);
 	if (r)
 		goto drop_lock;
 
 	return 0;
 
 drop_lock:
-	if (*flush)
-		kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
-
-	*flush = false;
 	*dropped_lock = true;
 
 	write_unlock(&kvm->mmu_lock);
 	cond_resched();
-	*spp = kvm_mmu_alloc_direct_sp_for_split(false);
-	if (!*spp)
-		r = -ENOMEM;
+	r = alloc_memory_for_split(kvm, spp, false);
 	write_lock(&kvm->mmu_lock);
 
 	return r;
@@ -6196,10 +6238,10 @@ static struct kvm_mmu_page *kvm_mmu_get_sp_for_split(struct kvm *kvm,
 
 static int kvm_mmu_split_huge_page(struct kvm *kvm,
 				   const struct kvm_memory_slot *slot,
-				   u64 *huge_sptep, struct kvm_mmu_page **spp,
-				   bool *flush)
+				   u64 *huge_sptep, struct kvm_mmu_page **spp)
 
 {
+	struct kvm_mmu_memory_cache *cache;
 	struct kvm_mmu_page *split_sp;
 	u64 huge_spte, split_spte;
 	int split_level, index;
@@ -6212,9 +6254,9 @@ static int kvm_mmu_split_huge_page(struct kvm *kvm,
 		return -EOPNOTSUPP;
 
 	/*
-	 * Since we did not allocate pte_list_desc_structs for the split, we
-	 * cannot add a new parent SPTE to parent_ptes. This should never happen
-	 * in practice though since this is a fresh SP.
+	 * We did not allocate an extra pte_list_desc struct to add huge_sptep
+	 * to split_sp->parent_ptes. An extra pte_list_desc struct should never
+	 * be necessary in practice though since split_sp is brand new.
 	 *
 	 * Note, this makes it safe to pass NULL to __link_shadow_page() below.
 	 */
@@ -6225,6 +6267,7 @@ static int kvm_mmu_split_huge_page(struct kvm *kvm,
 
 	split_level = split_sp->role.level;
 	access = split_sp->role.access;
+	cache = &kvm->arch.huge_page_split_desc_cache;
 
 	for (index = 0; index < PT64_ENT_PER_PAGE; index++) {
 		split_sptep = &split_sp->spt[index];
@@ -6232,25 +6275,11 @@ static int kvm_mmu_split_huge_page(struct kvm *kvm,
 
 		BUG_ON(is_shadow_present_pte(*split_sptep));
 
-		/*
-		 * Since we did not allocate pte_list_desc structs for the
-		 * split, we can't add a new SPTE that maps this GFN.
-		 * Skipping this SPTE means we're only partially mapping the
-		 * huge page, which means we'll need to flush TLBs before
-		 * dropping the MMU lock.
-		 *
-		 * Note, this make it safe to pass NULL to __rmap_add() below.
-		 */
-		if (gfn_to_rmap(split_gfn, split_level, slot)->val) {
-			*flush = true;
-			continue;
-		}
-
 		split_spte = make_huge_page_split_spte(
 				huge_spte, split_level + 1, index, access);
 
 		mmu_spte_set(split_sptep, split_spte);
-		__rmap_add(kvm, NULL, slot, split_sptep, split_gfn, access);
+		__rmap_add(kvm, cache, slot, split_sptep, split_gfn, access);
 	}
 
 	/*
@@ -6258,9 +6287,7 @@ static int kvm_mmu_split_huge_page(struct kvm *kvm,
 	 * page table. Since we are making this change without a TLB flush vCPUs
 	 * will see a mix of the split mappings and the original huge mapping,
 	 * depending on what's currently in their TLB. This is fine from a
-	 * correctness standpoint since the translation will either be identical
-	 * or non-present. To account for non-present mappings, the TLB will be
-	 * flushed prior to dropping the MMU lock.
+	 * correctness standpoint since the translation will be identical.
 	 */
 	__drop_large_spte(kvm, huge_sptep, false);
 	__link_shadow_page(NULL, huge_sptep, split_sp);
@@ -6297,7 +6324,6 @@ static bool rmap_try_split_huge_pages(struct kvm *kvm,
 	struct kvm_mmu_page *sp = NULL;
 	struct rmap_iterator iter;
 	u64 *huge_sptep, spte;
-	bool flush = false;
 	bool dropped_lock;
 	int level;
 	gfn_t gfn;
@@ -6312,7 +6338,7 @@ static bool rmap_try_split_huge_pages(struct kvm *kvm,
 		level = sptep_to_sp(huge_sptep)->role.level;
 		gfn = sptep_to_gfn(huge_sptep);
 
-		r = prepare_to_split_huge_page(kvm, slot, huge_sptep, &sp, &flush, &dropped_lock);
+		r = prepare_to_split_huge_page(kvm, slot, huge_sptep, &sp, &dropped_lock);
 		if (r) {
 			trace_kvm_mmu_split_huge_page(gfn, spte, level, r);
 			break;
@@ -6321,7 +6347,7 @@ static bool rmap_try_split_huge_pages(struct kvm *kvm,
 		if (dropped_lock)
 			goto restart;
 
-		r = kvm_mmu_split_huge_page(kvm, slot, huge_sptep, &sp, &flush);
+		r = kvm_mmu_split_huge_page(kvm, slot, huge_sptep, &sp);
 
 		trace_kvm_mmu_split_huge_page(gfn, spte, level, r);
 
@@ -6336,7 +6362,7 @@ static bool rmap_try_split_huge_pages(struct kvm *kvm,
 	if (sp)
 		kvm_mmu_free_shadow_page(sp);
 
-	return flush;
+	return false;
 }
 
 static void kvm_rmap_try_split_huge_pages(struct kvm *kvm,
@@ -6344,7 +6370,6 @@ static void kvm_rmap_try_split_huge_pages(struct kvm *kvm,
 					  gfn_t start, gfn_t end,
 					  int target_level)
 {
-	bool flush;
 	int level;
 
 	/*
@@ -6352,21 +6377,15 @@ static void kvm_rmap_try_split_huge_pages(struct kvm *kvm,
 	 * down to the target level. This ensures pages are recursively split
 	 * all the way to the target level. There's no need to split pages
 	 * already at the target level.
-	 *
-	 * Note that TLB flushes must be done before dropping the MMU lock since
-	 * rmap_try_split_huge_pages() may partially split any given huge page,
-	 * i.e. it may effectively unmap (make non-present) a portion of the
-	 * huge page.
 	 */
 	for (level = KVM_MAX_HUGEPAGE_LEVEL; level > target_level; level--) {
-		flush = slot_handle_level_range(kvm, slot,
-						rmap_try_split_huge_pages,
-						level, level, start, end - 1,
-						true, flush);
+		slot_handle_level_range(kvm, slot,
+					rmap_try_split_huge_pages,
+					level, level, start, end - 1,
+					true, false);
 	}
 
-	if (flush)
-		kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
+	kvm_mmu_free_memory_cache(&kvm->arch.huge_page_split_desc_cache);
 }
 
 /* Must be called with the mmu_lock held in write-mode. */
-- 
2.35.1.723.g4982287a31-goog


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH v2 23/26] KVM: x86/mmu: Fully split huge pages that require extra pte_list_desc structs
@ 2022-03-11  0:25   ` David Matlack
  0 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-11  0:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

When splitting a huge page we need to add all of the lower level SPTEs
to the memslot rmap. The current implementation of eager page splitting
bails if adding an SPTE would require allocating an extra pte_list_desc
struct. Fix this limitation by allocating enough pte_list_desc structs
before splitting the huge page.

This eliminates the need for TLB flushing under the MMU lock because the
huge page is always entirely split (no subregion of the huge page is
unmapped).

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/include/asm/kvm_host.h |  10 +++
 arch/x86/kvm/mmu/mmu.c          | 131 ++++++++++++++++++--------------
 2 files changed, 85 insertions(+), 56 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 544dde11963b..00a5c0bcc2eb 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1240,6 +1240,16 @@ struct kvm_arch {
 	hpa_t	hv_root_tdp;
 	spinlock_t hv_root_tdp_lock;
 #endif
+
+	/*
+	 * Memory cache used to allocate pte_list_desc structs while splitting
+	 * huge pages. In the worst case, to split one huge page we need 512
+	 * pte_list_desc structs to add each new lower level leaf sptep to the
+	 * memslot rmap.
+	 */
+#define HUGE_PAGE_SPLIT_DESC_CACHE_CAPACITY 512
+	__DEFINE_KVM_MMU_MEMORY_CACHE(huge_page_split_desc_cache,
+				      HUGE_PAGE_SPLIT_DESC_CACHE_CAPACITY);
 };
 
 struct kvm_vm_stat {
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 24e7e053e05b..95b8e2ef562f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1765,6 +1765,16 @@ struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu, bool direc
 	return sp;
 }
 
+static inline gfp_t gfp_flags_for_split(bool locked)
+{
+	/*
+	 * If under the MMU lock, use GFP_NOWAIT to avoid direct reclaim (which
+	 * is slow) and to avoid making any filesystem callbacks (which can end
+	 * up invoking KVM MMU notifiers, resulting in a deadlock).
+	 */
+	return (locked ? GFP_NOWAIT : GFP_KERNEL) | __GFP_ACCOUNT;
+}
+
 /*
  * Allocate a new shadow page, potentially while holding the MMU lock.
  *
@@ -1772,17 +1782,11 @@ struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu, bool direc
  * being mapped directly with a lower level page table. Thus there's no need to
  * allocate the shadowed_translation array.
  */
-struct kvm_mmu_page *kvm_mmu_alloc_direct_sp_for_split(bool locked)
+static struct kvm_mmu_page *__kvm_mmu_alloc_direct_sp_for_split(gfp_t gfp)
 {
 	struct kvm_mmu_page *sp;
-	gfp_t gfp;
 
-	/*
-	 * If under the MMU lock, use GFP_NOWAIT to avoid direct reclaim (which
-	 * is slow) and to avoid making any filesystem callbacks (which can end
-	 * up invoking KVM MMU notifiers, resulting in a deadlock).
-	 */
-	gfp = (locked ? GFP_NOWAIT : GFP_KERNEL) | __GFP_ACCOUNT | __GFP_ZERO;
+	gfp |= __GFP_ZERO;
 
 	sp = kmem_cache_alloc(mmu_page_header_cache, gfp);
 	if (!sp)
@@ -1799,6 +1803,13 @@ struct kvm_mmu_page *kvm_mmu_alloc_direct_sp_for_split(bool locked)
 	return sp;
 }
 
+struct kvm_mmu_page *kvm_mmu_alloc_direct_sp_for_split(bool locked)
+{
+	gfp_t gfp = gfp_flags_for_split(locked);
+
+	return __kvm_mmu_alloc_direct_sp_for_split(gfp);
+}
+
 static void mark_unsync(u64 *spte);
 static void kvm_mmu_mark_parents_unsync(struct kvm_mmu_page *sp)
 {
@@ -5989,6 +6000,11 @@ void kvm_mmu_init_vm(struct kvm *kvm)
 	node->track_write = kvm_mmu_pte_write;
 	node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;
 	kvm_page_track_register_notifier(kvm, node);
+
+	kvm->arch.huge_page_split_desc_cache.capacity =
+		HUGE_PAGE_SPLIT_DESC_CACHE_CAPACITY;
+	kvm->arch.huge_page_split_desc_cache.kmem_cache = pte_list_desc_cache;
+	kvm->arch.huge_page_split_desc_cache.gfp_zero = __GFP_ZERO;
 }
 
 void kvm_mmu_uninit_vm(struct kvm *kvm)
@@ -6119,11 +6135,43 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
 		kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
 }
 
+static int topup_huge_page_split_desc_cache(struct kvm *kvm, gfp_t gfp)
+{
+	/*
+	 * We may need up to HUGE_PAGE_SPLIT_DESC_CACHE_CAPACITY descriptors
+	 * to split any given huge page. We could more accurately calculate how
+	 * many we actually need by inspecting all the rmaps and check which
+	 * will need new descriptors, but that's not worth the extra cost or
+	 * code complexity.
+	 */
+	return __kvm_mmu_topup_memory_cache(
+			&kvm->arch.huge_page_split_desc_cache,
+			HUGE_PAGE_SPLIT_DESC_CACHE_CAPACITY,
+			gfp);
+}
+
+static int alloc_memory_for_split(struct kvm *kvm, struct kvm_mmu_page **spp,
+				  bool locked)
+{
+	gfp_t gfp = gfp_flags_for_split(locked);
+	int r;
+
+	r = topup_huge_page_split_desc_cache(kvm, gfp);
+	if (r)
+		return r;
+
+	if (!*spp) {
+		*spp = __kvm_mmu_alloc_direct_sp_for_split(gfp);
+		r = *spp ? 0 : -ENOMEM;
+	}
+
+	return r;
+}
+
 static int prepare_to_split_huge_page(struct kvm *kvm,
 				      const struct kvm_memory_slot *slot,
 				      u64 *huge_sptep,
 				      struct kvm_mmu_page **spp,
-				      bool *flush,
 				      bool *dropped_lock)
 {
 	int r = 0;
@@ -6136,24 +6184,18 @@ static int prepare_to_split_huge_page(struct kvm *kvm,
 	if (need_resched() || rwlock_needbreak(&kvm->mmu_lock))
 		goto drop_lock;
 
-	*spp = kvm_mmu_alloc_direct_sp_for_split(true);
+	r = alloc_memory_for_split(kvm, spp, true);
 	if (r)
 		goto drop_lock;
 
 	return 0;
 
 drop_lock:
-	if (*flush)
-		kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
-
-	*flush = false;
 	*dropped_lock = true;
 
 	write_unlock(&kvm->mmu_lock);
 	cond_resched();
-	*spp = kvm_mmu_alloc_direct_sp_for_split(false);
-	if (!*spp)
-		r = -ENOMEM;
+	r = alloc_memory_for_split(kvm, spp, false);
 	write_lock(&kvm->mmu_lock);
 
 	return r;
@@ -6196,10 +6238,10 @@ static struct kvm_mmu_page *kvm_mmu_get_sp_for_split(struct kvm *kvm,
 
 static int kvm_mmu_split_huge_page(struct kvm *kvm,
 				   const struct kvm_memory_slot *slot,
-				   u64 *huge_sptep, struct kvm_mmu_page **spp,
-				   bool *flush)
+				   u64 *huge_sptep, struct kvm_mmu_page **spp)
 
 {
+	struct kvm_mmu_memory_cache *cache;
 	struct kvm_mmu_page *split_sp;
 	u64 huge_spte, split_spte;
 	int split_level, index;
@@ -6212,9 +6254,9 @@ static int kvm_mmu_split_huge_page(struct kvm *kvm,
 		return -EOPNOTSUPP;
 
 	/*
-	 * Since we did not allocate pte_list_desc_structs for the split, we
-	 * cannot add a new parent SPTE to parent_ptes. This should never happen
-	 * in practice though since this is a fresh SP.
+	 * We did not allocate an extra pte_list_desc struct to add huge_sptep
+	 * to split_sp->parent_ptes. An extra pte_list_desc struct should never
+	 * be necessary in practice though since split_sp is brand new.
 	 *
 	 * Note, this makes it safe to pass NULL to __link_shadow_page() below.
 	 */
@@ -6225,6 +6267,7 @@ static int kvm_mmu_split_huge_page(struct kvm *kvm,
 
 	split_level = split_sp->role.level;
 	access = split_sp->role.access;
+	cache = &kvm->arch.huge_page_split_desc_cache;
 
 	for (index = 0; index < PT64_ENT_PER_PAGE; index++) {
 		split_sptep = &split_sp->spt[index];
@@ -6232,25 +6275,11 @@ static int kvm_mmu_split_huge_page(struct kvm *kvm,
 
 		BUG_ON(is_shadow_present_pte(*split_sptep));
 
-		/*
-		 * Since we did not allocate pte_list_desc structs for the
-		 * split, we can't add a new SPTE that maps this GFN.
-		 * Skipping this SPTE means we're only partially mapping the
-		 * huge page, which means we'll need to flush TLBs before
-		 * dropping the MMU lock.
-		 *
-		 * Note, this make it safe to pass NULL to __rmap_add() below.
-		 */
-		if (gfn_to_rmap(split_gfn, split_level, slot)->val) {
-			*flush = true;
-			continue;
-		}
-
 		split_spte = make_huge_page_split_spte(
 				huge_spte, split_level + 1, index, access);
 
 		mmu_spte_set(split_sptep, split_spte);
-		__rmap_add(kvm, NULL, slot, split_sptep, split_gfn, access);
+		__rmap_add(kvm, cache, slot, split_sptep, split_gfn, access);
 	}
 
 	/*
@@ -6258,9 +6287,7 @@ static int kvm_mmu_split_huge_page(struct kvm *kvm,
 	 * page table. Since we are making this change without a TLB flush vCPUs
 	 * will see a mix of the split mappings and the original huge mapping,
 	 * depending on what's currently in their TLB. This is fine from a
-	 * correctness standpoint since the translation will either be identical
-	 * or non-present. To account for non-present mappings, the TLB will be
-	 * flushed prior to dropping the MMU lock.
+	 * correctness standpoint since the translation will be identical.
 	 */
 	__drop_large_spte(kvm, huge_sptep, false);
 	__link_shadow_page(NULL, huge_sptep, split_sp);
@@ -6297,7 +6324,6 @@ static bool rmap_try_split_huge_pages(struct kvm *kvm,
 	struct kvm_mmu_page *sp = NULL;
 	struct rmap_iterator iter;
 	u64 *huge_sptep, spte;
-	bool flush = false;
 	bool dropped_lock;
 	int level;
 	gfn_t gfn;
@@ -6312,7 +6338,7 @@ static bool rmap_try_split_huge_pages(struct kvm *kvm,
 		level = sptep_to_sp(huge_sptep)->role.level;
 		gfn = sptep_to_gfn(huge_sptep);
 
-		r = prepare_to_split_huge_page(kvm, slot, huge_sptep, &sp, &flush, &dropped_lock);
+		r = prepare_to_split_huge_page(kvm, slot, huge_sptep, &sp, &dropped_lock);
 		if (r) {
 			trace_kvm_mmu_split_huge_page(gfn, spte, level, r);
 			break;
@@ -6321,7 +6347,7 @@ static bool rmap_try_split_huge_pages(struct kvm *kvm,
 		if (dropped_lock)
 			goto restart;
 
-		r = kvm_mmu_split_huge_page(kvm, slot, huge_sptep, &sp, &flush);
+		r = kvm_mmu_split_huge_page(kvm, slot, huge_sptep, &sp);
 
 		trace_kvm_mmu_split_huge_page(gfn, spte, level, r);
 
@@ -6336,7 +6362,7 @@ static bool rmap_try_split_huge_pages(struct kvm *kvm,
 	if (sp)
 		kvm_mmu_free_shadow_page(sp);
 
-	return flush;
+	return false;
 }
 
 static void kvm_rmap_try_split_huge_pages(struct kvm *kvm,
@@ -6344,7 +6370,6 @@ static void kvm_rmap_try_split_huge_pages(struct kvm *kvm,
 					  gfn_t start, gfn_t end,
 					  int target_level)
 {
-	bool flush;
 	int level;
 
 	/*
@@ -6352,21 +6377,15 @@ static void kvm_rmap_try_split_huge_pages(struct kvm *kvm,
 	 * down to the target level. This ensures pages are recursively split
 	 * all the way to the target level. There's no need to split pages
 	 * already at the target level.
-	 *
-	 * Note that TLB flushes must be done before dropping the MMU lock since
-	 * rmap_try_split_huge_pages() may partially split any given huge page,
-	 * i.e. it may effectively unmap (make non-present) a portion of the
-	 * huge page.
 	 */
 	for (level = KVM_MAX_HUGEPAGE_LEVEL; level > target_level; level--) {
-		flush = slot_handle_level_range(kvm, slot,
-						rmap_try_split_huge_pages,
-						level, level, start, end - 1,
-						true, flush);
+		slot_handle_level_range(kvm, slot,
+					rmap_try_split_huge_pages,
+					level, level, start, end - 1,
+					true, false);
 	}
 
-	if (flush)
-		kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
+	kvm_mmu_free_memory_cache(&kvm->arch.huge_page_split_desc_cache);
 }
 
 /* Must be called with the mmu_lock held in write-mode. */
-- 
2.35.1.723.g4982287a31-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH v2 24/26] KVM: x86/mmu: Split huge pages aliased by multiple SPTEs
  2022-03-11  0:25 ` David Matlack
@ 2022-03-11  0:25   ` David Matlack
  -1 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-11  0:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

The existing huge page splitting code bails if it encounters a huge page
that is aliased by another SPTE that has already been split (either due
to NX huge pages or eager page splitting). Extend the huge page
splitting code to also handle such aliases.

The thing we have to be careful about is dealing with what's already in
the lower level page table. If eager page splitting was the only
operation that split huge pages, this would be fine. However huge pages
can also be split by NX huge pages. This means the lower level page
table may only be partially filled in and may point to even lower level
page tables that are partially filled in. We can fill in the rest of the
page table but dealing with the lower level page tables would be too
complex.

To handle this we flush TLBs after dropping the huge SPTE whenever we
are about to install a lower level page table that was partially filled
in (*). We can skip the TLB flush if the lower level page table was
empty (no aliasing) or identical to what we were already going to
populate it with (aliased huge page that was just eagerly split).

(*) This TLB flush could probably be delayed until we're about to drop
the MMU lock, which would also let us batch flushes for multiple splits.
However such scenarios should be rare in practice (a huge page must be
aliased in multiple SPTEs and have been split for NX Huge Pages in only
some of them). Flushing immediately is simpler to plumb and also reduces
the chances of tripping over a CPU bug (e.g. see iTLB multi-hit).

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/include/asm/kvm_host.h |  5 ++-
 arch/x86/kvm/mmu/mmu.c          | 73 +++++++++++++++------------------
 2 files changed, 36 insertions(+), 42 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 00a5c0bcc2eb..275d00528805 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1245,9 +1245,10 @@ struct kvm_arch {
 	 * Memory cache used to allocate pte_list_desc structs while splitting
 	 * huge pages. In the worst case, to split one huge page we need 512
 	 * pte_list_desc structs to add each new lower level leaf sptep to the
-	 * memslot rmap.
+	 * memslot rmap plus 1 to extend the parent_ptes rmap of the new lower
+	 * level page table.
 	 */
-#define HUGE_PAGE_SPLIT_DESC_CACHE_CAPACITY 512
+#define HUGE_PAGE_SPLIT_DESC_CACHE_CAPACITY 513
 	__DEFINE_KVM_MMU_MEMORY_CACHE(huge_page_split_desc_cache,
 				      HUGE_PAGE_SPLIT_DESC_CACHE_CAPACITY);
 };
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 95b8e2ef562f..68785b422a08 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6208,6 +6208,7 @@ static struct kvm_mmu_page *kvm_mmu_get_sp_for_split(struct kvm *kvm,
 {
 	struct kvm_mmu_page *split_sp;
 	union kvm_mmu_page_role role;
+	bool created = false;
 	unsigned int access;
 	gfn_t gfn;
 
@@ -6220,25 +6221,21 @@ static struct kvm_mmu_page *kvm_mmu_get_sp_for_split(struct kvm *kvm,
 	 */
 	role = kvm_mmu_child_role(huge_sptep, true, access);
 	split_sp = kvm_mmu_find_direct_sp(kvm, gfn, role);
-
-	/*
-	 * Opt not to split if the lower-level SP already exists. This requires
-	 * more complex handling as the SP may be already partially filled in
-	 * and may need extra pte_list_desc structs to update parent_ptes.
-	 */
 	if (split_sp)
-		return NULL;
+		goto out;
 
+	created = true;
 	swap(split_sp, *spp);
 	init_shadow_page(kvm, split_sp, slot, gfn, role);
-	trace_kvm_mmu_get_page(split_sp, true);
 
+out:
+	trace_kvm_mmu_get_page(split_sp, created);
 	return split_sp;
 }
 
-static int kvm_mmu_split_huge_page(struct kvm *kvm,
-				   const struct kvm_memory_slot *slot,
-				   u64 *huge_sptep, struct kvm_mmu_page **spp)
+static void kvm_mmu_split_huge_page(struct kvm *kvm,
+				    const struct kvm_memory_slot *slot,
+				    u64 *huge_sptep, struct kvm_mmu_page **spp)
 
 {
 	struct kvm_mmu_memory_cache *cache;
@@ -6246,22 +6243,11 @@ static int kvm_mmu_split_huge_page(struct kvm *kvm,
 	u64 huge_spte, split_spte;
 	int split_level, index;
 	unsigned int access;
+	bool flush = false;
 	u64 *split_sptep;
 	gfn_t split_gfn;
 
 	split_sp = kvm_mmu_get_sp_for_split(kvm, slot, huge_sptep, spp);
-	if (!split_sp)
-		return -EOPNOTSUPP;
-
-	/*
-	 * We did not allocate an extra pte_list_desc struct to add huge_sptep
-	 * to split_sp->parent_ptes. An extra pte_list_desc struct should never
-	 * be necessary in practice though since split_sp is brand new.
-	 *
-	 * Note, this makes it safe to pass NULL to __link_shadow_page() below.
-	 */
-	if (WARN_ON_ONCE(split_sp->parent_ptes.val))
-		return -EINVAL;
 
 	huge_spte = READ_ONCE(*huge_sptep);
 
@@ -6273,7 +6259,20 @@ static int kvm_mmu_split_huge_page(struct kvm *kvm,
 		split_sptep = &split_sp->spt[index];
 		split_gfn = kvm_mmu_page_get_gfn(split_sp, index);
 
-		BUG_ON(is_shadow_present_pte(*split_sptep));
+		/*
+		 * split_sp may have populated page table entries if this huge
+		 * page is aliased in multiple shadow page table entries. We
+		 * know the existing SP will be mapping the same GFN->PFN
+		 * translation since this is a direct SP. However, the SPTE may
+		 * point to an even lower level page table that may only be
+		 * partially filled in (e.g. for NX huge pages). In other words,
+		 * we may be unmapping a portion of the huge page, which
+		 * requires a TLB flush.
+		 */
+		if (is_shadow_present_pte(*split_sptep)) {
+			flush |= !is_last_spte(*split_sptep, split_level);
+			continue;
+		}
 
 		split_spte = make_huge_page_split_spte(
 				huge_spte, split_level + 1, index, access);
@@ -6284,15 +6283,12 @@ static int kvm_mmu_split_huge_page(struct kvm *kvm,
 
 	/*
 	 * Replace the huge spte with a pointer to the populated lower level
-	 * page table. Since we are making this change without a TLB flush vCPUs
-	 * will see a mix of the split mappings and the original huge mapping,
-	 * depending on what's currently in their TLB. This is fine from a
-	 * correctness standpoint since the translation will be identical.
+	 * page table. If the lower-level page table indentically maps the huge
+	 * page, there's no need for a TLB flush. Otherwise, flush TLBs after
+	 * dropping the huge page and before installing the shadow page table.
 	 */
-	__drop_large_spte(kvm, huge_sptep, false);
-	__link_shadow_page(NULL, huge_sptep, split_sp);
-
-	return 0;
+	__drop_large_spte(kvm, huge_sptep, flush);
+	__link_shadow_page(cache, huge_sptep, split_sp);
 }
 
 static bool should_split_huge_page(u64 *huge_sptep)
@@ -6347,16 +6343,13 @@ static bool rmap_try_split_huge_pages(struct kvm *kvm,
 		if (dropped_lock)
 			goto restart;
 
-		r = kvm_mmu_split_huge_page(kvm, slot, huge_sptep, &sp);
-
-		trace_kvm_mmu_split_huge_page(gfn, spte, level, r);
-
 		/*
-		 * If splitting is successful we must restart the iterator
-		 * because huge_sptep has just been removed from it.
+		 * After splitting we must restart the iterator because
+		 * huge_sptep has just been removed from it.
 		 */
-		if (!r)
-			goto restart;
+		kvm_mmu_split_huge_page(kvm, slot, huge_sptep, &sp);
+		trace_kvm_mmu_split_huge_page(gfn, spte, level, 0);
+		goto restart;
 	}
 
 	if (sp)
-- 
2.35.1.723.g4982287a31-goog


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH v2 24/26] KVM: x86/mmu: Split huge pages aliased by multiple SPTEs
@ 2022-03-11  0:25   ` David Matlack
  0 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-11  0:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

The existing huge page splitting code bails if it encounters a huge page
that is aliased by another SPTE that has already been split (either due
to NX huge pages or eager page splitting). Extend the huge page
splitting code to also handle such aliases.

The thing we have to be careful about is dealing with what's already in
the lower level page table. If eager page splitting was the only
operation that split huge pages, this would be fine. However huge pages
can also be split by NX huge pages. This means the lower level page
table may only be partially filled in and may point to even lower level
page tables that are partially filled in. We can fill in the rest of the
page table but dealing with the lower level page tables would be too
complex.

To handle this we flush TLBs after dropping the huge SPTE whenever we
are about to install a lower level page table that was partially filled
in (*). We can skip the TLB flush if the lower level page table was
empty (no aliasing) or identical to what we were already going to
populate it with (aliased huge page that was just eagerly split).

(*) This TLB flush could probably be delayed until we're about to drop
the MMU lock, which would also let us batch flushes for multiple splits.
However such scenarios should be rare in practice (a huge page must be
aliased in multiple SPTEs and have been split for NX Huge Pages in only
some of them). Flushing immediately is simpler to plumb and also reduces
the chances of tripping over a CPU bug (e.g. see iTLB multi-hit).

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/include/asm/kvm_host.h |  5 ++-
 arch/x86/kvm/mmu/mmu.c          | 73 +++++++++++++++------------------
 2 files changed, 36 insertions(+), 42 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 00a5c0bcc2eb..275d00528805 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1245,9 +1245,10 @@ struct kvm_arch {
 	 * Memory cache used to allocate pte_list_desc structs while splitting
 	 * huge pages. In the worst case, to split one huge page we need 512
 	 * pte_list_desc structs to add each new lower level leaf sptep to the
-	 * memslot rmap.
+	 * memslot rmap plus 1 to extend the parent_ptes rmap of the new lower
+	 * level page table.
 	 */
-#define HUGE_PAGE_SPLIT_DESC_CACHE_CAPACITY 512
+#define HUGE_PAGE_SPLIT_DESC_CACHE_CAPACITY 513
 	__DEFINE_KVM_MMU_MEMORY_CACHE(huge_page_split_desc_cache,
 				      HUGE_PAGE_SPLIT_DESC_CACHE_CAPACITY);
 };
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 95b8e2ef562f..68785b422a08 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6208,6 +6208,7 @@ static struct kvm_mmu_page *kvm_mmu_get_sp_for_split(struct kvm *kvm,
 {
 	struct kvm_mmu_page *split_sp;
 	union kvm_mmu_page_role role;
+	bool created = false;
 	unsigned int access;
 	gfn_t gfn;
 
@@ -6220,25 +6221,21 @@ static struct kvm_mmu_page *kvm_mmu_get_sp_for_split(struct kvm *kvm,
 	 */
 	role = kvm_mmu_child_role(huge_sptep, true, access);
 	split_sp = kvm_mmu_find_direct_sp(kvm, gfn, role);
-
-	/*
-	 * Opt not to split if the lower-level SP already exists. This requires
-	 * more complex handling as the SP may be already partially filled in
-	 * and may need extra pte_list_desc structs to update parent_ptes.
-	 */
 	if (split_sp)
-		return NULL;
+		goto out;
 
+	created = true;
 	swap(split_sp, *spp);
 	init_shadow_page(kvm, split_sp, slot, gfn, role);
-	trace_kvm_mmu_get_page(split_sp, true);
 
+out:
+	trace_kvm_mmu_get_page(split_sp, created);
 	return split_sp;
 }
 
-static int kvm_mmu_split_huge_page(struct kvm *kvm,
-				   const struct kvm_memory_slot *slot,
-				   u64 *huge_sptep, struct kvm_mmu_page **spp)
+static void kvm_mmu_split_huge_page(struct kvm *kvm,
+				    const struct kvm_memory_slot *slot,
+				    u64 *huge_sptep, struct kvm_mmu_page **spp)
 
 {
 	struct kvm_mmu_memory_cache *cache;
@@ -6246,22 +6243,11 @@ static int kvm_mmu_split_huge_page(struct kvm *kvm,
 	u64 huge_spte, split_spte;
 	int split_level, index;
 	unsigned int access;
+	bool flush = false;
 	u64 *split_sptep;
 	gfn_t split_gfn;
 
 	split_sp = kvm_mmu_get_sp_for_split(kvm, slot, huge_sptep, spp);
-	if (!split_sp)
-		return -EOPNOTSUPP;
-
-	/*
-	 * We did not allocate an extra pte_list_desc struct to add huge_sptep
-	 * to split_sp->parent_ptes. An extra pte_list_desc struct should never
-	 * be necessary in practice though since split_sp is brand new.
-	 *
-	 * Note, this makes it safe to pass NULL to __link_shadow_page() below.
-	 */
-	if (WARN_ON_ONCE(split_sp->parent_ptes.val))
-		return -EINVAL;
 
 	huge_spte = READ_ONCE(*huge_sptep);
 
@@ -6273,7 +6259,20 @@ static int kvm_mmu_split_huge_page(struct kvm *kvm,
 		split_sptep = &split_sp->spt[index];
 		split_gfn = kvm_mmu_page_get_gfn(split_sp, index);
 
-		BUG_ON(is_shadow_present_pte(*split_sptep));
+		/*
+		 * split_sp may have populated page table entries if this huge
+		 * page is aliased in multiple shadow page table entries. We
+		 * know the existing SP will be mapping the same GFN->PFN
+		 * translation since this is a direct SP. However, the SPTE may
+		 * point to an even lower level page table that may only be
+		 * partially filled in (e.g. for NX huge pages). In other words,
+		 * we may be unmapping a portion of the huge page, which
+		 * requires a TLB flush.
+		 */
+		if (is_shadow_present_pte(*split_sptep)) {
+			flush |= !is_last_spte(*split_sptep, split_level);
+			continue;
+		}
 
 		split_spte = make_huge_page_split_spte(
 				huge_spte, split_level + 1, index, access);
@@ -6284,15 +6283,12 @@ static int kvm_mmu_split_huge_page(struct kvm *kvm,
 
 	/*
 	 * Replace the huge spte with a pointer to the populated lower level
-	 * page table. Since we are making this change without a TLB flush vCPUs
-	 * will see a mix of the split mappings and the original huge mapping,
-	 * depending on what's currently in their TLB. This is fine from a
-	 * correctness standpoint since the translation will be identical.
+	 * page table. If the lower-level page table indentically maps the huge
+	 * page, there's no need for a TLB flush. Otherwise, flush TLBs after
+	 * dropping the huge page and before installing the shadow page table.
 	 */
-	__drop_large_spte(kvm, huge_sptep, false);
-	__link_shadow_page(NULL, huge_sptep, split_sp);
-
-	return 0;
+	__drop_large_spte(kvm, huge_sptep, flush);
+	__link_shadow_page(cache, huge_sptep, split_sp);
 }
 
 static bool should_split_huge_page(u64 *huge_sptep)
@@ -6347,16 +6343,13 @@ static bool rmap_try_split_huge_pages(struct kvm *kvm,
 		if (dropped_lock)
 			goto restart;
 
-		r = kvm_mmu_split_huge_page(kvm, slot, huge_sptep, &sp);
-
-		trace_kvm_mmu_split_huge_page(gfn, spte, level, r);
-
 		/*
-		 * If splitting is successful we must restart the iterator
-		 * because huge_sptep has just been removed from it.
+		 * After splitting we must restart the iterator because
+		 * huge_sptep has just been removed from it.
 		 */
-		if (!r)
-			goto restart;
+		kvm_mmu_split_huge_page(kvm, slot, huge_sptep, &sp);
+		trace_kvm_mmu_split_huge_page(gfn, spte, level, 0);
+		goto restart;
 	}
 
 	if (sp)
-- 
2.35.1.723.g4982287a31-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH v2 25/26] KVM: x86/mmu: Drop NULL pte_list_desc_cache fallback
  2022-03-11  0:25 ` David Matlack
@ 2022-03-11  0:25   ` David Matlack
  -1 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-11  0:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

Now that the eager page splitting code no longer passes in NULL cache
pointers we can get rid of the debug WARN_ON() and allocation fallback.
While here, also drop the helper function mmu_alloc_pte_list_desc() as
it no longer serves any purpose.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 14 ++------------
 1 file changed, 2 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 68785b422a08..d2ffebb659e0 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -725,16 +725,6 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
 }
 
-static struct pte_list_desc *mmu_alloc_pte_list_desc(struct kvm_mmu_memory_cache *cache)
-{
-	static const gfp_t gfp_nocache = GFP_ATOMIC | __GFP_ACCOUNT | __GFP_ZERO;
-
-	if (WARN_ON_ONCE(!cache))
-		return kmem_cache_alloc(pte_list_desc_cache, gfp_nocache);
-
-	return kvm_mmu_memory_cache_alloc(cache);
-}
-
 static void mmu_free_pte_list_desc(struct pte_list_desc *pte_list_desc)
 {
 	kmem_cache_free(pte_list_desc_cache, pte_list_desc);
@@ -914,7 +904,7 @@ static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte,
 		rmap_head->val = (unsigned long)spte;
 	} else if (!(rmap_head->val & 1)) {
 		rmap_printk("%p %llx 1->many\n", spte, *spte);
-		desc = mmu_alloc_pte_list_desc(cache);
+		desc = kvm_mmu_memory_cache_alloc(cache);
 		desc->sptes[0] = (u64 *)rmap_head->val;
 		desc->sptes[1] = spte;
 		desc->spte_count = 2;
@@ -926,7 +916,7 @@ static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte,
 		while (desc->spte_count == PTE_LIST_EXT) {
 			count += PTE_LIST_EXT;
 			if (!desc->more) {
-				desc->more = mmu_alloc_pte_list_desc(cache);
+				desc->more = kvm_mmu_memory_cache_alloc(cache);
 				desc = desc->more;
 				desc->spte_count = 0;
 				break;
-- 
2.35.1.723.g4982287a31-goog


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH v2 25/26] KVM: x86/mmu: Drop NULL pte_list_desc_cache fallback
@ 2022-03-11  0:25   ` David Matlack
  0 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-11  0:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Now that the eager page splitting code no longer passes in NULL cache
pointers we can get rid of the debug WARN_ON() and allocation fallback.
While here, also drop the helper function mmu_alloc_pte_list_desc() as
it no longer serves any purpose.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 14 ++------------
 1 file changed, 2 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 68785b422a08..d2ffebb659e0 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -725,16 +725,6 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
 }
 
-static struct pte_list_desc *mmu_alloc_pte_list_desc(struct kvm_mmu_memory_cache *cache)
-{
-	static const gfp_t gfp_nocache = GFP_ATOMIC | __GFP_ACCOUNT | __GFP_ZERO;
-
-	if (WARN_ON_ONCE(!cache))
-		return kmem_cache_alloc(pte_list_desc_cache, gfp_nocache);
-
-	return kvm_mmu_memory_cache_alloc(cache);
-}
-
 static void mmu_free_pte_list_desc(struct pte_list_desc *pte_list_desc)
 {
 	kmem_cache_free(pte_list_desc_cache, pte_list_desc);
@@ -914,7 +904,7 @@ static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte,
 		rmap_head->val = (unsigned long)spte;
 	} else if (!(rmap_head->val & 1)) {
 		rmap_printk("%p %llx 1->many\n", spte, *spte);
-		desc = mmu_alloc_pte_list_desc(cache);
+		desc = kvm_mmu_memory_cache_alloc(cache);
 		desc->sptes[0] = (u64 *)rmap_head->val;
 		desc->sptes[1] = spte;
 		desc->spte_count = 2;
@@ -926,7 +916,7 @@ static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte,
 		while (desc->spte_count == PTE_LIST_EXT) {
 			count += PTE_LIST_EXT;
 			if (!desc->more) {
-				desc->more = mmu_alloc_pte_list_desc(cache);
+				desc->more = kvm_mmu_memory_cache_alloc(cache);
 				desc = desc->more;
 				desc->spte_count = 0;
 				break;
-- 
2.35.1.723.g4982287a31-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH v2 26/26] KVM: selftests: Map x86_64 guest virtual memory with huge pages
  2022-03-11  0:25 ` David Matlack
@ 2022-03-11  0:25   ` David Matlack
  -1 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-11  0:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner, David Matlack

Override virt_map() in x86_64 selftests to use the largest page size
possible when mapping guest virtual memory. This enables testing eager
page splitting with shadow paging (e.g. kvm_intel.ept=N), as it allows
KVM to shadow guest memory with huge pages.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 .../selftests/kvm/include/x86_64/processor.h  |  6 ++++
 tools/testing/selftests/kvm/lib/kvm_util.c    |  4 +--
 .../selftests/kvm/lib/x86_64/processor.c      | 31 +++++++++++++++++++
 3 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/x86_64/processor.h b/tools/testing/selftests/kvm/include/x86_64/processor.h
index 37db341d4cc5..efb228d2fbf7 100644
--- a/tools/testing/selftests/kvm/include/x86_64/processor.h
+++ b/tools/testing/selftests/kvm/include/x86_64/processor.h
@@ -470,6 +470,12 @@ enum x86_page_size {
 	X86_PAGE_SIZE_2M,
 	X86_PAGE_SIZE_1G,
 };
+
+static inline size_t page_size_bytes(enum x86_page_size page_size)
+{
+	return 1UL << (page_size * 9 + 12);
+}
+
 void __virt_pg_map(struct kvm_vm *vm, uint64_t vaddr, uint64_t paddr,
 		   enum x86_page_size page_size);
 
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 1665a220abcb..60198587236d 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -1432,8 +1432,8 @@ vm_vaddr_t vm_vaddr_alloc_page(struct kvm_vm *vm)
  * Within the VM given by @vm, creates a virtual translation for
  * @npages starting at @vaddr to the page range starting at @paddr.
  */
-void virt_map(struct kvm_vm *vm, uint64_t vaddr, uint64_t paddr,
-	      unsigned int npages)
+void __weak virt_map(struct kvm_vm *vm, uint64_t vaddr, uint64_t paddr,
+		     unsigned int npages)
 {
 	size_t page_size = vm->page_size;
 	size_t size = npages * page_size;
diff --git a/tools/testing/selftests/kvm/lib/x86_64/processor.c b/tools/testing/selftests/kvm/lib/x86_64/processor.c
index 9f000dfb5594..7df84292d5de 100644
--- a/tools/testing/selftests/kvm/lib/x86_64/processor.c
+++ b/tools/testing/selftests/kvm/lib/x86_64/processor.c
@@ -282,6 +282,37 @@ void virt_pg_map(struct kvm_vm *vm, uint64_t vaddr, uint64_t paddr)
 	__virt_pg_map(vm, vaddr, paddr, X86_PAGE_SIZE_4K);
 }
 
+void virt_map(struct kvm_vm *vm, uint64_t vaddr, uint64_t paddr, unsigned int npages)
+{
+	size_t size = (size_t) npages * vm->page_size;
+	size_t vend = vaddr + size;
+	enum x86_page_size page_size;
+	size_t stride;
+
+	TEST_ASSERT(vaddr + size > vaddr, "Vaddr overflow");
+	TEST_ASSERT(paddr + size > paddr, "Paddr overflow");
+
+	/*
+	 * Map the region with all 1G pages if possible, falling back to all
+	 * 2M pages, and finally all 4K pages. This could be improved to use
+	 * a mix of page sizes so that more of the region is mapped with large
+	 * pages.
+	 */
+	for (page_size = X86_PAGE_SIZE_1G; page_size >= X86_PAGE_SIZE_4K; page_size--) {
+		stride = page_size_bytes(page_size);
+
+		if (!(vaddr % stride) && !(paddr % stride) && !(size % stride))
+			break;
+	}
+
+	TEST_ASSERT(page_size >= X86_PAGE_SIZE_4K,
+		    "Cannot map unaligned region: vaddr 0x%lx paddr 0x%lx npages 0x%x\n",
+		    vaddr, paddr, npages);
+
+	for (; vaddr < vend; vaddr += stride, paddr += stride)
+		__virt_pg_map(vm, vaddr, paddr, page_size);
+}
+
 static struct pageTableEntry *_vm_get_page_table_entry(struct kvm_vm *vm, int vcpuid,
 						       uint64_t vaddr)
 {
-- 
2.35.1.723.g4982287a31-goog


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH v2 26/26] KVM: selftests: Map x86_64 guest virtual memory with huge pages
@ 2022-03-11  0:25   ` David Matlack
  0 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-11  0:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	David Matlack, Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

Override virt_map() in x86_64 selftests to use the largest page size
possible when mapping guest virtual memory. This enables testing eager
page splitting with shadow paging (e.g. kvm_intel.ept=N), as it allows
KVM to shadow guest memory with huge pages.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 .../selftests/kvm/include/x86_64/processor.h  |  6 ++++
 tools/testing/selftests/kvm/lib/kvm_util.c    |  4 +--
 .../selftests/kvm/lib/x86_64/processor.c      | 31 +++++++++++++++++++
 3 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/x86_64/processor.h b/tools/testing/selftests/kvm/include/x86_64/processor.h
index 37db341d4cc5..efb228d2fbf7 100644
--- a/tools/testing/selftests/kvm/include/x86_64/processor.h
+++ b/tools/testing/selftests/kvm/include/x86_64/processor.h
@@ -470,6 +470,12 @@ enum x86_page_size {
 	X86_PAGE_SIZE_2M,
 	X86_PAGE_SIZE_1G,
 };
+
+static inline size_t page_size_bytes(enum x86_page_size page_size)
+{
+	return 1UL << (page_size * 9 + 12);
+}
+
 void __virt_pg_map(struct kvm_vm *vm, uint64_t vaddr, uint64_t paddr,
 		   enum x86_page_size page_size);
 
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 1665a220abcb..60198587236d 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -1432,8 +1432,8 @@ vm_vaddr_t vm_vaddr_alloc_page(struct kvm_vm *vm)
  * Within the VM given by @vm, creates a virtual translation for
  * @npages starting at @vaddr to the page range starting at @paddr.
  */
-void virt_map(struct kvm_vm *vm, uint64_t vaddr, uint64_t paddr,
-	      unsigned int npages)
+void __weak virt_map(struct kvm_vm *vm, uint64_t vaddr, uint64_t paddr,
+		     unsigned int npages)
 {
 	size_t page_size = vm->page_size;
 	size_t size = npages * page_size;
diff --git a/tools/testing/selftests/kvm/lib/x86_64/processor.c b/tools/testing/selftests/kvm/lib/x86_64/processor.c
index 9f000dfb5594..7df84292d5de 100644
--- a/tools/testing/selftests/kvm/lib/x86_64/processor.c
+++ b/tools/testing/selftests/kvm/lib/x86_64/processor.c
@@ -282,6 +282,37 @@ void virt_pg_map(struct kvm_vm *vm, uint64_t vaddr, uint64_t paddr)
 	__virt_pg_map(vm, vaddr, paddr, X86_PAGE_SIZE_4K);
 }
 
+void virt_map(struct kvm_vm *vm, uint64_t vaddr, uint64_t paddr, unsigned int npages)
+{
+	size_t size = (size_t) npages * vm->page_size;
+	size_t vend = vaddr + size;
+	enum x86_page_size page_size;
+	size_t stride;
+
+	TEST_ASSERT(vaddr + size > vaddr, "Vaddr overflow");
+	TEST_ASSERT(paddr + size > paddr, "Paddr overflow");
+
+	/*
+	 * Map the region with all 1G pages if possible, falling back to all
+	 * 2M pages, and finally all 4K pages. This could be improved to use
+	 * a mix of page sizes so that more of the region is mapped with large
+	 * pages.
+	 */
+	for (page_size = X86_PAGE_SIZE_1G; page_size >= X86_PAGE_SIZE_4K; page_size--) {
+		stride = page_size_bytes(page_size);
+
+		if (!(vaddr % stride) && !(paddr % stride) && !(size % stride))
+			break;
+	}
+
+	TEST_ASSERT(page_size >= X86_PAGE_SIZE_4K,
+		    "Cannot map unaligned region: vaddr 0x%lx paddr 0x%lx npages 0x%x\n",
+		    vaddr, paddr, npages);
+
+	for (; vaddr < vend; vaddr += stride, paddr += stride)
+		__virt_pg_map(vm, vaddr, paddr, page_size);
+}
+
 static struct pageTableEntry *_vm_get_page_table_entry(struct kvm_vm *vm, int vcpuid,
 						       uint64_t vaddr)
 {
-- 
2.35.1.723.g4982287a31-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 01/26] KVM: x86/mmu: Optimize MMU page cache lookup for all direct SPs
  2022-03-11  0:25   ` David Matlack
@ 2022-03-15  7:40     ` Peter Xu
  -1 siblings, 0 replies; 134+ messages in thread
From: Peter Xu @ 2022-03-15  7:40 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Sean Christopherson, Andrew Jones, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Fri, Mar 11, 2022 at 12:25:03AM +0000, David Matlack wrote:
> Commit fb58a9c345f6 ("KVM: x86/mmu: Optimize MMU page cache lookup for
> fully direct MMUs") skipped the unsync checks and write flood clearing
> for full direct MMUs. We can extend this further and skip the checks for
> all direct shadow pages. Direct shadow pages are never marked unsynced
> or have a non-zero write-flooding count.

Nit: IMHO it's better to spell out the exact functional change, IIUC those
are the direct mapped SPs where guest uses huge pages but host uses only
small pages for the shadowing?

> 
> Checking sp->role.direct alos generates better code than checking
> direct_map because, due to register pressure, direct_map has to get
> shoved onto the stack and then pulled back off.
> 
> No functional change intended.
> 
> Reviewed-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: David Matlack <dmatlack@google.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 01/26] KVM: x86/mmu: Optimize MMU page cache lookup for all direct SPs
@ 2022-03-15  7:40     ` Peter Xu
  0 siblings, 0 replies; 134+ messages in thread
From: Peter Xu @ 2022-03-15  7:40 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Fri, Mar 11, 2022 at 12:25:03AM +0000, David Matlack wrote:
> Commit fb58a9c345f6 ("KVM: x86/mmu: Optimize MMU page cache lookup for
> fully direct MMUs") skipped the unsync checks and write flood clearing
> for full direct MMUs. We can extend this further and skip the checks for
> all direct shadow pages. Direct shadow pages are never marked unsynced
> or have a non-zero write-flooding count.

Nit: IMHO it's better to spell out the exact functional change, IIUC those
are the direct mapped SPs where guest uses huge pages but host uses only
small pages for the shadowing?

> 
> Checking sp->role.direct alos generates better code than checking
> direct_map because, due to register pressure, direct_map has to get
> shoved onto the stack and then pulled back off.
> 
> No functional change intended.
> 
> Reviewed-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: David Matlack <dmatlack@google.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 02/26] KVM: x86/mmu: Use a bool for direct
  2022-03-11  0:25   ` David Matlack
@ 2022-03-15  7:46     ` Peter Xu
  -1 siblings, 0 replies; 134+ messages in thread
From: Peter Xu @ 2022-03-15  7:46 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Sean Christopherson, Andrew Jones, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Fri, Mar 11, 2022 at 12:25:04AM +0000, David Matlack wrote:
> The parameter "direct" can either be true or false, and all of the
> callers pass in a bool variable or true/false literal, so just use the
> type bool.
> 
> No functional change intended.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>

If we care about this.. how about convert another one altogether?

TRACE_EVENT(kvm_hv_stimer_expiration,
	TP_PROTO(int vcpu_id, int timer_index, int direct, int msg_send_result),
	TP_ARGS(vcpu_id, timer_index, direct, msg_send_result),

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 02/26] KVM: x86/mmu: Use a bool for direct
@ 2022-03-15  7:46     ` Peter Xu
  0 siblings, 0 replies; 134+ messages in thread
From: Peter Xu @ 2022-03-15  7:46 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Fri, Mar 11, 2022 at 12:25:04AM +0000, David Matlack wrote:
> The parameter "direct" can either be true or false, and all of the
> callers pass in a bool variable or true/false literal, so just use the
> type bool.
> 
> No functional change intended.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>

If we care about this.. how about convert another one altogether?

TRACE_EVENT(kvm_hv_stimer_expiration,
	TP_PROTO(int vcpu_id, int timer_index, int direct, int msg_send_result),
	TP_ARGS(vcpu_id, timer_index, direct, msg_send_result),

Thanks,

-- 
Peter Xu

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 03/26] KVM: x86/mmu: Derive shadow MMU page role from parent
  2022-03-11  0:25   ` David Matlack
@ 2022-03-15  8:15     ` Peter Xu
  -1 siblings, 0 replies; 134+ messages in thread
From: Peter Xu @ 2022-03-15  8:15 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Sean Christopherson, Andrew Jones, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Fri, Mar 11, 2022 at 12:25:05AM +0000, David Matlack wrote:
> Instead of computing the shadow page role from scratch for every new
> page, we can derive most of the information from the parent shadow page.
> This avoids redundant calculations and reduces the number of parameters
> to kvm_mmu_get_page().
> 
> Preemptively split out the role calculation to a separate function for
> use in a following commit.
> 
> No functional change intended.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>

Looks right..

Reviewed-by: Peter Xu <peterx@redhat.com>

Two more comments/questions below.

> +static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct, u32 access)
> +{
> +	struct kvm_mmu_page *parent_sp = sptep_to_sp(sptep);
> +	union kvm_mmu_page_role role;
> +
> +	role = parent_sp->role;
> +	role.level--;
> +	role.access = access;
> +	role.direct = direct;
> +
> +	/*
> +	 * If the guest has 4-byte PTEs then that means it's using 32-bit,
> +	 * 2-level, non-PAE paging. KVM shadows such guests using 4 PAE page
> +	 * directories, each mapping 1/4 of the guest's linear address space
> +	 * (1GiB). The shadow pages for those 4 page directories are
> +	 * pre-allocated and assigned a separate quadrant in their role.
> +	 *
> +	 * Since we are allocating a child shadow page and there are only 2
> +	 * levels, this must be a PG_LEVEL_4K shadow page. Here the quadrant
> +	 * will either be 0 or 1 because it maps 1/2 of the address space mapped
> +	 * by the guest's PG_LEVEL_4K page table (or 4MiB huge page) that it
> +	 * is shadowing. In this case, the quadrant can be derived by the index
> +	 * of the SPTE that points to the new child shadow page in the page
> +	 * directory (parent_sp). Specifically, every 2 SPTEs in parent_sp
> +	 * shadow one half of a guest's page table (or 4MiB huge page) so the
> +	 * quadrant is just the parity of the index of the SPTE.
> +	 */
> +	if (role.has_4_byte_gpte) {
> +		BUG_ON(role.level != PG_LEVEL_4K);
> +		role.quadrant = (sptep - parent_sp->spt) % 2;
> +	}

This made me wonder whether role.quadrant can be dropped, because it seems
it can be calculated out of the box with has_4_byte_gpte, level and spte
offset.  I could have missed something, though..

> +
> +	return role;
> +}
> +
> +static struct kvm_mmu_page *kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu,
> +						 u64 *sptep, gfn_t gfn,
> +						 bool direct, u32 access)
> +{
> +	union kvm_mmu_page_role role;
> +
> +	role = kvm_mmu_child_role(sptep, direct, access);
> +	return kvm_mmu_get_page(vcpu, gfn, role);

Nit: it looks nicer to just drop the temp var?

        return kvm_mmu_get_page(vcpu, gfn,
                                kvm_mmu_child_role(sptep, direct, access));

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 03/26] KVM: x86/mmu: Derive shadow MMU page role from parent
@ 2022-03-15  8:15     ` Peter Xu
  0 siblings, 0 replies; 134+ messages in thread
From: Peter Xu @ 2022-03-15  8:15 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Fri, Mar 11, 2022 at 12:25:05AM +0000, David Matlack wrote:
> Instead of computing the shadow page role from scratch for every new
> page, we can derive most of the information from the parent shadow page.
> This avoids redundant calculations and reduces the number of parameters
> to kvm_mmu_get_page().
> 
> Preemptively split out the role calculation to a separate function for
> use in a following commit.
> 
> No functional change intended.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>

Looks right..

Reviewed-by: Peter Xu <peterx@redhat.com>

Two more comments/questions below.

> +static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct, u32 access)
> +{
> +	struct kvm_mmu_page *parent_sp = sptep_to_sp(sptep);
> +	union kvm_mmu_page_role role;
> +
> +	role = parent_sp->role;
> +	role.level--;
> +	role.access = access;
> +	role.direct = direct;
> +
> +	/*
> +	 * If the guest has 4-byte PTEs then that means it's using 32-bit,
> +	 * 2-level, non-PAE paging. KVM shadows such guests using 4 PAE page
> +	 * directories, each mapping 1/4 of the guest's linear address space
> +	 * (1GiB). The shadow pages for those 4 page directories are
> +	 * pre-allocated and assigned a separate quadrant in their role.
> +	 *
> +	 * Since we are allocating a child shadow page and there are only 2
> +	 * levels, this must be a PG_LEVEL_4K shadow page. Here the quadrant
> +	 * will either be 0 or 1 because it maps 1/2 of the address space mapped
> +	 * by the guest's PG_LEVEL_4K page table (or 4MiB huge page) that it
> +	 * is shadowing. In this case, the quadrant can be derived by the index
> +	 * of the SPTE that points to the new child shadow page in the page
> +	 * directory (parent_sp). Specifically, every 2 SPTEs in parent_sp
> +	 * shadow one half of a guest's page table (or 4MiB huge page) so the
> +	 * quadrant is just the parity of the index of the SPTE.
> +	 */
> +	if (role.has_4_byte_gpte) {
> +		BUG_ON(role.level != PG_LEVEL_4K);
> +		role.quadrant = (sptep - parent_sp->spt) % 2;
> +	}

This made me wonder whether role.quadrant can be dropped, because it seems
it can be calculated out of the box with has_4_byte_gpte, level and spte
offset.  I could have missed something, though..

> +
> +	return role;
> +}
> +
> +static struct kvm_mmu_page *kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu,
> +						 u64 *sptep, gfn_t gfn,
> +						 bool direct, u32 access)
> +{
> +	union kvm_mmu_page_role role;
> +
> +	role = kvm_mmu_child_role(sptep, direct, access);
> +	return kvm_mmu_get_page(vcpu, gfn, role);

Nit: it looks nicer to just drop the temp var?

        return kvm_mmu_get_page(vcpu, gfn,
                                kvm_mmu_child_role(sptep, direct, access));

Thanks,

-- 
Peter Xu

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 04/26] KVM: x86/mmu: Decompose kvm_mmu_get_page() into separate functions
  2022-03-11  0:25   ` David Matlack
@ 2022-03-15  8:50     ` Peter Xu
  -1 siblings, 0 replies; 134+ messages in thread
From: Peter Xu @ 2022-03-15  8:50 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Sean Christopherson, Andrew Jones, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Fri, Mar 11, 2022 at 12:25:06AM +0000, David Matlack wrote:
> Decompose kvm_mmu_get_page() into separate helper functions to increase
> readability and prepare for allocating shadow pages without a vcpu
> pointer.
> 
> Specifically, pull the guts of kvm_mmu_get_page() into 3 helper
> functions:
> 
> __kvm_mmu_find_shadow_page() -
>   Walks the page hash checking for any existing mmu pages that match the
>   given gfn and role. Does not attempt to synchronize the page if it is
>   unsync.
> 
> kvm_mmu_find_shadow_page() -
>   Wraps __kvm_mmu_find_shadow_page() and handles syncing if necessary.
> 
> kvm_mmu_new_shadow_page()
>   Allocates and initializes an entirely new kvm_mmu_page. This currently
>   requries a vcpu pointer for allocation and looking up the memslot but
>   that will be removed in a future commit.
> 
>   Note, kvm_mmu_new_shadow_page() is temporary and will be removed in a
>   subsequent commit. The name uses "new" rather than the more typical
>   "alloc" to avoid clashing with the existing kvm_mmu_alloc_page().
> 
> No functional change intended.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>

Looks good to me, a few nitpicks and questions below.

> ---
>  arch/x86/kvm/mmu/mmu.c         | 132 ++++++++++++++++++++++++---------
>  arch/x86/kvm/mmu/paging_tmpl.h |   5 +-
>  arch/x86/kvm/mmu/spte.c        |   5 +-
>  3 files changed, 101 insertions(+), 41 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 23c2004c6435..80dbfe07c87b 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2027,16 +2027,25 @@ static void clear_sp_write_flooding_count(u64 *spte)
>  	__clear_sp_write_flooding_count(sptep_to_sp(spte));
>  }
>  
> -static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
> -					     union kvm_mmu_page_role role)
> +/*
> + * Searches for an existing SP for the given gfn and role. Makes no attempt to
> + * sync the SP if it is marked unsync.
> + *
> + * If creating an upper-level page table, zaps unsynced pages for the same
> + * gfn and adds them to the invalid_list. It's the callers responsibility
> + * to call kvm_mmu_commit_zap_page() on invalid_list.
> + */
> +static struct kvm_mmu_page *__kvm_mmu_find_shadow_page(struct kvm *kvm,
> +						       gfn_t gfn,
> +						       union kvm_mmu_page_role role,
> +						       struct list_head *invalid_list)
>  {
>  	struct hlist_head *sp_list;
>  	struct kvm_mmu_page *sp;
>  	int collisions = 0;
> -	LIST_HEAD(invalid_list);
>  
> -	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
> -	for_each_valid_sp(vcpu->kvm, sp, sp_list) {
> +	sp_list = &kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
> +	for_each_valid_sp(kvm, sp, sp_list) {
>  		if (sp->gfn != gfn) {
>  			collisions++;
>  			continue;
> @@ -2053,60 +2062,109 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
>  			 * upper-level page will be write-protected.
>  			 */
>  			if (role.level > PG_LEVEL_4K && sp->unsync)
> -				kvm_mmu_prepare_zap_page(vcpu->kvm, sp,
> -							 &invalid_list);
> +				kvm_mmu_prepare_zap_page(kvm, sp, invalid_list);
> +
>  			continue;
>  		}
>  
> -		/* unsync and write-flooding only apply to indirect SPs. */
> -		if (sp->role.direct)
> -			goto trace_get_page;
> +		/* Write-flooding is only tracked for indirect SPs. */
> +		if (!sp->role.direct)
> +			__clear_sp_write_flooding_count(sp);
>  
> -		if (sp->unsync) {
> -			/*
> -			 * The page is good, but is stale.  kvm_sync_page does
> -			 * get the latest guest state, but (unlike mmu_unsync_children)
> -			 * it doesn't write-protect the page or mark it synchronized!
> -			 * This way the validity of the mapping is ensured, but the
> -			 * overhead of write protection is not incurred until the
> -			 * guest invalidates the TLB mapping.  This allows multiple
> -			 * SPs for a single gfn to be unsync.
> -			 *
> -			 * If the sync fails, the page is zapped.  If so, break
> -			 * in order to rebuild it.
> -			 */
> -			if (!kvm_sync_page(vcpu, sp, &invalid_list))
> -				break;
> +		goto out;
> +	}
>  
> -			WARN_ON(!list_empty(&invalid_list));
> -			kvm_flush_remote_tlbs(vcpu->kvm);
> -		}
> +	sp = NULL;
>  
> -		__clear_sp_write_flooding_count(sp);
> +out:
> +	if (collisions > kvm->stat.max_mmu_page_hash_collisions)
> +		kvm->stat.max_mmu_page_hash_collisions = collisions;
> +
> +	return sp;
> +}
>  
> -trace_get_page:
> -		trace_kvm_mmu_get_page(sp, false);
> +/*
> + * Looks up an existing SP for the given gfn and role if one exists. The
> + * return SP is guaranteed to be synced.
> + */
> +static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
> +						     gfn_t gfn,
> +						     union kvm_mmu_page_role role)
> +{
> +	struct kvm_mmu_page *sp;
> +	LIST_HEAD(invalid_list);
> +
> +	sp = __kvm_mmu_find_shadow_page(vcpu->kvm, gfn, role, &invalid_list);
> +	if (!sp)
>  		goto out;
> +
> +	if (sp->unsync) {
> +		/*
> +		 * The page is good, but is stale.  kvm_sync_page does
> +		 * get the latest guest state, but (unlike mmu_unsync_children)
> +		 * it doesn't write-protect the page or mark it synchronized!
> +		 * This way the validity of the mapping is ensured, but the
> +		 * overhead of write protection is not incurred until the
> +		 * guest invalidates the TLB mapping.  This allows multiple
> +		 * SPs for a single gfn to be unsync.
> +		 *
> +		 * If the sync fails, the page is zapped and added to the
> +		 * invalid_list.
> +		 */
> +		if (!kvm_sync_page(vcpu, sp, &invalid_list)) {
> +			sp = NULL;
> +			goto out;
> +		}
> +
> +		WARN_ON(!list_empty(&invalid_list));

Not related to this patch because I think it's a pure movement here,
however I have a question on why invalid_list is guaranteed to be empty..

I'm thinking the case where when lookup the page we could have already
called kvm_mmu_prepare_zap_page() there, then when reach here (which is the
kvm_sync_page==true case) invalid_list shouldn't be touched in
kvm_sync_page(), so it looks possible that it still contains some page to
be commited?

> +		kvm_flush_remote_tlbs(vcpu->kvm);
>  	}
>  
> +out:

I'm wondering whether this "out" can be dropped.. with something like:

        sp = __kvm_mmu_find_shadow_page(...);

        if (sp && sp->unsync) {
                if (kvm_sync_page(vcpu, sp, &invalid_list)) {
                        ..
                } else {
                        sp = NULL;
                }
        }

[...]

> +static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
> +					     union kvm_mmu_page_role role)
> +{
> +	struct kvm_mmu_page *sp;
> +	bool created = false;
> +
> +	sp = kvm_mmu_find_shadow_page(vcpu, gfn, role);
> +	if (sp)
> +		goto out;
> +
> +	created = true;
> +	sp = kvm_mmu_new_shadow_page(vcpu, gfn, role);
> +
> +out:
> +	trace_kvm_mmu_get_page(sp, created);
>  	return sp;

Same here, wondering whether we could drop the "out" by:

        sp = kvm_mmu_find_shadow_page(vcpu, gfn, role);
        if (!sp) {
                created = true;
                sp = kvm_mmu_new_shadow_page(vcpu, gfn, role);
        }

        trace_kvm_mmu_get_page(sp, created);
        return sp;

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 04/26] KVM: x86/mmu: Decompose kvm_mmu_get_page() into separate functions
@ 2022-03-15  8:50     ` Peter Xu
  0 siblings, 0 replies; 134+ messages in thread
From: Peter Xu @ 2022-03-15  8:50 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Fri, Mar 11, 2022 at 12:25:06AM +0000, David Matlack wrote:
> Decompose kvm_mmu_get_page() into separate helper functions to increase
> readability and prepare for allocating shadow pages without a vcpu
> pointer.
> 
> Specifically, pull the guts of kvm_mmu_get_page() into 3 helper
> functions:
> 
> __kvm_mmu_find_shadow_page() -
>   Walks the page hash checking for any existing mmu pages that match the
>   given gfn and role. Does not attempt to synchronize the page if it is
>   unsync.
> 
> kvm_mmu_find_shadow_page() -
>   Wraps __kvm_mmu_find_shadow_page() and handles syncing if necessary.
> 
> kvm_mmu_new_shadow_page()
>   Allocates and initializes an entirely new kvm_mmu_page. This currently
>   requries a vcpu pointer for allocation and looking up the memslot but
>   that will be removed in a future commit.
> 
>   Note, kvm_mmu_new_shadow_page() is temporary and will be removed in a
>   subsequent commit. The name uses "new" rather than the more typical
>   "alloc" to avoid clashing with the existing kvm_mmu_alloc_page().
> 
> No functional change intended.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>

Looks good to me, a few nitpicks and questions below.

> ---
>  arch/x86/kvm/mmu/mmu.c         | 132 ++++++++++++++++++++++++---------
>  arch/x86/kvm/mmu/paging_tmpl.h |   5 +-
>  arch/x86/kvm/mmu/spte.c        |   5 +-
>  3 files changed, 101 insertions(+), 41 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 23c2004c6435..80dbfe07c87b 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2027,16 +2027,25 @@ static void clear_sp_write_flooding_count(u64 *spte)
>  	__clear_sp_write_flooding_count(sptep_to_sp(spte));
>  }
>  
> -static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
> -					     union kvm_mmu_page_role role)
> +/*
> + * Searches for an existing SP for the given gfn and role. Makes no attempt to
> + * sync the SP if it is marked unsync.
> + *
> + * If creating an upper-level page table, zaps unsynced pages for the same
> + * gfn and adds them to the invalid_list. It's the callers responsibility
> + * to call kvm_mmu_commit_zap_page() on invalid_list.
> + */
> +static struct kvm_mmu_page *__kvm_mmu_find_shadow_page(struct kvm *kvm,
> +						       gfn_t gfn,
> +						       union kvm_mmu_page_role role,
> +						       struct list_head *invalid_list)
>  {
>  	struct hlist_head *sp_list;
>  	struct kvm_mmu_page *sp;
>  	int collisions = 0;
> -	LIST_HEAD(invalid_list);
>  
> -	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
> -	for_each_valid_sp(vcpu->kvm, sp, sp_list) {
> +	sp_list = &kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
> +	for_each_valid_sp(kvm, sp, sp_list) {
>  		if (sp->gfn != gfn) {
>  			collisions++;
>  			continue;
> @@ -2053,60 +2062,109 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
>  			 * upper-level page will be write-protected.
>  			 */
>  			if (role.level > PG_LEVEL_4K && sp->unsync)
> -				kvm_mmu_prepare_zap_page(vcpu->kvm, sp,
> -							 &invalid_list);
> +				kvm_mmu_prepare_zap_page(kvm, sp, invalid_list);
> +
>  			continue;
>  		}
>  
> -		/* unsync and write-flooding only apply to indirect SPs. */
> -		if (sp->role.direct)
> -			goto trace_get_page;
> +		/* Write-flooding is only tracked for indirect SPs. */
> +		if (!sp->role.direct)
> +			__clear_sp_write_flooding_count(sp);
>  
> -		if (sp->unsync) {
> -			/*
> -			 * The page is good, but is stale.  kvm_sync_page does
> -			 * get the latest guest state, but (unlike mmu_unsync_children)
> -			 * it doesn't write-protect the page or mark it synchronized!
> -			 * This way the validity of the mapping is ensured, but the
> -			 * overhead of write protection is not incurred until the
> -			 * guest invalidates the TLB mapping.  This allows multiple
> -			 * SPs for a single gfn to be unsync.
> -			 *
> -			 * If the sync fails, the page is zapped.  If so, break
> -			 * in order to rebuild it.
> -			 */
> -			if (!kvm_sync_page(vcpu, sp, &invalid_list))
> -				break;
> +		goto out;
> +	}
>  
> -			WARN_ON(!list_empty(&invalid_list));
> -			kvm_flush_remote_tlbs(vcpu->kvm);
> -		}
> +	sp = NULL;
>  
> -		__clear_sp_write_flooding_count(sp);
> +out:
> +	if (collisions > kvm->stat.max_mmu_page_hash_collisions)
> +		kvm->stat.max_mmu_page_hash_collisions = collisions;
> +
> +	return sp;
> +}
>  
> -trace_get_page:
> -		trace_kvm_mmu_get_page(sp, false);
> +/*
> + * Looks up an existing SP for the given gfn and role if one exists. The
> + * return SP is guaranteed to be synced.
> + */
> +static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
> +						     gfn_t gfn,
> +						     union kvm_mmu_page_role role)
> +{
> +	struct kvm_mmu_page *sp;
> +	LIST_HEAD(invalid_list);
> +
> +	sp = __kvm_mmu_find_shadow_page(vcpu->kvm, gfn, role, &invalid_list);
> +	if (!sp)
>  		goto out;
> +
> +	if (sp->unsync) {
> +		/*
> +		 * The page is good, but is stale.  kvm_sync_page does
> +		 * get the latest guest state, but (unlike mmu_unsync_children)
> +		 * it doesn't write-protect the page or mark it synchronized!
> +		 * This way the validity of the mapping is ensured, but the
> +		 * overhead of write protection is not incurred until the
> +		 * guest invalidates the TLB mapping.  This allows multiple
> +		 * SPs for a single gfn to be unsync.
> +		 *
> +		 * If the sync fails, the page is zapped and added to the
> +		 * invalid_list.
> +		 */
> +		if (!kvm_sync_page(vcpu, sp, &invalid_list)) {
> +			sp = NULL;
> +			goto out;
> +		}
> +
> +		WARN_ON(!list_empty(&invalid_list));

Not related to this patch because I think it's a pure movement here,
however I have a question on why invalid_list is guaranteed to be empty..

I'm thinking the case where when lookup the page we could have already
called kvm_mmu_prepare_zap_page() there, then when reach here (which is the
kvm_sync_page==true case) invalid_list shouldn't be touched in
kvm_sync_page(), so it looks possible that it still contains some page to
be commited?

> +		kvm_flush_remote_tlbs(vcpu->kvm);
>  	}
>  
> +out:

I'm wondering whether this "out" can be dropped.. with something like:

        sp = __kvm_mmu_find_shadow_page(...);

        if (sp && sp->unsync) {
                if (kvm_sync_page(vcpu, sp, &invalid_list)) {
                        ..
                } else {
                        sp = NULL;
                }
        }

[...]

> +static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
> +					     union kvm_mmu_page_role role)
> +{
> +	struct kvm_mmu_page *sp;
> +	bool created = false;
> +
> +	sp = kvm_mmu_find_shadow_page(vcpu, gfn, role);
> +	if (sp)
> +		goto out;
> +
> +	created = true;
> +	sp = kvm_mmu_new_shadow_page(vcpu, gfn, role);
> +
> +out:
> +	trace_kvm_mmu_get_page(sp, created);
>  	return sp;

Same here, wondering whether we could drop the "out" by:

        sp = kvm_mmu_find_shadow_page(vcpu, gfn, role);
        if (!sp) {
                created = true;
                sp = kvm_mmu_new_shadow_page(vcpu, gfn, role);
        }

        trace_kvm_mmu_get_page(sp, created);
        return sp;

Thanks,

-- 
Peter Xu

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 05/26] KVM: x86/mmu: Rename shadow MMU functions that deal with shadow pages
  2022-03-11  0:25   ` David Matlack
@ 2022-03-15  8:52     ` Peter Xu
  -1 siblings, 0 replies; 134+ messages in thread
From: Peter Xu @ 2022-03-15  8:52 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Sean Christopherson, Andrew Jones, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Fri, Mar 11, 2022 at 12:25:07AM +0000, David Matlack wrote:
> Rename 3 functions:
> 
>   kvm_mmu_get_page()   -> kvm_mmu_get_shadow_page()
>   kvm_mmu_alloc_page() -> kvm_mmu_alloc_shadow_page()
>   kvm_mmu_free_page()  -> kvm_mmu_free_shadow_page()
> 
> This change makes it clear that these functions deal with shadow pages
> rather than struct pages. Prefer "shadow_page" over the shorter "sp"
> since these are core routines.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>

Acked-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 05/26] KVM: x86/mmu: Rename shadow MMU functions that deal with shadow pages
@ 2022-03-15  8:52     ` Peter Xu
  0 siblings, 0 replies; 134+ messages in thread
From: Peter Xu @ 2022-03-15  8:52 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Fri, Mar 11, 2022 at 12:25:07AM +0000, David Matlack wrote:
> Rename 3 functions:
> 
>   kvm_mmu_get_page()   -> kvm_mmu_get_shadow_page()
>   kvm_mmu_alloc_page() -> kvm_mmu_alloc_shadow_page()
>   kvm_mmu_free_page()  -> kvm_mmu_free_shadow_page()
> 
> This change makes it clear that these functions deal with shadow pages
> rather than struct pages. Prefer "shadow_page" over the shorter "sp"
> since these are core routines.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>

Acked-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 06/26] KVM: x86/mmu: Pass memslot to kvm_mmu_new_shadow_page()
  2022-03-11  0:25   ` David Matlack
@ 2022-03-15  9:03     ` Peter Xu
  -1 siblings, 0 replies; 134+ messages in thread
From: Peter Xu @ 2022-03-15  9:03 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Sean Christopherson, Andrew Jones, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Fri, Mar 11, 2022 at 12:25:08AM +0000, David Matlack wrote:
> Passing the memslot to kvm_mmu_new_shadow_page() avoids the need for the
> vCPU pointer when write-protecting indirect 4k shadow pages. This moves
> us closer to being able to create new shadow pages during VM ioctls for
> eager page splitting, where there is not vCPU pointer.
> 
> This change does not negatively impact "Populate memory time" for ept=Y
> or ept=N configurations since kvm_vcpu_gfn_to_memslot() caches the last
> use slot. So even though we now look up the slot more often, it is a
> very cheap check.
> 
> Opportunistically move the code to write-protect GFNs shadowed by
> PG_LEVEL_4K shadow pages into account_shadowed() to reduce indentation
> and consolidate the code. This also eliminates a memslot lookup.
> 
> No functional change intended.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 23 ++++++++++++-----------
>  1 file changed, 12 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index b6fb50e32291..519910938478 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -793,16 +793,14 @@ void kvm_mmu_gfn_allow_lpage(const struct kvm_memory_slot *slot, gfn_t gfn)
>  	update_gfn_disallow_lpage_count(slot, gfn, -1);
>  }
>  
> -static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
> +static void account_shadowed(struct kvm *kvm,
> +			     struct kvm_memory_slot *slot,
> +			     struct kvm_mmu_page *sp)
>  {
> -	struct kvm_memslots *slots;
> -	struct kvm_memory_slot *slot;
>  	gfn_t gfn;
>  
>  	kvm->arch.indirect_shadow_pages++;
>  	gfn = sp->gfn;
> -	slots = kvm_memslots_for_spte_role(kvm, sp->role);
> -	slot = __gfn_to_memslot(slots, gfn);
>  
>  	/* the non-leaf shadow pages are keeping readonly. */
>  	if (sp->role.level > PG_LEVEL_4K)
> @@ -810,6 +808,9 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
>  						    KVM_PAGE_TRACK_WRITE);
>  
>  	kvm_mmu_gfn_disallow_lpage(slot, gfn);
> +
> +	if (kvm_mmu_slot_gfn_write_protect(kvm, slot, gfn, PG_LEVEL_4K))
> +		kvm_flush_remote_tlbs_with_address(kvm, gfn, 1);

It's not immediately obvious in this diff, but when looking at the code
yeah it looks right to just drop the 4K check..

I also never understood why we only write-track the >1 levels but only
wr-protect the last level.  It'll be great if there's quick answer from
anyone.. even though it's probably unrelated to the patch.

The change looks all correct:

Reviewed-by: Peter Xu <peterx@redhat.com>

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 06/26] KVM: x86/mmu: Pass memslot to kvm_mmu_new_shadow_page()
@ 2022-03-15  9:03     ` Peter Xu
  0 siblings, 0 replies; 134+ messages in thread
From: Peter Xu @ 2022-03-15  9:03 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Fri, Mar 11, 2022 at 12:25:08AM +0000, David Matlack wrote:
> Passing the memslot to kvm_mmu_new_shadow_page() avoids the need for the
> vCPU pointer when write-protecting indirect 4k shadow pages. This moves
> us closer to being able to create new shadow pages during VM ioctls for
> eager page splitting, where there is not vCPU pointer.
> 
> This change does not negatively impact "Populate memory time" for ept=Y
> or ept=N configurations since kvm_vcpu_gfn_to_memslot() caches the last
> use slot. So even though we now look up the slot more often, it is a
> very cheap check.
> 
> Opportunistically move the code to write-protect GFNs shadowed by
> PG_LEVEL_4K shadow pages into account_shadowed() to reduce indentation
> and consolidate the code. This also eliminates a memslot lookup.
> 
> No functional change intended.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 23 ++++++++++++-----------
>  1 file changed, 12 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index b6fb50e32291..519910938478 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -793,16 +793,14 @@ void kvm_mmu_gfn_allow_lpage(const struct kvm_memory_slot *slot, gfn_t gfn)
>  	update_gfn_disallow_lpage_count(slot, gfn, -1);
>  }
>  
> -static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
> +static void account_shadowed(struct kvm *kvm,
> +			     struct kvm_memory_slot *slot,
> +			     struct kvm_mmu_page *sp)
>  {
> -	struct kvm_memslots *slots;
> -	struct kvm_memory_slot *slot;
>  	gfn_t gfn;
>  
>  	kvm->arch.indirect_shadow_pages++;
>  	gfn = sp->gfn;
> -	slots = kvm_memslots_for_spte_role(kvm, sp->role);
> -	slot = __gfn_to_memslot(slots, gfn);
>  
>  	/* the non-leaf shadow pages are keeping readonly. */
>  	if (sp->role.level > PG_LEVEL_4K)
> @@ -810,6 +808,9 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
>  						    KVM_PAGE_TRACK_WRITE);
>  
>  	kvm_mmu_gfn_disallow_lpage(slot, gfn);
> +
> +	if (kvm_mmu_slot_gfn_write_protect(kvm, slot, gfn, PG_LEVEL_4K))
> +		kvm_flush_remote_tlbs_with_address(kvm, gfn, 1);

It's not immediately obvious in this diff, but when looking at the code
yeah it looks right to just drop the 4K check..

I also never understood why we only write-track the >1 levels but only
wr-protect the last level.  It'll be great if there's quick answer from
anyone.. even though it's probably unrelated to the patch.

The change looks all correct:

Reviewed-by: Peter Xu <peterx@redhat.com>

Thanks,

-- 
Peter Xu

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 07/26] KVM: x86/mmu: Separate shadow MMU sp allocation from initialization
  2022-03-11  0:25   ` David Matlack
@ 2022-03-15  9:54     ` Peter Xu
  -1 siblings, 0 replies; 134+ messages in thread
From: Peter Xu @ 2022-03-15  9:54 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Sean Christopherson, Andrew Jones, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Fri, Mar 11, 2022 at 12:25:09AM +0000, David Matlack wrote:
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 519910938478..e866e05c4ba5 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1716,16 +1716,9 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
>  	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
>  	if (!direct)
>  		sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache);
> +
>  	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);

Trivial nit:

I read Ben's comment in previous version and that sounds reasonable to keep
the two linkages together.  It's just a bit of a pity we need to set the
private manually for each allocation.

Meanwhile we have another counter example in the tdp mmu code
(tdp_mmu_init_sp()), so we may want to align the tdp/shadow cases at some
point..

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 07/26] KVM: x86/mmu: Separate shadow MMU sp allocation from initialization
@ 2022-03-15  9:54     ` Peter Xu
  0 siblings, 0 replies; 134+ messages in thread
From: Peter Xu @ 2022-03-15  9:54 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Fri, Mar 11, 2022 at 12:25:09AM +0000, David Matlack wrote:
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 519910938478..e866e05c4ba5 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1716,16 +1716,9 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
>  	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
>  	if (!direct)
>  		sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache);
> +
>  	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);

Trivial nit:

I read Ben's comment in previous version and that sounds reasonable to keep
the two linkages together.  It's just a bit of a pity we need to set the
private manually for each allocation.

Meanwhile we have another counter example in the tdp mmu code
(tdp_mmu_init_sp()), so we may want to align the tdp/shadow cases at some
point..

-- 
Peter Xu

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 08/26] KVM: x86/mmu: Link spt to sp during allocation
  2022-03-11  0:25   ` David Matlack
@ 2022-03-15 10:04     ` Peter Xu
  -1 siblings, 0 replies; 134+ messages in thread
From: Peter Xu @ 2022-03-15 10:04 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Sean Christopherson, Andrew Jones, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Fri, Mar 11, 2022 at 12:25:10AM +0000, David Matlack wrote:
> Link the shadow page table to the sp (via set_page_private()) during
> allocation rather than initialization. This is a more logical place to
> do it because allocation time is also where we do the reverse link
> (setting sp->spt).
> 
> This creates one extra call to set_page_private(), but having multiple
> calls to set_page_private() is unavoidable anyway. We either do
> set_page_private() during allocation, which requires 1 per allocation
> function, or we do it during initialization, which requires 1 per
> initialization function.
> 
> No functional change intended.
> 
> Suggested-by: Ben Gardon <bgardon@google.com>
> Signed-off-by: David Matlack <dmatlack@google.com>

Ah I should have read one more patch before commenting in previous one..

Personally I (a little bit) like the other way around, since if with this
in mind ideally we should also keep the use_mmu_page accounting in
allocation helper:

  kvm_mod_used_mmu_pages(vcpu->kvm, 1);

But then we dup yet another line to all elsewheres as long as sp allocated.

IOW, in my opinion the helpers should service 1st on code deduplications
rather than else.  No strong opinion though..

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 08/26] KVM: x86/mmu: Link spt to sp during allocation
@ 2022-03-15 10:04     ` Peter Xu
  0 siblings, 0 replies; 134+ messages in thread
From: Peter Xu @ 2022-03-15 10:04 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Fri, Mar 11, 2022 at 12:25:10AM +0000, David Matlack wrote:
> Link the shadow page table to the sp (via set_page_private()) during
> allocation rather than initialization. This is a more logical place to
> do it because allocation time is also where we do the reverse link
> (setting sp->spt).
> 
> This creates one extra call to set_page_private(), but having multiple
> calls to set_page_private() is unavoidable anyway. We either do
> set_page_private() during allocation, which requires 1 per allocation
> function, or we do it during initialization, which requires 1 per
> initialization function.
> 
> No functional change intended.
> 
> Suggested-by: Ben Gardon <bgardon@google.com>
> Signed-off-by: David Matlack <dmatlack@google.com>

Ah I should have read one more patch before commenting in previous one..

Personally I (a little bit) like the other way around, since if with this
in mind ideally we should also keep the use_mmu_page accounting in
allocation helper:

  kvm_mod_used_mmu_pages(vcpu->kvm, 1);

But then we dup yet another line to all elsewheres as long as sp allocated.

IOW, in my opinion the helpers should service 1st on code deduplications
rather than else.  No strong opinion though..

Thanks,

-- 
Peter Xu

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 09/26] KVM: x86/mmu: Move huge page split sp allocation code to mmu.c
  2022-03-11  0:25   ` David Matlack
@ 2022-03-15 10:17     ` Peter Xu
  -1 siblings, 0 replies; 134+ messages in thread
From: Peter Xu @ 2022-03-15 10:17 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Sean Christopherson, Andrew Jones, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Fri, Mar 11, 2022 at 12:25:11AM +0000, David Matlack wrote:
> Move the code that allocates a new shadow page for splitting huge pages
> into mmu.c. Currently this code is only used by the TDP MMU but it will
> be reused in subsequent commits to also split huge pages mapped by the
> shadow MMU.
> 
> While here, also shove the GFP complexity down into the allocation
> function so that it does not have to be duplicated when the shadow MMU
> needs to start allocating SPs for splitting.
> 
> No functional change intended.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 09/26] KVM: x86/mmu: Move huge page split sp allocation code to mmu.c
@ 2022-03-15 10:17     ` Peter Xu
  0 siblings, 0 replies; 134+ messages in thread
From: Peter Xu @ 2022-03-15 10:17 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Fri, Mar 11, 2022 at 12:25:11AM +0000, David Matlack wrote:
> Move the code that allocates a new shadow page for splitting huge pages
> into mmu.c. Currently this code is only used by the TDP MMU but it will
> be reused in subsequent commits to also split huge pages mapped by the
> shadow MMU.
> 
> While here, also shove the GFP complexity down into the allocation
> function so that it does not have to be duplicated when the shadow MMU
> needs to start allocating SPs for splitting.
> 
> No functional change intended.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 10/26] KVM: x86/mmu: Use common code to free kvm_mmu_page structs
  2022-03-11  0:25   ` David Matlack
@ 2022-03-15 10:22     ` Peter Xu
  -1 siblings, 0 replies; 134+ messages in thread
From: Peter Xu @ 2022-03-15 10:22 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Sean Christopherson, Andrew Jones, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Fri, Mar 11, 2022 at 12:25:12AM +0000, David Matlack wrote:
>  static void tdp_mmu_free_sp(struct kvm_mmu_page *sp)
>  {
> -	free_page((unsigned long)sp->spt);
> -	kmem_cache_free(mmu_page_header_cache, sp);
> +	kvm_mmu_free_shadow_page(sp);
>  }

Perhaps tdp_mmu_free_sp() can be dropped altogether with this?

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 10/26] KVM: x86/mmu: Use common code to free kvm_mmu_page structs
@ 2022-03-15 10:22     ` Peter Xu
  0 siblings, 0 replies; 134+ messages in thread
From: Peter Xu @ 2022-03-15 10:22 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Fri, Mar 11, 2022 at 12:25:12AM +0000, David Matlack wrote:
>  static void tdp_mmu_free_sp(struct kvm_mmu_page *sp)
>  {
> -	free_page((unsigned long)sp->spt);
> -	kmem_cache_free(mmu_page_header_cache, sp);
> +	kvm_mmu_free_shadow_page(sp);
>  }

Perhaps tdp_mmu_free_sp() can be dropped altogether with this?

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 11/26] KVM: x86/mmu: Use common code to allocate kvm_mmu_page structs from vCPU caches
  2022-03-11  0:25   ` David Matlack
@ 2022-03-15 10:27     ` Peter Xu
  -1 siblings, 0 replies; 134+ messages in thread
From: Peter Xu @ 2022-03-15 10:27 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Sean Christopherson, Andrew Jones, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Fri, Mar 11, 2022 at 12:25:13AM +0000, David Matlack wrote:
>  static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
>  {
> -	struct kvm_mmu_page *sp;
> -
> -	sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
> -	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
> -	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
> -
> -	return sp;
> +	return kvm_mmu_alloc_shadow_page(vcpu, true);
>  }

Similarly I had a feeling we could drop tdp_mmu_alloc_sp() too.. anyway:

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 11/26] KVM: x86/mmu: Use common code to allocate kvm_mmu_page structs from vCPU caches
@ 2022-03-15 10:27     ` Peter Xu
  0 siblings, 0 replies; 134+ messages in thread
From: Peter Xu @ 2022-03-15 10:27 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Fri, Mar 11, 2022 at 12:25:13AM +0000, David Matlack wrote:
>  static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
>  {
> -	struct kvm_mmu_page *sp;
> -
> -	sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
> -	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
> -	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
> -
> -	return sp;
> +	return kvm_mmu_alloc_shadow_page(vcpu, true);
>  }

Similarly I had a feeling we could drop tdp_mmu_alloc_sp() too.. anyway:

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 14/26] KVM: x86/mmu: Decouple rmap_add() and link_shadow_page() from kvm_vcpu
  2022-03-11  0:25   ` David Matlack
@ 2022-03-15 10:37     ` Peter Xu
  -1 siblings, 0 replies; 134+ messages in thread
From: Peter Xu @ 2022-03-15 10:37 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Sean Christopherson, Andrew Jones, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Fri, Mar 11, 2022 at 12:25:16AM +0000, David Matlack wrote:
> Allow adding new entries to the rmap and linking shadow pages without a
> struct kvm_vcpu pointer by moving the implementation of rmap_add() and
> link_shadow_page() into inner helper functions.
> 
> No functional change intended.
> 
> Reviewed-by: Ben Gardon <bgardon@google.com>
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 43 +++++++++++++++++++++++++++---------------
>  1 file changed, 28 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index d7ad71be6c52..c57070ed157d 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -725,9 +725,9 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
>  	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
>  }
>  
> -static struct pte_list_desc *mmu_alloc_pte_list_desc(struct kvm_vcpu *vcpu)
> +static struct pte_list_desc *mmu_alloc_pte_list_desc(struct kvm_mmu_memory_cache *cache)
>  {
> -	return kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_pte_list_desc_cache);
> +	return kvm_mmu_memory_cache_alloc(cache);
>  }

Nit: same here, IMHO we could drop mmu_alloc_pte_list_desc() already..

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 14/26] KVM: x86/mmu: Decouple rmap_add() and link_shadow_page() from kvm_vcpu
@ 2022-03-15 10:37     ` Peter Xu
  0 siblings, 0 replies; 134+ messages in thread
From: Peter Xu @ 2022-03-15 10:37 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Fri, Mar 11, 2022 at 12:25:16AM +0000, David Matlack wrote:
> Allow adding new entries to the rmap and linking shadow pages without a
> struct kvm_vcpu pointer by moving the implementation of rmap_add() and
> link_shadow_page() into inner helper functions.
> 
> No functional change intended.
> 
> Reviewed-by: Ben Gardon <bgardon@google.com>
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 43 +++++++++++++++++++++++++++---------------
>  1 file changed, 28 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index d7ad71be6c52..c57070ed157d 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -725,9 +725,9 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
>  	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
>  }
>  
> -static struct pte_list_desc *mmu_alloc_pte_list_desc(struct kvm_vcpu *vcpu)
> +static struct pte_list_desc *mmu_alloc_pte_list_desc(struct kvm_mmu_memory_cache *cache)
>  {
> -	return kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_pte_list_desc_cache);
> +	return kvm_mmu_memory_cache_alloc(cache);
>  }

Nit: same here, IMHO we could drop mmu_alloc_pte_list_desc() already..

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 15/26] KVM: x86/mmu: Update page stats in __rmap_add()
  2022-03-11  0:25   ` David Matlack
@ 2022-03-15 10:39     ` Peter Xu
  -1 siblings, 0 replies; 134+ messages in thread
From: Peter Xu @ 2022-03-15 10:39 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Sean Christopherson, Andrew Jones, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Fri, Mar 11, 2022 at 12:25:17AM +0000, David Matlack wrote:
> Update the page stats in __rmap_add() rather than at the call site. This
> will avoid having to manually update page stats when splitting huge
> pages in a subsequent commit.
> 
> No functional change intended.
> 
> Reviewed-by: Ben Gardon <bgardon@google.com>
> Signed-off-by: David Matlack <dmatlack@google.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 15/26] KVM: x86/mmu: Update page stats in __rmap_add()
@ 2022-03-15 10:39     ` Peter Xu
  0 siblings, 0 replies; 134+ messages in thread
From: Peter Xu @ 2022-03-15 10:39 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Fri, Mar 11, 2022 at 12:25:17AM +0000, David Matlack wrote:
> Update the page stats in __rmap_add() rather than at the call site. This
> will avoid having to manually update page stats when splitting huge
> pages in a subsequent commit.
> 
> No functional change intended.
> 
> Reviewed-by: Ben Gardon <bgardon@google.com>
> Signed-off-by: David Matlack <dmatlack@google.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 16/26] KVM: x86/mmu: Cache the access bits of shadowed translations
  2022-03-11  0:25   ` David Matlack
@ 2022-03-16  8:32     ` Peter Xu
  -1 siblings, 0 replies; 134+ messages in thread
From: Peter Xu @ 2022-03-16  8:32 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Sean Christopherson, Andrew Jones, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Fri, Mar 11, 2022 at 12:25:18AM +0000, David Matlack wrote:
> In order to split a huge page we need to know what access bits to assign
> to the role of the new child page table. This can't be easily derived
> from the huge page SPTE itself since KVM applies its own access policies
> on top, such as for HugePage NX.
> 
> We could walk the guest page tables to determine the correct access
> bits, but that is difficult to plumb outside of a vCPU fault context.
> Instead, we can store the original access bits for each leaf SPTE
> alongside the GFN in the gfns array. The access bits only take up 3
> bits, which leaves 61 bits left over for gfns, which is more than
> enough. So this change does not require any additional memory.

I have a pure question on why eager page split needs to worry on hugepage
nx..

IIUC that was about forbidden huge page being mapped as executable.  So
afaiu the only missing bit that could happen if we copy over the huge page
ptes is the executable bit.

But then?  I think we could get a page fault on fault->exec==true on the
split small page (because when we copy over it's cleared, even though the
page can actually be executable), but it should be well resolved right
after that small page fault.

The thing is IIUC this is a very rare case, IOW, it should mostly not
happen in 99% of the use case?  And there's a slight penalty when it
happens, but only perf-wise.

As I'm not really fluent with the code base, perhaps I missed something?

> 
> In order to keep the access bit cache in sync with the guest, we have to
> extend FNAME(sync_page) to also update the access bits.

Besides sync_page(), I also see that in mmu_set_spte() there's a path that
we will skip the rmap_add() if existed:

	if (!was_rmapped) {
		WARN_ON_ONCE(ret == RET_PF_SPURIOUS);
		kvm_update_page_stats(vcpu->kvm, level, 1);
		rmap_add(vcpu, slot, sptep, gfn);
	}

I didn't check, but it's not obvious whether the sync_page() change here
will cover all of the cases, hence raise this up too.

> 
> Now that the gfns array caches more information than just GFNs, rename
> it to shadowed_translation.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  2 +-
>  arch/x86/kvm/mmu/mmu.c          | 32 +++++++++++++++++++-------------
>  arch/x86/kvm/mmu/mmu_internal.h | 15 +++++++++++++--
>  arch/x86/kvm/mmu/paging_tmpl.h  |  7 +++++--
>  4 files changed, 38 insertions(+), 18 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index f72e80178ffc..0f5a36772bdc 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -694,7 +694,7 @@ struct kvm_vcpu_arch {
>  
>  	struct kvm_mmu_memory_cache mmu_pte_list_desc_cache;
>  	struct kvm_mmu_memory_cache mmu_shadow_page_cache;
> -	struct kvm_mmu_memory_cache mmu_gfn_array_cache;
> +	struct kvm_mmu_memory_cache mmu_shadowed_translation_cache;

I'd called it with a shorter name.. :) maybe mmu_shadowed_info_cache?  No
strong opinion.

>  	struct kvm_mmu_memory_cache mmu_page_header_cache;
>  
>  	/*

[...]

> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index b6e22ba9c654..c5b8ee625df7 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -32,6 +32,11 @@ extern bool dbg;
>  
>  typedef u64 __rcu *tdp_ptep_t;
>  
> +struct shadowed_translation_entry {
> +	u64 access:3;
> +	u64 gfn:56;

Why 56?

> +};

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 16/26] KVM: x86/mmu: Cache the access bits of shadowed translations
@ 2022-03-16  8:32     ` Peter Xu
  0 siblings, 0 replies; 134+ messages in thread
From: Peter Xu @ 2022-03-16  8:32 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Fri, Mar 11, 2022 at 12:25:18AM +0000, David Matlack wrote:
> In order to split a huge page we need to know what access bits to assign
> to the role of the new child page table. This can't be easily derived
> from the huge page SPTE itself since KVM applies its own access policies
> on top, such as for HugePage NX.
> 
> We could walk the guest page tables to determine the correct access
> bits, but that is difficult to plumb outside of a vCPU fault context.
> Instead, we can store the original access bits for each leaf SPTE
> alongside the GFN in the gfns array. The access bits only take up 3
> bits, which leaves 61 bits left over for gfns, which is more than
> enough. So this change does not require any additional memory.

I have a pure question on why eager page split needs to worry on hugepage
nx..

IIUC that was about forbidden huge page being mapped as executable.  So
afaiu the only missing bit that could happen if we copy over the huge page
ptes is the executable bit.

But then?  I think we could get a page fault on fault->exec==true on the
split small page (because when we copy over it's cleared, even though the
page can actually be executable), but it should be well resolved right
after that small page fault.

The thing is IIUC this is a very rare case, IOW, it should mostly not
happen in 99% of the use case?  And there's a slight penalty when it
happens, but only perf-wise.

As I'm not really fluent with the code base, perhaps I missed something?

> 
> In order to keep the access bit cache in sync with the guest, we have to
> extend FNAME(sync_page) to also update the access bits.

Besides sync_page(), I also see that in mmu_set_spte() there's a path that
we will skip the rmap_add() if existed:

	if (!was_rmapped) {
		WARN_ON_ONCE(ret == RET_PF_SPURIOUS);
		kvm_update_page_stats(vcpu->kvm, level, 1);
		rmap_add(vcpu, slot, sptep, gfn);
	}

I didn't check, but it's not obvious whether the sync_page() change here
will cover all of the cases, hence raise this up too.

> 
> Now that the gfns array caches more information than just GFNs, rename
> it to shadowed_translation.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  2 +-
>  arch/x86/kvm/mmu/mmu.c          | 32 +++++++++++++++++++-------------
>  arch/x86/kvm/mmu/mmu_internal.h | 15 +++++++++++++--
>  arch/x86/kvm/mmu/paging_tmpl.h  |  7 +++++--
>  4 files changed, 38 insertions(+), 18 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index f72e80178ffc..0f5a36772bdc 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -694,7 +694,7 @@ struct kvm_vcpu_arch {
>  
>  	struct kvm_mmu_memory_cache mmu_pte_list_desc_cache;
>  	struct kvm_mmu_memory_cache mmu_shadow_page_cache;
> -	struct kvm_mmu_memory_cache mmu_gfn_array_cache;
> +	struct kvm_mmu_memory_cache mmu_shadowed_translation_cache;

I'd called it with a shorter name.. :) maybe mmu_shadowed_info_cache?  No
strong opinion.

>  	struct kvm_mmu_memory_cache mmu_page_header_cache;
>  
>  	/*

[...]

> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index b6e22ba9c654..c5b8ee625df7 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -32,6 +32,11 @@ extern bool dbg;
>  
>  typedef u64 __rcu *tdp_ptep_t;
>  
> +struct shadowed_translation_entry {
> +	u64 access:3;
> +	u64 gfn:56;

Why 56?

> +};

Thanks,

-- 
Peter Xu

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 17/26] KVM: x86/mmu: Pass access information to make_huge_page_split_spte()
  2022-03-11  0:25   ` David Matlack
@ 2022-03-16  8:44     ` Peter Xu
  -1 siblings, 0 replies; 134+ messages in thread
From: Peter Xu @ 2022-03-16  8:44 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Sean Christopherson, Andrew Jones, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Fri, Mar 11, 2022 at 12:25:19AM +0000, David Matlack wrote:
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 85b7bc333302..541b145b2df2 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1430,7 +1430,7 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
>  	 * not been linked in yet and thus is not reachable from any other CPU.
>  	 */
>  	for (i = 0; i < PT64_ENT_PER_PAGE; i++)
> -		sp->spt[i] = make_huge_page_split_spte(huge_spte, level, i);
> +		sp->spt[i] = make_huge_page_split_spte(huge_spte, level, i, ACC_ALL);

Pure question: is it possible that huge_spte is RO while we passed in
ACC_ALL here (which has the write bit set)?  Would it be better if we make
it a "bool exec" to be clearer?

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 17/26] KVM: x86/mmu: Pass access information to make_huge_page_split_spte()
@ 2022-03-16  8:44     ` Peter Xu
  0 siblings, 0 replies; 134+ messages in thread
From: Peter Xu @ 2022-03-16  8:44 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Fri, Mar 11, 2022 at 12:25:19AM +0000, David Matlack wrote:
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 85b7bc333302..541b145b2df2 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1430,7 +1430,7 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
>  	 * not been linked in yet and thus is not reachable from any other CPU.
>  	 */
>  	for (i = 0; i < PT64_ENT_PER_PAGE; i++)
> -		sp->spt[i] = make_huge_page_split_spte(huge_spte, level, i);
> +		sp->spt[i] = make_huge_page_split_spte(huge_spte, level, i, ACC_ALL);

Pure question: is it possible that huge_spte is RO while we passed in
ACC_ALL here (which has the write bit set)?  Would it be better if we make
it a "bool exec" to be clearer?

-- 
Peter Xu

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 18/26] KVM: x86/mmu: Zap collapsible SPTEs at all levels in the shadow MMU
  2022-03-11  0:25   ` David Matlack
@ 2022-03-16  8:49     ` Peter Xu
  -1 siblings, 0 replies; 134+ messages in thread
From: Peter Xu @ 2022-03-16  8:49 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Sean Christopherson, Andrew Jones, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Fri, Mar 11, 2022 at 12:25:20AM +0000, David Matlack wrote:
> Currently KVM only zaps collapsible 4KiB SPTEs in the shadow MMU (i.e.
> in the rmap). This is fine for now KVM never creates intermediate huge
> pages during dirty logging, i.e. a 1GiB page is never partially split to
> a 2MiB page.
> 
> However, this will stop being true once the shadow MMU participates in
> eager page splitting, which can in fact leave behind partially split
> huge pages. In preparation for that change, change the shadow MMU to
> iterate over all necessary levels when zapping collapsible SPTEs.
> 
> No functional change intended.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 26 +++++++++++++++++++-------
>  1 file changed, 19 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 89a7a8d7a632..2032be3edd71 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -6142,18 +6142,30 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
>  	return need_tlb_flush;
>  }
>  
> +static void kvm_rmap_zap_collapsible_sptes(struct kvm *kvm,
> +					   const struct kvm_memory_slot *slot)
> +{
> +	bool flush;
> +
> +	/*
> +	 * Note, use KVM_MAX_HUGEPAGE_LEVEL - 1 since there's no need to zap
> +	 * pages that are already mapped at the maximum possible level.
> +	 */
> +	flush = slot_handle_level(kvm, slot, kvm_mmu_zap_collapsible_spte,
> +				  PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL - 1,
> +				  true);
> +
> +	if (flush)
> +		kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
> +
> +}

Reviewed-by: Peter Xu <peterx@redhat.com>

IMHO it looks cleaner to write it in the old way (drop the flush var).
Maybe even unwrap the helper?

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 18/26] KVM: x86/mmu: Zap collapsible SPTEs at all levels in the shadow MMU
@ 2022-03-16  8:49     ` Peter Xu
  0 siblings, 0 replies; 134+ messages in thread
From: Peter Xu @ 2022-03-16  8:49 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Fri, Mar 11, 2022 at 12:25:20AM +0000, David Matlack wrote:
> Currently KVM only zaps collapsible 4KiB SPTEs in the shadow MMU (i.e.
> in the rmap). This is fine for now KVM never creates intermediate huge
> pages during dirty logging, i.e. a 1GiB page is never partially split to
> a 2MiB page.
> 
> However, this will stop being true once the shadow MMU participates in
> eager page splitting, which can in fact leave behind partially split
> huge pages. In preparation for that change, change the shadow MMU to
> iterate over all necessary levels when zapping collapsible SPTEs.
> 
> No functional change intended.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 26 +++++++++++++++++++-------
>  1 file changed, 19 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 89a7a8d7a632..2032be3edd71 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -6142,18 +6142,30 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
>  	return need_tlb_flush;
>  }
>  
> +static void kvm_rmap_zap_collapsible_sptes(struct kvm *kvm,
> +					   const struct kvm_memory_slot *slot)
> +{
> +	bool flush;
> +
> +	/*
> +	 * Note, use KVM_MAX_HUGEPAGE_LEVEL - 1 since there's no need to zap
> +	 * pages that are already mapped at the maximum possible level.
> +	 */
> +	flush = slot_handle_level(kvm, slot, kvm_mmu_zap_collapsible_spte,
> +				  PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL - 1,
> +				  true);
> +
> +	if (flush)
> +		kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
> +
> +}

Reviewed-by: Peter Xu <peterx@redhat.com>

IMHO it looks cleaner to write it in the old way (drop the flush var).
Maybe even unwrap the helper?

Thanks,

-- 
Peter Xu

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 19/26] KVM: x86/mmu: Refactor drop_large_spte()
  2022-03-11  0:25   ` David Matlack
@ 2022-03-16  8:53     ` Peter Xu
  -1 siblings, 0 replies; 134+ messages in thread
From: Peter Xu @ 2022-03-16  8:53 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Sean Christopherson, Andrew Jones, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Fri, Mar 11, 2022 at 12:25:21AM +0000, David Matlack wrote:
> drop_large_spte() drops a large SPTE if it exists and then flushes TLBs.
> Its helper function, __drop_large_spte(), does the drop without the
> flush.
> 
> In preparation for eager page splitting, which will need to sometimes
> flush when dropping large SPTEs (and sometimes not), push the flushing
> logic down into __drop_large_spte() and add a bool parameter to control
> it.
> 
> No functional change intended.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>

The new helpers looks much better indeed..

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 19/26] KVM: x86/mmu: Refactor drop_large_spte()
@ 2022-03-16  8:53     ` Peter Xu
  0 siblings, 0 replies; 134+ messages in thread
From: Peter Xu @ 2022-03-16  8:53 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Fri, Mar 11, 2022 at 12:25:21AM +0000, David Matlack wrote:
> drop_large_spte() drops a large SPTE if it exists and then flushes TLBs.
> Its helper function, __drop_large_spte(), does the drop without the
> flush.
> 
> In preparation for eager page splitting, which will need to sometimes
> flush when dropping large SPTEs (and sometimes not), push the flushing
> logic down into __drop_large_spte() and add a bool parameter to control
> it.
> 
> No functional change intended.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>

The new helpers looks much better indeed..

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 20/26] KVM: x86/mmu: Extend Eager Page Splitting to the shadow MMU
  2022-03-11  0:25   ` David Matlack
@ 2022-03-16 10:26     ` Peter Xu
  -1 siblings, 0 replies; 134+ messages in thread
From: Peter Xu @ 2022-03-16 10:26 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Sean Christopherson, Andrew Jones, Ben Gardon, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Fri, Mar 11, 2022 at 12:25:22AM +0000, David Matlack wrote:
> Extend KVM's eager page splitting to also split huge pages that are
> mapped by the shadow MMU. Specifically, walk through the rmap splitting
> all 1GiB pages to 2MiB pages, and splitting all 2MiB pages to 4KiB
> pages.
> 
> Splitting huge pages mapped by the shadow MMU requries dealing with some
> extra complexity beyond that of the TDP MMU:
> 
> (1) The shadow MMU has a limit on the number of shadow pages that are
>     allowed to be allocated. So, as a policy, Eager Page Splitting
>     refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
>     pages available.
> 
> (2) Huge pages may be mapped by indirect shadow pages which have the
>     possibility of being unsync. As a policy we opt not to split such
>     pages as their translation may no longer be valid.
> 
> (3) Splitting a huge page may end up re-using an existing lower level
>     shadow page tables. This is unlike the TDP MMU which always allocates
>     new shadow page tables when splitting.  This commit does *not*
>     handle such aliasing and opts not to split such huge pages.
> 
> (4) When installing the lower level SPTEs, they must be added to the
>     rmap which may require allocating additional pte_list_desc structs.
>     This commit does *not* handle such cases and instead opts to leave
>     such lower-level SPTEs non-present. In this situation TLBs must be
>     flushed before dropping the MMU lock as a portion of the huge page
>     region is being unmapped.
> 
> Suggested-by: Peter Feiner <pfeiner@google.com>
> [ This commit is based off of the original implementation of Eager Page
>   Splitting from Peter in Google's kernel from 2016. ]
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  .../admin-guide/kernel-parameters.txt         |   3 -
>  arch/x86/kvm/mmu/mmu.c                        | 307 ++++++++++++++++++
>  2 files changed, 307 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 05161afd7642..495f6ac53801 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -2360,9 +2360,6 @@
>  			the KVM_CLEAR_DIRTY ioctl, and only for the pages being
>  			cleared.
>  
> -			Eager page splitting currently only supports splitting
> -			huge pages mapped by the TDP MMU.
> -
>  			Default is Y (on).
>  
>  	kvm.enable_vmware_backdoor=[KVM] Support VMware backdoor PV interface.
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 926ddfaa9e1a..dd56b5b9624f 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -727,6 +727,11 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
>  
>  static struct pte_list_desc *mmu_alloc_pte_list_desc(struct kvm_mmu_memory_cache *cache)
>  {
> +	static const gfp_t gfp_nocache = GFP_ATOMIC | __GFP_ACCOUNT | __GFP_ZERO;
> +
> +	if (WARN_ON_ONCE(!cache))
> +		return kmem_cache_alloc(pte_list_desc_cache, gfp_nocache);
> +

I also think this is not proper to be added into this patch.  Maybe it'll
be more suitable for the rmap_add() rework patch previously, or maybe it
can be dropped directly if it should never trigger at all. Then we die hard
at below when referencing it.

>  	return kvm_mmu_memory_cache_alloc(cache);
>  }
>  
> @@ -743,6 +748,28 @@ static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index)
>  	return sp->gfn + (index << ((sp->role.level - 1) * PT64_LEVEL_BITS));
>  }
>  
> +static gfn_t sptep_to_gfn(u64 *sptep)
> +{
> +	struct kvm_mmu_page *sp = sptep_to_sp(sptep);
> +
> +	return kvm_mmu_page_get_gfn(sp, sptep - sp->spt);
> +}
> +
> +static unsigned int kvm_mmu_page_get_access(struct kvm_mmu_page *sp, int index)
> +{
> +	if (!sp->role.direct)
> +		return sp->shadowed_translation[index].access;
> +
> +	return sp->role.access;
> +}
> +
> +static unsigned int sptep_to_access(u64 *sptep)
> +{
> +	struct kvm_mmu_page *sp = sptep_to_sp(sptep);
> +
> +	return kvm_mmu_page_get_access(sp, sptep - sp->spt);
> +}
> +
>  static void kvm_mmu_page_set_gfn_access(struct kvm_mmu_page *sp, int index,
>  					gfn_t gfn, u32 access)
>  {
> @@ -912,6 +939,9 @@ static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte,
>  	return count;
>  }
>  
> +static struct kvm_rmap_head *gfn_to_rmap(gfn_t gfn, int level,
> +					 const struct kvm_memory_slot *slot);
> +
>  static void
>  pte_list_desc_remove_entry(struct kvm_rmap_head *rmap_head,
>  			   struct pte_list_desc *desc, int i,
> @@ -2125,6 +2155,23 @@ static struct kvm_mmu_page *__kvm_mmu_find_shadow_page(struct kvm *kvm,
>  	return sp;
>  }
>  
> +static struct kvm_mmu_page *kvm_mmu_find_direct_sp(struct kvm *kvm, gfn_t gfn,
> +						   union kvm_mmu_page_role role)
> +{
> +	struct kvm_mmu_page *sp;
> +	LIST_HEAD(invalid_list);
> +
> +	BUG_ON(!role.direct);
> +
> +	sp = __kvm_mmu_find_shadow_page(kvm, gfn, role, &invalid_list);
> +
> +	/* Direct SPs are never unsync. */
> +	WARN_ON_ONCE(sp && sp->unsync);
> +
> +	kvm_mmu_commit_zap_page(kvm, &invalid_list);
> +	return sp;
> +}
> +
>  /*
>   * Looks up an existing SP for the given gfn and role if one exists. The
>   * return SP is guaranteed to be synced.
> @@ -6063,12 +6110,266 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
>  		kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
>  }
>  
> +static int prepare_to_split_huge_page(struct kvm *kvm,
> +				      const struct kvm_memory_slot *slot,
> +				      u64 *huge_sptep,
> +				      struct kvm_mmu_page **spp,
> +				      bool *flush,
> +				      bool *dropped_lock)
> +{
> +	int r = 0;
> +
> +	*dropped_lock = false;
> +
> +	if (kvm_mmu_available_pages(kvm) <= KVM_MIN_FREE_MMU_PAGES)
> +		return -ENOSPC;
> +
> +	if (need_resched() || rwlock_needbreak(&kvm->mmu_lock))
> +		goto drop_lock;
> +

Not immediately clear on whether there'll be case that *spp is set within
the current function.  Some sanity check might be nice?

> +	*spp = kvm_mmu_alloc_direct_sp_for_split(true);
> +	if (r)
> +		goto drop_lock;
> +
> +	return 0;
> +
> +drop_lock:
> +	if (*flush)
> +		kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
> +
> +	*flush = false;
> +	*dropped_lock = true;
> +
> +	write_unlock(&kvm->mmu_lock);
> +	cond_resched();
> +	*spp = kvm_mmu_alloc_direct_sp_for_split(false);
> +	if (!*spp)
> +		r = -ENOMEM;
> +	write_lock(&kvm->mmu_lock);
> +
> +	return r;
> +}
> +
> +static struct kvm_mmu_page *kvm_mmu_get_sp_for_split(struct kvm *kvm,
> +						     const struct kvm_memory_slot *slot,
> +						     u64 *huge_sptep,
> +						     struct kvm_mmu_page **spp)
> +{
> +	struct kvm_mmu_page *split_sp;
> +	union kvm_mmu_page_role role;
> +	unsigned int access;
> +	gfn_t gfn;
> +
> +	gfn = sptep_to_gfn(huge_sptep);
> +	access = sptep_to_access(huge_sptep);
> +
> +	/*
> +	 * Huge page splitting always uses direct shadow pages since we are
> +	 * directly mapping the huge page GFN region with smaller pages.
> +	 */
> +	role = kvm_mmu_child_role(huge_sptep, true, access);
> +	split_sp = kvm_mmu_find_direct_sp(kvm, gfn, role);
> +
> +	/*
> +	 * Opt not to split if the lower-level SP already exists. This requires
> +	 * more complex handling as the SP may be already partially filled in
> +	 * and may need extra pte_list_desc structs to update parent_ptes.
> +	 */
> +	if (split_sp)
> +		return NULL;

This smells tricky..

Firstly we're trying to lookup the existing SPs that has shadowed this huge
page in split way, with the access bits fetched from the shadow cache (so
without hugepage nx effect).  However could the pages be mapped with
different permissions from the currently hugely mapped page?

IIUC all these is for the fact that we can't allocate pte_list_desc and we
want to make sure we won't make the pte list to be >1.

But I also see that the pte_list check below...

> +
> +	swap(split_sp, *spp);
> +	init_shadow_page(kvm, split_sp, slot, gfn, role);
> +	trace_kvm_mmu_get_page(split_sp, true);
> +
> +	return split_sp;
> +}
> +
> +static int kvm_mmu_split_huge_page(struct kvm *kvm,
> +				   const struct kvm_memory_slot *slot,
> +				   u64 *huge_sptep, struct kvm_mmu_page **spp,
> +				   bool *flush)
> +
> +{
> +	struct kvm_mmu_page *split_sp;
> +	u64 huge_spte, split_spte;
> +	int split_level, index;
> +	unsigned int access;
> +	u64 *split_sptep;
> +	gfn_t split_gfn;
> +
> +	split_sp = kvm_mmu_get_sp_for_split(kvm, slot, huge_sptep, spp);
> +	if (!split_sp)
> +		return -EOPNOTSUPP;
> +
> +	/*
> +	 * Since we did not allocate pte_list_desc_structs for the split, we
> +	 * cannot add a new parent SPTE to parent_ptes. This should never happen
> +	 * in practice though since this is a fresh SP.
> +	 *
> +	 * Note, this makes it safe to pass NULL to __link_shadow_page() below.
> +	 */
> +	if (WARN_ON_ONCE(split_sp->parent_ptes.val))
> +		return -EINVAL;
> +
> +	huge_spte = READ_ONCE(*huge_sptep);
> +
> +	split_level = split_sp->role.level;
> +	access = split_sp->role.access;
> +
> +	for (index = 0; index < PT64_ENT_PER_PAGE; index++) {
> +		split_sptep = &split_sp->spt[index];
> +		split_gfn = kvm_mmu_page_get_gfn(split_sp, index);
> +
> +		BUG_ON(is_shadow_present_pte(*split_sptep));
> +
> +		/*
> +		 * Since we did not allocate pte_list_desc structs for the
> +		 * split, we can't add a new SPTE that maps this GFN.
> +		 * Skipping this SPTE means we're only partially mapping the
> +		 * huge page, which means we'll need to flush TLBs before
> +		 * dropping the MMU lock.
> +		 *
> +		 * Note, this make it safe to pass NULL to __rmap_add() below.
> +		 */
> +		if (gfn_to_rmap(split_gfn, split_level, slot)->val) {
> +			*flush = true;
> +			continue;
> +		}

... here.

IIUC this check should already be able to cover all the cases and it's
accurate on the fact that we don't want to grow any rmap to >1 len.

> +
> +		split_spte = make_huge_page_split_spte(
> +				huge_spte, split_level + 1, index, access);
> +
> +		mmu_spte_set(split_sptep, split_spte);
> +		__rmap_add(kvm, NULL, slot, split_sptep, split_gfn, access);

__rmap_add() with a NULL cache pointer is weird.. same as
__link_shadow_page() below.

I'll stop here for now I guess.. Have you considered having rmap allocation
ready altogether, rather than making this intermediate step but only add
that later?  Because all these look hackish to me..  It's also possible
that I missed something important, if so please shoot.

Thanks,

> +	}
> +
> +	/*
> +	 * Replace the huge spte with a pointer to the populated lower level
> +	 * page table. Since we are making this change without a TLB flush vCPUs
> +	 * will see a mix of the split mappings and the original huge mapping,
> +	 * depending on what's currently in their TLB. This is fine from a
> +	 * correctness standpoint since the translation will either be identical
> +	 * or non-present. To account for non-present mappings, the TLB will be
> +	 * flushed prior to dropping the MMU lock.
> +	 */
> +	__drop_large_spte(kvm, huge_sptep, false);
> +	__link_shadow_page(NULL, huge_sptep, split_sp);
> +
> +	return 0;
> +}

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 20/26] KVM: x86/mmu: Extend Eager Page Splitting to the shadow MMU
@ 2022-03-16 10:26     ` Peter Xu
  0 siblings, 0 replies; 134+ messages in thread
From: Peter Xu @ 2022-03-16 10:26 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, maciej.szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Fri, Mar 11, 2022 at 12:25:22AM +0000, David Matlack wrote:
> Extend KVM's eager page splitting to also split huge pages that are
> mapped by the shadow MMU. Specifically, walk through the rmap splitting
> all 1GiB pages to 2MiB pages, and splitting all 2MiB pages to 4KiB
> pages.
> 
> Splitting huge pages mapped by the shadow MMU requries dealing with some
> extra complexity beyond that of the TDP MMU:
> 
> (1) The shadow MMU has a limit on the number of shadow pages that are
>     allowed to be allocated. So, as a policy, Eager Page Splitting
>     refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
>     pages available.
> 
> (2) Huge pages may be mapped by indirect shadow pages which have the
>     possibility of being unsync. As a policy we opt not to split such
>     pages as their translation may no longer be valid.
> 
> (3) Splitting a huge page may end up re-using an existing lower level
>     shadow page tables. This is unlike the TDP MMU which always allocates
>     new shadow page tables when splitting.  This commit does *not*
>     handle such aliasing and opts not to split such huge pages.
> 
> (4) When installing the lower level SPTEs, they must be added to the
>     rmap which may require allocating additional pte_list_desc structs.
>     This commit does *not* handle such cases and instead opts to leave
>     such lower-level SPTEs non-present. In this situation TLBs must be
>     flushed before dropping the MMU lock as a portion of the huge page
>     region is being unmapped.
> 
> Suggested-by: Peter Feiner <pfeiner@google.com>
> [ This commit is based off of the original implementation of Eager Page
>   Splitting from Peter in Google's kernel from 2016. ]
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  .../admin-guide/kernel-parameters.txt         |   3 -
>  arch/x86/kvm/mmu/mmu.c                        | 307 ++++++++++++++++++
>  2 files changed, 307 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 05161afd7642..495f6ac53801 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -2360,9 +2360,6 @@
>  			the KVM_CLEAR_DIRTY ioctl, and only for the pages being
>  			cleared.
>  
> -			Eager page splitting currently only supports splitting
> -			huge pages mapped by the TDP MMU.
> -
>  			Default is Y (on).
>  
>  	kvm.enable_vmware_backdoor=[KVM] Support VMware backdoor PV interface.
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 926ddfaa9e1a..dd56b5b9624f 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -727,6 +727,11 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
>  
>  static struct pte_list_desc *mmu_alloc_pte_list_desc(struct kvm_mmu_memory_cache *cache)
>  {
> +	static const gfp_t gfp_nocache = GFP_ATOMIC | __GFP_ACCOUNT | __GFP_ZERO;
> +
> +	if (WARN_ON_ONCE(!cache))
> +		return kmem_cache_alloc(pte_list_desc_cache, gfp_nocache);
> +

I also think this is not proper to be added into this patch.  Maybe it'll
be more suitable for the rmap_add() rework patch previously, or maybe it
can be dropped directly if it should never trigger at all. Then we die hard
at below when referencing it.

>  	return kvm_mmu_memory_cache_alloc(cache);
>  }
>  
> @@ -743,6 +748,28 @@ static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index)
>  	return sp->gfn + (index << ((sp->role.level - 1) * PT64_LEVEL_BITS));
>  }
>  
> +static gfn_t sptep_to_gfn(u64 *sptep)
> +{
> +	struct kvm_mmu_page *sp = sptep_to_sp(sptep);
> +
> +	return kvm_mmu_page_get_gfn(sp, sptep - sp->spt);
> +}
> +
> +static unsigned int kvm_mmu_page_get_access(struct kvm_mmu_page *sp, int index)
> +{
> +	if (!sp->role.direct)
> +		return sp->shadowed_translation[index].access;
> +
> +	return sp->role.access;
> +}
> +
> +static unsigned int sptep_to_access(u64 *sptep)
> +{
> +	struct kvm_mmu_page *sp = sptep_to_sp(sptep);
> +
> +	return kvm_mmu_page_get_access(sp, sptep - sp->spt);
> +}
> +
>  static void kvm_mmu_page_set_gfn_access(struct kvm_mmu_page *sp, int index,
>  					gfn_t gfn, u32 access)
>  {
> @@ -912,6 +939,9 @@ static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte,
>  	return count;
>  }
>  
> +static struct kvm_rmap_head *gfn_to_rmap(gfn_t gfn, int level,
> +					 const struct kvm_memory_slot *slot);
> +
>  static void
>  pte_list_desc_remove_entry(struct kvm_rmap_head *rmap_head,
>  			   struct pte_list_desc *desc, int i,
> @@ -2125,6 +2155,23 @@ static struct kvm_mmu_page *__kvm_mmu_find_shadow_page(struct kvm *kvm,
>  	return sp;
>  }
>  
> +static struct kvm_mmu_page *kvm_mmu_find_direct_sp(struct kvm *kvm, gfn_t gfn,
> +						   union kvm_mmu_page_role role)
> +{
> +	struct kvm_mmu_page *sp;
> +	LIST_HEAD(invalid_list);
> +
> +	BUG_ON(!role.direct);
> +
> +	sp = __kvm_mmu_find_shadow_page(kvm, gfn, role, &invalid_list);
> +
> +	/* Direct SPs are never unsync. */
> +	WARN_ON_ONCE(sp && sp->unsync);
> +
> +	kvm_mmu_commit_zap_page(kvm, &invalid_list);
> +	return sp;
> +}
> +
>  /*
>   * Looks up an existing SP for the given gfn and role if one exists. The
>   * return SP is guaranteed to be synced.
> @@ -6063,12 +6110,266 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
>  		kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
>  }
>  
> +static int prepare_to_split_huge_page(struct kvm *kvm,
> +				      const struct kvm_memory_slot *slot,
> +				      u64 *huge_sptep,
> +				      struct kvm_mmu_page **spp,
> +				      bool *flush,
> +				      bool *dropped_lock)
> +{
> +	int r = 0;
> +
> +	*dropped_lock = false;
> +
> +	if (kvm_mmu_available_pages(kvm) <= KVM_MIN_FREE_MMU_PAGES)
> +		return -ENOSPC;
> +
> +	if (need_resched() || rwlock_needbreak(&kvm->mmu_lock))
> +		goto drop_lock;
> +

Not immediately clear on whether there'll be case that *spp is set within
the current function.  Some sanity check might be nice?

> +	*spp = kvm_mmu_alloc_direct_sp_for_split(true);
> +	if (r)
> +		goto drop_lock;
> +
> +	return 0;
> +
> +drop_lock:
> +	if (*flush)
> +		kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
> +
> +	*flush = false;
> +	*dropped_lock = true;
> +
> +	write_unlock(&kvm->mmu_lock);
> +	cond_resched();
> +	*spp = kvm_mmu_alloc_direct_sp_for_split(false);
> +	if (!*spp)
> +		r = -ENOMEM;
> +	write_lock(&kvm->mmu_lock);
> +
> +	return r;
> +}
> +
> +static struct kvm_mmu_page *kvm_mmu_get_sp_for_split(struct kvm *kvm,
> +						     const struct kvm_memory_slot *slot,
> +						     u64 *huge_sptep,
> +						     struct kvm_mmu_page **spp)
> +{
> +	struct kvm_mmu_page *split_sp;
> +	union kvm_mmu_page_role role;
> +	unsigned int access;
> +	gfn_t gfn;
> +
> +	gfn = sptep_to_gfn(huge_sptep);
> +	access = sptep_to_access(huge_sptep);
> +
> +	/*
> +	 * Huge page splitting always uses direct shadow pages since we are
> +	 * directly mapping the huge page GFN region with smaller pages.
> +	 */
> +	role = kvm_mmu_child_role(huge_sptep, true, access);
> +	split_sp = kvm_mmu_find_direct_sp(kvm, gfn, role);
> +
> +	/*
> +	 * Opt not to split if the lower-level SP already exists. This requires
> +	 * more complex handling as the SP may be already partially filled in
> +	 * and may need extra pte_list_desc structs to update parent_ptes.
> +	 */
> +	if (split_sp)
> +		return NULL;

This smells tricky..

Firstly we're trying to lookup the existing SPs that has shadowed this huge
page in split way, with the access bits fetched from the shadow cache (so
without hugepage nx effect).  However could the pages be mapped with
different permissions from the currently hugely mapped page?

IIUC all these is for the fact that we can't allocate pte_list_desc and we
want to make sure we won't make the pte list to be >1.

But I also see that the pte_list check below...

> +
> +	swap(split_sp, *spp);
> +	init_shadow_page(kvm, split_sp, slot, gfn, role);
> +	trace_kvm_mmu_get_page(split_sp, true);
> +
> +	return split_sp;
> +}
> +
> +static int kvm_mmu_split_huge_page(struct kvm *kvm,
> +				   const struct kvm_memory_slot *slot,
> +				   u64 *huge_sptep, struct kvm_mmu_page **spp,
> +				   bool *flush)
> +
> +{
> +	struct kvm_mmu_page *split_sp;
> +	u64 huge_spte, split_spte;
> +	int split_level, index;
> +	unsigned int access;
> +	u64 *split_sptep;
> +	gfn_t split_gfn;
> +
> +	split_sp = kvm_mmu_get_sp_for_split(kvm, slot, huge_sptep, spp);
> +	if (!split_sp)
> +		return -EOPNOTSUPP;
> +
> +	/*
> +	 * Since we did not allocate pte_list_desc_structs for the split, we
> +	 * cannot add a new parent SPTE to parent_ptes. This should never happen
> +	 * in practice though since this is a fresh SP.
> +	 *
> +	 * Note, this makes it safe to pass NULL to __link_shadow_page() below.
> +	 */
> +	if (WARN_ON_ONCE(split_sp->parent_ptes.val))
> +		return -EINVAL;
> +
> +	huge_spte = READ_ONCE(*huge_sptep);
> +
> +	split_level = split_sp->role.level;
> +	access = split_sp->role.access;
> +
> +	for (index = 0; index < PT64_ENT_PER_PAGE; index++) {
> +		split_sptep = &split_sp->spt[index];
> +		split_gfn = kvm_mmu_page_get_gfn(split_sp, index);
> +
> +		BUG_ON(is_shadow_present_pte(*split_sptep));
> +
> +		/*
> +		 * Since we did not allocate pte_list_desc structs for the
> +		 * split, we can't add a new SPTE that maps this GFN.
> +		 * Skipping this SPTE means we're only partially mapping the
> +		 * huge page, which means we'll need to flush TLBs before
> +		 * dropping the MMU lock.
> +		 *
> +		 * Note, this make it safe to pass NULL to __rmap_add() below.
> +		 */
> +		if (gfn_to_rmap(split_gfn, split_level, slot)->val) {
> +			*flush = true;
> +			continue;
> +		}

... here.

IIUC this check should already be able to cover all the cases and it's
accurate on the fact that we don't want to grow any rmap to >1 len.

> +
> +		split_spte = make_huge_page_split_spte(
> +				huge_spte, split_level + 1, index, access);
> +
> +		mmu_spte_set(split_sptep, split_spte);
> +		__rmap_add(kvm, NULL, slot, split_sptep, split_gfn, access);

__rmap_add() with a NULL cache pointer is weird.. same as
__link_shadow_page() below.

I'll stop here for now I guess.. Have you considered having rmap allocation
ready altogether, rather than making this intermediate step but only add
that later?  Because all these look hackish to me..  It's also possible
that I missed something important, if so please shoot.

Thanks,

> +	}
> +
> +	/*
> +	 * Replace the huge spte with a pointer to the populated lower level
> +	 * page table. Since we are making this change without a TLB flush vCPUs
> +	 * will see a mix of the split mappings and the original huge mapping,
> +	 * depending on what's currently in their TLB. This is fine from a
> +	 * correctness standpoint since the translation will either be identical
> +	 * or non-present. To account for non-present mappings, the TLB will be
> +	 * flushed prior to dropping the MMU lock.
> +	 */
> +	__drop_large_spte(kvm, huge_sptep, false);
> +	__link_shadow_page(NULL, huge_sptep, split_sp);
> +
> +	return 0;
> +}

-- 
Peter Xu

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 21/26] KVM: Allow for different capacities in kvm_mmu_memory_cache structs
  2022-03-11  0:25   ` David Matlack
@ 2022-03-19  5:27     ` Anup Patel
  -1 siblings, 0 replies; 134+ messages in thread
From: Anup Patel @ 2022-03-19  5:27 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Fri, Mar 11, 2022 at 5:56 AM David Matlack <dmatlack@google.com> wrote:
>
> Allow the capacity of the kvm_mmu_memory_cache struct to be chosen at
> declaration time rather than being fixed for all declarations. This will
> be used in a follow-up commit to declare an cache in x86 with a capacity
> of 512+ objects without having to increase the capacity of all caches in
> KVM.
>
> This change requires each cache now specify its capacity at runtime,
> since the cache struct itself no longer has a fixed capacity known at
> compile time. To protect against someone accidentally defining a
> kvm_mmu_memory_cache struct directly (without the extra storage), this
> commit includes a WARN_ON() in kvm_mmu_topup_memory_cache().
>
> This change, unfortunately, adds some grottiness to
> kvm_phys_addr_ioremap() in arm64, which uses a function-local (i.e.
> stack-allocated) kvm_mmu_memory_cache struct. Since C does not allow
> anonymous structs in functions, the new wrapper struct that contains
> kvm_mmu_memory_cache and the objects pointer array, must be named, which
> means dealing with an outer and inner struct. The outer struct can't be
> dropped since then there would be no guarantee the kvm_mmu_memory_cache
> struct and objects array would be laid out consecutively on the stack.
>
> No functional change intended.
>
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/arm64/include/asm/kvm_host.h |  2 +-
>  arch/arm64/kvm/arm.c              |  1 +
>  arch/arm64/kvm/mmu.c              | 13 +++++++++----
>  arch/mips/include/asm/kvm_host.h  |  2 +-
>  arch/mips/kvm/mips.c              |  2 ++
>  arch/riscv/include/asm/kvm_host.h |  2 +-
>  arch/riscv/kvm/vcpu.c             |  1 +
>  arch/x86/include/asm/kvm_host.h   |  8 ++++----
>  arch/x86/kvm/mmu/mmu.c            |  9 +++++++++
>  include/linux/kvm_types.h         | 19 +++++++++++++++++--
>  virt/kvm/kvm_main.c               | 10 +++++++++-
>  11 files changed, 55 insertions(+), 14 deletions(-)
>
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index 5bc01e62c08a..1369415290dd 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -357,7 +357,7 @@ struct kvm_vcpu_arch {
>         bool pause;
>
>         /* Cache some mmu pages needed inside spinlock regions */
> -       struct kvm_mmu_memory_cache mmu_page_cache;
> +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_cache);
>
>         /* Target CPU and feature flags */
>         int target;
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index ecc5958e27fe..5e38385be0ef 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -319,6 +319,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
>         vcpu->arch.target = -1;
>         bitmap_zero(vcpu->arch.features, KVM_VCPU_MAX_FEATURES);
>
> +       vcpu->arch.mmu_page_cache.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
>         vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
>
>         /* Set up the timer */
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index bc2aba953299..940089ba65ad 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -765,7 +765,12 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
>  {
>         phys_addr_t addr;
>         int ret = 0;
> -       struct kvm_mmu_memory_cache cache = { 0, __GFP_ZERO, NULL, };
> +       DEFINE_KVM_MMU_MEMORY_CACHE(cache) page_cache = {
> +               .cache = {
> +                       .gfp_zero = __GFP_ZERO,
> +                       .capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE,
> +               },
> +       };
>         struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
>         enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_DEVICE |
>                                      KVM_PGTABLE_PROT_R |
> @@ -778,14 +783,14 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
>         guest_ipa &= PAGE_MASK;
>
>         for (addr = guest_ipa; addr < guest_ipa + size; addr += PAGE_SIZE) {
> -               ret = kvm_mmu_topup_memory_cache(&cache,
> +               ret = kvm_mmu_topup_memory_cache(&page_cache.cache,
>                                                  kvm_mmu_cache_min_pages(kvm));
>                 if (ret)
>                         break;
>
>                 spin_lock(&kvm->mmu_lock);
>                 ret = kvm_pgtable_stage2_map(pgt, addr, PAGE_SIZE, pa, prot,
> -                                            &cache);
> +                                            &page_cache.cache);
>                 spin_unlock(&kvm->mmu_lock);
>                 if (ret)
>                         break;
> @@ -793,7 +798,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
>                 pa += PAGE_SIZE;
>         }
>
> -       kvm_mmu_free_memory_cache(&cache);
> +       kvm_mmu_free_memory_cache(&page_cache.cache);
>         return ret;
>  }
>
> diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
> index 717716cc51c5..935511d7fc3a 100644
> --- a/arch/mips/include/asm/kvm_host.h
> +++ b/arch/mips/include/asm/kvm_host.h
> @@ -347,7 +347,7 @@ struct kvm_vcpu_arch {
>         unsigned long pending_exceptions_clr;
>
>         /* Cache some mmu pages needed inside spinlock regions */
> -       struct kvm_mmu_memory_cache mmu_page_cache;
> +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_cache);
>
>         /* vcpu's vzguestid is different on each host cpu in an smp system */
>         u32 vzguestid[NR_CPUS];
> diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
> index a25e0b73ee70..45c7179144dc 100644
> --- a/arch/mips/kvm/mips.c
> +++ b/arch/mips/kvm/mips.c
> @@ -387,6 +387,8 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
>         if (err)
>                 goto out_free_gebase;
>
> +       vcpu->arch.mmu_page_cache.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
> +
>         return 0;
>
>  out_free_gebase:
> diff --git a/arch/riscv/include/asm/kvm_host.h b/arch/riscv/include/asm/kvm_host.h
> index 99ef6a120617..5bd4902ebda3 100644
> --- a/arch/riscv/include/asm/kvm_host.h
> +++ b/arch/riscv/include/asm/kvm_host.h
> @@ -186,7 +186,7 @@ struct kvm_vcpu_arch {
>         struct kvm_sbi_context sbi_context;
>
>         /* Cache pages needed to program page tables with spinlock held */
> -       struct kvm_mmu_memory_cache mmu_page_cache;
> +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_cache);
>
>         /* VCPU power-off state */
>         bool power_off;
> diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
> index 624166004e36..6a5f5aa45bac 100644
> --- a/arch/riscv/kvm/vcpu.c
> +++ b/arch/riscv/kvm/vcpu.c
> @@ -94,6 +94,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
>
>         /* Mark this VCPU never ran */
>         vcpu->arch.ran_atleast_once = false;
> +       vcpu->arch.mmu_page_cache.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
>         vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;

There another function stage2_ioremap() which also needs to change
because this function creates a kvm_mmu_memory_cache on stack.

Regards,
Anup

>
>         /* Setup ISA features available to VCPU */
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 0f5a36772bdc..544dde11963b 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -692,10 +692,10 @@ struct kvm_vcpu_arch {
>          */
>         struct kvm_mmu *walk_mmu;
>
> -       struct kvm_mmu_memory_cache mmu_pte_list_desc_cache;
> -       struct kvm_mmu_memory_cache mmu_shadow_page_cache;
> -       struct kvm_mmu_memory_cache mmu_shadowed_translation_cache;
> -       struct kvm_mmu_memory_cache mmu_page_header_cache;
> +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_pte_list_desc_cache);
> +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_shadow_page_cache);
> +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_shadowed_translation_cache);
> +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_header_cache);
>
>         /*
>          * QEMU userspace and the guest each have their own FPU state.
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index dd56b5b9624f..24e7e053e05b 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5817,12 +5817,21 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
>  {
>         int ret;
>
> +       vcpu->arch.mmu_pte_list_desc_cache.capacity =
> +               KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
>         vcpu->arch.mmu_pte_list_desc_cache.kmem_cache = pte_list_desc_cache;
>         vcpu->arch.mmu_pte_list_desc_cache.gfp_zero = __GFP_ZERO;
>
> +       vcpu->arch.mmu_page_header_cache.capacity =
> +               KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
>         vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;
>         vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
>
> +       vcpu->arch.mmu_shadowed_translation_cache.capacity =
> +               KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
> +
> +       vcpu->arch.mmu_shadow_page_cache.capacity =
> +               KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
>         vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
>
>         vcpu->arch.mmu = &vcpu->arch.root_mmu;
> diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> index ac1ebb37a0ff..579cf39986ec 100644
> --- a/include/linux/kvm_types.h
> +++ b/include/linux/kvm_types.h
> @@ -83,14 +83,29 @@ struct gfn_to_pfn_cache {
>   * MMU flows is problematic, as is triggering reclaim, I/O, etc... while
>   * holding MMU locks.  Note, these caches act more like prefetch buffers than
>   * classical caches, i.e. objects are not returned to the cache on being freed.
> + *
> + * The storage for the cache object pointers is laid out after the struct, to
> + * allow different declarations to choose different capacities. The capacity
> + * field defines the number of object pointers available after the struct.
>   */
>  struct kvm_mmu_memory_cache {
>         int nobjs;
> +       int capacity;
>         gfp_t gfp_zero;
>         struct kmem_cache *kmem_cache;
> -       void *objects[KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE];
> +       void *objects[];
>  };
> -#endif
> +
> +#define __DEFINE_KVM_MMU_MEMORY_CACHE(_name, _capacity)                \
> +       struct {                                                \
> +               struct kvm_mmu_memory_cache _name;              \
> +               void *_name##_objects[_capacity];               \
> +       }
> +
> +#define DEFINE_KVM_MMU_MEMORY_CACHE(_name) \
> +       __DEFINE_KVM_MMU_MEMORY_CACHE(_name, KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE)
> +
> +#endif /* KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE */
>
>  #define HALT_POLL_HIST_COUNT                   32
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 9581a24c3d17..1d849ba9529f 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -371,9 +371,17 @@ int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
>  {
>         void *obj;
>
> +       /*
> +        * The capacity fieldmust be initialized since the storage for the
> +        * objects pointer array is laid out after the kvm_mmu_memory_cache
> +        * struct and not known at compile time.
> +        */
> +       if (WARN_ON(mc->capacity == 0))
> +               return -EINVAL;
> +
>         if (mc->nobjs >= min)
>                 return 0;
> -       while (mc->nobjs < ARRAY_SIZE(mc->objects)) {
> +       while (mc->nobjs < mc->capacity) {
>                 obj = mmu_memory_cache_alloc_obj(mc, GFP_KERNEL_ACCOUNT);
>                 if (!obj)
>                         return mc->nobjs >= min ? 0 : -ENOMEM;
> --
> 2.35.1.723.g4982287a31-goog
>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 21/26] KVM: Allow for different capacities in kvm_mmu_memory_cache structs
@ 2022-03-19  5:27     ` Anup Patel
  0 siblings, 0 replies; 134+ messages in thread
From: Anup Patel @ 2022-03-19  5:27 UTC (permalink / raw)
  To: David Matlack
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Fri, Mar 11, 2022 at 5:56 AM David Matlack <dmatlack@google.com> wrote:
>
> Allow the capacity of the kvm_mmu_memory_cache struct to be chosen at
> declaration time rather than being fixed for all declarations. This will
> be used in a follow-up commit to declare an cache in x86 with a capacity
> of 512+ objects without having to increase the capacity of all caches in
> KVM.
>
> This change requires each cache now specify its capacity at runtime,
> since the cache struct itself no longer has a fixed capacity known at
> compile time. To protect against someone accidentally defining a
> kvm_mmu_memory_cache struct directly (without the extra storage), this
> commit includes a WARN_ON() in kvm_mmu_topup_memory_cache().
>
> This change, unfortunately, adds some grottiness to
> kvm_phys_addr_ioremap() in arm64, which uses a function-local (i.e.
> stack-allocated) kvm_mmu_memory_cache struct. Since C does not allow
> anonymous structs in functions, the new wrapper struct that contains
> kvm_mmu_memory_cache and the objects pointer array, must be named, which
> means dealing with an outer and inner struct. The outer struct can't be
> dropped since then there would be no guarantee the kvm_mmu_memory_cache
> struct and objects array would be laid out consecutively on the stack.
>
> No functional change intended.
>
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/arm64/include/asm/kvm_host.h |  2 +-
>  arch/arm64/kvm/arm.c              |  1 +
>  arch/arm64/kvm/mmu.c              | 13 +++++++++----
>  arch/mips/include/asm/kvm_host.h  |  2 +-
>  arch/mips/kvm/mips.c              |  2 ++
>  arch/riscv/include/asm/kvm_host.h |  2 +-
>  arch/riscv/kvm/vcpu.c             |  1 +
>  arch/x86/include/asm/kvm_host.h   |  8 ++++----
>  arch/x86/kvm/mmu/mmu.c            |  9 +++++++++
>  include/linux/kvm_types.h         | 19 +++++++++++++++++--
>  virt/kvm/kvm_main.c               | 10 +++++++++-
>  11 files changed, 55 insertions(+), 14 deletions(-)
>
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index 5bc01e62c08a..1369415290dd 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -357,7 +357,7 @@ struct kvm_vcpu_arch {
>         bool pause;
>
>         /* Cache some mmu pages needed inside spinlock regions */
> -       struct kvm_mmu_memory_cache mmu_page_cache;
> +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_cache);
>
>         /* Target CPU and feature flags */
>         int target;
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index ecc5958e27fe..5e38385be0ef 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -319,6 +319,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
>         vcpu->arch.target = -1;
>         bitmap_zero(vcpu->arch.features, KVM_VCPU_MAX_FEATURES);
>
> +       vcpu->arch.mmu_page_cache.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
>         vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
>
>         /* Set up the timer */
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index bc2aba953299..940089ba65ad 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -765,7 +765,12 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
>  {
>         phys_addr_t addr;
>         int ret = 0;
> -       struct kvm_mmu_memory_cache cache = { 0, __GFP_ZERO, NULL, };
> +       DEFINE_KVM_MMU_MEMORY_CACHE(cache) page_cache = {
> +               .cache = {
> +                       .gfp_zero = __GFP_ZERO,
> +                       .capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE,
> +               },
> +       };
>         struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
>         enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_DEVICE |
>                                      KVM_PGTABLE_PROT_R |
> @@ -778,14 +783,14 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
>         guest_ipa &= PAGE_MASK;
>
>         for (addr = guest_ipa; addr < guest_ipa + size; addr += PAGE_SIZE) {
> -               ret = kvm_mmu_topup_memory_cache(&cache,
> +               ret = kvm_mmu_topup_memory_cache(&page_cache.cache,
>                                                  kvm_mmu_cache_min_pages(kvm));
>                 if (ret)
>                         break;
>
>                 spin_lock(&kvm->mmu_lock);
>                 ret = kvm_pgtable_stage2_map(pgt, addr, PAGE_SIZE, pa, prot,
> -                                            &cache);
> +                                            &page_cache.cache);
>                 spin_unlock(&kvm->mmu_lock);
>                 if (ret)
>                         break;
> @@ -793,7 +798,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
>                 pa += PAGE_SIZE;
>         }
>
> -       kvm_mmu_free_memory_cache(&cache);
> +       kvm_mmu_free_memory_cache(&page_cache.cache);
>         return ret;
>  }
>
> diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
> index 717716cc51c5..935511d7fc3a 100644
> --- a/arch/mips/include/asm/kvm_host.h
> +++ b/arch/mips/include/asm/kvm_host.h
> @@ -347,7 +347,7 @@ struct kvm_vcpu_arch {
>         unsigned long pending_exceptions_clr;
>
>         /* Cache some mmu pages needed inside spinlock regions */
> -       struct kvm_mmu_memory_cache mmu_page_cache;
> +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_cache);
>
>         /* vcpu's vzguestid is different on each host cpu in an smp system */
>         u32 vzguestid[NR_CPUS];
> diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
> index a25e0b73ee70..45c7179144dc 100644
> --- a/arch/mips/kvm/mips.c
> +++ b/arch/mips/kvm/mips.c
> @@ -387,6 +387,8 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
>         if (err)
>                 goto out_free_gebase;
>
> +       vcpu->arch.mmu_page_cache.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
> +
>         return 0;
>
>  out_free_gebase:
> diff --git a/arch/riscv/include/asm/kvm_host.h b/arch/riscv/include/asm/kvm_host.h
> index 99ef6a120617..5bd4902ebda3 100644
> --- a/arch/riscv/include/asm/kvm_host.h
> +++ b/arch/riscv/include/asm/kvm_host.h
> @@ -186,7 +186,7 @@ struct kvm_vcpu_arch {
>         struct kvm_sbi_context sbi_context;
>
>         /* Cache pages needed to program page tables with spinlock held */
> -       struct kvm_mmu_memory_cache mmu_page_cache;
> +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_cache);
>
>         /* VCPU power-off state */
>         bool power_off;
> diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
> index 624166004e36..6a5f5aa45bac 100644
> --- a/arch/riscv/kvm/vcpu.c
> +++ b/arch/riscv/kvm/vcpu.c
> @@ -94,6 +94,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
>
>         /* Mark this VCPU never ran */
>         vcpu->arch.ran_atleast_once = false;
> +       vcpu->arch.mmu_page_cache.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
>         vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;

There another function stage2_ioremap() which also needs to change
because this function creates a kvm_mmu_memory_cache on stack.

Regards,
Anup

>
>         /* Setup ISA features available to VCPU */
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 0f5a36772bdc..544dde11963b 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -692,10 +692,10 @@ struct kvm_vcpu_arch {
>          */
>         struct kvm_mmu *walk_mmu;
>
> -       struct kvm_mmu_memory_cache mmu_pte_list_desc_cache;
> -       struct kvm_mmu_memory_cache mmu_shadow_page_cache;
> -       struct kvm_mmu_memory_cache mmu_shadowed_translation_cache;
> -       struct kvm_mmu_memory_cache mmu_page_header_cache;
> +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_pte_list_desc_cache);
> +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_shadow_page_cache);
> +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_shadowed_translation_cache);
> +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_header_cache);
>
>         /*
>          * QEMU userspace and the guest each have their own FPU state.
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index dd56b5b9624f..24e7e053e05b 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5817,12 +5817,21 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
>  {
>         int ret;
>
> +       vcpu->arch.mmu_pte_list_desc_cache.capacity =
> +               KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
>         vcpu->arch.mmu_pte_list_desc_cache.kmem_cache = pte_list_desc_cache;
>         vcpu->arch.mmu_pte_list_desc_cache.gfp_zero = __GFP_ZERO;
>
> +       vcpu->arch.mmu_page_header_cache.capacity =
> +               KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
>         vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;
>         vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
>
> +       vcpu->arch.mmu_shadowed_translation_cache.capacity =
> +               KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
> +
> +       vcpu->arch.mmu_shadow_page_cache.capacity =
> +               KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
>         vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
>
>         vcpu->arch.mmu = &vcpu->arch.root_mmu;
> diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> index ac1ebb37a0ff..579cf39986ec 100644
> --- a/include/linux/kvm_types.h
> +++ b/include/linux/kvm_types.h
> @@ -83,14 +83,29 @@ struct gfn_to_pfn_cache {
>   * MMU flows is problematic, as is triggering reclaim, I/O, etc... while
>   * holding MMU locks.  Note, these caches act more like prefetch buffers than
>   * classical caches, i.e. objects are not returned to the cache on being freed.
> + *
> + * The storage for the cache object pointers is laid out after the struct, to
> + * allow different declarations to choose different capacities. The capacity
> + * field defines the number of object pointers available after the struct.
>   */
>  struct kvm_mmu_memory_cache {
>         int nobjs;
> +       int capacity;
>         gfp_t gfp_zero;
>         struct kmem_cache *kmem_cache;
> -       void *objects[KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE];
> +       void *objects[];
>  };
> -#endif
> +
> +#define __DEFINE_KVM_MMU_MEMORY_CACHE(_name, _capacity)                \
> +       struct {                                                \
> +               struct kvm_mmu_memory_cache _name;              \
> +               void *_name##_objects[_capacity];               \
> +       }
> +
> +#define DEFINE_KVM_MMU_MEMORY_CACHE(_name) \
> +       __DEFINE_KVM_MMU_MEMORY_CACHE(_name, KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE)
> +
> +#endif /* KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE */
>
>  #define HALT_POLL_HIST_COUNT                   32
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 9581a24c3d17..1d849ba9529f 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -371,9 +371,17 @@ int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
>  {
>         void *obj;
>
> +       /*
> +        * The capacity fieldmust be initialized since the storage for the
> +        * objects pointer array is laid out after the kvm_mmu_memory_cache
> +        * struct and not known at compile time.
> +        */
> +       if (WARN_ON(mc->capacity == 0))
> +               return -EINVAL;
> +
>         if (mc->nobjs >= min)
>                 return 0;
> -       while (mc->nobjs < ARRAY_SIZE(mc->objects)) {
> +       while (mc->nobjs < mc->capacity) {
>                 obj = mmu_memory_cache_alloc_obj(mc, GFP_KERNEL_ACCOUNT);
>                 if (!obj)
>                         return mc->nobjs >= min ? 0 : -ENOMEM;
> --
> 2.35.1.723.g4982287a31-goog
>
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 20/26] KVM: x86/mmu: Extend Eager Page Splitting to the shadow MMU
  2022-03-16 10:26     ` Peter Xu
@ 2022-03-22  0:07       ` David Matlack
  -1 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-22  0:07 UTC (permalink / raw)
  To: Peter Xu
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Sean Christopherson, Andrew Jones, Ben Gardon,
	Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Wed, Mar 16, 2022 at 3:27 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, Mar 11, 2022 at 12:25:22AM +0000, David Matlack wrote:
> > Extend KVM's eager page splitting to also split huge pages that are
> > mapped by the shadow MMU. Specifically, walk through the rmap splitting
> > all 1GiB pages to 2MiB pages, and splitting all 2MiB pages to 4KiB
> > pages.
> >
> > Splitting huge pages mapped by the shadow MMU requries dealing with some
> > extra complexity beyond that of the TDP MMU:
> >
> > (1) The shadow MMU has a limit on the number of shadow pages that are
> >     allowed to be allocated. So, as a policy, Eager Page Splitting
> >     refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
> >     pages available.
> >
> > (2) Huge pages may be mapped by indirect shadow pages which have the
> >     possibility of being unsync. As a policy we opt not to split such
> >     pages as their translation may no longer be valid.
> >
> > (3) Splitting a huge page may end up re-using an existing lower level
> >     shadow page tables. This is unlike the TDP MMU which always allocates
> >     new shadow page tables when splitting.  This commit does *not*
> >     handle such aliasing and opts not to split such huge pages.
> >
> > (4) When installing the lower level SPTEs, they must be added to the
> >     rmap which may require allocating additional pte_list_desc structs.
> >     This commit does *not* handle such cases and instead opts to leave
> >     such lower-level SPTEs non-present. In this situation TLBs must be
> >     flushed before dropping the MMU lock as a portion of the huge page
> >     region is being unmapped.
> >
> > Suggested-by: Peter Feiner <pfeiner@google.com>
> > [ This commit is based off of the original implementation of Eager Page
> >   Splitting from Peter in Google's kernel from 2016. ]
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> >  .../admin-guide/kernel-parameters.txt         |   3 -
> >  arch/x86/kvm/mmu/mmu.c                        | 307 ++++++++++++++++++
> >  2 files changed, 307 insertions(+), 3 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > index 05161afd7642..495f6ac53801 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -2360,9 +2360,6 @@
> >                       the KVM_CLEAR_DIRTY ioctl, and only for the pages being
> >                       cleared.
> >
> > -                     Eager page splitting currently only supports splitting
> > -                     huge pages mapped by the TDP MMU.
> > -
> >                       Default is Y (on).
> >
> >       kvm.enable_vmware_backdoor=[KVM] Support VMware backdoor PV interface.
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 926ddfaa9e1a..dd56b5b9624f 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -727,6 +727,11 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
> >
> >  static struct pte_list_desc *mmu_alloc_pte_list_desc(struct kvm_mmu_memory_cache *cache)
> >  {
> > +     static const gfp_t gfp_nocache = GFP_ATOMIC | __GFP_ACCOUNT | __GFP_ZERO;
> > +
> > +     if (WARN_ON_ONCE(!cache))
> > +             return kmem_cache_alloc(pte_list_desc_cache, gfp_nocache);
> > +
>
> I also think this is not proper to be added into this patch.  Maybe it'll
> be more suitable for the rmap_add() rework patch previously, or maybe it
> can be dropped directly if it should never trigger at all. Then we die hard
> at below when referencing it.
>
> >       return kvm_mmu_memory_cache_alloc(cache);
> >  }
> >
> > @@ -743,6 +748,28 @@ static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index)
> >       return sp->gfn + (index << ((sp->role.level - 1) * PT64_LEVEL_BITS));
> >  }
> >
> > +static gfn_t sptep_to_gfn(u64 *sptep)
> > +{
> > +     struct kvm_mmu_page *sp = sptep_to_sp(sptep);
> > +
> > +     return kvm_mmu_page_get_gfn(sp, sptep - sp->spt);
> > +}
> > +
> > +static unsigned int kvm_mmu_page_get_access(struct kvm_mmu_page *sp, int index)
> > +{
> > +     if (!sp->role.direct)
> > +             return sp->shadowed_translation[index].access;
> > +
> > +     return sp->role.access;
> > +}
> > +
> > +static unsigned int sptep_to_access(u64 *sptep)
> > +{
> > +     struct kvm_mmu_page *sp = sptep_to_sp(sptep);
> > +
> > +     return kvm_mmu_page_get_access(sp, sptep - sp->spt);
> > +}
> > +
> >  static void kvm_mmu_page_set_gfn_access(struct kvm_mmu_page *sp, int index,
> >                                       gfn_t gfn, u32 access)
> >  {
> > @@ -912,6 +939,9 @@ static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte,
> >       return count;
> >  }
> >
> > +static struct kvm_rmap_head *gfn_to_rmap(gfn_t gfn, int level,
> > +                                      const struct kvm_memory_slot *slot);
> > +
> >  static void
> >  pte_list_desc_remove_entry(struct kvm_rmap_head *rmap_head,
> >                          struct pte_list_desc *desc, int i,
> > @@ -2125,6 +2155,23 @@ static struct kvm_mmu_page *__kvm_mmu_find_shadow_page(struct kvm *kvm,
> >       return sp;
> >  }
> >
> > +static struct kvm_mmu_page *kvm_mmu_find_direct_sp(struct kvm *kvm, gfn_t gfn,
> > +                                                union kvm_mmu_page_role role)
> > +{
> > +     struct kvm_mmu_page *sp;
> > +     LIST_HEAD(invalid_list);
> > +
> > +     BUG_ON(!role.direct);
> > +
> > +     sp = __kvm_mmu_find_shadow_page(kvm, gfn, role, &invalid_list);
> > +
> > +     /* Direct SPs are never unsync. */
> > +     WARN_ON_ONCE(sp && sp->unsync);
> > +
> > +     kvm_mmu_commit_zap_page(kvm, &invalid_list);
> > +     return sp;
> > +}
> > +
> >  /*
> >   * Looks up an existing SP for the given gfn and role if one exists. The
> >   * return SP is guaranteed to be synced.
> > @@ -6063,12 +6110,266 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
> >               kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
> >  }
> >
> > +static int prepare_to_split_huge_page(struct kvm *kvm,
> > +                                   const struct kvm_memory_slot *slot,
> > +                                   u64 *huge_sptep,
> > +                                   struct kvm_mmu_page **spp,
> > +                                   bool *flush,
> > +                                   bool *dropped_lock)
> > +{
> > +     int r = 0;
> > +
> > +     *dropped_lock = false;
> > +
> > +     if (kvm_mmu_available_pages(kvm) <= KVM_MIN_FREE_MMU_PAGES)
> > +             return -ENOSPC;
> > +
> > +     if (need_resched() || rwlock_needbreak(&kvm->mmu_lock))
> > +             goto drop_lock;
> > +
>
> Not immediately clear on whether there'll be case that *spp is set within
> the current function.  Some sanity check might be nice?
>
> > +     *spp = kvm_mmu_alloc_direct_sp_for_split(true);
> > +     if (r)
> > +             goto drop_lock;
> > +
> > +     return 0;
> > +
> > +drop_lock:
> > +     if (*flush)
> > +             kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
> > +
> > +     *flush = false;
> > +     *dropped_lock = true;
> > +
> > +     write_unlock(&kvm->mmu_lock);
> > +     cond_resched();
> > +     *spp = kvm_mmu_alloc_direct_sp_for_split(false);
> > +     if (!*spp)
> > +             r = -ENOMEM;
> > +     write_lock(&kvm->mmu_lock);
> > +
> > +     return r;
> > +}
> > +
> > +static struct kvm_mmu_page *kvm_mmu_get_sp_for_split(struct kvm *kvm,
> > +                                                  const struct kvm_memory_slot *slot,
> > +                                                  u64 *huge_sptep,
> > +                                                  struct kvm_mmu_page **spp)
> > +{
> > +     struct kvm_mmu_page *split_sp;
> > +     union kvm_mmu_page_role role;
> > +     unsigned int access;
> > +     gfn_t gfn;
> > +
> > +     gfn = sptep_to_gfn(huge_sptep);
> > +     access = sptep_to_access(huge_sptep);
> > +
> > +     /*
> > +      * Huge page splitting always uses direct shadow pages since we are
> > +      * directly mapping the huge page GFN region with smaller pages.
> > +      */
> > +     role = kvm_mmu_child_role(huge_sptep, true, access);
> > +     split_sp = kvm_mmu_find_direct_sp(kvm, gfn, role);
> > +
> > +     /*
> > +      * Opt not to split if the lower-level SP already exists. This requires
> > +      * more complex handling as the SP may be already partially filled in
> > +      * and may need extra pte_list_desc structs to update parent_ptes.
> > +      */
> > +     if (split_sp)
> > +             return NULL;
>
> This smells tricky..
>
> Firstly we're trying to lookup the existing SPs that has shadowed this huge
> page in split way, with the access bits fetched from the shadow cache (so
> without hugepage nx effect).  However could the pages be mapped with
> different permissions from the currently hugely mapped page?
>
> IIUC all these is for the fact that we can't allocate pte_list_desc and we
> want to make sure we won't make the pte list to be >1.
>
> But I also see that the pte_list check below...
>
> > +
> > +     swap(split_sp, *spp);
> > +     init_shadow_page(kvm, split_sp, slot, gfn, role);
> > +     trace_kvm_mmu_get_page(split_sp, true);
> > +
> > +     return split_sp;
> > +}
> > +
> > +static int kvm_mmu_split_huge_page(struct kvm *kvm,
> > +                                const struct kvm_memory_slot *slot,
> > +                                u64 *huge_sptep, struct kvm_mmu_page **spp,
> > +                                bool *flush)
> > +
> > +{
> > +     struct kvm_mmu_page *split_sp;
> > +     u64 huge_spte, split_spte;
> > +     int split_level, index;
> > +     unsigned int access;
> > +     u64 *split_sptep;
> > +     gfn_t split_gfn;
> > +
> > +     split_sp = kvm_mmu_get_sp_for_split(kvm, slot, huge_sptep, spp);
> > +     if (!split_sp)
> > +             return -EOPNOTSUPP;
> > +
> > +     /*
> > +      * Since we did not allocate pte_list_desc_structs for the split, we
> > +      * cannot add a new parent SPTE to parent_ptes. This should never happen
> > +      * in practice though since this is a fresh SP.
> > +      *
> > +      * Note, this makes it safe to pass NULL to __link_shadow_page() below.
> > +      */
> > +     if (WARN_ON_ONCE(split_sp->parent_ptes.val))
> > +             return -EINVAL;
> > +
> > +     huge_spte = READ_ONCE(*huge_sptep);
> > +
> > +     split_level = split_sp->role.level;
> > +     access = split_sp->role.access;
> > +
> > +     for (index = 0; index < PT64_ENT_PER_PAGE; index++) {
> > +             split_sptep = &split_sp->spt[index];
> > +             split_gfn = kvm_mmu_page_get_gfn(split_sp, index);
> > +
> > +             BUG_ON(is_shadow_present_pte(*split_sptep));
> > +
> > +             /*
> > +              * Since we did not allocate pte_list_desc structs for the
> > +              * split, we can't add a new SPTE that maps this GFN.
> > +              * Skipping this SPTE means we're only partially mapping the
> > +              * huge page, which means we'll need to flush TLBs before
> > +              * dropping the MMU lock.
> > +              *
> > +              * Note, this make it safe to pass NULL to __rmap_add() below.
> > +              */
> > +             if (gfn_to_rmap(split_gfn, split_level, slot)->val) {
> > +                     *flush = true;
> > +                     continue;
> > +             }
>
> ... here.
>
> IIUC this check should already be able to cover all the cases and it's
> accurate on the fact that we don't want to grow any rmap to >1 len.
>
> > +
> > +             split_spte = make_huge_page_split_spte(
> > +                             huge_spte, split_level + 1, index, access);
> > +
> > +             mmu_spte_set(split_sptep, split_spte);
> > +             __rmap_add(kvm, NULL, slot, split_sptep, split_gfn, access);
>
> __rmap_add() with a NULL cache pointer is weird.. same as
> __link_shadow_page() below.
>
> I'll stop here for now I guess.. Have you considered having rmap allocation
> ready altogether, rather than making this intermediate step but only add
> that later?  Because all these look hackish to me..  It's also possible
> that I missed something important, if so please shoot.

I'd be happy to do it that way. The reasons I broke it up into the
intermediate steps are:
 - At Google we only support up to including this patch. We don't
handle the cases where the rmap or parent_ptes list need to grow.
 - It seemed like a good way to break up the support into smaller
patches. But I think this backfired since the intermediate steps
introduce their own complexity such as passing in NULL to
__rmap_add().

>
> Thanks,
>
> > +     }
> > +
> > +     /*
> > +      * Replace the huge spte with a pointer to the populated lower level
> > +      * page table. Since we are making this change without a TLB flush vCPUs
> > +      * will see a mix of the split mappings and the original huge mapping,
> > +      * depending on what's currently in their TLB. This is fine from a
> > +      * correctness standpoint since the translation will either be identical
> > +      * or non-present. To account for non-present mappings, the TLB will be
> > +      * flushed prior to dropping the MMU lock.
> > +      */
> > +     __drop_large_spte(kvm, huge_sptep, false);
> > +     __link_shadow_page(NULL, huge_sptep, split_sp);
> > +
> > +     return 0;
> > +}
>
> --
> Peter Xu
>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 20/26] KVM: x86/mmu: Extend Eager Page Splitting to the shadow MMU
@ 2022-03-22  0:07       ` David Matlack
  0 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-22  0:07 UTC (permalink / raw)
  To: Peter Xu
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Wed, Mar 16, 2022 at 3:27 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, Mar 11, 2022 at 12:25:22AM +0000, David Matlack wrote:
> > Extend KVM's eager page splitting to also split huge pages that are
> > mapped by the shadow MMU. Specifically, walk through the rmap splitting
> > all 1GiB pages to 2MiB pages, and splitting all 2MiB pages to 4KiB
> > pages.
> >
> > Splitting huge pages mapped by the shadow MMU requries dealing with some
> > extra complexity beyond that of the TDP MMU:
> >
> > (1) The shadow MMU has a limit on the number of shadow pages that are
> >     allowed to be allocated. So, as a policy, Eager Page Splitting
> >     refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
> >     pages available.
> >
> > (2) Huge pages may be mapped by indirect shadow pages which have the
> >     possibility of being unsync. As a policy we opt not to split such
> >     pages as their translation may no longer be valid.
> >
> > (3) Splitting a huge page may end up re-using an existing lower level
> >     shadow page tables. This is unlike the TDP MMU which always allocates
> >     new shadow page tables when splitting.  This commit does *not*
> >     handle such aliasing and opts not to split such huge pages.
> >
> > (4) When installing the lower level SPTEs, they must be added to the
> >     rmap which may require allocating additional pte_list_desc structs.
> >     This commit does *not* handle such cases and instead opts to leave
> >     such lower-level SPTEs non-present. In this situation TLBs must be
> >     flushed before dropping the MMU lock as a portion of the huge page
> >     region is being unmapped.
> >
> > Suggested-by: Peter Feiner <pfeiner@google.com>
> > [ This commit is based off of the original implementation of Eager Page
> >   Splitting from Peter in Google's kernel from 2016. ]
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> >  .../admin-guide/kernel-parameters.txt         |   3 -
> >  arch/x86/kvm/mmu/mmu.c                        | 307 ++++++++++++++++++
> >  2 files changed, 307 insertions(+), 3 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > index 05161afd7642..495f6ac53801 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -2360,9 +2360,6 @@
> >                       the KVM_CLEAR_DIRTY ioctl, and only for the pages being
> >                       cleared.
> >
> > -                     Eager page splitting currently only supports splitting
> > -                     huge pages mapped by the TDP MMU.
> > -
> >                       Default is Y (on).
> >
> >       kvm.enable_vmware_backdoor=[KVM] Support VMware backdoor PV interface.
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 926ddfaa9e1a..dd56b5b9624f 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -727,6 +727,11 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
> >
> >  static struct pte_list_desc *mmu_alloc_pte_list_desc(struct kvm_mmu_memory_cache *cache)
> >  {
> > +     static const gfp_t gfp_nocache = GFP_ATOMIC | __GFP_ACCOUNT | __GFP_ZERO;
> > +
> > +     if (WARN_ON_ONCE(!cache))
> > +             return kmem_cache_alloc(pte_list_desc_cache, gfp_nocache);
> > +
>
> I also think this is not proper to be added into this patch.  Maybe it'll
> be more suitable for the rmap_add() rework patch previously, or maybe it
> can be dropped directly if it should never trigger at all. Then we die hard
> at below when referencing it.
>
> >       return kvm_mmu_memory_cache_alloc(cache);
> >  }
> >
> > @@ -743,6 +748,28 @@ static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index)
> >       return sp->gfn + (index << ((sp->role.level - 1) * PT64_LEVEL_BITS));
> >  }
> >
> > +static gfn_t sptep_to_gfn(u64 *sptep)
> > +{
> > +     struct kvm_mmu_page *sp = sptep_to_sp(sptep);
> > +
> > +     return kvm_mmu_page_get_gfn(sp, sptep - sp->spt);
> > +}
> > +
> > +static unsigned int kvm_mmu_page_get_access(struct kvm_mmu_page *sp, int index)
> > +{
> > +     if (!sp->role.direct)
> > +             return sp->shadowed_translation[index].access;
> > +
> > +     return sp->role.access;
> > +}
> > +
> > +static unsigned int sptep_to_access(u64 *sptep)
> > +{
> > +     struct kvm_mmu_page *sp = sptep_to_sp(sptep);
> > +
> > +     return kvm_mmu_page_get_access(sp, sptep - sp->spt);
> > +}
> > +
> >  static void kvm_mmu_page_set_gfn_access(struct kvm_mmu_page *sp, int index,
> >                                       gfn_t gfn, u32 access)
> >  {
> > @@ -912,6 +939,9 @@ static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte,
> >       return count;
> >  }
> >
> > +static struct kvm_rmap_head *gfn_to_rmap(gfn_t gfn, int level,
> > +                                      const struct kvm_memory_slot *slot);
> > +
> >  static void
> >  pte_list_desc_remove_entry(struct kvm_rmap_head *rmap_head,
> >                          struct pte_list_desc *desc, int i,
> > @@ -2125,6 +2155,23 @@ static struct kvm_mmu_page *__kvm_mmu_find_shadow_page(struct kvm *kvm,
> >       return sp;
> >  }
> >
> > +static struct kvm_mmu_page *kvm_mmu_find_direct_sp(struct kvm *kvm, gfn_t gfn,
> > +                                                union kvm_mmu_page_role role)
> > +{
> > +     struct kvm_mmu_page *sp;
> > +     LIST_HEAD(invalid_list);
> > +
> > +     BUG_ON(!role.direct);
> > +
> > +     sp = __kvm_mmu_find_shadow_page(kvm, gfn, role, &invalid_list);
> > +
> > +     /* Direct SPs are never unsync. */
> > +     WARN_ON_ONCE(sp && sp->unsync);
> > +
> > +     kvm_mmu_commit_zap_page(kvm, &invalid_list);
> > +     return sp;
> > +}
> > +
> >  /*
> >   * Looks up an existing SP for the given gfn and role if one exists. The
> >   * return SP is guaranteed to be synced.
> > @@ -6063,12 +6110,266 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
> >               kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
> >  }
> >
> > +static int prepare_to_split_huge_page(struct kvm *kvm,
> > +                                   const struct kvm_memory_slot *slot,
> > +                                   u64 *huge_sptep,
> > +                                   struct kvm_mmu_page **spp,
> > +                                   bool *flush,
> > +                                   bool *dropped_lock)
> > +{
> > +     int r = 0;
> > +
> > +     *dropped_lock = false;
> > +
> > +     if (kvm_mmu_available_pages(kvm) <= KVM_MIN_FREE_MMU_PAGES)
> > +             return -ENOSPC;
> > +
> > +     if (need_resched() || rwlock_needbreak(&kvm->mmu_lock))
> > +             goto drop_lock;
> > +
>
> Not immediately clear on whether there'll be case that *spp is set within
> the current function.  Some sanity check might be nice?
>
> > +     *spp = kvm_mmu_alloc_direct_sp_for_split(true);
> > +     if (r)
> > +             goto drop_lock;
> > +
> > +     return 0;
> > +
> > +drop_lock:
> > +     if (*flush)
> > +             kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
> > +
> > +     *flush = false;
> > +     *dropped_lock = true;
> > +
> > +     write_unlock(&kvm->mmu_lock);
> > +     cond_resched();
> > +     *spp = kvm_mmu_alloc_direct_sp_for_split(false);
> > +     if (!*spp)
> > +             r = -ENOMEM;
> > +     write_lock(&kvm->mmu_lock);
> > +
> > +     return r;
> > +}
> > +
> > +static struct kvm_mmu_page *kvm_mmu_get_sp_for_split(struct kvm *kvm,
> > +                                                  const struct kvm_memory_slot *slot,
> > +                                                  u64 *huge_sptep,
> > +                                                  struct kvm_mmu_page **spp)
> > +{
> > +     struct kvm_mmu_page *split_sp;
> > +     union kvm_mmu_page_role role;
> > +     unsigned int access;
> > +     gfn_t gfn;
> > +
> > +     gfn = sptep_to_gfn(huge_sptep);
> > +     access = sptep_to_access(huge_sptep);
> > +
> > +     /*
> > +      * Huge page splitting always uses direct shadow pages since we are
> > +      * directly mapping the huge page GFN region with smaller pages.
> > +      */
> > +     role = kvm_mmu_child_role(huge_sptep, true, access);
> > +     split_sp = kvm_mmu_find_direct_sp(kvm, gfn, role);
> > +
> > +     /*
> > +      * Opt not to split if the lower-level SP already exists. This requires
> > +      * more complex handling as the SP may be already partially filled in
> > +      * and may need extra pte_list_desc structs to update parent_ptes.
> > +      */
> > +     if (split_sp)
> > +             return NULL;
>
> This smells tricky..
>
> Firstly we're trying to lookup the existing SPs that has shadowed this huge
> page in split way, with the access bits fetched from the shadow cache (so
> without hugepage nx effect).  However could the pages be mapped with
> different permissions from the currently hugely mapped page?
>
> IIUC all these is for the fact that we can't allocate pte_list_desc and we
> want to make sure we won't make the pte list to be >1.
>
> But I also see that the pte_list check below...
>
> > +
> > +     swap(split_sp, *spp);
> > +     init_shadow_page(kvm, split_sp, slot, gfn, role);
> > +     trace_kvm_mmu_get_page(split_sp, true);
> > +
> > +     return split_sp;
> > +}
> > +
> > +static int kvm_mmu_split_huge_page(struct kvm *kvm,
> > +                                const struct kvm_memory_slot *slot,
> > +                                u64 *huge_sptep, struct kvm_mmu_page **spp,
> > +                                bool *flush)
> > +
> > +{
> > +     struct kvm_mmu_page *split_sp;
> > +     u64 huge_spte, split_spte;
> > +     int split_level, index;
> > +     unsigned int access;
> > +     u64 *split_sptep;
> > +     gfn_t split_gfn;
> > +
> > +     split_sp = kvm_mmu_get_sp_for_split(kvm, slot, huge_sptep, spp);
> > +     if (!split_sp)
> > +             return -EOPNOTSUPP;
> > +
> > +     /*
> > +      * Since we did not allocate pte_list_desc_structs for the split, we
> > +      * cannot add a new parent SPTE to parent_ptes. This should never happen
> > +      * in practice though since this is a fresh SP.
> > +      *
> > +      * Note, this makes it safe to pass NULL to __link_shadow_page() below.
> > +      */
> > +     if (WARN_ON_ONCE(split_sp->parent_ptes.val))
> > +             return -EINVAL;
> > +
> > +     huge_spte = READ_ONCE(*huge_sptep);
> > +
> > +     split_level = split_sp->role.level;
> > +     access = split_sp->role.access;
> > +
> > +     for (index = 0; index < PT64_ENT_PER_PAGE; index++) {
> > +             split_sptep = &split_sp->spt[index];
> > +             split_gfn = kvm_mmu_page_get_gfn(split_sp, index);
> > +
> > +             BUG_ON(is_shadow_present_pte(*split_sptep));
> > +
> > +             /*
> > +              * Since we did not allocate pte_list_desc structs for the
> > +              * split, we can't add a new SPTE that maps this GFN.
> > +              * Skipping this SPTE means we're only partially mapping the
> > +              * huge page, which means we'll need to flush TLBs before
> > +              * dropping the MMU lock.
> > +              *
> > +              * Note, this make it safe to pass NULL to __rmap_add() below.
> > +              */
> > +             if (gfn_to_rmap(split_gfn, split_level, slot)->val) {
> > +                     *flush = true;
> > +                     continue;
> > +             }
>
> ... here.
>
> IIUC this check should already be able to cover all the cases and it's
> accurate on the fact that we don't want to grow any rmap to >1 len.
>
> > +
> > +             split_spte = make_huge_page_split_spte(
> > +                             huge_spte, split_level + 1, index, access);
> > +
> > +             mmu_spte_set(split_sptep, split_spte);
> > +             __rmap_add(kvm, NULL, slot, split_sptep, split_gfn, access);
>
> __rmap_add() with a NULL cache pointer is weird.. same as
> __link_shadow_page() below.
>
> I'll stop here for now I guess.. Have you considered having rmap allocation
> ready altogether, rather than making this intermediate step but only add
> that later?  Because all these look hackish to me..  It's also possible
> that I missed something important, if so please shoot.

I'd be happy to do it that way. The reasons I broke it up into the
intermediate steps are:
 - At Google we only support up to including this patch. We don't
handle the cases where the rmap or parent_ptes list need to grow.
 - It seemed like a good way to break up the support into smaller
patches. But I think this backfired since the intermediate steps
introduce their own complexity such as passing in NULL to
__rmap_add().

>
> Thanks,
>
> > +     }
> > +
> > +     /*
> > +      * Replace the huge spte with a pointer to the populated lower level
> > +      * page table. Since we are making this change without a TLB flush vCPUs
> > +      * will see a mix of the split mappings and the original huge mapping,
> > +      * depending on what's currently in their TLB. This is fine from a
> > +      * correctness standpoint since the translation will either be identical
> > +      * or non-present. To account for non-present mappings, the TLB will be
> > +      * flushed prior to dropping the MMU lock.
> > +      */
> > +     __drop_large_spte(kvm, huge_sptep, false);
> > +     __link_shadow_page(NULL, huge_sptep, split_sp);
> > +
> > +     return 0;
> > +}
>
> --
> Peter Xu
>
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 01/26] KVM: x86/mmu: Optimize MMU page cache lookup for all direct SPs
  2022-03-15  7:40     ` Peter Xu
@ 2022-03-22 18:16       ` David Matlack
  -1 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-22 18:16 UTC (permalink / raw)
  To: Peter Xu
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Sean Christopherson, Andrew Jones, Ben Gardon,
	Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Tue, Mar 15, 2022 at 12:40 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, Mar 11, 2022 at 12:25:03AM +0000, David Matlack wrote:
> > Commit fb58a9c345f6 ("KVM: x86/mmu: Optimize MMU page cache lookup for
> > fully direct MMUs") skipped the unsync checks and write flood clearing
> > for full direct MMUs. We can extend this further and skip the checks for
> > all direct shadow pages. Direct shadow pages are never marked unsynced
> > or have a non-zero write-flooding count.
>
> Nit: IMHO it's better to spell out the exact functional change, IIUC those
> are the direct mapped SPs where guest uses huge pages but host uses only
> small pages for the shadowing?

Yes that's correct. I'll include that in the commit message in the next version.

>
> >
> > Checking sp->role.direct alos generates better code than checking
> > direct_map because, due to register pressure, direct_map has to get
> > shoved onto the stack and then pulled back off.
> >
> > No functional change intended.
> >
> > Reviewed-by: Sean Christopherson <seanjc@google.com>
> > Signed-off-by: David Matlack <dmatlack@google.com>
>
> Reviewed-by: Peter Xu <peterx@redhat.com>
>
> --
> Peter Xu
>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 01/26] KVM: x86/mmu: Optimize MMU page cache lookup for all direct SPs
@ 2022-03-22 18:16       ` David Matlack
  0 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-22 18:16 UTC (permalink / raw)
  To: Peter Xu
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Tue, Mar 15, 2022 at 12:40 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, Mar 11, 2022 at 12:25:03AM +0000, David Matlack wrote:
> > Commit fb58a9c345f6 ("KVM: x86/mmu: Optimize MMU page cache lookup for
> > fully direct MMUs") skipped the unsync checks and write flood clearing
> > for full direct MMUs. We can extend this further and skip the checks for
> > all direct shadow pages. Direct shadow pages are never marked unsynced
> > or have a non-zero write-flooding count.
>
> Nit: IMHO it's better to spell out the exact functional change, IIUC those
> are the direct mapped SPs where guest uses huge pages but host uses only
> small pages for the shadowing?

Yes that's correct. I'll include that in the commit message in the next version.

>
> >
> > Checking sp->role.direct alos generates better code than checking
> > direct_map because, due to register pressure, direct_map has to get
> > shoved onto the stack and then pulled back off.
> >
> > No functional change intended.
> >
> > Reviewed-by: Sean Christopherson <seanjc@google.com>
> > Signed-off-by: David Matlack <dmatlack@google.com>
>
> Reviewed-by: Peter Xu <peterx@redhat.com>
>
> --
> Peter Xu
>
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 02/26] KVM: x86/mmu: Use a bool for direct
  2022-03-15  7:46     ` Peter Xu
@ 2022-03-22 18:21       ` David Matlack
  -1 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-22 18:21 UTC (permalink / raw)
  To: Peter Xu
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Sean Christopherson, Andrew Jones, Ben Gardon,
	Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Tue, Mar 15, 2022 at 12:46 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, Mar 11, 2022 at 12:25:04AM +0000, David Matlack wrote:
> > The parameter "direct" can either be true or false, and all of the
> > callers pass in a bool variable or true/false literal, so just use the
> > type bool.
> >
> > No functional change intended.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
>
> If we care about this.. how about convert another one altogether?
>
> TRACE_EVENT(kvm_hv_stimer_expiration,
>         TP_PROTO(int vcpu_id, int timer_index, int direct, int msg_send_result),
>         TP_ARGS(vcpu_id, timer_index, direct, msg_send_result),

My preference would be to keep this commit specific to uses of
"direct" that are related to shadow pages.

The parameter `direct` in trace_kvm_hv_stimer_expiration() looks like
it could be converted as well, but is a different concept altogether
despite having the same variable name.

>
> Thanks,
>
> --
> Peter Xu
>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 02/26] KVM: x86/mmu: Use a bool for direct
@ 2022-03-22 18:21       ` David Matlack
  0 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-22 18:21 UTC (permalink / raw)
  To: Peter Xu
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Tue, Mar 15, 2022 at 12:46 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, Mar 11, 2022 at 12:25:04AM +0000, David Matlack wrote:
> > The parameter "direct" can either be true or false, and all of the
> > callers pass in a bool variable or true/false literal, so just use the
> > type bool.
> >
> > No functional change intended.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
>
> If we care about this.. how about convert another one altogether?
>
> TRACE_EVENT(kvm_hv_stimer_expiration,
>         TP_PROTO(int vcpu_id, int timer_index, int direct, int msg_send_result),
>         TP_ARGS(vcpu_id, timer_index, direct, msg_send_result),

My preference would be to keep this commit specific to uses of
"direct" that are related to shadow pages.

The parameter `direct` in trace_kvm_hv_stimer_expiration() looks like
it could be converted as well, but is a different concept altogether
despite having the same variable name.

>
> Thanks,
>
> --
> Peter Xu
>
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 03/26] KVM: x86/mmu: Derive shadow MMU page role from parent
  2022-03-15  8:15     ` Peter Xu
@ 2022-03-22 18:30       ` David Matlack
  -1 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-22 18:30 UTC (permalink / raw)
  To: Peter Xu
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Sean Christopherson, Andrew Jones, Ben Gardon,
	Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Tue, Mar 15, 2022 at 1:15 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, Mar 11, 2022 at 12:25:05AM +0000, David Matlack wrote:
> > Instead of computing the shadow page role from scratch for every new
> > page, we can derive most of the information from the parent shadow page.
> > This avoids redundant calculations and reduces the number of parameters
> > to kvm_mmu_get_page().
> >
> > Preemptively split out the role calculation to a separate function for
> > use in a following commit.
> >
> > No functional change intended.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
>
> Looks right..
>
> Reviewed-by: Peter Xu <peterx@redhat.com>
>
> Two more comments/questions below.
>
> > +static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct, u32 access)
> > +{
> > +     struct kvm_mmu_page *parent_sp = sptep_to_sp(sptep);
> > +     union kvm_mmu_page_role role;
> > +
> > +     role = parent_sp->role;
> > +     role.level--;
> > +     role.access = access;
> > +     role.direct = direct;
> > +
> > +     /*
> > +      * If the guest has 4-byte PTEs then that means it's using 32-bit,
> > +      * 2-level, non-PAE paging. KVM shadows such guests using 4 PAE page
> > +      * directories, each mapping 1/4 of the guest's linear address space
> > +      * (1GiB). The shadow pages for those 4 page directories are
> > +      * pre-allocated and assigned a separate quadrant in their role.
> > +      *
> > +      * Since we are allocating a child shadow page and there are only 2
> > +      * levels, this must be a PG_LEVEL_4K shadow page. Here the quadrant
> > +      * will either be 0 or 1 because it maps 1/2 of the address space mapped
> > +      * by the guest's PG_LEVEL_4K page table (or 4MiB huge page) that it
> > +      * is shadowing. In this case, the quadrant can be derived by the index
> > +      * of the SPTE that points to the new child shadow page in the page
> > +      * directory (parent_sp). Specifically, every 2 SPTEs in parent_sp
> > +      * shadow one half of a guest's page table (or 4MiB huge page) so the
> > +      * quadrant is just the parity of the index of the SPTE.
> > +      */
> > +     if (role.has_4_byte_gpte) {
> > +             BUG_ON(role.level != PG_LEVEL_4K);
> > +             role.quadrant = (sptep - parent_sp->spt) % 2;
> > +     }
>
> This made me wonder whether role.quadrant can be dropped, because it seems
> it can be calculated out of the box with has_4_byte_gpte, level and spte
> offset.  I could have missed something, though..

I think you're right that we could compute it on-the-fly. But it'd be
non-trivial to remove since it's currently used to ensure the sp->role
and sp->gfn uniquely identifies each shadow page (e.g. when checking
for collisions in the mmu_page_hash).

>
> > +
> > +     return role;
> > +}
> > +
> > +static struct kvm_mmu_page *kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu,
> > +                                              u64 *sptep, gfn_t gfn,
> > +                                              bool direct, u32 access)
> > +{
> > +     union kvm_mmu_page_role role;
> > +
> > +     role = kvm_mmu_child_role(sptep, direct, access);
> > +     return kvm_mmu_get_page(vcpu, gfn, role);
>
> Nit: it looks nicer to just drop the temp var?
>
>         return kvm_mmu_get_page(vcpu, gfn,
>                                 kvm_mmu_child_role(sptep, direct, access));

Yeah that's simpler. I just have an aversion to line wrapping :)

>
> Thanks,
>
> --
> Peter Xu
>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 03/26] KVM: x86/mmu: Derive shadow MMU page role from parent
@ 2022-03-22 18:30       ` David Matlack
  0 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-22 18:30 UTC (permalink / raw)
  To: Peter Xu
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Tue, Mar 15, 2022 at 1:15 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, Mar 11, 2022 at 12:25:05AM +0000, David Matlack wrote:
> > Instead of computing the shadow page role from scratch for every new
> > page, we can derive most of the information from the parent shadow page.
> > This avoids redundant calculations and reduces the number of parameters
> > to kvm_mmu_get_page().
> >
> > Preemptively split out the role calculation to a separate function for
> > use in a following commit.
> >
> > No functional change intended.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
>
> Looks right..
>
> Reviewed-by: Peter Xu <peterx@redhat.com>
>
> Two more comments/questions below.
>
> > +static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct, u32 access)
> > +{
> > +     struct kvm_mmu_page *parent_sp = sptep_to_sp(sptep);
> > +     union kvm_mmu_page_role role;
> > +
> > +     role = parent_sp->role;
> > +     role.level--;
> > +     role.access = access;
> > +     role.direct = direct;
> > +
> > +     /*
> > +      * If the guest has 4-byte PTEs then that means it's using 32-bit,
> > +      * 2-level, non-PAE paging. KVM shadows such guests using 4 PAE page
> > +      * directories, each mapping 1/4 of the guest's linear address space
> > +      * (1GiB). The shadow pages for those 4 page directories are
> > +      * pre-allocated and assigned a separate quadrant in their role.
> > +      *
> > +      * Since we are allocating a child shadow page and there are only 2
> > +      * levels, this must be a PG_LEVEL_4K shadow page. Here the quadrant
> > +      * will either be 0 or 1 because it maps 1/2 of the address space mapped
> > +      * by the guest's PG_LEVEL_4K page table (or 4MiB huge page) that it
> > +      * is shadowing. In this case, the quadrant can be derived by the index
> > +      * of the SPTE that points to the new child shadow page in the page
> > +      * directory (parent_sp). Specifically, every 2 SPTEs in parent_sp
> > +      * shadow one half of a guest's page table (or 4MiB huge page) so the
> > +      * quadrant is just the parity of the index of the SPTE.
> > +      */
> > +     if (role.has_4_byte_gpte) {
> > +             BUG_ON(role.level != PG_LEVEL_4K);
> > +             role.quadrant = (sptep - parent_sp->spt) % 2;
> > +     }
>
> This made me wonder whether role.quadrant can be dropped, because it seems
> it can be calculated out of the box with has_4_byte_gpte, level and spte
> offset.  I could have missed something, though..

I think you're right that we could compute it on-the-fly. But it'd be
non-trivial to remove since it's currently used to ensure the sp->role
and sp->gfn uniquely identifies each shadow page (e.g. when checking
for collisions in the mmu_page_hash).

>
> > +
> > +     return role;
> > +}
> > +
> > +static struct kvm_mmu_page *kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu,
> > +                                              u64 *sptep, gfn_t gfn,
> > +                                              bool direct, u32 access)
> > +{
> > +     union kvm_mmu_page_role role;
> > +
> > +     role = kvm_mmu_child_role(sptep, direct, access);
> > +     return kvm_mmu_get_page(vcpu, gfn, role);
>
> Nit: it looks nicer to just drop the temp var?
>
>         return kvm_mmu_get_page(vcpu, gfn,
>                                 kvm_mmu_child_role(sptep, direct, access));

Yeah that's simpler. I just have an aversion to line wrapping :)

>
> Thanks,
>
> --
> Peter Xu
>
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 05/26] KVM: x86/mmu: Rename shadow MMU functions that deal with shadow pages
  2022-03-15  8:52     ` Peter Xu
@ 2022-03-22 21:35       ` David Matlack
  -1 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-22 21:35 UTC (permalink / raw)
  To: Peter Xu
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Sean Christopherson, Andrew Jones, Ben Gardon,
	Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Tue, Mar 15, 2022 at 1:52 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, Mar 11, 2022 at 12:25:07AM +0000, David Matlack wrote:
> > Rename 3 functions:
> >
> >   kvm_mmu_get_page()   -> kvm_mmu_get_shadow_page()
> >   kvm_mmu_alloc_page() -> kvm_mmu_alloc_shadow_page()
> >   kvm_mmu_free_page()  -> kvm_mmu_free_shadow_page()
> >
> > This change makes it clear that these functions deal with shadow pages
> > rather than struct pages. Prefer "shadow_page" over the shorter "sp"
> > since these are core routines.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
>
> Acked-by: Peter Xu <peterx@redhat.com>

What's the reason to use Acked-by for this patch but Reviewed-by for others?


>
> --
> Peter Xu
>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 05/26] KVM: x86/mmu: Rename shadow MMU functions that deal with shadow pages
@ 2022-03-22 21:35       ` David Matlack
  0 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-22 21:35 UTC (permalink / raw)
  To: Peter Xu
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Tue, Mar 15, 2022 at 1:52 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, Mar 11, 2022 at 12:25:07AM +0000, David Matlack wrote:
> > Rename 3 functions:
> >
> >   kvm_mmu_get_page()   -> kvm_mmu_get_shadow_page()
> >   kvm_mmu_alloc_page() -> kvm_mmu_alloc_shadow_page()
> >   kvm_mmu_free_page()  -> kvm_mmu_free_shadow_page()
> >
> > This change makes it clear that these functions deal with shadow pages
> > rather than struct pages. Prefer "shadow_page" over the shorter "sp"
> > since these are core routines.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
>
> Acked-by: Peter Xu <peterx@redhat.com>

What's the reason to use Acked-by for this patch but Reviewed-by for others?


>
> --
> Peter Xu
>
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 06/26] KVM: x86/mmu: Pass memslot to kvm_mmu_new_shadow_page()
  2022-03-15  9:03     ` Peter Xu
@ 2022-03-22 22:05       ` David Matlack
  -1 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-22 22:05 UTC (permalink / raw)
  To: Peter Xu
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Sean Christopherson, Andrew Jones, Ben Gardon,
	Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Tue, Mar 15, 2022 at 2:04 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, Mar 11, 2022 at 12:25:08AM +0000, David Matlack wrote:
> > Passing the memslot to kvm_mmu_new_shadow_page() avoids the need for the
> > vCPU pointer when write-protecting indirect 4k shadow pages. This moves
> > us closer to being able to create new shadow pages during VM ioctls for
> > eager page splitting, where there is not vCPU pointer.
> >
> > This change does not negatively impact "Populate memory time" for ept=Y
> > or ept=N configurations since kvm_vcpu_gfn_to_memslot() caches the last
> > use slot. So even though we now look up the slot more often, it is a
> > very cheap check.
> >
> > Opportunistically move the code to write-protect GFNs shadowed by
> > PG_LEVEL_4K shadow pages into account_shadowed() to reduce indentation
> > and consolidate the code. This also eliminates a memslot lookup.
> >
> > No functional change intended.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> >  arch/x86/kvm/mmu/mmu.c | 23 ++++++++++++-----------
> >  1 file changed, 12 insertions(+), 11 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index b6fb50e32291..519910938478 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -793,16 +793,14 @@ void kvm_mmu_gfn_allow_lpage(const struct kvm_memory_slot *slot, gfn_t gfn)
> >       update_gfn_disallow_lpage_count(slot, gfn, -1);
> >  }
> >
> > -static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
> > +static void account_shadowed(struct kvm *kvm,
> > +                          struct kvm_memory_slot *slot,
> > +                          struct kvm_mmu_page *sp)
> >  {
> > -     struct kvm_memslots *slots;
> > -     struct kvm_memory_slot *slot;
> >       gfn_t gfn;
> >
> >       kvm->arch.indirect_shadow_pages++;
> >       gfn = sp->gfn;
> > -     slots = kvm_memslots_for_spte_role(kvm, sp->role);
> > -     slot = __gfn_to_memslot(slots, gfn);
> >
> >       /* the non-leaf shadow pages are keeping readonly. */
> >       if (sp->role.level > PG_LEVEL_4K)
> > @@ -810,6 +808,9 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
> >                                                   KVM_PAGE_TRACK_WRITE);
> >
> >       kvm_mmu_gfn_disallow_lpage(slot, gfn);
> > +
> > +     if (kvm_mmu_slot_gfn_write_protect(kvm, slot, gfn, PG_LEVEL_4K))
> > +             kvm_flush_remote_tlbs_with_address(kvm, gfn, 1);
>
> It's not immediately obvious in this diff, but when looking at the code
> yeah it looks right to just drop the 4K check..

Yeah it's a bit subtle but (as you probably noticed) account_shadowed()
returns early if the level is above PG_LEVEL_4K.


>
> I also never understood why we only write-track the >1 levels but only
> wr-protect the last level.  It'll be great if there's quick answer from
> anyone.. even though it's probably unrelated to the patch.
>
> The change looks all correct:
>
> Reviewed-by: Peter Xu <peterx@redhat.com>
>
> Thanks,
>
> --
> Peter Xu
>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 06/26] KVM: x86/mmu: Pass memslot to kvm_mmu_new_shadow_page()
@ 2022-03-22 22:05       ` David Matlack
  0 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-22 22:05 UTC (permalink / raw)
  To: Peter Xu
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Tue, Mar 15, 2022 at 2:04 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, Mar 11, 2022 at 12:25:08AM +0000, David Matlack wrote:
> > Passing the memslot to kvm_mmu_new_shadow_page() avoids the need for the
> > vCPU pointer when write-protecting indirect 4k shadow pages. This moves
> > us closer to being able to create new shadow pages during VM ioctls for
> > eager page splitting, where there is not vCPU pointer.
> >
> > This change does not negatively impact "Populate memory time" for ept=Y
> > or ept=N configurations since kvm_vcpu_gfn_to_memslot() caches the last
> > use slot. So even though we now look up the slot more often, it is a
> > very cheap check.
> >
> > Opportunistically move the code to write-protect GFNs shadowed by
> > PG_LEVEL_4K shadow pages into account_shadowed() to reduce indentation
> > and consolidate the code. This also eliminates a memslot lookup.
> >
> > No functional change intended.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> >  arch/x86/kvm/mmu/mmu.c | 23 ++++++++++++-----------
> >  1 file changed, 12 insertions(+), 11 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index b6fb50e32291..519910938478 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -793,16 +793,14 @@ void kvm_mmu_gfn_allow_lpage(const struct kvm_memory_slot *slot, gfn_t gfn)
> >       update_gfn_disallow_lpage_count(slot, gfn, -1);
> >  }
> >
> > -static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
> > +static void account_shadowed(struct kvm *kvm,
> > +                          struct kvm_memory_slot *slot,
> > +                          struct kvm_mmu_page *sp)
> >  {
> > -     struct kvm_memslots *slots;
> > -     struct kvm_memory_slot *slot;
> >       gfn_t gfn;
> >
> >       kvm->arch.indirect_shadow_pages++;
> >       gfn = sp->gfn;
> > -     slots = kvm_memslots_for_spte_role(kvm, sp->role);
> > -     slot = __gfn_to_memslot(slots, gfn);
> >
> >       /* the non-leaf shadow pages are keeping readonly. */
> >       if (sp->role.level > PG_LEVEL_4K)
> > @@ -810,6 +808,9 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
> >                                                   KVM_PAGE_TRACK_WRITE);
> >
> >       kvm_mmu_gfn_disallow_lpage(slot, gfn);
> > +
> > +     if (kvm_mmu_slot_gfn_write_protect(kvm, slot, gfn, PG_LEVEL_4K))
> > +             kvm_flush_remote_tlbs_with_address(kvm, gfn, 1);
>
> It's not immediately obvious in this diff, but when looking at the code
> yeah it looks right to just drop the 4K check..

Yeah it's a bit subtle but (as you probably noticed) account_shadowed()
returns early if the level is above PG_LEVEL_4K.


>
> I also never understood why we only write-track the >1 levels but only
> wr-protect the last level.  It'll be great if there's quick answer from
> anyone.. even though it's probably unrelated to the patch.
>
> The change looks all correct:
>
> Reviewed-by: Peter Xu <peterx@redhat.com>
>
> Thanks,
>
> --
> Peter Xu
>
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 04/26] KVM: x86/mmu: Decompose kvm_mmu_get_page() into separate functions
  2022-03-15  8:50     ` Peter Xu
@ 2022-03-22 22:09       ` David Matlack
  -1 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-22 22:09 UTC (permalink / raw)
  To: Peter Xu
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Sean Christopherson, Andrew Jones, Ben Gardon,
	Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Tue, Mar 15, 2022 at 1:51 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, Mar 11, 2022 at 12:25:06AM +0000, David Matlack wrote:
> > Decompose kvm_mmu_get_page() into separate helper functions to increase
> > readability and prepare for allocating shadow pages without a vcpu
> > pointer.
> >
> > Specifically, pull the guts of kvm_mmu_get_page() into 3 helper
> > functions:
> >
> > __kvm_mmu_find_shadow_page() -
> >   Walks the page hash checking for any existing mmu pages that match the
> >   given gfn and role. Does not attempt to synchronize the page if it is
> >   unsync.
> >
> > kvm_mmu_find_shadow_page() -
> >   Wraps __kvm_mmu_find_shadow_page() and handles syncing if necessary.
> >
> > kvm_mmu_new_shadow_page()
> >   Allocates and initializes an entirely new kvm_mmu_page. This currently
> >   requries a vcpu pointer for allocation and looking up the memslot but
> >   that will be removed in a future commit.
> >
> >   Note, kvm_mmu_new_shadow_page() is temporary and will be removed in a
> >   subsequent commit. The name uses "new" rather than the more typical
> >   "alloc" to avoid clashing with the existing kvm_mmu_alloc_page().
> >
> > No functional change intended.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
>
> Looks good to me, a few nitpicks and questions below.
>
> > ---
> >  arch/x86/kvm/mmu/mmu.c         | 132 ++++++++++++++++++++++++---------
> >  arch/x86/kvm/mmu/paging_tmpl.h |   5 +-
> >  arch/x86/kvm/mmu/spte.c        |   5 +-
> >  3 files changed, 101 insertions(+), 41 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 23c2004c6435..80dbfe07c87b 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -2027,16 +2027,25 @@ static void clear_sp_write_flooding_count(u64 *spte)
> >       __clear_sp_write_flooding_count(sptep_to_sp(spte));
> >  }
> >
> > -static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
> > -                                          union kvm_mmu_page_role role)
> > +/*
> > + * Searches for an existing SP for the given gfn and role. Makes no attempt to
> > + * sync the SP if it is marked unsync.
> > + *
> > + * If creating an upper-level page table, zaps unsynced pages for the same
> > + * gfn and adds them to the invalid_list. It's the callers responsibility
> > + * to call kvm_mmu_commit_zap_page() on invalid_list.
> > + */
> > +static struct kvm_mmu_page *__kvm_mmu_find_shadow_page(struct kvm *kvm,
> > +                                                    gfn_t gfn,
> > +                                                    union kvm_mmu_page_role role,
> > +                                                    struct list_head *invalid_list)
> >  {
> >       struct hlist_head *sp_list;
> >       struct kvm_mmu_page *sp;
> >       int collisions = 0;
> > -     LIST_HEAD(invalid_list);
> >
> > -     sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
> > -     for_each_valid_sp(vcpu->kvm, sp, sp_list) {
> > +     sp_list = &kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
> > +     for_each_valid_sp(kvm, sp, sp_list) {
> >               if (sp->gfn != gfn) {
> >                       collisions++;
> >                       continue;
> > @@ -2053,60 +2062,109 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
> >                        * upper-level page will be write-protected.
> >                        */
> >                       if (role.level > PG_LEVEL_4K && sp->unsync)
> > -                             kvm_mmu_prepare_zap_page(vcpu->kvm, sp,
> > -                                                      &invalid_list);
> > +                             kvm_mmu_prepare_zap_page(kvm, sp, invalid_list);
> > +
> >                       continue;
> >               }
> >
> > -             /* unsync and write-flooding only apply to indirect SPs. */
> > -             if (sp->role.direct)
> > -                     goto trace_get_page;
> > +             /* Write-flooding is only tracked for indirect SPs. */
> > +             if (!sp->role.direct)
> > +                     __clear_sp_write_flooding_count(sp);
> >
> > -             if (sp->unsync) {
> > -                     /*
> > -                      * The page is good, but is stale.  kvm_sync_page does
> > -                      * get the latest guest state, but (unlike mmu_unsync_children)
> > -                      * it doesn't write-protect the page or mark it synchronized!
> > -                      * This way the validity of the mapping is ensured, but the
> > -                      * overhead of write protection is not incurred until the
> > -                      * guest invalidates the TLB mapping.  This allows multiple
> > -                      * SPs for a single gfn to be unsync.
> > -                      *
> > -                      * If the sync fails, the page is zapped.  If so, break
> > -                      * in order to rebuild it.
> > -                      */
> > -                     if (!kvm_sync_page(vcpu, sp, &invalid_list))
> > -                             break;
> > +             goto out;
> > +     }
> >
> > -                     WARN_ON(!list_empty(&invalid_list));
> > -                     kvm_flush_remote_tlbs(vcpu->kvm);
> > -             }
> > +     sp = NULL;
> >
> > -             __clear_sp_write_flooding_count(sp);
> > +out:
> > +     if (collisions > kvm->stat.max_mmu_page_hash_collisions)
> > +             kvm->stat.max_mmu_page_hash_collisions = collisions;
> > +
> > +     return sp;
> > +}
> >
> > -trace_get_page:
> > -             trace_kvm_mmu_get_page(sp, false);
> > +/*
> > + * Looks up an existing SP for the given gfn and role if one exists. The
> > + * return SP is guaranteed to be synced.
> > + */
> > +static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
> > +                                                  gfn_t gfn,
> > +                                                  union kvm_mmu_page_role role)
> > +{
> > +     struct kvm_mmu_page *sp;
> > +     LIST_HEAD(invalid_list);
> > +
> > +     sp = __kvm_mmu_find_shadow_page(vcpu->kvm, gfn, role, &invalid_list);
> > +     if (!sp)
> >               goto out;
> > +
> > +     if (sp->unsync) {
> > +             /*
> > +              * The page is good, but is stale.  kvm_sync_page does
> > +              * get the latest guest state, but (unlike mmu_unsync_children)
> > +              * it doesn't write-protect the page or mark it synchronized!
> > +              * This way the validity of the mapping is ensured, but the
> > +              * overhead of write protection is not incurred until the
> > +              * guest invalidates the TLB mapping.  This allows multiple
> > +              * SPs for a single gfn to be unsync.
> > +              *
> > +              * If the sync fails, the page is zapped and added to the
> > +              * invalid_list.
> > +              */
> > +             if (!kvm_sync_page(vcpu, sp, &invalid_list)) {
> > +                     sp = NULL;
> > +                     goto out;
> > +             }
> > +
> > +             WARN_ON(!list_empty(&invalid_list));
>
> Not related to this patch because I think it's a pure movement here,
> however I have a question on why invalid_list is guaranteed to be empty..
>
> I'm thinking the case where when lookup the page we could have already
> called kvm_mmu_prepare_zap_page() there, then when reach here (which is the
> kvm_sync_page==true case) invalid_list shouldn't be touched in
> kvm_sync_page(), so it looks possible that it still contains some page to
> be commited?

I also had this question when I was re-organizing this code but
haven't had the time to look into it yet.

>
> > +             kvm_flush_remote_tlbs(vcpu->kvm);
> >       }
> >
> > +out:
>
> I'm wondering whether this "out" can be dropped.. with something like:
>
>         sp = __kvm_mmu_find_shadow_page(...);
>
>         if (sp && sp->unsync) {
>                 if (kvm_sync_page(vcpu, sp, &invalid_list)) {
>                         ..
>                 } else {
>                         sp = NULL;
>                 }
>         }

Sure will do. I used the goto to reduce the amount of indentation, but
I can definitely get rid of it.

>
> [...]
>
> > +static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
> > +                                          union kvm_mmu_page_role role)
> > +{
> > +     struct kvm_mmu_page *sp;
> > +     bool created = false;
> > +
> > +     sp = kvm_mmu_find_shadow_page(vcpu, gfn, role);
> > +     if (sp)
> > +             goto out;
> > +
> > +     created = true;
> > +     sp = kvm_mmu_new_shadow_page(vcpu, gfn, role);
> > +
> > +out:
> > +     trace_kvm_mmu_get_page(sp, created);
> >       return sp;
>
> Same here, wondering whether we could drop the "out" by:
>
>         sp = kvm_mmu_find_shadow_page(vcpu, gfn, role);
>         if (!sp) {
>                 created = true;
>                 sp = kvm_mmu_new_shadow_page(vcpu, gfn, role);
>         }
>
>         trace_kvm_mmu_get_page(sp, created);
>         return sp;

Ditto.

>
> Thanks,
>
> --
> Peter Xu
>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 04/26] KVM: x86/mmu: Decompose kvm_mmu_get_page() into separate functions
@ 2022-03-22 22:09       ` David Matlack
  0 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-22 22:09 UTC (permalink / raw)
  To: Peter Xu
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Tue, Mar 15, 2022 at 1:51 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, Mar 11, 2022 at 12:25:06AM +0000, David Matlack wrote:
> > Decompose kvm_mmu_get_page() into separate helper functions to increase
> > readability and prepare for allocating shadow pages without a vcpu
> > pointer.
> >
> > Specifically, pull the guts of kvm_mmu_get_page() into 3 helper
> > functions:
> >
> > __kvm_mmu_find_shadow_page() -
> >   Walks the page hash checking for any existing mmu pages that match the
> >   given gfn and role. Does not attempt to synchronize the page if it is
> >   unsync.
> >
> > kvm_mmu_find_shadow_page() -
> >   Wraps __kvm_mmu_find_shadow_page() and handles syncing if necessary.
> >
> > kvm_mmu_new_shadow_page()
> >   Allocates and initializes an entirely new kvm_mmu_page. This currently
> >   requries a vcpu pointer for allocation and looking up the memslot but
> >   that will be removed in a future commit.
> >
> >   Note, kvm_mmu_new_shadow_page() is temporary and will be removed in a
> >   subsequent commit. The name uses "new" rather than the more typical
> >   "alloc" to avoid clashing with the existing kvm_mmu_alloc_page().
> >
> > No functional change intended.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
>
> Looks good to me, a few nitpicks and questions below.
>
> > ---
> >  arch/x86/kvm/mmu/mmu.c         | 132 ++++++++++++++++++++++++---------
> >  arch/x86/kvm/mmu/paging_tmpl.h |   5 +-
> >  arch/x86/kvm/mmu/spte.c        |   5 +-
> >  3 files changed, 101 insertions(+), 41 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 23c2004c6435..80dbfe07c87b 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -2027,16 +2027,25 @@ static void clear_sp_write_flooding_count(u64 *spte)
> >       __clear_sp_write_flooding_count(sptep_to_sp(spte));
> >  }
> >
> > -static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
> > -                                          union kvm_mmu_page_role role)
> > +/*
> > + * Searches for an existing SP for the given gfn and role. Makes no attempt to
> > + * sync the SP if it is marked unsync.
> > + *
> > + * If creating an upper-level page table, zaps unsynced pages for the same
> > + * gfn and adds them to the invalid_list. It's the callers responsibility
> > + * to call kvm_mmu_commit_zap_page() on invalid_list.
> > + */
> > +static struct kvm_mmu_page *__kvm_mmu_find_shadow_page(struct kvm *kvm,
> > +                                                    gfn_t gfn,
> > +                                                    union kvm_mmu_page_role role,
> > +                                                    struct list_head *invalid_list)
> >  {
> >       struct hlist_head *sp_list;
> >       struct kvm_mmu_page *sp;
> >       int collisions = 0;
> > -     LIST_HEAD(invalid_list);
> >
> > -     sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
> > -     for_each_valid_sp(vcpu->kvm, sp, sp_list) {
> > +     sp_list = &kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
> > +     for_each_valid_sp(kvm, sp, sp_list) {
> >               if (sp->gfn != gfn) {
> >                       collisions++;
> >                       continue;
> > @@ -2053,60 +2062,109 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
> >                        * upper-level page will be write-protected.
> >                        */
> >                       if (role.level > PG_LEVEL_4K && sp->unsync)
> > -                             kvm_mmu_prepare_zap_page(vcpu->kvm, sp,
> > -                                                      &invalid_list);
> > +                             kvm_mmu_prepare_zap_page(kvm, sp, invalid_list);
> > +
> >                       continue;
> >               }
> >
> > -             /* unsync and write-flooding only apply to indirect SPs. */
> > -             if (sp->role.direct)
> > -                     goto trace_get_page;
> > +             /* Write-flooding is only tracked for indirect SPs. */
> > +             if (!sp->role.direct)
> > +                     __clear_sp_write_flooding_count(sp);
> >
> > -             if (sp->unsync) {
> > -                     /*
> > -                      * The page is good, but is stale.  kvm_sync_page does
> > -                      * get the latest guest state, but (unlike mmu_unsync_children)
> > -                      * it doesn't write-protect the page or mark it synchronized!
> > -                      * This way the validity of the mapping is ensured, but the
> > -                      * overhead of write protection is not incurred until the
> > -                      * guest invalidates the TLB mapping.  This allows multiple
> > -                      * SPs for a single gfn to be unsync.
> > -                      *
> > -                      * If the sync fails, the page is zapped.  If so, break
> > -                      * in order to rebuild it.
> > -                      */
> > -                     if (!kvm_sync_page(vcpu, sp, &invalid_list))
> > -                             break;
> > +             goto out;
> > +     }
> >
> > -                     WARN_ON(!list_empty(&invalid_list));
> > -                     kvm_flush_remote_tlbs(vcpu->kvm);
> > -             }
> > +     sp = NULL;
> >
> > -             __clear_sp_write_flooding_count(sp);
> > +out:
> > +     if (collisions > kvm->stat.max_mmu_page_hash_collisions)
> > +             kvm->stat.max_mmu_page_hash_collisions = collisions;
> > +
> > +     return sp;
> > +}
> >
> > -trace_get_page:
> > -             trace_kvm_mmu_get_page(sp, false);
> > +/*
> > + * Looks up an existing SP for the given gfn and role if one exists. The
> > + * return SP is guaranteed to be synced.
> > + */
> > +static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
> > +                                                  gfn_t gfn,
> > +                                                  union kvm_mmu_page_role role)
> > +{
> > +     struct kvm_mmu_page *sp;
> > +     LIST_HEAD(invalid_list);
> > +
> > +     sp = __kvm_mmu_find_shadow_page(vcpu->kvm, gfn, role, &invalid_list);
> > +     if (!sp)
> >               goto out;
> > +
> > +     if (sp->unsync) {
> > +             /*
> > +              * The page is good, but is stale.  kvm_sync_page does
> > +              * get the latest guest state, but (unlike mmu_unsync_children)
> > +              * it doesn't write-protect the page or mark it synchronized!
> > +              * This way the validity of the mapping is ensured, but the
> > +              * overhead of write protection is not incurred until the
> > +              * guest invalidates the TLB mapping.  This allows multiple
> > +              * SPs for a single gfn to be unsync.
> > +              *
> > +              * If the sync fails, the page is zapped and added to the
> > +              * invalid_list.
> > +              */
> > +             if (!kvm_sync_page(vcpu, sp, &invalid_list)) {
> > +                     sp = NULL;
> > +                     goto out;
> > +             }
> > +
> > +             WARN_ON(!list_empty(&invalid_list));
>
> Not related to this patch because I think it's a pure movement here,
> however I have a question on why invalid_list is guaranteed to be empty..
>
> I'm thinking the case where when lookup the page we could have already
> called kvm_mmu_prepare_zap_page() there, then when reach here (which is the
> kvm_sync_page==true case) invalid_list shouldn't be touched in
> kvm_sync_page(), so it looks possible that it still contains some page to
> be commited?

I also had this question when I was re-organizing this code but
haven't had the time to look into it yet.

>
> > +             kvm_flush_remote_tlbs(vcpu->kvm);
> >       }
> >
> > +out:
>
> I'm wondering whether this "out" can be dropped.. with something like:
>
>         sp = __kvm_mmu_find_shadow_page(...);
>
>         if (sp && sp->unsync) {
>                 if (kvm_sync_page(vcpu, sp, &invalid_list)) {
>                         ..
>                 } else {
>                         sp = NULL;
>                 }
>         }

Sure will do. I used the goto to reduce the amount of indentation, but
I can definitely get rid of it.

>
> [...]
>
> > +static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
> > +                                          union kvm_mmu_page_role role)
> > +{
> > +     struct kvm_mmu_page *sp;
> > +     bool created = false;
> > +
> > +     sp = kvm_mmu_find_shadow_page(vcpu, gfn, role);
> > +     if (sp)
> > +             goto out;
> > +
> > +     created = true;
> > +     sp = kvm_mmu_new_shadow_page(vcpu, gfn, role);
> > +
> > +out:
> > +     trace_kvm_mmu_get_page(sp, created);
> >       return sp;
>
> Same here, wondering whether we could drop the "out" by:
>
>         sp = kvm_mmu_find_shadow_page(vcpu, gfn, role);
>         if (!sp) {
>                 created = true;
>                 sp = kvm_mmu_new_shadow_page(vcpu, gfn, role);
>         }
>
>         trace_kvm_mmu_get_page(sp, created);
>         return sp;

Ditto.

>
> Thanks,
>
> --
> Peter Xu
>
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 08/26] KVM: x86/mmu: Link spt to sp during allocation
  2022-03-15 10:04     ` Peter Xu
@ 2022-03-22 22:30       ` David Matlack
  -1 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-22 22:30 UTC (permalink / raw)
  To: Peter Xu
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Sean Christopherson, Andrew Jones, Ben Gardon,
	Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Tue, Mar 15, 2022 at 3:04 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, Mar 11, 2022 at 12:25:10AM +0000, David Matlack wrote:
> > Link the shadow page table to the sp (via set_page_private()) during
> > allocation rather than initialization. This is a more logical place to
> > do it because allocation time is also where we do the reverse link
> > (setting sp->spt).
> >
> > This creates one extra call to set_page_private(), but having multiple
> > calls to set_page_private() is unavoidable anyway. We either do
> > set_page_private() during allocation, which requires 1 per allocation
> > function, or we do it during initialization, which requires 1 per
> > initialization function.
> >
> > No functional change intended.
> >
> > Suggested-by: Ben Gardon <bgardon@google.com>
> > Signed-off-by: David Matlack <dmatlack@google.com>
>
> Ah I should have read one more patch before commenting in previous one..
>
> Personally I (a little bit) like the other way around, since if with this
> in mind ideally we should also keep the use_mmu_page accounting in
> allocation helper:
>
>   kvm_mod_used_mmu_pages(vcpu->kvm, 1);

The TDP MMU doesn't call kvm_mod_used_mmu_pages() when it allocates
SPs. So that would prevent sharing kvm_mmu_alloc_shadow_page() with
the TDP MMU in patch 11.

Ben pointed out that we link the the page to sp->spt during
allocation, so it makes sense to do the reverse link at the same time.
Also, the set_page_private() call is common between the TDP MMU and
shadow MMU, so it makes sense to do it in the SP allocation code since
the allocation functions are shared between the two MMUs.





>
> But then we dup yet another line to all elsewheres as long as sp allocated.
>
> IOW, in my opinion the helpers should service 1st on code deduplications
> rather than else.  No strong opinion though..



>
> Thanks,
>
> --
> Peter Xu
>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 08/26] KVM: x86/mmu: Link spt to sp during allocation
@ 2022-03-22 22:30       ` David Matlack
  0 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-22 22:30 UTC (permalink / raw)
  To: Peter Xu
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Tue, Mar 15, 2022 at 3:04 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, Mar 11, 2022 at 12:25:10AM +0000, David Matlack wrote:
> > Link the shadow page table to the sp (via set_page_private()) during
> > allocation rather than initialization. This is a more logical place to
> > do it because allocation time is also where we do the reverse link
> > (setting sp->spt).
> >
> > This creates one extra call to set_page_private(), but having multiple
> > calls to set_page_private() is unavoidable anyway. We either do
> > set_page_private() during allocation, which requires 1 per allocation
> > function, or we do it during initialization, which requires 1 per
> > initialization function.
> >
> > No functional change intended.
> >
> > Suggested-by: Ben Gardon <bgardon@google.com>
> > Signed-off-by: David Matlack <dmatlack@google.com>
>
> Ah I should have read one more patch before commenting in previous one..
>
> Personally I (a little bit) like the other way around, since if with this
> in mind ideally we should also keep the use_mmu_page accounting in
> allocation helper:
>
>   kvm_mod_used_mmu_pages(vcpu->kvm, 1);

The TDP MMU doesn't call kvm_mod_used_mmu_pages() when it allocates
SPs. So that would prevent sharing kvm_mmu_alloc_shadow_page() with
the TDP MMU in patch 11.

Ben pointed out that we link the the page to sp->spt during
allocation, so it makes sense to do the reverse link at the same time.
Also, the set_page_private() call is common between the TDP MMU and
shadow MMU, so it makes sense to do it in the SP allocation code since
the allocation functions are shared between the two MMUs.





>
> But then we dup yet another line to all elsewheres as long as sp allocated.
>
> IOW, in my opinion the helpers should service 1st on code deduplications
> rather than else.  No strong opinion though..



>
> Thanks,
>
> --
> Peter Xu
>
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 10/26] KVM: x86/mmu: Use common code to free kvm_mmu_page structs
  2022-03-15 10:22     ` Peter Xu
@ 2022-03-22 22:33       ` David Matlack
  -1 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-22 22:33 UTC (permalink / raw)
  To: Peter Xu
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Sean Christopherson, Andrew Jones, Ben Gardon,
	Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Tue, Mar 15, 2022 at 3:23 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, Mar 11, 2022 at 12:25:12AM +0000, David Matlack wrote:
> >  static void tdp_mmu_free_sp(struct kvm_mmu_page *sp)
> >  {
> > -     free_page((unsigned long)sp->spt);
> > -     kmem_cache_free(mmu_page_header_cache, sp);
> > +     kvm_mmu_free_shadow_page(sp);
> >  }
>
> Perhaps tdp_mmu_free_sp() can be dropped altogether with this?

It certainly can but I prefer to keep it for 2 reasons:
 - Smaller diff.
 - It mirrors tdp_mmu_alloc_sp(), which I prefer to keep as well but
I'll explain that in the next patch.

>
> Reviewed-by: Peter Xu <peterx@redhat.com>
>
> --
> Peter Xu
>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 10/26] KVM: x86/mmu: Use common code to free kvm_mmu_page structs
@ 2022-03-22 22:33       ` David Matlack
  0 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-22 22:33 UTC (permalink / raw)
  To: Peter Xu
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Tue, Mar 15, 2022 at 3:23 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, Mar 11, 2022 at 12:25:12AM +0000, David Matlack wrote:
> >  static void tdp_mmu_free_sp(struct kvm_mmu_page *sp)
> >  {
> > -     free_page((unsigned long)sp->spt);
> > -     kmem_cache_free(mmu_page_header_cache, sp);
> > +     kvm_mmu_free_shadow_page(sp);
> >  }
>
> Perhaps tdp_mmu_free_sp() can be dropped altogether with this?

It certainly can but I prefer to keep it for 2 reasons:
 - Smaller diff.
 - It mirrors tdp_mmu_alloc_sp(), which I prefer to keep as well but
I'll explain that in the next patch.

>
> Reviewed-by: Peter Xu <peterx@redhat.com>
>
> --
> Peter Xu
>
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 11/26] KVM: x86/mmu: Use common code to allocate kvm_mmu_page structs from vCPU caches
  2022-03-15 10:27     ` Peter Xu
@ 2022-03-22 22:35       ` David Matlack
  -1 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-22 22:35 UTC (permalink / raw)
  To: Peter Xu
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Sean Christopherson, Andrew Jones, Ben Gardon,
	Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Tue, Mar 15, 2022 at 3:27 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, Mar 11, 2022 at 12:25:13AM +0000, David Matlack wrote:
> >  static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
> >  {
> > -     struct kvm_mmu_page *sp;
> > -
> > -     sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
> > -     sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
> > -     set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
> > -
> > -     return sp;
> > +     return kvm_mmu_alloc_shadow_page(vcpu, true);
> >  }
>
> Similarly I had a feeling we could drop tdp_mmu_alloc_sp() too.. anyway:

Certainly, but I think it simplifies the TDP MMU code to keep it. It abstracts
away the implementation detail that a TDP MMU shadow page is allocated
the same way as a shadow MMU shadow page with direct=true.


>
> Reviewed-by: Peter Xu <peterx@redhat.com>
>
> --
> Peter Xu
>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 11/26] KVM: x86/mmu: Use common code to allocate kvm_mmu_page structs from vCPU caches
@ 2022-03-22 22:35       ` David Matlack
  0 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-22 22:35 UTC (permalink / raw)
  To: Peter Xu
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Tue, Mar 15, 2022 at 3:27 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, Mar 11, 2022 at 12:25:13AM +0000, David Matlack wrote:
> >  static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
> >  {
> > -     struct kvm_mmu_page *sp;
> > -
> > -     sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
> > -     sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
> > -     set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
> > -
> > -     return sp;
> > +     return kvm_mmu_alloc_shadow_page(vcpu, true);
> >  }
>
> Similarly I had a feeling we could drop tdp_mmu_alloc_sp() too.. anyway:

Certainly, but I think it simplifies the TDP MMU code to keep it. It abstracts
away the implementation detail that a TDP MMU shadow page is allocated
the same way as a shadow MMU shadow page with direct=true.


>
> Reviewed-by: Peter Xu <peterx@redhat.com>
>
> --
> Peter Xu
>
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 16/26] KVM: x86/mmu: Cache the access bits of shadowed translations
  2022-03-16  8:32     ` Peter Xu
@ 2022-03-22 22:51       ` David Matlack
  -1 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-22 22:51 UTC (permalink / raw)
  To: Peter Xu
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Sean Christopherson, Andrew Jones, Ben Gardon,
	Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Wed, Mar 16, 2022 at 1:32 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, Mar 11, 2022 at 12:25:18AM +0000, David Matlack wrote:
> > In order to split a huge page we need to know what access bits to assign
> > to the role of the new child page table. This can't be easily derived
> > from the huge page SPTE itself since KVM applies its own access policies
> > on top, such as for HugePage NX.
> >
> > We could walk the guest page tables to determine the correct access
> > bits, but that is difficult to plumb outside of a vCPU fault context.
> > Instead, we can store the original access bits for each leaf SPTE
> > alongside the GFN in the gfns array. The access bits only take up 3
> > bits, which leaves 61 bits left over for gfns, which is more than
> > enough. So this change does not require any additional memory.
>
> I have a pure question on why eager page split needs to worry on hugepage
> nx..
>
> IIUC that was about forbidden huge page being mapped as executable.  So
> afaiu the only missing bit that could happen if we copy over the huge page
> ptes is the executable bit.
>
> But then?  I think we could get a page fault on fault->exec==true on the
> split small page (because when we copy over it's cleared, even though the
> page can actually be executable), but it should be well resolved right
> after that small page fault.
>
> The thing is IIUC this is a very rare case, IOW, it should mostly not
> happen in 99% of the use case?  And there's a slight penalty when it
> happens, but only perf-wise.
>
> As I'm not really fluent with the code base, perhaps I missed something?

You're right that we could get away with not knowing the shadowed
access permissions to assign to the child SPTEs when splitting a huge
SPTE. We could just copy the huge SPTE access permissions and then let
the execute bit be repaired on fault (although those faults would be a
performance cost).

But the access permissions are also needed to lookup an existing
shadow page (or create a new shadow page) to use to split the huge
page. For example, let's say we are going to split a huge page that
does not have execute permissions enabled. That could be because NX
HugePages are enabled or because we are shadowing a guest translation
that does not allow execution (or both). We wouldn't want to propagate
the no-execute permission into the child SP role.access if the
shadowed translation really does allow execution (and vice versa).

>
> >
> > In order to keep the access bit cache in sync with the guest, we have to
> > extend FNAME(sync_page) to also update the access bits.
>
> Besides sync_page(), I also see that in mmu_set_spte() there's a path that
> we will skip the rmap_add() if existed:
>
>         if (!was_rmapped) {
>                 WARN_ON_ONCE(ret == RET_PF_SPURIOUS);
>                 kvm_update_page_stats(vcpu->kvm, level, 1);
>                 rmap_add(vcpu, slot, sptep, gfn);
>         }
>
> I didn't check, but it's not obvious whether the sync_page() change here
> will cover all of the cases, hence raise this up too.

Good catch. I will need to dig into this more to confirm but I think
you might be right.

>
> >
> > Now that the gfns array caches more information than just GFNs, rename
> > it to shadowed_translation.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> >  arch/x86/include/asm/kvm_host.h |  2 +-
> >  arch/x86/kvm/mmu/mmu.c          | 32 +++++++++++++++++++-------------
> >  arch/x86/kvm/mmu/mmu_internal.h | 15 +++++++++++++--
> >  arch/x86/kvm/mmu/paging_tmpl.h  |  7 +++++--
> >  4 files changed, 38 insertions(+), 18 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index f72e80178ffc..0f5a36772bdc 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -694,7 +694,7 @@ struct kvm_vcpu_arch {
> >
> >       struct kvm_mmu_memory_cache mmu_pte_list_desc_cache;
> >       struct kvm_mmu_memory_cache mmu_shadow_page_cache;
> > -     struct kvm_mmu_memory_cache mmu_gfn_array_cache;
> > +     struct kvm_mmu_memory_cache mmu_shadowed_translation_cache;
>
> I'd called it with a shorter name.. :) maybe mmu_shadowed_info_cache?  No
> strong opinion.
>
> >       struct kvm_mmu_memory_cache mmu_page_header_cache;
> >
> >       /*
>
> [...]
>
> > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> > index b6e22ba9c654..c5b8ee625df7 100644
> > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > @@ -32,6 +32,11 @@ extern bool dbg;
> >
> >  typedef u64 __rcu *tdp_ptep_t;
> >
> > +struct shadowed_translation_entry {
> > +     u64 access:3;
> > +     u64 gfn:56;
>
> Why 56?

I was going for the theoretical maximum number of bits for a GFN. But
that would be 64 - 12 = 52... so I'm not sure what I was thinking
here.

I'll switch it to 52 and add a comment.

>
> > +};
>
> Thanks,
>
> --
> Peter Xu
>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 16/26] KVM: x86/mmu: Cache the access bits of shadowed translations
@ 2022-03-22 22:51       ` David Matlack
  0 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-22 22:51 UTC (permalink / raw)
  To: Peter Xu
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Wed, Mar 16, 2022 at 1:32 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, Mar 11, 2022 at 12:25:18AM +0000, David Matlack wrote:
> > In order to split a huge page we need to know what access bits to assign
> > to the role of the new child page table. This can't be easily derived
> > from the huge page SPTE itself since KVM applies its own access policies
> > on top, such as for HugePage NX.
> >
> > We could walk the guest page tables to determine the correct access
> > bits, but that is difficult to plumb outside of a vCPU fault context.
> > Instead, we can store the original access bits for each leaf SPTE
> > alongside the GFN in the gfns array. The access bits only take up 3
> > bits, which leaves 61 bits left over for gfns, which is more than
> > enough. So this change does not require any additional memory.
>
> I have a pure question on why eager page split needs to worry on hugepage
> nx..
>
> IIUC that was about forbidden huge page being mapped as executable.  So
> afaiu the only missing bit that could happen if we copy over the huge page
> ptes is the executable bit.
>
> But then?  I think we could get a page fault on fault->exec==true on the
> split small page (because when we copy over it's cleared, even though the
> page can actually be executable), but it should be well resolved right
> after that small page fault.
>
> The thing is IIUC this is a very rare case, IOW, it should mostly not
> happen in 99% of the use case?  And there's a slight penalty when it
> happens, but only perf-wise.
>
> As I'm not really fluent with the code base, perhaps I missed something?

You're right that we could get away with not knowing the shadowed
access permissions to assign to the child SPTEs when splitting a huge
SPTE. We could just copy the huge SPTE access permissions and then let
the execute bit be repaired on fault (although those faults would be a
performance cost).

But the access permissions are also needed to lookup an existing
shadow page (or create a new shadow page) to use to split the huge
page. For example, let's say we are going to split a huge page that
does not have execute permissions enabled. That could be because NX
HugePages are enabled or because we are shadowing a guest translation
that does not allow execution (or both). We wouldn't want to propagate
the no-execute permission into the child SP role.access if the
shadowed translation really does allow execution (and vice versa).

>
> >
> > In order to keep the access bit cache in sync with the guest, we have to
> > extend FNAME(sync_page) to also update the access bits.
>
> Besides sync_page(), I also see that in mmu_set_spte() there's a path that
> we will skip the rmap_add() if existed:
>
>         if (!was_rmapped) {
>                 WARN_ON_ONCE(ret == RET_PF_SPURIOUS);
>                 kvm_update_page_stats(vcpu->kvm, level, 1);
>                 rmap_add(vcpu, slot, sptep, gfn);
>         }
>
> I didn't check, but it's not obvious whether the sync_page() change here
> will cover all of the cases, hence raise this up too.

Good catch. I will need to dig into this more to confirm but I think
you might be right.

>
> >
> > Now that the gfns array caches more information than just GFNs, rename
> > it to shadowed_translation.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> >  arch/x86/include/asm/kvm_host.h |  2 +-
> >  arch/x86/kvm/mmu/mmu.c          | 32 +++++++++++++++++++-------------
> >  arch/x86/kvm/mmu/mmu_internal.h | 15 +++++++++++++--
> >  arch/x86/kvm/mmu/paging_tmpl.h  |  7 +++++--
> >  4 files changed, 38 insertions(+), 18 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index f72e80178ffc..0f5a36772bdc 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -694,7 +694,7 @@ struct kvm_vcpu_arch {
> >
> >       struct kvm_mmu_memory_cache mmu_pte_list_desc_cache;
> >       struct kvm_mmu_memory_cache mmu_shadow_page_cache;
> > -     struct kvm_mmu_memory_cache mmu_gfn_array_cache;
> > +     struct kvm_mmu_memory_cache mmu_shadowed_translation_cache;
>
> I'd called it with a shorter name.. :) maybe mmu_shadowed_info_cache?  No
> strong opinion.
>
> >       struct kvm_mmu_memory_cache mmu_page_header_cache;
> >
> >       /*
>
> [...]
>
> > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> > index b6e22ba9c654..c5b8ee625df7 100644
> > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > @@ -32,6 +32,11 @@ extern bool dbg;
> >
> >  typedef u64 __rcu *tdp_ptep_t;
> >
> > +struct shadowed_translation_entry {
> > +     u64 access:3;
> > +     u64 gfn:56;
>
> Why 56?

I was going for the theoretical maximum number of bits for a GFN. But
that would be 64 - 12 = 52... so I'm not sure what I was thinking
here.

I'll switch it to 52 and add a comment.

>
> > +};
>
> Thanks,
>
> --
> Peter Xu
>
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 17/26] KVM: x86/mmu: Pass access information to make_huge_page_split_spte()
  2022-03-16  8:44     ` Peter Xu
@ 2022-03-22 23:08       ` David Matlack
  -1 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-22 23:08 UTC (permalink / raw)
  To: Peter Xu
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Sean Christopherson, Andrew Jones, Ben Gardon,
	Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Wed, Mar 16, 2022 at 1:44 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, Mar 11, 2022 at 12:25:19AM +0000, David Matlack wrote:
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 85b7bc333302..541b145b2df2 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -1430,7 +1430,7 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
> >        * not been linked in yet and thus is not reachable from any other CPU.
> >        */
> >       for (i = 0; i < PT64_ENT_PER_PAGE; i++)
> > -             sp->spt[i] = make_huge_page_split_spte(huge_spte, level, i);
> > +             sp->spt[i] = make_huge_page_split_spte(huge_spte, level, i, ACC_ALL);
>
> Pure question: is it possible that huge_spte is RO while we passed in
> ACC_ALL here (which has the write bit set)?

Yes that is possible, but only if KVM the page is RO due to host-side
policies (e.g. RO memslot or RO VMA). "access" here is the
guest-allowed access permissions, similar to the pte_access parameter
to mmu_set_spte(). e.g. notice how the TDP MMU passes ACC_ALL to
make_spte().

> Would it be better if we make it a "bool exec" to be clearer?

But all that being said, the ACC_ALL stuff is confusing for exactly
the reason you pointed out so it doesn't make sense to duplicate it
further. I agree it would make more sense to pass in bool exec.

>
> --
> Peter Xu
>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 17/26] KVM: x86/mmu: Pass access information to make_huge_page_split_spte()
@ 2022-03-22 23:08       ` David Matlack
  0 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-22 23:08 UTC (permalink / raw)
  To: Peter Xu
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Wed, Mar 16, 2022 at 1:44 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, Mar 11, 2022 at 12:25:19AM +0000, David Matlack wrote:
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 85b7bc333302..541b145b2df2 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -1430,7 +1430,7 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
> >        * not been linked in yet and thus is not reachable from any other CPU.
> >        */
> >       for (i = 0; i < PT64_ENT_PER_PAGE; i++)
> > -             sp->spt[i] = make_huge_page_split_spte(huge_spte, level, i);
> > +             sp->spt[i] = make_huge_page_split_spte(huge_spte, level, i, ACC_ALL);
>
> Pure question: is it possible that huge_spte is RO while we passed in
> ACC_ALL here (which has the write bit set)?

Yes that is possible, but only if KVM the page is RO due to host-side
policies (e.g. RO memslot or RO VMA). "access" here is the
guest-allowed access permissions, similar to the pte_access parameter
to mmu_set_spte(). e.g. notice how the TDP MMU passes ACC_ALL to
make_spte().

> Would it be better if we make it a "bool exec" to be clearer?

But all that being said, the ACC_ALL stuff is confusing for exactly
the reason you pointed out so it doesn't make sense to duplicate it
further. I agree it would make more sense to pass in bool exec.

>
> --
> Peter Xu
>
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 18/26] KVM: x86/mmu: Zap collapsible SPTEs at all levels in the shadow MMU
  2022-03-16  8:49     ` Peter Xu
@ 2022-03-22 23:11       ` David Matlack
  -1 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-22 23:11 UTC (permalink / raw)
  To: Peter Xu
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Sean Christopherson, Andrew Jones, Ben Gardon,
	Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Wed, Mar 16, 2022 at 1:49 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, Mar 11, 2022 at 12:25:20AM +0000, David Matlack wrote:
> > Currently KVM only zaps collapsible 4KiB SPTEs in the shadow MMU (i.e.
> > in the rmap). This is fine for now KVM never creates intermediate huge
> > pages during dirty logging, i.e. a 1GiB page is never partially split to
> > a 2MiB page.
> >
> > However, this will stop being true once the shadow MMU participates in
> > eager page splitting, which can in fact leave behind partially split
> > huge pages. In preparation for that change, change the shadow MMU to
> > iterate over all necessary levels when zapping collapsible SPTEs.
> >
> > No functional change intended.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> >  arch/x86/kvm/mmu/mmu.c | 26 +++++++++++++++++++-------
> >  1 file changed, 19 insertions(+), 7 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 89a7a8d7a632..2032be3edd71 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -6142,18 +6142,30 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
> >       return need_tlb_flush;
> >  }
> >
> > +static void kvm_rmap_zap_collapsible_sptes(struct kvm *kvm,
> > +                                        const struct kvm_memory_slot *slot)
> > +{
> > +     bool flush;
> > +
> > +     /*
> > +      * Note, use KVM_MAX_HUGEPAGE_LEVEL - 1 since there's no need to zap
> > +      * pages that are already mapped at the maximum possible level.
> > +      */
> > +     flush = slot_handle_level(kvm, slot, kvm_mmu_zap_collapsible_spte,
> > +                               PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL - 1,
> > +                               true);
> > +
> > +     if (flush)
> > +             kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
> > +
> > +}
>
> Reviewed-by: Peter Xu <peterx@redhat.com>
>
> IMHO it looks cleaner to write it in the old way (drop the flush var).
> Maybe even unwrap the helper?

Unwrapping the helper and dropping the flush var makes the if
condition quite long and hard to read. But I think a compromise would
to have kvm_rmap_zap_collapsible_sptes() return flush and leave the
flush call in kvm_mmu_zap_collapsible_sptes().

>
> Thanks,
>
> --
> Peter Xu
>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 18/26] KVM: x86/mmu: Zap collapsible SPTEs at all levels in the shadow MMU
@ 2022-03-22 23:11       ` David Matlack
  0 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-22 23:11 UTC (permalink / raw)
  To: Peter Xu
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Wed, Mar 16, 2022 at 1:49 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, Mar 11, 2022 at 12:25:20AM +0000, David Matlack wrote:
> > Currently KVM only zaps collapsible 4KiB SPTEs in the shadow MMU (i.e.
> > in the rmap). This is fine for now KVM never creates intermediate huge
> > pages during dirty logging, i.e. a 1GiB page is never partially split to
> > a 2MiB page.
> >
> > However, this will stop being true once the shadow MMU participates in
> > eager page splitting, which can in fact leave behind partially split
> > huge pages. In preparation for that change, change the shadow MMU to
> > iterate over all necessary levels when zapping collapsible SPTEs.
> >
> > No functional change intended.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> >  arch/x86/kvm/mmu/mmu.c | 26 +++++++++++++++++++-------
> >  1 file changed, 19 insertions(+), 7 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 89a7a8d7a632..2032be3edd71 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -6142,18 +6142,30 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
> >       return need_tlb_flush;
> >  }
> >
> > +static void kvm_rmap_zap_collapsible_sptes(struct kvm *kvm,
> > +                                        const struct kvm_memory_slot *slot)
> > +{
> > +     bool flush;
> > +
> > +     /*
> > +      * Note, use KVM_MAX_HUGEPAGE_LEVEL - 1 since there's no need to zap
> > +      * pages that are already mapped at the maximum possible level.
> > +      */
> > +     flush = slot_handle_level(kvm, slot, kvm_mmu_zap_collapsible_spte,
> > +                               PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL - 1,
> > +                               true);
> > +
> > +     if (flush)
> > +             kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
> > +
> > +}
>
> Reviewed-by: Peter Xu <peterx@redhat.com>
>
> IMHO it looks cleaner to write it in the old way (drop the flush var).
> Maybe even unwrap the helper?

Unwrapping the helper and dropping the flush var makes the if
condition quite long and hard to read. But I think a compromise would
to have kvm_rmap_zap_collapsible_sptes() return flush and leave the
flush call in kvm_mmu_zap_collapsible_sptes().

>
> Thanks,
>
> --
> Peter Xu
>
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 21/26] KVM: Allow for different capacities in kvm_mmu_memory_cache structs
  2022-03-19  5:27     ` Anup Patel
@ 2022-03-22 23:13       ` David Matlack
  -1 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-22 23:13 UTC (permalink / raw)
  To: Anup Patel
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Sean Christopherson,
	Andrew Jones, Ben Gardon, Peter Xu, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Fri, Mar 18, 2022 at 10:27 PM Anup Patel <anup@brainfault.org> wrote:
>
> On Fri, Mar 11, 2022 at 5:56 AM David Matlack <dmatlack@google.com> wrote:
> >
> > Allow the capacity of the kvm_mmu_memory_cache struct to be chosen at
> > declaration time rather than being fixed for all declarations. This will
> > be used in a follow-up commit to declare an cache in x86 with a capacity
> > of 512+ objects without having to increase the capacity of all caches in
> > KVM.
> >
> > This change requires each cache now specify its capacity at runtime,
> > since the cache struct itself no longer has a fixed capacity known at
> > compile time. To protect against someone accidentally defining a
> > kvm_mmu_memory_cache struct directly (without the extra storage), this
> > commit includes a WARN_ON() in kvm_mmu_topup_memory_cache().
> >
> > This change, unfortunately, adds some grottiness to
> > kvm_phys_addr_ioremap() in arm64, which uses a function-local (i.e.
> > stack-allocated) kvm_mmu_memory_cache struct. Since C does not allow
> > anonymous structs in functions, the new wrapper struct that contains
> > kvm_mmu_memory_cache and the objects pointer array, must be named, which
> > means dealing with an outer and inner struct. The outer struct can't be
> > dropped since then there would be no guarantee the kvm_mmu_memory_cache
> > struct and objects array would be laid out consecutively on the stack.
> >
> > No functional change intended.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> >  arch/arm64/include/asm/kvm_host.h |  2 +-
> >  arch/arm64/kvm/arm.c              |  1 +
> >  arch/arm64/kvm/mmu.c              | 13 +++++++++----
> >  arch/mips/include/asm/kvm_host.h  |  2 +-
> >  arch/mips/kvm/mips.c              |  2 ++
> >  arch/riscv/include/asm/kvm_host.h |  2 +-
> >  arch/riscv/kvm/vcpu.c             |  1 +
> >  arch/x86/include/asm/kvm_host.h   |  8 ++++----
> >  arch/x86/kvm/mmu/mmu.c            |  9 +++++++++
> >  include/linux/kvm_types.h         | 19 +++++++++++++++++--
> >  virt/kvm/kvm_main.c               | 10 +++++++++-
> >  11 files changed, 55 insertions(+), 14 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> > index 5bc01e62c08a..1369415290dd 100644
> > --- a/arch/arm64/include/asm/kvm_host.h
> > +++ b/arch/arm64/include/asm/kvm_host.h
> > @@ -357,7 +357,7 @@ struct kvm_vcpu_arch {
> >         bool pause;
> >
> >         /* Cache some mmu pages needed inside spinlock regions */
> > -       struct kvm_mmu_memory_cache mmu_page_cache;
> > +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_cache);
> >
> >         /* Target CPU and feature flags */
> >         int target;
> > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > index ecc5958e27fe..5e38385be0ef 100644
> > --- a/arch/arm64/kvm/arm.c
> > +++ b/arch/arm64/kvm/arm.c
> > @@ -319,6 +319,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
> >         vcpu->arch.target = -1;
> >         bitmap_zero(vcpu->arch.features, KVM_VCPU_MAX_FEATURES);
> >
> > +       vcpu->arch.mmu_page_cache.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
> >         vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
> >
> >         /* Set up the timer */
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index bc2aba953299..940089ba65ad 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -765,7 +765,12 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
> >  {
> >         phys_addr_t addr;
> >         int ret = 0;
> > -       struct kvm_mmu_memory_cache cache = { 0, __GFP_ZERO, NULL, };
> > +       DEFINE_KVM_MMU_MEMORY_CACHE(cache) page_cache = {
> > +               .cache = {
> > +                       .gfp_zero = __GFP_ZERO,
> > +                       .capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE,
> > +               },
> > +       };
> >         struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
> >         enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_DEVICE |
> >                                      KVM_PGTABLE_PROT_R |
> > @@ -778,14 +783,14 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
> >         guest_ipa &= PAGE_MASK;
> >
> >         for (addr = guest_ipa; addr < guest_ipa + size; addr += PAGE_SIZE) {
> > -               ret = kvm_mmu_topup_memory_cache(&cache,
> > +               ret = kvm_mmu_topup_memory_cache(&page_cache.cache,
> >                                                  kvm_mmu_cache_min_pages(kvm));
> >                 if (ret)
> >                         break;
> >
> >                 spin_lock(&kvm->mmu_lock);
> >                 ret = kvm_pgtable_stage2_map(pgt, addr, PAGE_SIZE, pa, prot,
> > -                                            &cache);
> > +                                            &page_cache.cache);
> >                 spin_unlock(&kvm->mmu_lock);
> >                 if (ret)
> >                         break;
> > @@ -793,7 +798,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
> >                 pa += PAGE_SIZE;
> >         }
> >
> > -       kvm_mmu_free_memory_cache(&cache);
> > +       kvm_mmu_free_memory_cache(&page_cache.cache);
> >         return ret;
> >  }
> >
> > diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
> > index 717716cc51c5..935511d7fc3a 100644
> > --- a/arch/mips/include/asm/kvm_host.h
> > +++ b/arch/mips/include/asm/kvm_host.h
> > @@ -347,7 +347,7 @@ struct kvm_vcpu_arch {
> >         unsigned long pending_exceptions_clr;
> >
> >         /* Cache some mmu pages needed inside spinlock regions */
> > -       struct kvm_mmu_memory_cache mmu_page_cache;
> > +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_cache);
> >
> >         /* vcpu's vzguestid is different on each host cpu in an smp system */
> >         u32 vzguestid[NR_CPUS];
> > diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
> > index a25e0b73ee70..45c7179144dc 100644
> > --- a/arch/mips/kvm/mips.c
> > +++ b/arch/mips/kvm/mips.c
> > @@ -387,6 +387,8 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
> >         if (err)
> >                 goto out_free_gebase;
> >
> > +       vcpu->arch.mmu_page_cache.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
> > +
> >         return 0;
> >
> >  out_free_gebase:
> > diff --git a/arch/riscv/include/asm/kvm_host.h b/arch/riscv/include/asm/kvm_host.h
> > index 99ef6a120617..5bd4902ebda3 100644
> > --- a/arch/riscv/include/asm/kvm_host.h
> > +++ b/arch/riscv/include/asm/kvm_host.h
> > @@ -186,7 +186,7 @@ struct kvm_vcpu_arch {
> >         struct kvm_sbi_context sbi_context;
> >
> >         /* Cache pages needed to program page tables with spinlock held */
> > -       struct kvm_mmu_memory_cache mmu_page_cache;
> > +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_cache);
> >
> >         /* VCPU power-off state */
> >         bool power_off;
> > diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
> > index 624166004e36..6a5f5aa45bac 100644
> > --- a/arch/riscv/kvm/vcpu.c
> > +++ b/arch/riscv/kvm/vcpu.c
> > @@ -94,6 +94,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
> >
> >         /* Mark this VCPU never ran */
> >         vcpu->arch.ran_atleast_once = false;
> > +       vcpu->arch.mmu_page_cache.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
> >         vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
>
> There another function stage2_ioremap() which also needs to change
> because this function creates a kvm_mmu_memory_cache on stack.

Thanks for catching that. Will fix in v3.

>
> Regards,
> Anup
>
> >
> >         /* Setup ISA features available to VCPU */
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 0f5a36772bdc..544dde11963b 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -692,10 +692,10 @@ struct kvm_vcpu_arch {
> >          */
> >         struct kvm_mmu *walk_mmu;
> >
> > -       struct kvm_mmu_memory_cache mmu_pte_list_desc_cache;
> > -       struct kvm_mmu_memory_cache mmu_shadow_page_cache;
> > -       struct kvm_mmu_memory_cache mmu_shadowed_translation_cache;
> > -       struct kvm_mmu_memory_cache mmu_page_header_cache;
> > +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_pte_list_desc_cache);
> > +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_shadow_page_cache);
> > +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_shadowed_translation_cache);
> > +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_header_cache);
> >
> >         /*
> >          * QEMU userspace and the guest each have their own FPU state.
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index dd56b5b9624f..24e7e053e05b 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -5817,12 +5817,21 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
> >  {
> >         int ret;
> >
> > +       vcpu->arch.mmu_pte_list_desc_cache.capacity =
> > +               KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
> >         vcpu->arch.mmu_pte_list_desc_cache.kmem_cache = pte_list_desc_cache;
> >         vcpu->arch.mmu_pte_list_desc_cache.gfp_zero = __GFP_ZERO;
> >
> > +       vcpu->arch.mmu_page_header_cache.capacity =
> > +               KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
> >         vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;
> >         vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
> >
> > +       vcpu->arch.mmu_shadowed_translation_cache.capacity =
> > +               KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
> > +
> > +       vcpu->arch.mmu_shadow_page_cache.capacity =
> > +               KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
> >         vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
> >
> >         vcpu->arch.mmu = &vcpu->arch.root_mmu;
> > diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> > index ac1ebb37a0ff..579cf39986ec 100644
> > --- a/include/linux/kvm_types.h
> > +++ b/include/linux/kvm_types.h
> > @@ -83,14 +83,29 @@ struct gfn_to_pfn_cache {
> >   * MMU flows is problematic, as is triggering reclaim, I/O, etc... while
> >   * holding MMU locks.  Note, these caches act more like prefetch buffers than
> >   * classical caches, i.e. objects are not returned to the cache on being freed.
> > + *
> > + * The storage for the cache object pointers is laid out after the struct, to
> > + * allow different declarations to choose different capacities. The capacity
> > + * field defines the number of object pointers available after the struct.
> >   */
> >  struct kvm_mmu_memory_cache {
> >         int nobjs;
> > +       int capacity;
> >         gfp_t gfp_zero;
> >         struct kmem_cache *kmem_cache;
> > -       void *objects[KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE];
> > +       void *objects[];
> >  };
> > -#endif
> > +
> > +#define __DEFINE_KVM_MMU_MEMORY_CACHE(_name, _capacity)                \
> > +       struct {                                                \
> > +               struct kvm_mmu_memory_cache _name;              \
> > +               void *_name##_objects[_capacity];               \
> > +       }
> > +
> > +#define DEFINE_KVM_MMU_MEMORY_CACHE(_name) \
> > +       __DEFINE_KVM_MMU_MEMORY_CACHE(_name, KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE)
> > +
> > +#endif /* KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE */
> >
> >  #define HALT_POLL_HIST_COUNT                   32
> >
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 9581a24c3d17..1d849ba9529f 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -371,9 +371,17 @@ int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
> >  {
> >         void *obj;
> >
> > +       /*
> > +        * The capacity fieldmust be initialized since the storage for the
> > +        * objects pointer array is laid out after the kvm_mmu_memory_cache
> > +        * struct and not known at compile time.
> > +        */
> > +       if (WARN_ON(mc->capacity == 0))
> > +               return -EINVAL;
> > +
> >         if (mc->nobjs >= min)
> >                 return 0;
> > -       while (mc->nobjs < ARRAY_SIZE(mc->objects)) {
> > +       while (mc->nobjs < mc->capacity) {
> >                 obj = mmu_memory_cache_alloc_obj(mc, GFP_KERNEL_ACCOUNT);
> >                 if (!obj)
> >                         return mc->nobjs >= min ? 0 : -ENOMEM;
> > --
> > 2.35.1.723.g4982287a31-goog
> >

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 21/26] KVM: Allow for different capacities in kvm_mmu_memory_cache structs
@ 2022-03-22 23:13       ` David Matlack
  0 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-22 23:13 UTC (permalink / raw)
  To: Anup Patel
  Cc: Albert Ou, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Marc Zyngier, Huacai Chen,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Fri, Mar 18, 2022 at 10:27 PM Anup Patel <anup@brainfault.org> wrote:
>
> On Fri, Mar 11, 2022 at 5:56 AM David Matlack <dmatlack@google.com> wrote:
> >
> > Allow the capacity of the kvm_mmu_memory_cache struct to be chosen at
> > declaration time rather than being fixed for all declarations. This will
> > be used in a follow-up commit to declare an cache in x86 with a capacity
> > of 512+ objects without having to increase the capacity of all caches in
> > KVM.
> >
> > This change requires each cache now specify its capacity at runtime,
> > since the cache struct itself no longer has a fixed capacity known at
> > compile time. To protect against someone accidentally defining a
> > kvm_mmu_memory_cache struct directly (without the extra storage), this
> > commit includes a WARN_ON() in kvm_mmu_topup_memory_cache().
> >
> > This change, unfortunately, adds some grottiness to
> > kvm_phys_addr_ioremap() in arm64, which uses a function-local (i.e.
> > stack-allocated) kvm_mmu_memory_cache struct. Since C does not allow
> > anonymous structs in functions, the new wrapper struct that contains
> > kvm_mmu_memory_cache and the objects pointer array, must be named, which
> > means dealing with an outer and inner struct. The outer struct can't be
> > dropped since then there would be no guarantee the kvm_mmu_memory_cache
> > struct and objects array would be laid out consecutively on the stack.
> >
> > No functional change intended.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> >  arch/arm64/include/asm/kvm_host.h |  2 +-
> >  arch/arm64/kvm/arm.c              |  1 +
> >  arch/arm64/kvm/mmu.c              | 13 +++++++++----
> >  arch/mips/include/asm/kvm_host.h  |  2 +-
> >  arch/mips/kvm/mips.c              |  2 ++
> >  arch/riscv/include/asm/kvm_host.h |  2 +-
> >  arch/riscv/kvm/vcpu.c             |  1 +
> >  arch/x86/include/asm/kvm_host.h   |  8 ++++----
> >  arch/x86/kvm/mmu/mmu.c            |  9 +++++++++
> >  include/linux/kvm_types.h         | 19 +++++++++++++++++--
> >  virt/kvm/kvm_main.c               | 10 +++++++++-
> >  11 files changed, 55 insertions(+), 14 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> > index 5bc01e62c08a..1369415290dd 100644
> > --- a/arch/arm64/include/asm/kvm_host.h
> > +++ b/arch/arm64/include/asm/kvm_host.h
> > @@ -357,7 +357,7 @@ struct kvm_vcpu_arch {
> >         bool pause;
> >
> >         /* Cache some mmu pages needed inside spinlock regions */
> > -       struct kvm_mmu_memory_cache mmu_page_cache;
> > +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_cache);
> >
> >         /* Target CPU and feature flags */
> >         int target;
> > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > index ecc5958e27fe..5e38385be0ef 100644
> > --- a/arch/arm64/kvm/arm.c
> > +++ b/arch/arm64/kvm/arm.c
> > @@ -319,6 +319,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
> >         vcpu->arch.target = -1;
> >         bitmap_zero(vcpu->arch.features, KVM_VCPU_MAX_FEATURES);
> >
> > +       vcpu->arch.mmu_page_cache.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
> >         vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
> >
> >         /* Set up the timer */
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index bc2aba953299..940089ba65ad 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -765,7 +765,12 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
> >  {
> >         phys_addr_t addr;
> >         int ret = 0;
> > -       struct kvm_mmu_memory_cache cache = { 0, __GFP_ZERO, NULL, };
> > +       DEFINE_KVM_MMU_MEMORY_CACHE(cache) page_cache = {
> > +               .cache = {
> > +                       .gfp_zero = __GFP_ZERO,
> > +                       .capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE,
> > +               },
> > +       };
> >         struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
> >         enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_DEVICE |
> >                                      KVM_PGTABLE_PROT_R |
> > @@ -778,14 +783,14 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
> >         guest_ipa &= PAGE_MASK;
> >
> >         for (addr = guest_ipa; addr < guest_ipa + size; addr += PAGE_SIZE) {
> > -               ret = kvm_mmu_topup_memory_cache(&cache,
> > +               ret = kvm_mmu_topup_memory_cache(&page_cache.cache,
> >                                                  kvm_mmu_cache_min_pages(kvm));
> >                 if (ret)
> >                         break;
> >
> >                 spin_lock(&kvm->mmu_lock);
> >                 ret = kvm_pgtable_stage2_map(pgt, addr, PAGE_SIZE, pa, prot,
> > -                                            &cache);
> > +                                            &page_cache.cache);
> >                 spin_unlock(&kvm->mmu_lock);
> >                 if (ret)
> >                         break;
> > @@ -793,7 +798,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
> >                 pa += PAGE_SIZE;
> >         }
> >
> > -       kvm_mmu_free_memory_cache(&cache);
> > +       kvm_mmu_free_memory_cache(&page_cache.cache);
> >         return ret;
> >  }
> >
> > diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
> > index 717716cc51c5..935511d7fc3a 100644
> > --- a/arch/mips/include/asm/kvm_host.h
> > +++ b/arch/mips/include/asm/kvm_host.h
> > @@ -347,7 +347,7 @@ struct kvm_vcpu_arch {
> >         unsigned long pending_exceptions_clr;
> >
> >         /* Cache some mmu pages needed inside spinlock regions */
> > -       struct kvm_mmu_memory_cache mmu_page_cache;
> > +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_cache);
> >
> >         /* vcpu's vzguestid is different on each host cpu in an smp system */
> >         u32 vzguestid[NR_CPUS];
> > diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
> > index a25e0b73ee70..45c7179144dc 100644
> > --- a/arch/mips/kvm/mips.c
> > +++ b/arch/mips/kvm/mips.c
> > @@ -387,6 +387,8 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
> >         if (err)
> >                 goto out_free_gebase;
> >
> > +       vcpu->arch.mmu_page_cache.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
> > +
> >         return 0;
> >
> >  out_free_gebase:
> > diff --git a/arch/riscv/include/asm/kvm_host.h b/arch/riscv/include/asm/kvm_host.h
> > index 99ef6a120617..5bd4902ebda3 100644
> > --- a/arch/riscv/include/asm/kvm_host.h
> > +++ b/arch/riscv/include/asm/kvm_host.h
> > @@ -186,7 +186,7 @@ struct kvm_vcpu_arch {
> >         struct kvm_sbi_context sbi_context;
> >
> >         /* Cache pages needed to program page tables with spinlock held */
> > -       struct kvm_mmu_memory_cache mmu_page_cache;
> > +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_cache);
> >
> >         /* VCPU power-off state */
> >         bool power_off;
> > diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
> > index 624166004e36..6a5f5aa45bac 100644
> > --- a/arch/riscv/kvm/vcpu.c
> > +++ b/arch/riscv/kvm/vcpu.c
> > @@ -94,6 +94,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
> >
> >         /* Mark this VCPU never ran */
> >         vcpu->arch.ran_atleast_once = false;
> > +       vcpu->arch.mmu_page_cache.capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
> >         vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
>
> There another function stage2_ioremap() which also needs to change
> because this function creates a kvm_mmu_memory_cache on stack.

Thanks for catching that. Will fix in v3.

>
> Regards,
> Anup
>
> >
> >         /* Setup ISA features available to VCPU */
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 0f5a36772bdc..544dde11963b 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -692,10 +692,10 @@ struct kvm_vcpu_arch {
> >          */
> >         struct kvm_mmu *walk_mmu;
> >
> > -       struct kvm_mmu_memory_cache mmu_pte_list_desc_cache;
> > -       struct kvm_mmu_memory_cache mmu_shadow_page_cache;
> > -       struct kvm_mmu_memory_cache mmu_shadowed_translation_cache;
> > -       struct kvm_mmu_memory_cache mmu_page_header_cache;
> > +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_pte_list_desc_cache);
> > +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_shadow_page_cache);
> > +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_shadowed_translation_cache);
> > +       DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_header_cache);
> >
> >         /*
> >          * QEMU userspace and the guest each have their own FPU state.
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index dd56b5b9624f..24e7e053e05b 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -5817,12 +5817,21 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
> >  {
> >         int ret;
> >
> > +       vcpu->arch.mmu_pte_list_desc_cache.capacity =
> > +               KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
> >         vcpu->arch.mmu_pte_list_desc_cache.kmem_cache = pte_list_desc_cache;
> >         vcpu->arch.mmu_pte_list_desc_cache.gfp_zero = __GFP_ZERO;
> >
> > +       vcpu->arch.mmu_page_header_cache.capacity =
> > +               KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
> >         vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;
> >         vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
> >
> > +       vcpu->arch.mmu_shadowed_translation_cache.capacity =
> > +               KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
> > +
> > +       vcpu->arch.mmu_shadow_page_cache.capacity =
> > +               KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
> >         vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
> >
> >         vcpu->arch.mmu = &vcpu->arch.root_mmu;
> > diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> > index ac1ebb37a0ff..579cf39986ec 100644
> > --- a/include/linux/kvm_types.h
> > +++ b/include/linux/kvm_types.h
> > @@ -83,14 +83,29 @@ struct gfn_to_pfn_cache {
> >   * MMU flows is problematic, as is triggering reclaim, I/O, etc... while
> >   * holding MMU locks.  Note, these caches act more like prefetch buffers than
> >   * classical caches, i.e. objects are not returned to the cache on being freed.
> > + *
> > + * The storage for the cache object pointers is laid out after the struct, to
> > + * allow different declarations to choose different capacities. The capacity
> > + * field defines the number of object pointers available after the struct.
> >   */
> >  struct kvm_mmu_memory_cache {
> >         int nobjs;
> > +       int capacity;
> >         gfp_t gfp_zero;
> >         struct kmem_cache *kmem_cache;
> > -       void *objects[KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE];
> > +       void *objects[];
> >  };
> > -#endif
> > +
> > +#define __DEFINE_KVM_MMU_MEMORY_CACHE(_name, _capacity)                \
> > +       struct {                                                \
> > +               struct kvm_mmu_memory_cache _name;              \
> > +               void *_name##_objects[_capacity];               \
> > +       }
> > +
> > +#define DEFINE_KVM_MMU_MEMORY_CACHE(_name) \
> > +       __DEFINE_KVM_MMU_MEMORY_CACHE(_name, KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE)
> > +
> > +#endif /* KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE */
> >
> >  #define HALT_POLL_HIST_COUNT                   32
> >
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 9581a24c3d17..1d849ba9529f 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -371,9 +371,17 @@ int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
> >  {
> >         void *obj;
> >
> > +       /*
> > +        * The capacity fieldmust be initialized since the storage for the
> > +        * objects pointer array is laid out after the kvm_mmu_memory_cache
> > +        * struct and not known at compile time.
> > +        */
> > +       if (WARN_ON(mc->capacity == 0))
> > +               return -EINVAL;
> > +
> >         if (mc->nobjs >= min)
> >                 return 0;
> > -       while (mc->nobjs < ARRAY_SIZE(mc->objects)) {
> > +       while (mc->nobjs < mc->capacity) {
> >                 obj = mmu_memory_cache_alloc_obj(mc, GFP_KERNEL_ACCOUNT);
> >                 if (!obj)
> >                         return mc->nobjs >= min ? 0 : -ENOMEM;
> > --
> > 2.35.1.723.g4982287a31-goog
> >
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 20/26] KVM: x86/mmu: Extend Eager Page Splitting to the shadow MMU
  2022-03-16 10:26     ` Peter Xu
@ 2022-03-22 23:58       ` David Matlack
  -1 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-22 23:58 UTC (permalink / raw)
  To: Peter Xu
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Sean Christopherson, Andrew Jones, Ben Gardon,
	Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

(On Wed, Mar 16, 2022 at 3:27 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, Mar 11, 2022 at 12:25:22AM +0000, David Matlack wrote:
> > Extend KVM's eager page splitting to also split huge pages that are
> > mapped by the shadow MMU. Specifically, walk through the rmap splitting
> > all 1GiB pages to 2MiB pages, and splitting all 2MiB pages to 4KiB
> > pages.
> >
> > Splitting huge pages mapped by the shadow MMU requries dealing with some
> > extra complexity beyond that of the TDP MMU:
> >
> > (1) The shadow MMU has a limit on the number of shadow pages that are
> >     allowed to be allocated. So, as a policy, Eager Page Splitting
> >     refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
> >     pages available.
> >
> > (2) Huge pages may be mapped by indirect shadow pages which have the
> >     possibility of being unsync. As a policy we opt not to split such
> >     pages as their translation may no longer be valid.
> >
> > (3) Splitting a huge page may end up re-using an existing lower level
> >     shadow page tables. This is unlike the TDP MMU which always allocates
> >     new shadow page tables when splitting.  This commit does *not*
> >     handle such aliasing and opts not to split such huge pages.
> >
> > (4) When installing the lower level SPTEs, they must be added to the
> >     rmap which may require allocating additional pte_list_desc structs.
> >     This commit does *not* handle such cases and instead opts to leave
> >     such lower-level SPTEs non-present. In this situation TLBs must be
> >     flushed before dropping the MMU lock as a portion of the huge page
> >     region is being unmapped.
> >
> > Suggested-by: Peter Feiner <pfeiner@google.com>
> > [ This commit is based off of the original implementation of Eager Page
> >   Splitting from Peter in Google's kernel from 2016. ]
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> >  .../admin-guide/kernel-parameters.txt         |   3 -
> >  arch/x86/kvm/mmu/mmu.c                        | 307 ++++++++++++++++++
> >  2 files changed, 307 insertions(+), 3 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > index 05161afd7642..495f6ac53801 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -2360,9 +2360,6 @@
> >                       the KVM_CLEAR_DIRTY ioctl, and only for the pages being
> >                       cleared.
> >
> > -                     Eager page splitting currently only supports splitting
> > -                     huge pages mapped by the TDP MMU.
> > -
> >                       Default is Y (on).
> >
> >       kvm.enable_vmware_backdoor=[KVM] Support VMware backdoor PV interface.
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 926ddfaa9e1a..dd56b5b9624f 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -727,6 +727,11 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
> >
> >  static struct pte_list_desc *mmu_alloc_pte_list_desc(struct kvm_mmu_memory_cache *cache)
> >  {
> > +     static const gfp_t gfp_nocache = GFP_ATOMIC | __GFP_ACCOUNT | __GFP_ZERO;
> > +
> > +     if (WARN_ON_ONCE(!cache))
> > +             return kmem_cache_alloc(pte_list_desc_cache, gfp_nocache);
> > +
>
> I also think this is not proper to be added into this patch.  Maybe it'll
> be more suitable for the rmap_add() rework patch previously, or maybe it
> can be dropped directly if it should never trigger at all. Then we die hard
> at below when referencing it.

I can drop this, Ben suggested the same. cache should really never be
NULL so there's no need for this backup code.

>
> >       return kvm_mmu_memory_cache_alloc(cache);
> >  }
> >
> > @@ -743,6 +748,28 @@ static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index)
> >       return sp->gfn + (index << ((sp->role.level - 1) * PT64_LEVEL_BITS));
> >  }
> >
> > +static gfn_t sptep_to_gfn(u64 *sptep)
> > +{
> > +     struct kvm_mmu_page *sp = sptep_to_sp(sptep);
> > +
> > +     return kvm_mmu_page_get_gfn(sp, sptep - sp->spt);
> > +}
> > +
> > +static unsigned int kvm_mmu_page_get_access(struct kvm_mmu_page *sp, int index)
> > +{
> > +     if (!sp->role.direct)
> > +             return sp->shadowed_translation[index].access;
> > +
> > +     return sp->role.access;
> > +}
> > +
> > +static unsigned int sptep_to_access(u64 *sptep)
> > +{
> > +     struct kvm_mmu_page *sp = sptep_to_sp(sptep);
> > +
> > +     return kvm_mmu_page_get_access(sp, sptep - sp->spt);
> > +}
> > +
> >  static void kvm_mmu_page_set_gfn_access(struct kvm_mmu_page *sp, int index,
> >                                       gfn_t gfn, u32 access)
> >  {
> > @@ -912,6 +939,9 @@ static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte,
> >       return count;
> >  }
> >
> > +static struct kvm_rmap_head *gfn_to_rmap(gfn_t gfn, int level,
> > +                                      const struct kvm_memory_slot *slot);
> > +
> >  static void
> >  pte_list_desc_remove_entry(struct kvm_rmap_head *rmap_head,
> >                          struct pte_list_desc *desc, int i,
> > @@ -2125,6 +2155,23 @@ static struct kvm_mmu_page *__kvm_mmu_find_shadow_page(struct kvm *kvm,
> >       return sp;
> >  }
> >
> > +static struct kvm_mmu_page *kvm_mmu_find_direct_sp(struct kvm *kvm, gfn_t gfn,
> > +                                                union kvm_mmu_page_role role)
> > +{
> > +     struct kvm_mmu_page *sp;
> > +     LIST_HEAD(invalid_list);
> > +
> > +     BUG_ON(!role.direct);
> > +
> > +     sp = __kvm_mmu_find_shadow_page(kvm, gfn, role, &invalid_list);
> > +
> > +     /* Direct SPs are never unsync. */
> > +     WARN_ON_ONCE(sp && sp->unsync);
> > +
> > +     kvm_mmu_commit_zap_page(kvm, &invalid_list);
> > +     return sp;
> > +}
> > +
> >  /*
> >   * Looks up an existing SP for the given gfn and role if one exists. The
> >   * return SP is guaranteed to be synced.
> > @@ -6063,12 +6110,266 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
> >               kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
> >  }
> >
> > +static int prepare_to_split_huge_page(struct kvm *kvm,
> > +                                   const struct kvm_memory_slot *slot,
> > +                                   u64 *huge_sptep,
> > +                                   struct kvm_mmu_page **spp,
> > +                                   bool *flush,
> > +                                   bool *dropped_lock)
> > +{
> > +     int r = 0;
> > +
> > +     *dropped_lock = false;
> > +
> > +     if (kvm_mmu_available_pages(kvm) <= KVM_MIN_FREE_MMU_PAGES)
> > +             return -ENOSPC;
> > +
> > +     if (need_resched() || rwlock_needbreak(&kvm->mmu_lock))
> > +             goto drop_lock;
> > +
>
> Not immediately clear on whether there'll be case that *spp is set within
> the current function.  Some sanity check might be nice?

Sorry I'm not sure what you mean here. What kind of sanity check did
you have in mind?

>
> > +     *spp = kvm_mmu_alloc_direct_sp_for_split(true);
> > +     if (r)
> > +             goto drop_lock;
> > +
> > +     return 0;
> > +
> > +drop_lock:
> > +     if (*flush)
> > +             kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
> > +
> > +     *flush = false;
> > +     *dropped_lock = true;
> > +
> > +     write_unlock(&kvm->mmu_lock);
> > +     cond_resched();
> > +     *spp = kvm_mmu_alloc_direct_sp_for_split(false);
> > +     if (!*spp)
> > +             r = -ENOMEM;
> > +     write_lock(&kvm->mmu_lock);
> > +
> > +     return r;
> > +}
> > +
> > +static struct kvm_mmu_page *kvm_mmu_get_sp_for_split(struct kvm *kvm,
> > +                                                  const struct kvm_memory_slot *slot,
> > +                                                  u64 *huge_sptep,
> > +                                                  struct kvm_mmu_page **spp)
> > +{
> > +     struct kvm_mmu_page *split_sp;
> > +     union kvm_mmu_page_role role;
> > +     unsigned int access;
> > +     gfn_t gfn;
> > +
> > +     gfn = sptep_to_gfn(huge_sptep);
> > +     access = sptep_to_access(huge_sptep);
> > +
> > +     /*
> > +      * Huge page splitting always uses direct shadow pages since we are
> > +      * directly mapping the huge page GFN region with smaller pages.
> > +      */
> > +     role = kvm_mmu_child_role(huge_sptep, true, access);
> > +     split_sp = kvm_mmu_find_direct_sp(kvm, gfn, role);
> > +
> > +     /*
> > +      * Opt not to split if the lower-level SP already exists. This requires
> > +      * more complex handling as the SP may be already partially filled in
> > +      * and may need extra pte_list_desc structs to update parent_ptes.
> > +      */
> > +     if (split_sp)
> > +             return NULL;
>
> This smells tricky..
>
> Firstly we're trying to lookup the existing SPs that has shadowed this huge
> page in split way, with the access bits fetched from the shadow cache (so
> without hugepage nx effect).

Yeah this is tricky for sure.

For direct shadow pages, sp->role.access is always the guest access
permissions being shadowed (or ACC_ALL for situations where there is
no shadowing, e.g. __direct_map() and the TDP MMU). That's why we use
shadow translation cache to lookup up an existing SP or creating a new
SP, rather than taking the access permissions from the huge SPTE
itself (which may have KVM-specific policies applied such as HugePage
NX, access tracking, etc.). In other words, we want to look up
existing SPs in the same exact way that the fault handler looks them
up.

> However could the pages be mapped with
> different permissions from the currently hugely mapped page?

Yes, I think there can be some differences, such as:

 - The child SPTEs may have execute permission granted due to HugePage
NX while the huge page does not.
 - The child SPTEs may be in a different access tracking state than
the huge page.

There may be others. But no matter what, the same differences are
possible when we split a huge page during a fault, which leads me to
conclude it is safe.

>
> IIUC all these is for the fact that we can't allocate pte_list_desc and we
> want to make sure we won't make the pte list to be >1.
>
> But I also see that the pte_list check below...
>
> > +
> > +     swap(split_sp, *spp);
> > +     init_shadow_page(kvm, split_sp, slot, gfn, role);
> > +     trace_kvm_mmu_get_page(split_sp, true);
> > +
> > +     return split_sp;
> > +}
> > +
> > +static int kvm_mmu_split_huge_page(struct kvm *kvm,
> > +                                const struct kvm_memory_slot *slot,
> > +                                u64 *huge_sptep, struct kvm_mmu_page **spp,
> > +                                bool *flush)
> > +
> > +{
> > +     struct kvm_mmu_page *split_sp;
> > +     u64 huge_spte, split_spte;
> > +     int split_level, index;
> > +     unsigned int access;
> > +     u64 *split_sptep;
> > +     gfn_t split_gfn;
> > +
> > +     split_sp = kvm_mmu_get_sp_for_split(kvm, slot, huge_sptep, spp);
> > +     if (!split_sp)
> > +             return -EOPNOTSUPP;
> > +
> > +     /*
> > +      * Since we did not allocate pte_list_desc_structs for the split, we
> > +      * cannot add a new parent SPTE to parent_ptes. This should never happen
> > +      * in practice though since this is a fresh SP.
> > +      *
> > +      * Note, this makes it safe to pass NULL to __link_shadow_page() below.
> > +      */
> > +     if (WARN_ON_ONCE(split_sp->parent_ptes.val))
> > +             return -EINVAL;
> > +
> > +     huge_spte = READ_ONCE(*huge_sptep);
> > +
> > +     split_level = split_sp->role.level;
> > +     access = split_sp->role.access;
> > +
> > +     for (index = 0; index < PT64_ENT_PER_PAGE; index++) {
> > +             split_sptep = &split_sp->spt[index];
> > +             split_gfn = kvm_mmu_page_get_gfn(split_sp, index);
> > +
> > +             BUG_ON(is_shadow_present_pte(*split_sptep));
> > +
> > +             /*
> > +              * Since we did not allocate pte_list_desc structs for the
> > +              * split, we can't add a new SPTE that maps this GFN.
> > +              * Skipping this SPTE means we're only partially mapping the
> > +              * huge page, which means we'll need to flush TLBs before
> > +              * dropping the MMU lock.
> > +              *
> > +              * Note, this make it safe to pass NULL to __rmap_add() below.
> > +              */
> > +             if (gfn_to_rmap(split_gfn, split_level, slot)->val) {
> > +                     *flush = true;
> > +                     continue;
> > +             }
>
> ... here.
>
> IIUC this check should already be able to cover all the cases and it's
> accurate on the fact that we don't want to grow any rmap to >1 len.
>
> > +
> > +             split_spte = make_huge_page_split_spte(
> > +                             huge_spte, split_level + 1, index, access);
> > +
> > +             mmu_spte_set(split_sptep, split_spte);
> > +             __rmap_add(kvm, NULL, slot, split_sptep, split_gfn, access);
>
> __rmap_add() with a NULL cache pointer is weird.. same as
> __link_shadow_page() below.
>
> I'll stop here for now I guess.. Have you considered having rmap allocation
> ready altogether, rather than making this intermediate step but only add
> that later?  Because all these look hackish to me..  It's also possible
> that I missed something important, if so please shoot.
>
> Thanks,
>
> > +     }
> > +
> > +     /*
> > +      * Replace the huge spte with a pointer to the populated lower level
> > +      * page table. Since we are making this change without a TLB flush vCPUs
> > +      * will see a mix of the split mappings and the original huge mapping,
> > +      * depending on what's currently in their TLB. This is fine from a
> > +      * correctness standpoint since the translation will either be identical
> > +      * or non-present. To account for non-present mappings, the TLB will be
> > +      * flushed prior to dropping the MMU lock.
> > +      */
> > +     __drop_large_spte(kvm, huge_sptep, false);
> > +     __link_shadow_page(NULL, huge_sptep, split_sp);
> > +
> > +     return 0;
> > +}
>
> --
> Peter Xu
>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 20/26] KVM: x86/mmu: Extend Eager Page Splitting to the shadow MMU
@ 2022-03-22 23:58       ` David Matlack
  0 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-22 23:58 UTC (permalink / raw)
  To: Peter Xu
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

(On Wed, Mar 16, 2022 at 3:27 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, Mar 11, 2022 at 12:25:22AM +0000, David Matlack wrote:
> > Extend KVM's eager page splitting to also split huge pages that are
> > mapped by the shadow MMU. Specifically, walk through the rmap splitting
> > all 1GiB pages to 2MiB pages, and splitting all 2MiB pages to 4KiB
> > pages.
> >
> > Splitting huge pages mapped by the shadow MMU requries dealing with some
> > extra complexity beyond that of the TDP MMU:
> >
> > (1) The shadow MMU has a limit on the number of shadow pages that are
> >     allowed to be allocated. So, as a policy, Eager Page Splitting
> >     refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
> >     pages available.
> >
> > (2) Huge pages may be mapped by indirect shadow pages which have the
> >     possibility of being unsync. As a policy we opt not to split such
> >     pages as their translation may no longer be valid.
> >
> > (3) Splitting a huge page may end up re-using an existing lower level
> >     shadow page tables. This is unlike the TDP MMU which always allocates
> >     new shadow page tables when splitting.  This commit does *not*
> >     handle such aliasing and opts not to split such huge pages.
> >
> > (4) When installing the lower level SPTEs, they must be added to the
> >     rmap which may require allocating additional pte_list_desc structs.
> >     This commit does *not* handle such cases and instead opts to leave
> >     such lower-level SPTEs non-present. In this situation TLBs must be
> >     flushed before dropping the MMU lock as a portion of the huge page
> >     region is being unmapped.
> >
> > Suggested-by: Peter Feiner <pfeiner@google.com>
> > [ This commit is based off of the original implementation of Eager Page
> >   Splitting from Peter in Google's kernel from 2016. ]
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> >  .../admin-guide/kernel-parameters.txt         |   3 -
> >  arch/x86/kvm/mmu/mmu.c                        | 307 ++++++++++++++++++
> >  2 files changed, 307 insertions(+), 3 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > index 05161afd7642..495f6ac53801 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -2360,9 +2360,6 @@
> >                       the KVM_CLEAR_DIRTY ioctl, and only for the pages being
> >                       cleared.
> >
> > -                     Eager page splitting currently only supports splitting
> > -                     huge pages mapped by the TDP MMU.
> > -
> >                       Default is Y (on).
> >
> >       kvm.enable_vmware_backdoor=[KVM] Support VMware backdoor PV interface.
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 926ddfaa9e1a..dd56b5b9624f 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -727,6 +727,11 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
> >
> >  static struct pte_list_desc *mmu_alloc_pte_list_desc(struct kvm_mmu_memory_cache *cache)
> >  {
> > +     static const gfp_t gfp_nocache = GFP_ATOMIC | __GFP_ACCOUNT | __GFP_ZERO;
> > +
> > +     if (WARN_ON_ONCE(!cache))
> > +             return kmem_cache_alloc(pte_list_desc_cache, gfp_nocache);
> > +
>
> I also think this is not proper to be added into this patch.  Maybe it'll
> be more suitable for the rmap_add() rework patch previously, or maybe it
> can be dropped directly if it should never trigger at all. Then we die hard
> at below when referencing it.

I can drop this, Ben suggested the same. cache should really never be
NULL so there's no need for this backup code.

>
> >       return kvm_mmu_memory_cache_alloc(cache);
> >  }
> >
> > @@ -743,6 +748,28 @@ static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index)
> >       return sp->gfn + (index << ((sp->role.level - 1) * PT64_LEVEL_BITS));
> >  }
> >
> > +static gfn_t sptep_to_gfn(u64 *sptep)
> > +{
> > +     struct kvm_mmu_page *sp = sptep_to_sp(sptep);
> > +
> > +     return kvm_mmu_page_get_gfn(sp, sptep - sp->spt);
> > +}
> > +
> > +static unsigned int kvm_mmu_page_get_access(struct kvm_mmu_page *sp, int index)
> > +{
> > +     if (!sp->role.direct)
> > +             return sp->shadowed_translation[index].access;
> > +
> > +     return sp->role.access;
> > +}
> > +
> > +static unsigned int sptep_to_access(u64 *sptep)
> > +{
> > +     struct kvm_mmu_page *sp = sptep_to_sp(sptep);
> > +
> > +     return kvm_mmu_page_get_access(sp, sptep - sp->spt);
> > +}
> > +
> >  static void kvm_mmu_page_set_gfn_access(struct kvm_mmu_page *sp, int index,
> >                                       gfn_t gfn, u32 access)
> >  {
> > @@ -912,6 +939,9 @@ static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte,
> >       return count;
> >  }
> >
> > +static struct kvm_rmap_head *gfn_to_rmap(gfn_t gfn, int level,
> > +                                      const struct kvm_memory_slot *slot);
> > +
> >  static void
> >  pte_list_desc_remove_entry(struct kvm_rmap_head *rmap_head,
> >                          struct pte_list_desc *desc, int i,
> > @@ -2125,6 +2155,23 @@ static struct kvm_mmu_page *__kvm_mmu_find_shadow_page(struct kvm *kvm,
> >       return sp;
> >  }
> >
> > +static struct kvm_mmu_page *kvm_mmu_find_direct_sp(struct kvm *kvm, gfn_t gfn,
> > +                                                union kvm_mmu_page_role role)
> > +{
> > +     struct kvm_mmu_page *sp;
> > +     LIST_HEAD(invalid_list);
> > +
> > +     BUG_ON(!role.direct);
> > +
> > +     sp = __kvm_mmu_find_shadow_page(kvm, gfn, role, &invalid_list);
> > +
> > +     /* Direct SPs are never unsync. */
> > +     WARN_ON_ONCE(sp && sp->unsync);
> > +
> > +     kvm_mmu_commit_zap_page(kvm, &invalid_list);
> > +     return sp;
> > +}
> > +
> >  /*
> >   * Looks up an existing SP for the given gfn and role if one exists. The
> >   * return SP is guaranteed to be synced.
> > @@ -6063,12 +6110,266 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
> >               kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
> >  }
> >
> > +static int prepare_to_split_huge_page(struct kvm *kvm,
> > +                                   const struct kvm_memory_slot *slot,
> > +                                   u64 *huge_sptep,
> > +                                   struct kvm_mmu_page **spp,
> > +                                   bool *flush,
> > +                                   bool *dropped_lock)
> > +{
> > +     int r = 0;
> > +
> > +     *dropped_lock = false;
> > +
> > +     if (kvm_mmu_available_pages(kvm) <= KVM_MIN_FREE_MMU_PAGES)
> > +             return -ENOSPC;
> > +
> > +     if (need_resched() || rwlock_needbreak(&kvm->mmu_lock))
> > +             goto drop_lock;
> > +
>
> Not immediately clear on whether there'll be case that *spp is set within
> the current function.  Some sanity check might be nice?

Sorry I'm not sure what you mean here. What kind of sanity check did
you have in mind?

>
> > +     *spp = kvm_mmu_alloc_direct_sp_for_split(true);
> > +     if (r)
> > +             goto drop_lock;
> > +
> > +     return 0;
> > +
> > +drop_lock:
> > +     if (*flush)
> > +             kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
> > +
> > +     *flush = false;
> > +     *dropped_lock = true;
> > +
> > +     write_unlock(&kvm->mmu_lock);
> > +     cond_resched();
> > +     *spp = kvm_mmu_alloc_direct_sp_for_split(false);
> > +     if (!*spp)
> > +             r = -ENOMEM;
> > +     write_lock(&kvm->mmu_lock);
> > +
> > +     return r;
> > +}
> > +
> > +static struct kvm_mmu_page *kvm_mmu_get_sp_for_split(struct kvm *kvm,
> > +                                                  const struct kvm_memory_slot *slot,
> > +                                                  u64 *huge_sptep,
> > +                                                  struct kvm_mmu_page **spp)
> > +{
> > +     struct kvm_mmu_page *split_sp;
> > +     union kvm_mmu_page_role role;
> > +     unsigned int access;
> > +     gfn_t gfn;
> > +
> > +     gfn = sptep_to_gfn(huge_sptep);
> > +     access = sptep_to_access(huge_sptep);
> > +
> > +     /*
> > +      * Huge page splitting always uses direct shadow pages since we are
> > +      * directly mapping the huge page GFN region with smaller pages.
> > +      */
> > +     role = kvm_mmu_child_role(huge_sptep, true, access);
> > +     split_sp = kvm_mmu_find_direct_sp(kvm, gfn, role);
> > +
> > +     /*
> > +      * Opt not to split if the lower-level SP already exists. This requires
> > +      * more complex handling as the SP may be already partially filled in
> > +      * and may need extra pte_list_desc structs to update parent_ptes.
> > +      */
> > +     if (split_sp)
> > +             return NULL;
>
> This smells tricky..
>
> Firstly we're trying to lookup the existing SPs that has shadowed this huge
> page in split way, with the access bits fetched from the shadow cache (so
> without hugepage nx effect).

Yeah this is tricky for sure.

For direct shadow pages, sp->role.access is always the guest access
permissions being shadowed (or ACC_ALL for situations where there is
no shadowing, e.g. __direct_map() and the TDP MMU). That's why we use
shadow translation cache to lookup up an existing SP or creating a new
SP, rather than taking the access permissions from the huge SPTE
itself (which may have KVM-specific policies applied such as HugePage
NX, access tracking, etc.). In other words, we want to look up
existing SPs in the same exact way that the fault handler looks them
up.

> However could the pages be mapped with
> different permissions from the currently hugely mapped page?

Yes, I think there can be some differences, such as:

 - The child SPTEs may have execute permission granted due to HugePage
NX while the huge page does not.
 - The child SPTEs may be in a different access tracking state than
the huge page.

There may be others. But no matter what, the same differences are
possible when we split a huge page during a fault, which leads me to
conclude it is safe.

>
> IIUC all these is for the fact that we can't allocate pte_list_desc and we
> want to make sure we won't make the pte list to be >1.
>
> But I also see that the pte_list check below...
>
> > +
> > +     swap(split_sp, *spp);
> > +     init_shadow_page(kvm, split_sp, slot, gfn, role);
> > +     trace_kvm_mmu_get_page(split_sp, true);
> > +
> > +     return split_sp;
> > +}
> > +
> > +static int kvm_mmu_split_huge_page(struct kvm *kvm,
> > +                                const struct kvm_memory_slot *slot,
> > +                                u64 *huge_sptep, struct kvm_mmu_page **spp,
> > +                                bool *flush)
> > +
> > +{
> > +     struct kvm_mmu_page *split_sp;
> > +     u64 huge_spte, split_spte;
> > +     int split_level, index;
> > +     unsigned int access;
> > +     u64 *split_sptep;
> > +     gfn_t split_gfn;
> > +
> > +     split_sp = kvm_mmu_get_sp_for_split(kvm, slot, huge_sptep, spp);
> > +     if (!split_sp)
> > +             return -EOPNOTSUPP;
> > +
> > +     /*
> > +      * Since we did not allocate pte_list_desc_structs for the split, we
> > +      * cannot add a new parent SPTE to parent_ptes. This should never happen
> > +      * in practice though since this is a fresh SP.
> > +      *
> > +      * Note, this makes it safe to pass NULL to __link_shadow_page() below.
> > +      */
> > +     if (WARN_ON_ONCE(split_sp->parent_ptes.val))
> > +             return -EINVAL;
> > +
> > +     huge_spte = READ_ONCE(*huge_sptep);
> > +
> > +     split_level = split_sp->role.level;
> > +     access = split_sp->role.access;
> > +
> > +     for (index = 0; index < PT64_ENT_PER_PAGE; index++) {
> > +             split_sptep = &split_sp->spt[index];
> > +             split_gfn = kvm_mmu_page_get_gfn(split_sp, index);
> > +
> > +             BUG_ON(is_shadow_present_pte(*split_sptep));
> > +
> > +             /*
> > +              * Since we did not allocate pte_list_desc structs for the
> > +              * split, we can't add a new SPTE that maps this GFN.
> > +              * Skipping this SPTE means we're only partially mapping the
> > +              * huge page, which means we'll need to flush TLBs before
> > +              * dropping the MMU lock.
> > +              *
> > +              * Note, this make it safe to pass NULL to __rmap_add() below.
> > +              */
> > +             if (gfn_to_rmap(split_gfn, split_level, slot)->val) {
> > +                     *flush = true;
> > +                     continue;
> > +             }
>
> ... here.
>
> IIUC this check should already be able to cover all the cases and it's
> accurate on the fact that we don't want to grow any rmap to >1 len.
>
> > +
> > +             split_spte = make_huge_page_split_spte(
> > +                             huge_spte, split_level + 1, index, access);
> > +
> > +             mmu_spte_set(split_sptep, split_spte);
> > +             __rmap_add(kvm, NULL, slot, split_sptep, split_gfn, access);
>
> __rmap_add() with a NULL cache pointer is weird.. same as
> __link_shadow_page() below.
>
> I'll stop here for now I guess.. Have you considered having rmap allocation
> ready altogether, rather than making this intermediate step but only add
> that later?  Because all these look hackish to me..  It's also possible
> that I missed something important, if so please shoot.
>
> Thanks,
>
> > +     }
> > +
> > +     /*
> > +      * Replace the huge spte with a pointer to the populated lower level
> > +      * page table. Since we are making this change without a TLB flush vCPUs
> > +      * will see a mix of the split mappings and the original huge mapping,
> > +      * depending on what's currently in their TLB. This is fine from a
> > +      * correctness standpoint since the translation will either be identical
> > +      * or non-present. To account for non-present mappings, the TLB will be
> > +      * flushed prior to dropping the MMU lock.
> > +      */
> > +     __drop_large_spte(kvm, huge_sptep, false);
> > +     __link_shadow_page(NULL, huge_sptep, split_sp);
> > +
> > +     return 0;
> > +}
>
> --
> Peter Xu
>
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 03/26] KVM: x86/mmu: Derive shadow MMU page role from parent
  2022-03-22 18:30       ` David Matlack
@ 2022-03-30 14:25         ` Peter Xu
  -1 siblings, 0 replies; 134+ messages in thread
From: Peter Xu @ 2022-03-30 14:25 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Sean Christopherson, Andrew Jones, Ben Gardon,
	Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Tue, Mar 22, 2022 at 11:30:07AM -0700, David Matlack wrote:
> > > +static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct, u32 access)
> > > +{
> > > +     struct kvm_mmu_page *parent_sp = sptep_to_sp(sptep);
> > > +     union kvm_mmu_page_role role;
> > > +
> > > +     role = parent_sp->role;
> > > +     role.level--;
> > > +     role.access = access;
> > > +     role.direct = direct;
> > > +
> > > +     /*
> > > +      * If the guest has 4-byte PTEs then that means it's using 32-bit,
> > > +      * 2-level, non-PAE paging. KVM shadows such guests using 4 PAE page
> > > +      * directories, each mapping 1/4 of the guest's linear address space
> > > +      * (1GiB). The shadow pages for those 4 page directories are
> > > +      * pre-allocated and assigned a separate quadrant in their role.
> > > +      *
> > > +      * Since we are allocating a child shadow page and there are only 2
> > > +      * levels, this must be a PG_LEVEL_4K shadow page. Here the quadrant
> > > +      * will either be 0 or 1 because it maps 1/2 of the address space mapped
> > > +      * by the guest's PG_LEVEL_4K page table (or 4MiB huge page) that it
> > > +      * is shadowing. In this case, the quadrant can be derived by the index
> > > +      * of the SPTE that points to the new child shadow page in the page
> > > +      * directory (parent_sp). Specifically, every 2 SPTEs in parent_sp
> > > +      * shadow one half of a guest's page table (or 4MiB huge page) so the
> > > +      * quadrant is just the parity of the index of the SPTE.
> > > +      */
> > > +     if (role.has_4_byte_gpte) {
> > > +             BUG_ON(role.level != PG_LEVEL_4K);
> > > +             role.quadrant = (sptep - parent_sp->spt) % 2;
> > > +     }
> >
> > This made me wonder whether role.quadrant can be dropped, because it seems
> > it can be calculated out of the box with has_4_byte_gpte, level and spte
> > offset.  I could have missed something, though..
> 
> I think you're right that we could compute it on-the-fly. But it'd be
> non-trivial to remove since it's currently used to ensure the sp->role
> and sp->gfn uniquely identifies each shadow page (e.g. when checking
> for collisions in the mmu_page_hash).

Makes sense.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 03/26] KVM: x86/mmu: Derive shadow MMU page role from parent
@ 2022-03-30 14:25         ` Peter Xu
  0 siblings, 0 replies; 134+ messages in thread
From: Peter Xu @ 2022-03-30 14:25 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Tue, Mar 22, 2022 at 11:30:07AM -0700, David Matlack wrote:
> > > +static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct, u32 access)
> > > +{
> > > +     struct kvm_mmu_page *parent_sp = sptep_to_sp(sptep);
> > > +     union kvm_mmu_page_role role;
> > > +
> > > +     role = parent_sp->role;
> > > +     role.level--;
> > > +     role.access = access;
> > > +     role.direct = direct;
> > > +
> > > +     /*
> > > +      * If the guest has 4-byte PTEs then that means it's using 32-bit,
> > > +      * 2-level, non-PAE paging. KVM shadows such guests using 4 PAE page
> > > +      * directories, each mapping 1/4 of the guest's linear address space
> > > +      * (1GiB). The shadow pages for those 4 page directories are
> > > +      * pre-allocated and assigned a separate quadrant in their role.
> > > +      *
> > > +      * Since we are allocating a child shadow page and there are only 2
> > > +      * levels, this must be a PG_LEVEL_4K shadow page. Here the quadrant
> > > +      * will either be 0 or 1 because it maps 1/2 of the address space mapped
> > > +      * by the guest's PG_LEVEL_4K page table (or 4MiB huge page) that it
> > > +      * is shadowing. In this case, the quadrant can be derived by the index
> > > +      * of the SPTE that points to the new child shadow page in the page
> > > +      * directory (parent_sp). Specifically, every 2 SPTEs in parent_sp
> > > +      * shadow one half of a guest's page table (or 4MiB huge page) so the
> > > +      * quadrant is just the parity of the index of the SPTE.
> > > +      */
> > > +     if (role.has_4_byte_gpte) {
> > > +             BUG_ON(role.level != PG_LEVEL_4K);
> > > +             role.quadrant = (sptep - parent_sp->spt) % 2;
> > > +     }
> >
> > This made me wonder whether role.quadrant can be dropped, because it seems
> > it can be calculated out of the box with has_4_byte_gpte, level and spte
> > offset.  I could have missed something, though..
> 
> I think you're right that we could compute it on-the-fly. But it'd be
> non-trivial to remove since it's currently used to ensure the sp->role
> and sp->gfn uniquely identifies each shadow page (e.g. when checking
> for collisions in the mmu_page_hash).

Makes sense.

-- 
Peter Xu

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 05/26] KVM: x86/mmu: Rename shadow MMU functions that deal with shadow pages
  2022-03-22 21:35       ` David Matlack
@ 2022-03-30 14:28         ` Peter Xu
  -1 siblings, 0 replies; 134+ messages in thread
From: Peter Xu @ 2022-03-30 14:28 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Sean Christopherson, Andrew Jones, Ben Gardon,
	Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Tue, Mar 22, 2022 at 02:35:25PM -0700, David Matlack wrote:
> On Tue, Mar 15, 2022 at 1:52 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Fri, Mar 11, 2022 at 12:25:07AM +0000, David Matlack wrote:
> > > Rename 3 functions:
> > >
> > >   kvm_mmu_get_page()   -> kvm_mmu_get_shadow_page()
> > >   kvm_mmu_alloc_page() -> kvm_mmu_alloc_shadow_page()
> > >   kvm_mmu_free_page()  -> kvm_mmu_free_shadow_page()
> > >
> > > This change makes it clear that these functions deal with shadow pages
> > > rather than struct pages. Prefer "shadow_page" over the shorter "sp"
> > > since these are core routines.
> > >
> > > Signed-off-by: David Matlack <dmatlack@google.com>
> >
> > Acked-by: Peter Xu <peterx@redhat.com>
> 
> What's the reason to use Acked-by for this patch but Reviewed-by for others?

A weak version of r-b?  I normally don't do the rename when necessary (and
I'm pretty poor at naming..), in this case I don't have a strong opinion.
I should have left nothing then it's less confusing. :)

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 05/26] KVM: x86/mmu: Rename shadow MMU functions that deal with shadow pages
@ 2022-03-30 14:28         ` Peter Xu
  0 siblings, 0 replies; 134+ messages in thread
From: Peter Xu @ 2022-03-30 14:28 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Tue, Mar 22, 2022 at 02:35:25PM -0700, David Matlack wrote:
> On Tue, Mar 15, 2022 at 1:52 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Fri, Mar 11, 2022 at 12:25:07AM +0000, David Matlack wrote:
> > > Rename 3 functions:
> > >
> > >   kvm_mmu_get_page()   -> kvm_mmu_get_shadow_page()
> > >   kvm_mmu_alloc_page() -> kvm_mmu_alloc_shadow_page()
> > >   kvm_mmu_free_page()  -> kvm_mmu_free_shadow_page()
> > >
> > > This change makes it clear that these functions deal with shadow pages
> > > rather than struct pages. Prefer "shadow_page" over the shorter "sp"
> > > since these are core routines.
> > >
> > > Signed-off-by: David Matlack <dmatlack@google.com>
> >
> > Acked-by: Peter Xu <peterx@redhat.com>
> 
> What's the reason to use Acked-by for this patch but Reviewed-by for others?

A weak version of r-b?  I normally don't do the rename when necessary (and
I'm pretty poor at naming..), in this case I don't have a strong opinion.
I should have left nothing then it's less confusing. :)

-- 
Peter Xu

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 16/26] KVM: x86/mmu: Cache the access bits of shadowed translations
  2022-03-22 22:51       ` David Matlack
@ 2022-03-30 18:30         ` Peter Xu
  -1 siblings, 0 replies; 134+ messages in thread
From: Peter Xu @ 2022-03-30 18:30 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Tue, Mar 22, 2022 at 03:51:54PM -0700, David Matlack wrote:
> On Wed, Mar 16, 2022 at 1:32 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Fri, Mar 11, 2022 at 12:25:18AM +0000, David Matlack wrote:
> > > In order to split a huge page we need to know what access bits to assign
> > > to the role of the new child page table. This can't be easily derived
> > > from the huge page SPTE itself since KVM applies its own access policies
> > > on top, such as for HugePage NX.
> > >
> > > We could walk the guest page tables to determine the correct access
> > > bits, but that is difficult to plumb outside of a vCPU fault context.
> > > Instead, we can store the original access bits for each leaf SPTE
> > > alongside the GFN in the gfns array. The access bits only take up 3
> > > bits, which leaves 61 bits left over for gfns, which is more than
> > > enough. So this change does not require any additional memory.
> >
> > I have a pure question on why eager page split needs to worry on hugepage
> > nx..
> >
> > IIUC that was about forbidden huge page being mapped as executable.  So
> > afaiu the only missing bit that could happen if we copy over the huge page
> > ptes is the executable bit.
> >
> > But then?  I think we could get a page fault on fault->exec==true on the
> > split small page (because when we copy over it's cleared, even though the
> > page can actually be executable), but it should be well resolved right
> > after that small page fault.
> >
> > The thing is IIUC this is a very rare case, IOW, it should mostly not
> > happen in 99% of the use case?  And there's a slight penalty when it
> > happens, but only perf-wise.
> >
> > As I'm not really fluent with the code base, perhaps I missed something?
> 
> You're right that we could get away with not knowing the shadowed
> access permissions to assign to the child SPTEs when splitting a huge
> SPTE. We could just copy the huge SPTE access permissions and then let
> the execute bit be repaired on fault (although those faults would be a
> performance cost).
> 
> But the access permissions are also needed to lookup an existing
> shadow page (or create a new shadow page) to use to split the huge
> page. For example, let's say we are going to split a huge page that
> does not have execute permissions enabled. That could be because NX
> HugePages are enabled or because we are shadowing a guest translation
> that does not allow execution (or both). We wouldn't want to propagate
> the no-execute permission into the child SP role.access if the
> shadowed translation really does allow execution (and vice versa).

Then the follow up (pure) question is what will happen if we simply
propagate the no-exec permission into the child SP?

I think that only happens with direct sptes where guest used huge pages
because that's where the shadow page will be huge, so IIUC that's checked
here when the small page fault triggers:

static void validate_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep,
				   unsigned direct_access)
{
	if (is_shadow_present_pte(*sptep) && !is_large_pte(*sptep)) {
		struct kvm_mmu_page *child;

		/*
		 * For the direct sp, if the guest pte's dirty bit
		 * changed form clean to dirty, it will corrupt the
		 * sp's access: allow writable in the read-only sp,
		 * so we should update the spte at this point to get
		 * a new sp with the correct access.
		 */
		child = to_shadow_page(*sptep & PT64_BASE_ADDR_MASK);
		if (child->role.access == direct_access)
			return;

		drop_parent_pte(child, sptep);
		kvm_flush_remote_tlbs_with_address(vcpu->kvm, child->gfn, 1);
	}
}

Due to missing EXEC the role.access check will not match with direct
access, which is the guest pgtable value which has EXEC set.  Then IIUC
we'll simply drop the no-exec SP and replace it with a new one with exec
perm.  The question is, would that untimately work too?

Even if that works, I agree this sounds tricky because we're potentially
caching fake sp.role conditionally and it seems we never do that before.
It's just that the other option that you proposed here seems to add other
way of complexity on caching spte permission information while kvm doesn't
do either before.  IMHO we need to see which is the best trade off.

I could have missed something else, though.

Thanks,

-- 
Peter Xu

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 16/26] KVM: x86/mmu: Cache the access bits of shadowed translations
@ 2022-03-30 18:30         ` Peter Xu
  0 siblings, 0 replies; 134+ messages in thread
From: Peter Xu @ 2022-03-30 18:30 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Sean Christopherson, Andrew Jones, Ben Gardon,
	Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Tue, Mar 22, 2022 at 03:51:54PM -0700, David Matlack wrote:
> On Wed, Mar 16, 2022 at 1:32 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Fri, Mar 11, 2022 at 12:25:18AM +0000, David Matlack wrote:
> > > In order to split a huge page we need to know what access bits to assign
> > > to the role of the new child page table. This can't be easily derived
> > > from the huge page SPTE itself since KVM applies its own access policies
> > > on top, such as for HugePage NX.
> > >
> > > We could walk the guest page tables to determine the correct access
> > > bits, but that is difficult to plumb outside of a vCPU fault context.
> > > Instead, we can store the original access bits for each leaf SPTE
> > > alongside the GFN in the gfns array. The access bits only take up 3
> > > bits, which leaves 61 bits left over for gfns, which is more than
> > > enough. So this change does not require any additional memory.
> >
> > I have a pure question on why eager page split needs to worry on hugepage
> > nx..
> >
> > IIUC that was about forbidden huge page being mapped as executable.  So
> > afaiu the only missing bit that could happen if we copy over the huge page
> > ptes is the executable bit.
> >
> > But then?  I think we could get a page fault on fault->exec==true on the
> > split small page (because when we copy over it's cleared, even though the
> > page can actually be executable), but it should be well resolved right
> > after that small page fault.
> >
> > The thing is IIUC this is a very rare case, IOW, it should mostly not
> > happen in 99% of the use case?  And there's a slight penalty when it
> > happens, but only perf-wise.
> >
> > As I'm not really fluent with the code base, perhaps I missed something?
> 
> You're right that we could get away with not knowing the shadowed
> access permissions to assign to the child SPTEs when splitting a huge
> SPTE. We could just copy the huge SPTE access permissions and then let
> the execute bit be repaired on fault (although those faults would be a
> performance cost).
> 
> But the access permissions are also needed to lookup an existing
> shadow page (or create a new shadow page) to use to split the huge
> page. For example, let's say we are going to split a huge page that
> does not have execute permissions enabled. That could be because NX
> HugePages are enabled or because we are shadowing a guest translation
> that does not allow execution (or both). We wouldn't want to propagate
> the no-execute permission into the child SP role.access if the
> shadowed translation really does allow execution (and vice versa).

Then the follow up (pure) question is what will happen if we simply
propagate the no-exec permission into the child SP?

I think that only happens with direct sptes where guest used huge pages
because that's where the shadow page will be huge, so IIUC that's checked
here when the small page fault triggers:

static void validate_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep,
				   unsigned direct_access)
{
	if (is_shadow_present_pte(*sptep) && !is_large_pte(*sptep)) {
		struct kvm_mmu_page *child;

		/*
		 * For the direct sp, if the guest pte's dirty bit
		 * changed form clean to dirty, it will corrupt the
		 * sp's access: allow writable in the read-only sp,
		 * so we should update the spte at this point to get
		 * a new sp with the correct access.
		 */
		child = to_shadow_page(*sptep & PT64_BASE_ADDR_MASK);
		if (child->role.access == direct_access)
			return;

		drop_parent_pte(child, sptep);
		kvm_flush_remote_tlbs_with_address(vcpu->kvm, child->gfn, 1);
	}
}

Due to missing EXEC the role.access check will not match with direct
access, which is the guest pgtable value which has EXEC set.  Then IIUC
we'll simply drop the no-exec SP and replace it with a new one with exec
perm.  The question is, would that untimately work too?

Even if that works, I agree this sounds tricky because we're potentially
caching fake sp.role conditionally and it seems we never do that before.
It's just that the other option that you proposed here seems to add other
way of complexity on caching spte permission information while kvm doesn't
do either before.  IMHO we need to see which is the best trade off.

I could have missed something else, though.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 20/26] KVM: x86/mmu: Extend Eager Page Splitting to the shadow MMU
  2022-03-22 23:58       ` David Matlack
@ 2022-03-30 18:34         ` Peter Xu
  -1 siblings, 0 replies; 134+ messages in thread
From: Peter Xu @ 2022-03-30 18:34 UTC (permalink / raw)
  To: David Matlack
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Tue, Mar 22, 2022 at 04:58:08PM -0700, David Matlack wrote:
> > > +static int prepare_to_split_huge_page(struct kvm *kvm,
> > > +                                   const struct kvm_memory_slot *slot,
> > > +                                   u64 *huge_sptep,
> > > +                                   struct kvm_mmu_page **spp,
> > > +                                   bool *flush,
> > > +                                   bool *dropped_lock)
> > > +{
> > > +     int r = 0;
> > > +
> > > +     *dropped_lock = false;
> > > +
> > > +     if (kvm_mmu_available_pages(kvm) <= KVM_MIN_FREE_MMU_PAGES)
> > > +             return -ENOSPC;
> > > +
> > > +     if (need_resched() || rwlock_needbreak(&kvm->mmu_lock))
> > > +             goto drop_lock;
> > > +
> >
> > Not immediately clear on whether there'll be case that *spp is set within
> > the current function.  Some sanity check might be nice?
> 
> Sorry I'm not sure what you mean here. What kind of sanity check did
> you have in mind?

Something like "WARN_ON_ONCE(*spp);"?

> 
> >
> > > +     *spp = kvm_mmu_alloc_direct_sp_for_split(true);
> > > +     if (r)
> > > +             goto drop_lock;
> > > +
> > > +     return 0;

Thanks,

-- 
Peter Xu

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 20/26] KVM: x86/mmu: Extend Eager Page Splitting to the shadow MMU
@ 2022-03-30 18:34         ` Peter Xu
  0 siblings, 0 replies; 134+ messages in thread
From: Peter Xu @ 2022-03-30 18:34 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Sean Christopherson, Andrew Jones, Ben Gardon,
	Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Tue, Mar 22, 2022 at 04:58:08PM -0700, David Matlack wrote:
> > > +static int prepare_to_split_huge_page(struct kvm *kvm,
> > > +                                   const struct kvm_memory_slot *slot,
> > > +                                   u64 *huge_sptep,
> > > +                                   struct kvm_mmu_page **spp,
> > > +                                   bool *flush,
> > > +                                   bool *dropped_lock)
> > > +{
> > > +     int r = 0;
> > > +
> > > +     *dropped_lock = false;
> > > +
> > > +     if (kvm_mmu_available_pages(kvm) <= KVM_MIN_FREE_MMU_PAGES)
> > > +             return -ENOSPC;
> > > +
> > > +     if (need_resched() || rwlock_needbreak(&kvm->mmu_lock))
> > > +             goto drop_lock;
> > > +
> >
> > Not immediately clear on whether there'll be case that *spp is set within
> > the current function.  Some sanity check might be nice?
> 
> Sorry I'm not sure what you mean here. What kind of sanity check did
> you have in mind?

Something like "WARN_ON_ONCE(*spp);"?

> 
> >
> > > +     *spp = kvm_mmu_alloc_direct_sp_for_split(true);
> > > +     if (r)
> > > +             goto drop_lock;
> > > +
> > > +     return 0;

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 20/26] KVM: x86/mmu: Extend Eager Page Splitting to the shadow MMU
  2022-03-30 18:34         ` Peter Xu
@ 2022-03-31 19:57           ` David Matlack
  -1 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-31 19:57 UTC (permalink / raw)
  To: Peter Xu
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Sean Christopherson, Andrew Jones, Ben Gardon,
	Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Wed, Mar 30, 2022 at 11:34 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Tue, Mar 22, 2022 at 04:58:08PM -0700, David Matlack wrote:
> > > > +static int prepare_to_split_huge_page(struct kvm *kvm,
> > > > +                                   const struct kvm_memory_slot *slot,
> > > > +                                   u64 *huge_sptep,
> > > > +                                   struct kvm_mmu_page **spp,
> > > > +                                   bool *flush,
> > > > +                                   bool *dropped_lock)
> > > > +{
> > > > +     int r = 0;
> > > > +
> > > > +     *dropped_lock = false;
> > > > +
> > > > +     if (kvm_mmu_available_pages(kvm) <= KVM_MIN_FREE_MMU_PAGES)
> > > > +             return -ENOSPC;
> > > > +
> > > > +     if (need_resched() || rwlock_needbreak(&kvm->mmu_lock))
> > > > +             goto drop_lock;
> > > > +
> > >
> > > Not immediately clear on whether there'll be case that *spp is set within
> > > the current function.  Some sanity check might be nice?
> >
> > Sorry I'm not sure what you mean here. What kind of sanity check did
> > you have in mind?
>
> Something like "WARN_ON_ONCE(*spp);"?

Ah I see. I was confused because the previous version of this code
checked if *spp is already set and, if so, skipped the allocation. But
I accidentally introduced a memory leak here when I implemented Ben'
suggestion to defer alloc_memory_for_split() to a subsequent commit.
I'll fix this in v3.

>
> >
> > >
> > > > +     *spp = kvm_mmu_alloc_direct_sp_for_split(true);
> > > > +     if (r)
> > > > +             goto drop_lock;
> > > > +
> > > > +     return 0;
>
> Thanks,
>
> --
> Peter Xu
>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 20/26] KVM: x86/mmu: Extend Eager Page Splitting to the shadow MMU
@ 2022-03-31 19:57           ` David Matlack
  0 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-31 19:57 UTC (permalink / raw)
  To: Peter Xu
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Wed, Mar 30, 2022 at 11:34 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Tue, Mar 22, 2022 at 04:58:08PM -0700, David Matlack wrote:
> > > > +static int prepare_to_split_huge_page(struct kvm *kvm,
> > > > +                                   const struct kvm_memory_slot *slot,
> > > > +                                   u64 *huge_sptep,
> > > > +                                   struct kvm_mmu_page **spp,
> > > > +                                   bool *flush,
> > > > +                                   bool *dropped_lock)
> > > > +{
> > > > +     int r = 0;
> > > > +
> > > > +     *dropped_lock = false;
> > > > +
> > > > +     if (kvm_mmu_available_pages(kvm) <= KVM_MIN_FREE_MMU_PAGES)
> > > > +             return -ENOSPC;
> > > > +
> > > > +     if (need_resched() || rwlock_needbreak(&kvm->mmu_lock))
> > > > +             goto drop_lock;
> > > > +
> > >
> > > Not immediately clear on whether there'll be case that *spp is set within
> > > the current function.  Some sanity check might be nice?
> >
> > Sorry I'm not sure what you mean here. What kind of sanity check did
> > you have in mind?
>
> Something like "WARN_ON_ONCE(*spp);"?

Ah I see. I was confused because the previous version of this code
checked if *spp is already set and, if so, skipped the allocation. But
I accidentally introduced a memory leak here when I implemented Ben'
suggestion to defer alloc_memory_for_split() to a subsequent commit.
I'll fix this in v3.

>
> >
> > >
> > > > +     *spp = kvm_mmu_alloc_direct_sp_for_split(true);
> > > > +     if (r)
> > > > +             goto drop_lock;
> > > > +
> > > > +     return 0;
>
> Thanks,
>
> --
> Peter Xu
>
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 16/26] KVM: x86/mmu: Cache the access bits of shadowed translations
  2022-03-30 18:30         ` Peter Xu
@ 2022-03-31 21:40           ` David Matlack
  -1 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-31 21:40 UTC (permalink / raw)
  To: Peter Xu
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Sean Christopherson, Andrew Jones, Ben Gardon,
	Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Peter Feiner

On Wed, Mar 30, 2022 at 11:30 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Tue, Mar 22, 2022 at 03:51:54PM -0700, David Matlack wrote:
> > On Wed, Mar 16, 2022 at 1:32 AM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Fri, Mar 11, 2022 at 12:25:18AM +0000, David Matlack wrote:
> > > > In order to split a huge page we need to know what access bits to assign
> > > > to the role of the new child page table. This can't be easily derived
> > > > from the huge page SPTE itself since KVM applies its own access policies
> > > > on top, such as for HugePage NX.
> > > >
> > > > We could walk the guest page tables to determine the correct access
> > > > bits, but that is difficult to plumb outside of a vCPU fault context.
> > > > Instead, we can store the original access bits for each leaf SPTE
> > > > alongside the GFN in the gfns array. The access bits only take up 3
> > > > bits, which leaves 61 bits left over for gfns, which is more than
> > > > enough. So this change does not require any additional memory.
> > >
> > > I have a pure question on why eager page split needs to worry on hugepage
> > > nx..
> > >
> > > IIUC that was about forbidden huge page being mapped as executable.  So
> > > afaiu the only missing bit that could happen if we copy over the huge page
> > > ptes is the executable bit.
> > >
> > > But then?  I think we could get a page fault on fault->exec==true on the
> > > split small page (because when we copy over it's cleared, even though the
> > > page can actually be executable), but it should be well resolved right
> > > after that small page fault.
> > >
> > > The thing is IIUC this is a very rare case, IOW, it should mostly not
> > > happen in 99% of the use case?  And there's a slight penalty when it
> > > happens, but only perf-wise.
> > >
> > > As I'm not really fluent with the code base, perhaps I missed something?
> >
> > You're right that we could get away with not knowing the shadowed
> > access permissions to assign to the child SPTEs when splitting a huge
> > SPTE. We could just copy the huge SPTE access permissions and then let
> > the execute bit be repaired on fault (although those faults would be a
> > performance cost).
> >
> > But the access permissions are also needed to lookup an existing
> > shadow page (or create a new shadow page) to use to split the huge
> > page. For example, let's say we are going to split a huge page that
> > does not have execute permissions enabled. That could be because NX
> > HugePages are enabled or because we are shadowing a guest translation
> > that does not allow execution (or both). We wouldn't want to propagate
> > the no-execute permission into the child SP role.access if the
> > shadowed translation really does allow execution (and vice versa).
>
> Then the follow up (pure) question is what will happen if we simply
> propagate the no-exec permission into the child SP?
>
> I think that only happens with direct sptes where guest used huge pages
> because that's where the shadow page will be huge, so IIUC that's checked
> here when the small page fault triggers:
>
> static void validate_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep,
>                                    unsigned direct_access)
> {
>         if (is_shadow_present_pte(*sptep) && !is_large_pte(*sptep)) {
>                 struct kvm_mmu_page *child;
>
>                 /*
>                  * For the direct sp, if the guest pte's dirty bit
>                  * changed form clean to dirty, it will corrupt the
>                  * sp's access: allow writable in the read-only sp,
>                  * so we should update the spte at this point to get
>                  * a new sp with the correct access.
>                  */
>                 child = to_shadow_page(*sptep & PT64_BASE_ADDR_MASK);
>                 if (child->role.access == direct_access)
>                         return;
>
>                 drop_parent_pte(child, sptep);
>                 kvm_flush_remote_tlbs_with_address(vcpu->kvm, child->gfn, 1);
>         }
> }
>
> Due to missing EXEC the role.access check will not match with direct
> access, which is the guest pgtable value which has EXEC set.  Then IIUC
> we'll simply drop the no-exec SP and replace it with a new one with exec
> perm.  The question is, would that untimately work too?
>
> Even if that works, I agree this sounds tricky because we're potentially
> caching fake sp.role conditionally and it seems we never do that before.
> It's just that the other option that you proposed here seems to add other
> way of complexity on caching spte permission information while kvm doesn't
> do either before.  IMHO we need to see which is the best trade off.

Clever! I think you're right that it would work correctly.

This approach avoids the need for caching access bits, but comes with downsides:
 - Performance impact from the extra faults needed to drop the SP and
repair the execute permission bit.
 - Some amount of memory overhead from KVM allocating new SPs when it
could re-use existing SPs.

Given the relative simplicity of access caching (and the fact that it
requires no additional memory), I'm inclined to stick with it rather
than taking the access permissions from the huge page.

>
> I could have missed something else, though.
>
> Thanks,
>
> --
> Peter Xu
>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 16/26] KVM: x86/mmu: Cache the access bits of shadowed translations
@ 2022-03-31 21:40           ` David Matlack
  0 siblings, 0 replies; 134+ messages in thread
From: David Matlack @ 2022-03-31 21:40 UTC (permalink / raw)
  To: Peter Xu
  Cc: Marc Zyngier, Albert Ou,
	open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Huacai Chen, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Aleksandar Markovic, Palmer Dabbelt,
	open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv),
	Paul Walmsley, Ben Gardon, Paolo Bonzini, Maciej S. Szmigiero,
	moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64),
	Peter Feiner

On Wed, Mar 30, 2022 at 11:30 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Tue, Mar 22, 2022 at 03:51:54PM -0700, David Matlack wrote:
> > On Wed, Mar 16, 2022 at 1:32 AM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Fri, Mar 11, 2022 at 12:25:18AM +0000, David Matlack wrote:
> > > > In order to split a huge page we need to know what access bits to assign
> > > > to the role of the new child page table. This can't be easily derived
> > > > from the huge page SPTE itself since KVM applies its own access policies
> > > > on top, such as for HugePage NX.
> > > >
> > > > We could walk the guest page tables to determine the correct access
> > > > bits, but that is difficult to plumb outside of a vCPU fault context.
> > > > Instead, we can store the original access bits for each leaf SPTE
> > > > alongside the GFN in the gfns array. The access bits only take up 3
> > > > bits, which leaves 61 bits left over for gfns, which is more than
> > > > enough. So this change does not require any additional memory.
> > >
> > > I have a pure question on why eager page split needs to worry on hugepage
> > > nx..
> > >
> > > IIUC that was about forbidden huge page being mapped as executable.  So
> > > afaiu the only missing bit that could happen if we copy over the huge page
> > > ptes is the executable bit.
> > >
> > > But then?  I think we could get a page fault on fault->exec==true on the
> > > split small page (because when we copy over it's cleared, even though the
> > > page can actually be executable), but it should be well resolved right
> > > after that small page fault.
> > >
> > > The thing is IIUC this is a very rare case, IOW, it should mostly not
> > > happen in 99% of the use case?  And there's a slight penalty when it
> > > happens, but only perf-wise.
> > >
> > > As I'm not really fluent with the code base, perhaps I missed something?
> >
> > You're right that we could get away with not knowing the shadowed
> > access permissions to assign to the child SPTEs when splitting a huge
> > SPTE. We could just copy the huge SPTE access permissions and then let
> > the execute bit be repaired on fault (although those faults would be a
> > performance cost).
> >
> > But the access permissions are also needed to lookup an existing
> > shadow page (or create a new shadow page) to use to split the huge
> > page. For example, let's say we are going to split a huge page that
> > does not have execute permissions enabled. That could be because NX
> > HugePages are enabled or because we are shadowing a guest translation
> > that does not allow execution (or both). We wouldn't want to propagate
> > the no-execute permission into the child SP role.access if the
> > shadowed translation really does allow execution (and vice versa).
>
> Then the follow up (pure) question is what will happen if we simply
> propagate the no-exec permission into the child SP?
>
> I think that only happens with direct sptes where guest used huge pages
> because that's where the shadow page will be huge, so IIUC that's checked
> here when the small page fault triggers:
>
> static void validate_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep,
>                                    unsigned direct_access)
> {
>         if (is_shadow_present_pte(*sptep) && !is_large_pte(*sptep)) {
>                 struct kvm_mmu_page *child;
>
>                 /*
>                  * For the direct sp, if the guest pte's dirty bit
>                  * changed form clean to dirty, it will corrupt the
>                  * sp's access: allow writable in the read-only sp,
>                  * so we should update the spte at this point to get
>                  * a new sp with the correct access.
>                  */
>                 child = to_shadow_page(*sptep & PT64_BASE_ADDR_MASK);
>                 if (child->role.access == direct_access)
>                         return;
>
>                 drop_parent_pte(child, sptep);
>                 kvm_flush_remote_tlbs_with_address(vcpu->kvm, child->gfn, 1);
>         }
> }
>
> Due to missing EXEC the role.access check will not match with direct
> access, which is the guest pgtable value which has EXEC set.  Then IIUC
> we'll simply drop the no-exec SP and replace it with a new one with exec
> perm.  The question is, would that untimately work too?
>
> Even if that works, I agree this sounds tricky because we're potentially
> caching fake sp.role conditionally and it seems we never do that before.
> It's just that the other option that you proposed here seems to add other
> way of complexity on caching spte permission information while kvm doesn't
> do either before.  IMHO we need to see which is the best trade off.

Clever! I think you're right that it would work correctly.

This approach avoids the need for caching access bits, but comes with downsides:
 - Performance impact from the extra faults needed to drop the SP and
repair the execute permission bit.
 - Some amount of memory overhead from KVM allocating new SPs when it
could re-use existing SPs.

Given the relative simplicity of access caching (and the fact that it
requires no additional memory), I'm inclined to stick with it rather
than taking the access permissions from the huge page.

>
> I could have missed something else, though.
>
> Thanks,
>
> --
> Peter Xu
>
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 134+ messages in thread

end of thread, other threads:[~2022-04-01 16:24 UTC | newest]

Thread overview: 134+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-11  0:25 [PATCH v2 00/26] Extend Eager Page Splitting to the shadow MMU David Matlack
2022-03-11  0:25 ` David Matlack
2022-03-11  0:25 ` [PATCH v2 01/26] KVM: x86/mmu: Optimize MMU page cache lookup for all direct SPs David Matlack
2022-03-11  0:25   ` David Matlack
2022-03-15  7:40   ` Peter Xu
2022-03-15  7:40     ` Peter Xu
2022-03-22 18:16     ` David Matlack
2022-03-22 18:16       ` David Matlack
2022-03-11  0:25 ` [PATCH v2 02/26] KVM: x86/mmu: Use a bool for direct David Matlack
2022-03-11  0:25   ` David Matlack
2022-03-15  7:46   ` Peter Xu
2022-03-15  7:46     ` Peter Xu
2022-03-22 18:21     ` David Matlack
2022-03-22 18:21       ` David Matlack
2022-03-11  0:25 ` [PATCH v2 03/26] KVM: x86/mmu: Derive shadow MMU page role from parent David Matlack
2022-03-11  0:25   ` David Matlack
2022-03-15  8:15   ` Peter Xu
2022-03-15  8:15     ` Peter Xu
2022-03-22 18:30     ` David Matlack
2022-03-22 18:30       ` David Matlack
2022-03-30 14:25       ` Peter Xu
2022-03-30 14:25         ` Peter Xu
2022-03-11  0:25 ` [PATCH v2 04/26] KVM: x86/mmu: Decompose kvm_mmu_get_page() into separate functions David Matlack
2022-03-11  0:25   ` David Matlack
2022-03-15  8:50   ` Peter Xu
2022-03-15  8:50     ` Peter Xu
2022-03-22 22:09     ` David Matlack
2022-03-22 22:09       ` David Matlack
2022-03-11  0:25 ` [PATCH v2 05/26] KVM: x86/mmu: Rename shadow MMU functions that deal with shadow pages David Matlack
2022-03-11  0:25   ` David Matlack
2022-03-15  8:52   ` Peter Xu
2022-03-15  8:52     ` Peter Xu
2022-03-22 21:35     ` David Matlack
2022-03-22 21:35       ` David Matlack
2022-03-30 14:28       ` Peter Xu
2022-03-30 14:28         ` Peter Xu
2022-03-11  0:25 ` [PATCH v2 06/26] KVM: x86/mmu: Pass memslot to kvm_mmu_new_shadow_page() David Matlack
2022-03-11  0:25   ` David Matlack
2022-03-15  9:03   ` Peter Xu
2022-03-15  9:03     ` Peter Xu
2022-03-22 22:05     ` David Matlack
2022-03-22 22:05       ` David Matlack
2022-03-11  0:25 ` [PATCH v2 07/26] KVM: x86/mmu: Separate shadow MMU sp allocation from initialization David Matlack
2022-03-11  0:25   ` David Matlack
2022-03-15  9:54   ` Peter Xu
2022-03-15  9:54     ` Peter Xu
2022-03-11  0:25 ` [PATCH v2 08/26] KVM: x86/mmu: Link spt to sp during allocation David Matlack
2022-03-11  0:25   ` David Matlack
2022-03-15 10:04   ` Peter Xu
2022-03-15 10:04     ` Peter Xu
2022-03-22 22:30     ` David Matlack
2022-03-22 22:30       ` David Matlack
2022-03-11  0:25 ` [PATCH v2 09/26] KVM: x86/mmu: Move huge page split sp allocation code to mmu.c David Matlack
2022-03-11  0:25   ` David Matlack
2022-03-15 10:17   ` Peter Xu
2022-03-15 10:17     ` Peter Xu
2022-03-11  0:25 ` [PATCH v2 10/26] KVM: x86/mmu: Use common code to free kvm_mmu_page structs David Matlack
2022-03-11  0:25   ` David Matlack
2022-03-15 10:22   ` Peter Xu
2022-03-15 10:22     ` Peter Xu
2022-03-22 22:33     ` David Matlack
2022-03-22 22:33       ` David Matlack
2022-03-11  0:25 ` [PATCH v2 11/26] KVM: x86/mmu: Use common code to allocate kvm_mmu_page structs from vCPU caches David Matlack
2022-03-11  0:25   ` David Matlack
2022-03-15 10:27   ` Peter Xu
2022-03-15 10:27     ` Peter Xu
2022-03-22 22:35     ` David Matlack
2022-03-22 22:35       ` David Matlack
2022-03-11  0:25 ` [PATCH v2 12/26] KVM: x86/mmu: Pass const memslot to rmap_add() David Matlack
2022-03-11  0:25   ` David Matlack
2022-03-11  0:25 ` [PATCH v2 13/26] KVM: x86/mmu: Pass const memslot to init_shadow_page() and descendants David Matlack
2022-03-11  0:25   ` David Matlack
2022-03-11  0:25 ` [PATCH v2 14/26] KVM: x86/mmu: Decouple rmap_add() and link_shadow_page() from kvm_vcpu David Matlack
2022-03-11  0:25   ` David Matlack
2022-03-15 10:37   ` Peter Xu
2022-03-15 10:37     ` Peter Xu
2022-03-11  0:25 ` [PATCH v2 15/26] KVM: x86/mmu: Update page stats in __rmap_add() David Matlack
2022-03-11  0:25   ` David Matlack
2022-03-15 10:39   ` Peter Xu
2022-03-15 10:39     ` Peter Xu
2022-03-11  0:25 ` [PATCH v2 16/26] KVM: x86/mmu: Cache the access bits of shadowed translations David Matlack
2022-03-11  0:25   ` David Matlack
2022-03-16  8:32   ` Peter Xu
2022-03-16  8:32     ` Peter Xu
2022-03-22 22:51     ` David Matlack
2022-03-22 22:51       ` David Matlack
2022-03-30 18:30       ` Peter Xu
2022-03-30 18:30         ` Peter Xu
2022-03-31 21:40         ` David Matlack
2022-03-31 21:40           ` David Matlack
2022-03-11  0:25 ` [PATCH v2 17/26] KVM: x86/mmu: Pass access information to make_huge_page_split_spte() David Matlack
2022-03-11  0:25   ` David Matlack
2022-03-16  8:44   ` Peter Xu
2022-03-16  8:44     ` Peter Xu
2022-03-22 23:08     ` David Matlack
2022-03-22 23:08       ` David Matlack
2022-03-11  0:25 ` [PATCH v2 18/26] KVM: x86/mmu: Zap collapsible SPTEs at all levels in the shadow MMU David Matlack
2022-03-11  0:25   ` David Matlack
2022-03-16  8:49   ` Peter Xu
2022-03-16  8:49     ` Peter Xu
2022-03-22 23:11     ` David Matlack
2022-03-22 23:11       ` David Matlack
2022-03-11  0:25 ` [PATCH v2 19/26] KVM: x86/mmu: Refactor drop_large_spte() David Matlack
2022-03-11  0:25   ` David Matlack
2022-03-16  8:53   ` Peter Xu
2022-03-16  8:53     ` Peter Xu
2022-03-11  0:25 ` [PATCH v2 20/26] KVM: x86/mmu: Extend Eager Page Splitting to the shadow MMU David Matlack
2022-03-11  0:25   ` David Matlack
2022-03-16 10:26   ` Peter Xu
2022-03-16 10:26     ` Peter Xu
2022-03-22  0:07     ` David Matlack
2022-03-22  0:07       ` David Matlack
2022-03-22 23:58     ` David Matlack
2022-03-22 23:58       ` David Matlack
2022-03-30 18:34       ` Peter Xu
2022-03-30 18:34         ` Peter Xu
2022-03-31 19:57         ` David Matlack
2022-03-31 19:57           ` David Matlack
2022-03-11  0:25 ` [PATCH v2 21/26] KVM: Allow for different capacities in kvm_mmu_memory_cache structs David Matlack
2022-03-11  0:25   ` David Matlack
2022-03-19  5:27   ` Anup Patel
2022-03-19  5:27     ` Anup Patel
2022-03-22 23:13     ` David Matlack
2022-03-22 23:13       ` David Matlack
2022-03-11  0:25 ` [PATCH v2 22/26] KVM: Allow GFP flags to be passed when topping up MMU caches David Matlack
2022-03-11  0:25   ` David Matlack
2022-03-11  0:25 ` [PATCH v2 23/26] KVM: x86/mmu: Fully split huge pages that require extra pte_list_desc structs David Matlack
2022-03-11  0:25   ` David Matlack
2022-03-11  0:25 ` [PATCH v2 24/26] KVM: x86/mmu: Split huge pages aliased by multiple SPTEs David Matlack
2022-03-11  0:25   ` David Matlack
2022-03-11  0:25 ` [PATCH v2 25/26] KVM: x86/mmu: Drop NULL pte_list_desc_cache fallback David Matlack
2022-03-11  0:25   ` David Matlack
2022-03-11  0:25 ` [PATCH v2 26/26] KVM: selftests: Map x86_64 guest virtual memory with huge pages David Matlack
2022-03-11  0:25   ` David Matlack

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.