linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v4 00/16] KVM TDX: TDP MMU: large page support
@ 2023-07-25 22:23 isaku.yamahata
  2023-07-25 22:23 ` [RFC PATCH v4 01/16] KVM: TDP_MMU: Go to next level if smaller private mapping exists isaku.yamahata
                   ` (15 more replies)
  0 siblings, 16 replies; 21+ messages in thread
From: isaku.yamahata @ 2023-07-25 22:23 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack, Kai Huang,
	Zhi Wang, chen.bo, hang.yuan, tina.zhang

From: Isaku Yamahata <isaku.yamahata@intel.com>

From: Isaku Yamahata <isaku.yamahata@intel.com>

Changes from v3:
- Rebased to v15 TDX KVM v6.5-rc1 base

Changes from v2:
- implemented page merging path
- rebased to TDX KVM v11

Changes from v1:
- implemented page merging path
- rebased to UPM v10
- rebased to TDX KVM v10
- rebased to kvm.git queue + v6.1-rc8

---
This patch series is based on "v15 KVM TDX: basic feature support".  It
implements large page support for TDP MMU by allowing populating of the large
page and splitting it when necessary.

Feedback for options to merge sub-pages into a large page are welcome.

Remaining TODOs
===============
* Make nx recovery thread use TDH.MEM.RANGE.BLOCK instead of zapping EPT entry.
* Record that the entry is blocked by introducing a bit in spte.  On EPT
  violation, check if the entry is blocked or not.  If the EPT violation is
  caused by a blocked Secure-EPT entry, trigger the page merge logic.

Splitting large pages when necessary
====================================
* It already tracking whether GFN is private or shared.  When it's changed,
  update lpage_info to prevent a large page.
* TDX provides page level on Secure EPT violation.  Pass around the page level
  that the lower level functions needs.
* Even if the page is the large page in the host, at the EPT level, only some
  sub-pages are mapped.  In such cases abandon to map large pages and step into
  the sub-page level, unlike the conventional EPT.
* When zapping spte and the spte is for a large page, split and zap it unlike
  the conventional EPT because otherwise the protected page contents will be
  lost.

Merging small pages into a large page if possible
=================================================
On normal EPT violation, check whether pages can be merged into a large page
after mapping it.

TDX operation
=============
The following describes what TDX operations procedures.

* EPT violation trick
Such track (zapping the EPT entry to trigger EPT violation) doesn't work for
TDX.  For TDX, it will lose the contents of the protected page to zap a page
because the protected guest page is un-associated from the guest TD.  Instead,
TDX provides a different way to trigger EPT violation without losing the page
contents so that VMM can detect guest TD activity by blocking/unblocking
Secure-EPT entry.  TDH.MEM.RANGE.BLOCK and TDH.MEM.RANGE.UNBLOCK.  They
correspond to clearing/setting a present bit in an EPT entry with page contents
still kept.  By TDH.MEM.RANGE.BLOCK and TLB shoot down, VMM can cause guest TD
to trigger EPT violation.  After that, VMM can unblock it by
TDH.MEM.RANGE.UNBLOCK and resume guest TD execution.  The procedure is as
follows.

  - Block Secure-EPT entry by TDH.MEM.RANGE.BLOCK.
  - TLB shoot down.
  - Wait for guest TD to trigger EPT violation.
  - Unblock Secure-EPT entry by TDH.MEM.RANGE.UNBLOCK to resume the guest TD.

* merging sub-pages into a large page
The following steps are needed.
- Ensure that all sub-pages are mapped.
- TLB shoot down.
- Merge sub-pages into a large page (TDH.MEM.PAGE.PROMOTE).
  This requires all sub-pages are mapped.
- Cache flush Secure EPT page used to map subpages.

Thanks,

Isaku Yamahata (4):
  KVM: x86/tdp_mmu: Allocate private page table for large page split
  KVM: x86/tdp_mmu: Try to merge pages into a large page
  KVM: x86/tdp_mmu: TDX: Implement merge pages into a large page
  KVM: x86/mmu: Make kvm fault handler aware of large page of private
    memslot

Xiaoyao Li (12):
  KVM: TDP_MMU: Go to next level if smaller private mapping exists
  KVM: TDX: Pass page level to cache flush before TDX SEAMCALL
  KVM: TDX: Pass KVM page level to tdh_mem_page_add() and
    tdh_mem_page_aug()
  KVM: TDX: Pass size to tdx_measure_page()
  KVM: TDX: Pass size to reclaim_page()
  KVM: TDX: Update tdx_sept_{set,drop}_private_spte() to support large
    page
  KVM: MMU: Introduce level info in PFERR code
  KVM: TDX: Pin pages via get_page() right before ADD/AUG'ed to TDs
  KVM: TDX: Pass desired page level in err code for page fault handler
  KVM: x86/tdp_mmu: Split the large page when zap leaf
  KVM: x86/tdp_mmu, TDX: Split a large page when 4KB page within it
    converted to shared
  KVM: TDX: Allow 2MB large page for TD GUEST

 arch/x86/include/asm/kvm-x86-ops.h |   3 +
 arch/x86/include/asm/kvm_host.h    |  11 ++
 arch/x86/kvm/Kconfig               |   1 +
 arch/x86/kvm/mmu/mmu.c             |  40 +++--
 arch/x86/kvm/mmu/mmu_internal.h    |  35 +++-
 arch/x86/kvm/mmu/tdp_iter.c        |  37 +++-
 arch/x86/kvm/mmu/tdp_iter.h        |   2 +
 arch/x86/kvm/mmu/tdp_mmu.c         | 279 ++++++++++++++++++++++++++---
 arch/x86/kvm/vmx/common.h          |   6 +-
 arch/x86/kvm/vmx/tdx.c             | 227 +++++++++++++++++------
 arch/x86/kvm/vmx/tdx_arch.h        |  21 +++
 arch/x86/kvm/vmx/tdx_errno.h       |   2 +
 arch/x86/kvm/vmx/tdx_ops.h         |  50 ++++--
 arch/x86/kvm/vmx/vmx.c             |   2 +-
 14 files changed, 601 insertions(+), 115 deletions(-)


base-commit: bfa3037d828050896ae52f6467b6ca2489ae6fb1
prerequisite-patch-id: 3bd3037b3803e2d84f0ef98bb6c678be44eddd08
prerequisite-patch-id: b474cbf4f0ea21cf945036271f5286017e0efc84
prerequisite-patch-id: bd96a89fafe51956a55fdfc08a3ea2a37a2e55e4
prerequisite-patch-id: f15d178f9000430e0089c546756ab1d8d29341a7
prerequisite-patch-id: 5b34829d7433fa81ed574d724ee476b9cc2e6a50
prerequisite-patch-id: bf75388851ee37a83b37bfa7cb0084f27301f6bc
prerequisite-patch-id: 9d77fb0e8ce8c8c21e22ff3f26bd168eb5446df0
prerequisite-patch-id: 7152514149d4b4525a0057e3460ff78861e162f5
prerequisite-patch-id: a1d688257a210564ebeb23b1eef4b9ad1f5d7be3
prerequisite-patch-id: 0b1e771c370a03e1588ed97ee77cb0493d9304f4
prerequisite-patch-id: 313219882d617e4d4cb226760d1f071f52b3f882
prerequisite-patch-id: a8ebe373e3913fd0e0a55c57f55690f432975ec0
prerequisite-patch-id: 8b06f2333214e355b145113e33c65ade85d7eac4
prerequisite-patch-id: e739dd58995d35b0f888d02a6bf4ea144476f264
prerequisite-patch-id: 0e93d19cb59f3a052a377a56ff0a4399046818aa
prerequisite-patch-id: 4e0839abbfb8885154e278b4b0071a760199ad46
prerequisite-patch-id: be193bb3393ad8a16ea376a530df20a145145259
prerequisite-patch-id: 301dbdf8448175ea609664c890a3694750ecf740
prerequisite-patch-id: ba8e6068bcef7865bb5523065e19edd49fbc02de
prerequisite-patch-id: 81b25d13169b3617c12992dce85613a2730b0e1b
prerequisite-patch-id: b4526dee5b5a95da0a13116ae0c73d4e69efa3c6
prerequisite-patch-id: 8c62bacc52a75d4a9038a3f597fe436c50e07de3
prerequisite-patch-id: 5618d2414a1ef641b4c247b5e28076f67a765b24
prerequisite-patch-id: 022b4620f6ff729eca842192259e986d126e7fa6
prerequisite-patch-id: 73ebc581a3ce9a51167785d273fe69406ccccaed
prerequisite-patch-id: 1225df90aeae430a74354bc5ad0ddf508d0707db
prerequisite-patch-id: 1e38df398ee370ad7e457f4890d6e4457e8a83fa
prerequisite-patch-id: b8812b613f5674351565ea28354e91a756efd56e
prerequisite-patch-id: e231eff2baba07c2de984dd6cf83ad1a31b792b8
prerequisite-patch-id: 4c3e874f5a81d8faa87f1552c4f66c335e51b10b
prerequisite-patch-id: fa77e23cb08f647a81c8a2d6e15b71d0d9d73d3f
prerequisite-patch-id: 358d933f6d6fafba8fcf363673e4aeaa3175bffa
prerequisite-patch-id: 4b529f51e850c2ae205ccebf06c506a2ceda2352
prerequisite-patch-id: e611ed11739866ed5863c10893447d18f7362793
prerequisite-patch-id: 8d3716956281a5bd4024343c7a6538c635bc4512
prerequisite-patch-id: 5c1099652396c3020b2af559ed2a12cf2725f2fe
prerequisite-patch-id: 554e6bd542b845c1a556f7da4db9c7ac33fe396e
prerequisite-patch-id: 38461b84a4c6292b81a97424f9834f693065c794
prerequisite-patch-id: 5d05b55188360da9737f9cf52a7b888b1393e03f
prerequisite-patch-id: c4b6a6cb6ecd44b4ccb4fd0bd29d3df14ad2df2d
prerequisite-patch-id: 3c93e412ef811eb92d0c9e7442108e57f4c0161d
prerequisite-patch-id: 144982ee3761b30264328bf97f75ad8511c92ef1
prerequisite-patch-id: 2e1bfaa6f636431c64be30567b6ab29612ab667b
prerequisite-patch-id: dbbafc93f22c632974ac4f0f7723dff031f58b44
prerequisite-patch-id: 23844e3aeb137c15225bd1e00e36ff3e28ecf3a4
prerequisite-patch-id: 1df0c588530996d9ed78592aef25a1c28290511d
prerequisite-patch-id: 676e4f00026f36d11a56a09306800f9bbdfdf418
prerequisite-patch-id: c1f6a4380640607966d2574d828e20444fdec82c
prerequisite-patch-id: 2d7d9e53916d8ae7098b81d16c37f8fa36d49ac0
prerequisite-patch-id: 4df02112a774adec078d579304355e665e812c97
prerequisite-patch-id: bf078bcc88a3fa417dcaa3ff284fd9b13dc3c88b
prerequisite-patch-id: 93919b210b5255c8225ba651b64f5a251674dacb
prerequisite-patch-id: 3986d23cd0b46ed5a836d91ff0578b4afd190e39
prerequisite-patch-id: 46449476658cfd8715ff04822508694f64f0e047
prerequisite-patch-id: c0d872fbfe9cf24cb69f93e4d84f39a1fc9cec2d
prerequisite-patch-id: d10a2f5ee80095ddd8ada0a5f524bbc50c2782a9
prerequisite-patch-id: 9612e4f0609b6680bf40c94cbf41f7898b7149b0
prerequisite-patch-id: aa6ebca29f326ee57123b49992584ac1e71cd0c1
prerequisite-patch-id: ebab5bff65b7583b9257849e93b67f71c964630b
prerequisite-patch-id: b0bf2eaba4e53f01e6316780b80cb1e29ac74ee0
prerequisite-patch-id: f4e97d679570433a549ec7c7a9ff87df57adc41c
prerequisite-patch-id: 13625ac5fc2522e74b1c1639ac511206b43256c7
prerequisite-patch-id: be4911c0d255be1706205f3b825630e14dec3398
prerequisite-patch-id: d95adb5a77af86847f0e20fd99f081db3d880827
prerequisite-patch-id: 4dd00540050377ff852c0a939682d5894513444c
prerequisite-patch-id: e21ebef42f8bd94ab61b74eb3d5fe633843e4c8d
prerequisite-patch-id: 2478168dda83b9aef49a4d9131107308f512e9df
prerequisite-patch-id: 6ea8a145d7db1e3287a1d26b63876b6d6b4ea2b5
prerequisite-patch-id: 241e8fd9a85fa5be259cd24b89d94e1555a825b3
prerequisite-patch-id: 1d447fc520ee61b7d136cbfede32a6343ecd8526
prerequisite-patch-id: 5f00989dd01aab4d7f49520347e7427506c40685
prerequisite-patch-id: 3759b94086e48ee7e4853c85af9ea0949b0b18c8
prerequisite-patch-id: 85c853ced4c8d23497b2ff5a31808da64c09798b
prerequisite-patch-id: 3d9be44c3c51aacf9038b143d55acade6e9de323
prerequisite-patch-id: c0f5a5252a1fdbae95b38e4fab27630cb812911e
prerequisite-patch-id: bfabdd06ab51e3eebed14762bb6bc3821bc17f9a
prerequisite-patch-id: 70d7f5c228a85077fa1f31c92e2857cec5687247
prerequisite-patch-id: ff4f6fbd4f9f14ed19e802bebad71f63a1aeb1b9
prerequisite-patch-id: 1f31c884a75ce61edadd32933e9c3f6b5a63c4af
prerequisite-patch-id: 10bdbd913c6b5ad27adabbdc21bb7b4c061b76be
prerequisite-patch-id: 5264da83b82ea2063556180f1789908e7d7bfe20
prerequisite-patch-id: 73063491abd192b08ea6f15d8fad2235646fb01e
prerequisite-patch-id: f80283dc2b5127bb5ab84863ac361269598a96d0
prerequisite-patch-id: 7fc7eea3ca9110f14ca7d07573df087de66ea567
prerequisite-patch-id: ca6b6209f762701a28d03f97a9d849b7194bd8f9
prerequisite-patch-id: 1f3dfb4bd31957ec8845205bb978a42ba38f23d5
prerequisite-patch-id: f8fd0c8378f1dfecc6d1f19735a7387e1a321feb
prerequisite-patch-id: 62fb4481c6d53b0cd695964c7aea05ab5672073f
prerequisite-patch-id: 530de5c34ee8c595960e370a5a55096d014c7cce
prerequisite-patch-id: c6b3e3ed6b86d3309fcd14b59ba5a946ba7038ad
prerequisite-patch-id: 09213e6b300e1390e5b97eec4e75d51242af5e06
prerequisite-patch-id: 367ec96f0c582eab2a6919dc7092cc543620b7ea
prerequisite-patch-id: 2d8eb76da2d77cf4d87b83cf5a4dff319aa8ac97
prerequisite-patch-id: 28e65c507a95947415faa47990764882736c67d7
prerequisite-patch-id: 57d920b1c15a2567736ce1ff5f1214f6d07ab32d
prerequisite-patch-id: 6e05c9b91082ac21b0149152f581b54fd4c0005f
prerequisite-patch-id: f4d423ed7166ae9e05d0eab7f1389c1f028c0593
prerequisite-patch-id: 77814640692b8e3eddcc239c0398388fe0aa10e7
prerequisite-patch-id: 9d6b4a62ebcddfb7f1fb702f9c0194e7e7ba87aa
prerequisite-patch-id: de149d85ce42652e66df50669584efb9cbe6d72b
prerequisite-patch-id: fa40e1fcec3342b520c8ebb2f6ed6a6ffe2f759b
prerequisite-patch-id: f06db517c56a36cac49beff9ff52b69199f33a13
prerequisite-patch-id: 9d644a738f3495192c808647bab0109cad18057a
prerequisite-patch-id: 94617faf0d666a146549c9d8e9936d61a1c8ba15
prerequisite-patch-id: bb093494e96b0c9797745fd430f4e1e1f570bf0b
prerequisite-patch-id: 37154e5673b94164fc805c9e2d1ba0b0669eeb74
prerequisite-patch-id: a509a202c02795fd8439a9feeefa1d118372a73f
prerequisite-patch-id: 1839ee0ce1af0fcc7eed4645c53b54310829890d
prerequisite-patch-id: 83efa503b3cd5b6ddae29e4a7e7144e29b82bfa1
prerequisite-patch-id: 56c1760c611de7fac797b3ebf260a1cc33ced72b
prerequisite-patch-id: d6bf265d34ca38d5b8d7c6344972e68e77ec3fcb
prerequisite-patch-id: 693bd8abe58facc5c1f4d6a2f0a2ad4d5ea86624
prerequisite-patch-id: 677d0e9b5f60aff0e474fa4b4079eb04ac78f244
prerequisite-patch-id: 34cd31ba63d23b556193abbbab3978e967a6b846
prerequisite-patch-id: 712b77a9c15e351a3d8815e699025e6deff71909
prerequisite-patch-id: 1e1fa8306edd877ce406fc1cf5b082b4717950cd
prerequisite-patch-id: d0e48a96c3d2a81b7f7298cbe46b96166fab16a5
prerequisite-patch-id: 4967cee4416a7680e536d93a27ccbcd75c3cdc2a
prerequisite-patch-id: f65606364c47459f0e60f84764203b25cef3897e
prerequisite-patch-id: 6301eb6d3877015632c6bd2fb4d4d11c9504ef30
prerequisite-patch-id: 86a5cd1f1a740f5a331fdefe6c1565ab5042624f
prerequisite-patch-id: c73eec4d8640da745c1bedf9be0557e469c39f86
prerequisite-patch-id: 8193b68fa56b6a3915aa31526c5129322985571e
prerequisite-patch-id: 859dd8422af62bd19d80016f320afb015ac7bc7d
prerequisite-patch-id: 7d633c1131be2afdd191691bd6c76a760651364f
prerequisite-patch-id: a377fa7042b735378a792e28f27b1cc43e11e722
prerequisite-patch-id: a9a34487c048d6c62bd48e2101756525fc58805f
prerequisite-patch-id: c43bb7a0638d6b08585c35896d4962ce19402aa9
prerequisite-patch-id: 023bc742cee689f6db55e0083dde2c44f3edbc77
prerequisite-patch-id: 56fab84ef1691ac9b9629c8ee7571479cbe981ea
prerequisite-patch-id: eb0163d5e486161b8e22991ce3087932794a4bcc
prerequisite-patch-id: 316db61898881a8fcd0b57aab6533e74422e2aca
prerequisite-patch-id: 0d55fc17e3d799c548f8579b01aec271ac579f1d
prerequisite-patch-id: 71d4e5a705089a386f42c6812093d55d0d9f44ec
prerequisite-patch-id: 0bbe899852935ed5b97eb40758706fd1e774e79b
prerequisite-patch-id: 830d049555ccbf661dac2bfe0825de2e6fffc19b
prerequisite-patch-id: 6e1403460c80e7d94f2e0401d878886228213d0a
prerequisite-patch-id: f01d987576425d116d39c3dc0a195fd814ed864a
prerequisite-patch-id: 497371f3094c51a7f2242581930b28d319a6dd93
prerequisite-patch-id: 0555a722fe13d69665526d85ec1dff477d57e572
prerequisite-patch-id: 62d2965e2392d7ab81e38fc5a43161a0238ec1d6
prerequisite-patch-id: fbdabf6fe4f5836f4e0015bd4809c4cf8defeae5
prerequisite-patch-id: d49c285afa37417a1eda4983c80cffecbebcb226
prerequisite-patch-id: 713afe4090cc6a5d03e0505798299e8fe10c67a8
prerequisite-patch-id: f4605a97b041af597616cc4e334b1ce2c96fda89
prerequisite-patch-id: 0220f1418bed8aacc8cf01793e0b4fbc658d9834
prerequisite-patch-id: 50f0f5ec56a8d625f2bbd2a1d9aed6dd19eb9e23
prerequisite-patch-id: f155615a1520b421844626f829bfeb6460838ee7
prerequisite-patch-id: 22d072c6b3af7d055b85108512214877e4792a9c
prerequisite-patch-id: c7c625c835a1590c2f695191f40efd753d415e06
prerequisite-patch-id: c202c186c8cf1813f495a28bc9d0dbeb11a8f26a
prerequisite-patch-id: 117ca93acf7e39b5801fec569c9abe6adbb7e500
prerequisite-patch-id: 852a1c72b7a52fbfd0bb14b56755a4fbf79a6ec1
prerequisite-patch-id: 8f3b3f7f3449f1da247ea89e2027e88065ba2a94
prerequisite-patch-id: 738de5d8a181e3c4763b69c09b70bb88e1e93f85
prerequisite-patch-id: fe791ffefa55b6c271c0cc3a1a658a3c5e3eda03
prerequisite-patch-id: abf39bd923baf524b1071ad00578ccd3471f3318
prerequisite-patch-id: d7a5b549d80f33d8d4a00673c827cffcbbba7d02
prerequisite-patch-id: 31f1343d72fec48432f71b23987d5ac482d4d0e3
prerequisite-patch-id: 561815c250fd5407f45f791fdb63d10b9b5293dc
prerequisite-patch-id: f37372210572bda07b63dce1dddf73922e7c0cb1
prerequisite-patch-id: 840f8215c2ae5264fc3038ff2c7ed83fe6c35fe1
prerequisite-patch-id: bea612f77b1faac60f2a581ef10f33a3312e8435
prerequisite-patch-id: 604c6430b0d02b780c1c8abf158ef99b8911c4ad
prerequisite-patch-id: cf30ebef3b11b8d6654e891f69bf3a1e81050c2a
prerequisite-patch-id: c7a8f9838ccc2d9c8301002ac2d0c5808e089acd
prerequisite-patch-id: 14b93caad07ac2c1098336d4a5fd5e0c7d208011
prerequisite-patch-id: 9f8ebe703f436be49f00ebf0ab13ee7329215923
prerequisite-patch-id: 6109b6886b4fb6c0241548b5639e48ff69f306c7
prerequisite-patch-id: 1318b00e6c377677044a1bb51ad0fe46ac2e2e3c
prerequisite-patch-id: e06f12602791f56bd0204097189151d19de05816
prerequisite-patch-id: 4e4ea15874d5788aea716da17202242355d70b60
prerequisite-patch-id: f5f508eb2ddef431ea6e2141c3a8092355e988b3
prerequisite-patch-id: 0f1d953414afda95278954d9cded2fdd64ed9089
prerequisite-patch-id: 45472dffa449747d7eb3513b365c78f44123732d
prerequisite-patch-id: e7e98c1056fc94ba4387a7ef0b49412958ffd68f
prerequisite-patch-id: 5b2530fd85758847f420540a5ffa8a0cd4e32eda
prerequisite-patch-id: ab980f9533bc7668f3a820103d9c0ec6866ea805
prerequisite-patch-id: e57e3302ce47641cd78db9705815dbf20169746a
prerequisite-patch-id: 6dae9a9de4ebaf8727f0f7f543d882735b180cc3
prerequisite-patch-id: c438562a90dd917471892d82f5257c71c1fbbb37
prerequisite-patch-id: e9ceb3031dec327501e012ca3b53fadfea8676b5
-- 
2.25.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [RFC PATCH v4 01/16] KVM: TDP_MMU: Go to next level if smaller private mapping exists
  2023-07-25 22:23 [RFC PATCH v4 00/16] KVM TDX: TDP MMU: large page support isaku.yamahata
@ 2023-07-25 22:23 ` isaku.yamahata
  2023-09-05  8:10   ` Binbin Wu
  2023-07-25 22:23 ` [RFC PATCH v4 02/16] KVM: TDX: Pass page level to cache flush before TDX SEAMCALL isaku.yamahata
                   ` (14 subsequent siblings)
  15 siblings, 1 reply; 21+ messages in thread
From: isaku.yamahata @ 2023-07-25 22:23 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack, Kai Huang,
	Zhi Wang, chen.bo, hang.yuan, tina.zhang, Xiaoyao Li

From: Xiaoyao Li <xiaoyao.li@intel.com>

Cannot map a private page as large page if any smaller mapping exists.

It has to wait for all the not-mapped smaller page to be mapped and
promote it to larger mapping.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 95ba78944712..a9f0f4ade2d0 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1293,7 +1293,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	tdp_mmu_for_each_pte(iter, mmu, is_private, raw_gfn, raw_gfn + 1) {
 		int r;
 
-		if (fault->nx_huge_page_workaround_enabled)
+		if (fault->nx_huge_page_workaround_enabled ||
+		    kvm_gfn_shared_mask(vcpu->kvm))
 			disallowed_hugepage_adjust(fault, iter.old_spte, iter.level);
 
 		/*
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH v4 02/16] KVM: TDX: Pass page level to cache flush before TDX SEAMCALL
  2023-07-25 22:23 [RFC PATCH v4 00/16] KVM TDX: TDP MMU: large page support isaku.yamahata
  2023-07-25 22:23 ` [RFC PATCH v4 01/16] KVM: TDP_MMU: Go to next level if smaller private mapping exists isaku.yamahata
@ 2023-07-25 22:23 ` isaku.yamahata
  2023-07-25 22:23 ` [RFC PATCH v4 03/16] KVM: TDX: Pass KVM page level to tdh_mem_page_add() and tdh_mem_page_aug() isaku.yamahata
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: isaku.yamahata @ 2023-07-25 22:23 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack, Kai Huang,
	Zhi Wang, chen.bo, hang.yuan, tina.zhang, Xiaoyao Li

From: Xiaoyao Li <xiaoyao.li@intel.com>

tdh_mem_page_aug() will support 2MB large page in the near future.  Cache
flush also needs to be 2MB instead of 4KB in such cases.  Introduce a
helper function to flush cache with page size info in preparation for large
pages.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/tdx_ops.h | 22 ++++++++++++++--------
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx_ops.h b/arch/x86/kvm/vmx/tdx_ops.h
index c7819abd61b0..3d0968c98437 100644
--- a/arch/x86/kvm/vmx/tdx_ops.h
+++ b/arch/x86/kvm/vmx/tdx_ops.h
@@ -6,6 +6,7 @@
 
 #include <linux/compiler.h>
 
+#include <asm/pgtable_types.h>
 #include <asm/archrandom.h>
 #include <asm/cacheflush.h>
 #include <asm/asm.h>
@@ -46,6 +47,11 @@ static inline u64 tdx_seamcall(u64 op, u64 rcx, u64 rdx, u64 r8, u64 r9,
 void pr_tdx_error(u64 op, u64 error_code, const struct tdx_module_output *out);
 #endif
 
+static inline void tdx_clflush_page(hpa_t addr, enum pg_level level)
+{
+	clflush_cache_range(__va(addr), KVM_HPAGE_SIZE(level));
+}
+
 /*
  * TDX module acquires its internal lock for resources.  It doesn't spin to get
  * locks because of its restrictions of allowed execution time.  Instead, it
@@ -78,21 +84,21 @@ static inline u64 tdx_seamcall_sept(u64 op, u64 rcx, u64 rdx, u64 r8, u64 r9,
 
 static inline u64 tdh_mng_addcx(hpa_t tdr, hpa_t addr)
 {
-	clflush_cache_range(__va(addr), PAGE_SIZE);
+	tdx_clflush_page(addr, PG_LEVEL_4K);
 	return tdx_seamcall(TDH_MNG_ADDCX, addr, tdr, 0, 0, NULL);
 }
 
 static inline u64 tdh_mem_page_add(hpa_t tdr, gpa_t gpa, hpa_t hpa, hpa_t source,
 				   struct tdx_module_output *out)
 {
-	clflush_cache_range(__va(hpa), PAGE_SIZE);
+	tdx_clflush_page(hpa, PG_LEVEL_4K);
 	return tdx_seamcall_sept(TDH_MEM_PAGE_ADD, gpa, tdr, hpa, source, out);
 }
 
 static inline u64 tdh_mem_sept_add(hpa_t tdr, gpa_t gpa, int level, hpa_t page,
 				   struct tdx_module_output *out)
 {
-	clflush_cache_range(__va(page), PAGE_SIZE);
+	tdx_clflush_page(page, PG_LEVEL_4K);
 	return tdx_seamcall_sept(TDH_MEM_SEPT_ADD, gpa | level, tdr, page, 0, out);
 }
 
@@ -104,21 +110,21 @@ static inline u64 tdh_mem_sept_remove(hpa_t tdr, gpa_t gpa, int level,
 
 static inline u64 tdh_vp_addcx(hpa_t tdvpr, hpa_t addr)
 {
-	clflush_cache_range(__va(addr), PAGE_SIZE);
+	tdx_clflush_page(addr, PG_LEVEL_4K);
 	return tdx_seamcall(TDH_VP_ADDCX, addr, tdvpr, 0, 0, NULL);
 }
 
 static inline u64 tdh_mem_page_relocate(hpa_t tdr, gpa_t gpa, hpa_t hpa,
 					struct tdx_module_output *out)
 {
-	clflush_cache_range(__va(hpa), PAGE_SIZE);
+	tdx_clflush_page(hpa, PG_LEVEL_4K);
 	return tdx_seamcall(TDH_MEM_PAGE_RELOCATE, gpa, tdr, hpa, 0, out);
 }
 
 static inline u64 tdh_mem_page_aug(hpa_t tdr, gpa_t gpa, hpa_t hpa,
 				   struct tdx_module_output *out)
 {
-	clflush_cache_range(__va(hpa), PAGE_SIZE);
+	tdx_clflush_page(hpa, PG_LEVEL_4K);
 	return tdx_seamcall_sept(TDH_MEM_PAGE_AUG, gpa, tdr, hpa, 0, out);
 }
 
@@ -135,13 +141,13 @@ static inline u64 tdh_mng_key_config(hpa_t tdr)
 
 static inline u64 tdh_mng_create(hpa_t tdr, int hkid)
 {
-	clflush_cache_range(__va(tdr), PAGE_SIZE);
+	tdx_clflush_page(tdr, PG_LEVEL_4K);
 	return tdx_seamcall(TDH_MNG_CREATE, tdr, hkid, 0, 0, NULL);
 }
 
 static inline u64 tdh_vp_create(hpa_t tdr, hpa_t tdvpr)
 {
-	clflush_cache_range(__va(tdvpr), PAGE_SIZE);
+	tdx_clflush_page(tdvpr, PG_LEVEL_4K);
 	return tdx_seamcall(TDH_VP_CREATE, tdvpr, tdr, 0, 0, NULL);
 }
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH v4 03/16] KVM: TDX: Pass KVM page level to tdh_mem_page_add() and tdh_mem_page_aug()
  2023-07-25 22:23 [RFC PATCH v4 00/16] KVM TDX: TDP MMU: large page support isaku.yamahata
  2023-07-25 22:23 ` [RFC PATCH v4 01/16] KVM: TDP_MMU: Go to next level if smaller private mapping exists isaku.yamahata
  2023-07-25 22:23 ` [RFC PATCH v4 02/16] KVM: TDX: Pass page level to cache flush before TDX SEAMCALL isaku.yamahata
@ 2023-07-25 22:23 ` isaku.yamahata
  2023-07-25 22:23 ` [RFC PATCH v4 04/16] KVM: TDX: Pass size to tdx_measure_page() isaku.yamahata
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: isaku.yamahata @ 2023-07-25 22:23 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack, Kai Huang,
	Zhi Wang, chen.bo, hang.yuan, tina.zhang, Xiaoyao Li

From: Xiaoyao Li <xiaoyao.li@intel.com>

Level info is needed in tdh_clflush_page() to generate the correct page
size.

Besides, explicitly pass level info to SEAMCALL instead of assuming
it's zero. It works naturally when 2MB support lands.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/tdx.c     |  7 ++++---
 arch/x86/kvm/vmx/tdx_ops.h | 19 ++++++++++++-------
 2 files changed, 16 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 1a8a3fa92303..f3a8ae3e81bd 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1365,6 +1365,7 @@ static void tdx_unpin(struct kvm *kvm, kvm_pfn_t pfn)
 static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
 				     enum pg_level level, kvm_pfn_t pfn)
 {
+	int tdx_level = pg_level_to_tdx_sept_level(level);
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
 	hpa_t hpa = pfn_to_hpa(pfn);
 	gpa_t gpa = gfn_to_gpa(gfn);
@@ -1389,7 +1390,7 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
 		if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
 			return -EINVAL;
 
-		err = tdh_mem_page_aug(kvm_tdx->tdr_pa, gpa, hpa, &out);
+		err = tdh_mem_page_aug(kvm_tdx->tdr_pa, gpa, tdx_level, hpa, &out);
 		if (unlikely(err == TDX_ERROR_SEPT_BUSY)) {
 			tdx_unpin(kvm, pfn);
 			return -EAGAIN;
@@ -1428,8 +1429,8 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
 	kvm_tdx->source_pa = INVALID_PAGE;
 
 	do {
-		err = tdh_mem_page_add(kvm_tdx->tdr_pa, gpa, hpa, source_pa,
-				       &out);
+		err = tdh_mem_page_add(kvm_tdx->tdr_pa, gpa, tdx_level, hpa,
+				       source_pa, &out);
 		/*
 		 * This path is executed during populating initial guest memory
 		 * image. i.e. before running any vcpu.  Race is rare.
diff --git a/arch/x86/kvm/vmx/tdx_ops.h b/arch/x86/kvm/vmx/tdx_ops.h
index 3d0968c98437..e3d7e19e5324 100644
--- a/arch/x86/kvm/vmx/tdx_ops.h
+++ b/arch/x86/kvm/vmx/tdx_ops.h
@@ -47,6 +47,11 @@ static inline u64 tdx_seamcall(u64 op, u64 rcx, u64 rdx, u64 r8, u64 r9,
 void pr_tdx_error(u64 op, u64 error_code, const struct tdx_module_output *out);
 #endif
 
+static inline enum pg_level tdx_sept_level_to_pg_level(int tdx_level)
+{
+	return tdx_level + 1;
+}
+
 static inline void tdx_clflush_page(hpa_t addr, enum pg_level level)
 {
 	clflush_cache_range(__va(addr), KVM_HPAGE_SIZE(level));
@@ -88,11 +93,11 @@ static inline u64 tdh_mng_addcx(hpa_t tdr, hpa_t addr)
 	return tdx_seamcall(TDH_MNG_ADDCX, addr, tdr, 0, 0, NULL);
 }
 
-static inline u64 tdh_mem_page_add(hpa_t tdr, gpa_t gpa, hpa_t hpa, hpa_t source,
-				   struct tdx_module_output *out)
+static inline u64 tdh_mem_page_add(hpa_t tdr, gpa_t gpa, int level, hpa_t hpa,
+				   hpa_t source, struct tdx_module_output *out)
 {
-	tdx_clflush_page(hpa, PG_LEVEL_4K);
-	return tdx_seamcall_sept(TDH_MEM_PAGE_ADD, gpa, tdr, hpa, source, out);
+	tdx_clflush_page(hpa, tdx_sept_level_to_pg_level(level));
+	return tdx_seamcall_sept(TDH_MEM_PAGE_ADD, gpa | level, tdr, hpa, source, out);
 }
 
 static inline u64 tdh_mem_sept_add(hpa_t tdr, gpa_t gpa, int level, hpa_t page,
@@ -121,11 +126,11 @@ static inline u64 tdh_mem_page_relocate(hpa_t tdr, gpa_t gpa, hpa_t hpa,
 	return tdx_seamcall(TDH_MEM_PAGE_RELOCATE, gpa, tdr, hpa, 0, out);
 }
 
-static inline u64 tdh_mem_page_aug(hpa_t tdr, gpa_t gpa, hpa_t hpa,
+static inline u64 tdh_mem_page_aug(hpa_t tdr, gpa_t gpa, int level, hpa_t hpa,
 				   struct tdx_module_output *out)
 {
-	tdx_clflush_page(hpa, PG_LEVEL_4K);
-	return tdx_seamcall_sept(TDH_MEM_PAGE_AUG, gpa, tdr, hpa, 0, out);
+	tdx_clflush_page(hpa, tdx_sept_level_to_pg_level(level));
+	return tdx_seamcall_sept(TDH_MEM_PAGE_AUG, gpa | level, tdr, hpa, 0, out);
 }
 
 static inline u64 tdh_mem_range_block(hpa_t tdr, gpa_t gpa, int level,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH v4 04/16] KVM: TDX: Pass size to tdx_measure_page()
  2023-07-25 22:23 [RFC PATCH v4 00/16] KVM TDX: TDP MMU: large page support isaku.yamahata
                   ` (2 preceding siblings ...)
  2023-07-25 22:23 ` [RFC PATCH v4 03/16] KVM: TDX: Pass KVM page level to tdh_mem_page_add() and tdh_mem_page_aug() isaku.yamahata
@ 2023-07-25 22:23 ` isaku.yamahata
  2023-07-25 22:23 ` [RFC PATCH v4 05/16] KVM: TDX: Pass size to reclaim_page() isaku.yamahata
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: isaku.yamahata @ 2023-07-25 22:23 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack, Kai Huang,
	Zhi Wang, chen.bo, hang.yuan, tina.zhang, Xiaoyao Li

From: Xiaoyao Li <xiaoyao.li@intel.com>

Extend tdx_measure_page() to pass size info so that it can measure
large page as well.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/tdx.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index f3a8ae3e81bd..3522ee232eda 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1340,13 +1340,15 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
 	td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa & PAGE_MASK);
 }
 
-static void tdx_measure_page(struct kvm_tdx *kvm_tdx, hpa_t gpa)
+static void tdx_measure_page(struct kvm_tdx *kvm_tdx, hpa_t gpa, int size)
 {
 	struct tdx_module_output out;
 	u64 err;
 	int i;
 
-	for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) {
+	WARN_ON_ONCE(size % TDX_EXTENDMR_CHUNKSIZE);
+
+	for (i = 0; i < size; i += TDX_EXTENDMR_CHUNKSIZE) {
 		err = tdh_mr_extend(kvm_tdx->tdr_pa, gpa + i, &out);
 		if (KVM_BUG_ON(err, &kvm_tdx->kvm)) {
 			pr_tdx_error(TDH_MR_EXTEND, err, &out);
@@ -1441,7 +1443,7 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
 		tdx_unpin(kvm, pfn);
 		return -EIO;
 	} else if (measure)
-		tdx_measure_page(kvm_tdx, gpa);
+		tdx_measure_page(kvm_tdx, gpa, KVM_HPAGE_SIZE(level));
 
 	return 0;
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH v4 05/16] KVM: TDX: Pass size to reclaim_page()
  2023-07-25 22:23 [RFC PATCH v4 00/16] KVM TDX: TDP MMU: large page support isaku.yamahata
                   ` (3 preceding siblings ...)
  2023-07-25 22:23 ` [RFC PATCH v4 04/16] KVM: TDX: Pass size to tdx_measure_page() isaku.yamahata
@ 2023-07-25 22:23 ` isaku.yamahata
  2023-09-06  1:48   ` Binbin Wu
  2023-07-25 22:23 ` [RFC PATCH v4 06/16] KVM: TDX: Update tdx_sept_{set,drop}_private_spte() to support large page isaku.yamahata
                   ` (10 subsequent siblings)
  15 siblings, 1 reply; 21+ messages in thread
From: isaku.yamahata @ 2023-07-25 22:23 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack, Kai Huang,
	Zhi Wang, chen.bo, hang.yuan, tina.zhang, Xiaoyao Li

From: Xiaoyao Li <xiaoyao.li@intel.com>

A 2MB large page can be tdh_mem_page_aug()'ed to TD directly. In this case,
it needs to reclaim and clear the page as 2MB size.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/tdx.c | 24 ++++++++++++++----------
 1 file changed, 14 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 3522ee232eda..86cfbf435671 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -198,12 +198,13 @@ static void tdx_disassociate_vp_on_cpu(struct kvm_vcpu *vcpu)
 	smp_call_function_single(cpu, tdx_disassociate_vp_arg, vcpu, 1);
 }
 
-static void tdx_clear_page(unsigned long page_pa)
+static void tdx_clear_page(unsigned long page_pa, int size)
 {
 	const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
 	void *page = __va(page_pa);
 	unsigned long i;
 
+	WARN_ON_ONCE(size % PAGE_SIZE);
 	/*
 	 * When re-assign one page from old keyid to a new keyid, MOVDIR64B is
 	 * required to clear/write the page with new keyid to prevent integrity
@@ -212,7 +213,7 @@ static void tdx_clear_page(unsigned long page_pa)
 	 * clflush doesn't flush cache with HKID set.  The cache line could be
 	 * poisoned (even without MKTME-i), clear the poison bit.
 	 */
-	for (i = 0; i < PAGE_SIZE; i += 64)
+	for (i = 0; i < size; i += 64)
 		movdir64b(page + i, zero_page);
 	/*
 	 * MOVDIR64B store uses WC buffer.  Prevent following memory reads
@@ -221,7 +222,8 @@ static void tdx_clear_page(unsigned long page_pa)
 	__mb();
 }
 
-static int tdx_reclaim_page(hpa_t pa, bool do_wb, u16 hkid)
+static int tdx_reclaim_page(hpa_t pa, enum pg_level level,
+			    bool do_wb, u16 hkid)
 {
 	struct tdx_module_output out;
 	u64 err;
@@ -239,8 +241,10 @@ static int tdx_reclaim_page(hpa_t pa, bool do_wb, u16 hkid)
 		pr_tdx_error(TDH_PHYMEM_PAGE_RECLAIM, err, &out);
 		return -EIO;
 	}
+	/* out.r8 == tdx sept page level */
+	WARN_ON_ONCE(out.r8 != pg_level_to_tdx_sept_level(level));
 
-	if (do_wb) {
+	if (do_wb && level == PG_LEVEL_4K) {
 		/*
 		 * Only TDR page gets into this path.  No contention is expected
 		 * because of the last page of TD.
@@ -252,7 +256,7 @@ static int tdx_reclaim_page(hpa_t pa, bool do_wb, u16 hkid)
 		}
 	}
 
-	tdx_clear_page(pa);
+	tdx_clear_page(pa, KVM_HPAGE_SIZE(level));
 	return 0;
 }
 
@@ -266,7 +270,7 @@ static void tdx_reclaim_td_page(unsigned long td_page_pa)
 	 * was already flushed by TDH.PHYMEM.CACHE.WB before here, So
 	 * cache doesn't need to be flushed again.
 	 */
-	if (tdx_reclaim_page(td_page_pa, false, 0))
+	if (tdx_reclaim_page(td_page_pa, PG_LEVEL_4K, false, 0))
 		/*
 		 * Leak the page on failure:
 		 * tdx_reclaim_page() returns an error if and only if there's an
@@ -474,7 +478,7 @@ void tdx_vm_free(struct kvm *kvm)
 	 * while operating on TD (Especially reclaiming TDCS).  Cache flush with
 	 * TDX global HKID is needed.
 	 */
-	if (tdx_reclaim_page(kvm_tdx->tdr_pa, true, tdx_global_keyid))
+	if (tdx_reclaim_page(kvm_tdx->tdr_pa, PG_LEVEL_4K, true, tdx_global_keyid))
 		return;
 
 	free_page((unsigned long)__va(kvm_tdx->tdr_pa));
@@ -1468,7 +1472,7 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
 		 * The HKID assigned to this TD was already freed and cache
 		 * was already flushed. We don't have to flush again.
 		 */
-		err = tdx_reclaim_page(hpa, false, 0);
+		err = tdx_reclaim_page(hpa, level, false, 0);
 		if (KVM_BUG_ON(err, kvm))
 			return -EIO;
 		tdx_unpin(kvm, pfn);
@@ -1501,7 +1505,7 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
 		pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err, NULL);
 		return -EIO;
 	}
-	tdx_clear_page(hpa);
+	tdx_clear_page(hpa, PAGE_SIZE);
 	tdx_unpin(kvm, pfn);
 	return 0;
 }
@@ -1612,7 +1616,7 @@ static int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,
 	 * already flushed. We don't have to flush again.
 	 */
 	if (!is_hkid_assigned(kvm_tdx))
-		return tdx_reclaim_page(__pa(private_spt), false, 0);
+		return tdx_reclaim_page(__pa(private_spt), PG_LEVEL_4K, false, 0);
 
 	/*
 	 * free_private_spt() is (obviously) called when a shadow page is being
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH v4 06/16] KVM: TDX: Update tdx_sept_{set,drop}_private_spte() to support large page
  2023-07-25 22:23 [RFC PATCH v4 00/16] KVM TDX: TDP MMU: large page support isaku.yamahata
                   ` (4 preceding siblings ...)
  2023-07-25 22:23 ` [RFC PATCH v4 05/16] KVM: TDX: Pass size to reclaim_page() isaku.yamahata
@ 2023-07-25 22:23 ` isaku.yamahata
  2023-07-25 22:23 ` [RFC PATCH v4 07/16] KVM: MMU: Introduce level info in PFERR code isaku.yamahata
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: isaku.yamahata @ 2023-07-25 22:23 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack, Kai Huang,
	Zhi Wang, chen.bo, hang.yuan, tina.zhang, Xiaoyao Li

From: Xiaoyao Li <xiaoyao.li@intel.com>

Allow large page level AUG and REMOVE for TDX pages.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/tdx.c | 66 ++++++++++++++++++++++--------------------
 1 file changed, 34 insertions(+), 32 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 86cfbf435671..9d6c951bb625 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1361,11 +1361,12 @@ static void tdx_measure_page(struct kvm_tdx *kvm_tdx, hpa_t gpa, int size)
 	}
 }
 
-static void tdx_unpin(struct kvm *kvm, kvm_pfn_t pfn)
+static void tdx_unpin(struct kvm *kvm, kvm_pfn_t pfn, int level)
 {
-	struct page *page = pfn_to_page(pfn);
+	int i;
 
-	put_page(page);
+	for (i = 0; i < KVM_PAGES_PER_HPAGE(level); i++)
+		put_page(pfn_to_page(pfn + i));
 }
 
 static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
@@ -1379,6 +1380,7 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
 	hpa_t source_pa;
 	bool measure;
 	u64 err;
+	int i;
 
 	/*
 	 * Because restricted mem doesn't support page migration with
@@ -1388,22 +1390,19 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
 	 * TODO: Once restricted mem introduces callback on page migration,
 	 * implement it and remove get_page/put_page().
 	 */
-	get_page(pfn_to_page(pfn));
+	for (i = 0; i < KVM_PAGES_PER_HPAGE(level); i++)
+		get_page(pfn_to_page(pfn + i));
 
 	/* Build-time faults are induced and handled via TDH_MEM_PAGE_ADD. */
 	if (likely(is_td_finalized(kvm_tdx))) {
-		/* TODO: handle large pages. */
-		if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
-			return -EINVAL;
-
 		err = tdh_mem_page_aug(kvm_tdx->tdr_pa, gpa, tdx_level, hpa, &out);
 		if (unlikely(err == TDX_ERROR_SEPT_BUSY)) {
-			tdx_unpin(kvm, pfn);
+			tdx_unpin(kvm, pfn, level);
 			return -EAGAIN;
 		}
 		if (KVM_BUG_ON(err, kvm)) {
 			pr_tdx_error(TDH_MEM_PAGE_AUG, err, &out);
-			tdx_unpin(kvm, pfn);
+			tdx_unpin(kvm, pfn, level);
 			return -EIO;
 		}
 		return 0;
@@ -1426,7 +1425,7 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
 	 * always uses vcpu 0's page table and protected by vcpu->mutex).
 	 */
 	if (KVM_BUG_ON(kvm_tdx->source_pa == INVALID_PAGE, kvm)) {
-		tdx_unpin(kvm, pfn);
+		tdx_unpin(kvm, pfn, level);
 		return -EINVAL;
 	}
 
@@ -1444,7 +1443,7 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
 	} while (unlikely(err == TDX_ERROR_SEPT_BUSY));
 	if (KVM_BUG_ON(err, kvm)) {
 		pr_tdx_error(TDH_MEM_PAGE_ADD, err, &out);
-		tdx_unpin(kvm, pfn);
+		tdx_unpin(kvm, pfn, level);
 		return -EIO;
 	} else if (measure)
 		tdx_measure_page(kvm_tdx, gpa, KVM_HPAGE_SIZE(level));
@@ -1461,11 +1460,9 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
 	gpa_t gpa = gfn_to_gpa(gfn);
 	hpa_t hpa = pfn_to_hpa(pfn);
 	hpa_t hpa_with_hkid;
+	int r = 0;
 	u64 err;
-
-	/* TODO: handle large pages. */
-	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
-		return -EINVAL;
+	int i;
 
 	if (unlikely(!is_hkid_assigned(kvm_tdx))) {
 		/*
@@ -1475,7 +1472,7 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
 		err = tdx_reclaim_page(hpa, level, false, 0);
 		if (KVM_BUG_ON(err, kvm))
 			return -EIO;
-		tdx_unpin(kvm, pfn);
+		tdx_unpin(kvm, pfn, level);
 		return 0;
 	}
 
@@ -1492,22 +1489,27 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
 		return -EIO;
 	}
 
-	hpa_with_hkid = set_hkid_to_hpa(hpa, (u16)kvm_tdx->hkid);
-	do {
-		/*
-		 * TDX_OPERAND_BUSY can happen on locking PAMT entry.  Because
-		 * this page was removed above, other thread shouldn't be
-		 * repeatedly operating on this page.  Just retry loop.
-		 */
-		err = tdh_phymem_page_wbinvd(hpa_with_hkid);
-	} while (unlikely(err == (TDX_OPERAND_BUSY | TDX_OPERAND_ID_RCX)));
-	if (KVM_BUG_ON(err, kvm)) {
-		pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err, NULL);
-		return -EIO;
+	for (i = 0; i < KVM_PAGES_PER_HPAGE(level); i++) {
+		hpa_with_hkid = set_hkid_to_hpa(hpa, (u16)kvm_tdx->hkid);
+		do {
+			/*
+			 * TDX_OPERAND_BUSY can happen on locking PAMT entry.
+			 * Because this page was removed above, other thread
+			 * shouldn't be repeatedly operating on this page.
+			 * Simple retry should work.
+			 */
+			err = tdh_phymem_page_wbinvd(hpa_with_hkid);
+		} while (unlikely(err == (TDX_OPERAND_BUSY | TDX_OPERAND_ID_RCX)));
+		if (KVM_BUG_ON(err, kvm)) {
+			pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err, NULL);
+			r = -EIO;
+		} else {
+			tdx_clear_page(hpa, PAGE_SIZE);
+			tdx_unpin(kvm, pfn + i, PG_LEVEL_4K);
+		}
+		hpa += PAGE_SIZE;
 	}
-	tdx_clear_page(hpa, PAGE_SIZE);
-	tdx_unpin(kvm, pfn);
-	return 0;
+	return r;
 }
 
 static int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH v4 07/16] KVM: MMU: Introduce level info in PFERR code
  2023-07-25 22:23 [RFC PATCH v4 00/16] KVM TDX: TDP MMU: large page support isaku.yamahata
                   ` (5 preceding siblings ...)
  2023-07-25 22:23 ` [RFC PATCH v4 06/16] KVM: TDX: Update tdx_sept_{set,drop}_private_spte() to support large page isaku.yamahata
@ 2023-07-25 22:23 ` isaku.yamahata
  2023-07-25 22:23 ` [RFC PATCH v4 08/16] KVM: TDX: Pin pages via get_page() right before ADD/AUG'ed to TDs isaku.yamahata
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: isaku.yamahata @ 2023-07-25 22:23 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack, Kai Huang,
	Zhi Wang, chen.bo, hang.yuan, tina.zhang, Xiaoyao Li

From: Xiaoyao Li <xiaoyao.li@intel.com>

For TDX, EPT violation can happen when TDG.MEM.PAGE.ACCEPT.
And TDG.MEM.PAGE.ACCEPT contains the desired accept page level of TD guest.

1. KVM can map it with 4KB page while TD guest wants to accept 2MB page.

  TD geust will get TDX_PAGE_SIZE_MISMATCH and it should try to accept
  4KB size.

2. KVM can map it with 2MB page while TD guest wants to accept 4KB page.

  KVM needs to honor it because
  a) there is no way to tell guest KVM maps it as 2MB size. And
  b) guest accepts it in 4KB size since guest knows some other 4KB page
     in the same 2MB range will be used as shared page.

For case 2, it need to pass desired page level to MMU's
page_fault_handler. Use bit 29:31 of kvm PF error code for this purpose.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/asm/kvm_host.h |  3 +++
 arch/x86/kvm/mmu/mmu.c          |  5 +++++
 arch/x86/kvm/vmx/common.h       |  6 +++++-
 arch/x86/kvm/vmx/tdx.c          | 15 ++++++++++++++-
 arch/x86/kvm/vmx/tdx.h          | 19 +++++++++++++++++++
 arch/x86/kvm/vmx/vmx.c          |  2 +-
 6 files changed, 47 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 304c01945115..2326e48a8fcb 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -253,6 +253,8 @@ enum x86_intercept_stage;
 #define PFERR_FETCH_BIT 4
 #define PFERR_PK_BIT 5
 #define PFERR_SGX_BIT 15
+#define PFERR_LEVEL_START_BIT 29
+#define PFERR_LEVEL_END_BIT 31
 #define PFERR_GUEST_FINAL_BIT 32
 #define PFERR_GUEST_PAGE_BIT 33
 #define PFERR_GUEST_ENC_BIT 34
@@ -265,6 +267,7 @@ enum x86_intercept_stage;
 #define PFERR_FETCH_MASK	BIT(PFERR_FETCH_BIT)
 #define PFERR_PK_MASK		BIT(PFERR_PK_BIT)
 #define PFERR_SGX_MASK		BIT(PFERR_SGX_BIT)
+#define PFERR_LEVEL_MASK	GENMASK_ULL(PFERR_LEVEL_END_BIT, PFERR_LEVEL_START_BIT)
 #define PFERR_GUEST_FINAL_MASK	BIT_ULL(PFERR_GUEST_FINAL_BIT)
 #define PFERR_GUEST_PAGE_MASK	BIT_ULL(PFERR_GUEST_PAGE_BIT)
 #define PFERR_GUEST_ENC_MASK	BIT_ULL(PFERR_GUEST_ENC_BIT)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 2b65e8beee17..bf4f23129ad0 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4642,6 +4642,11 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
 
 int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 {
+	u8 err_level = (fault->error_code & PFERR_LEVEL_MASK) >> PFERR_LEVEL_START_BIT;
+
+	if (err_level)
+		fault->max_level = min(fault->max_level, err_level);
+
 	/*
 	 * If the guest's MTRRs may be used to compute the "real" memtype,
 	 * restrict the mapping level to ensure KVM uses a consistent memtype
diff --git a/arch/x86/kvm/vmx/common.h b/arch/x86/kvm/vmx/common.h
index e4fec792a3ae..90db5ca45925 100644
--- a/arch/x86/kvm/vmx/common.h
+++ b/arch/x86/kvm/vmx/common.h
@@ -67,7 +67,8 @@ static inline void vmx_handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu,
 }
 
 static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
-					     unsigned long exit_qualification)
+					     unsigned long exit_qualification,
+					     int err_page_level)
 {
 	u64 error_code;
 
@@ -90,6 +91,9 @@ static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
 	if (kvm_is_private_gpa(vcpu->kvm, gpa))
 		error_code |= PFERR_GUEST_ENC_MASK;
 
+	if (err_page_level > 0)
+		error_code |= (err_page_level << PFERR_LEVEL_START_BIT) & PFERR_LEVEL_MASK;
+
 	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
 }
 
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 9d6c951bb625..c122160142fd 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1679,7 +1679,20 @@ void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
 
 static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
 {
+	union tdx_ext_exit_qualification ext_exit_qual;
 	unsigned long exit_qual;
+	int err_page_level = 0;
+
+	ext_exit_qual.full = tdexit_ext_exit_qual(vcpu);
+
+	if (ext_exit_qual.type >= NUM_EXT_EXIT_QUAL) {
+		pr_err("EPT violation at gpa 0x%lx, with invalid ext exit qualification type 0x%x\n",
+			tdexit_gpa(vcpu), ext_exit_qual.type);
+		kvm_vm_bugged(vcpu->kvm);
+		return 0;
+	} else if (ext_exit_qual.type == EXT_EXIT_QUAL_ACCEPT) {
+		err_page_level = tdx_sept_level_to_pg_level(ext_exit_qual.req_sept_level);
+	}
 
 	if (kvm_is_private_gpa(vcpu->kvm, tdexit_gpa(vcpu))) {
 		/*
@@ -1706,7 +1719,7 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
 	}
 
 	trace_kvm_page_fault(vcpu, tdexit_gpa(vcpu), exit_qual);
-	return __vmx_handle_ept_violation(vcpu, tdexit_gpa(vcpu), exit_qual);
+	return __vmx_handle_ept_violation(vcpu, tdexit_gpa(vcpu), exit_qual, err_page_level);
 }
 
 static int tdx_handle_ept_misconfig(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index aff740a775bd..117a81d69cb4 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -72,6 +72,25 @@ union tdx_exit_reason {
 	u64 full;
 };
 
+union tdx_ext_exit_qualification {
+	struct {
+		u64 type		: 4;
+		u64 reserved0		: 28;
+		u64 req_sept_level	: 3;
+		u64 err_sept_level	: 3;
+		u64 err_sept_state	: 8;
+		u64 err_sept_is_leaf	: 1;
+		u64 reserved1		: 17;
+	};
+	u64 full;
+};
+
+enum tdx_ext_exit_qualification_type {
+	EXT_EXIT_QUAL_NONE,
+	EXT_EXIT_QUAL_ACCEPT,
+	NUM_EXT_EXIT_QUAL,
+};
+
 struct vcpu_tdx {
 	struct kvm_vcpu	vcpu;
 
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index cc0234fed7b5..ac36c3618325 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -5725,7 +5725,7 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
 	if (unlikely(allow_smaller_maxphyaddr && kvm_vcpu_is_illegal_gpa(vcpu, gpa)))
 		return kvm_emulate_instruction(vcpu, 0);
 
-	return __vmx_handle_ept_violation(vcpu, gpa, exit_qualification);
+	return __vmx_handle_ept_violation(vcpu, gpa, exit_qualification, 0);
 }
 
 static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH v4 08/16] KVM: TDX: Pin pages via get_page() right before ADD/AUG'ed to TDs
  2023-07-25 22:23 [RFC PATCH v4 00/16] KVM TDX: TDP MMU: large page support isaku.yamahata
                   ` (6 preceding siblings ...)
  2023-07-25 22:23 ` [RFC PATCH v4 07/16] KVM: MMU: Introduce level info in PFERR code isaku.yamahata
@ 2023-07-25 22:23 ` isaku.yamahata
  2023-09-07  5:26   ` Binbin Wu
  2023-07-25 22:23 ` [RFC PATCH v4 09/16] KVM: TDX: Pass desired page level in err code for page fault handler isaku.yamahata
                   ` (7 subsequent siblings)
  15 siblings, 1 reply; 21+ messages in thread
From: isaku.yamahata @ 2023-07-25 22:23 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack, Kai Huang,
	Zhi Wang, chen.bo, hang.yuan, tina.zhang, Xiaoyao Li

From: Xiaoyao Li <xiaoyao.li@intel.com>

When kvm_faultin_pfn(), it doesn't have the info regarding which page level
will the gfn be mapped at. Hence it doesn't know to pin a 4K page or a
2M page.

Move the guest private pages pinning logic right before
TDH_MEM_PAGE_ADD/AUG() since at that time it knows the page level info.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/tdx.c | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index c122160142fd..bd1582e6b693 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1361,7 +1361,8 @@ static void tdx_measure_page(struct kvm_tdx *kvm_tdx, hpa_t gpa, int size)
 	}
 }
 
-static void tdx_unpin(struct kvm *kvm, kvm_pfn_t pfn, int level)
+static void tdx_unpin(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
+		      enum pg_level level)
 {
 	int i;
 
@@ -1397,12 +1398,12 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
 	if (likely(is_td_finalized(kvm_tdx))) {
 		err = tdh_mem_page_aug(kvm_tdx->tdr_pa, gpa, tdx_level, hpa, &out);
 		if (unlikely(err == TDX_ERROR_SEPT_BUSY)) {
-			tdx_unpin(kvm, pfn, level);
+			tdx_unpin(kvm, gfn, pfn, level);
 			return -EAGAIN;
 		}
 		if (KVM_BUG_ON(err, kvm)) {
 			pr_tdx_error(TDH_MEM_PAGE_AUG, err, &out);
-			tdx_unpin(kvm, pfn, level);
+			tdx_unpin(kvm, gfn, pfn, level);
 			return -EIO;
 		}
 		return 0;
@@ -1425,7 +1426,7 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
 	 * always uses vcpu 0's page table and protected by vcpu->mutex).
 	 */
 	if (KVM_BUG_ON(kvm_tdx->source_pa == INVALID_PAGE, kvm)) {
-		tdx_unpin(kvm, pfn, level);
+		tdx_unpin(kvm, gfn, pfn, level);
 		return -EINVAL;
 	}
 
@@ -1443,7 +1444,7 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
 	} while (unlikely(err == TDX_ERROR_SEPT_BUSY));
 	if (KVM_BUG_ON(err, kvm)) {
 		pr_tdx_error(TDH_MEM_PAGE_ADD, err, &out);
-		tdx_unpin(kvm, pfn, level);
+		tdx_unpin(kvm, gfn, pfn, level);
 		return -EIO;
 	} else if (measure)
 		tdx_measure_page(kvm_tdx, gpa, KVM_HPAGE_SIZE(level));
@@ -1472,7 +1473,7 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
 		err = tdx_reclaim_page(hpa, level, false, 0);
 		if (KVM_BUG_ON(err, kvm))
 			return -EIO;
-		tdx_unpin(kvm, pfn, level);
+		tdx_unpin(kvm, gfn, pfn, level);
 		return 0;
 	}
 
@@ -1505,7 +1506,7 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
 			r = -EIO;
 		} else {
 			tdx_clear_page(hpa, PAGE_SIZE);
-			tdx_unpin(kvm, pfn + i, PG_LEVEL_4K);
+			tdx_unpin(kvm, gfn + i, pfn + i, PG_LEVEL_4K);
 		}
 		hpa += PAGE_SIZE;
 	}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH v4 09/16] KVM: TDX: Pass desired page level in err code for page fault handler
  2023-07-25 22:23 [RFC PATCH v4 00/16] KVM TDX: TDP MMU: large page support isaku.yamahata
                   ` (7 preceding siblings ...)
  2023-07-25 22:23 ` [RFC PATCH v4 08/16] KVM: TDX: Pin pages via get_page() right before ADD/AUG'ed to TDs isaku.yamahata
@ 2023-07-25 22:23 ` isaku.yamahata
  2023-07-25 22:23 ` [RFC PATCH v4 10/16] KVM: x86/tdp_mmu: Allocate private page table for large page split isaku.yamahata
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: isaku.yamahata @ 2023-07-25 22:23 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack, Kai Huang,
	Zhi Wang, chen.bo, hang.yuan, tina.zhang, Xiaoyao Li

From: Xiaoyao Li <xiaoyao.li@intel.com>

For TDX, EPT violation can happen when TDG.MEM.PAGE.ACCEPT.
And TDG.MEM.PAGE.ACCEPT contains the desired accept page level of TD guest.

1. KVM can map it with 4KB page while TD guest wants to accept 2MB page.

  TD geust will get TDX_PAGE_SIZE_MISMATCH and it should try to accept
  4KB size.

2. KVM can map it with 2MB page while TD guest wants to accept 4KB page.

  KVM needs to honor it because
  a) there is no way to tell guest KVM maps it as 2MB size. And
  b) guest accepts it in 4KB size since guest knows some other 4KB page
     in the same 2MB range will be used as shared page.

For case 2, it need to pass desired page level to MMU's
page_fault_handler. Use bit 29:31 of kvm PF error code for this purpose.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/asm/kvm_host.h |  2 ++
 arch/x86/kvm/vmx/common.h       |  2 +-
 arch/x86/kvm/vmx/tdx.c          |  7 ++++++-
 arch/x86/kvm/vmx/tdx.h          | 19 -------------------
 arch/x86/kvm/vmx/tdx_arch.h     | 19 +++++++++++++++++++
 arch/x86/kvm/vmx/vmx.c          |  2 +-
 6 files changed, 29 insertions(+), 22 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 2326e48a8fcb..97c9a0d5a9e3 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -277,6 +277,8 @@ enum x86_intercept_stage;
 				 PFERR_WRITE_MASK |		\
 				 PFERR_PRESENT_MASK)
 
+#define PFERR_LEVEL(err_code)	(((err_code) & PFERR_LEVEL_MASK) >> PFERR_LEVEL_START_BIT)
+
 /* apic attention bits */
 #define KVM_APIC_CHECK_VAPIC	0
 /*
diff --git a/arch/x86/kvm/vmx/common.h b/arch/x86/kvm/vmx/common.h
index 90db5ca45925..5ffcd4c64053 100644
--- a/arch/x86/kvm/vmx/common.h
+++ b/arch/x86/kvm/vmx/common.h
@@ -91,7 +91,7 @@ static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
 	if (kvm_is_private_gpa(vcpu->kvm, gpa))
 		error_code |= PFERR_GUEST_ENC_MASK;
 
-	if (err_page_level > 0)
+	if (err_page_level > PG_LEVEL_NONE)
 		error_code |= (err_page_level << PFERR_LEVEL_START_BIT) & PFERR_LEVEL_MASK;
 
 	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index bd1582e6b693..d6d5a9020f99 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -2596,6 +2596,7 @@ static int tdx_init_mem_region(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
 	struct kvm_tdx_init_mem_region region;
 	struct kvm_vcpu *vcpu;
 	struct page *page;
+	u64 error_code;
 	int idx, ret = 0;
 	bool added = false;
 
@@ -2652,7 +2653,11 @@ static int tdx_init_mem_region(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
 		kvm_tdx->source_pa = pfn_to_hpa(page_to_pfn(page)) |
 				     (cmd->flags & KVM_TDX_MEASURE_MEMORY_REGION);
 
-		ret = kvm_mmu_map_tdp_page(vcpu, region.gpa, TDX_SEPT_PFERR,
+		/* TODO: large page support. */
+		error_code = TDX_SEPT_PFERR;
+		error_code |= (PG_LEVEL_4K << PFERR_LEVEL_START_BIT) &
+			PFERR_LEVEL_MASK;
+		ret = kvm_mmu_map_tdp_page(vcpu, region.gpa, error_code,
 					   PG_LEVEL_4K);
 		put_page(page);
 		if (ret)
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 117a81d69cb4..aff740a775bd 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -72,25 +72,6 @@ union tdx_exit_reason {
 	u64 full;
 };
 
-union tdx_ext_exit_qualification {
-	struct {
-		u64 type		: 4;
-		u64 reserved0		: 28;
-		u64 req_sept_level	: 3;
-		u64 err_sept_level	: 3;
-		u64 err_sept_state	: 8;
-		u64 err_sept_is_leaf	: 1;
-		u64 reserved1		: 17;
-	};
-	u64 full;
-};
-
-enum tdx_ext_exit_qualification_type {
-	EXT_EXIT_QUAL_NONE,
-	EXT_EXIT_QUAL_ACCEPT,
-	NUM_EXT_EXIT_QUAL,
-};
-
 struct vcpu_tdx {
 	struct kvm_vcpu	vcpu;
 
diff --git a/arch/x86/kvm/vmx/tdx_arch.h b/arch/x86/kvm/vmx/tdx_arch.h
index 8860c7571b1f..73fa33e7c943 100644
--- a/arch/x86/kvm/vmx/tdx_arch.h
+++ b/arch/x86/kvm/vmx/tdx_arch.h
@@ -167,4 +167,23 @@ struct td_params {
 #define TDX_MIN_TSC_FREQUENCY_KHZ		(100 * 1000)
 #define TDX_MAX_TSC_FREQUENCY_KHZ		(10 * 1000 * 1000)
 
+union tdx_ext_exit_qualification {
+	struct {
+		u64 type		:  4;
+		u64 reserved0		: 28;
+		u64 req_sept_level	:  3;
+		u64 err_sept_level	:  3;
+		u64 err_sept_state	:  8;
+		u64 err_sept_is_leaf	:  1;
+		u64 reserved1		: 17;
+	};
+	u64 full;
+};
+
+enum tdx_ext_exit_qualification_type {
+	EXT_EXIT_QUAL_NONE = 0,
+	EXT_EXIT_QUAL_ACCEPT,
+	NUM_EXT_EXIT_QUAL,
+};
+
 #endif /* __KVM_X86_TDX_ARCH_H */
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index ac36c3618325..3605366317a2 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -5725,7 +5725,7 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
 	if (unlikely(allow_smaller_maxphyaddr && kvm_vcpu_is_illegal_gpa(vcpu, gpa)))
 		return kvm_emulate_instruction(vcpu, 0);
 
-	return __vmx_handle_ept_violation(vcpu, gpa, exit_qualification, 0);
+	return __vmx_handle_ept_violation(vcpu, gpa, exit_qualification, PG_LEVEL_NONE);
 }
 
 static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH v4 10/16] KVM: x86/tdp_mmu: Allocate private page table for large page split
  2023-07-25 22:23 [RFC PATCH v4 00/16] KVM TDX: TDP MMU: large page support isaku.yamahata
                   ` (8 preceding siblings ...)
  2023-07-25 22:23 ` [RFC PATCH v4 09/16] KVM: TDX: Pass desired page level in err code for page fault handler isaku.yamahata
@ 2023-07-25 22:23 ` isaku.yamahata
  2023-07-25 22:23 ` [RFC PATCH v4 11/16] KVM: x86/tdp_mmu: Split the large page when zap leaf isaku.yamahata
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: isaku.yamahata @ 2023-07-25 22:23 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack, Kai Huang,
	Zhi Wang, chen.bo, hang.yuan, tina.zhang

From: Isaku Yamahata <isaku.yamahata@intel.com>

Make tdp_mmu_alloc_sp_split() aware of private page table.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/mmu/mmu_internal.h | 14 ++++++++++++++
 arch/x86/kvm/mmu/tdp_mmu.c      |  8 ++++++--
 2 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index d65324d87a17..2dc733b15c39 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -201,6 +201,15 @@ static inline void kvm_mmu_alloc_private_spt(struct kvm_vcpu *vcpu, struct kvm_m
 	}
 }
 
+static inline int kvm_alloc_private_spt_for_split(struct kvm_mmu_page *sp, gfp_t gfp)
+{
+	gfp &= ~__GFP_ZERO;
+	sp->private_spt = (void *)__get_free_page(gfp);
+	if (!sp->private_spt)
+		return -ENOMEM;
+	return 0;
+}
+
 static inline void kvm_mmu_free_private_spt(struct kvm_mmu_page *sp)
 {
 	if (sp->private_spt)
@@ -229,6 +238,11 @@ static inline void kvm_mmu_alloc_private_spt(struct kvm_vcpu *vcpu, struct kvm_m
 {
 }
 
+static inline int kvm_alloc_private_spt_for_split(struct kvm_mmu_page *sp, gfp_t gfp)
+{
+	return -ENOMEM;
+}
+
 static inline void kvm_mmu_free_private_spt(struct kvm_mmu_page *sp)
 {
 }
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index a9f0f4ade2d0..548b559280d7 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1585,8 +1585,12 @@ static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp, union kvm_mm
 
 	sp->role = role;
 	sp->spt = (void *)__get_free_page(gfp);
-	/* TODO: large page support for private GPA. */
-	WARN_ON_ONCE(kvm_mmu_page_role_is_private(role));
+	if (kvm_mmu_page_role_is_private(role)) {
+		if (kvm_alloc_private_spt_for_split(sp, gfp)) {
+			free_page((unsigned long)sp->spt);
+			sp->spt = NULL;
+		}
+	}
 	if (!sp->spt) {
 		kmem_cache_free(mmu_page_header_cache, sp);
 		return NULL;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH v4 11/16] KVM: x86/tdp_mmu: Split the large page when zap leaf
  2023-07-25 22:23 [RFC PATCH v4 00/16] KVM TDX: TDP MMU: large page support isaku.yamahata
                   ` (9 preceding siblings ...)
  2023-07-25 22:23 ` [RFC PATCH v4 10/16] KVM: x86/tdp_mmu: Allocate private page table for large page split isaku.yamahata
@ 2023-07-25 22:23 ` isaku.yamahata
  2023-07-25 22:23 ` [RFC PATCH v4 12/16] KVM: x86/tdp_mmu, TDX: Split a large page when 4KB page within it converted to shared isaku.yamahata
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: isaku.yamahata @ 2023-07-25 22:23 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack, Kai Huang,
	Zhi Wang, chen.bo, hang.yuan, tina.zhang, Xiaoyao Li

From: Xiaoyao Li <xiaoyao.li@intel.com>

When TDX enabled, a large page cannot be zapped if it contains mixed
pages. In this case, it has to split the large page.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/Kconfig            |  1 +
 arch/x86/kvm/mmu/mmu.c          |  6 +--
 arch/x86/kvm/mmu/mmu_internal.h |  9 +++++
 arch/x86/kvm/mmu/tdp_mmu.c      | 68 +++++++++++++++++++++++++++++++--
 4 files changed, 78 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index c7cb060c4ddc..47613ad41220 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -93,6 +93,7 @@ config KVM_INTEL
 	tristate "KVM for Intel (and compatible) processors support"
 	depends on KVM && IA32_FEAT_CTL
 	select KVM_SW_PROTECTED_VM if INTEL_TDX_HOST
+	select KVM_GENERIC_MEMORY_ATTRIBUTES if INTEL_TDX_HOST
 	select KVM_PRIVATE_MEM if INTEL_TDX_HOST
 	help
 	  Provides support for KVM on processors equipped with Intel's VT
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index bf4f23129ad0..949ef2fa8264 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7503,8 +7503,8 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
 }
 
 #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
-static bool hugepage_test_mixed(struct kvm_memory_slot *slot, gfn_t gfn,
-				int level)
+bool kvm_hugepage_test_mixed(struct kvm_memory_slot *slot, gfn_t gfn,
+			     int level)
 {
 	return lpage_info_slot(gfn, slot, level)->disallow_lpage & KVM_LPAGE_MIXED_FLAG;
 }
@@ -7563,7 +7563,7 @@ static bool hugepage_has_attrs(struct kvm *kvm, struct kvm_memory_slot *slot,
 		return range_has_attrs(kvm, start, end, attrs);
 
 	for (gfn = start; gfn < end; gfn += KVM_PAGES_PER_HPAGE(level - 1)) {
-		if (hugepage_test_mixed(slot, gfn, level - 1) ||
+		if (kvm_hugepage_test_mixed(slot, gfn, level - 1) ||
 		    attrs != kvm_get_memory_attributes(kvm, gfn))
 			return false;
 	}
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 2dc733b15c39..bc3d38762ace 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -464,4 +464,13 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
 void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
 void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
 
+#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+bool kvm_hugepage_test_mixed(struct kvm_memory_slot *slot, gfn_t gfn, int level);
+#else
+static inline bool kvm_hugepage_test_mixed(struct kvm_memory_slot *slot, gfn_t gfn, int level)
+{
+	return false;
+}
+#endif
+
 #endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 548b559280d7..e1169082c68c 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1004,6 +1004,14 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
 	return true;
 }
 
+
+static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
+						       struct tdp_iter *iter,
+						       bool shared);
+
+static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
+				   struct kvm_mmu_page *sp, bool shared);
+
 /*
  * If can_yield is true, will release the MMU lock and reschedule if the
  * scheduler needs the CPU or there is contention on the MMU lock. If this
@@ -1015,13 +1023,15 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
 			      gfn_t start, gfn_t end, bool can_yield, bool flush,
 			      bool zap_private)
 {
+	bool is_private = is_private_sp(root);
+	struct kvm_mmu_page *split_sp = NULL;
 	struct tdp_iter iter;
 
 	end = min(end, tdp_mmu_max_gfn_exclusive());
 
 	lockdep_assert_held_write(&kvm->mmu_lock);
 
-	WARN_ON_ONCE(zap_private && !is_private_sp(root));
+	WARN_ON_ONCE(zap_private && !is_private);
 	if (!zap_private && is_private_sp(root))
 		return false;
 
@@ -1046,12 +1056,66 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
 		    !is_last_spte(iter.old_spte, iter.level))
 			continue;
 
+		if (is_private && kvm_gfn_shared_mask(kvm) &&
+		    is_large_pte(iter.old_spte)) {
+			gfn_t gfn = iter.gfn & ~kvm_gfn_shared_mask(kvm);
+			gfn_t mask = KVM_PAGES_PER_HPAGE(iter.level) - 1;
+			struct kvm_memory_slot *slot;
+			struct kvm_mmu_page *sp;
+
+			slot = gfn_to_memslot(kvm, gfn);
+			if (kvm_hugepage_test_mixed(slot, gfn, iter.level) ||
+			    (gfn & mask) < start ||
+			    end < (gfn & mask) + KVM_PAGES_PER_HPAGE(iter.level)) {
+				WARN_ON_ONCE(!can_yield);
+				if (split_sp) {
+					sp = split_sp;
+					split_sp = NULL;
+					sp->role = tdp_iter_child_role(&iter);
+				} else {
+					WARN_ON(iter.yielded);
+					if (flush && can_yield) {
+						kvm_flush_remote_tlbs(kvm);
+						flush = false;
+					}
+					sp = tdp_mmu_alloc_sp_for_split(kvm, &iter, false);
+					if (iter.yielded) {
+						split_sp = sp;
+						continue;
+					}
+				}
+				KVM_BUG_ON(!sp, kvm);
+
+				tdp_mmu_init_sp(sp, iter.sptep, iter.gfn);
+				if (tdp_mmu_split_huge_page(kvm, &iter, sp, false)) {
+					kvm_flush_remote_tlbs(kvm);
+					flush = false;
+					/* force retry on this gfn. */
+					iter.yielded = true;
+				} else
+					flush = true;
+				continue;
+			}
+		}
+
 		tdp_mmu_iter_set_spte(kvm, &iter, SHADOW_NONPRESENT_VALUE);
 		flush = true;
 	}
 
 	rcu_read_unlock();
 
+	if (split_sp) {
+		WARN_ON(!can_yield);
+		if (flush) {
+			kvm_flush_remote_tlbs(kvm);
+			flush = false;
+		}
+
+		write_unlock(&kvm->mmu_lock);
+		tdp_mmu_free_sp(split_sp);
+		write_lock(&kvm->mmu_lock);
+	}
+
 	/*
 	 * Because this flow zaps _only_ leaf SPTEs, the caller doesn't need
 	 * to provide RCU protection as no 'struct kvm_mmu_page' will be freed.
@@ -1608,8 +1672,6 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
 
 	KVM_BUG_ON(kvm_mmu_page_role_is_private(role) !=
 		   is_private_sptep(iter->sptep), kvm);
-	/* TODO: Large page isn't supported for private SPTE yet. */
-	KVM_BUG_ON(kvm_mmu_page_role_is_private(role), kvm);
 
 	/*
 	 * Since we are allocating while under the MMU lock we have to be
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH v4 12/16] KVM: x86/tdp_mmu, TDX: Split a large page when 4KB page within it converted to shared
  2023-07-25 22:23 [RFC PATCH v4 00/16] KVM TDX: TDP MMU: large page support isaku.yamahata
                   ` (10 preceding siblings ...)
  2023-07-25 22:23 ` [RFC PATCH v4 11/16] KVM: x86/tdp_mmu: Split the large page when zap leaf isaku.yamahata
@ 2023-07-25 22:23 ` isaku.yamahata
  2023-07-25 22:23 ` [RFC PATCH v4 13/16] KVM: x86/tdp_mmu: Try to merge pages into a large page isaku.yamahata
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: isaku.yamahata @ 2023-07-25 22:23 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack, Kai Huang,
	Zhi Wang, chen.bo, hang.yuan, tina.zhang, Xiaoyao Li

From: Xiaoyao Li <xiaoyao.li@intel.com>

When mapping the shared page for TDX, it needs to zap private alias.

In the case that private page is mapped as large page (2MB), it can be
removed directly only when the whole 2MB is converted to shared.
Otherwise, it has to split 2MB page into 512 4KB page, and only remove
the pages that converted to shared.

When a present large leaf spte switches to present non-leaf spte, TDX needs
to split the corresponding SEPT page to reflect it.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/asm/kvm-x86-ops.h |  1 +
 arch/x86/include/asm/kvm_host.h    |  2 ++
 arch/x86/kvm/mmu/tdp_mmu.c         | 26 +++++++++++++++++++++-----
 arch/x86/kvm/vmx/tdx.c             | 25 +++++++++++++++++++++++--
 arch/x86/kvm/vmx/tdx_arch.h        |  1 +
 arch/x86/kvm/vmx/tdx_ops.h         |  7 +++++++
 6 files changed, 55 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index aaa7db45d809..5989503112c6 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -102,6 +102,7 @@ KVM_X86_OP_OPTIONAL_RET0(get_mt_mask)
 KVM_X86_OP(load_mmu_pgd)
 KVM_X86_OP_OPTIONAL(link_private_spt)
 KVM_X86_OP_OPTIONAL(free_private_spt)
+KVM_X86_OP_OPTIONAL(split_private_spt)
 KVM_X86_OP_OPTIONAL(set_private_spte)
 KVM_X86_OP_OPTIONAL(remove_private_spte)
 KVM_X86_OP_OPTIONAL(zap_private_spte)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 97c9a0d5a9e3..7fe85b2d9a38 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1715,6 +1715,8 @@ struct kvm_x86_ops {
 				void *private_spt);
 	int (*free_private_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
 				void *private_spt);
+	int (*split_private_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
+				  void *private_spt);
 	int (*set_private_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
 				 kvm_pfn_t pfn);
 	int (*remove_private_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index e1169082c68c..c3963002722c 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -648,23 +648,39 @@ static int __must_check __set_private_spte_present(struct kvm *kvm, tdp_ptep_t s
 {
 	bool was_present = is_shadow_present_pte(old_spte);
 	bool is_present = is_shadow_present_pte(new_spte);
+	bool was_leaf = was_present && is_last_spte(old_spte, level);
 	bool is_leaf = is_present && is_last_spte(new_spte, level);
 	kvm_pfn_t new_pfn = spte_to_pfn(new_spte);
+	void *private_spt;
 	int ret = 0;
 
 	lockdep_assert_held(&kvm->mmu_lock);
-	/* TDP MMU doesn't change present -> present */
-	KVM_BUG_ON(was_present, kvm);
+	/*
+	 * TDP MMU doesn't change present -> present. split or merge of large
+	 * page can happen.
+	 */
+	KVM_BUG_ON(was_present && (was_leaf == is_leaf), kvm);
 
 	/*
 	 * Use different call to either set up middle level
 	 * private page table, or leaf.
 	 */
-	if (is_leaf)
+	if (level > PG_LEVEL_4K && was_leaf && !is_leaf) {
+		/*
+		 * splitting large page into 4KB.
+		 * tdp_mmu_split_huage_page() => tdp_mmu_link_sp()
+		 */
+		private_spt = get_private_spt(gfn, new_spte, level);
+		KVM_BUG_ON(!private_spt, kvm);
+		ret = static_call(kvm_x86_zap_private_spte)(kvm, gfn, level);
+		kvm_flush_remote_tlbs(kvm);
+		if (!ret)
+			ret = static_call(kvm_x86_split_private_spt)(kvm, gfn,
+								     level, private_spt);
+	} else if (is_leaf)
 		ret = static_call(kvm_x86_set_private_spte)(kvm, gfn, level, new_pfn);
 	else {
-		void *private_spt = get_private_spt(gfn, new_spte, level);
-
+		private_spt = get_private_spt(gfn, new_spte, level);
 		KVM_BUG_ON(!private_spt, kvm);
 		ret = static_call(kvm_x86_link_private_spt)(kvm, gfn, level, private_spt);
 	}
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index d6d5a9020f99..f2f1b40d9ae8 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1534,6 +1534,28 @@ static int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
 	return 0;
 }
 
+static int tdx_sept_split_private_spt(struct kvm *kvm, gfn_t gfn,
+				      enum pg_level level, void *private_spt)
+{
+	int tdx_level = pg_level_to_tdx_sept_level(level);
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	gpa_t gpa = gfn_to_gpa(gfn) & KVM_HPAGE_MASK(level);
+	hpa_t hpa = __pa(private_spt);
+	struct tdx_module_output out;
+	u64 err;
+
+	/* See comment in tdx_sept_set_private_spte() */
+	err = tdh_mem_page_demote(kvm_tdx->tdr_pa, gpa, tdx_level, hpa, &out);
+	if (unlikely(err == TDX_ERROR_SEPT_BUSY))
+		return -EAGAIN;
+	if (KVM_BUG_ON(err, kvm)) {
+		pr_tdx_error(TDH_MEM_PAGE_DEMOTE, err, &out);
+		return -EIO;
+	}
+
+	return 0;
+}
+
 static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
 				      enum pg_level level)
 {
@@ -1547,8 +1569,6 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
 	if (unlikely(!is_hkid_assigned(kvm_tdx)))
 		return 0;
 
-	/* For now large page isn't supported yet. */
-	WARN_ON_ONCE(level != PG_LEVEL_4K);
 	err = tdh_mem_range_block(kvm_tdx->tdr_pa, gpa, tdx_level, &out);
 	if (unlikely(err == TDX_ERROR_SEPT_BUSY))
 		return -EAGAIN;
@@ -3052,6 +3072,7 @@ int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
 
 	x86_ops->link_private_spt = tdx_sept_link_private_spt;
 	x86_ops->free_private_spt = tdx_sept_free_private_spt;
+	x86_ops->split_private_spt = tdx_sept_split_private_spt;
 	x86_ops->set_private_spte = tdx_sept_set_private_spte;
 	x86_ops->remove_private_spte = tdx_sept_remove_private_spte;
 	x86_ops->zap_private_spte = tdx_sept_zap_private_spte;
diff --git a/arch/x86/kvm/vmx/tdx_arch.h b/arch/x86/kvm/vmx/tdx_arch.h
index 73fa33e7c943..dd5e5981b39e 100644
--- a/arch/x86/kvm/vmx/tdx_arch.h
+++ b/arch/x86/kvm/vmx/tdx_arch.h
@@ -21,6 +21,7 @@
 #define TDH_MNG_CREATE			9
 #define TDH_VP_CREATE			10
 #define TDH_MNG_RD			11
+#define TDH_MEM_PAGE_DEMOTE		15
 #define TDH_MR_EXTEND			16
 #define TDH_MR_FINALIZE			17
 #define TDH_VP_FLUSH			18
diff --git a/arch/x86/kvm/vmx/tdx_ops.h b/arch/x86/kvm/vmx/tdx_ops.h
index e3d7e19e5324..739c67af849b 100644
--- a/arch/x86/kvm/vmx/tdx_ops.h
+++ b/arch/x86/kvm/vmx/tdx_ops.h
@@ -161,6 +161,13 @@ static inline u64 tdh_mng_rd(hpa_t tdr, u64 field, struct tdx_module_output *out
 	return tdx_seamcall(TDH_MNG_RD, tdr, field, 0, 0, out);
 }
 
+static inline u64 tdh_mem_page_demote(hpa_t tdr, gpa_t gpa, int level, hpa_t page,
+				      struct tdx_module_output *out)
+{
+	tdx_clflush_page(page, PG_LEVEL_4K);
+	return tdx_seamcall_sept(TDH_MEM_PAGE_DEMOTE, gpa | level, tdr, page, 0, out);
+}
+
 static inline u64 tdh_mr_extend(hpa_t tdr, gpa_t gpa,
 				struct tdx_module_output *out)
 {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH v4 13/16] KVM: x86/tdp_mmu: Try to merge pages into a large page
  2023-07-25 22:23 [RFC PATCH v4 00/16] KVM TDX: TDP MMU: large page support isaku.yamahata
                   ` (11 preceding siblings ...)
  2023-07-25 22:23 ` [RFC PATCH v4 12/16] KVM: x86/tdp_mmu, TDX: Split a large page when 4KB page within it converted to shared isaku.yamahata
@ 2023-07-25 22:23 ` isaku.yamahata
  2023-08-14 20:35   ` Isaku Yamahata
  2023-07-25 22:24 ` [RFC PATCH v4 14/16] KVM: x86/tdp_mmu: TDX: Implement " isaku.yamahata
                   ` (2 subsequent siblings)
  15 siblings, 1 reply; 21+ messages in thread
From: isaku.yamahata @ 2023-07-25 22:23 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack, Kai Huang,
	Zhi Wang, chen.bo, hang.yuan, tina.zhang

From: Isaku Yamahata <isaku.yamahata@intel.com>

When a large page is passed to the KVM page fault handler and some of sub
pages are already populated, try to merge sub pages into a large page.
This situation can happen when the guest converts small pages into shared
and convert it back into private.

When a large page is passed to KVM mmu page fault handler and the spte
corresponding to the page is non-leaf (one or more of sub pages are already
populated at lower page level), the current kvm mmu zaps non-leaf spte at a
large page level, and populate a leaf spte at that level.  Thus small pages
are converted into a large page.  However, it doesn't work for TDX because
zapping and re-populating results in zeroing page content.  Instead,
populate all small pages and merge them into a large page.

Merging pages into a large page can fail when some sub pages are accepted
and some are not.  In such case, with the assumption that guest tries to
accept at large page size for performance when possible, don't try to be
smart to identify which page is still pending, map all pages at lower page
level, and let vcpu re-execute.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/asm/kvm-x86-ops.h |   2 +
 arch/x86/include/asm/kvm_host.h    |   4 +
 arch/x86/kvm/mmu/tdp_iter.c        |  37 +++++--
 arch/x86/kvm/mmu/tdp_iter.h        |   2 +
 arch/x86/kvm/mmu/tdp_mmu.c         | 163 ++++++++++++++++++++++++++++-
 5 files changed, 198 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 5989503112c6..378f15a4a1e9 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -103,9 +103,11 @@ KVM_X86_OP(load_mmu_pgd)
 KVM_X86_OP_OPTIONAL(link_private_spt)
 KVM_X86_OP_OPTIONAL(free_private_spt)
 KVM_X86_OP_OPTIONAL(split_private_spt)
+KVM_X86_OP_OPTIONAL(merge_private_spt)
 KVM_X86_OP_OPTIONAL(set_private_spte)
 KVM_X86_OP_OPTIONAL(remove_private_spte)
 KVM_X86_OP_OPTIONAL(zap_private_spte)
+KVM_X86_OP_OPTIONAL(unzap_private_spte)
 KVM_X86_OP(has_wbinvd_exit)
 KVM_X86_OP(get_l2_tsc_offset)
 KVM_X86_OP(get_l2_tsc_multiplier)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 7fe85b2d9a38..fc4c0b75d496 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -138,6 +138,7 @@
 #define KVM_MAX_HUGEPAGE_LEVEL	PG_LEVEL_1G
 #define KVM_NR_PAGE_SIZES	(KVM_MAX_HUGEPAGE_LEVEL - PG_LEVEL_4K + 1)
 #define KVM_HPAGE_GFN_SHIFT(x)	(((x) - 1) * 9)
+#define KVM_HPAGE_GFN_MASK(x)	(~((1UL << KVM_HPAGE_GFN_SHIFT(x)) - 1))
 #define KVM_HPAGE_SHIFT(x)	(PAGE_SHIFT + KVM_HPAGE_GFN_SHIFT(x))
 #define KVM_HPAGE_SIZE(x)	(1UL << KVM_HPAGE_SHIFT(x))
 #define KVM_HPAGE_MASK(x)	(~(KVM_HPAGE_SIZE(x) - 1))
@@ -1717,11 +1718,14 @@ struct kvm_x86_ops {
 				void *private_spt);
 	int (*split_private_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
 				  void *private_spt);
+	int (*merge_private_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
+				 void *private_spt);
 	int (*set_private_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
 				 kvm_pfn_t pfn);
 	int (*remove_private_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
 				    kvm_pfn_t pfn);
 	int (*zap_private_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level);
+	int (*unzap_private_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level);
 
 	bool (*has_wbinvd_exit)(void);
 
diff --git a/arch/x86/kvm/mmu/tdp_iter.c b/arch/x86/kvm/mmu/tdp_iter.c
index d2eb0d4f8710..736508833260 100644
--- a/arch/x86/kvm/mmu/tdp_iter.c
+++ b/arch/x86/kvm/mmu/tdp_iter.c
@@ -70,6 +70,14 @@ tdp_ptep_t spte_to_child_pt(u64 spte, int level)
 	return (tdp_ptep_t)__va(spte_to_pfn(spte) << PAGE_SHIFT);
 }
 
+static void step_down(struct tdp_iter *iter, tdp_ptep_t child_pt)
+{
+	iter->level--;
+	iter->pt_path[iter->level - 1] = child_pt;
+	iter->gfn = gfn_round_for_level(iter->next_last_level_gfn, iter->level);
+	tdp_iter_refresh_sptep(iter);
+}
+
 /*
  * Steps down one level in the paging structure towards the goal GFN. Returns
  * true if the iterator was able to step down a level, false otherwise.
@@ -91,14 +99,28 @@ static bool try_step_down(struct tdp_iter *iter)
 	if (!child_pt)
 		return false;
 
-	iter->level--;
-	iter->pt_path[iter->level - 1] = child_pt;
-	iter->gfn = gfn_round_for_level(iter->next_last_level_gfn, iter->level);
-	tdp_iter_refresh_sptep(iter);
-
+	step_down(iter, child_pt);
 	return true;
 }
 
+/* Steps down for freezed spte.  Don't re-read sptep because it was freezed. */
+void tdp_iter_step_down(struct tdp_iter *iter, tdp_ptep_t child_pt)
+{
+	WARN_ON_ONCE(!child_pt);
+	WARN_ON_ONCE(iter->yielded);
+	WARN_ON_ONCE(iter->level == iter->min_level);
+
+	step_down(iter, child_pt);
+}
+
+void tdp_iter_step_side(struct tdp_iter *iter)
+{
+	iter->gfn += KVM_PAGES_PER_HPAGE(iter->level);
+	iter->next_last_level_gfn = iter->gfn;
+	iter->sptep++;
+	iter->old_spte = kvm_tdp_mmu_read_spte(iter->sptep);
+}
+
 /*
  * Steps to the next entry in the current page table, at the current page table
  * level. The next entry could point to a page backing guest memory or another
@@ -116,10 +138,7 @@ static bool try_step_side(struct tdp_iter *iter)
 	    (SPTE_ENT_PER_PAGE - 1))
 		return false;
 
-	iter->gfn += KVM_PAGES_PER_HPAGE(iter->level);
-	iter->next_last_level_gfn = iter->gfn;
-	iter->sptep++;
-	iter->old_spte = kvm_tdp_mmu_read_spte(iter->sptep);
+	tdp_iter_step_side(iter);
 
 	return true;
 }
diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index a9c9cd0db20a..ca00db799a50 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -134,6 +134,8 @@ void tdp_iter_start(struct tdp_iter *iter, struct kvm_mmu_page *root,
 		    int min_level, gfn_t next_last_level_gfn);
 void tdp_iter_next(struct tdp_iter *iter);
 void tdp_iter_restart(struct tdp_iter *iter);
+void tdp_iter_step_side(struct tdp_iter *iter);
+void tdp_iter_step_down(struct tdp_iter *iter, tdp_ptep_t child_pt);
 
 static inline union kvm_mmu_page_role tdp_iter_child_role(struct tdp_iter *iter)
 {
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index c3963002722c..612fcaac600d 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1242,6 +1242,167 @@ void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm, bool skip_private)
 	rcu_read_unlock();
 }
 
+static void tdp_mmu_iter_step_side(int i, struct tdp_iter *iter)
+{
+	/*
+	 * if i = SPTE_ENT_PER_PAGE - 1, tdp_iter_step_side() results
+	 * in reading the entry beyond the last entry.
+	 */
+	if (i < SPTE_ENT_PER_PAGE)
+		tdp_iter_step_side(iter);
+}
+
+static int tdp_mmu_merge_private_spt(struct kvm_vcpu *vcpu,
+				     struct kvm_page_fault *fault,
+				     struct tdp_iter *iter, u64 new_spte)
+{
+	u64 *sptep = rcu_dereference(iter->sptep);
+	struct kvm_mmu_page *child_sp;
+	struct kvm *kvm = vcpu->kvm;
+	struct tdp_iter child_iter;
+	bool ret_pf_retry = false;
+	int level = iter->level;
+	gfn_t gfn = iter->gfn;
+	u64 old_spte = *sptep;
+	tdp_ptep_t child_pt;
+	u64 child_spte;
+	int ret = 0;
+	int i;
+
+	/*
+	 * TDX KVM supports only 2MB large page.  It's not supported to merge
+	 * 2MB pages into 1GB page at the moment.
+	 */
+	WARN_ON_ONCE(fault->goal_level != PG_LEVEL_2M);
+	WARN_ON_ONCE(iter->level != PG_LEVEL_2M);
+	WARN_ON_ONCE(!is_large_pte(new_spte));
+
+	/* Freeze the spte to prevent other threads from working spte. */
+	if (!try_cmpxchg64(sptep, &iter->old_spte, REMOVED_SPTE))
+		return -EBUSY;
+
+	/*
+	 * Step down to the child spte.  Because tdp_iter_next() assumes the
+	 * parent spte isn't freezed, do it manually.
+	 */
+	child_pt = spte_to_child_pt(iter->old_spte, iter->level);
+	child_sp = sptep_to_sp(child_pt);
+	WARN_ON_ONCE(child_sp->role.level != PG_LEVEL_4K);
+	WARN_ON_ONCE(!kvm_mmu_page_role_is_private(child_sp->role));
+
+	/* Don't modify iter as the caller will use iter after this function. */
+	child_iter = *iter;
+	/* Adjust the target gfn to the head gfn of the large page. */
+	child_iter.next_last_level_gfn &= -KVM_PAGES_PER_HPAGE(level);
+	tdp_iter_step_down(&child_iter, child_pt);
+
+	/*
+	 * All child pages are required to be populated for merging them into a
+	 * large page.  Populate all child spte.
+	 */
+	for (i = 0; i < SPTE_ENT_PER_PAGE; i++, tdp_mmu_iter_step_side(i, &child_iter)) {
+		WARN_ON_ONCE(child_iter.level != PG_LEVEL_4K);
+		if (is_shadow_present_pte(child_iter.old_spte)) {
+			/* TODO: relocate page for huge page. */
+			if (WARN_ON_ONCE(spte_to_pfn(child_iter.old_spte) !=
+					 spte_to_pfn(new_spte) + i)) {
+				ret = -EAGAIN;
+				ret_pf_retry = true;
+			}
+			/*
+			 * When SEPT_VE_DISABLE=true and the page state is
+			 * pending, this case can happen.  Just resume the vcpu
+			 * again with the expectation for other vcpu to accept
+			 * this page.
+			 */
+			if (child_iter.gfn == fault->gfn) {
+				ret = -EAGAIN;
+				ret_pf_retry = true;
+				break;
+			}
+			continue;
+		}
+
+		WARN_ON_ONCE(spte_to_pfn(child_iter.old_spte) != spte_to_pfn(new_spte) + i);
+		child_spte = make_huge_page_split_spte(kvm, new_spte, child_sp->role, i);
+		/*
+		 * Because other thread may have started to operate on this spte
+		 * before freezing the parent spte,  Use atomic version to
+		 * prevent race.
+		 */
+		ret = tdp_mmu_set_spte_atomic(vcpu->kvm, &child_iter, child_spte);
+		if (ret == -EBUSY || ret == -EAGAIN)
+			/*
+			 * There was a race condition.  Populate remaining 4K
+			 * spte to resolve fault->gfn to guarantee the forward
+			 * progress.
+			 */
+			ret_pf_retry = true;
+		else if (ret)
+			goto out;
+
+	}
+	if (ret_pf_retry)
+		goto out;
+
+	/* Prevent the Secure-EPT entry from being used. */
+	ret = static_call(kvm_x86_zap_private_spte)(kvm, gfn, level);
+	if (ret)
+		goto out;
+	kvm_flush_remote_tlbs_range(kvm, gfn & KVM_HPAGE_GFN_MASK(level),
+				    KVM_PAGES_PER_HPAGE(level));
+
+	/* Merge pages into a large page. */
+	ret = static_call(kvm_x86_merge_private_spt)(kvm, gfn, level,
+						     kvm_mmu_private_spt(child_sp));
+	/*
+	 * Failed to merge pages because some pages are accepted and some are
+	 * pending.  Since the child page was mapped above, let vcpu run.
+	 */
+	if (ret) {
+		if (static_call(kvm_x86_unzap_private_spte)(kvm, gfn, level))
+			old_spte = SHADOW_NONPRESENT_VALUE |
+				(spte_to_pfn(old_spte) << PAGE_SHIFT) |
+				PT_PAGE_SIZE_MASK;
+		goto out;
+	}
+
+	/* Unfreeze spte. */
+	__kvm_tdp_mmu_write_spte(sptep, new_spte);
+
+	/*
+	 * Free unused child sp.  Secure-EPT page was already freed at TDX level
+	 * by kvm_x86_merge_private_spt().
+	 */
+	tdp_unaccount_mmu_page(kvm, child_sp);
+	tdp_mmu_free_sp(child_sp);
+	return -EAGAIN;
+
+out:
+	__kvm_tdp_mmu_write_spte(sptep, old_spte);
+	return ret;
+}
+
+static int __tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
+					     struct kvm_page_fault *fault,
+					     struct tdp_iter *iter, u64 new_spte)
+{
+	/*
+	 * The private page has smaller-size pages.  For example, the child
+	 * pages was converted from shared to page, and now it can be mapped as
+	 * a large page.  Try to merge small pages into a large page.
+	 */
+	if (fault->slot &&
+	    kvm_gfn_shared_mask(vcpu->kvm) &&
+	    iter->level > PG_LEVEL_4K &&
+	    kvm_is_private_gpa(vcpu->kvm, fault->addr) &&
+	    is_shadow_present_pte(iter->old_spte) &&
+	    !is_large_pte(iter->old_spte))
+		return tdp_mmu_merge_private_spt(vcpu, fault, iter, new_spte);
+
+	return tdp_mmu_set_spte_atomic(vcpu->kvm, iter, new_spte);
+}
+
 /*
  * Installs a last-level SPTE to handle a TDP page fault.
  * (NPT/EPT violation/misconfiguration)
@@ -1276,7 +1437,7 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
 
 	if (new_spte == iter->old_spte)
 		ret = RET_PF_SPURIOUS;
-	else if (tdp_mmu_set_spte_atomic(vcpu->kvm, iter, new_spte))
+	else if (__tdp_mmu_map_handle_target_level(vcpu, fault, iter, new_spte))
 		return RET_PF_RETRY;
 	else if (is_shadow_present_pte(iter->old_spte) &&
 		 !is_last_spte(iter->old_spte, iter->level))
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH v4 14/16] KVM: x86/tdp_mmu: TDX: Implement merge pages into a large page
  2023-07-25 22:23 [RFC PATCH v4 00/16] KVM TDX: TDP MMU: large page support isaku.yamahata
                   ` (12 preceding siblings ...)
  2023-07-25 22:23 ` [RFC PATCH v4 13/16] KVM: x86/tdp_mmu: Try to merge pages into a large page isaku.yamahata
@ 2023-07-25 22:24 ` isaku.yamahata
  2023-07-25 22:24 ` [RFC PATCH v4 15/16] KVM: x86/mmu: Make kvm fault handler aware of large page of private memslot isaku.yamahata
  2023-07-25 22:24 ` [RFC PATCH v4 16/16] KVM: TDX: Allow 2MB large page for TD GUEST isaku.yamahata
  15 siblings, 0 replies; 21+ messages in thread
From: isaku.yamahata @ 2023-07-25 22:24 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack, Kai Huang,
	Zhi Wang, chen.bo, hang.yuan, tina.zhang

From: Isaku Yamahata <isaku.yamahata@intel.com>

Implement merge_private_stp callback.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/tdx.c       | 72 ++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/tdx_arch.h  |  1 +
 arch/x86/kvm/vmx/tdx_errno.h |  2 +
 arch/x86/kvm/vmx/tdx_ops.h   |  6 +++
 4 files changed, 81 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index f2f1b40d9ae8..2f375e0e45aa 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1556,6 +1556,49 @@ static int tdx_sept_split_private_spt(struct kvm *kvm, gfn_t gfn,
 	return 0;
 }
 
+static int tdx_sept_merge_private_spt(struct kvm *kvm, gfn_t gfn,
+				      enum pg_level level, void *private_spt)
+{
+	int tdx_level = pg_level_to_tdx_sept_level(level);
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	struct tdx_module_output out;
+	gpa_t gpa = gfn_to_gpa(gfn) & KVM_HPAGE_MASK(level);
+	u64 err;
+
+	/* See comment in tdx_sept_set_private_spte() */
+	err = tdh_mem_page_promote(kvm_tdx->tdr_pa, gpa, tdx_level, &out);
+	if (unlikely(err == TDX_ERROR_SEPT_BUSY))
+		return -EAGAIN;
+	if (unlikely(err == (TDX_EPT_INVALID_PROMOTE_CONDITIONS |
+			     TDX_OPERAND_ID_RCX)))
+		/*
+		 * Some pages are accepted, some pending.  Need to wait for TD
+		 * to accept all pages.  Tell it the caller.
+		 */
+		return -EAGAIN;
+	if (KVM_BUG_ON(err, kvm)) {
+		pr_tdx_error(TDH_MEM_PAGE_PROMOTE, err, &out);
+		return -EIO;
+	}
+	WARN_ON_ONCE(out.rcx != __pa(private_spt));
+
+	/*
+	 * TDH.MEM.PAGE.PROMOTE frees the Secure-EPT page for the lower level.
+	 * Flush cache for reuse.
+	 */
+	do {
+		err = tdh_phymem_page_wbinvd(set_hkid_to_hpa(__pa(private_spt),
+							     to_kvm_tdx(kvm)->hkid));
+	} while (unlikely(err == (TDX_OPERAND_BUSY | TDX_OPERAND_ID_RCX)));
+	if (WARN_ON_ONCE(err)) {
+		pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err, NULL);
+		return -EIO;
+	}
+
+	tdx_clear_page(__pa(private_spt), PAGE_SIZE);
+	return 0;
+}
+
 static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
 				      enum pg_level level)
 {
@@ -1629,6 +1672,33 @@ static void tdx_track(struct kvm_tdx *kvm_tdx)
 
 }
 
+static int tdx_sept_unzap_private_spte(struct kvm *kvm, gfn_t gfn,
+				       enum pg_level level)
+{
+	int tdx_level = pg_level_to_tdx_sept_level(level);
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	gpa_t gpa = gfn_to_gpa(gfn) & KVM_HPAGE_MASK(level);
+	struct tdx_module_output out;
+	u64 err;
+
+	do {
+		err = tdh_mem_range_unblock(kvm_tdx->tdr_pa, gpa, tdx_level, &out);
+
+		/*
+		 * tdh_mem_range_block() is accompanied with tdx_track() via kvm
+		 * remote tlb flush.  Wait for the caller of
+		 * tdh_mem_range_block() to complete TDX track.
+		 */
+	} while (err == (TDX_TLB_TRACKING_NOT_DONE | TDX_OPERAND_ID_SEPT));
+	if (unlikely(err == TDX_ERROR_SEPT_BUSY))
+		return -EAGAIN;
+	if (KVM_BUG_ON(err, kvm)) {
+		pr_tdx_error(TDH_MEM_RANGE_UNBLOCK, err, &out);
+		return -EIO;
+	}
+	return 0;
+}
+
 static int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,
 				     enum pg_level level, void *private_spt)
 {
@@ -3073,9 +3143,11 @@ int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
 	x86_ops->link_private_spt = tdx_sept_link_private_spt;
 	x86_ops->free_private_spt = tdx_sept_free_private_spt;
 	x86_ops->split_private_spt = tdx_sept_split_private_spt;
+	x86_ops->merge_private_spt = tdx_sept_merge_private_spt;
 	x86_ops->set_private_spte = tdx_sept_set_private_spte;
 	x86_ops->remove_private_spte = tdx_sept_remove_private_spte;
 	x86_ops->zap_private_spte = tdx_sept_zap_private_spte;
+	x86_ops->unzap_private_spte = tdx_sept_unzap_private_spte;
 
 	return 0;
 
diff --git a/arch/x86/kvm/vmx/tdx_arch.h b/arch/x86/kvm/vmx/tdx_arch.h
index dd5e5981b39e..0828a35dc4e6 100644
--- a/arch/x86/kvm/vmx/tdx_arch.h
+++ b/arch/x86/kvm/vmx/tdx_arch.h
@@ -29,6 +29,7 @@
 #define TDH_MNG_KEY_FREEID		20
 #define TDH_MNG_INIT			21
 #define TDH_VP_INIT			22
+#define TDH_MEM_PAGE_PROMOTE		23
 #define TDH_VP_RD			26
 #define TDH_MNG_KEY_RECLAIMID		27
 #define TDH_PHYMEM_PAGE_RECLAIM		28
diff --git a/arch/x86/kvm/vmx/tdx_errno.h b/arch/x86/kvm/vmx/tdx_errno.h
index 53dc14ba9107..f1a050cae05c 100644
--- a/arch/x86/kvm/vmx/tdx_errno.h
+++ b/arch/x86/kvm/vmx/tdx_errno.h
@@ -21,6 +21,8 @@
 #define TDX_KEY_CONFIGURED			0x0000081500000000ULL
 #define TDX_NO_HKID_READY_TO_WBCACHE		0x0000082100000000ULL
 #define TDX_EPT_WALK_FAILED			0xC0000B0000000000ULL
+#define TDX_TLB_TRACKING_NOT_DONE		0xC0000B0800000000ULL
+#define TDX_EPT_INVALID_PROMOTE_CONDITIONS	0xC0000B0900000000ULL
 
 /*
  * TDG.VP.VMCALL Status Codes (returned in R10)
diff --git a/arch/x86/kvm/vmx/tdx_ops.h b/arch/x86/kvm/vmx/tdx_ops.h
index 739c67af849b..df41ab8f4ff7 100644
--- a/arch/x86/kvm/vmx/tdx_ops.h
+++ b/arch/x86/kvm/vmx/tdx_ops.h
@@ -168,6 +168,12 @@ static inline u64 tdh_mem_page_demote(hpa_t tdr, gpa_t gpa, int level, hpa_t pag
 	return tdx_seamcall_sept(TDH_MEM_PAGE_DEMOTE, gpa | level, tdr, page, 0, out);
 }
 
+static inline u64 tdh_mem_page_promote(hpa_t tdr, gpa_t gpa, int level,
+				       struct tdx_module_output *out)
+{
+	return tdx_seamcall_sept(TDH_MEM_PAGE_PROMOTE, gpa | level, tdr, 0, 0, out);
+}
+
 static inline u64 tdh_mr_extend(hpa_t tdr, gpa_t gpa,
 				struct tdx_module_output *out)
 {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH v4 15/16] KVM: x86/mmu: Make kvm fault handler aware of large page of private memslot
  2023-07-25 22:23 [RFC PATCH v4 00/16] KVM TDX: TDP MMU: large page support isaku.yamahata
                   ` (13 preceding siblings ...)
  2023-07-25 22:24 ` [RFC PATCH v4 14/16] KVM: x86/tdp_mmu: TDX: Implement " isaku.yamahata
@ 2023-07-25 22:24 ` isaku.yamahata
  2023-07-25 22:24 ` [RFC PATCH v4 16/16] KVM: TDX: Allow 2MB large page for TD GUEST isaku.yamahata
  15 siblings, 0 replies; 21+ messages in thread
From: isaku.yamahata @ 2023-07-25 22:24 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack, Kai Huang,
	Zhi Wang, chen.bo, hang.yuan, tina.zhang

From: Isaku Yamahata <isaku.yamahata@intel.com>

struct kvm_page_fault.req_level is the page level which takes care of the
faulted-in page size.  For now its calculation is only for the conventional
kvm memslot by host_pfn_mapping_level() that traverses page table.

However, host_pfn_mapping_level() cannot be used for private kvm memslot
because pages of private kvm memlost aren't mapped into user virtual
address space.  Instead page order is given when getting pfn.  Remember it
in struct kvm_page_fault and use it.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/mmu/mmu.c          | 29 +++++++++++++++--------------
 arch/x86/kvm/mmu/mmu_internal.h | 12 +++++++++++-
 arch/x86/kvm/mmu/tdp_mmu.c      |  2 +-
 3 files changed, 27 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 949ef2fa8264..bb828eb2b1e3 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3190,10 +3190,10 @@ static int host_pfn_mapping_level(struct kvm *kvm, gfn_t gfn,
 
 static int __kvm_mmu_max_mapping_level(struct kvm *kvm,
 				       const struct kvm_memory_slot *slot,
-				       gfn_t gfn, int max_level, bool is_private)
+				       gfn_t gfn, int max_level, int host_level,
+				       bool is_private)
 {
 	struct kvm_lpage_info *linfo;
-	int host_level;
 
 	max_level = min(max_level, max_huge_page_level);
 	for ( ; max_level > PG_LEVEL_4K; max_level--) {
@@ -3202,24 +3202,23 @@ static int __kvm_mmu_max_mapping_level(struct kvm *kvm,
 			break;
 	}
 
-	if (is_private)
-		return max_level;
-
 	if (max_level == PG_LEVEL_4K)
 		return PG_LEVEL_4K;
 
-	host_level = host_pfn_mapping_level(kvm, gfn, slot);
+	if (!is_private) {
+		WARN_ON_ONCE(host_level != PG_LEVEL_NONE);
+		host_level = host_pfn_mapping_level(kvm, gfn, slot);
+	}
+	WARN_ON_ONCE(host_level == PG_LEVEL_NONE);
 	return min(host_level, max_level);
 }
 
 int kvm_mmu_max_mapping_level(struct kvm *kvm,
 			      const struct kvm_memory_slot *slot, gfn_t gfn,
-			      int max_level)
+			      int max_level, bool faultin_private)
 {
-	bool is_private = kvm_slot_can_be_private(slot) &&
-			  kvm_mem_is_private(kvm, gfn);
-
-	return __kvm_mmu_max_mapping_level(kvm, slot, gfn, max_level, is_private);
+	return __kvm_mmu_max_mapping_level(kvm, slot, gfn, max_level,
+					   PG_LEVEL_NONE, faultin_private);
 }
 
 void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
@@ -3244,7 +3243,8 @@ void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	 */
 	fault->req_level = __kvm_mmu_max_mapping_level(vcpu->kvm, slot,
 						       fault->gfn, fault->max_level,
-						       fault->is_private);
+						       fault->host_level,
+						       kvm_is_faultin_private(fault));
 	if (fault->req_level == PG_LEVEL_4K || fault->huge_page_disallowed)
 		return;
 
@@ -4391,6 +4391,7 @@ static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
 		return r;
 	}
 
+	fault->host_level = max_level;
 	fault->max_level = min(max_level, fault->max_level);
 	fault->map_writable = !(fault->slot->flags & KVM_MEM_READONLY);
 	return RET_PF_CONTINUE;
@@ -4440,7 +4441,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 			return kvm_do_memory_fault_exit(vcpu, fault);
 	}
 
-	if (fault->is_private && kvm_slot_can_be_private(slot))
+	if (kvm_is_faultin_private(fault))
 		return kvm_faultin_pfn_private(vcpu, fault);
 
 	if (fault->is_private && !kvm_slot_can_be_private(fault->slot))
@@ -6922,7 +6923,7 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
 		 */
 		if (sp->role.direct &&
 		    sp->role.level < kvm_mmu_max_mapping_level(kvm, slot, sp->gfn,
-							       PG_LEVEL_NUM)) {
+							       PG_LEVEL_NUM, false)) {
 			kvm_zap_one_rmap_spte(kvm, rmap_head, sptep);
 
 			if (kvm_available_flush_remote_tlbs_range())
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index bc3d38762ace..556c6ceec15f 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -359,6 +359,9 @@ struct kvm_page_fault {
 	 * is changing its own translation in the guest page tables.
 	 */
 	bool write_fault_to_shadow_pgtable;
+
+	/* valid only for private memslot && private gfn */
+	enum pg_level host_level;
 };
 
 int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
@@ -455,7 +458,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 
 int kvm_mmu_max_mapping_level(struct kvm *kvm,
 			      const struct kvm_memory_slot *slot, gfn_t gfn,
-			      int max_level);
+			      int max_level, bool faultin_private);
 void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
 void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_level);
 
@@ -473,4 +476,11 @@ static inline bool kvm_hugepage_test_mixed(struct kvm_memory_slot *slot, gfn_t g
 }
 #endif
 
+static inline bool kvm_is_faultin_private(const struct kvm_page_fault *fault)
+{
+	if (IS_ENABLED(CONFIG_KVM_GENERIC_PRIVATE_MEM))
+		return fault->is_private && kvm_slot_can_be_private(fault->slot);
+	return false;
+}
+
 #endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 612fcaac600d..6f22e38e3973 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -2176,7 +2176,7 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
 			continue;
 
 		max_mapping_level = kvm_mmu_max_mapping_level(kvm, slot,
-							      iter.gfn, PG_LEVEL_NUM);
+							      iter.gfn, PG_LEVEL_NUM, false);
 		if (max_mapping_level < iter.level)
 			continue;
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH v4 16/16] KVM: TDX: Allow 2MB large page for TD GUEST
  2023-07-25 22:23 [RFC PATCH v4 00/16] KVM TDX: TDP MMU: large page support isaku.yamahata
                   ` (14 preceding siblings ...)
  2023-07-25 22:24 ` [RFC PATCH v4 15/16] KVM: x86/mmu: Make kvm fault handler aware of large page of private memslot isaku.yamahata
@ 2023-07-25 22:24 ` isaku.yamahata
  15 siblings, 0 replies; 21+ messages in thread
From: isaku.yamahata @ 2023-07-25 22:24 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack, Kai Huang,
	Zhi Wang, chen.bo, hang.yuan, tina.zhang, Xiaoyao Li

From: Xiaoyao Li <xiaoyao.li@intel.com>

Now that everything is there to support 2MB page for TD guest.  Because TDX
module TDH.MEM.PAGE.AUG supports 4KB page and 2MB page, set struct
kvm_arch.tdp_max_page_level to 2MB page level.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 9 ++-------
 arch/x86/kvm/vmx/tdx.c     | 4 ++--
 2 files changed, 4 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 6f22e38e3973..d78174a6c69b 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1562,14 +1562,9 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 
 		sp->nx_huge_page_disallowed = fault->huge_page_disallowed;
 
-		if (is_shadow_present_pte(iter.old_spte)) {
-			/*
-			 * TODO: large page support.
-			 * Doesn't support large page for TDX now
-			 */
-			KVM_BUG_ON(is_private_sptep(iter.sptep), vcpu->kvm);
+		if (is_shadow_present_pte(iter.old_spte))
 			r = tdp_mmu_split_huge_page(kvm, &iter, sp, true);
-		} else
+		else
 			r = tdp_mmu_link_sp(kvm, &iter, sp, true);
 
 		/*
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 2f375e0e45aa..0625c172cbb9 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -526,8 +526,8 @@ int tdx_vm_init(struct kvm *kvm)
 	 */
 	kvm_mmu_set_mmio_spte_value(kvm, 0);
 
-	/* TODO: Enable 2mb and 1gb large page support. */
-	kvm->arch.tdp_max_page_level = PG_LEVEL_4K;
+	/* TDH.MEM.PAGE.AUG supports up to 2MB page. */
+	kvm->arch.tdp_max_page_level = PG_LEVEL_2M;
 
 	/*
 	 * This function initializes only KVM software construct.  It doesn't
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v4 13/16] KVM: x86/tdp_mmu: Try to merge pages into a large page
  2023-07-25 22:23 ` [RFC PATCH v4 13/16] KVM: x86/tdp_mmu: Try to merge pages into a large page isaku.yamahata
@ 2023-08-14 20:35   ` Isaku Yamahata
  0 siblings, 0 replies; 21+ messages in thread
From: Isaku Yamahata @ 2023-08-14 20:35 UTC (permalink / raw)
  To: isaku.yamahata
  Cc: kvm, linux-kernel, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack, Kai Huang,
	Zhi Wang, chen.bo, hang.yuan, tina.zhang

On Tue, Jul 25, 2023 at 03:23:59PM -0700,
isaku.yamahata@intel.com wrote:

> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> When a large page is passed to the KVM page fault handler and some of sub
> pages are already populated, try to merge sub pages into a large page.
> This situation can happen when the guest converts small pages into shared
> and convert it back into private.
> 
> When a large page is passed to KVM mmu page fault handler and the spte
> corresponding to the page is non-leaf (one or more of sub pages are already
> populated at lower page level), the current kvm mmu zaps non-leaf spte at a
> large page level, and populate a leaf spte at that level.  Thus small pages
> are converted into a large page.  However, it doesn't work for TDX because
> zapping and re-populating results in zeroing page content.  Instead,
> populate all small pages and merge them into a large page.
> 
> Merging pages into a large page can fail when some sub pages are accepted
> and some are not.  In such case, with the assumption that guest tries to
> accept at large page size for performance when possible, don't try to be
> smart to identify which page is still pending, map all pages at lower page
> level, and let vcpu re-execute.
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>  arch/x86/include/asm/kvm-x86-ops.h |   2 +
>  arch/x86/include/asm/kvm_host.h    |   4 +
>  arch/x86/kvm/mmu/tdp_iter.c        |  37 +++++--
>  arch/x86/kvm/mmu/tdp_iter.h        |   2 +
>  arch/x86/kvm/mmu/tdp_mmu.c         | 163 ++++++++++++++++++++++++++++-
>  5 files changed, 198 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index c3963002722c..612fcaac600d 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1242,6 +1242,167 @@ void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm, bool skip_private)
>  	rcu_read_unlock();
>  }
>  
> +static void tdp_mmu_iter_step_side(int i, struct tdp_iter *iter)
> +{
> +	/*
> +	 * if i = SPTE_ENT_PER_PAGE - 1, tdp_iter_step_side() results
> +	 * in reading the entry beyond the last entry.
> +	 */
> +	if (i < SPTE_ENT_PER_PAGE)
> +		tdp_iter_step_side(iter);
> +}
> +
> +static int tdp_mmu_merge_private_spt(struct kvm_vcpu *vcpu,
> +				     struct kvm_page_fault *fault,
> +				     struct tdp_iter *iter, u64 new_spte)
> +{
> +	u64 *sptep = rcu_dereference(iter->sptep);
> +	struct kvm_mmu_page *child_sp;
> +	struct kvm *kvm = vcpu->kvm;
> +	struct tdp_iter child_iter;
> +	bool ret_pf_retry = false;
> +	int level = iter->level;
> +	gfn_t gfn = iter->gfn;
> +	u64 old_spte = *sptep;
> +	tdp_ptep_t child_pt;
> +	u64 child_spte;
> +	int ret = 0;
> +	int i;
> +
> +	/*
> +	 * TDX KVM supports only 2MB large page.  It's not supported to merge
> +	 * 2MB pages into 1GB page at the moment.
> +	 */
> +	WARN_ON_ONCE(fault->goal_level != PG_LEVEL_2M);
> +	WARN_ON_ONCE(iter->level != PG_LEVEL_2M);
> +	WARN_ON_ONCE(!is_large_pte(new_spte));
> +
> +	/* Freeze the spte to prevent other threads from working spte. */
> +	if (!try_cmpxchg64(sptep, &iter->old_spte, REMOVED_SPTE))
> +		return -EBUSY;
> +
> +	/*
> +	 * Step down to the child spte.  Because tdp_iter_next() assumes the
> +	 * parent spte isn't freezed, do it manually.
> +	 */
> +	child_pt = spte_to_child_pt(iter->old_spte, iter->level);
> +	child_sp = sptep_to_sp(child_pt);
> +	WARN_ON_ONCE(child_sp->role.level != PG_LEVEL_4K);
> +	WARN_ON_ONCE(!kvm_mmu_page_role_is_private(child_sp->role));
> +
> +	/* Don't modify iter as the caller will use iter after this function. */
> +	child_iter = *iter;
> +	/* Adjust the target gfn to the head gfn of the large page. */
> +	child_iter.next_last_level_gfn &= -KVM_PAGES_PER_HPAGE(level);
> +	tdp_iter_step_down(&child_iter, child_pt);
> +
> +	/*
> +	 * All child pages are required to be populated for merging them into a
> +	 * large page.  Populate all child spte.
> +	 */
> +	for (i = 0; i < SPTE_ENT_PER_PAGE; i++, tdp_mmu_iter_step_side(i, &child_iter)) {
> +		WARN_ON_ONCE(child_iter.level != PG_LEVEL_4K);
> +		if (is_shadow_present_pte(child_iter.old_spte)) {
> +			/* TODO: relocate page for huge page. */
> +			if (WARN_ON_ONCE(spte_to_pfn(child_iter.old_spte) !=
> +					 spte_to_pfn(new_spte) + i)) {
> +				ret = -EAGAIN;
> +				ret_pf_retry = true;
> +			}
> +			/*
> +			 * When SEPT_VE_DISABLE=true and the page state is
> +			 * pending, this case can happen.  Just resume the vcpu
> +			 * again with the expectation for other vcpu to accept
> +			 * this page.
> +			 */
> +			if (child_iter.gfn == fault->gfn) {
> +				ret = -EAGAIN;
> +				ret_pf_retry = true;
> +				break;
> +			}
> +			continue;
> +		}
> +
> +		WARN_ON_ONCE(spte_to_pfn(child_iter.old_spte) != spte_to_pfn(new_spte) + i);
> +		child_spte = make_huge_page_split_spte(kvm, new_spte, child_sp->role, i);
> +		/*
> +		 * Because other thread may have started to operate on this spte
> +		 * before freezing the parent spte,  Use atomic version to
> +		 * prevent race.
> +		 */
> +		ret = tdp_mmu_set_spte_atomic(vcpu->kvm, &child_iter, child_spte);
> +		if (ret == -EBUSY || ret == -EAGAIN)
> +			/*
> +			 * There was a race condition.  Populate remaining 4K
> +			 * spte to resolve fault->gfn to guarantee the forward
> +			 * progress.
> +			 */
> +			ret_pf_retry = true;
> +		else if (ret)
> +			goto out;
> +
> +	}
> +	if (ret_pf_retry)
> +		goto out;
> +
> +	/* Prevent the Secure-EPT entry from being used. */
> +	ret = static_call(kvm_x86_zap_private_spte)(kvm, gfn, level);
> +	if (ret)
> +		goto out;
> +	kvm_flush_remote_tlbs_range(kvm, gfn & KVM_HPAGE_GFN_MASK(level),
> +				    KVM_PAGES_PER_HPAGE(level));
> +
> +	/* Merge pages into a large page. */
> +	ret = static_call(kvm_x86_merge_private_spt)(kvm, gfn, level,
> +						     kvm_mmu_private_spt(child_sp));
> +	/*
> +	 * Failed to merge pages because some pages are accepted and some are
> +	 * pending.  Since the child page was mapped above, let vcpu run.
> +	 */
> +	if (ret) {
> +		if (static_call(kvm_x86_unzap_private_spte)(kvm, gfn, level))
> +			old_spte = SHADOW_NONPRESENT_VALUE |
> +				(spte_to_pfn(old_spte) << PAGE_SHIFT) |
> +				PT_PAGE_SIZE_MASK;
> +		goto out;
> +	}
> +
> +	/* Unfreeze spte. */
> +	__kvm_tdp_mmu_write_spte(sptep, new_spte);
> +
> +	/*
> +	 * Free unused child sp.  Secure-EPT page was already freed at TDX level
> +	 * by kvm_x86_merge_private_spt().
> +	 */
> +	tdp_unaccount_mmu_page(kvm, child_sp);
> +	tdp_mmu_free_sp(child_sp);
> +	return -EAGAIN;
> +
> +out:
> +	__kvm_tdp_mmu_write_spte(sptep, old_spte);
> +	return ret;
> +}
> +
> +static int __tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
> +					     struct kvm_page_fault *fault,
> +					     struct tdp_iter *iter, u64 new_spte)
> +{
> +	/*
> +	 * The private page has smaller-size pages.  For example, the child
> +	 * pages was converted from shared to page, and now it can be mapped as
> +	 * a large page.  Try to merge small pages into a large page.
> +	 */
> +	if (fault->slot &&
> +	    kvm_gfn_shared_mask(vcpu->kvm) &&
> +	    iter->level > PG_LEVEL_4K &&
> +	    kvm_is_private_gpa(vcpu->kvm, fault->addr) &&
> +	    is_shadow_present_pte(iter->old_spte) &&
> +	    !is_large_pte(iter->old_spte))
> +		return tdp_mmu_merge_private_spt(vcpu, fault, iter, new_spte);
> +
> +	return tdp_mmu_set_spte_atomic(vcpu->kvm, iter, new_spte);
> +}
> +
>  /*
>   * Installs a last-level SPTE to handle a TDP page fault.
>   * (NPT/EPT violation/misconfiguration)
> @@ -1276,7 +1437,7 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
>  
>  	if (new_spte == iter->old_spte)
>  		ret = RET_PF_SPURIOUS;
> -	else if (tdp_mmu_set_spte_atomic(vcpu->kvm, iter, new_spte))
> +	else if (__tdp_mmu_map_handle_target_level(vcpu, fault, iter, new_spte))
>  		return RET_PF_RETRY;
>  	else if (is_shadow_present_pte(iter->old_spte) &&
>  		 !is_last_spte(iter->old_spte, iter->level))
> -- 
> 2.25.1
> 


I missed the race condition and had a wrong WARN_ON_ONCE().  I think it's hard
to hit it because
- In most cases, we hit if (is_shadow_present_pte()) because map it with large
  page, split the page on mapgpa(to-shared), merge the page on
  mapgpa(to-private-again).
  We need independent mapgpa sequence on a different GPA, but within same 2M
  range.

- To hit removed case, we need a race with 2 vcpus in addition to the above.

Anyway this will be included in the next respin.

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 70051dd863a8..4ccfbd04fb27 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1306,6 +1306,13 @@ static int tdp_mmu_merge_private_spt(struct kvm_vcpu *vcpu,
 	 */
 	for (i = 0; i < SPTE_ENT_PER_PAGE; i = tdp_mmu_iter_step_side(i, &child_iter)) {
 		WARN_ON_ONCE(child_iter.level != PG_LEVEL_4K);
+
+		if (is_removed_spte(child_iter.old_spte)) {
+			ret = -EAGAIN;
+			ret_pf_retry = true;
+			continue;
+		}
+
 		if (is_shadow_present_pte(child_iter.old_spte)) {
 			/* TODO: relocate page for huge page. */
 			if (WARN_ON_ONCE(spte_to_pfn(child_iter.old_spte) !=
@@ -1327,7 +1334,6 @@ static int tdp_mmu_merge_private_spt(struct kvm_vcpu *vcpu,
 			continue;
 		}
 
-		WARN_ON_ONCE(spte_to_pfn(child_iter.old_spte) != spte_to_pfn(new_spte) + i);
 		child_spte = make_huge_page_split_spte(kvm, new_spte, child_sp->role, i);
 		/*
 		 * Because other thread may have started to operate on this spte
-- 
2.25.1

-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v4 01/16] KVM: TDP_MMU: Go to next level if smaller private mapping exists
  2023-07-25 22:23 ` [RFC PATCH v4 01/16] KVM: TDP_MMU: Go to next level if smaller private mapping exists isaku.yamahata
@ 2023-09-05  8:10   ` Binbin Wu
  0 siblings, 0 replies; 21+ messages in thread
From: Binbin Wu @ 2023-09-05  8:10 UTC (permalink / raw)
  To: isaku.yamahata, Xiaoyao Li
  Cc: kvm, linux-kernel, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack, Kai Huang,
	Zhi Wang, chen.bo, hang.yuan, tina.zhang



On 7/26/2023 6:23 AM, isaku.yamahata@intel.com wrote:
> From: Xiaoyao Li <xiaoyao.li@intel.com>
>
> Cannot map a private page as large page if any smaller mapping exists.
>
> It has to wait for all the not-mapped smaller page to be mapped and
> promote it to larger mapping.
>
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>   arch/x86/kvm/mmu/tdp_mmu.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 95ba78944712..a9f0f4ade2d0 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1293,7 +1293,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>   	tdp_mmu_for_each_pte(iter, mmu, is_private, raw_gfn, raw_gfn + 1) {
>   		int r;
>   
> -		if (fault->nx_huge_page_workaround_enabled)
> +		if (fault->nx_huge_page_workaround_enabled ||
> +		    kvm_gfn_shared_mask(vcpu->kvm))
>   			disallowed_hugepage_adjust(fault, iter.old_spte, iter.level);
>   
>   		/*
The implementation of disallowed_hugepage_adjust() is as following:

void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, 
int cur_level)
{
     if (cur_level > PG_LEVEL_4K &&
         cur_level == fault->goal_level &&
         is_shadow_present_pte(spte) &&
         !is_large_pte(spte) &&
         spte_to_child_sp(spte)->nx_huge_page_disallowed) {
             ...
     }
}

One condition is spte_to_child_sp(spte)->nx_huge_page_disallowed should be
true to decrease the goal level of the fault.
Does this condition make the change of this patch invalid?


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v4 05/16] KVM: TDX: Pass size to reclaim_page()
  2023-07-25 22:23 ` [RFC PATCH v4 05/16] KVM: TDX: Pass size to reclaim_page() isaku.yamahata
@ 2023-09-06  1:48   ` Binbin Wu
  0 siblings, 0 replies; 21+ messages in thread
From: Binbin Wu @ 2023-09-06  1:48 UTC (permalink / raw)
  To: isaku.yamahata, Xiaoyao Li
  Cc: kvm, linux-kernel, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack, Kai Huang,
	Zhi Wang, chen.bo, hang.yuan, tina.zhang



On 7/26/2023 6:23 AM, isaku.yamahata@intel.com wrote:
> From: Xiaoyao Li <xiaoyao.li@intel.com>
>
> A 2MB large page can be tdh_mem_page_aug()'ed to TD directly. In this case,
> it needs to reclaim and clear the page as 2MB size.
>
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>   arch/x86/kvm/vmx/tdx.c | 24 ++++++++++++++----------
>   1 file changed, 14 insertions(+), 10 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 3522ee232eda..86cfbf435671 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -198,12 +198,13 @@ static void tdx_disassociate_vp_on_cpu(struct kvm_vcpu *vcpu)
>   	smp_call_function_single(cpu, tdx_disassociate_vp_arg, vcpu, 1);
>   }
>   
> -static void tdx_clear_page(unsigned long page_pa)
> +static void tdx_clear_page(unsigned long page_pa, int size)
>   {
>   	const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
>   	void *page = __va(page_pa);
>   	unsigned long i;
>   
> +	WARN_ON_ONCE(size % PAGE_SIZE);
>   	/*
>   	 * When re-assign one page from old keyid to a new keyid, MOVDIR64B is
>   	 * required to clear/write the page with new keyid to prevent integrity
> @@ -212,7 +213,7 @@ static void tdx_clear_page(unsigned long page_pa)
>   	 * clflush doesn't flush cache with HKID set.  The cache line could be
>   	 * poisoned (even without MKTME-i), clear the poison bit.
>   	 */
> -	for (i = 0; i < PAGE_SIZE; i += 64)
> +	for (i = 0; i < size; i += 64)
>   		movdir64b(page + i, zero_page);
>   	/*
>   	 * MOVDIR64B store uses WC buffer.  Prevent following memory reads
> @@ -221,7 +222,8 @@ static void tdx_clear_page(unsigned long page_pa)
>   	__mb();
>   }
>   
> -static int tdx_reclaim_page(hpa_t pa, bool do_wb, u16 hkid)
> +static int tdx_reclaim_page(hpa_t pa, enum pg_level level,
> +			    bool do_wb, u16 hkid)
>   {
>   	struct tdx_module_output out;
>   	u64 err;
> @@ -239,8 +241,10 @@ static int tdx_reclaim_page(hpa_t pa, bool do_wb, u16 hkid)
>   		pr_tdx_error(TDH_PHYMEM_PAGE_RECLAIM, err, &out);
>   		return -EIO;
>   	}
> +	/* out.r8 == tdx sept page level */
> +	WARN_ON_ONCE(out.r8 != pg_level_to_tdx_sept_level(level));
>   
> -	if (do_wb) {
> +	if (do_wb && level == PG_LEVEL_4K) {
I was wondering if it is better to add a WARN_ON_ONCE() to ensure level is
PG_LEVEL_4K instead of skipping it silently. But later, I found the 
warning of
comparing out.r8 and level has guaranteed that there will be a warning 
if there
is a mismatch between do_wb and level.

>   		/*
>   		 * Only TDR page gets into this path.  No contention is expected
>   		 * because of the last page of TD.
> @@ -252,7 +256,7 @@ static int tdx_reclaim_page(hpa_t pa, bool do_wb, u16 hkid)
>   		}
>   	}
>   
> -	tdx_clear_page(pa);
> +	tdx_clear_page(pa, KVM_HPAGE_SIZE(level));
>   	return 0;
>   }
>   
> @@ -266,7 +270,7 @@ static void tdx_reclaim_td_page(unsigned long td_page_pa)
>   	 * was already flushed by TDH.PHYMEM.CACHE.WB before here, So
>   	 * cache doesn't need to be flushed again.
>   	 */
> -	if (tdx_reclaim_page(td_page_pa, false, 0))
> +	if (tdx_reclaim_page(td_page_pa, PG_LEVEL_4K, false, 0))
>   		/*
>   		 * Leak the page on failure:
>   		 * tdx_reclaim_page() returns an error if and only if there's an
> @@ -474,7 +478,7 @@ void tdx_vm_free(struct kvm *kvm)
>   	 * while operating on TD (Especially reclaiming TDCS).  Cache flush with
>   	 * TDX global HKID is needed.
>   	 */
> -	if (tdx_reclaim_page(kvm_tdx->tdr_pa, true, tdx_global_keyid))
> +	if (tdx_reclaim_page(kvm_tdx->tdr_pa, PG_LEVEL_4K, true, tdx_global_keyid))
>   		return;
>   
>   	free_page((unsigned long)__va(kvm_tdx->tdr_pa));
> @@ -1468,7 +1472,7 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
>   		 * The HKID assigned to this TD was already freed and cache
>   		 * was already flushed. We don't have to flush again.
>   		 */
> -		err = tdx_reclaim_page(hpa, false, 0);
> +		err = tdx_reclaim_page(hpa, level, false, 0);
>   		if (KVM_BUG_ON(err, kvm))
>   			return -EIO;
>   		tdx_unpin(kvm, pfn);
> @@ -1501,7 +1505,7 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
>   		pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err, NULL);
>   		return -EIO;
>   	}
> -	tdx_clear_page(hpa);
> +	tdx_clear_page(hpa, PAGE_SIZE);
>   	tdx_unpin(kvm, pfn);
>   	return 0;
>   }
> @@ -1612,7 +1616,7 @@ static int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,
>   	 * already flushed. We don't have to flush again.
>   	 */
>   	if (!is_hkid_assigned(kvm_tdx))
> -		return tdx_reclaim_page(__pa(private_spt), false, 0);
> +		return tdx_reclaim_page(__pa(private_spt), PG_LEVEL_4K, false, 0);
>   
>   	/*
>   	 * free_private_spt() is (obviously) called when a shadow page is being


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v4 08/16] KVM: TDX: Pin pages via get_page() right before ADD/AUG'ed to TDs
  2023-07-25 22:23 ` [RFC PATCH v4 08/16] KVM: TDX: Pin pages via get_page() right before ADD/AUG'ed to TDs isaku.yamahata
@ 2023-09-07  5:26   ` Binbin Wu
  0 siblings, 0 replies; 21+ messages in thread
From: Binbin Wu @ 2023-09-07  5:26 UTC (permalink / raw)
  To: isaku.yamahata, Xiaoyao Li
  Cc: kvm, linux-kernel, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack, Kai Huang,
	Zhi Wang, chen.bo, hang.yuan, tina.zhang



On 7/26/2023 6:23 AM, isaku.yamahata@intel.com wrote:
> From: Xiaoyao Li <xiaoyao.li@intel.com>
>
> When kvm_faultin_pfn(), it doesn't have the info regarding which page level
> will the gfn be mapped at. Hence it doesn't know to pin a 4K page or a
> 2M page.
>
> Move the guest private pages pinning logic right before
> TDH_MEM_PAGE_ADD/AUG() since at that time it knows the page level info.
The code change of the patch doesn't match the changelog.

>
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>   arch/x86/kvm/vmx/tdx.c | 15 ++++++++-------
>   1 file changed, 8 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index c122160142fd..bd1582e6b693 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1361,7 +1361,8 @@ static void tdx_measure_page(struct kvm_tdx *kvm_tdx, hpa_t gpa, int size)
>   	}
>   }
>   
> -static void tdx_unpin(struct kvm *kvm, kvm_pfn_t pfn, int level)
> +static void tdx_unpin(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> +		      enum pg_level level)
>   {
>   	int i;
>   
> @@ -1397,12 +1398,12 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
>   	if (likely(is_td_finalized(kvm_tdx))) {
>   		err = tdh_mem_page_aug(kvm_tdx->tdr_pa, gpa, tdx_level, hpa, &out);
>   		if (unlikely(err == TDX_ERROR_SEPT_BUSY)) {
> -			tdx_unpin(kvm, pfn, level);
> +			tdx_unpin(kvm, gfn, pfn, level);
>   			return -EAGAIN;
>   		}
>   		if (KVM_BUG_ON(err, kvm)) {
>   			pr_tdx_error(TDH_MEM_PAGE_AUG, err, &out);
> -			tdx_unpin(kvm, pfn, level);
> +			tdx_unpin(kvm, gfn, pfn, level);
>   			return -EIO;
>   		}
>   		return 0;
> @@ -1425,7 +1426,7 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
>   	 * always uses vcpu 0's page table and protected by vcpu->mutex).
>   	 */
>   	if (KVM_BUG_ON(kvm_tdx->source_pa == INVALID_PAGE, kvm)) {
> -		tdx_unpin(kvm, pfn, level);
> +		tdx_unpin(kvm, gfn, pfn, level);
>   		return -EINVAL;
>   	}
>   
> @@ -1443,7 +1444,7 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
>   	} while (unlikely(err == TDX_ERROR_SEPT_BUSY));
>   	if (KVM_BUG_ON(err, kvm)) {
>   		pr_tdx_error(TDH_MEM_PAGE_ADD, err, &out);
> -		tdx_unpin(kvm, pfn, level);
> +		tdx_unpin(kvm, gfn, pfn, level);
>   		return -EIO;
>   	} else if (measure)
>   		tdx_measure_page(kvm_tdx, gpa, KVM_HPAGE_SIZE(level));
> @@ -1472,7 +1473,7 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
>   		err = tdx_reclaim_page(hpa, level, false, 0);
>   		if (KVM_BUG_ON(err, kvm))
>   			return -EIO;
> -		tdx_unpin(kvm, pfn, level);
> +		tdx_unpin(kvm, gfn, pfn, level);
>   		return 0;
>   	}
>   
> @@ -1505,7 +1506,7 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
>   			r = -EIO;
>   		} else {
>   			tdx_clear_page(hpa, PAGE_SIZE);
> -			tdx_unpin(kvm, pfn + i, PG_LEVEL_4K);
> +			tdx_unpin(kvm, gfn + i, pfn + i, PG_LEVEL_4K);
>   		}
>   		hpa += PAGE_SIZE;
>   	}


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2023-09-07  5:26 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-25 22:23 [RFC PATCH v4 00/16] KVM TDX: TDP MMU: large page support isaku.yamahata
2023-07-25 22:23 ` [RFC PATCH v4 01/16] KVM: TDP_MMU: Go to next level if smaller private mapping exists isaku.yamahata
2023-09-05  8:10   ` Binbin Wu
2023-07-25 22:23 ` [RFC PATCH v4 02/16] KVM: TDX: Pass page level to cache flush before TDX SEAMCALL isaku.yamahata
2023-07-25 22:23 ` [RFC PATCH v4 03/16] KVM: TDX: Pass KVM page level to tdh_mem_page_add() and tdh_mem_page_aug() isaku.yamahata
2023-07-25 22:23 ` [RFC PATCH v4 04/16] KVM: TDX: Pass size to tdx_measure_page() isaku.yamahata
2023-07-25 22:23 ` [RFC PATCH v4 05/16] KVM: TDX: Pass size to reclaim_page() isaku.yamahata
2023-09-06  1:48   ` Binbin Wu
2023-07-25 22:23 ` [RFC PATCH v4 06/16] KVM: TDX: Update tdx_sept_{set,drop}_private_spte() to support large page isaku.yamahata
2023-07-25 22:23 ` [RFC PATCH v4 07/16] KVM: MMU: Introduce level info in PFERR code isaku.yamahata
2023-07-25 22:23 ` [RFC PATCH v4 08/16] KVM: TDX: Pin pages via get_page() right before ADD/AUG'ed to TDs isaku.yamahata
2023-09-07  5:26   ` Binbin Wu
2023-07-25 22:23 ` [RFC PATCH v4 09/16] KVM: TDX: Pass desired page level in err code for page fault handler isaku.yamahata
2023-07-25 22:23 ` [RFC PATCH v4 10/16] KVM: x86/tdp_mmu: Allocate private page table for large page split isaku.yamahata
2023-07-25 22:23 ` [RFC PATCH v4 11/16] KVM: x86/tdp_mmu: Split the large page when zap leaf isaku.yamahata
2023-07-25 22:23 ` [RFC PATCH v4 12/16] KVM: x86/tdp_mmu, TDX: Split a large page when 4KB page within it converted to shared isaku.yamahata
2023-07-25 22:23 ` [RFC PATCH v4 13/16] KVM: x86/tdp_mmu: Try to merge pages into a large page isaku.yamahata
2023-08-14 20:35   ` Isaku Yamahata
2023-07-25 22:24 ` [RFC PATCH v4 14/16] KVM: x86/tdp_mmu: TDX: Implement " isaku.yamahata
2023-07-25 22:24 ` [RFC PATCH v4 15/16] KVM: x86/mmu: Make kvm fault handler aware of large page of private memslot isaku.yamahata
2023-07-25 22:24 ` [RFC PATCH v4 16/16] KVM: TDX: Allow 2MB large page for TD GUEST isaku.yamahata

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).