linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6] xen/x86: PV boot speedup
@ 2021-09-30 12:33 Jan Beulich
  2021-09-30 12:34 ` [PATCH 1/6] xen/x86: streamline set_pte_mfn() Jan Beulich
                   ` (7 more replies)
  0 siblings, 8 replies; 9+ messages in thread
From: Jan Beulich @ 2021-09-30 12:33 UTC (permalink / raw)
  To: Juergen Gross, Boris Ostrovsky; +Cc: Stefano Stabellini, lkml, xen-devel

The observed (by the human eye) performance difference of early boot
between native and PV-on-Xen was just too large to not look into. As
it turns out, gaining performance back wasn't all that difficult.

While the series (re)introduces a small number of PTWR emulations on
the boot path (from phys_pte_init()), there has been a much larger
number of them post-boot. Hence I think if this was of concern, the
post-boot instances would want eliminating first.

Some of the later changes aren'r directly related to the main goal of
the series; these address aspects noticed while doing the investigation.

1: streamline set_pte_mfn()
2: restore (fix) xen_set_pte_init() behavior
3: adjust xen_set_fixmap()
4: adjust handling of the L3 user vsyscall special page table
5: there's no highmem anymore in PV mode
6: restrict PV Dom0 identity mapping

Jan


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/6] xen/x86: streamline set_pte_mfn()
  2021-09-30 12:33 [PATCH 0/6] xen/x86: PV boot speedup Jan Beulich
@ 2021-09-30 12:34 ` Jan Beulich
  2021-09-30 12:35 ` [PATCH 2/6] xen/x86: restore (fix) xen_set_pte_init() behavior Jan Beulich
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Jan Beulich @ 2021-09-30 12:34 UTC (permalink / raw)
  To: Juergen Gross, Boris Ostrovsky; +Cc: Stefano Stabellini, lkml, xen-devel

In preparation for restoring xen_set_pte_init()'s original behavior of
avoiding hypercalls, make set_pte_mfn() no longer use the standard
set_pte() code path. That one is more complicated than the alternative
of simply using an available hypercall directly. This way we can avoid
introducing a fair number (2k on my test system) of cases where the
hypervisor would trap-and-emulate page table updates.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -241,9 +241,11 @@ static void xen_set_pmd(pmd_t *ptr, pmd_
  * Associate a virtual page frame with a given physical page frame
  * and protection flags for that frame.
  */
-void set_pte_mfn(unsigned long vaddr, unsigned long mfn, pgprot_t flags)
+void __init set_pte_mfn(unsigned long vaddr, unsigned long mfn, pgprot_t flags)
 {
-	set_pte_vaddr(vaddr, mfn_pte(mfn, flags));
+	if (HYPERVISOR_update_va_mapping(vaddr, mfn_pte(mfn, flags),
+					 UVMF_INVLPG))
+		BUG();
 }
 
 static bool xen_batched_set_pte(pte_t *ptep, pte_t pteval)


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 2/6] xen/x86: restore (fix) xen_set_pte_init() behavior
  2021-09-30 12:33 [PATCH 0/6] xen/x86: PV boot speedup Jan Beulich
  2021-09-30 12:34 ` [PATCH 1/6] xen/x86: streamline set_pte_mfn() Jan Beulich
@ 2021-09-30 12:35 ` Jan Beulich
  2021-09-30 12:35 ` [PATCH 3/6] xen/x86: adjust xen_set_fixmap() Jan Beulich
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Jan Beulich @ 2021-09-30 12:35 UTC (permalink / raw)
  To: Juergen Gross, Boris Ostrovsky; +Cc: Stefano Stabellini, lkml, xen-devel

Commit f7c90c2aa400 ("x86/xen: don't write ptes directly in 32-bit PV
guests") needlessly (and heavily) penalized 64-bit guests here: The
majority of the early page table updates is to writable pages (which get
converted to r/o only after all the writes are done), in particular
those involved in building the direct map (which consists of all 4k
mappings in PV). On my test system this accounts for almost 16 million
hypercalls when each could simply have been a plain memory write.

Switch back to using native_set_pte(), except for updates of early
ioremap tables (where a suitable accessor exists to recognize them).
With 32-bit PV support gone, this doesn't need to be further
conditionalized (albeit backports thereof may need adjustment).

To avoid a fair number (almost 256k on my test system) of trap-and-
emulate cases appearing as a result, switch the hook in
xen_pagetable_init().

Finally commit d6b186c1e2d8 ("x86/xen: avoid m2p lookup when setting
early page table entries") inserted a function ahead of
xen_set_pte_init(), separating it from its comment (which may have been
part of the reason why the performance regression wasn't anticipated /
recognized while codeing / reviewing the change mentioned further up).
Move the function up and adjust that comment to describe the new
behavior.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -1194,6 +1194,13 @@ static void __init xen_pagetable_p2m_set
 
 static void __init xen_pagetable_init(void)
 {
+	/*
+	 * The majority of further PTE writes is to pagetables already
+	 * announced as such to Xen. Hence it is more efficient to use
+	 * hypercalls for these updates.
+	 */
+	pv_ops.mmu.set_pte = __xen_set_pte;
+
 	paging_init();
 	xen_post_allocator_init();
 
@@ -1422,10 +1429,18 @@ static void xen_pgd_free(struct mm_struc
  *
  * Many of these PTE updates are done on unpinned and writable pages
  * and doing a hypercall for these is unnecessary and expensive.  At
- * this point it is not possible to tell if a page is pinned or not,
- * so always write the PTE directly and rely on Xen trapping and
+ * this point it is rarely possible to tell if a page is pinned, so
+ * mostly write the PTE directly and rely on Xen trapping and
  * emulating any updates as necessary.
  */
+static void __init xen_set_pte_init(pte_t *ptep, pte_t pte)
+{
+	if (unlikely(is_early_ioremap_ptep(ptep)))
+		__xen_set_pte(ptep, pte);
+	else
+		native_set_pte(ptep, pte);
+}
+
 __visible pte_t xen_make_pte_init(pteval_t pte)
 {
 	unsigned long pfn;
@@ -1447,11 +1462,6 @@ __visible pte_t xen_make_pte_init(pteval
 }
 PV_CALLEE_SAVE_REGS_THUNK(xen_make_pte_init);
 
-static void __init xen_set_pte_init(pte_t *ptep, pte_t pte)
-{
-	__xen_set_pte(ptep, pte);
-}
-
 /* Early in boot, while setting up the initial pagetable, assume
    everything is pinned. */
 static void __init xen_alloc_pte_init(struct mm_struct *mm, unsigned long pfn)


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 3/6] xen/x86: adjust xen_set_fixmap()
  2021-09-30 12:33 [PATCH 0/6] xen/x86: PV boot speedup Jan Beulich
  2021-09-30 12:34 ` [PATCH 1/6] xen/x86: streamline set_pte_mfn() Jan Beulich
  2021-09-30 12:35 ` [PATCH 2/6] xen/x86: restore (fix) xen_set_pte_init() behavior Jan Beulich
@ 2021-09-30 12:35 ` Jan Beulich
  2021-09-30 12:36 ` [PATCH 4/6] xen/x86: adjust handling of the L3 user vsyscall special page table Jan Beulich
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Jan Beulich @ 2021-09-30 12:35 UTC (permalink / raw)
  To: Juergen Gross, Boris Ostrovsky; +Cc: Stefano Stabellini, lkml, xen-devel

Using __native_set_fixmap() here means guaranteed trap-and-emulate
instances the hypervisor has to deal with. Since the virtual address
covered by the to be adjusted page table entry is easy to determine (and
actually already gets obtained in a special case), simply use an
available, easy to invoke hypercall instead.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -2010,6 +2010,7 @@ static unsigned char dummy_mapping[PAGE_
 static void xen_set_fixmap(unsigned idx, phys_addr_t phys, pgprot_t prot)
 {
 	pte_t pte;
+	unsigned long vaddr;
 
 	phys >>= PAGE_SHIFT;
 
@@ -2050,15 +2051,15 @@ static void xen_set_fixmap(unsigned idx,
 		break;
 	}
 
-	__native_set_fixmap(idx, pte);
+	vaddr = __fix_to_virt(idx);
+	if (HYPERVISOR_update_va_mapping(vaddr, pte, UVMF_INVLPG))
+		BUG();
 
 #ifdef CONFIG_X86_VSYSCALL_EMULATION
 	/* Replicate changes to map the vsyscall page into the user
 	   pagetable vsyscall mapping. */
-	if (idx == VSYSCALL_PAGE) {
-		unsigned long vaddr = __fix_to_virt(idx);
+	if (idx == VSYSCALL_PAGE)
 		set_pte_vaddr_pud(level3_user_vsyscall, vaddr, pte);
-	}
 #endif
 }
 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 4/6] xen/x86: adjust handling of the L3 user vsyscall special page table
  2021-09-30 12:33 [PATCH 0/6] xen/x86: PV boot speedup Jan Beulich
                   ` (2 preceding siblings ...)
  2021-09-30 12:35 ` [PATCH 3/6] xen/x86: adjust xen_set_fixmap() Jan Beulich
@ 2021-09-30 12:36 ` Jan Beulich
  2021-09-30 12:36 ` [PATCH 5/6] xen/x86: there's no highmem anymore in PV mode Jan Beulich
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Jan Beulich @ 2021-09-30 12:36 UTC (permalink / raw)
  To: Juergen Gross, Boris Ostrovsky; +Cc: Stefano Stabellini, lkml, xen-devel

Marking the page tableas pinned without ever actually pinning is was
probably an oversight in the first place. The main reason for the change
is more subtle, though: The write of the one present entry each here and
in the subsequently allocated L2 table engage a code path in the
hypervisor which exists only for thought-to-be-broken guests: An mmu-
update operation to a page which is neither a page table nor marked
writable. The hypervisor merely assumes (or should I say "hopes") that
the fact that a writable reference to the page can be obtained means it
is okay to actually write to that page in response to such a hypercall.

While there make all involved code and data dependent upon
X86_VSYSCALL_EMULATION (some code was already).

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -86,8 +86,10 @@
 #include "mmu.h"
 #include "debugfs.h"
 
+#ifdef CONFIG_X86_VSYSCALL_EMULATION
 /* l3 pud for userspace vsyscall mapping */
 static pud_t level3_user_vsyscall[PTRS_PER_PUD] __page_aligned_bss;
+#endif
 
 /*
  * Protects atomic reservation decrease/increase against concurrent increases.
@@ -791,7 +793,9 @@ static void __init xen_mark_pinned(struc
 static void __init xen_after_bootmem(void)
 {
 	static_branch_enable(&xen_struct_pages_ready);
+#ifdef CONFIG_X86_VSYSCALL_EMULATION
 	SetPagePinned(virt_to_page(level3_user_vsyscall));
+#endif
 	xen_pgd_walk(&init_mm, xen_mark_pinned, FIXADDR_TOP);
 }
 
@@ -1761,7 +1765,6 @@ void __init xen_setup_kernel_pagetable(p
 	set_page_prot(init_top_pgt, PAGE_KERNEL_RO);
 	set_page_prot(level3_ident_pgt, PAGE_KERNEL_RO);
 	set_page_prot(level3_kernel_pgt, PAGE_KERNEL_RO);
-	set_page_prot(level3_user_vsyscall, PAGE_KERNEL_RO);
 	set_page_prot(level2_ident_pgt, PAGE_KERNEL_RO);
 	set_page_prot(level2_kernel_pgt, PAGE_KERNEL_RO);
 	set_page_prot(level2_fixmap_pgt, PAGE_KERNEL_RO);
@@ -1778,6 +1781,13 @@ void __init xen_setup_kernel_pagetable(p
 	/* Unpin Xen-provided one */
 	pin_pagetable_pfn(MMUEXT_UNPIN_TABLE, PFN_DOWN(__pa(pgd)));
 
+#ifdef CONFIG_X86_VSYSCALL_EMULATION
+	/* Pin user vsyscall L3 */
+	set_page_prot(level3_user_vsyscall, PAGE_KERNEL_RO);
+	pin_pagetable_pfn(MMUEXT_PIN_L3_TABLE,
+			  PFN_DOWN(__pa_symbol(level3_user_vsyscall)));
+#endif
+
 	/*
 	 * At this stage there can be no user pgd, and no page structure to
 	 * attach it to, so make sure we just set kernel pgd.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 5/6] xen/x86: there's no highmem anymore in PV mode
  2021-09-30 12:33 [PATCH 0/6] xen/x86: PV boot speedup Jan Beulich
                   ` (3 preceding siblings ...)
  2021-09-30 12:36 ` [PATCH 4/6] xen/x86: adjust handling of the L3 user vsyscall special page table Jan Beulich
@ 2021-09-30 12:36 ` Jan Beulich
  2021-09-30 12:37 ` [PATCH 6/6] xen/x86: restrict PV Dom0 identity mapping Jan Beulich
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Jan Beulich @ 2021-09-30 12:36 UTC (permalink / raw)
  To: Juergen Gross, Boris Ostrovsky; +Cc: Stefano Stabellini, lkml, xen-devel

Considerations for it are a leftover from when 32-bit was still
supported.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -306,10 +306,6 @@ static void __init xen_update_mem_tables
 		BUG();
 	}
 
-	/* Update kernel mapping, but not for highmem. */
-	if (pfn >= PFN_UP(__pa(high_memory - 1)))
-		return;
-
 	if (HYPERVISOR_update_va_mapping((unsigned long)__va(pfn << PAGE_SHIFT),
 					 mfn_pte(mfn, PAGE_KERNEL), 0)) {
 		WARN(1, "Failed to update kernel mapping for mfn=%ld pfn=%ld\n",


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 6/6] xen/x86: restrict PV Dom0 identity mapping
  2021-09-30 12:33 [PATCH 0/6] xen/x86: PV boot speedup Jan Beulich
                   ` (4 preceding siblings ...)
  2021-09-30 12:36 ` [PATCH 5/6] xen/x86: there's no highmem anymore in PV mode Jan Beulich
@ 2021-09-30 12:37 ` Jan Beulich
  2021-10-02  0:33 ` [PATCH 0/6] xen/x86: PV boot speedup Boris Ostrovsky
  2021-10-27 13:17 ` Boris Ostrovsky
  7 siblings, 0 replies; 9+ messages in thread
From: Jan Beulich @ 2021-09-30 12:37 UTC (permalink / raw)
  To: Juergen Gross, Boris Ostrovsky; +Cc: Stefano Stabellini, lkml, xen-devel

When moving away RAM pages, there having been a mapping of those is not
a proper indication that instead MMIO should be mapped there. At the
point in time this effectively covers the low megabyte only. Mapping of
that is, however, the job of init_mem_mapping(). Comparing the two one
can also spot that we've been wrongly (or at least inconsistently) using
PAGE_KERNEL_IO here.

Simply zap any such mappings instead.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -425,13 +425,13 @@ static unsigned long __init xen_set_iden
 	}
 
 	/*
-	 * If the PFNs are currently mapped, the VA mapping also needs
-	 * to be updated to be 1:1.
+	 * If the PFNs are currently mapped, their VA mappings need to be
+	 * zapped.
 	 */
 	for (pfn = start_pfn; pfn <= max_pfn_mapped && pfn < end_pfn; pfn++)
 		(void)HYPERVISOR_update_va_mapping(
 			(unsigned long)__va(pfn << PAGE_SHIFT),
-			mfn_pte(pfn, PAGE_KERNEL_IO), 0);
+			native_make_pte(0), 0);
 
 	return remap_pfn;
 }


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/6] xen/x86: PV boot speedup
  2021-09-30 12:33 [PATCH 0/6] xen/x86: PV boot speedup Jan Beulich
                   ` (5 preceding siblings ...)
  2021-09-30 12:37 ` [PATCH 6/6] xen/x86: restrict PV Dom0 identity mapping Jan Beulich
@ 2021-10-02  0:33 ` Boris Ostrovsky
  2021-10-27 13:17 ` Boris Ostrovsky
  7 siblings, 0 replies; 9+ messages in thread
From: Boris Ostrovsky @ 2021-10-02  0:33 UTC (permalink / raw)
  To: Jan Beulich, Juergen Gross; +Cc: Stefano Stabellini, lkml, xen-devel


On 9/30/21 8:33 AM, Jan Beulich wrote:
> The observed (by the human eye) performance difference of early boot
> between native and PV-on-Xen was just too large to not look into. As
> it turns out, gaining performance back wasn't all that difficult.
>
> While the series (re)introduces a small number of PTWR emulations on
> the boot path (from phys_pte_init()), there has been a much larger
> number of them post-boot. Hence I think if this was of concern, the
> post-boot instances would want eliminating first.
>
> Some of the later changes aren'r directly related to the main goal of
> the series; these address aspects noticed while doing the investigation.
>
> 1: streamline set_pte_mfn()
> 2: restore (fix) xen_set_pte_init() behavior
> 3: adjust xen_set_fixmap()
> 4: adjust handling of the L3 user vsyscall special page table
> 5: there's no highmem anymore in PV mode
> 6: restrict PV Dom0 identity mapping



For the series:


Reviewed-by: Boris Ostrovsky <boris.ostrovksy@oracle.com>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/6] xen/x86: PV boot speedup
  2021-09-30 12:33 [PATCH 0/6] xen/x86: PV boot speedup Jan Beulich
                   ` (6 preceding siblings ...)
  2021-10-02  0:33 ` [PATCH 0/6] xen/x86: PV boot speedup Boris Ostrovsky
@ 2021-10-27 13:17 ` Boris Ostrovsky
  7 siblings, 0 replies; 9+ messages in thread
From: Boris Ostrovsky @ 2021-10-27 13:17 UTC (permalink / raw)
  To: Jan Beulich, Juergen Gross; +Cc: Stefano Stabellini, lkml, xen-devel


On 9/30/21 8:33 AM, Jan Beulich wrote:
> The observed (by the human eye) performance difference of early boot
> between native and PV-on-Xen was just too large to not look into. As
> it turns out, gaining performance back wasn't all that difficult.
>
> While the series (re)introduces a small number of PTWR emulations on
> the boot path (from phys_pte_init()), there has been a much larger
> number of them post-boot. Hence I think if this was of concern, the
> post-boot instances would want eliminating first.
>
> Some of the later changes aren'r directly related to the main goal of
> the series; these address aspects noticed while doing the investigation.
>
> 1: streamline set_pte_mfn()
> 2: restore (fix) xen_set_pte_init() behavior
> 3: adjust xen_set_fixmap()
> 4: adjust handling of the L3 user vsyscall special page table
> 5: there's no highmem anymore in PV mode
> 6: restrict PV Dom0 identity mapping



Applied to for-linus-5.16


-boris


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-10-27 13:17 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-30 12:33 [PATCH 0/6] xen/x86: PV boot speedup Jan Beulich
2021-09-30 12:34 ` [PATCH 1/6] xen/x86: streamline set_pte_mfn() Jan Beulich
2021-09-30 12:35 ` [PATCH 2/6] xen/x86: restore (fix) xen_set_pte_init() behavior Jan Beulich
2021-09-30 12:35 ` [PATCH 3/6] xen/x86: adjust xen_set_fixmap() Jan Beulich
2021-09-30 12:36 ` [PATCH 4/6] xen/x86: adjust handling of the L3 user vsyscall special page table Jan Beulich
2021-09-30 12:36 ` [PATCH 5/6] xen/x86: there's no highmem anymore in PV mode Jan Beulich
2021-09-30 12:37 ` [PATCH 6/6] xen/x86: restrict PV Dom0 identity mapping Jan Beulich
2021-10-02  0:33 ` [PATCH 0/6] xen/x86: PV boot speedup Boris Ostrovsky
2021-10-27 13:17 ` Boris Ostrovsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).