[PATCH v2 00/12] x86: memcpy() / memset() (non-)ERMS flavors plus fallout

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 00/12] x86: memcpy() / memset() (non-)ERMS flavors plus fallout
@ 2021-05-27 12:29 Jan Beulich
  2021-05-27 12:30 ` [PATCH v2 01/12] x86: introduce ioremap_wc() Jan Beulich
                   ` (12 more replies)
  0 siblings, 13 replies; 32+ messages in thread
From: Jan Beulich @ 2021-05-27 12:29 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Wei Liu, Roger Pau Monné

While the performance varies quite a bit on older (pre-ERMS) and newer
(ERMS) hardware, so far we've been going with just a single flavor of
these two functions, and oddly enough with ones not consistent with one
another. Using plain memcpy() / memset() on MMIO (video frame buffer)
is generally okay, but the ERMS variant of memcpy() turned out to
regress (boot) performance in a way easily visible to the human eye.
Hence as a prerequisite step this series switches the frame buffer
(and VGA) mapping to be write-combining independent of firmware
arrangements (of MTRRs in particular).

v2, besides addressing review feedback not addressed verbally, extends
things to
- driving gcc's inlining of __builtin_mem{cpy,set}(),
- page clearing and scrubbing.

01: x86: introduce ioremap_wc()
02: x86: re-work memset()
03: x86: re-work memcpy()
04: x86: control memset() and memcpy() inlining
05: x86: introduce "hot" and "cold" page clearing functions
06: page-alloc: make scrub_on_page() static
07: mm: allow page scrubbing routine(s) to be arch controlled
08: x86: move .text.kexec
09: video/vesa: unmap frame buffer when relinquishing console
10: video/vesa: drop "vesa-mtrr" command line option
12: video/vesa: adjust (not just) command line option handling

Side note: While strictly speaking the xen/drivers/video/ changes fall
under REST maintainership, with that code getting built for x86 only
I'm restricting Cc-s to x86 maintainers.

Jan

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2 01/12] x86: introduce ioremap_wc()
  2021-05-27 12:29 [PATCH v2 00/12] x86: memcpy() / memset() (non-)ERMS flavors plus fallout Jan Beulich
@ 2021-05-27 12:30 ` Jan Beulich
  2021-05-27 12:48   ` Julien Grall
  2021-05-27 12:31 ` [PATCH v2 02/12] x86: re-work memset() Jan Beulich
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 32+ messages in thread
From: Jan Beulich @ 2021-05-27 12:30 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Wei Liu, Roger Pau Monné

In order for a to-be-introduced ERMS form of memcpy() to not regress
boot performance on certain systems when video output is active, we
first need to arrange for avoiding further dependency on firmware
setting up MTRRs in a way we can actually further modify. On many
systems, due to the continuously growing amounts of installed memory,
MTRRs get configured with at least one huge WB range, and with MMIO
ranges below 4Gb then forced to UC via overlapping MTRRs. mtrr_add(), as
it is today, can't deal with such a setup. Hence on such systems we
presently leave the frame buffer mapped UC, leading to significantly
reduced performance when using REP STOSB / REP MOVSB.

On post-PentiumII hardware (i.e. any that's capable of running 64-bit
code), an effective memory type of WC can be achieved without MTRRs, by
simply referencing the respective PAT entry from the PTEs. While this
will leave the switch to ERMS forms of memset() and memcpy() with
largely unchanged performance, the change here on its own improves
performance on affected systems quite significantly: Measuring just the
individual affected memcpy() invocations yielded a speedup by a factor
of over 250 on my initial (Skylake) test system. memset() isn't getting
improved by as much there, but still by a factor of about 20.

While adding {__,}PAGE_HYPERVISOR_WC, also add {__,}PAGE_HYPERVISOR_WT
to, at the very least, make clear what PTE flags this memory type uses.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v2: Mark ioremap_wc() __init.
---
TBD: If the VGA range is WC in the fixed range MTRRs, reusing the low
     1st Mb mapping (like ioremap() does) would be an option.

--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -5881,6 +5881,20 @@ void __iomem *ioremap(paddr_t pa, size_t
     return (void __force __iomem *)va;
 }
 
+void __iomem *__init ioremap_wc(paddr_t pa, size_t len)
+{
+    mfn_t mfn = _mfn(PFN_DOWN(pa));
+    unsigned int offs = pa & (PAGE_SIZE - 1);
+    unsigned int nr = PFN_UP(offs + len);
+    void *va;
+
+    WARN_ON(page_is_ram_type(mfn_x(mfn), RAM_TYPE_CONVENTIONAL));
+
+    va = __vmap(&mfn, nr, 1, 1, PAGE_HYPERVISOR_WC, VMAP_DEFAULT);
+
+    return (void __force __iomem *)(va + offs);
+}
+
 int create_perdomain_mapping(struct domain *d, unsigned long va,
                              unsigned int nr, l1_pgentry_t **pl1tab,
                              struct page_info **ppg)
--- a/xen/drivers/video/vesa.c
+++ b/xen/drivers/video/vesa.c
@@ -9,9 +9,9 @@
 #include <xen/param.h>
 #include <xen/xmalloc.h>
 #include <xen/kernel.h>
+#include <xen/mm.h>
 #include <xen/vga.h>
 #include <asm/io.h>
-#include <asm/page.h>
 #include "font.h"
 #include "lfb.h"
 
@@ -103,7 +103,7 @@ void __init vesa_init(void)
     lfbp.text_columns = vlfb_info.width / font->width;
     lfbp.text_rows = vlfb_info.height / font->height;
 
-    lfbp.lfb = lfb = ioremap(lfb_base(), vram_remap);
+    lfbp.lfb = lfb = ioremap_wc(lfb_base(), vram_remap);
     if ( !lfb )
         return;
 
@@ -179,8 +179,7 @@ void __init vesa_mtrr_init(void)
 
 static void lfb_flush(void)
 {
-    if ( vesa_mtrr == 3 )
-        __asm__ __volatile__ ("sfence" : : : "memory");
+    __asm__ __volatile__ ("sfence" : : : "memory");
 }
 
 void __init vesa_endboot(bool_t keep)
--- a/xen/drivers/video/vga.c
+++ b/xen/drivers/video/vga.c
@@ -79,7 +79,7 @@ void __init video_init(void)
     {
     case XEN_VGATYPE_TEXT_MODE_3:
         if ( page_is_ram_type(paddr_to_pfn(0xB8000), RAM_TYPE_CONVENTIONAL) ||
-             ((video = ioremap(0xB8000, 0x8000)) == NULL) )
+             ((video = ioremap_wc(0xB8000, 0x8000)) == NULL) )
             return;
         outw(0x200a, 0x3d4); /* disable cursor */
         columns = vga_console_info.u.text_mode_3.columns;
@@ -164,7 +164,11 @@ void __init video_endboot(void)
     {
     case XEN_VGATYPE_TEXT_MODE_3:
         if ( !vgacon_keep )
+        {
             memset(video, 0, columns * lines * 2);
+            iounmap(video);
+            video = ZERO_BLOCK_PTR;
+        }
         break;
     case XEN_VGATYPE_VESA_LFB:
     case XEN_VGATYPE_EFI_LFB:
--- a/xen/include/asm-x86/mm.h
+++ b/xen/include/asm-x86/mm.h
@@ -615,6 +615,8 @@ void destroy_perdomain_mapping(struct do
                                unsigned int nr);
 void free_perdomain_mappings(struct domain *);
 
+void __iomem *ioremap_wc(paddr_t, size_t);
+
 extern int memory_add(unsigned long spfn, unsigned long epfn, unsigned int pxm);
 
 void domain_set_alloc_bitsize(struct domain *d);
--- a/xen/include/asm-x86/page.h
+++ b/xen/include/asm-x86/page.h
@@ -349,8 +349,10 @@ void efi_update_l4_pgtable(unsigned int
 #define __PAGE_HYPERVISOR_RX      (_PAGE_PRESENT | _PAGE_ACCESSED)
 #define __PAGE_HYPERVISOR         (__PAGE_HYPERVISOR_RX | \
                                    _PAGE_DIRTY | _PAGE_RW)
+#define __PAGE_HYPERVISOR_WT      (__PAGE_HYPERVISOR | _PAGE_PWT)
 #define __PAGE_HYPERVISOR_UCMINUS (__PAGE_HYPERVISOR | _PAGE_PCD)
 #define __PAGE_HYPERVISOR_UC      (__PAGE_HYPERVISOR | _PAGE_PCD | _PAGE_PWT)
+#define __PAGE_HYPERVISOR_WC      (__PAGE_HYPERVISOR | _PAGE_PAT)
 #define __PAGE_HYPERVISOR_SHSTK   (__PAGE_HYPERVISOR_RO | _PAGE_DIRTY)
 
 #define MAP_SMALL_PAGES _PAGE_AVAIL0 /* don't use superpages mappings */
--- a/xen/include/asm-x86/x86_64/page.h
+++ b/xen/include/asm-x86/x86_64/page.h
@@ -154,6 +154,10 @@ static inline intpte_t put_pte_flags(uns
                                  _PAGE_GLOBAL | _PAGE_NX)
 #define PAGE_HYPERVISOR_UC      (__PAGE_HYPERVISOR_UC | \
                                  _PAGE_GLOBAL | _PAGE_NX)
+#define PAGE_HYPERVISOR_WC      (__PAGE_HYPERVISOR_WC | \
+                                 _PAGE_GLOBAL | _PAGE_NX)
+#define PAGE_HYPERVISOR_WT      (__PAGE_HYPERVISOR_WT | \
+                                 _PAGE_GLOBAL | _PAGE_NX)
 
 #endif /* __X86_64_PAGE_H__ */
 



^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2 02/12] x86: re-work memset()
  2021-05-27 12:29 [PATCH v2 00/12] x86: memcpy() / memset() (non-)ERMS flavors plus fallout Jan Beulich
  2021-05-27 12:30 ` [PATCH v2 01/12] x86: introduce ioremap_wc() Jan Beulich
@ 2021-05-27 12:31 ` Jan Beulich
  2021-05-27 12:31 ` [PATCH v2 03/12] x86: re-work memcpy() Jan Beulich
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 32+ messages in thread
From: Jan Beulich @ 2021-05-27 12:31 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Wei Liu, Roger Pau Monné

Move the function to its own assembly file. Having it in C just for the
entire body to be an asm() isn't really helpful. Then have two flavors:
A "basic" version using qword steps for the bulk of the operation, and an
ERMS version for modern hardware, to be substituted in via alternatives
patching.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
We may want to consider branching over the REP STOSQ as well, if the
number of qwords turns out to be zero.
We may also want to consider using non-REP STOS{L,W,B} for the tail.

--- a/xen/arch/x86/Makefile
+++ b/xen/arch/x86/Makefile
@@ -43,6 +43,7 @@ obj-$(CONFIG_INDIRECT_THUNK) += indirect
 obj-y += ioport_emulate.o
 obj-y += irq.o
 obj-$(CONFIG_KEXEC) += machine_kexec.o
+obj-y += memset.o
 obj-y += mm.o x86_64/mm.o
 obj-$(CONFIG_HVM) += monitor.o
 obj-y += mpparse.o
--- /dev/null
+++ b/xen/arch/x86/memset.S
@@ -0,0 +1,31 @@
+#include <asm/asm_defns.h>
+
+.macro memset
+        and     $7, %edx
+        shr     $3, %rcx
+        movzbl  %sil, %esi
+        mov     $0x0101010101010101, %rax
+        imul    %rsi, %rax
+        mov     %rdi, %rsi
+        rep stosq
+        or      %edx, %ecx
+        jz      0f
+        rep stosb
+0:
+        mov     %rsi, %rax
+        ret
+.endm
+
+.macro memset_erms
+        mov     %esi, %eax
+        mov     %rdi, %rsi
+        rep stosb
+        mov     %rsi, %rax
+        ret
+.endm
+
+ENTRY(memset)
+        mov     %rdx, %rcx
+        ALTERNATIVE memset, memset_erms, X86_FEATURE_ERMS
+        .type memset, @function
+        .size memset, . - memset
--- a/xen/arch/x86/string.c
+++ b/xen/arch/x86/string.c
@@ -22,19 +22,6 @@ void *(memcpy)(void *dest, const void *s
     return dest;
 }
 
-void *(memset)(void *s, int c, size_t n)
-{
-    long d0, d1;
-
-    asm volatile (
-        "rep stosb"
-        : "=&c" (d0), "=&D" (d1)
-        : "a" (c), "1" (s), "0" (n)
-        : "memory");
-
-    return s;
-}
-
 void *(memmove)(void *dest, const void *src, size_t n)
 {
     long d0, d1, d2;



^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2 03/12] x86: re-work memcpy()
  2021-05-27 12:29 [PATCH v2 00/12] x86: memcpy() / memset() (non-)ERMS flavors plus fallout Jan Beulich
  2021-05-27 12:30 ` [PATCH v2 01/12] x86: introduce ioremap_wc() Jan Beulich
  2021-05-27 12:31 ` [PATCH v2 02/12] x86: re-work memset() Jan Beulich
@ 2021-05-27 12:31 ` Jan Beulich
  2021-05-27 12:31 ` [PATCH v2 04/12] x86: control memset() and memcpy() inlining Jan Beulich
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 32+ messages in thread
From: Jan Beulich @ 2021-05-27 12:31 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Wei Liu, Roger Pau Monné

Move the function to its own assembly file. Having it in C just for the
entire body to be an asm() isn't really helpful. Then have two flavors:
A "basic" version using qword steps for the bulk of the operation, and an
ERMS version for modern hardware, to be substituted in via alternatives
patching.

Alternatives patching, however, requires an extra precaution: It uses
memcpy() itself, and hence the function may patch itself. Luckily the
patched-in code only replaces the prolog of the original function. Make
sure this remains this way.

Additionally alternatives patching, while supposedly safe via enforcing
a control flow change when modifying already prefetched code, may not
really be. Afaict a request is pending to drop the first of the two
options in the SDM's "Handling Self- and Cross-Modifying Code" section.
Insert a serializing instruction there. To avoid having to introduce a
local variable, also switch text_poke() to return void: Neither of its
callers cares about the returned value.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
We may want to consider branching over the REP MOVSQ as well, if the
number of qwords turns out to be zero.
We may also want to consider using non-REP MOVS{L,W,B} for the tail.

--- a/xen/arch/x86/Makefile
+++ b/xen/arch/x86/Makefile
@@ -43,6 +43,7 @@ obj-$(CONFIG_INDIRECT_THUNK) += indirect
 obj-y += ioport_emulate.o
 obj-y += irq.o
 obj-$(CONFIG_KEXEC) += machine_kexec.o
+obj-y += memcpy.o
 obj-y += memset.o
 obj-y += mm.o x86_64/mm.o
 obj-$(CONFIG_HVM) += monitor.o
--- a/xen/arch/x86/alternative.c
+++ b/xen/arch/x86/alternative.c
@@ -164,12 +164,14 @@ void init_or_livepatch add_nops(void *in
  * executing.
  *
  * "noinline" to cause control flow change and thus invalidate I$ and
- * cause refetch after modification.
+ * cause refetch after modification.  While the SDM continues to suggest this
+ * is sufficient, it may not be - issue a serializing insn afterwards as well.
  */
-static void *init_or_livepatch noinline
+static void init_or_livepatch noinline
 text_poke(void *addr, const void *opcode, size_t len)
 {
-    return memcpy(addr, opcode, len);
+    memcpy(addr, opcode, len);
+    cpuid_eax(0);
 }
 
 /*
--- /dev/null
+++ b/xen/arch/x86/memcpy.S
@@ -0,0 +1,21 @@
+#include <asm/asm_defns.h>
+
+ENTRY(memcpy)
+        mov     %rdx, %rcx
+        mov     %rdi, %rax
+        /*
+         * We need to be careful here: memcpy() is involved in alternatives
+         * patching, so the code doing the actual copying (i.e. past setting
+         * up registers) may not be subject to patching (unless further
+         * precautions were taken).
+         */
+        ALTERNATIVE "and $7, %edx; shr $3, %rcx", \
+                    "rep movsb; ret", X86_FEATURE_ERMS
+        rep movsq
+        or      %edx, %ecx
+        jz      1f
+        rep movsb
+1:
+        ret
+        .type memcpy, @function
+        .size memcpy, . - memcpy
--- a/xen/arch/x86/string.c
+++ b/xen/arch/x86/string.c
@@ -7,21 +7,6 @@
 
 #include <xen/lib.h>
 
-void *(memcpy)(void *dest, const void *src, size_t n)
-{
-    long d0, d1, d2;
-
-    asm volatile (
-        "   rep ; movs"__OS" ; "
-        "   mov %k4,%k3      ; "
-        "   rep ; movsb        "
-        : "=&c" (d0), "=&D" (d1), "=&S" (d2)
-        : "0" (n/BYTES_PER_LONG), "r" (n%BYTES_PER_LONG), "1" (dest), "2" (src)
-        : "memory" );
-
-    return dest;
-}
-
 void *(memmove)(void *dest, const void *src, size_t n)
 {
     long d0, d1, d2;



^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2 04/12] x86: control memset() and memcpy() inlining
  2021-05-27 12:29 [PATCH v2 00/12] x86: memcpy() / memset() (non-)ERMS flavors plus fallout Jan Beulich
                   ` (2 preceding siblings ...)
  2021-05-27 12:31 ` [PATCH v2 03/12] x86: re-work memcpy() Jan Beulich
@ 2021-05-27 12:31 ` Jan Beulich
  2021-05-27 12:32 ` [PATCH v2 05/12] x86: introduce "hot" and "cold" page clearing functions Jan Beulich
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 32+ messages in thread
From: Jan Beulich @ 2021-05-27 12:31 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Wei Liu, Roger Pau Monné

Stop the compiler from inlining non-trivial memset() and memcpy() (for
memset() see e.g. map_vcpu_info() or kimage_load_segments() for
examples). This way we even keep the compiler from using REP STOSQ /
REP MOVSQ when we'd prefer REP STOSB / REP MOVSB (when ERMS is
available).

With gcc10 this yields a modest .text size reduction (release build) of
around 2k.

Unfortunately these options aren't understood by the clang versions I
have readily available for testing with; I'm unaware of equivalents.

Note also that using cc-option-add is not an option here, or at least I
couldn't make things work with it (in case the option was not supported
by the compiler): The embedded comma in the option looks to be getting
in the way.

Requested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v2: New.
---
The boundary values are of course up for discussion - I wasn't really
certain whether to use 16 or 32; I'd be less certain about using yet
larger values.

Similarly whether to permit the compiler to emit REP STOSQ / REP MOVSQ
for known size, properly aligned blocks is up for discussion.

--- a/xen/arch/x86/arch.mk
+++ b/xen/arch/x86/arch.mk
@@ -51,6 +51,9 @@ CFLAGS-$(CONFIG_INDIRECT_THUNK) += -fno-
 $(call cc-option-add,CFLAGS-stack-boundary,CC,-mpreferred-stack-boundary=3)
 export CFLAGS-stack-boundary

+CFLAGS += $(call cc-option,$(CC),-mmemcpy-strategy=unrolled_loop:16:noalign$(comma)libcall:-1:noalign)
+CFLAGS += $(call cc-option,$(CC),-mmemset-strategy=unrolled_loop:16:noalign$(comma)libcall:-1:noalign)
+
 ifeq ($(CONFIG_UBSAN),y)
 # Don't enable alignment sanitisation.  x86 has efficient unaligned accesses,
 # and various things (ACPI tables, hypercall pages, stubs, etc) are wont-fix.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2 05/12] x86: introduce "hot" and "cold" page clearing functions
  2021-05-27 12:29 [PATCH v2 00/12] x86: memcpy() / memset() (non-)ERMS flavors plus fallout Jan Beulich
                   ` (3 preceding siblings ...)
  2021-05-27 12:31 ` [PATCH v2 04/12] x86: control memset() and memcpy() inlining Jan Beulich
@ 2021-05-27 12:32 ` Jan Beulich
  2021-05-27 12:32 ` [PATCH v2 06/12] page-alloc: make scrub_on_page() static Jan Beulich
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 32+ messages in thread
From: Jan Beulich @ 2021-05-27 12:32 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Wei Liu, Roger Pau Monné

The present clear_page_sse2() is useful in case a page isn't going to
get touched again soon, or if we want to limit churn on the caches.
Amend it by alternatively using CLZERO, which has been found to be quite
a bit faster on Zen2 hardware at least. Note that to use CLZERO, we need
to know the cache line size, and hence a feature dependency on CLFLUSH
gets introduced.

For cases where latency is the most important aspect, or when it is
expected that sufficiently large parts of a page will get accessed again
soon after the clearing, introduce a "hot" alternative. Again use
alternatives patching to select between a "legacy" and an ERMS variant.

Don't switch any callers just yet - this will be the subject of
subsequent changes.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v2: New.
---
Note: Ankur indicates that for ~L3-size or larger regions MOVNT/CLZERO
      is better even latency-wise.

--- a/xen/arch/x86/clear_page.S
+++ b/xen/arch/x86/clear_page.S
@@ -1,8 +1,9 @@
         .file __FILE__
 
-#include <asm/page.h>
+#include <asm/asm_defns.h>
+#include <xen/page-size.h>
 
-ENTRY(clear_page_sse2)
+        .macro clear_page_sse2
         mov     $PAGE_SIZE/32, %ecx
         xor     %eax,%eax
 
@@ -16,3 +17,45 @@ ENTRY(clear_page_sse2)
 
         sfence
         ret
+        .endm
+
+        .macro clear_page_clzero
+        mov     %rdi, %rax
+        mov     $PAGE_SIZE/64, %ecx
+        .globl clear_page_clzero_post_count
+clear_page_clzero_post_count:
+
+0:      clzero
+        sub     $-64, %rax
+        .globl clear_page_clzero_post_neg_size
+clear_page_clzero_post_neg_size:
+        sub     $1, %ecx
+        jnz     0b
+
+        sfence
+        ret
+        .endm
+
+ENTRY(clear_page_cold)
+        ALTERNATIVE clear_page_sse2, clear_page_clzero, X86_FEATURE_CLZERO
+        .type clear_page_cold, @function
+        .size clear_page_cold, . - clear_page_cold
+
+        .macro clear_page_stosb
+        mov     $PAGE_SIZE, %ecx
+        xor     %eax,%eax
+        rep stosb
+        ret
+        .endm
+
+        .macro clear_page_stosq
+        mov     $PAGE_SIZE/8, %ecx
+        xor     %eax, %eax
+        rep stosq
+        ret
+        .endm
+
+ENTRY(clear_page_hot)
+        ALTERNATIVE clear_page_stosq, clear_page_stosb, X86_FEATURE_ERMS
+        .type clear_page_hot, @function
+        .size clear_page_hot, . - clear_page_hot
--- a/xen/arch/x86/cpu/common.c
+++ b/xen/arch/x86/cpu/common.c
@@ -56,6 +56,9 @@ static unsigned int forced_caps[NCAPINTS
 
 DEFINE_PER_CPU(bool, full_gdt_loaded);
 
+extern uint32_t clear_page_clzero_post_count[];
+extern int8_t clear_page_clzero_post_neg_size[];
+
 void __init setup_clear_cpu_cap(unsigned int cap)
 {
 	const uint32_t *dfs;
@@ -331,8 +334,38 @@ void __init early_cpu_init(void)
 
 	edx &= ~cleared_caps[cpufeat_word(X86_FEATURE_FPU)];
 	ecx &= ~cleared_caps[cpufeat_word(X86_FEATURE_SSE3)];
-	if (edx & cpufeat_mask(X86_FEATURE_CLFLUSH))
-		c->x86_cache_alignment = ((ebx >> 8) & 0xff) * 8;
+	if (edx & cpufeat_mask(X86_FEATURE_CLFLUSH)) {
+		unsigned int size = ((ebx >> 8) & 0xff) * 8;
+
+		c->x86_cache_alignment = size;
+
+		/*
+		 * Patch in parameters of clear_page_cold()'s CLZERO
+		 * alternative. Note that for now we cap this at 128 bytes.
+		 * Larger cache line sizes would still be dealt with
+		 * correctly, but would cause redundant work done.
+		 */
+		if (size > 128)
+			size = 128;
+		if (size && !(size & (size - 1))) {
+			/*
+			 * Need to play some games to keep the compiler from
+			 * recognizing the negative array index as being out
+			 * of bounds. The labels in assembler code really are
+			 * _after_ the locations to be patched, so the
+			 * negative index is intentional.
+			 */
+			uint32_t *pcount = clear_page_clzero_post_count;
+			int8_t *neg_size = clear_page_clzero_post_neg_size;
+
+			OPTIMIZER_HIDE_VAR(pcount);
+			OPTIMIZER_HIDE_VAR(neg_size);
+			pcount[-1] = PAGE_SIZE / size;
+			neg_size[-1] = -size;
+		}
+		else
+			setup_clear_cpu_cap(X86_FEATURE_CLZERO);
+	}
 	/* Leaf 0x1 capabilities filled in early for Xen. */
 	c->x86_capability[cpufeat_word(X86_FEATURE_FPU)] = edx;
 	c->x86_capability[cpufeat_word(X86_FEATURE_SSE3)] = ecx;
--- a/xen/include/asm-x86/asm-defns.h
+++ b/xen/include/asm-x86/asm-defns.h
@@ -20,6 +20,10 @@
     .byte 0x0f, 0x01, 0xdd
 .endm
 
+.macro clzero
+    .byte 0x0f, 0x01, 0xfc
+.endm
+
 .macro INDIRECT_BRANCH insn:req arg:req
 /*
  * Create an indirect branch.  insn is one of call/jmp, arg is a single
--- a/xen/include/asm-x86/page.h
+++ b/xen/include/asm-x86/page.h
@@ -232,10 +232,11 @@ typedef struct { u64 pfn; } pagetable_t;
 #define pagetable_from_paddr(p) pagetable_from_pfn((p)>>PAGE_SHIFT)
 #define pagetable_null()        pagetable_from_pfn(0)
 
-void clear_page_sse2(void *);
+void clear_page_hot(void *);
+void clear_page_cold(void *);
 void copy_page_sse2(void *, const void *);
 
-#define clear_page(_p)      clear_page_sse2(_p)
+#define clear_page(_p)      clear_page_cold(_p)
 #define copy_page(_t, _f)   copy_page_sse2(_t, _f)
 
 /* Convert between Xen-heap virtual addresses and machine addresses. */
--- a/xen/tools/gen-cpuid.py
+++ b/xen/tools/gen-cpuid.py
@@ -182,6 +182,10 @@ def crunch_numbers(state):
         # the first place.
         APIC: [X2APIC, TSC_DEADLINE, EXTAPIC],
 
+        # The CLZERO insn requires a means to determine the cache line size,
+        # which is tied to the CLFLUSH insn.
+        CLFLUSH: [CLZERO],
+
         # AMD built MMXExtentions and 3DNow as extentions to MMX.
         MMX: [MMXEXT, _3DNOW],
 



^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2 06/12] page-alloc: make scrub_on_page() static
  2021-05-27 12:29 [PATCH v2 00/12] x86: memcpy() / memset() (non-)ERMS flavors plus fallout Jan Beulich
                   ` (4 preceding siblings ...)
  2021-05-27 12:32 ` [PATCH v2 05/12] x86: introduce "hot" and "cold" page clearing functions Jan Beulich
@ 2021-05-27 12:32 ` Jan Beulich
  2021-05-27 12:33 ` [PATCH v2 07/12] mm: allow page scrubbing routine(s) to be arch controlled Jan Beulich
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 32+ messages in thread
From: Jan Beulich @ 2021-05-27 12:32 UTC (permalink / raw)
  To: xen-devel
  Cc: Roger Pau Monné,
	Andrew Cooper, George Dunlap, Ian Jackson, Julien Grall,
	Stefano Stabellini, Wei Liu

Before starting to alter its properties, restrict the function's
visibility. The only external user is mem-paging, which we can
accommodate by different means.

Also move the function up in its source file, so we won't need to
forward-declare it. Constify its parameter at the same time.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v2: New.

--- a/xen/arch/x86/mm/mem_paging.c
+++ b/xen/arch/x86/mm/mem_paging.c
@@ -316,9 +316,6 @@ static int evict(struct domain *d, gfn_t
     ret = p2m_set_entry(p2m, gfn, INVALID_MFN, PAGE_ORDER_4K,
                         p2m_ram_paged, a);
 
-    /* Clear content before returning the page to Xen */
-    scrub_one_page(page);
-
     /* Track number of paged gfns */
     atomic_inc(&d->paged_pages);
 
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -136,6 +136,7 @@
 #include <xen/numa.h>
 #include <xen/nodemask.h>
 #include <xen/event.h>
+#include <xen/vm_event.h>
 #include <public/sysctl.h>
 #include <public/sched.h>
 #include <asm/page.h>
@@ -757,6 +758,21 @@ static void page_list_add_scrub(struct p
 #endif
 #define SCRUB_BYTE_PATTERN   (SCRUB_PATTERN & 0xff)
 
+static void scrub_one_page(const struct page_info *pg)
+{
+    if ( unlikely(pg->count_info & PGC_broken) )
+        return;
+
+#ifndef NDEBUG
+    /* Avoid callers relying on allocations returning zeroed pages. */
+    unmap_domain_page(memset(__map_domain_page(pg),
+                             SCRUB_BYTE_PATTERN, PAGE_SIZE));
+#else
+    /* For a production build, clear_page() is the fastest way to scrub. */
+    clear_domain_page(_mfn(page_to_mfn(pg)));
+#endif
+}
+
 static void poison_one_page(struct page_info *pg)
 {
 #ifdef CONFIG_SCRUB_DEBUG
@@ -2431,10 +2447,12 @@ void free_domheap_pages(struct page_info
             /*
              * Normally we expect a domain to clear pages before freeing them,
              * if it cares about the secrecy of their contents. However, after
-             * a domain has died we assume responsibility for erasure. We do
-             * scrub regardless if option scrub_domheap is set.
+             * a domain has died or if it has mem-paging enabled we assume
+             * responsibility for erasure. We do scrub regardless if option
+             * scrub_domheap is set.
              */
-            scrub = d->is_dying || scrub_debug || opt_scrub_domheap;
+            scrub = d->is_dying || mem_paging_enabled(d) ||
+                    scrub_debug || opt_scrub_domheap;
         }
         else
         {
@@ -2519,21 +2537,6 @@ static __init int pagealloc_keyhandler_i
 __initcall(pagealloc_keyhandler_init);
 
 
-void scrub_one_page(struct page_info *pg)
-{
-    if ( unlikely(pg->count_info & PGC_broken) )
-        return;
-
-#ifndef NDEBUG
-    /* Avoid callers relying on allocations returning zeroed pages. */
-    unmap_domain_page(memset(__map_domain_page(pg),
-                             SCRUB_BYTE_PATTERN, PAGE_SIZE));
-#else
-    /* For a production build, clear_page() is the fastest way to scrub. */
-    clear_domain_page(_mfn(page_to_mfn(pg)));
-#endif
-}
-
 static void dump_heap(unsigned char key)
 {
     s_time_t      now = NOW();
--- a/xen/include/asm-x86/mem_paging.h
+++ b/xen/include/asm-x86/mem_paging.h
@@ -24,12 +24,6 @@
 
 int mem_paging_memop(XEN_GUEST_HANDLE_PARAM(xen_mem_paging_op_t) arg);
 
-#ifdef CONFIG_MEM_PAGING
-# define mem_paging_enabled(d) vm_event_check_ring((d)->vm_event_paging)
-#else
-# define mem_paging_enabled(d) false
-#endif
-
 #endif /*__ASM_X86_MEM_PAGING_H__ */
 
 /*
--- a/xen/include/xen/mm.h
+++ b/xen/include/xen/mm.h
@@ -498,8 +498,6 @@ static inline unsigned int get_order_fro
     return order;
 }
 
-void scrub_one_page(struct page_info *);
-
 #ifndef arch_free_heap_page
 #define arch_free_heap_page(d, pg) \
     page_list_del(pg, page_to_list(d, pg))
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -1117,6 +1117,12 @@ static always_inline bool is_iommu_enabl
     return evaluate_nospec(d->options & XEN_DOMCTL_CDF_iommu);
 }
 
+#ifdef CONFIG_MEM_PAGING
+# define mem_paging_enabled(d) vm_event_check_ring((d)->vm_event_paging)
+#else
+# define mem_paging_enabled(d) false
+#endif
+
 extern bool sched_smt_power_savings;
 extern bool sched_disable_smt_switching;
 



^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2 07/12] mm: allow page scrubbing routine(s) to be arch controlled
  2021-05-27 12:29 [PATCH v2 00/12] x86: memcpy() / memset() (non-)ERMS flavors plus fallout Jan Beulich
                   ` (5 preceding siblings ...)
  2021-05-27 12:32 ` [PATCH v2 06/12] page-alloc: make scrub_on_page() static Jan Beulich
@ 2021-05-27 12:33 ` Jan Beulich
  2021-05-27 13:06   ` Julien Grall
  2021-05-27 12:34 ` [PATCH v2 08/12] x86: move .text.kexec Jan Beulich
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 32+ messages in thread
From: Jan Beulich @ 2021-05-27 12:33 UTC (permalink / raw)
  To: xen-devel
  Cc: Roger Pau Monné,
	Andrew Cooper, George Dunlap, Ian Jackson, Julien Grall,
	Stefano Stabellini, Wei Liu

Especially when dealing with large amounts of memory, memset() may not
be very efficient; this can be bad enough that even for debug builds a
custom function is warranted. We additionally want to distinguish "hot"
and "cold" cases.

Keep the default fallback to clear_page_*() in common code; this may
want to be revisited down the road.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v2: New.
---
The choice between hot and cold in scrub_one_page()'s callers is
certainly up for discussion / improvement.

--- a/xen/arch/x86/Makefile
+++ b/xen/arch/x86/Makefile
@@ -55,6 +55,7 @@ obj-y += percpu.o
 obj-y += physdev.o
 obj-$(CONFIG_COMPAT) += x86_64/physdev.o
 obj-y += psr.o
+obj-bin-$(CONFIG_DEBUG) += scrub_page.o
 obj-y += setup.o
 obj-y += shutdown.o
 obj-y += smp.o
--- /dev/null
+++ b/xen/arch/x86/scrub_page.S
@@ -0,0 +1,41 @@
+        .file __FILE__
+
+#include <asm/asm_defns.h>
+#include <xen/page-size.h>
+#include <xen/scrub.h>
+
+ENTRY(scrub_page_cold)
+        mov     $PAGE_SIZE/32, %ecx
+        mov     $SCRUB_PATTERN, %rax
+
+0:      movnti  %rax,   (%rdi)
+        movnti  %rax,  8(%rdi)
+        movnti  %rax, 16(%rdi)
+        movnti  %rax, 24(%rdi)
+        add     $32, %rdi
+        sub     $1, %ecx
+        jnz     0b
+
+        sfence
+        ret
+        .type scrub_page_cold, @function
+        .size scrub_page_cold, . - scrub_page_cold
+
+        .macro scrub_page_stosb
+        mov     $PAGE_SIZE, %ecx
+        mov     $SCRUB_BYTE_PATTERN, %eax
+        rep stosb
+        ret
+        .endm
+
+        .macro scrub_page_stosq
+        mov     $PAGE_SIZE/8, %ecx
+        mov     $SCRUB_PATTERN, %rax
+        rep stosq
+        ret
+        .endm
+
+ENTRY(scrub_page_hot)
+        ALTERNATIVE scrub_page_stosq, scrub_page_stosb, X86_FEATURE_ERMS
+        .type scrub_page_hot, @function
+        .size scrub_page_hot, . - scrub_page_hot
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -124,6 +124,7 @@
 #include <xen/types.h>
 #include <xen/lib.h>
 #include <xen/sched.h>
+#include <xen/scrub.h>
 #include <xen/spinlock.h>
 #include <xen/mm.h>
 #include <xen/param.h>
@@ -750,27 +751,31 @@ static void page_list_add_scrub(struct p
         page_list_add(pg, &heap(node, zone, order));
 }
 
-/* SCRUB_PATTERN needs to be a repeating series of bytes. */
-#ifndef NDEBUG
-#define SCRUB_PATTERN        0xc2c2c2c2c2c2c2c2ULL
-#else
-#define SCRUB_PATTERN        0ULL
+/*
+ * While in debug builds we want callers to avoid relying on allocations
+ * returning zeroed pages, for a production build, clear_page_*() is the
+ * fastest way to scrub.
+ */
+#ifndef CONFIG_DEBUG
+# undef  scrub_page_hot
+# define scrub_page_hot clear_page_hot
+# undef  scrub_page_cold
+# define scrub_page_cold clear_page_cold
 #endif
-#define SCRUB_BYTE_PATTERN   (SCRUB_PATTERN & 0xff)
 
-static void scrub_one_page(const struct page_info *pg)
+static void scrub_one_page(const struct page_info *pg, bool cold)
 {
+    void *ptr;
+
     if ( unlikely(pg->count_info & PGC_broken) )
         return;
 
-#ifndef NDEBUG
-    /* Avoid callers relying on allocations returning zeroed pages. */
-    unmap_domain_page(memset(__map_domain_page(pg),
-                             SCRUB_BYTE_PATTERN, PAGE_SIZE));
-#else
-    /* For a production build, clear_page() is the fastest way to scrub. */
-    clear_domain_page(_mfn(page_to_mfn(pg)));
-#endif
+    ptr = __map_domain_page(pg);
+    if ( cold )
+        scrub_page_cold(ptr);
+    else
+        scrub_page_hot(ptr);
+    unmap_domain_page(ptr);
 }
 
 static void poison_one_page(struct page_info *pg)
@@ -1046,12 +1051,14 @@ static struct page_info *alloc_heap_page
     if ( first_dirty != INVALID_DIRTY_IDX ||
          (scrub_debug && !(memflags & MEMF_no_scrub)) )
     {
+        bool cold = d && d != current->domain;
+
         for ( i = 0; i < (1U << order); i++ )
         {
             if ( test_and_clear_bit(_PGC_need_scrub, &pg[i].count_info) )
             {
                 if ( !(memflags & MEMF_no_scrub) )
-                    scrub_one_page(&pg[i]);
+                    scrub_one_page(&pg[i], cold);
 
                 dirty_cnt++;
             }
@@ -1308,7 +1315,7 @@ bool scrub_free_pages(void)
                 {
                     if ( test_bit(_PGC_need_scrub, &pg[i].count_info) )
                     {
-                        scrub_one_page(&pg[i]);
+                        scrub_one_page(&pg[i], true);
                         /*
                          * We can modify count_info without holding heap
                          * lock since we effectively locked this buddy by
@@ -1947,7 +1954,7 @@ static void __init smp_scrub_heap_pages(
         if ( !mfn_valid(_mfn(mfn)) || !page_state_is(pg, free) )
             continue;
 
-        scrub_one_page(pg);
+        scrub_one_page(pg, true);
     }
 }
 
--- a/xen/include/asm-arm/page.h
+++ b/xen/include/asm-arm/page.h
@@ -135,6 +135,12 @@ extern size_t dcache_line_bytes;
 
 #define copy_page(dp, sp) memcpy(dp, sp, PAGE_SIZE)
 
+#define clear_page_hot  clear_page
+#define clear_page_cold clear_page
+
+#define scrub_page_hot(page) memset(page, SCRUB_BYTE_PATTERN, PAGE_SIZE)
+#define scrub_page_cold      scrub_page_hot
+
 static inline size_t read_dcache_line_bytes(void)
 {
     register_t ctr;
--- a/xen/include/asm-x86/page.h
+++ b/xen/include/asm-x86/page.h
@@ -239,6 +239,11 @@ void copy_page_sse2(void *, const void *
 #define clear_page(_p)      clear_page_cold(_p)
 #define copy_page(_t, _f)   copy_page_sse2(_t, _f)
 
+#ifdef CONFIG_DEBUG
+void scrub_page_hot(void *);
+void scrub_page_cold(void *);
+#endif
+
 /* Convert between Xen-heap virtual addresses and machine addresses. */
 #define __pa(x)             (virt_to_maddr(x))
 #define __va(x)             (maddr_to_virt(x))
--- /dev/null
+++ b/xen/include/xen/scrub.h
@@ -0,0 +1,24 @@
+#ifndef __XEN_SCRUB_H__
+#define __XEN_SCRUB_H__
+
+#include <xen/const.h>
+
+/* SCRUB_PATTERN needs to be a repeating series of bytes. */
+#ifdef CONFIG_DEBUG
+# define SCRUB_PATTERN       _AC(0xc2c2c2c2c2c2c2c2,ULL)
+#else
+# define SCRUB_PATTERN       _AC(0,ULL)
+#endif
+#define SCRUB_BYTE_PATTERN   (SCRUB_PATTERN & 0xff)
+
+#endif /* __XEN_SCRUB_H__ */
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */



^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2 08/12] x86: move .text.kexec
  2021-05-27 12:29 [PATCH v2 00/12] x86: memcpy() / memset() (non-)ERMS flavors plus fallout Jan Beulich
                   ` (6 preceding siblings ...)
  2021-05-27 12:33 ` [PATCH v2 07/12] mm: allow page scrubbing routine(s) to be arch controlled Jan Beulich
@ 2021-05-27 12:34 ` Jan Beulich
  2022-02-18 13:34   ` Andrew Cooper
  2021-05-27 12:34 ` [PATCH v2 09/12] video/vesa: unmap frame buffer when relinquishing console Jan Beulich
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 32+ messages in thread
From: Jan Beulich @ 2021-05-27 12:34 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Wei Liu, Roger Pau Monné

The source file requests page alignment - avoid a padding hole by
placing it right after .text.entry. On average this yields a .text size
reduction of 2k.

Requested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v2: New.

--- a/xen/arch/x86/xen.lds.S
+++ b/xen/arch/x86/xen.lds.S
@@ -83,10 +83,11 @@ SECTIONS
        . = ALIGN(PAGE_SIZE);
        _etextentry = .;
 
+       *(.text.kexec)          /* Page aligned in the object file. */
+
        *(.text.cold)
        *(.text.unlikely)
        *(.fixup)
-       *(.text.kexec)
        *(.gnu.warning)
        _etext = .;             /* End of text section */
   } PHDR(text) = 0x9090



^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2 09/12] video/vesa: unmap frame buffer when relinquishing console
  2021-05-27 12:29 [PATCH v2 00/12] x86: memcpy() / memset() (non-)ERMS flavors plus fallout Jan Beulich
                   ` (7 preceding siblings ...)
  2021-05-27 12:34 ` [PATCH v2 08/12] x86: move .text.kexec Jan Beulich
@ 2021-05-27 12:34 ` Jan Beulich
  2022-02-18 13:36   ` Andrew Cooper
  2021-05-27 12:35 ` [PATCH v2 10/12] video/vesa: drop "vesa-mtrr" command line option Jan Beulich
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 32+ messages in thread
From: Jan Beulich @ 2021-05-27 12:34 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Wei Liu, Roger Pau Monné

There's no point in keeping the VA space occupied when no further output
will occur.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/drivers/video/lfb.c
+++ b/xen/drivers/video/lfb.c
@@ -168,4 +168,5 @@ void lfb_free(void)
     xfree(lfb.lbuf);
     xfree(lfb.text_buf);
     xfree(lfb.line_len);
+    lfb.lfbp.lfb = ZERO_BLOCK_PTR;
 }
--- a/xen/drivers/video/vesa.c
+++ b/xen/drivers/video/vesa.c
@@ -197,5 +197,7 @@ void __init vesa_endboot(bool_t keep)
                    vlfb_info.width * bpp);
         lfb_flush();
         lfb_free();
+        iounmap(lfb);
+        lfb = ZERO_BLOCK_PTR;
     }
 }



^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2 10/12] video/vesa: drop "vesa-mtrr" command line option
  2021-05-27 12:29 [PATCH v2 00/12] x86: memcpy() / memset() (non-)ERMS flavors plus fallout Jan Beulich
                   ` (8 preceding siblings ...)
  2021-05-27 12:34 ` [PATCH v2 09/12] video/vesa: unmap frame buffer when relinquishing console Jan Beulich
@ 2021-05-27 12:35 ` Jan Beulich
  2021-05-27 12:35 ` [PATCH v2 11/12] video/vesa: drop "vesa-remap" " Jan Beulich
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 32+ messages in thread
From: Jan Beulich @ 2021-05-27 12:35 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Wei Liu, Roger Pau Monné

Now that we use ioremap_wc() for mapping the frame buffer, there's no
need for this option anymore. As noted in the change introducing the
use of ioremap_wc(), mtrr_add() didn't work in certain cases anyway.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -6,9 +6,10 @@ The format is based on [Keep a Changelog
 
 ## [unstable UNRELEASED](https://xenbits.xen.org/gitweb/?p=xen.git;a=shortlog;h=staging) - TBD
 
-### Removed
+### Removed / support downgraded
  - XENSTORED_ROOTDIR environment variable from configuartion files and
    initscripts, due to being unused.
+ - dropped support for the (x86-only) "vesa-mtrr" command line option
 
 ## [4.15.0 UNRELEASED](https://xenbits.xen.org/gitweb/?p=xen.git;a=shortlog;h=RELEASE-4.15.0) - TBD
 
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -2369,9 +2369,6 @@ cache-warming. 1ms (1000) has been measu
 ### vesa-map
 > `= <integer>`
 
-### vesa-mtrr
-> `= <integer>`
-
 ### vesa-ram
 > `= <integer>`
 
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -1816,8 +1816,6 @@ void __init noreturn __start_xen(unsigne
 
     local_irq_enable();
 
-    vesa_mtrr_init();
-
     early_msi_init();
 
     iommu_setup();    /* setup iommu if available */
--- a/xen/drivers/video/vesa.c
+++ b/xen/drivers/video/vesa.c
@@ -145,38 +145,6 @@ void __init vesa_init(void)
     video_puts = lfb_redraw_puts;
 }
 
-#include <asm/mtrr.h>
-
-static unsigned int vesa_mtrr;
-integer_param("vesa-mtrr", vesa_mtrr);
-
-void __init vesa_mtrr_init(void)
-{
-    static const int mtrr_types[] = {
-        0, MTRR_TYPE_UNCACHABLE, MTRR_TYPE_WRBACK,
-        MTRR_TYPE_WRCOMB, MTRR_TYPE_WRTHROUGH };
-    unsigned int size_total;
-    int rc, type;
-
-    if ( !lfb || (vesa_mtrr == 0) || (vesa_mtrr >= ARRAY_SIZE(mtrr_types)) )
-        return;
-
-    type = mtrr_types[vesa_mtrr];
-    if ( !type )
-        return;
-
-    /* Find the largest power-of-two */
-    size_total = vram_total;
-    while ( size_total & (size_total - 1) )
-        size_total &= size_total - 1;
-
-    /* Try and find a power of two to add */
-    do {
-        rc = mtrr_add(lfb_base(), size_total, type, 1);
-        size_total >>= 1;
-    } while ( (size_total >= PAGE_SIZE) && (rc == -EINVAL) );
-}
-
 static void lfb_flush(void)
 {
     __asm__ __volatile__ ("sfence" : : : "memory");
--- a/xen/include/asm-x86/setup.h
+++ b/xen/include/asm-x86/setup.h
@@ -25,10 +25,8 @@ void init_IRQ(void);
 
 #ifdef CONFIG_VIDEO
 void vesa_init(void);
-void vesa_mtrr_init(void);
 #else
 static inline void vesa_init(void) {};
-static inline void vesa_mtrr_init(void) {};
 #endif
 
 int construct_dom0(



^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2 11/12] video/vesa: drop "vesa-remap" command line option
  2021-05-27 12:29 [PATCH v2 00/12] x86: memcpy() / memset() (non-)ERMS flavors plus fallout Jan Beulich
                   ` (9 preceding siblings ...)
  2021-05-27 12:35 ` [PATCH v2 10/12] video/vesa: drop "vesa-mtrr" command line option Jan Beulich
@ 2021-05-27 12:35 ` Jan Beulich
  2022-02-18 13:35   ` Andrew Cooper
  2021-05-27 12:36 ` [PATCH v2 12/12] video/vesa: adjust (not just) command line option handling Jan Beulich
  2022-02-17 11:01 ` [PATCH RESEND v2] x86: introduce ioremap_wc() Jan Beulich
  12 siblings, 1 reply; 32+ messages in thread
From: Jan Beulich @ 2021-05-27 12:35 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Wei Liu, Roger Pau Monné

If we get mode dimensions wrong, having the remapping size controllable
via command line option isn't going to help much. Drop the option.

While adjusting this also
- add __initdata to the variable,
- use ROUNDUP() instead of open-coding it.

Requested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -9,7 +9,7 @@ The format is based on [Keep a Changelog
 ### Removed / support downgraded
  - XENSTORED_ROOTDIR environment variable from configuartion files and
    initscripts, due to being unused.
- - dropped support for the (x86-only) "vesa-mtrr" command line option
+ - dropped support for the (x86-only) "vesa-mtrr" and "vesa-remap" command line options
 
 ## [4.15.0 UNRELEASED](https://xenbits.xen.org/gitweb/?p=xen.git;a=shortlog;h=RELEASE-4.15.0) - TBD
 
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -2366,9 +2366,6 @@ PCPUs when using the credit1 scheduler.
 of a VCPU between CPUs, and reduces the implicit overheads such as
 cache-warming. 1ms (1000) has been measured as a good value.
 
-### vesa-map
-> `= <integer>`
-
 ### vesa-ram
 > `= <integer>`
 
--- a/xen/drivers/video/vesa.c
+++ b/xen/drivers/video/vesa.c
@@ -26,8 +26,7 @@ static bool_t vga_compat;
 static unsigned int vram_total;
 integer_param("vesa-ram", vram_total);
 
-static unsigned int vram_remap;
-integer_param("vesa-map", vram_remap);
+static unsigned int __initdata vram_remap;
 
 static int font_height;
 static int __init parse_font_height(const char *s)
@@ -79,12 +78,8 @@ void __init vesa_early_init(void)
      *                 use for vesafb.  With modern cards it is no
      *                 option to simply use vram_total as that
      *                 wastes plenty of kernel address space. */
-    vram_remap = (vram_remap ?
-                  (vram_remap << 20) :
-                  ((vram_vmode + (1 << L2_PAGETABLE_SHIFT) - 1) &
-                   ~((1 << L2_PAGETABLE_SHIFT) - 1)));
-    vram_remap = max_t(unsigned int, vram_remap, vram_vmode);
-    vram_remap = min_t(unsigned int, vram_remap, vram_total);
+    vram_remap = ROUNDUP(vram_vmode, 1 << L2_PAGETABLE_SHIFT);
+    vram_remap = min(vram_remap, vram_total);
 }
 
 void __init vesa_init(void)



^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2 12/12] video/vesa: adjust (not just) command line option handling
  2021-05-27 12:29 [PATCH v2 00/12] x86: memcpy() / memset() (non-)ERMS flavors plus fallout Jan Beulich
                   ` (10 preceding siblings ...)
  2021-05-27 12:35 ` [PATCH v2 11/12] video/vesa: drop "vesa-remap" " Jan Beulich
@ 2021-05-27 12:36 ` Jan Beulich
  2022-02-17 11:01 ` [PATCH RESEND v2] x86: introduce ioremap_wc() Jan Beulich
  12 siblings, 0 replies; 32+ messages in thread
From: Jan Beulich @ 2021-05-27 12:36 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Wei Liu, Roger Pau Monné

Document the remaining option. Add section annotation to the variable
holding the parsed value as well as a few adjacent ones. Adjust the
types of font_height and vga_compat.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
v2: Re-base over added earlier patch.

--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -2369,6 +2369,11 @@ cache-warming. 1ms (1000) has been measu
 ### vesa-ram
 > `= <integer>`
 
+> Default: `0`
+
+This allows to override the amount of video RAM, in MiB, determined to be
+present.
+
 ### vga
 > `= ( ask | current | text-80x<rows> | gfx-<width>x<height>x<depth> | mode-<mode> )[,keep]`
 
--- a/xen/drivers/video/vesa.c
+++ b/xen/drivers/video/vesa.c
@@ -19,16 +19,16 @@
 
 static void lfb_flush(void);
 
-static unsigned char *lfb;
-static const struct font_desc *font;
-static bool_t vga_compat;
+static unsigned char *__read_mostly lfb;
+static const struct font_desc *__initdata font;
+static bool __initdata vga_compat;
 
-static unsigned int vram_total;
+static unsigned int __initdata vram_total;
 integer_param("vesa-ram", vram_total);
 
 static unsigned int __initdata vram_remap;
 
-static int font_height;
+static unsigned int __initdata font_height;
 static int __init parse_font_height(const char *s)
 {
     if ( simple_strtoul(s, &s, 10) == 8 && (*s++ == 'x') )



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 01/12] x86: introduce ioremap_wc()
  2021-05-27 12:30 ` [PATCH v2 01/12] x86: introduce ioremap_wc() Jan Beulich
@ 2021-05-27 12:48   ` Julien Grall
  2021-05-27 13:09     ` Jan Beulich
  0 siblings, 1 reply; 32+ messages in thread
From: Julien Grall @ 2021-05-27 12:48 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: Andrew Cooper, Wei Liu, Roger Pau Monné

Hi Jan,

On 27/05/2021 13:30, Jan Beulich wrote:
> In order for a to-be-introduced ERMS form of memcpy() to not regress
> boot performance on certain systems when video output is active, we
> first need to arrange for avoiding further dependency on firmware
> setting up MTRRs in a way we can actually further modify. On many
> systems, due to the continuously growing amounts of installed memory,
> MTRRs get configured with at least one huge WB range, and with MMIO
> ranges below 4Gb then forced to UC via overlapping MTRRs. mtrr_add(), as
> it is today, can't deal with such a setup. Hence on such systems we
> presently leave the frame buffer mapped UC, leading to significantly
> reduced performance when using REP STOSB / REP MOVSB.
> 
> On post-PentiumII hardware (i.e. any that's capable of running 64-bit
> code), an effective memory type of WC can be achieved without MTRRs, by
> simply referencing the respective PAT entry from the PTEs. While this
> will leave the switch to ERMS forms of memset() and memcpy() with
> largely unchanged performance, the change here on its own improves
> performance on affected systems quite significantly: Measuring just the
> individual affected memcpy() invocations yielded a speedup by a factor
> of over 250 on my initial (Skylake) test system. memset() isn't getting
> improved by as much there, but still by a factor of about 20.
> 
> While adding {__,}PAGE_HYPERVISOR_WC, also add {__,}PAGE_HYPERVISOR_WT
> to, at the very least, make clear what PTE flags this memory type uses.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> v2: Mark ioremap_wc() __init.
> ---
> TBD: If the VGA range is WC in the fixed range MTRRs, reusing the low
>       1st Mb mapping (like ioremap() does) would be an option.
> 
> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c
> @@ -5881,6 +5881,20 @@ void __iomem *ioremap(paddr_t pa, size_t
>       return (void __force __iomem *)va;
>   }
>   
> +void __iomem *__init ioremap_wc(paddr_t pa, size_t len)
> +{
> +    mfn_t mfn = _mfn(PFN_DOWN(pa));
> +    unsigned int offs = pa & (PAGE_SIZE - 1);
> +    unsigned int nr = PFN_UP(offs + len);
> +    void *va;
> +
> +    WARN_ON(page_is_ram_type(mfn_x(mfn), RAM_TYPE_CONVENTIONAL));
> +
> +    va = __vmap(&mfn, nr, 1, 1, PAGE_HYPERVISOR_WC, VMAP_DEFAULT);
> +
> +    return (void __force __iomem *)(va + offs);
> +}

Arm is already providing ioremap_wc() which is a wrapper to 
ioremap_attr(). Can this be moved to the common code to avoid duplication?

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 07/12] mm: allow page scrubbing routine(s) to be arch controlled
  2021-05-27 12:33 ` [PATCH v2 07/12] mm: allow page scrubbing routine(s) to be arch controlled Jan Beulich
@ 2021-05-27 13:06   ` Julien Grall
  2021-05-27 13:58     ` Jan Beulich
  0 siblings, 1 reply; 32+ messages in thread
From: Julien Grall @ 2021-05-27 13:06 UTC (permalink / raw)
  To: Jan Beulich, xen-devel
  Cc: Roger Pau Monné,
	Andrew Cooper, George Dunlap, Ian Jackson, Stefano Stabellini,
	Wei Liu

Hi Jan,

On 27/05/2021 13:33, Jan Beulich wrote:
> Especially when dealing with large amounts of memory, memset() may not
> be very efficient; this can be bad enough that even for debug builds a
> custom function is warranted. We additionally want to distinguish "hot"
> and "cold" cases.

Do you have any benchmark showing the performance improvement?

> 
> Keep the default fallback to clear_page_*() in common code; this may
> want to be revisited down the road.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> v2: New.
> ---
> The choice between hot and cold in scrub_one_page()'s callers is
> certainly up for discussion / improvement.

To get the discussion started, can you explain how you made the decision 
between hot/cot? This will also want to be written down in the commit 
message.

> 
> --- a/xen/arch/x86/Makefile
> +++ b/xen/arch/x86/Makefile
> @@ -55,6 +55,7 @@ obj-y += percpu.o
>   obj-y += physdev.o
>   obj-$(CONFIG_COMPAT) += x86_64/physdev.o
>   obj-y += psr.o
> +obj-bin-$(CONFIG_DEBUG) += scrub_page.o
>   obj-y += setup.o
>   obj-y += shutdown.o
>   obj-y += smp.o
> --- /dev/null
> +++ b/xen/arch/x86/scrub_page.S
> @@ -0,0 +1,41 @@
> +        .file __FILE__
> +
> +#include <asm/asm_defns.h>
> +#include <xen/page-size.h>
> +#include <xen/scrub.h>
> +
> +ENTRY(scrub_page_cold)
> +        mov     $PAGE_SIZE/32, %ecx
> +        mov     $SCRUB_PATTERN, %rax
> +
> +0:      movnti  %rax,   (%rdi)
> +        movnti  %rax,  8(%rdi)
> +        movnti  %rax, 16(%rdi)
> +        movnti  %rax, 24(%rdi)
> +        add     $32, %rdi
> +        sub     $1, %ecx
> +        jnz     0b
> +
> +        sfence
> +        ret
> +        .type scrub_page_cold, @function
> +        .size scrub_page_cold, . - scrub_page_cold
> +
> +        .macro scrub_page_stosb
> +        mov     $PAGE_SIZE, %ecx
> +        mov     $SCRUB_BYTE_PATTERN, %eax
> +        rep stosb
> +        ret
> +        .endm
> +
> +        .macro scrub_page_stosq
> +        mov     $PAGE_SIZE/8, %ecx
> +        mov     $SCRUB_PATTERN, %rax
> +        rep stosq
> +        ret
> +        .endm
> +
> +ENTRY(scrub_page_hot)
> +        ALTERNATIVE scrub_page_stosq, scrub_page_stosb, X86_FEATURE_ERMS
> +        .type scrub_page_hot, @function
> +        .size scrub_page_hot, . - scrub_page_hot

 From the commit message, it is not clear how the implementation for 
hot/cold was chosen. Can you outline in the commit message what are the 
assumption for each helper?

This will be helpful for anyone that may notice regression or even other 
arch if they need to implement it.

> --- a/xen/common/page_alloc.c
> +++ b/xen/common/page_alloc.c
> @@ -124,6 +124,7 @@
>   #include <xen/types.h>
>   #include <xen/lib.h>
>   #include <xen/sched.h>
> +#include <xen/scrub.h>
>   #include <xen/spinlock.h>
>   #include <xen/mm.h>
>   #include <xen/param.h>
> @@ -750,27 +751,31 @@ static void page_list_add_scrub(struct p
>           page_list_add(pg, &heap(node, zone, order));
>   }
>   
> -/* SCRUB_PATTERN needs to be a repeating series of bytes. */
> -#ifndef NDEBUG
> -#define SCRUB_PATTERN        0xc2c2c2c2c2c2c2c2ULL
> -#else
> -#define SCRUB_PATTERN        0ULL
> +/*
> + * While in debug builds we want callers to avoid relying on allocations
> + * returning zeroed pages, for a production build, clear_page_*() is the
> + * fastest way to scrub.
> + */
> +#ifndef CONFIG_DEBUG
> +# undef  scrub_page_hot
> +# define scrub_page_hot clear_page_hot
> +# undef  scrub_page_cold
> +# define scrub_page_cold clear_page_cold
>   #endif
> -#define SCRUB_BYTE_PATTERN   (SCRUB_PATTERN & 0xff)
>   
> -static void scrub_one_page(const struct page_info *pg)
> +static void scrub_one_page(const struct page_info *pg, bool cold)
>   {
> +    void *ptr;
> +
>       if ( unlikely(pg->count_info & PGC_broken) )
>           return;
>   
> -#ifndef NDEBUG
> -    /* Avoid callers relying on allocations returning zeroed pages. */
> -    unmap_domain_page(memset(__map_domain_page(pg),
> -                             SCRUB_BYTE_PATTERN, PAGE_SIZE));
> -#else
> -    /* For a production build, clear_page() is the fastest way to scrub. */
> -    clear_domain_page(_mfn(page_to_mfn(pg)));
> -#endif
> +    ptr = __map_domain_page(pg);
> +    if ( cold )
> +        scrub_page_cold(ptr);
> +    else
> +        scrub_page_hot(ptr);
> +    unmap_domain_page(ptr);
>   }
>   
>   static void poison_one_page(struct page_info *pg)
> @@ -1046,12 +1051,14 @@ static struct page_info *alloc_heap_page
>       if ( first_dirty != INVALID_DIRTY_IDX ||
>            (scrub_debug && !(memflags & MEMF_no_scrub)) )
>       {
> +        bool cold = d && d != current->domain;

So the assumption is if the domain is not running, then the content is 
not in the cache. Is that correct?

> +
>           for ( i = 0; i < (1U << order); i++ )
>           {
>               if ( test_and_clear_bit(_PGC_need_scrub, &pg[i].count_info) )
>               {
>                   if ( !(memflags & MEMF_no_scrub) )
> -                    scrub_one_page(&pg[i]);
> +                    scrub_one_page(&pg[i], cold);
>   
>                   dirty_cnt++;
>               }
> @@ -1308,7 +1315,7 @@ bool scrub_free_pages(void)
>                   {
>                       if ( test_bit(_PGC_need_scrub, &pg[i].count_info) )
>                       {
> -                        scrub_one_page(&pg[i]);
> +                        scrub_one_page(&pg[i], true);
>                           /*
>                            * We can modify count_info without holding heap
>                            * lock since we effectively locked this buddy by
> @@ -1947,7 +1954,7 @@ static void __init smp_scrub_heap_pages(
>           if ( !mfn_valid(_mfn(mfn)) || !page_state_is(pg, free) )
>               continue;
>   
> -        scrub_one_page(pg);
> +        scrub_one_page(pg, true);
>       }
>   }
>   
> --- a/xen/include/asm-arm/page.h
> +++ b/xen/include/asm-arm/page.h
> @@ -135,6 +135,12 @@ extern size_t dcache_line_bytes;
>   
>   #define copy_page(dp, sp) memcpy(dp, sp, PAGE_SIZE)
>   
> +#define clear_page_hot  clear_page
> +#define clear_page_cold clear_page
> +
> +#define scrub_page_hot(page) memset(page, SCRUB_BYTE_PATTERN, PAGE_SIZE)
> +#define scrub_page_cold      scrub_page_hot
> +
>   static inline size_t read_dcache_line_bytes(void)
>   {
>       register_t ctr;
> --- a/xen/include/asm-x86/page.h
> +++ b/xen/include/asm-x86/page.h
> @@ -239,6 +239,11 @@ void copy_page_sse2(void *, const void *
>   #define clear_page(_p)      clear_page_cold(_p)
>   #define copy_page(_t, _f)   copy_page_sse2(_t, _f)
>   
> +#ifdef CONFIG_DEBUG
> +void scrub_page_hot(void *);
> +void scrub_page_cold(void *);
> +#endif
> +
>   /* Convert between Xen-heap virtual addresses and machine addresses. */
>   #define __pa(x)             (virt_to_maddr(x))
>   #define __va(x)             (maddr_to_virt(x))
> --- /dev/null
> +++ b/xen/include/xen/scrub.h
> @@ -0,0 +1,24 @@
> +#ifndef __XEN_SCRUB_H__
> +#define __XEN_SCRUB_H__
> +
> +#include <xen/const.h>
> +
> +/* SCRUB_PATTERN needs to be a repeating series of bytes. */
> +#ifdef CONFIG_DEBUG
> +# define SCRUB_PATTERN       _AC(0xc2c2c2c2c2c2c2c2,ULL)
> +#else
> +# define SCRUB_PATTERN       _AC(0,ULL)
> +#endif
> +#define SCRUB_BYTE_PATTERN   (SCRUB_PATTERN & 0xff)
> +
> +#endif /* __XEN_SCRUB_H__ */
> +
> +/*
> + * Local variables:
> + * mode: C
> + * c-file-style: "BSD"
> + * c-basic-offset: 4
> + * tab-width: 4
> + * indent-tabs-mode: nil
> + * End:
> + */
> 
> 

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 01/12] x86: introduce ioremap_wc()
  2021-05-27 12:48   ` Julien Grall
@ 2021-05-27 13:09     ` Jan Beulich
  2021-05-27 13:30       ` Julien Grall
  0 siblings, 1 reply; 32+ messages in thread
From: Jan Beulich @ 2021-05-27 13:09 UTC (permalink / raw)
  To: Julien Grall; +Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, xen-devel

On 27.05.2021 14:48, Julien Grall wrote:
> On 27/05/2021 13:30, Jan Beulich wrote:
>> --- a/xen/arch/x86/mm.c
>> +++ b/xen/arch/x86/mm.c
>> @@ -5881,6 +5881,20 @@ void __iomem *ioremap(paddr_t pa, size_t
>>       return (void __force __iomem *)va;
>>   }
>>   
>> +void __iomem *__init ioremap_wc(paddr_t pa, size_t len)
>> +{
>> +    mfn_t mfn = _mfn(PFN_DOWN(pa));
>> +    unsigned int offs = pa & (PAGE_SIZE - 1);
>> +    unsigned int nr = PFN_UP(offs + len);
>> +    void *va;
>> +
>> +    WARN_ON(page_is_ram_type(mfn_x(mfn), RAM_TYPE_CONVENTIONAL));
>> +
>> +    va = __vmap(&mfn, nr, 1, 1, PAGE_HYPERVISOR_WC, VMAP_DEFAULT);
>> +
>> +    return (void __force __iomem *)(va + offs);
>> +}
> 
> Arm is already providing ioremap_wc() which is a wrapper to 
> ioremap_attr().

I did notice this, yes.

> Can this be moved to the common code to avoid duplication?

If by "this" you mean ioremap_attr(), then I wasn't convinced we want
a function of this name on x86. In particular you may note that
x86'es ioremap() is sort of the equivalent of Arm's ioremap_nocache(),
but is different from the new ioremap_wc() by more than just the
different PTE attributes.

Also I was specifically asked to make ioremap_wc() __init; ioremap()
cannot be, because of at least the use from pci_vtd_quirk().

Plus I'd need to clean up Arm's lack of __iomem if I wanted to fold
things. Or wait - it's declaration and definition which are out of
sync there, i.e. a pre-existing issue.

Bottom line - while I did consider folding, I don't think that's
feasible at this point in time.

Jan

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 01/12] x86: introduce ioremap_wc()
  2021-05-27 13:09     ` Jan Beulich
@ 2021-05-27 13:30       ` Julien Grall
  2021-05-27 14:57         ` Jan Beulich
  0 siblings, 1 reply; 32+ messages in thread
From: Julien Grall @ 2021-05-27 13:30 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, xen-devel

Hi Jan,

On 27/05/2021 14:09, Jan Beulich wrote:
> On 27.05.2021 14:48, Julien Grall wrote:
>> On 27/05/2021 13:30, Jan Beulich wrote:
>>> --- a/xen/arch/x86/mm.c
>>> +++ b/xen/arch/x86/mm.c
>>> @@ -5881,6 +5881,20 @@ void __iomem *ioremap(paddr_t pa, size_t
>>>        return (void __force __iomem *)va;
>>>    }
>>>    
>>> +void __iomem *__init ioremap_wc(paddr_t pa, size_t len)
>>> +{
>>> +    mfn_t mfn = _mfn(PFN_DOWN(pa));
>>> +    unsigned int offs = pa & (PAGE_SIZE - 1);
>>> +    unsigned int nr = PFN_UP(offs + len);
>>> +    void *va;
>>> +
>>> +    WARN_ON(page_is_ram_type(mfn_x(mfn), RAM_TYPE_CONVENTIONAL));
>>> +
>>> +    va = __vmap(&mfn, nr, 1, 1, PAGE_HYPERVISOR_WC, VMAP_DEFAULT);
>>> +
>>> +    return (void __force __iomem *)(va + offs);
>>> +}
>>
>> Arm is already providing ioremap_wc() which is a wrapper to
>> ioremap_attr().
> 
> I did notice this, yes.
> 
>> Can this be moved to the common code to avoid duplication?
> 
> If by "this" you mean ioremap_attr(), then I wasn't convinced we want
> a function of this name on x86.

I am open to other name.

> In particular you may note that
> x86'es ioremap() is sort of the equivalent of Arm's ioremap_nocache(),
> but is different from the new ioremap_wc() by more than just the
> different PTE attributes.
That's because ioremap() will not vmap() the first MB, am I correct? If 
so, I am not sure why you want to do that in ioremap() but not 
ioremap_wc(). Wouldn't this result access the memory with mismatched 
attributes?

> Also I was specifically asked to make ioremap_wc() __init; ioremap()
> cannot be, because of at least the use from pci_vtd_quirk().

I am not sure this is relevant to the conversation here. I am sure there 
are other function that would benefits to be __init in one arch but 
can't in the other. Yet, common code can be beneficials.

> 
> Plus I'd need to clean up Arm's lack of __iomem if I wanted to fold
> things. 

__iomem is NOP on Xen. So while the annotation may not be consistently 
used, I don't see the clean-up a requirement to consolidate the code...

> Or wait - it's declaration and definition which are out of
> sync there, i.e. a pre-existing issue.

We don't usually add __init on both the declaration and definition. So 
why would it be necessary to add __iomem in both cases?

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 07/12] mm: allow page scrubbing routine(s) to be arch controlled
  2021-05-27 13:06   ` Julien Grall
@ 2021-05-27 13:58     ` Jan Beulich
  2021-06-03  9:39       ` Julien Grall
  0 siblings, 1 reply; 32+ messages in thread
From: Jan Beulich @ 2021-05-27 13:58 UTC (permalink / raw)
  To: Julien Grall
  Cc: Roger Pau Monné,
	Andrew Cooper, George Dunlap, Ian Jackson, Stefano Stabellini,
	Wei Liu, xen-devel

On 27.05.2021 15:06, Julien Grall wrote:
> On 27/05/2021 13:33, Jan Beulich wrote:
>> Especially when dealing with large amounts of memory, memset() may not
>> be very efficient; this can be bad enough that even for debug builds a
>> custom function is warranted. We additionally want to distinguish "hot"
>> and "cold" cases.
> 
> Do you have any benchmark showing the performance improvement?

This is based on the numbers provided at
https://lists.xen.org/archives/html/xen-devel/2021-04/msg00716.html (???)
with the thread with some of the prior discussion rooted at
https://lists.xen.org/archives/html/xen-devel/2021-04/msg00425.html

I'm afraid I lack ideas on how to sensibly measure _all_ of the
effects (i.e. including the amount of disturbing of caches).

>> ---
>> The choice between hot and cold in scrub_one_page()'s callers is
>> certainly up for discussion / improvement.
> 
> To get the discussion started, can you explain how you made the decision 
> between hot/cot? This will also want to be written down in the commit 
> message.

Well, the initial trivial heuristic is "allocation for oneself" vs
"allocation for someone else, or freeing, or scrubbing", i.e. whether
it would be likely that the page will soon be accessed again (or for
the first time).

>> --- /dev/null
>> +++ b/xen/arch/x86/scrub_page.S
>> @@ -0,0 +1,41 @@
>> +        .file __FILE__
>> +
>> +#include <asm/asm_defns.h>
>> +#include <xen/page-size.h>
>> +#include <xen/scrub.h>
>> +
>> +ENTRY(scrub_page_cold)
>> +        mov     $PAGE_SIZE/32, %ecx
>> +        mov     $SCRUB_PATTERN, %rax
>> +
>> +0:      movnti  %rax,   (%rdi)
>> +        movnti  %rax,  8(%rdi)
>> +        movnti  %rax, 16(%rdi)
>> +        movnti  %rax, 24(%rdi)
>> +        add     $32, %rdi
>> +        sub     $1, %ecx
>> +        jnz     0b
>> +
>> +        sfence
>> +        ret
>> +        .type scrub_page_cold, @function
>> +        .size scrub_page_cold, . - scrub_page_cold
>> +
>> +        .macro scrub_page_stosb
>> +        mov     $PAGE_SIZE, %ecx
>> +        mov     $SCRUB_BYTE_PATTERN, %eax
>> +        rep stosb
>> +        ret
>> +        .endm
>> +
>> +        .macro scrub_page_stosq
>> +        mov     $PAGE_SIZE/8, %ecx
>> +        mov     $SCRUB_PATTERN, %rax
>> +        rep stosq
>> +        ret
>> +        .endm
>> +
>> +ENTRY(scrub_page_hot)
>> +        ALTERNATIVE scrub_page_stosq, scrub_page_stosb, X86_FEATURE_ERMS
>> +        .type scrub_page_hot, @function
>> +        .size scrub_page_hot, . - scrub_page_hot
> 
>  From the commit message, it is not clear how the implementation for 
> hot/cold was chosen. Can you outline in the commit message what are the 
> assumption for each helper?

I've added 'The goal is for accesses of "cold" pages to not
disturb caches (albeit finding a good balance between this
and the higher latency looks to be difficult).'

>> @@ -1046,12 +1051,14 @@ static struct page_info *alloc_heap_page
>>       if ( first_dirty != INVALID_DIRTY_IDX ||
>>            (scrub_debug && !(memflags & MEMF_no_scrub)) )
>>       {
>> +        bool cold = d && d != current->domain;
> 
> So the assumption is if the domain is not running, then the content is 
> not in the cache. Is that correct?

Not exactly: For one, instead of "not running" it is "is not the current
domain", i.e. there may still be vCPU-s of the domain running elsewhere.
And for the cache the question isn't so much of "is in cache", but to
avoid needlessly bringing contents into the cache when the data is
unlikely to be used again soon.

Jan


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 01/12] x86: introduce ioremap_wc()
  2021-05-27 13:30       ` Julien Grall
@ 2021-05-27 14:57         ` Jan Beulich
  0 siblings, 0 replies; 32+ messages in thread
From: Jan Beulich @ 2021-05-27 14:57 UTC (permalink / raw)
  To: Julien Grall; +Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, xen-devel

On 27.05.2021 15:30, Julien Grall wrote:
> On 27/05/2021 14:09, Jan Beulich wrote:
>> On 27.05.2021 14:48, Julien Grall wrote:
>>> On 27/05/2021 13:30, Jan Beulich wrote:
>>>> --- a/xen/arch/x86/mm.c
>>>> +++ b/xen/arch/x86/mm.c
>>>> @@ -5881,6 +5881,20 @@ void __iomem *ioremap(paddr_t pa, size_t
>>>>        return (void __force __iomem *)va;
>>>>    }
>>>>    
>>>> +void __iomem *__init ioremap_wc(paddr_t pa, size_t len)
>>>> +{
>>>> +    mfn_t mfn = _mfn(PFN_DOWN(pa));
>>>> +    unsigned int offs = pa & (PAGE_SIZE - 1);
>>>> +    unsigned int nr = PFN_UP(offs + len);
>>>> +    void *va;
>>>> +
>>>> +    WARN_ON(page_is_ram_type(mfn_x(mfn), RAM_TYPE_CONVENTIONAL));
>>>> +
>>>> +    va = __vmap(&mfn, nr, 1, 1, PAGE_HYPERVISOR_WC, VMAP_DEFAULT);
>>>> +
>>>> +    return (void __force __iomem *)(va + offs);
>>>> +}
>>>
>>> Arm is already providing ioremap_wc() which is a wrapper to
>>> ioremap_attr().
>>
>> I did notice this, yes.
>>
>>> Can this be moved to the common code to avoid duplication?
>>
>> If by "this" you mean ioremap_attr(), then I wasn't convinced we want
>> a function of this name on x86.
> 
> I am open to other name.

My remark wasn't so much about the name, but about there being a
"more capable" backing function for a number of wrappers.

>> In particular you may note that
>> x86'es ioremap() is sort of the equivalent of Arm's ioremap_nocache(),
>> but is different from the new ioremap_wc() by more than just the
>> different PTE attributes.
> That's because ioremap() will not vmap() the first MB, am I correct? If 
> so, I am not sure why you want to do that in ioremap() but not 
> ioremap_wc(). Wouldn't this result access the memory with mismatched 
> attributes?

UC and WC aren't really conflicting cache attributes - they both
fall in the "uncachable" category. In fact I have a TBD in the
post-commit-message area regarding this very aspect of possibly
reusing the low 1Mb mapping.

>> Plus I'd need to clean up Arm's lack of __iomem if I wanted to fold
>> things. 
> 
> __iomem is NOP on Xen. So while the annotation may not be consistently 
> used, I don't see the clean-up a requirement to consolidate the code...
> 
>> Or wait - it's declaration and definition which are out of
>> sync there, i.e. a pre-existing issue.
> 
> We don't usually add __init on both the declaration and definition. So 
> why would it be necessary to add __iomem in both cases?

__init is an attribute that is meaningful only for functions and
only on their definitions (because it controls what section the
code gets emitted to by the compiler, while it is of no interest
at all to any caller of the function, as far as the compiler is
concerned). __iomem, otoh, is a modifier for pointer types, so
doesn't apply to the function as a whole but to its return types.
Such types (when they're not NOP) need to be consistent between
declaration and definition. You can try this with an about
arbitrary (but valid) __attribute__(()) of your choice and with a
not overly old compiler - you should see it complain about such
inconsistencies.

Jan


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 07/12] mm: allow page scrubbing routine(s) to be arch controlled
  2021-05-27 13:58     ` Jan Beulich
@ 2021-06-03  9:39       ` Julien Grall
  2021-06-04 13:23         ` Jan Beulich
  0 siblings, 1 reply; 32+ messages in thread
From: Julien Grall @ 2021-06-03  9:39 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Roger Pau Monné,
	Andrew Cooper, George Dunlap, Ian Jackson, Stefano Stabellini,
	Wei Liu, xen-devel

Hi Jan,

On 27/05/2021 14:58, Jan Beulich wrote:
> On 27.05.2021 15:06, Julien Grall wrote:
>> On 27/05/2021 13:33, Jan Beulich wrote:
>>> Especially when dealing with large amounts of memory, memset() may not
>>> be very efficient; this can be bad enough that even for debug builds a
>>> custom function is warranted. We additionally want to distinguish "hot"
>>> and "cold" cases.
>>
>> Do you have any benchmark showing the performance improvement?
> 
> This is based on the numbers provided at
> https://lists.xen.org/archives/html/xen-devel/2021-04/msg00716.html (???)
> with the thread with some of the prior discussion rooted at
> https://lists.xen.org/archives/html/xen-devel/2021-04/msg00425.html

Thanks for the pointer!

> I'm afraid I lack ideas on how to sensibly measure _all_ of the
> effects (i.e. including the amount of disturbing of caches).

I think it is quite important to provide some benchmark (or at least 
rationale) in the commit message.

We had a similar situation in the past (see the discussion [1]) where a 
commit message claimed it would improve the performance but in reality 
it also added regression. Unfortunately, there is no easy way forward as 
the rationale is now forgotten...

>>> ---
>>> The choice between hot and cold in scrub_one_page()'s callers is
>>> certainly up for discussion / improvement.
>>
>> To get the discussion started, can you explain how you made the decision
>> between hot/cot? This will also want to be written down in the commit
>> message.
> 
> Well, the initial trivial heuristic is "allocation for oneself" vs
> "allocation for someone else, or freeing, or scrubbing", i.e. whether
> it would be likely that the page will soon be accessed again (or for
> the first time).
> 
>>> --- /dev/null
>>> +++ b/xen/arch/x86/scrub_page.S
>>> @@ -0,0 +1,41 @@
>>> +        .file __FILE__
>>> +
>>> +#include <asm/asm_defns.h>
>>> +#include <xen/page-size.h>
>>> +#include <xen/scrub.h>
>>> +
>>> +ENTRY(scrub_page_cold)
>>> +        mov     $PAGE_SIZE/32, %ecx
>>> +        mov     $SCRUB_PATTERN, %rax
>>> +
>>> +0:      movnti  %rax,   (%rdi)
>>> +        movnti  %rax,  8(%rdi)
>>> +        movnti  %rax, 16(%rdi)
>>> +        movnti  %rax, 24(%rdi)
>>> +        add     $32, %rdi
>>> +        sub     $1, %ecx
>>> +        jnz     0b
>>> +
>>> +        sfence
>>> +        ret
>>> +        .type scrub_page_cold, @function
>>> +        .size scrub_page_cold, . - scrub_page_cold
>>> +
>>> +        .macro scrub_page_stosb
>>> +        mov     $PAGE_SIZE, %ecx
>>> +        mov     $SCRUB_BYTE_PATTERN, %eax
>>> +        rep stosb
>>> +        ret
>>> +        .endm
>>> +
>>> +        .macro scrub_page_stosq
>>> +        mov     $PAGE_SIZE/8, %ecx
>>> +        mov     $SCRUB_PATTERN, %rax
>>> +        rep stosq
>>> +        ret
>>> +        .endm
>>> +
>>> +ENTRY(scrub_page_hot)
>>> +        ALTERNATIVE scrub_page_stosq, scrub_page_stosb, X86_FEATURE_ERMS
>>> +        .type scrub_page_hot, @function
>>> +        .size scrub_page_hot, . - scrub_page_hot
>>
>>   From the commit message, it is not clear how the implementation for
>> hot/cold was chosen. Can you outline in the commit message what are the
>> assumption for each helper?
> 
> I've added 'The goal is for accesses of "cold" pages to not
> disturb caches (albeit finding a good balance between this
> and the higher latency looks to be difficult).'
> 
>>> @@ -1046,12 +1051,14 @@ static struct page_info *alloc_heap_page
>>>        if ( first_dirty != INVALID_DIRTY_IDX ||
>>>             (scrub_debug && !(memflags & MEMF_no_scrub)) )
>>>        {
>>> +        bool cold = d && d != current->domain;
>>
>> So the assumption is if the domain is not running, then the content is
>> not in the cache. Is that correct?
> 
> Not exactly: For one, instead of "not running" it is "is not the current
> domain", i.e. there may still be vCPU-s of the domain running elsewhere.
> And for the cache the question isn't so much of "is in cache", but to
> avoid needlessly bringing contents into the cache when the data is
> unlikely to be used again soon.

Ok. Can this be clarified in the commit message?

As to the approach itself, I'd like an ack from one of the x86 
maintainers to confirm that distinguising cold vs hot page is worth it.

Cheers,

[1] 
<de46590ad566d9be55b26eaca0bc4dc7fbbada59.1585063311.git.hongyxia@amazon.com>

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 07/12] mm: allow page scrubbing routine(s) to be arch controlled
  2021-06-03  9:39       ` Julien Grall
@ 2021-06-04 13:23         ` Jan Beulich
  2021-06-07 18:12           ` Julien Grall
  0 siblings, 1 reply; 32+ messages in thread
From: Jan Beulich @ 2021-06-04 13:23 UTC (permalink / raw)
  To: Julien Grall
  Cc: Roger Pau Monné,
	Andrew Cooper, George Dunlap, Ian Jackson, Stefano Stabellini,
	Wei Liu, xen-devel

On 03.06.2021 11:39, Julien Grall wrote:
> On 27/05/2021 14:58, Jan Beulich wrote:
>> On 27.05.2021 15:06, Julien Grall wrote:
>>> On 27/05/2021 13:33, Jan Beulich wrote:
>>>> @@ -1046,12 +1051,14 @@ static struct page_info *alloc_heap_page
>>>>        if ( first_dirty != INVALID_DIRTY_IDX ||
>>>>             (scrub_debug && !(memflags & MEMF_no_scrub)) )
>>>>        {
>>>> +        bool cold = d && d != current->domain;
>>>
>>> So the assumption is if the domain is not running, then the content is
>>> not in the cache. Is that correct?
>>
>> Not exactly: For one, instead of "not running" it is "is not the current
>> domain", i.e. there may still be vCPU-s of the domain running elsewhere.
>> And for the cache the question isn't so much of "is in cache", but to
>> avoid needlessly bringing contents into the cache when the data is
>> unlikely to be used again soon.
> 
> Ok. Can this be clarified in the commit message?

I had updated it already the other day to

"Especially when dealing with large amounts of memory, memset() may not
 be very efficient; this can be bad enough that even for debug builds a
 custom function is warranted. We additionally want to distinguish "hot"
 and "cold" cases (with, as initial heuristic, "hot" being for any
 allocations a domain does for itself, assuming that in all other cases
 the page wouldn't be accessed [again] soon). The goal is for accesses
 of "cold" pages to not disturb caches (albeit finding a good balance
 between this and the higher latency looks to be difficult)."

Is this good enough?

Jan



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 07/12] mm: allow page scrubbing routine(s) to be arch controlled
  2021-06-04 13:23         ` Jan Beulich
@ 2021-06-07 18:12           ` Julien Grall
  0 siblings, 0 replies; 32+ messages in thread
From: Julien Grall @ 2021-06-07 18:12 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Roger Pau Monné,
	Andrew Cooper, George Dunlap, Ian Jackson, Stefano Stabellini,
	Wei Liu, xen-devel

Hi Jan,

On 04/06/2021 14:23, Jan Beulich wrote:
> On 03.06.2021 11:39, Julien Grall wrote:
>> On 27/05/2021 14:58, Jan Beulich wrote:
>>> On 27.05.2021 15:06, Julien Grall wrote:
>>>> On 27/05/2021 13:33, Jan Beulich wrote:
>>>>> @@ -1046,12 +1051,14 @@ static struct page_info *alloc_heap_page
>>>>>         if ( first_dirty != INVALID_DIRTY_IDX ||
>>>>>              (scrub_debug && !(memflags & MEMF_no_scrub)) )
>>>>>         {
>>>>> +        bool cold = d && d != current->domain;
>>>>
>>>> So the assumption is if the domain is not running, then the content is
>>>> not in the cache. Is that correct?
>>>
>>> Not exactly: For one, instead of "not running" it is "is not the current
>>> domain", i.e. there may still be vCPU-s of the domain running elsewhere.
>>> And for the cache the question isn't so much of "is in cache", but to
>>> avoid needlessly bringing contents into the cache when the data is
>>> unlikely to be used again soon.
>>
>> Ok. Can this be clarified in the commit message?
> 
> I had updated it already the other day to
> 
> "Especially when dealing with large amounts of memory, memset() may not
>   be very efficient; this can be bad enough that even for debug builds a
>   custom function is warranted. We additionally want to distinguish "hot"
>   and "cold" cases (with, as initial heuristic, "hot" being for any
>   allocations a domain does for itself, assuming that in all other cases
>   the page wouldn't be accessed [again] soon). The goal is for accesses
>   of "cold" pages to not disturb caches (albeit finding a good balance
>   between this and the higher latency looks to be difficult)."
> 
> Is this good enough?

Yes. Thank you for proposing an update to the commit message!

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH RESEND v2] x86: introduce ioremap_wc()
  2021-05-27 12:29 [PATCH v2 00/12] x86: memcpy() / memset() (non-)ERMS flavors plus fallout Jan Beulich
                   ` (11 preceding siblings ...)
  2021-05-27 12:36 ` [PATCH v2 12/12] video/vesa: adjust (not just) command line option handling Jan Beulich
@ 2022-02-17 11:01 ` Jan Beulich
  2022-02-17 14:47   ` Roger Pau Monné
  12 siblings, 1 reply; 32+ messages in thread
From: Jan Beulich @ 2022-02-17 11:01 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Wei Liu, Roger Pau Monné

In order for a to-be-introduced ERMS form of memcpy() to not regress
boot performance on certain systems when video output is active, we
first need to arrange for avoiding further dependency on firmware
setting up MTRRs in a way we can actually further modify. On many
systems, due to the continuously growing amounts of installed memory,
MTRRs get configured with at least one huge WB range, and with MMIO
ranges below 4Gb then forced to UC via overlapping MTRRs. mtrr_add(), as
it is today, can't deal with such a setup. Hence on such systems we
presently leave the frame buffer mapped UC, leading to significantly
reduced performance when using REP STOSB / REP MOVSB.

On post-PentiumII hardware (i.e. any that's capable of running 64-bit
code), an effective memory type of WC can be achieved without MTRRs, by
simply referencing the respective PAT entry from the PTEs. While this
will leave the switch to ERMS forms of memset() and memcpy() with
largely unchanged performance, the change here on its own improves
performance on affected systems quite significantly: Measuring just the
individual affected memcpy() invocations yielded a speedup by a factor
of over 250 on my initial (Skylake) test system. memset() isn't getting
improved by as much there, but still by a factor of about 20.

While adding {__,}PAGE_HYPERVISOR_WC, also add {__,}PAGE_HYPERVISOR_WT
to, at the very least, make clear what PTE flags this memory type uses.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
REPOST (in isolation) upon Roger's request. The header location change I
don't really consider a "re-base".

v2: Mark ioremap_wc() __init.
---
TBD: If the VGA range is WC in the fixed range MTRRs, reusing the low
     1st Mb mapping (like ioremap() does) would be an option.

--- a/xen/arch/x86/include/asm/mm.h
+++ b/xen/arch/x86/include/asm/mm.h
@@ -602,6 +602,8 @@ void destroy_perdomain_mapping(struct do
                                unsigned int nr);
 void free_perdomain_mappings(struct domain *);
 
+void __iomem *ioremap_wc(paddr_t, size_t);
+
 extern int memory_add(unsigned long spfn, unsigned long epfn, unsigned int pxm);
 
 void domain_set_alloc_bitsize(struct domain *d);
--- a/xen/arch/x86/include/asm/page.h
+++ b/xen/arch/x86/include/asm/page.h
@@ -349,8 +349,10 @@ void efi_update_l4_pgtable(unsigned int
 #define __PAGE_HYPERVISOR_RX      (_PAGE_PRESENT | _PAGE_ACCESSED)
 #define __PAGE_HYPERVISOR         (__PAGE_HYPERVISOR_RX | \
                                    _PAGE_DIRTY | _PAGE_RW)
+#define __PAGE_HYPERVISOR_WT      (__PAGE_HYPERVISOR | _PAGE_PWT)
 #define __PAGE_HYPERVISOR_UCMINUS (__PAGE_HYPERVISOR | _PAGE_PCD)
 #define __PAGE_HYPERVISOR_UC      (__PAGE_HYPERVISOR | _PAGE_PCD | _PAGE_PWT)
+#define __PAGE_HYPERVISOR_WC      (__PAGE_HYPERVISOR | _PAGE_PAT)
 #define __PAGE_HYPERVISOR_SHSTK   (__PAGE_HYPERVISOR_RO | _PAGE_DIRTY)
 
 #define MAP_SMALL_PAGES _PAGE_AVAIL0 /* don't use superpages mappings */
--- a/xen/arch/x86/include/asm/x86_64/page.h
+++ b/xen/arch/x86/include/asm/x86_64/page.h
@@ -152,6 +152,10 @@ static inline intpte_t put_pte_flags(uns
                                  _PAGE_GLOBAL | _PAGE_NX)
 #define PAGE_HYPERVISOR_UC      (__PAGE_HYPERVISOR_UC | \
                                  _PAGE_GLOBAL | _PAGE_NX)
+#define PAGE_HYPERVISOR_WC      (__PAGE_HYPERVISOR_WC | \
+                                 _PAGE_GLOBAL | _PAGE_NX)
+#define PAGE_HYPERVISOR_WT      (__PAGE_HYPERVISOR_WT | \
+                                 _PAGE_GLOBAL | _PAGE_NX)
 
 #endif /* __X86_64_PAGE_H__ */
 
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -5895,6 +5895,20 @@ void __iomem *ioremap(paddr_t pa, size_t
     return (void __force __iomem *)va;
 }
 
+void __iomem *__init ioremap_wc(paddr_t pa, size_t len)
+{
+    mfn_t mfn = _mfn(PFN_DOWN(pa));
+    unsigned int offs = pa & (PAGE_SIZE - 1);
+    unsigned int nr = PFN_UP(offs + len);
+    void *va;
+
+    WARN_ON(page_is_ram_type(mfn_x(mfn), RAM_TYPE_CONVENTIONAL));
+
+    va = __vmap(&mfn, nr, 1, 1, PAGE_HYPERVISOR_WC, VMAP_DEFAULT);
+
+    return (void __force __iomem *)(va + offs);
+}
+
 int create_perdomain_mapping(struct domain *d, unsigned long va,
                              unsigned int nr, l1_pgentry_t **pl1tab,
                              struct page_info **ppg)
--- a/xen/drivers/video/vesa.c
+++ b/xen/drivers/video/vesa.c
@@ -9,9 +9,9 @@
 #include <xen/param.h>
 #include <xen/xmalloc.h>
 #include <xen/kernel.h>
+#include <xen/mm.h>
 #include <xen/vga.h>
 #include <asm/io.h>
-#include <asm/page.h>
 #include "font.h"
 #include "lfb.h"
 
@@ -103,7 +103,7 @@ void __init vesa_init(void)
     lfbp.text_columns = vlfb_info.width / font->width;
     lfbp.text_rows = vlfb_info.height / font->height;
 
-    lfbp.lfb = lfb = ioremap(lfb_base(), vram_remap);
+    lfbp.lfb = lfb = ioremap_wc(lfb_base(), vram_remap);
     if ( !lfb )
         return;
 
@@ -179,8 +179,7 @@ void __init vesa_mtrr_init(void)
 
 static void lfb_flush(void)
 {
-    if ( vesa_mtrr == 3 )
-        __asm__ __volatile__ ("sfence" : : : "memory");
+    __asm__ __volatile__ ("sfence" : : : "memory");
 }
 
 void __init vesa_endboot(bool_t keep)
--- a/xen/drivers/video/vga.c
+++ b/xen/drivers/video/vga.c
@@ -79,7 +79,7 @@ void __init video_init(void)
     {
     case XEN_VGATYPE_TEXT_MODE_3:
         if ( page_is_ram_type(paddr_to_pfn(0xB8000), RAM_TYPE_CONVENTIONAL) ||
-             ((video = ioremap(0xB8000, 0x8000)) == NULL) )
+             ((video = ioremap_wc(0xB8000, 0x8000)) == NULL) )
             return;
         outw(0x200a, 0x3d4); /* disable cursor */
         columns = vga_console_info.u.text_mode_3.columns;
@@ -164,7 +164,11 @@ void __init video_endboot(void)
     {
     case XEN_VGATYPE_TEXT_MODE_3:
         if ( !vgacon_keep )
+        {
             memset(video, 0, columns * lines * 2);
+            iounmap(video);
+            video = ZERO_BLOCK_PTR;
+        }
         break;
     case XEN_VGATYPE_VESA_LFB:
     case XEN_VGATYPE_EFI_LFB:



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RESEND v2] x86: introduce ioremap_wc()
  2022-02-17 11:01 ` [PATCH RESEND v2] x86: introduce ioremap_wc() Jan Beulich
@ 2022-02-17 14:47   ` Roger Pau Monné
  2022-02-17 15:02     ` Jan Beulich
  0 siblings, 1 reply; 32+ messages in thread
From: Roger Pau Monné @ 2022-02-17 14:47 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Wei Liu

On Thu, Feb 17, 2022 at 12:01:08PM +0100, Jan Beulich wrote:
> In order for a to-be-introduced ERMS form of memcpy() to not regress
> boot performance on certain systems when video output is active, we
> first need to arrange for avoiding further dependency on firmware
> setting up MTRRs in a way we can actually further modify. On many
> systems, due to the continuously growing amounts of installed memory,
> MTRRs get configured with at least one huge WB range, and with MMIO
> ranges below 4Gb then forced to UC via overlapping MTRRs. mtrr_add(), as
> it is today, can't deal with such a setup. Hence on such systems we
> presently leave the frame buffer mapped UC, leading to significantly
> reduced performance when using REP STOSB / REP MOVSB.
> 
> On post-PentiumII hardware (i.e. any that's capable of running 64-bit
> code), an effective memory type of WC can be achieved without MTRRs, by
> simply referencing the respective PAT entry from the PTEs. While this
> will leave the switch to ERMS forms of memset() and memcpy() with
> largely unchanged performance, the change here on its own improves
> performance on affected systems quite significantly: Measuring just the
> individual affected memcpy() invocations yielded a speedup by a factor
> of over 250 on my initial (Skylake) test system. memset() isn't getting
> improved by as much there, but still by a factor of about 20.
> 
> While adding {__,}PAGE_HYPERVISOR_WC, also add {__,}PAGE_HYPERVISOR_WT
> to, at the very least, make clear what PTE flags this memory type uses.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> REPOST (in isolation) upon Roger's request. The header location change I
> don't really consider a "re-base".
> 
> v2: Mark ioremap_wc() __init.
> ---
> TBD: If the VGA range is WC in the fixed range MTRRs, reusing the low
>      1st Mb mapping (like ioremap() does) would be an option.
> 
> --- a/xen/arch/x86/include/asm/mm.h
> +++ b/xen/arch/x86/include/asm/mm.h
> @@ -602,6 +602,8 @@ void destroy_perdomain_mapping(struct do
>                                 unsigned int nr);
>  void free_perdomain_mappings(struct domain *);
>  
> +void __iomem *ioremap_wc(paddr_t, size_t);
> +
>  extern int memory_add(unsigned long spfn, unsigned long epfn, unsigned int pxm);
>  
>  void domain_set_alloc_bitsize(struct domain *d);
> --- a/xen/arch/x86/include/asm/page.h
> +++ b/xen/arch/x86/include/asm/page.h
> @@ -349,8 +349,10 @@ void efi_update_l4_pgtable(unsigned int
>  #define __PAGE_HYPERVISOR_RX      (_PAGE_PRESENT | _PAGE_ACCESSED)
>  #define __PAGE_HYPERVISOR         (__PAGE_HYPERVISOR_RX | \
>                                     _PAGE_DIRTY | _PAGE_RW)
> +#define __PAGE_HYPERVISOR_WT      (__PAGE_HYPERVISOR | _PAGE_PWT)
>  #define __PAGE_HYPERVISOR_UCMINUS (__PAGE_HYPERVISOR | _PAGE_PCD)
>  #define __PAGE_HYPERVISOR_UC      (__PAGE_HYPERVISOR | _PAGE_PCD | _PAGE_PWT)
> +#define __PAGE_HYPERVISOR_WC      (__PAGE_HYPERVISOR | _PAGE_PAT)
>  #define __PAGE_HYPERVISOR_SHSTK   (__PAGE_HYPERVISOR_RO | _PAGE_DIRTY)
>  
>  #define MAP_SMALL_PAGES _PAGE_AVAIL0 /* don't use superpages mappings */
> --- a/xen/arch/x86/include/asm/x86_64/page.h
> +++ b/xen/arch/x86/include/asm/x86_64/page.h
> @@ -152,6 +152,10 @@ static inline intpte_t put_pte_flags(uns
>                                   _PAGE_GLOBAL | _PAGE_NX)
>  #define PAGE_HYPERVISOR_UC      (__PAGE_HYPERVISOR_UC | \
>                                   _PAGE_GLOBAL | _PAGE_NX)
> +#define PAGE_HYPERVISOR_WC      (__PAGE_HYPERVISOR_WC | \
> +                                 _PAGE_GLOBAL | _PAGE_NX)
> +#define PAGE_HYPERVISOR_WT      (__PAGE_HYPERVISOR_WT | \
> +                                 _PAGE_GLOBAL | _PAGE_NX)
>  
>  #endif /* __X86_64_PAGE_H__ */
>  
> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c
> @@ -5895,6 +5895,20 @@ void __iomem *ioremap(paddr_t pa, size_t
>      return (void __force __iomem *)va;
>  }
>  
> +void __iomem *__init ioremap_wc(paddr_t pa, size_t len)
> +{
> +    mfn_t mfn = _mfn(PFN_DOWN(pa));
> +    unsigned int offs = pa & (PAGE_SIZE - 1);
> +    unsigned int nr = PFN_UP(offs + len);
> +    void *va;
> +
> +    WARN_ON(page_is_ram_type(mfn_x(mfn), RAM_TYPE_CONVENTIONAL));
> +
> +    va = __vmap(&mfn, nr, 1, 1, PAGE_HYPERVISOR_WC, VMAP_DEFAULT);
> +
> +    return (void __force __iomem *)(va + offs);
> +}
> +
>  int create_perdomain_mapping(struct domain *d, unsigned long va,
>                               unsigned int nr, l1_pgentry_t **pl1tab,
>                               struct page_info **ppg)
> --- a/xen/drivers/video/vesa.c
> +++ b/xen/drivers/video/vesa.c
> @@ -9,9 +9,9 @@
>  #include <xen/param.h>
>  #include <xen/xmalloc.h>
>  #include <xen/kernel.h>
> +#include <xen/mm.h>
>  #include <xen/vga.h>
>  #include <asm/io.h>
> -#include <asm/page.h>
>  #include "font.h"
>  #include "lfb.h"
>  
> @@ -103,7 +103,7 @@ void __init vesa_init(void)
>      lfbp.text_columns = vlfb_info.width / font->width;
>      lfbp.text_rows = vlfb_info.height / font->height;
>  
> -    lfbp.lfb = lfb = ioremap(lfb_base(), vram_remap);
> +    lfbp.lfb = lfb = ioremap_wc(lfb_base(), vram_remap);
>      if ( !lfb )
>          return;
>  
> @@ -179,8 +179,7 @@ void __init vesa_mtrr_init(void)
>  
>  static void lfb_flush(void)
>  {
> -    if ( vesa_mtrr == 3 )
> -        __asm__ __volatile__ ("sfence" : : : "memory");
> +    __asm__ __volatile__ ("sfence" : : : "memory");

Now that the cache attribute is forced to WC using PAT don't we need
to drop vesa_mtrr_init and vesa_mtrr? The more that the option is
fully undocumented.

>  }
>  
>  void __init vesa_endboot(bool_t keep)
> --- a/xen/drivers/video/vga.c
> +++ b/xen/drivers/video/vga.c
> @@ -79,7 +79,7 @@ void __init video_init(void)
>      {
>      case XEN_VGATYPE_TEXT_MODE_3:
>          if ( page_is_ram_type(paddr_to_pfn(0xB8000), RAM_TYPE_CONVENTIONAL) ||
> -             ((video = ioremap(0xB8000, 0x8000)) == NULL) )
> +             ((video = ioremap_wc(0xB8000, 0x8000)) == NULL) )
>              return;
>          outw(0x200a, 0x3d4); /* disable cursor */
>          columns = vga_console_info.u.text_mode_3.columns;
> @@ -164,7 +164,11 @@ void __init video_endboot(void)
>      {
>      case XEN_VGATYPE_TEXT_MODE_3:
>          if ( !vgacon_keep )
> +        {
>              memset(video, 0, columns * lines * 2);
> +            iounmap(video);
> +            video = ZERO_BLOCK_PTR;
> +        }
>          break;
>      case XEN_VGATYPE_VESA_LFB:
>      case XEN_VGATYPE_EFI_LFB:

I think in vesa_endboot you also need to iounmap the framebuffer
iomem?

I would assume this was also required before your change, yet I'm not
finding any iounmap call that would do it.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RESEND v2] x86: introduce ioremap_wc()
  2022-02-17 14:47   ` Roger Pau Monné
@ 2022-02-17 15:02     ` Jan Beulich
  2022-02-17 15:50       ` Roger Pau Monné
  0 siblings, 1 reply; 32+ messages in thread
From: Jan Beulich @ 2022-02-17 15:02 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Wei Liu

On 17.02.2022 15:47, Roger Pau Monné wrote:
> On Thu, Feb 17, 2022 at 12:01:08PM +0100, Jan Beulich wrote:
>> @@ -179,8 +179,7 @@ void __init vesa_mtrr_init(void)
>>  
>>  static void lfb_flush(void)
>>  {
>> -    if ( vesa_mtrr == 3 )
>> -        __asm__ __volatile__ ("sfence" : : : "memory");
>> +    __asm__ __volatile__ ("sfence" : : : "memory");
> 
> Now that the cache attribute is forced to WC using PAT don't we need
> to drop vesa_mtrr_init and vesa_mtrr? The more that the option is
> fully undocumented.

Yes indeed. You did ask to re-send this patch in isolation. This removal
is part of the full series.

>> --- a/xen/drivers/video/vga.c
>> +++ b/xen/drivers/video/vga.c
>> @@ -79,7 +79,7 @@ void __init video_init(void)
>>      {
>>      case XEN_VGATYPE_TEXT_MODE_3:
>>          if ( page_is_ram_type(paddr_to_pfn(0xB8000), RAM_TYPE_CONVENTIONAL) ||
>> -             ((video = ioremap(0xB8000, 0x8000)) == NULL) )
>> +             ((video = ioremap_wc(0xB8000, 0x8000)) == NULL) )
>>              return;
>>          outw(0x200a, 0x3d4); /* disable cursor */
>>          columns = vga_console_info.u.text_mode_3.columns;
>> @@ -164,7 +164,11 @@ void __init video_endboot(void)
>>      {
>>      case XEN_VGATYPE_TEXT_MODE_3:
>>          if ( !vgacon_keep )
>> +        {
>>              memset(video, 0, columns * lines * 2);
>> +            iounmap(video);
>> +            video = ZERO_BLOCK_PTR;
>> +        }
>>          break;
>>      case XEN_VGATYPE_VESA_LFB:
>>      case XEN_VGATYPE_EFI_LFB:
> 
> I think in vesa_endboot you also need to iounmap the framebuffer
> iomem?

Again part of the full series. I guess I was a little inconsistent
with leaving the VGA unmap in here, but breaking out the VESA part.
It's been a long time, but I guess I did so because the VESA part
needs to touch two files.

> I would assume this was also required before your change, yet I'm not
> finding any iounmap call that would do it.

Indeed, this has been missing all the time.

Jan



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RESEND v2] x86: introduce ioremap_wc()
  2022-02-17 15:02     ` Jan Beulich
@ 2022-02-17 15:50       ` Roger Pau Monné
  2022-02-17 15:57         ` Jan Beulich
  0 siblings, 1 reply; 32+ messages in thread
From: Roger Pau Monné @ 2022-02-17 15:50 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Wei Liu

On Thu, Feb 17, 2022 at 04:02:39PM +0100, Jan Beulich wrote:
> On 17.02.2022 15:47, Roger Pau Monné wrote:
> > On Thu, Feb 17, 2022 at 12:01:08PM +0100, Jan Beulich wrote:
> >> @@ -179,8 +179,7 @@ void __init vesa_mtrr_init(void)
> >>  
> >>  static void lfb_flush(void)
> >>  {
> >> -    if ( vesa_mtrr == 3 )
> >> -        __asm__ __volatile__ ("sfence" : : : "memory");
> >> +    __asm__ __volatile__ ("sfence" : : : "memory");
> > 
> > Now that the cache attribute is forced to WC using PAT don't we need
> > to drop vesa_mtrr_init and vesa_mtrr? The more that the option is
> > fully undocumented.
> 
> Yes indeed. You did ask to re-send this patch in isolation. This removal
> is part of the full series.
> 
> >> --- a/xen/drivers/video/vga.c
> >> +++ b/xen/drivers/video/vga.c
> >> @@ -79,7 +79,7 @@ void __init video_init(void)
> >>      {
> >>      case XEN_VGATYPE_TEXT_MODE_3:
> >>          if ( page_is_ram_type(paddr_to_pfn(0xB8000), RAM_TYPE_CONVENTIONAL) ||
> >> -             ((video = ioremap(0xB8000, 0x8000)) == NULL) )
> >> +             ((video = ioremap_wc(0xB8000, 0x8000)) == NULL) )
> >>              return;
> >>          outw(0x200a, 0x3d4); /* disable cursor */
> >>          columns = vga_console_info.u.text_mode_3.columns;
> >> @@ -164,7 +164,11 @@ void __init video_endboot(void)
> >>      {
> >>      case XEN_VGATYPE_TEXT_MODE_3:
> >>          if ( !vgacon_keep )
> >> +        {
> >>              memset(video, 0, columns * lines * 2);
> >> +            iounmap(video);
> >> +            video = ZERO_BLOCK_PTR;
> >> +        }
> >>          break;
> >>      case XEN_VGATYPE_VESA_LFB:
> >>      case XEN_VGATYPE_EFI_LFB:
> > 
> > I think in vesa_endboot you also need to iounmap the framebuffer
> > iomem?
> 
> Again part of the full series. I guess I was a little inconsistent
> with leaving the VGA unmap in here, but breaking out the VESA part.
> It's been a long time, but I guess I did so because the VESA part
> needs to touch two files.

I think you are hesitant to include the chunks for the above items? (or
maybe I'm not properly accounting for their complexity).

Thanks, Roger.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RESEND v2] x86: introduce ioremap_wc()
  2022-02-17 15:50       ` Roger Pau Monné
@ 2022-02-17 15:57         ` Jan Beulich
  2022-02-18  9:09           ` Roger Pau Monné
  0 siblings, 1 reply; 32+ messages in thread
From: Jan Beulich @ 2022-02-17 15:57 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Wei Liu

On 17.02.2022 16:50, Roger Pau Monné wrote:
> On Thu, Feb 17, 2022 at 04:02:39PM +0100, Jan Beulich wrote:
>> On 17.02.2022 15:47, Roger Pau Monné wrote:
>>> On Thu, Feb 17, 2022 at 12:01:08PM +0100, Jan Beulich wrote:
>>>> --- a/xen/drivers/video/vga.c
>>>> +++ b/xen/drivers/video/vga.c
>>>> @@ -79,7 +79,7 @@ void __init video_init(void)
>>>>      {
>>>>      case XEN_VGATYPE_TEXT_MODE_3:
>>>>          if ( page_is_ram_type(paddr_to_pfn(0xB8000), RAM_TYPE_CONVENTIONAL) ||
>>>> -             ((video = ioremap(0xB8000, 0x8000)) == NULL) )
>>>> +             ((video = ioremap_wc(0xB8000, 0x8000)) == NULL) )
>>>>              return;
>>>>          outw(0x200a, 0x3d4); /* disable cursor */
>>>>          columns = vga_console_info.u.text_mode_3.columns;
>>>> @@ -164,7 +164,11 @@ void __init video_endboot(void)
>>>>      {
>>>>      case XEN_VGATYPE_TEXT_MODE_3:
>>>>          if ( !vgacon_keep )
>>>> +        {
>>>>              memset(video, 0, columns * lines * 2);
>>>> +            iounmap(video);
>>>> +            video = ZERO_BLOCK_PTR;
>>>> +        }
>>>>          break;
>>>>      case XEN_VGATYPE_VESA_LFB:
>>>>      case XEN_VGATYPE_EFI_LFB:
>>>
>>> I think in vesa_endboot you also need to iounmap the framebuffer
>>> iomem?
>>
>> Again part of the full series. I guess I was a little inconsistent
>> with leaving the VGA unmap in here, but breaking out the VESA part.
>> It's been a long time, but I guess I did so because the VESA part
>> needs to touch two files.
> 
> I think you are hesitant to include the chunks for the above items? (or
> maybe I'm not properly accounting for their complexity).

There's no complexity, it's really just that the zapping of the pointer
needs to be done in a different place from where the unmap is. See below.

Jan

video/vesa: unmap frame buffer when relinquishing console

There's no point in keeping the VA space occupied when no further output
will occur.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- unstable.orig/xen/drivers/video/lfb.c
+++ unstable/xen/drivers/video/lfb.c
@@ -168,4 +168,5 @@ void lfb_free(void)
     xfree(lfb.lbuf);
     xfree(lfb.text_buf);
     xfree(lfb.line_len);
+    lfb.lfbp.lfb = ZERO_BLOCK_PTR;
 }
--- unstable.orig/xen/drivers/video/vesa.c
+++ unstable/xen/drivers/video/vesa.c
@@ -197,5 +197,7 @@ void __init vesa_endboot(bool_t keep)
                    vlfb_info.width * bpp);
         lfb_flush();
         lfb_free();
+        iounmap(lfb);
+        lfb = ZERO_BLOCK_PTR;
     }
 }



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RESEND v2] x86: introduce ioremap_wc()
  2022-02-17 15:57         ` Jan Beulich
@ 2022-02-18  9:09           ` Roger Pau Monné
  2022-02-18  9:23             ` Jan Beulich
  0 siblings, 1 reply; 32+ messages in thread
From: Roger Pau Monné @ 2022-02-18  9:09 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Wei Liu

On Thu, Feb 17, 2022 at 04:57:41PM +0100, Jan Beulich wrote:
> On 17.02.2022 16:50, Roger Pau Monné wrote:
> > On Thu, Feb 17, 2022 at 04:02:39PM +0100, Jan Beulich wrote:
> >> On 17.02.2022 15:47, Roger Pau Monné wrote:
> >>> On Thu, Feb 17, 2022 at 12:01:08PM +0100, Jan Beulich wrote:
> >>>> --- a/xen/drivers/video/vga.c
> >>>> +++ b/xen/drivers/video/vga.c
> >>>> @@ -79,7 +79,7 @@ void __init video_init(void)
> >>>>      {
> >>>>      case XEN_VGATYPE_TEXT_MODE_3:
> >>>>          if ( page_is_ram_type(paddr_to_pfn(0xB8000), RAM_TYPE_CONVENTIONAL) ||
> >>>> -             ((video = ioremap(0xB8000, 0x8000)) == NULL) )
> >>>> +             ((video = ioremap_wc(0xB8000, 0x8000)) == NULL) )
> >>>>              return;
> >>>>          outw(0x200a, 0x3d4); /* disable cursor */
> >>>>          columns = vga_console_info.u.text_mode_3.columns;
> >>>> @@ -164,7 +164,11 @@ void __init video_endboot(void)
> >>>>      {
> >>>>      case XEN_VGATYPE_TEXT_MODE_3:
> >>>>          if ( !vgacon_keep )
> >>>> +        {
> >>>>              memset(video, 0, columns * lines * 2);
> >>>> +            iounmap(video);
> >>>> +            video = ZERO_BLOCK_PTR;
> >>>> +        }
> >>>>          break;
> >>>>      case XEN_VGATYPE_VESA_LFB:
> >>>>      case XEN_VGATYPE_EFI_LFB:
> >>>
> >>> I think in vesa_endboot you also need to iounmap the framebuffer
> >>> iomem?
> >>
> >> Again part of the full series. I guess I was a little inconsistent
> >> with leaving the VGA unmap in here, but breaking out the VESA part.
> >> It's been a long time, but I guess I did so because the VESA part
> >> needs to touch two files.
> > 
> > I think you are hesitant to include the chunks for the above items? (or
> > maybe I'm not properly accounting for their complexity).
> 
> There's no complexity, it's really just that the zapping of the pointer
> needs to be done in a different place from where the unmap is. See below.
> 
> Jan
> 
> video/vesa: unmap frame buffer when relinquishing console
> 
> There's no point in keeping the VA space occupied when no further output
> will occur.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

For both patches, the one inline here and "x86: introduce
ioremap_wc()".

While at it, I think you should also push "video/vesa: drop
"vesa-mtrr" command line option".

Thanks, Roger.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RESEND v2] x86: introduce ioremap_wc()
  2022-02-18  9:09           ` Roger Pau Monné
@ 2022-02-18  9:23             ` Jan Beulich
  0 siblings, 0 replies; 32+ messages in thread
From: Jan Beulich @ 2022-02-18  9:23 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Wei Liu

On 18.02.2022 10:09, Roger Pau Monné wrote:
> On Thu, Feb 17, 2022 at 04:57:41PM +0100, Jan Beulich wrote:
>> On 17.02.2022 16:50, Roger Pau Monné wrote:
>>> On Thu, Feb 17, 2022 at 04:02:39PM +0100, Jan Beulich wrote:
>>>> On 17.02.2022 15:47, Roger Pau Monné wrote:
>>>>> On Thu, Feb 17, 2022 at 12:01:08PM +0100, Jan Beulich wrote:
>>>>>> --- a/xen/drivers/video/vga.c
>>>>>> +++ b/xen/drivers/video/vga.c
>>>>>> @@ -79,7 +79,7 @@ void __init video_init(void)
>>>>>>      {
>>>>>>      case XEN_VGATYPE_TEXT_MODE_3:
>>>>>>          if ( page_is_ram_type(paddr_to_pfn(0xB8000), RAM_TYPE_CONVENTIONAL) ||
>>>>>> -             ((video = ioremap(0xB8000, 0x8000)) == NULL) )
>>>>>> +             ((video = ioremap_wc(0xB8000, 0x8000)) == NULL) )
>>>>>>              return;
>>>>>>          outw(0x200a, 0x3d4); /* disable cursor */
>>>>>>          columns = vga_console_info.u.text_mode_3.columns;
>>>>>> @@ -164,7 +164,11 @@ void __init video_endboot(void)
>>>>>>      {
>>>>>>      case XEN_VGATYPE_TEXT_MODE_3:
>>>>>>          if ( !vgacon_keep )
>>>>>> +        {
>>>>>>              memset(video, 0, columns * lines * 2);
>>>>>> +            iounmap(video);
>>>>>> +            video = ZERO_BLOCK_PTR;
>>>>>> +        }
>>>>>>          break;
>>>>>>      case XEN_VGATYPE_VESA_LFB:
>>>>>>      case XEN_VGATYPE_EFI_LFB:
>>>>>
>>>>> I think in vesa_endboot you also need to iounmap the framebuffer
>>>>> iomem?
>>>>
>>>> Again part of the full series. I guess I was a little inconsistent
>>>> with leaving the VGA unmap in here, but breaking out the VESA part.
>>>> It's been a long time, but I guess I did so because the VESA part
>>>> needs to touch two files.
>>>
>>> I think you are hesitant to include the chunks for the above items? (or
>>> maybe I'm not properly accounting for their complexity).
>>
>> There's no complexity, it's really just that the zapping of the pointer
>> needs to be done in a different place from where the unmap is. See below.
>>
>> Jan
>>
>> video/vesa: unmap frame buffer when relinquishing console
>>
>> There's no point in keeping the VA space occupied when no further output
>> will occur.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
> 
> For both patches, the one inline here and "x86: introduce
> ioremap_wc()".

Thanks. Actually, while looking back at the original thread, to re-check
what pending objections there might have been, I did find the reason for
the split: In the patch here I would have introduced another leak, while
the other patch fixes an existing one.

> While at it, I think you should also push "video/vesa: drop
> "vesa-mtrr" command line option".

Yes, that one's merely dependent on the one here.

Jan



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 08/12] x86: move .text.kexec
  2021-05-27 12:34 ` [PATCH v2 08/12] x86: move .text.kexec Jan Beulich
@ 2022-02-18 13:34   ` Andrew Cooper
  0 siblings, 0 replies; 32+ messages in thread
From: Andrew Cooper @ 2022-02-18 13:34 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: Wei Liu, Roger Pau Monne

On 27/05/2021 13:34, Jan Beulich wrote:
> The source file requests page alignment - avoid a padding hole by
> placing it right after .text.entry. On average this yields a .text size
> reduction of 2k.
>
> Requested-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

I'll rebase my kexec metadata patch over this.

~Andrew

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 11/12] video/vesa: drop "vesa-remap" command line option
  2021-05-27 12:35 ` [PATCH v2 11/12] video/vesa: drop "vesa-remap" " Jan Beulich
@ 2022-02-18 13:35   ` Andrew Cooper
  0 siblings, 0 replies; 32+ messages in thread
From: Andrew Cooper @ 2022-02-18 13:35 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: Wei Liu, Roger Pau Monne

On 27/05/2021 13:35, Jan Beulich wrote:
> If we get mode dimensions wrong, having the remapping size controllable
> via command line option isn't going to help much. Drop the option.
>
> While adjusting this also
> - add __initdata to the variable,
> - use ROUNDUP() instead of open-coding it.
>
> Requested-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 09/12] video/vesa: unmap frame buffer when relinquishing console
  2021-05-27 12:34 ` [PATCH v2 09/12] video/vesa: unmap frame buffer when relinquishing console Jan Beulich
@ 2022-02-18 13:36   ` Andrew Cooper
  0 siblings, 0 replies; 32+ messages in thread
From: Andrew Cooper @ 2022-02-18 13:36 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: Wei Liu, Roger Pau Monne

On 27/05/2021 13:34, Jan Beulich wrote:
> There's no point in keeping the VA space occupied when no further output
> will occur.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2022-02-18 13:36 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-27 12:29 [PATCH v2 00/12] x86: memcpy() / memset() (non-)ERMS flavors plus fallout Jan Beulich
2021-05-27 12:30 ` [PATCH v2 01/12] x86: introduce ioremap_wc() Jan Beulich
2021-05-27 12:48   ` Julien Grall
2021-05-27 13:09     ` Jan Beulich
2021-05-27 13:30       ` Julien Grall
2021-05-27 14:57         ` Jan Beulich
2021-05-27 12:31 ` [PATCH v2 02/12] x86: re-work memset() Jan Beulich
2021-05-27 12:31 ` [PATCH v2 03/12] x86: re-work memcpy() Jan Beulich
2021-05-27 12:31 ` [PATCH v2 04/12] x86: control memset() and memcpy() inlining Jan Beulich
2021-05-27 12:32 ` [PATCH v2 05/12] x86: introduce "hot" and "cold" page clearing functions Jan Beulich
2021-05-27 12:32 ` [PATCH v2 06/12] page-alloc: make scrub_on_page() static Jan Beulich
2021-05-27 12:33 ` [PATCH v2 07/12] mm: allow page scrubbing routine(s) to be arch controlled Jan Beulich
2021-05-27 13:06   ` Julien Grall
2021-05-27 13:58     ` Jan Beulich
2021-06-03  9:39       ` Julien Grall
2021-06-04 13:23         ` Jan Beulich
2021-06-07 18:12           ` Julien Grall
2021-05-27 12:34 ` [PATCH v2 08/12] x86: move .text.kexec Jan Beulich
2022-02-18 13:34   ` Andrew Cooper
2021-05-27 12:34 ` [PATCH v2 09/12] video/vesa: unmap frame buffer when relinquishing console Jan Beulich
2022-02-18 13:36   ` Andrew Cooper
2021-05-27 12:35 ` [PATCH v2 10/12] video/vesa: drop "vesa-mtrr" command line option Jan Beulich
2021-05-27 12:35 ` [PATCH v2 11/12] video/vesa: drop "vesa-remap" " Jan Beulich
2022-02-18 13:35   ` Andrew Cooper
2021-05-27 12:36 ` [PATCH v2 12/12] video/vesa: adjust (not just) command line option handling Jan Beulich
2022-02-17 11:01 ` [PATCH RESEND v2] x86: introduce ioremap_wc() Jan Beulich
2022-02-17 14:47   ` Roger Pau Monné
2022-02-17 15:02     ` Jan Beulich
2022-02-17 15:50       ` Roger Pau Monné
2022-02-17 15:57         ` Jan Beulich
2022-02-18  9:09           ` Roger Pau Monné
2022-02-18  9:23             ` Jan Beulich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.