Re: [PATCH 2/7] x86: introduce ioremap_wc()

From: Jan Beulich <jbeulich@suse.com>
To: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: "Wei Liu" <wl@xen.org>, "Roger Pau Monné" <roger.pau@citrix.com>,
	"xen-devel@lists.xenproject.org" <xen-devel@lists.xenproject.org>
Subject: Re: [PATCH 2/7] x86: introduce ioremap_wc()
Date: Wed, 28 Apr 2021 11:41:31 +0200	[thread overview]
Message-ID: <5a2c3037-67b3-7e6e-9080-1a7a97a14ffb@suse.com> (raw)
In-Reply-To: <8cb67c2c-d5f9-a72b-d0d3-9289f6c9b512@citrix.com>

On 27.04.2021 19:13, Andrew Cooper wrote:
> On 27/04/2021 13:54, Jan Beulich wrote:
>> In order for a to-be-introduced ERMS form of memcpy() to not regress
>> boot performance on certain systems when video output is active, we
>> first need to arrange for avoiding further dependency on firmware
>> setting up MTRRs in a way we can actually further modify. On many
>> systems, due to the continuously growing amounts of installed memory,
>> MTRRs get configured with at least one huge WB range, and with MMIO
>> ranges below 4Gb then forced to UC via overlapping MTRRs. mtrr_add(), as
>> it is today, can't deal with such a setup. Hence on such systems we
>> presently leave the frame buffer mapped UC, leading to significantly
>> reduced performance when using REP STOSB / REP MOVSB.
>>
>> On post-PentiumII hardware (i.e. any that's capable of running 64-bit
>> code), an effective memory type of WC can be achieved without MTRRs, by
>> simply referencing the respective PAT entry from the PTEs. While this
>> will leave the switch to ERMS forms of memset() and memcpy() with
>> largely unchanged performance, the change here on its own improves
>> performance on affected systems quite significantly: Measuring just the
>> individual affected memcpy() invocations yielded a speedup by a factor
>> of over 250 on my initial (Skylake) test system. memset() isn't getting
>> improved by as much there, but still by a factor of about 20.
>>
>> While adding {__,}PAGE_HYPERVISOR_WC, also add {__,}PAGE_HYPERVISOR_WT
>> to, at the very least, make clear what PTE flags this memory type uses.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> ---
> 
> Seeing as MTRRs are full of firmware issues, shouldn't we also
> cross-check that the vram is marked WC, or we'll still fall into bad
> perf from combining down to UC.  (Obviously follow-up work if so.)

VRAM generally isn't marked WC nowadays, at least from my observations.
And it doesn't need to be - there's no "combining down to UC" when the
WC PAT entry is chosen by the PTE. Plus firmware doing so may actually
be counter-productive, as the VRAM address range may easily move when
BARs get re-assigned by an OS.

>> TBD: Both callers are __init, so in principle ioremap_wc() could be so,
>>      too, at least for the time being.
> 
> I don't see us making use this at runtime.  Uses of WC are few and far
> between.

Okay, will mark the function __init then.

>> TBD: If the VGA range is WC in the fixed range MTRRs, reusing the low
>>      1st Mb mapping (like ioremap() does) would be an option.
> 
> It might be fine to do that unconditionally.  The low VRAM has had known
> settings for 2 decades now.
> 
> That said, the low 1MB does use UC- mappings, which means we're entirely
> dependent on MTRRs specifying WC for sensible performance.

The two parts of your reply look to contradict one another: I've been
suggesting to use the low-Mb mapping precisely only when the fixed
range MTRRs covering the range are all saying WC.

>> --- a/xen/arch/x86/mm.c
>> +++ b/xen/arch/x86/mm.c
>> @@ -5883,6 +5883,20 @@ void __iomem *ioremap(paddr_t pa, size_t
>>      return (void __force __iomem *)va;
>>  }
>>  
>> +void __iomem *ioremap_wc(paddr_t pa, size_t len)
>> +{
>> +    mfn_t mfn = _mfn(PFN_DOWN(pa));
>> +    unsigned int offs = pa & (PAGE_SIZE - 1);
>> +    unsigned int nr = PFN_UP(offs + len);
>> +    void *va;
>> +
>> +    WARN_ON(page_is_ram_type(mfn_x(mfn), RAM_TYPE_CONVENTIONAL));
>> +
>> +    va = __vmap(&mfn, nr, 1, 1, PAGE_HYPERVISOR_WC, VMAP_DEFAULT);
> 
> This doesn't look correct.  granularity and nr are passed the wrong way
> around, but maybe that's related to the fact that only a single mfn is
> passed.  I'm confused.

It's identical to ioremap(). And yes, it's this way because of there
just being a single MFN getting passed in here. If I flipped the 2nd
and 3rd arguments, nr calls to map_pages_to_xen() would result
instead of just one.

> Also, several truncations will occur for a framebuffer > 4G, both with
> calculations here, and the types of __vmap()'s parameters.

Truncations may occur, indeed, but for frame buffers >= 16Tb. I
didn't think I needed to worry about this more here than ioremap()
does.

>> @@ -179,8 +179,7 @@ void __init vesa_mtrr_init(void)
>>  
>>  static void lfb_flush(void)
>>  {
>> -    if ( vesa_mtrr == 3 )
>> -        __asm__ __volatile__ ("sfence" : : : "memory");
>> +    __asm__ __volatile__ ("sfence" : : : "memory");
> 
> wmb(), seeing as that is the operation we mean here?

Not sure, to be honest. It's not been all this long ago that
smp_wmb() and wmb() were the same. And it's also not a write
barrier we mean here, but specifically SFENCE (which merely
happens to be the insn of choice for implementing wmb()). We
don't care about having a full-fledged write barrier here -
other writes getting reordered would in principle be fine.

>> --- a/xen/drivers/video/vga.c
>> +++ b/xen/drivers/video/vga.c
>> @@ -79,7 +79,7 @@ void __init video_init(void)
>>      {
>>      case XEN_VGATYPE_TEXT_MODE_3:
>>          if ( page_is_ram_type(paddr_to_pfn(0xB8000), RAM_TYPE_CONVENTIONAL) ||
>> -             ((video = ioremap(0xB8000, 0x8000)) == NULL) )
>> +             ((video = ioremap_wc(0xB8000, 0x8000)) == NULL) )
>>              return;
>>          outw(0x200a, 0x3d4); /* disable cursor */
>>          columns = vga_console_info.u.text_mode_3.columns;
>> @@ -164,7 +164,11 @@ void __init video_endboot(void)
>>      {
>>      case XEN_VGATYPE_TEXT_MODE_3:
>>          if ( !vgacon_keep )
>> +        {
>>              memset(video, 0, columns * lines * 2);
>> +            iounmap(video);
>> +            video = ZERO_BLOCK_PTR;
>> +        }
>>          break;
> 
> Shouldn't this hunk be in patch 5?

Only if ioremap_wc() unconditionally re-used the low-Mb mapping;
I don't want to introduce another leak, paralleling the one which
patch 5 addresses. And as per above I see ioremap_wc() use __va()
at best conditionally.

>> --- a/xen/include/asm-x86/mm.h
>> +++ b/xen/include/asm-x86/mm.h
>> @@ -615,6 +615,8 @@ void destroy_perdomain_mapping(struct do
>>                                 unsigned int nr);
>>  void free_perdomain_mappings(struct domain *);
>>  
>> +void __iomem *ioremap_wc(paddr_t, size_t);
> 
> I'm not sure we want to add a second prototype.  ARM has ioremap_wc()
> too, and we absolutely don't want them to get out of sync, and we have
> two new architectures on the horizon.
> 
> Perhaps a new xen/ioremap.h which includes asm/ioremap.h  (although
> thinking forward to encrypted RAM, we might want something which can
> also encompass the memremap*() variants.)
> 
> ARM can #define ioremap_wc ioremap_wc and provide their inline wrapper. 
> x86 can fall back to the common prototype.  Other architectures can do
> whatever is best for them.

I'm afraid I don't see how what you suggest would not lead to
duplicate prototypes. Arm wanting the function be an inline
one precludes exposing the extern declaration there. Hence
there will still be a risk of things going out of sync. In
fact I first had the declaration next to ioremap()'s, until I
ran into the build issue on Arm.

Jan