All of lore.kernel.org
 help / color / mirror / Atom feed
* XSA 154 and ISA region (640K -> 1MB) WB cache instead of UC
@ 2016-08-17 20:32 ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 16+ messages in thread
From: Konrad Rzeszutek Wilk @ 2016-08-17 20:32 UTC (permalink / raw)
  To: xen-devel, chuck.anderson, Boris Ostrovsky, david.vrabel, jgross,
	jbeulich, stefan.bader
  Cc: linux-kernel

Hey Jan, et. al.,

One of the interesting things about XSA 154 fix ("x86: enforce consistent
cachability of MMIO mappings") is that when certain applications (mcelog)
are trying to map /dev/mmap and lurk in ISA regions - we get:

[   49.399053] WARNING: CPU: 0 PID: 2471 at arch/x86/mm/pat.c:913 untrack_pfn+0x93/0xc0()
[   49.399055] Modules linked in: bnx2fc fcoe libfcoe libfc 8021q mrp garp stp llc bonding dm_multipath vfat fat iTCO_wdt iTCO_vendor_support pcspkr ipmi_devintf ipmi_si ipmi_msghandler sb_edac edac_core i2c_i801 i2c_core lpc_ich mfd_core shpchp ioatdma sg ext4 jbd2 mbcache sr_mod cdrom sd_mod usb_storage ahci libahci megaraid_sas qla2xxx scsi_transport_fc crc32c_intel be2iscsi bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi ipv6 cxgb3 libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi ixgbe dca ptp pps_core vxlan udp_tunnel ip6_udp_tunnel mdio dm_mirror dm_region_hash dm_log dm_mod
[   49.399131] CPU: 0 PID: 2471 Comm: mcelog Not tainted 4.1
[   49.399134] Hardware name: Oracle Corporation SUN SERVER X4-2       /ASSY,MB,X4-2, 1U      , BIOS 25030100 04/15/2015
[   49.399138]  0000000000000000 ffff880074673c28 ffffffff816c66f0 0000000000000000
[   49.399143]  0000000000000391 ffff880074673c68 ffffffff81084745 ffff880074673c78
[   49.399148]  ffff88014b625db0 0000000000000000 ffff880074673d58 00007f39290ab000
[   49.399152] Call Trace:
[   49.399166]  [<ffffffff816c66f0>] dump_stack+0x63/0x83
[   49.399175]  [<ffffffff81084745>] warn_slowpath_common+0x95/0xe0
[   49.399180]  [<ffffffff810847aa>] warn_slowpath_null+0x1a/0x20
[   49.399183]  [<ffffffff810725f3>] untrack_pfn+0x93/0xc0
[   49.399190]  [<ffffffff811b90f9>] unmap_single_vma+0xa9/0x100
[   49.399194]  [<ffffffff811b9644>] unmap_vmas+0x54/0xa0
[   49.399199]  [<ffffffff811bf0da>] exit_mmap+0x9a/0x150
[   49.399204]  [<ffffffff810825d3>] mmput+0x73/0x110
[   49.399208]  [<ffffffff81082775>] dup_mm+0x105/0x110
[   49.399213]  [<ffffffff81083b1d>] copy_process+0x11ed/0x1240
[   49.399218]  [<ffffffff81084009>] do_fork+0x79/0x280
[   49.399226]  [<ffffffff810259d3>] ? syscall_trace_enter_phase1+0x153/0x180
[   49.399231]  [<ffffffff81084226>] SyS_clone+0x16/0x20
[   49.399235]  [<ffffffff816cb3ee>] system_call_fastpath+0x12/0x71
[   49.399239] ---[ end trace a61cd3d271a53a54 ]---

The reason is that Linux kernel assumes that the range from 640KB -> 1MB can
be mapped as write-back (see is_new_memtype_allowed and x86_platform.is_untracked_pat_range).
But we enforce the uncached mode and Linux complains.

With the mmio-relax=1, Linux gets its way and is happy.

With the patch below:

diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index 70a38c1..e5ff5a5 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -288,6 +287,12 @@ static void __init xen_banner(void)
 	       version >> 16, version & 0xffff, extra.extraversion,
 	       xen_feature(XENFEAT_mmu_pt_update_preserve_ad) ? " (preserve-AD)" : "");
 }
+
+static bool xen_ignore(u64 s, u64 e)
+{
+	return false;
+}
+
 /* Check if running on Xen version (major, minor) or later */
 bool
 xen_running_on_version_or_later(unsigned int major, unsigned int minor)
@@ -1563,7 +1570,7 @@ asmlinkage __visible void __init xen_start_kernel(void)
 		x86_init.resources.memory_setup = xen_memory_setup;
 	x86_init.oem.arch_setup = xen_arch_setup;
 	x86_init.oem.banner = xen_banner;
-
+	x86_platform.is_untracked_pat_range = xen_ignore;
 	xen_init_time_ops();
 
 	/*

Things work much better - as we don't treat the 640KB->1MB region specially.

Anyhow what I am wondering:

 a) Should we add a edge case in the hypervisor to allow multiple mappings
   for this region? I am thinking no.. but it sounds like mapping ISA region
   as WB is safe even in baremetal?

 b) Or would it be better to let Linux do its thing and treat 640KB->1MB
   as uncached instead of writeback?

   Looking at the kernel it assumes that WB is ok for 640KB->1MB.
   The comment says:
   " /* Low ISA region is always mapped WB in page table. No need to track *"

   which is probably true on baremetal. But with Xen PV:

 856         /*                                                                      
 857          * In domU, the ISA region is normal, usable memory, but we             
 858          * reserve ISA memory anyway because too many things poke               
 859          * about in there.                                                      
 860          */                                                                     
 861         e820_add_region(ISA_START_ADDRESS, ISA_END_ADDRESS - ISA_START_ADDRESS, 
 862                         E820_RESERVED);                                 

   which would imply we don't have any page table mappings.

   And then the quick fix I provided above looks like the right solution?


CC-ing Boris, Daniel, Juergen, Steve, and Chuck.

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* XSA 154 and ISA region (640K -> 1MB) WB cache instead of UC
@ 2016-08-17 20:32 ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 16+ messages in thread
From: Konrad Rzeszutek Wilk @ 2016-08-17 20:32 UTC (permalink / raw)
  To: xen-devel, chuck.anderson, Boris Ostrovsky, david.vrabel, jgross,
	jbeulich, stefan.bader
  Cc: linux-kernel

Hey Jan, et. al.,

One of the interesting things about XSA 154 fix ("x86: enforce consistent
cachability of MMIO mappings") is that when certain applications (mcelog)
are trying to map /dev/mmap and lurk in ISA regions - we get:

[   49.399053] WARNING: CPU: 0 PID: 2471 at arch/x86/mm/pat.c:913 untrack_pfn+0x93/0xc0()
[   49.399055] Modules linked in: bnx2fc fcoe libfcoe libfc 8021q mrp garp stp llc bonding dm_multipath vfat fat iTCO_wdt iTCO_vendor_support pcspkr ipmi_devintf ipmi_si ipmi_msghandler sb_edac edac_core i2c_i801 i2c_core lpc_ich mfd_core shpchp ioatdma sg ext4 jbd2 mbcache sr_mod cdrom sd_mod usb_storage ahci libahci megaraid_sas qla2xxx scsi_transport_fc crc32c_intel be2iscsi bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi ipv6 cxgb3 libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi ixgbe dca ptp pps_core vxlan udp_tunnel ip6_udp_tunnel mdio dm_mirror dm_region_hash dm_log dm_mod
[   49.399131] CPU: 0 PID: 2471 Comm: mcelog Not tainted 4.1
[   49.399134] Hardware name: Oracle Corporation SUN SERVER X4-2       /ASSY,MB,X4-2, 1U      , BIOS 25030100 04/15/2015
[   49.399138]  0000000000000000 ffff880074673c28 ffffffff816c66f0 0000000000000000
[   49.399143]  0000000000000391 ffff880074673c68 ffffffff81084745 ffff880074673c78
[   49.399148]  ffff88014b625db0 0000000000000000 ffff880074673d58 00007f39290ab000
[   49.399152] Call Trace:
[   49.399166]  [<ffffffff816c66f0>] dump_stack+0x63/0x83
[   49.399175]  [<ffffffff81084745>] warn_slowpath_common+0x95/0xe0
[   49.399180]  [<ffffffff810847aa>] warn_slowpath_null+0x1a/0x20
[   49.399183]  [<ffffffff810725f3>] untrack_pfn+0x93/0xc0
[   49.399190]  [<ffffffff811b90f9>] unmap_single_vma+0xa9/0x100
[   49.399194]  [<ffffffff811b9644>] unmap_vmas+0x54/0xa0
[   49.399199]  [<ffffffff811bf0da>] exit_mmap+0x9a/0x150
[   49.399204]  [<ffffffff810825d3>] mmput+0x73/0x110
[   49.399208]  [<ffffffff81082775>] dup_mm+0x105/0x110
[   49.399213]  [<ffffffff81083b1d>] copy_process+0x11ed/0x1240
[   49.399218]  [<ffffffff81084009>] do_fork+0x79/0x280
[   49.399226]  [<ffffffff810259d3>] ? syscall_trace_enter_phase1+0x153/0x180
[   49.399231]  [<ffffffff81084226>] SyS_clone+0x16/0x20
[   49.399235]  [<ffffffff816cb3ee>] system_call_fastpath+0x12/0x71
[   49.399239] ---[ end trace a61cd3d271a53a54 ]---

The reason is that Linux kernel assumes that the range from 640KB -> 1MB can
be mapped as write-back (see is_new_memtype_allowed and x86_platform.is_untracked_pat_range).
But we enforce the uncached mode and Linux complains.

With the mmio-relax=1, Linux gets its way and is happy.

With the patch below:

diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index 70a38c1..e5ff5a5 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -288,6 +287,12 @@ static void __init xen_banner(void)
 	       version >> 16, version & 0xffff, extra.extraversion,
 	       xen_feature(XENFEAT_mmu_pt_update_preserve_ad) ? " (preserve-AD)" : "");
 }
+
+static bool xen_ignore(u64 s, u64 e)
+{
+	return false;
+}
+
 /* Check if running on Xen version (major, minor) or later */
 bool
 xen_running_on_version_or_later(unsigned int major, unsigned int minor)
@@ -1563,7 +1570,7 @@ asmlinkage __visible void __init xen_start_kernel(void)
 		x86_init.resources.memory_setup = xen_memory_setup;
 	x86_init.oem.arch_setup = xen_arch_setup;
 	x86_init.oem.banner = xen_banner;
-
+	x86_platform.is_untracked_pat_range = xen_ignore;
 	xen_init_time_ops();
 
 	/*

Things work much better - as we don't treat the 640KB->1MB region specially.

Anyhow what I am wondering:

 a) Should we add a edge case in the hypervisor to allow multiple mappings
   for this region? I am thinking no.. but it sounds like mapping ISA region
   as WB is safe even in baremetal?

 b) Or would it be better to let Linux do its thing and treat 640KB->1MB
   as uncached instead of writeback?

   Looking at the kernel it assumes that WB is ok for 640KB->1MB.
   The comment says:
   " /* Low ISA region is always mapped WB in page table. No need to track *"

   which is probably true on baremetal. But with Xen PV:

 856         /*                                                                      
 857          * In domU, the ISA region is normal, usable memory, but we             
 858          * reserve ISA memory anyway because too many things poke               
 859          * about in there.                                                      
 860          */                                                                     
 861         e820_add_region(ISA_START_ADDRESS, ISA_END_ADDRESS - ISA_START_ADDRESS, 
 862                         E820_RESERVED);                                 

   which would imply we don't have any page table mappings.

   And then the quick fix I provided above looks like the right solution?


CC-ing Boris, Daniel, Juergen, Steve, and Chuck.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: XSA 154 and ISA region (640K -> 1MB) WB cache instead of UC
  2016-08-17 20:32 ` Konrad Rzeszutek Wilk
  (?)
  (?)
@ 2016-08-18 10:06 ` Jan Beulich
  2016-08-18 10:16   ` Andrew Cooper
                     ` (2 more replies)
  -1 siblings, 3 replies; 16+ messages in thread
From: Jan Beulich @ 2016-08-18 10:06 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: stefan.bader, david.vrabel, xen-devel, Boris Ostrovsky,
	chuck.anderson, Juergen Gross, linux-kernel

>>> On 17.08.16 at 22:32, <konrad.wilk@oracle.com> wrote:
> One of the interesting things about XSA 154 fix ("x86: enforce consistent
> cachability of MMIO mappings") is that when certain applications (mcelog)
> are trying to map /dev/mmap and lurk in ISA regions - we get:

DYM /dev/mem ? Most accesses to which are bogus in PV guests
(often including Dom0) anyway.

> [   49.399053] WARNING: CPU: 0 PID: 2471 at arch/x86/mm/pat.c:913 untrack_pfn+0x93/0xc0()

What Linux version is this? untrack_pfn() doesn't span line 913 in
4.8-rc2. And follow_phys() appears to only check whether the write
flag is set as expected; I can't see any cachability checks. Plus it
gets called only when both incoming address and size are zero.

> Anyhow what I am wondering:
> 
>  a) Should we add a edge case in the hypervisor to allow multiple mappings
>    for this region? I am thinking no.. but it sounds like mapping ISA region
>    as WB is safe even in baremetal?

We should never allow multiple mappings with different cachability.
And I don't understand what makes you think the ISA region is safe
to map WB? There might be ROMs, MMIO regions, or simply nothing
there, neither of which is safe to map WB. ROMs certainly could be
WP, but that would require a way to reliably size not only ISA
extension ROMs, but also main and video BIOS.

>  b) Or would it be better to let Linux do its thing and treat 640KB->1MB
>    as uncached instead of writeback?

According to what you wrote earlier the two parts of the sentence
read contradictory to me.

>    Looking at the kernel it assumes that WB is ok for 640KB->1MB.
>    The comment says:
>    " /* Low ISA region is always mapped WB in page table. No need to track *"

As per above it's not clear to me what this comment is backed by.

Jan

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: XSA 154 and ISA region (640K -> 1MB) WB cache instead of UC
  2016-08-17 20:32 ` Konrad Rzeszutek Wilk
  (?)
@ 2016-08-18 10:06 ` Jan Beulich
  -1 siblings, 0 replies; 16+ messages in thread
From: Jan Beulich @ 2016-08-18 10:06 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Juergen Gross, linux-kernel, stefan.bader, david.vrabel,
	chuck.anderson, xen-devel, Boris Ostrovsky

>>> On 17.08.16 at 22:32, <konrad.wilk@oracle.com> wrote:
> One of the interesting things about XSA 154 fix ("x86: enforce consistent
> cachability of MMIO mappings") is that when certain applications (mcelog)
> are trying to map /dev/mmap and lurk in ISA regions - we get:

DYM /dev/mem ? Most accesses to which are bogus in PV guests
(often including Dom0) anyway.

> [   49.399053] WARNING: CPU: 0 PID: 2471 at arch/x86/mm/pat.c:913 untrack_pfn+0x93/0xc0()

What Linux version is this? untrack_pfn() doesn't span line 913 in
4.8-rc2. And follow_phys() appears to only check whether the write
flag is set as expected; I can't see any cachability checks. Plus it
gets called only when both incoming address and size are zero.

> Anyhow what I am wondering:
> 
>  a) Should we add a edge case in the hypervisor to allow multiple mappings
>    for this region? I am thinking no.. but it sounds like mapping ISA region
>    as WB is safe even in baremetal?

We should never allow multiple mappings with different cachability.
And I don't understand what makes you think the ISA region is safe
to map WB? There might be ROMs, MMIO regions, or simply nothing
there, neither of which is safe to map WB. ROMs certainly could be
WP, but that would require a way to reliably size not only ISA
extension ROMs, but also main and video BIOS.

>  b) Or would it be better to let Linux do its thing and treat 640KB->1MB
>    as uncached instead of writeback?

According to what you wrote earlier the two parts of the sentence
read contradictory to me.

>    Looking at the kernel it assumes that WB is ok for 640KB->1MB.
>    The comment says:
>    " /* Low ISA region is always mapped WB in page table. No need to track *"

As per above it's not clear to me what this comment is backed by.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Xen-devel] XSA 154 and ISA region (640K -> 1MB) WB cache instead of UC
  2016-08-18 10:06 ` Jan Beulich
  2016-08-18 10:16   ` Andrew Cooper
@ 2016-08-18 10:16   ` Andrew Cooper
  2016-08-18 11:12     ` Jan Beulich
  2016-08-18 11:12     ` Jan Beulich
  2016-08-19 14:52     ` Konrad Rzeszutek Wilk
  2 siblings, 2 replies; 16+ messages in thread
From: Andrew Cooper @ 2016-08-18 10:16 UTC (permalink / raw)
  To: Jan Beulich, Konrad Rzeszutek Wilk
  Cc: Juergen Gross, linux-kernel, stefan.bader, david.vrabel,
	chuck.anderson, xen-devel, Boris Ostrovsky

On 18/08/16 11:06, Jan Beulich wrote:
>>>> On 17.08.16 at 22:32, <konrad.wilk@oracle.com> wrote:
>> One of the interesting things about XSA 154 fix ("x86: enforce consistent
>> cachability of MMIO mappings") is that when certain applications (mcelog)
>> are trying to map /dev/mmap and lurk in ISA regions - we get:
> DYM /dev/mem ? Most accesses to which are bogus in PV guests
> (often including Dom0) anyway.
>
>> [   49.399053] WARNING: CPU: 0 PID: 2471 at arch/x86/mm/pat.c:913 untrack_pfn+0x93/0xc0()
> What Linux version is this? untrack_pfn() doesn't span line 913 in
> 4.8-rc2. And follow_phys() appears to only check whether the write
> flag is set as expected; I can't see any cachability checks. Plus it
> gets called only when both incoming address and size are zero.
>
>> Anyhow what I am wondering:
>>
>>  a) Should we add a edge case in the hypervisor to allow multiple mappings
>>    for this region? I am thinking no.. but it sounds like mapping ISA region
>>    as WB is safe even in baremetal?
> We should never allow multiple mappings with different cachability.
> And I don't understand what makes you think the ISA region is safe
> to map WB? There might be ROMs, MMIO regions, or simply nothing
> there, neither of which is safe to map WB. ROMs certainly could be
> WP, but that would require a way to reliably size not only ISA
> extension ROMs, but also main and video BIOS.
>
>>  b) Or would it be better to let Linux do its thing and treat 640KB->1MB
>>    as uncached instead of writeback?
> According to what you wrote earlier the two parts of the sentence
> read contradictory to me.
>
>>    Looking at the kernel it assumes that WB is ok for 640KB->1MB.
>>    The comment says:
>>    " /* Low ISA region is always mapped WB in page table. No need to track *"
> As per above it's not clear to me what this comment is backed by.

This states what is in the pagetables.  Not the combined result with MTRRs.

WB in the pagetables and WC/UB in the MTRRs is a legal combination which
functions correctly.

~Andrew

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: XSA 154 and ISA region (640K -> 1MB) WB cache instead of UC
  2016-08-18 10:06 ` Jan Beulich
@ 2016-08-18 10:16   ` Andrew Cooper
  2016-08-18 10:16   ` [Xen-devel] " Andrew Cooper
  2016-08-19 14:52     ` Konrad Rzeszutek Wilk
  2 siblings, 0 replies; 16+ messages in thread
From: Andrew Cooper @ 2016-08-18 10:16 UTC (permalink / raw)
  To: Jan Beulich, Konrad Rzeszutek Wilk
  Cc: Juergen Gross, linux-kernel, stefan.bader, david.vrabel,
	chuck.anderson, xen-devel, Boris Ostrovsky

On 18/08/16 11:06, Jan Beulich wrote:
>>>> On 17.08.16 at 22:32, <konrad.wilk@oracle.com> wrote:
>> One of the interesting things about XSA 154 fix ("x86: enforce consistent
>> cachability of MMIO mappings") is that when certain applications (mcelog)
>> are trying to map /dev/mmap and lurk in ISA regions - we get:
> DYM /dev/mem ? Most accesses to which are bogus in PV guests
> (often including Dom0) anyway.
>
>> [   49.399053] WARNING: CPU: 0 PID: 2471 at arch/x86/mm/pat.c:913 untrack_pfn+0x93/0xc0()
> What Linux version is this? untrack_pfn() doesn't span line 913 in
> 4.8-rc2. And follow_phys() appears to only check whether the write
> flag is set as expected; I can't see any cachability checks. Plus it
> gets called only when both incoming address and size are zero.
>
>> Anyhow what I am wondering:
>>
>>  a) Should we add a edge case in the hypervisor to allow multiple mappings
>>    for this region? I am thinking no.. but it sounds like mapping ISA region
>>    as WB is safe even in baremetal?
> We should never allow multiple mappings with different cachability.
> And I don't understand what makes you think the ISA region is safe
> to map WB? There might be ROMs, MMIO regions, or simply nothing
> there, neither of which is safe to map WB. ROMs certainly could be
> WP, but that would require a way to reliably size not only ISA
> extension ROMs, but also main and video BIOS.
>
>>  b) Or would it be better to let Linux do its thing and treat 640KB->1MB
>>    as uncached instead of writeback?
> According to what you wrote earlier the two parts of the sentence
> read contradictory to me.
>
>>    Looking at the kernel it assumes that WB is ok for 640KB->1MB.
>>    The comment says:
>>    " /* Low ISA region is always mapped WB in page table. No need to track *"
> As per above it's not clear to me what this comment is backed by.

This states what is in the pagetables.  Not the combined result with MTRRs.

WB in the pagetables and WC/UB in the MTRRs is a legal combination which
functions correctly.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Xen-devel] XSA 154 and ISA region (640K -> 1MB) WB cache instead of UC
  2016-08-18 10:16   ` [Xen-devel] " Andrew Cooper
@ 2016-08-18 11:12     ` Jan Beulich
  2016-08-18 15:35         ` One Thousand Gnomes
  2016-08-18 11:12     ` Jan Beulich
  1 sibling, 1 reply; 16+ messages in thread
From: Jan Beulich @ 2016-08-18 11:12 UTC (permalink / raw)
  To: Andrew Cooper, Konrad Rzeszutek Wilk
  Cc: stefan.bader, david.vrabel, xen-devel, Boris Ostrovsky,
	chuck.anderson, Juergen Gross, linux-kernel

>>> On 18.08.16 at 12:16, <andrew.cooper3@citrix.com> wrote:
> On 18/08/16 11:06, Jan Beulich wrote:
>>>>> On 17.08.16 at 22:32, <konrad.wilk@oracle.com> wrote:
>>>    Looking at the kernel it assumes that WB is ok for 640KB->1MB.
>>>    The comment says:
>>>    " /* Low ISA region is always mapped WB in page table. No need to track 
> *"
>> As per above it's not clear to me what this comment is backed by.
> 
> This states what is in the pagetables.  Not the combined result with MTRRs.
> 
> WB in the pagetables and WC/UB in the MTRRs is a legal combination which
> functions correctly.

True, but then again - haven't I been told multiple times that Linux
nowadays prefers to run without using MTRRs?

Jan

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: XSA 154 and ISA region (640K -> 1MB) WB cache instead of UC
  2016-08-18 10:16   ` [Xen-devel] " Andrew Cooper
  2016-08-18 11:12     ` Jan Beulich
@ 2016-08-18 11:12     ` Jan Beulich
  1 sibling, 0 replies; 16+ messages in thread
From: Jan Beulich @ 2016-08-18 11:12 UTC (permalink / raw)
  To: Andrew Cooper, Konrad Rzeszutek Wilk
  Cc: Juergen Gross, linux-kernel, stefan.bader, david.vrabel,
	chuck.anderson, xen-devel, Boris Ostrovsky

>>> On 18.08.16 at 12:16, <andrew.cooper3@citrix.com> wrote:
> On 18/08/16 11:06, Jan Beulich wrote:
>>>>> On 17.08.16 at 22:32, <konrad.wilk@oracle.com> wrote:
>>>    Looking at the kernel it assumes that WB is ok for 640KB->1MB.
>>>    The comment says:
>>>    " /* Low ISA region is always mapped WB in page table. No need to track 
> *"
>> As per above it's not clear to me what this comment is backed by.
> 
> This states what is in the pagetables.  Not the combined result with MTRRs.
> 
> WB in the pagetables and WC/UB in the MTRRs is a legal combination which
> functions correctly.

True, but then again - haven't I been told multiple times that Linux
nowadays prefers to run without using MTRRs?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Xen-devel] XSA 154 and ISA region (640K -> 1MB) WB cache instead of UC
  2016-08-18 11:12     ` Jan Beulich
@ 2016-08-18 15:35         ` One Thousand Gnomes
  0 siblings, 0 replies; 16+ messages in thread
From: One Thousand Gnomes @ 2016-08-18 15:35 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, Konrad Rzeszutek Wilk, stefan.bader, david.vrabel,
	xen-devel, Boris Ostrovsky, chuck.anderson, Juergen Gross,
	linux-kernel

On Thu, 18 Aug 2016 05:12:54 -0600
"Jan Beulich" <JBeulich@suse.com> wrote:

> >>> On 18.08.16 at 12:16, <andrew.cooper3@citrix.com> wrote:  
> > On 18/08/16 11:06, Jan Beulich wrote:  
> >>>>> On 17.08.16 at 22:32, <konrad.wilk@oracle.com> wrote:  
> >>>    Looking at the kernel it assumes that WB is ok for 640KB->1MB.
> >>>    The comment says:
> >>>    " /* Low ISA region is always mapped WB in page table. No need to track   
> > *"  
> >> As per above it's not clear to me what this comment is backed by.  
> > 
> > This states what is in the pagetables.  Not the combined result with MTRRs.
> > 
> > WB in the pagetables and WC/UB in the MTRRs is a legal combination which
> > functions correctly.  
> 
> True, but then again - haven't I been told multiple times that Linux
> nowadays prefers to run without using MTRRs?

The BIOS sets up the fixed MTRR registers for the 640K-1MB window. Those
are separate to the variable range MTRR registers used for main memory
with specific mappings for segments A000 to BFFF then C000-C7FF /
C800-CFFF / etc up to FFFF.

Alan

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: XSA 154 and ISA region (640K -> 1MB) WB cache instead of UC
@ 2016-08-18 15:35         ` One Thousand Gnomes
  0 siblings, 0 replies; 16+ messages in thread
From: One Thousand Gnomes @ 2016-08-18 15:35 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Juergen Gross, Andrew Cooper, linux-kernel, stefan.bader,
	david.vrabel, chuck.anderson, xen-devel, Boris Ostrovsky

On Thu, 18 Aug 2016 05:12:54 -0600
"Jan Beulich" <JBeulich@suse.com> wrote:

> >>> On 18.08.16 at 12:16, <andrew.cooper3@citrix.com> wrote:  
> > On 18/08/16 11:06, Jan Beulich wrote:  
> >>>>> On 17.08.16 at 22:32, <konrad.wilk@oracle.com> wrote:  
> >>>    Looking at the kernel it assumes that WB is ok for 640KB->1MB.
> >>>    The comment says:
> >>>    " /* Low ISA region is always mapped WB in page table. No need to track   
> > *"  
> >> As per above it's not clear to me what this comment is backed by.  
> > 
> > This states what is in the pagetables.  Not the combined result with MTRRs.
> > 
> > WB in the pagetables and WC/UB in the MTRRs is a legal combination which
> > functions correctly.  
> 
> True, but then again - haven't I been told multiple times that Linux
> nowadays prefers to run without using MTRRs?

The BIOS sets up the fixed MTRR registers for the 640K-1MB window. Those
are separate to the variable range MTRR registers used for main memory
with specific mappings for segments A000 to BFFF then C000-C7FF /
C800-CFFF / etc up to FFFF.

Alan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: XSA 154 and ISA region (640K -> 1MB) WB cache instead of UC
  2016-08-18 10:06 ` Jan Beulich
@ 2016-08-19 14:52     ` Konrad Rzeszutek Wilk
  2016-08-18 10:16   ` [Xen-devel] " Andrew Cooper
  2016-08-19 14:52     ` Konrad Rzeszutek Wilk
  2 siblings, 0 replies; 16+ messages in thread
From: Konrad Rzeszutek Wilk @ 2016-08-19 14:52 UTC (permalink / raw)
  To: Jan Beulich
  Cc: stefan.bader, david.vrabel, xen-devel, Boris Ostrovsky,
	chuck.anderson, Juergen Gross, linux-kernel

On Thu, Aug 18, 2016 at 04:06:33AM -0600, Jan Beulich wrote:
> >>> On 17.08.16 at 22:32, <konrad.wilk@oracle.com> wrote:
> > One of the interesting things about XSA 154 fix ("x86: enforce consistent
> > cachability of MMIO mappings") is that when certain applications (mcelog)
> > are trying to map /dev/mmap and lurk in ISA regions - we get:
> 
> DYM /dev/mem ? Most accesses to which are bogus in PV guests
> (often including Dom0) anyway.

Yes.
> 
> > [   49.399053] WARNING: CPU: 0 PID: 2471 at arch/x86/mm/pat.c:913 untrack_pfn+0x93/0xc0()
> 
> What Linux version is this? untrack_pfn() doesn't span line 913 in

4.1
> 4.8-rc2. And follow_phys() appears to only check whether the write
> flag is set as expected; I can't see any cachability checks. Plus it
> gets called only when both incoming address and size are zero.

The error that happens is much sooner - that is when the VMA is setup
with the incorrect page attributes. Specifically: reserve_memtype which

 548         /* Low ISA region is always mapped WB in page table. No need to track */
 549         if (x86_platform.is_untracked_pat_range(start, end)) {                  
 550                 if (new_type)                                                   
 551                         *new_type = _PAGE_CACHE_MODE_WB;                        
 552                 return 0;                                                       
 553         }                                           

(And this for a change is v4.8-rc2)
> 
> > Anyhow what I am wondering:
> > 
> >  a) Should we add a edge case in the hypervisor to allow multiple mappings
> >    for this region? I am thinking no.. but it sounds like mapping ISA region
> >    as WB is safe even in baremetal?
> 
> We should never allow multiple mappings with different cachability.
> And I don't understand what makes you think the ISA region is safe
> to map WB? There might be ROMs, MMIO regions, or simply nothing
> there, neither of which is safe to map WB. ROMs certainly could be
> WP, but that would require a way to reliably size not only ISA
> extension ROMs, but also main and video BIOS.
> 
> >  b) Or would it be better to let Linux do its thing and treat 640KB->1MB
> >    as uncached instead of writeback?
> 
> According to what you wrote earlier the two parts of the sentence
> read contradictory to me.
> 
> >    Looking at the kernel it assumes that WB is ok for 640KB->1MB.
> >    The comment says:
> >    " /* Low ISA region is always mapped WB in page table. No need to track *"
> 
> As per above it's not clear to me what this comment is backed by.

I was hoping you would know :-)

Ah, commit 2e5d9c857d4e6c9e7b7d8c8c86a68a7842d213d6
Author: venkatesh.pallipadi@intel.com <venkatesh.pallipadi@intel.com>
Date:   Tue Mar 18 17:00:14 2008 -0700

    x86: PAT infrastructure patch
    
    Sets up pat_init() infrastructure.
    

which sets the MTRR for that region.
> 
> Jan
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: XSA 154 and ISA region (640K -> 1MB) WB cache instead of UC
@ 2016-08-19 14:52     ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 16+ messages in thread
From: Konrad Rzeszutek Wilk @ 2016-08-19 14:52 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Juergen Gross, linux-kernel, stefan.bader, david.vrabel,
	chuck.anderson, xen-devel, Boris Ostrovsky

On Thu, Aug 18, 2016 at 04:06:33AM -0600, Jan Beulich wrote:
> >>> On 17.08.16 at 22:32, <konrad.wilk@oracle.com> wrote:
> > One of the interesting things about XSA 154 fix ("x86: enforce consistent
> > cachability of MMIO mappings") is that when certain applications (mcelog)
> > are trying to map /dev/mmap and lurk in ISA regions - we get:
> 
> DYM /dev/mem ? Most accesses to which are bogus in PV guests
> (often including Dom0) anyway.

Yes.
> 
> > [   49.399053] WARNING: CPU: 0 PID: 2471 at arch/x86/mm/pat.c:913 untrack_pfn+0x93/0xc0()
> 
> What Linux version is this? untrack_pfn() doesn't span line 913 in

4.1
> 4.8-rc2. And follow_phys() appears to only check whether the write
> flag is set as expected; I can't see any cachability checks. Plus it
> gets called only when both incoming address and size are zero.

The error that happens is much sooner - that is when the VMA is setup
with the incorrect page attributes. Specifically: reserve_memtype which

 548         /* Low ISA region is always mapped WB in page table. No need to track */
 549         if (x86_platform.is_untracked_pat_range(start, end)) {                  
 550                 if (new_type)                                                   
 551                         *new_type = _PAGE_CACHE_MODE_WB;                        
 552                 return 0;                                                       
 553         }                                           

(And this for a change is v4.8-rc2)
> 
> > Anyhow what I am wondering:
> > 
> >  a) Should we add a edge case in the hypervisor to allow multiple mappings
> >    for this region? I am thinking no.. but it sounds like mapping ISA region
> >    as WB is safe even in baremetal?
> 
> We should never allow multiple mappings with different cachability.
> And I don't understand what makes you think the ISA region is safe
> to map WB? There might be ROMs, MMIO regions, or simply nothing
> there, neither of which is safe to map WB. ROMs certainly could be
> WP, but that would require a way to reliably size not only ISA
> extension ROMs, but also main and video BIOS.
> 
> >  b) Or would it be better to let Linux do its thing and treat 640KB->1MB
> >    as uncached instead of writeback?
> 
> According to what you wrote earlier the two parts of the sentence
> read contradictory to me.
> 
> >    Looking at the kernel it assumes that WB is ok for 640KB->1MB.
> >    The comment says:
> >    " /* Low ISA region is always mapped WB in page table. No need to track *"
> 
> As per above it's not clear to me what this comment is backed by.

I was hoping you would know :-)

Ah, commit 2e5d9c857d4e6c9e7b7d8c8c86a68a7842d213d6
Author: venkatesh.pallipadi@intel.com <venkatesh.pallipadi@intel.com>
Date:   Tue Mar 18 17:00:14 2008 -0700

    x86: PAT infrastructure patch
    
    Sets up pat_init() infrastructure.
    

which sets the MTRR for that region.
> 
> Jan
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: XSA 154 and ISA region (640K -> 1MB) WB cache instead of UC
  2016-08-19 14:52     ` Konrad Rzeszutek Wilk
@ 2016-08-19 15:14       ` Jan Beulich
  -1 siblings, 0 replies; 16+ messages in thread
From: Jan Beulich @ 2016-08-19 15:14 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: stefan.bader, david.vrabel, xen-devel, Boris Ostrovsky,
	chuck.anderson, Juergen Gross, linux-kernel

>>> On 19.08.16 at 16:52, <konrad.wilk@oracle.com> wrote:
> On Thu, Aug 18, 2016 at 04:06:33AM -0600, Jan Beulich wrote:
>> >>> On 17.08.16 at 22:32, <konrad.wilk@oracle.com> wrote:
>> >    Looking at the kernel it assumes that WB is ok for 640KB->1MB.
>> >    The comment says:
>> >    " /* Low ISA region is always mapped WB in page table. No need to track *"
>> 
>> As per above it's not clear to me what this comment is backed by.
> 
> I was hoping you would know :-)
> 
> Ah, commit 2e5d9c857d4e6c9e7b7d8c8c86a68a7842d213d6
> Author: venkatesh.pallipadi@intel.com <venkatesh.pallipadi@intel.com>
> Date:   Tue Mar 18 17:00:14 2008 -0700
> 
>     x86: PAT infrastructure patch
>     
>     Sets up pat_init() infrastructure.
>     
> 
> which sets the MTRR for that region.

Hmm, that's the commit which introduced pat.c years ago. I can't
see it writing an MTRRs though, nor can I see it backing the comment
in adds in any way. I guess I'm confused...

Jan

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: XSA 154 and ISA region (640K -> 1MB) WB cache instead of UC
@ 2016-08-19 15:14       ` Jan Beulich
  0 siblings, 0 replies; 16+ messages in thread
From: Jan Beulich @ 2016-08-19 15:14 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Juergen Gross, linux-kernel, stefan.bader, david.vrabel,
	chuck.anderson, xen-devel, Boris Ostrovsky

>>> On 19.08.16 at 16:52, <konrad.wilk@oracle.com> wrote:
> On Thu, Aug 18, 2016 at 04:06:33AM -0600, Jan Beulich wrote:
>> >>> On 17.08.16 at 22:32, <konrad.wilk@oracle.com> wrote:
>> >    Looking at the kernel it assumes that WB is ok for 640KB->1MB.
>> >    The comment says:
>> >    " /* Low ISA region is always mapped WB in page table. No need to track *"
>> 
>> As per above it's not clear to me what this comment is backed by.
> 
> I was hoping you would know :-)
> 
> Ah, commit 2e5d9c857d4e6c9e7b7d8c8c86a68a7842d213d6
> Author: venkatesh.pallipadi@intel.com <venkatesh.pallipadi@intel.com>
> Date:   Tue Mar 18 17:00:14 2008 -0700
> 
>     x86: PAT infrastructure patch
>     
>     Sets up pat_init() infrastructure.
>     
> 
> which sets the MTRR for that region.

Hmm, that's the commit which introduced pat.c years ago. I can't
see it writing an MTRRs though, nor can I see it backing the comment
in adds in any way. I guess I'm confused...

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Xen-devel] XSA 154 and ISA region (640K -> 1MB) WB cache instead of UC
  2016-08-18 15:35         ` One Thousand Gnomes
@ 2016-08-19 15:27           ` Konrad Rzeszutek Wilk
  -1 siblings, 0 replies; 16+ messages in thread
From: Konrad Rzeszutek Wilk @ 2016-08-19 15:27 UTC (permalink / raw)
  To: One Thousand Gnomes
  Cc: Jan Beulich, Andrew Cooper, stefan.bader, david.vrabel,
	xen-devel, Boris Ostrovsky, chuck.anderson, Juergen Gross,
	linux-kernel

On Thu, Aug 18, 2016 at 04:35:44PM +0100, One Thousand Gnomes wrote:
> On Thu, 18 Aug 2016 05:12:54 -0600
> "Jan Beulich" <JBeulich@suse.com> wrote:
> 
> > >>> On 18.08.16 at 12:16, <andrew.cooper3@citrix.com> wrote:  
> > > On 18/08/16 11:06, Jan Beulich wrote:  
> > >>>>> On 17.08.16 at 22:32, <konrad.wilk@oracle.com> wrote:  
> > >>>    Looking at the kernel it assumes that WB is ok for 640KB->1MB.
> > >>>    The comment says:
> > >>>    " /* Low ISA region is always mapped WB in page table. No need to track   
> > > *"  
> > >> As per above it's not clear to me what this comment is backed by.  
> > > 
> > > This states what is in the pagetables.  Not the combined result with MTRRs.
> > > 
> > > WB in the pagetables and WC/UB in the MTRRs is a legal combination which
> > > functions correctly.  
> > 
> > True, but then again - haven't I been told multiple times that Linux
> > nowadays prefers to run without using MTRRs?
> 
> The BIOS sets up the fixed MTRR registers for the 640K-1MB window. Those
> are separate to the variable range MTRR registers used for main memory
> with specific mappings for segments A000 to BFFF then C000-C7FF /
> C800-CFFF / etc up to FFFF.

OK, so BIOS-inherited.

Looking at the Intel SDM (figure 11-7), if the MTRR is UC for that, then 
having pagetables being either UC or WB are fine. Except Linux's use
of the quirk (is_untracked_pat_range) ends up always requesting WB.

And to combat the splat, the patch:


>From 5209635f23786fb88cf0ce77719da8acda63bf65 Mon Sep 17 00:00:00 2001
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Date: Fri, 19 Aug 2016 11:06:44 -0400
Subject: [PATCH] x86/xen: Add x86_platform.is_untracked_pat_range quirk to
 ignore ISA regions.

On x86 whenever VMAs are setup, the 'is_ISA_range quirk' (which this
patch re-implements) is used to figure whether to ignore the
requested PAT type and always use WB (see 'reserve_memtype').
Specifically it forces the WB type for any region in the ISA space.

>From the Intel SDM, the combination of MTRR (UC, which is setup by
the BIOS) and PAT (UC or WB) for the ISA region ends up with the same
value - UC.

However on Xen, due to XSA 154 we enforce that mappings that _ANY_
pagetable entry to MMIO ranges MUST have the same the same cachability
mapping - and in this case we enforce UC.

Which means that with XSA 154 (and without this patch) any application
that maps /dev/mem to get SMBIOS information (like mcelog), and pokes
in the ISA region will not have an PTE set. That is due to
reserve_pfn_range returning -EINVAL which results in the PTE not being set.

[These are debug entries added in 'reserve_pfn_range']
mcelog:2471 0xf0000->0xf1000, req_type=write-back new_type=write-back
mcelog:2471 0xeb000->0xed000, req_type=write-back new_type=write-back

.. above are successfull ones, but:
mcelog:2471 0xeb000->0xed000, req_type=uncached new_type=uncached
[again, a debug one:]
mcelog:2471 want=uncached got=write-back strict 0x000eb000-0x000ecfff
mcelog:2471 map pfn expected mapping type uncached for [mem 0x000eb000-0x000ecfff], got write-back
 ------------[ cut here ]------------

 [<ffffffff816c66f0>] dump_stack+0x63/0x83
 [<ffffffff81084745>] warn_slowpath_common+0x95/0xe0
 [<ffffffff810847aa>] warn_slowpath_null+0x1a/0x20
 [<ffffffff810725f3>] untrack_pfn+0x93/0xc0
 [<ffffffff811b90f9>] unmap_single_vma+0xa9/0x100
 [<ffffffff811b9644>] unmap_vmas+0x54/0xa0
 [<ffffffff811bf0da>] exit_mmap+0x9a/0x150
 [<ffffffff810825d3>] mmput+0x73/0x110
 [<ffffffff81082775>] dup_mm+0x105/0x110
 [<ffffffff81083b1d>] copy_process+0x11ed/0x1240
 [<ffffffff81084009>] do_fork+0x79/0x280
 [<ffffffff810259d3>] ? syscall_trace_enter_phase1+0x153/0x180
 [<ffffffff81084226>] SyS_clone+0x16/0x20
 [<ffffffff816cb3ee>] system_call_fastpath+0x12/0x71

results in that splat.

The effective result of the function below is for 'reserver_memtype'
to ignore the result from 'x86_platform.is_untracked_pat_range' quirk.
Which means that the splat above does not happen.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 arch/x86/xen/enlighten.c | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index 8ffb089..3238d04 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -283,6 +283,27 @@ static void __init xen_banner(void)
 	       version >> 16, version & 0xffff, extra.extraversion,
 	       xen_feature(XENFEAT_mmu_pt_update_preserve_ad) ? " (preserve-AD)" : "");
 }
+
+/*
+ * On x86 whenever VMAs are setup, the 'is_ISA_range quirk' (which we
+ * re-implement below) is used to figure whether to ignore the
+ * requested PAT type and always use WB (see 'reserve_memtype').
+ *
+ * The combination of MTRR (UC) and PAT (UC or WB) for the ISA region ends
+ * up with the same value - UC.
+ *
+ * However on Xen, due to XSA 154 we enforce that mappings to _ANY_ MMIO
+ * range MUST have the same the same cachability mapping - and in this case
+ * we enforce UC for everything.
+ *
+ * The effective result of the function below is for 'reserver_memtype'
+ * to ignore the result from 'x86_platform.is_untracked_pat_range' quirk.
+ */
+static bool xen_ignore(u64 s, u64 e)
+{
+	return false;
+}
+
 /* Check if running on Xen version (major, minor) or later */
 bool
 xen_running_on_version_or_later(unsigned int major, unsigned int minor)
@@ -1730,6 +1751,8 @@ asmlinkage __visible void __init xen_start_kernel(void)
 		x86_init.mpparse.get_smp_config = x86_init_uint_noop;
 
 		xen_boot_params_init_edd();
+
+		x86_platform.is_untracked_pat_range = xen_ignore;
 	}
 #ifdef CONFIG_PCI
 	/* PCI BIOS service won't work from a PV guest. */
-- 
2.5.5

> 
> Alan

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: XSA 154 and ISA region (640K -> 1MB) WB cache instead of UC
@ 2016-08-19 15:27           ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 16+ messages in thread
From: Konrad Rzeszutek Wilk @ 2016-08-19 15:27 UTC (permalink / raw)
  To: One Thousand Gnomes
  Cc: Juergen Gross, chuck.anderson, Andrew Cooper, linux-kernel,
	stefan.bader, david.vrabel, Jan Beulich, xen-devel,
	Boris Ostrovsky

On Thu, Aug 18, 2016 at 04:35:44PM +0100, One Thousand Gnomes wrote:
> On Thu, 18 Aug 2016 05:12:54 -0600
> "Jan Beulich" <JBeulich@suse.com> wrote:
> 
> > >>> On 18.08.16 at 12:16, <andrew.cooper3@citrix.com> wrote:  
> > > On 18/08/16 11:06, Jan Beulich wrote:  
> > >>>>> On 17.08.16 at 22:32, <konrad.wilk@oracle.com> wrote:  
> > >>>    Looking at the kernel it assumes that WB is ok for 640KB->1MB.
> > >>>    The comment says:
> > >>>    " /* Low ISA region is always mapped WB in page table. No need to track   
> > > *"  
> > >> As per above it's not clear to me what this comment is backed by.  
> > > 
> > > This states what is in the pagetables.  Not the combined result with MTRRs.
> > > 
> > > WB in the pagetables and WC/UB in the MTRRs is a legal combination which
> > > functions correctly.  
> > 
> > True, but then again - haven't I been told multiple times that Linux
> > nowadays prefers to run without using MTRRs?
> 
> The BIOS sets up the fixed MTRR registers for the 640K-1MB window. Those
> are separate to the variable range MTRR registers used for main memory
> with specific mappings for segments A000 to BFFF then C000-C7FF /
> C800-CFFF / etc up to FFFF.

OK, so BIOS-inherited.

Looking at the Intel SDM (figure 11-7), if the MTRR is UC for that, then 
having pagetables being either UC or WB are fine. Except Linux's use
of the quirk (is_untracked_pat_range) ends up always requesting WB.

And to combat the splat, the patch:


From 5209635f23786fb88cf0ce77719da8acda63bf65 Mon Sep 17 00:00:00 2001
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Date: Fri, 19 Aug 2016 11:06:44 -0400
Subject: [PATCH] x86/xen: Add x86_platform.is_untracked_pat_range quirk to
 ignore ISA regions.

On x86 whenever VMAs are setup, the 'is_ISA_range quirk' (which this
patch re-implements) is used to figure whether to ignore the
requested PAT type and always use WB (see 'reserve_memtype').
Specifically it forces the WB type for any region in the ISA space.

From the Intel SDM, the combination of MTRR (UC, which is setup by
the BIOS) and PAT (UC or WB) for the ISA region ends up with the same
value - UC.

However on Xen, due to XSA 154 we enforce that mappings that _ANY_
pagetable entry to MMIO ranges MUST have the same the same cachability
mapping - and in this case we enforce UC.

Which means that with XSA 154 (and without this patch) any application
that maps /dev/mem to get SMBIOS information (like mcelog), and pokes
in the ISA region will not have an PTE set. That is due to
reserve_pfn_range returning -EINVAL which results in the PTE not being set.

[These are debug entries added in 'reserve_pfn_range']
mcelog:2471 0xf0000->0xf1000, req_type=write-back new_type=write-back
mcelog:2471 0xeb000->0xed000, req_type=write-back new_type=write-back

.. above are successfull ones, but:
mcelog:2471 0xeb000->0xed000, req_type=uncached new_type=uncached
[again, a debug one:]
mcelog:2471 want=uncached got=write-back strict 0x000eb000-0x000ecfff
mcelog:2471 map pfn expected mapping type uncached for [mem 0x000eb000-0x000ecfff], got write-back
 ------------[ cut here ]------------

 [<ffffffff816c66f0>] dump_stack+0x63/0x83
 [<ffffffff81084745>] warn_slowpath_common+0x95/0xe0
 [<ffffffff810847aa>] warn_slowpath_null+0x1a/0x20
 [<ffffffff810725f3>] untrack_pfn+0x93/0xc0
 [<ffffffff811b90f9>] unmap_single_vma+0xa9/0x100
 [<ffffffff811b9644>] unmap_vmas+0x54/0xa0
 [<ffffffff811bf0da>] exit_mmap+0x9a/0x150
 [<ffffffff810825d3>] mmput+0x73/0x110
 [<ffffffff81082775>] dup_mm+0x105/0x110
 [<ffffffff81083b1d>] copy_process+0x11ed/0x1240
 [<ffffffff81084009>] do_fork+0x79/0x280
 [<ffffffff810259d3>] ? syscall_trace_enter_phase1+0x153/0x180
 [<ffffffff81084226>] SyS_clone+0x16/0x20
 [<ffffffff816cb3ee>] system_call_fastpath+0x12/0x71

results in that splat.

The effective result of the function below is for 'reserver_memtype'
to ignore the result from 'x86_platform.is_untracked_pat_range' quirk.
Which means that the splat above does not happen.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 arch/x86/xen/enlighten.c | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index 8ffb089..3238d04 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -283,6 +283,27 @@ static void __init xen_banner(void)
 	       version >> 16, version & 0xffff, extra.extraversion,
 	       xen_feature(XENFEAT_mmu_pt_update_preserve_ad) ? " (preserve-AD)" : "");
 }
+
+/*
+ * On x86 whenever VMAs are setup, the 'is_ISA_range quirk' (which we
+ * re-implement below) is used to figure whether to ignore the
+ * requested PAT type and always use WB (see 'reserve_memtype').
+ *
+ * The combination of MTRR (UC) and PAT (UC or WB) for the ISA region ends
+ * up with the same value - UC.
+ *
+ * However on Xen, due to XSA 154 we enforce that mappings to _ANY_ MMIO
+ * range MUST have the same the same cachability mapping - and in this case
+ * we enforce UC for everything.
+ *
+ * The effective result of the function below is for 'reserver_memtype'
+ * to ignore the result from 'x86_platform.is_untracked_pat_range' quirk.
+ */
+static bool xen_ignore(u64 s, u64 e)
+{
+	return false;
+}
+
 /* Check if running on Xen version (major, minor) or later */
 bool
 xen_running_on_version_or_later(unsigned int major, unsigned int minor)
@@ -1730,6 +1751,8 @@ asmlinkage __visible void __init xen_start_kernel(void)
 		x86_init.mpparse.get_smp_config = x86_init_uint_noop;
 
 		xen_boot_params_init_edd();
+
+		x86_platform.is_untracked_pat_range = xen_ignore;
 	}
 #ifdef CONFIG_PCI
 	/* PCI BIOS service won't work from a PV guest. */
-- 
2.5.5

> 
> Alan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2016-08-19 15:28 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-17 20:32 XSA 154 and ISA region (640K -> 1MB) WB cache instead of UC Konrad Rzeszutek Wilk
2016-08-17 20:32 ` Konrad Rzeszutek Wilk
2016-08-18 10:06 ` Jan Beulich
2016-08-18 10:06 ` Jan Beulich
2016-08-18 10:16   ` Andrew Cooper
2016-08-18 10:16   ` [Xen-devel] " Andrew Cooper
2016-08-18 11:12     ` Jan Beulich
2016-08-18 15:35       ` One Thousand Gnomes
2016-08-18 15:35         ` One Thousand Gnomes
2016-08-19 15:27         ` [Xen-devel] " Konrad Rzeszutek Wilk
2016-08-19 15:27           ` Konrad Rzeszutek Wilk
2016-08-18 11:12     ` Jan Beulich
2016-08-19 14:52   ` Konrad Rzeszutek Wilk
2016-08-19 14:52     ` Konrad Rzeszutek Wilk
2016-08-19 15:14     ` Jan Beulich
2016-08-19 15:14       ` Jan Beulich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.