All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] x86 fixes for 3.3 impacting distros (v1).
@ 2012-02-10 15:34 Konrad Rzeszutek Wilk
  2012-02-10 15:34 ` [PATCH] x86/cpa: Use pte_attrs instead of pte_flags on CPA/set_p.._wb/wc operations Konrad Rzeszutek Wilk
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-02-10 15:34 UTC (permalink / raw)
  To: rostedt, tglx, mingo, hpa, x86, linux-kernel; +Cc: xen-devel

The attached patch fixes RH BZ #742032, #787403, and #745574
and touch x86 subsystem.

The patch description gives a very good overview of the problem and
one solution. The one solution it chooses is not the most architecturally
sound but it does not cause performance degradation. If this your
first time reading this, please read the patch first and then come back to
this cover letter as I've some perf numbers and more detailed explanation here.

A bit of overview of the __page_change_attr_set_clr:

Its purpose is to change page attributes from one type to another.
It is important to understand that the entrance that code:
__page_change_attr_set_clr is guarded by cpa_lock spin-lock - which makes
that whole code be single threaded.

Albeit it seems that if debug mode is turned on, it can run in parallel. The
effect of using the posted patch is that __page_change_attr_set_clr() will be
affected when we change caching attributes on 4KB pages and/or the NX flag.

The execution of __page_change_attr_set_clr is concentrated in
(looked for ioremap_* and set_pages_*):
 - during bootup ("Write protecting the ..")
 - suspend/resume and graphic adapters evicting their buffers from the card
   to RAM (which is usually done during suspend but can be done via the
   'evict' attribute in debugfs)
 - when setting the memory for the cursor (AGP cards using i8xx chipset) -
   done during bootup and startup of Xserver.
 - setting up memory for Intel GTT scratch (i9xx) page (done during bootup)
 - payload (purgatory code) for kexec (done during kexec -l).
 - ioremap_* during PCI devices load - InfiniBand and video cards like to use
   ioremap_wc.
 - Intel, radeon, nouveau running into memory pressure and evicting pages from
   their GEM/TTM pool (once an hour or so if compiling a lot with only 4GB).

These are the cases I found when running on baremetal (and Xen) using a normal
Fedora Core 16 distro.

The alternate solution to the problem I am trying to solve, which is much
more architecturally sound (but has some perf disadvantages) is to wrap
the pte_flags with paravirt call everywhere. For that these patches two patches:
http://darnok.org/results/baseline_pte_flags_pte_attrs/0001-x86-paravirt-xen-Introduce-pte_flags.patch
http://darnok.org/results/baseline_pte_flags_pte_attrs/0002-x86-paravirt-xen-Optimize-pte_flags-by-marking-it-as.patch

make the pte_flags function (after bootup and patching with alternative asm)
look as so:

   48 89 f8             	mov    %rdi,%rax
   66 66 66 90          	data32 data32 xchg %ax,%ax

[the 66 66 .. is 'nop']. Looks good right? Well, it does work very well on Intel
(used an i3 2100), but on AMD A8-3850 it hits a performance wall - that I found out
is a result of CONFIG_FUNCTION_TRACER (too many nops??) being compiled in (but the tracer
is set to the default 'nop'). If I disable that specific config option the numbers
are the same as the baseline (with CONFIG_FUNCTION_TRACER disabled) on the AMD box.
Interestingly enough I only see these on AMD machines - not on the Intel ones.

The benchmarks (kernbench 5 times) were carried out using a Fedora Core 16 .config in
single-mode bootup with the swap disk enabled.

I ran some benchmarks (kernbench) under an i3 2100 and AMD A8-3850 and
the results are in:

https://docs.google.com/spreadsheet/ccc?key=0Aj5ukWh4htwMdHBOaUJzckRrNTdtN0FZcm5pLXNCbWc
and
https://docs.google.com/spreadsheet/ccc?key=0Aj5ukWh4htwMdDJLRTZGbDB0S0Eta0FHd0lkZThOM0E

respectively (all aggregated with raw data and pictures is in
http://darnok.org/results/baseline_pte_flags_pte_attrs/)

The interesting thing is that on Intel i3 2100 it does not matter which solution
is chosen - just to make sure I also ran it on 32 logical CPU SandyBridge beast and it just
breezed through this with no trouble. On AMD A8-3850 it stumbles when using the non-posted
version - wrapping the pte_flags with paravirt everywhere. Even if the code is optimized so
that it ends up looking as so:

        pushq   %rax,%rdi	[the pte]
        mov     %rdi,%rax       [the paravirt ident patching - used to be callq]
        nop
        nop
        nop
        nop
	[and here the check for !pgprot_val(cpa->mask_clr) happens]
        movq    16(%rbx), %rdi          [cpa->mask_clr -> rdi]
        notq    %rdi
        andq    %rax, %rdi

vs where there was none of this (that is a virgin 3.2-rc4):

        ...
	[and here the check for !pgprot_val(cpa->mask_clr) happens]
        movq    16(%rbx), %rdi          [cpa->mask_clr -> rdi]
        notq    %rdi
        andq    %rax, %rdi

[The full .S annotate outputs, are at http://darnok.org/results/baseline_pte_flags_pte_attrs/]

So on AMD if I use the posted patch  - where I wrap the pte_flags in just one specific place
(__page_change_attr_set_clr) - the degradation disappears.

I called that case the pte_attrs in the benchmarks numbers. It could be that the
AMD degradation is due to insertion of improper nops in the 'callq *pv_mmu_ops+104' space
but not sure. [edit: That was my original thinking but after some more runs it looks
to be due to whatever CONFIG_FUNCTION_TRACER introduces]

The posted patch (pte_attrs), introduces three extra operations
(http://darnok.org/results/baseline_pte_flags_pte_attrs/baseline-vs-pte_attrs.diff)

        call    pte_val         [this calls mcount and does the pvops call (which gets patched over)]
        movq    %r14, %rdi      [used for pte_pfn call right after the next movq]
        movq    %rax, -144(%rbp)[save our new pte flag value on stack]

        [.. call pte_pfn - which the original code did as well]

Performance wise it was on par (on in one case faster?) than baseline (v3.2-rc4 virgin).
This is on Intel and AMD.

Naturally if the "# CONFIG_PARAVIRT is not set", the assembler code is _exactly_ the same
as before this patch (no surprise since it ends up becoming native_pte_flags)
and the performance is on par.



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH] x86/cpa: Use pte_attrs instead of pte_flags on CPA/set_p.._wb/wc operations.
  2012-02-10 15:34 [PATCH] x86 fixes for 3.3 impacting distros (v1) Konrad Rzeszutek Wilk
@ 2012-02-10 15:34 ` Konrad Rzeszutek Wilk
  2012-02-21  1:01 ` [PATCH] x86 fixes for 3.3 impacting distros (v1) Steven Rostedt
  2012-02-22  2:53 ` Jason Garrett-Glaser
  2 siblings, 0 replies; 15+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-02-10 15:34 UTC (permalink / raw)
  To: rostedt, tglx, mingo, hpa, x86, linux-kernel
  Cc: xen-devel, Konrad Rzeszutek Wilk, stable

When using the paravirt interface, most of the page operations are wrapped
in the pvops interface. The one that is not is the pte_flags. The reason
being that for most cases, the "raw" PTE flag values for baremetal and whatever
pvops platform is running (in this case) - share the same bit meaning.

Except for PAT. Under Linux, the PAT MSR is written to be:

          PAT4                 PAT0
+---+----+----+----+-----+----+----+
 WC | WC | WB | UC | UC- | WC | WB |  <= Linux
+---+----+----+----+-----+----+----+
 WC | WT | WB | UC | UC- | WT | WB |  <= BIOS
+---+----+----+----+-----+----+----+
 WC | WP | WC | UC | UC- | WT | WB |  <= Xen
+---+----+----+----+-----+----+----+

The lookup of this index table translates to looking up
Bit 7, Bit 4, and Bit 3 of PTE:

 PAT/PSE (bit 7) ... PCD (bit 4) .. PWT (bit 3).

If all bits are off, then we are using PAT0. If bit 3 turned on,
then we are using PAT1, if bit 3 and bit 4, then PAT2..

Back to the PAT MSR table:

As you can see, the PAT1 translates to PAT4 under Xen. Under Linux
we only use PAT0, PAT1, and PAT2 for the caching as:

 WB = none (so PAT0)
 WC = PWT (bit 3 on)
 UC = PWT | PCD (bit 3 and 4 are on).

But to make it work with Xen, we end up doing for WC a translation:

 PWT (so bit 3 on) --> PAT (so bit 7 is on) and clear bit 3

And to translate back (when the paravirt pte_val is used) we would:

 PAT (bit 7 on) --> PWT (bit 3 on) and clear bit 7.

This works quite well, except if code uses the pte_flags, as pte_flags
reads the raw value and does not go through the paravirt. Which means
that if (when running under Xen):

 1) we allocate some pages.
 2) call set_pages_array_wc, which ends up calling:
     __page_change_att_set_clr(.., __pgprot(__PAGE_WC),  /* set */
                                 , __pgprot(__PAGE_MASK), /* clear */
    which ends up reading the _raw_ PTE flags and _only_ look at the
    _PTE_FLAG_MASK contents with __PAGE_MASK cleared (0x18) and
    __PAGE_WC (0x8) set.

     read raw *pte -> 0x67
     *pte = 0x67 & ^0x18 | 0x8
     *pte = 0x67 & 0xfffffe7 | 0x8
     *pte = 0x6f

   [now set_pte_atomic is called, and 0x6f is written in, but under
    xen_make_pte, the bit 3 is translated to bit 7, so it ends up
    writting 0xa7, which is correct]

 3) do something to them.
 4) call set_pages_array_wb
     __page_change_att_set_clr(.., __pgprot(__PAGE_WB),  /* set */
                                 , __pgprot(__PAGE_MASK), /* clear */
    which ends up reading the _raw_ PTE and _only_ look at the
    _PTE_FLAG_MASK contents with _PAGE_MASK cleared (0x18) and
    __PAGE_WB (0x0) set:

     read raw *pte -> 0xa7
     *pte = 0xa7 & &0x18 | 0
     *pte = 0xa7 & 0xfffffe7 | 0
     *pte = 0xa7

   [we check whether the old PTE is different from the new one

    if (pte_val(old_pte) != pte_val(new_pte)) {
        set_pte_atomic(kpte, new_pte);
        ...

   and find out that 0xA7 == 0xA7 so we do not write the new PTE value in]

   End result is that we failed at removing the WC caching bit!

 5) free them.
   [and have pages with PAT4 (bit 7) set, so other subsystems end up using
    the pages that have the write combined bit set resulting in crashes. Yikes!].

The fix, which this patch proposes, is to wrap the pte_pgprot in the CPA
code with newly introduced pte_attrs which can go through the pvops interface
to get the "emulated" value instead of the raw. Naturally if CONFIG_PARAVIRT is
not set, it would end calling native_pte_val.

The other way to fix this is by wrapping pte_flags and go through the pvops
interface and it really is the Right Thing to do.  The problem is, that past
experience with mprotect stuff demonstrates that it be really expensive in inner
loops, and pte_flags() is used in some very perf-critical areas.

Example code to run this and see the various mysterious subsystems/applications
crashing

MODULE_AUTHOR("Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>");
MODULE_DESCRIPTION("wb_to_wc_and_back");
MODULE_LICENSE("GPL");
MODULE_VERSION(WB_TO_WC);

static int thread(void *arg)
{
	struct page *a[MAX_PAGES];
	unsigned int i, j;
	do {
		for (j = 0, i = 0;i < MAX_PAGES; i++, j++) {
			a[i] = alloc_page(GFP_KERNEL);
			if (!a[i])
				break;
		}
		set_pages_array_wc(a, j);
		set_current_state(TASK_INTERRUPTIBLE);
		schedule_timeout_interruptible(HZ);
		for (i = 0; i < j; i++) {
			unsigned long *addr = page_address(a[i]);
			if (addr) {
				memset(addr, 0xc2, PAGE_SIZE);
			}
		}
		set_pages_array_wb(a, j);
		for (i = 0; i< MAX_PAGES; i++) {
			if (a[i])
				__free_page(a[i]);
			a[i] = NULL;
		}
	} while (!kthread_should_stop());
	return 0;
}
static struct task_struct *t;
static int __init wb_to_wc_init(void)
{
	t = kthread_run(thread, NULL, "wb_to_wc_and_back");
	return 0;
}
static void __exit wb_to_wc_exit(void)
{
	if (t)
		kthread_stop(t);
}
module_init(wb_to_wc_init);
module_exit(wb_to_wc_exit);

This fixes RH BZ #742032, #787403, and #745574
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Tested-by: Tom Goetz <tom.goetz@virtualcomputer.com>
CC: stable@kernel.org
---
 arch/x86/include/asm/pgtable.h |    5 +++++
 arch/x86/mm/pageattr.c         |    2 +-
 2 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 49afb3f..fa7bd2c 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -349,6 +349,11 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
 	return __pgprot(preservebits | addbits);
 }
 
+static inline pgprot_t pte_attrs(pte_t pte)
+{
+	return __pgprot(pte_val(pte) & PTE_FLAGS_MASK);
+}
+
 #define pte_pgprot(x) __pgprot(pte_flags(x) & PTE_FLAGS_MASK)
 
 #define canon_pgprot(p) __pgprot(massage_pgprot(p))
diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index e1ebde3..1ae1b4b 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -651,7 +651,7 @@ repeat:
 
 	if (level == PG_LEVEL_4K) {
 		pte_t new_pte;
-		pgprot_t new_prot = pte_pgprot(old_pte);
+		pgprot_t new_prot = pte_attrs(old_pte);
 		unsigned long pfn = pte_pfn(old_pte);
 
 		pgprot_val(new_prot) &= ~pgprot_val(cpa->mask_clr);
-- 
1.7.7.5


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] x86 fixes for 3.3 impacting distros (v1).
  2012-02-10 15:34 [PATCH] x86 fixes for 3.3 impacting distros (v1) Konrad Rzeszutek Wilk
  2012-02-10 15:34 ` [PATCH] x86/cpa: Use pte_attrs instead of pte_flags on CPA/set_p.._wb/wc operations Konrad Rzeszutek Wilk
@ 2012-02-21  1:01 ` Steven Rostedt
  2012-02-21  1:38   ` H. Peter Anvin
  2012-02-21  3:32   ` Konrad Rzeszutek Wilk
  2012-02-22  2:53 ` Jason Garrett-Glaser
  2 siblings, 2 replies; 15+ messages in thread
From: Steven Rostedt @ 2012-02-21  1:01 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: tglx, mingo, hpa, x86, linux-kernel, xen-devel

On Fri, 2012-02-10 at 10:34 -0500, Konrad Rzeszutek Wilk wrote:
>    66 66 66 90          	data32 data32 xchg %ax,%ax
> 
> [the 66 66 .. is 'nop']. Looks good right? Well, it does work very well on Intel
> (used an i3 2100), but on AMD A8-3850 it hits a performance wall - that I found out
> is a result of CONFIG_FUNCTION_TRACER (too many nops??) being compiled in (but the tracer
> is set to the default 'nop'). If I disable that specific config option the numbers
> are the same as the baseline (with CONFIG_FUNCTION_TRACER disabled) on the AMD box.
> Interestingly enough I only see these on AMD machines - not on the Intel ones.

All paravirt ops should be labeled with "notrace" so that function
tracer does not trace those functions. Have you annotated your new
paravirt ops with notrace?

-- Steve



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] x86 fixes for 3.3 impacting distros (v1).
  2012-02-21  1:01 ` [PATCH] x86 fixes for 3.3 impacting distros (v1) Steven Rostedt
@ 2012-02-21  1:38   ` H. Peter Anvin
  2012-02-21  3:32   ` Konrad Rzeszutek Wilk
  1 sibling, 0 replies; 15+ messages in thread
From: H. Peter Anvin @ 2012-02-21  1:38 UTC (permalink / raw)
  To: Steven Rostedt, Konrad Rzeszutek Wilk
  Cc: tglx, mingo, x86, linux-kernel, xen-devel

Odd... given that that is the NOP recommended by AMD.

Steven Rostedt <rostedt@goodmis.org> wrote:

>On Fri, 2012-02-10 at 10:34 -0500, Konrad Rzeszutek Wilk wrote:
>>    66 66 66 90          	data32 data32 xchg %ax,%ax
>> 
>> [the 66 66 .. is 'nop']. Looks good right? Well, it does work very
>well on Intel
>> (used an i3 2100), but on AMD A8-3850 it hits a performance wall -
>that I found out
>> is a result of CONFIG_FUNCTION_TRACER (too many nops??) being
>compiled in (but the tracer
>> is set to the default 'nop'). If I disable that specific config
>option the numbers
>> are the same as the baseline (with CONFIG_FUNCTION_TRACER disabled)
>on the AMD box.
>> Interestingly enough I only see these on AMD machines - not on the
>Intel ones.
>
>All paravirt ops should be labeled with "notrace" so that function
>tracer does not trace those functions. Have you annotated your new
>paravirt ops with notrace?
>
>-- Steve

-- 
Sent from my mobile phone. Please excuse my brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] x86 fixes for 3.3 impacting distros (v1).
  2012-02-21  1:01 ` [PATCH] x86 fixes for 3.3 impacting distros (v1) Steven Rostedt
  2012-02-21  1:38   ` H. Peter Anvin
@ 2012-02-21  3:32   ` Konrad Rzeszutek Wilk
  1 sibling, 0 replies; 15+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-02-21  3:32 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: tglx, mingo, hpa, x86, linux-kernel, xen-devel

On Mon, Feb 20, 2012 at 08:01:43PM -0500, Steven Rostedt wrote:
> On Fri, 2012-02-10 at 10:34 -0500, Konrad Rzeszutek Wilk wrote:
> >    66 66 66 90          	data32 data32 xchg %ax,%ax
> > 
> > [the 66 66 .. is 'nop']. Looks good right? Well, it does work very well on Intel
> > (used an i3 2100), but on AMD A8-3850 it hits a performance wall - that I found out
> > is a result of CONFIG_FUNCTION_TRACER (too many nops??) being compiled in (but the tracer
> > is set to the default 'nop'). If I disable that specific config option the numbers
> > are the same as the baseline (with CONFIG_FUNCTION_TRACER disabled) on the AMD box.
> > Interestingly enough I only see these on AMD machines - not on the Intel ones.
> 
> All paravirt ops should be labeled with "notrace" so that function
> tracer does not trace those functions. Have you annotated your new
> paravirt ops with notrace?

No. I hadn't realized that flag existed until your email a couple of days ago - I hadn't
had a chance to see if the notrace would solve this. But let me do that and get
back on this.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] x86 fixes for 3.3 impacting distros (v1).
  2012-02-10 15:34 [PATCH] x86 fixes for 3.3 impacting distros (v1) Konrad Rzeszutek Wilk
  2012-02-10 15:34 ` [PATCH] x86/cpa: Use pte_attrs instead of pte_flags on CPA/set_p.._wb/wc operations Konrad Rzeszutek Wilk
  2012-02-21  1:01 ` [PATCH] x86 fixes for 3.3 impacting distros (v1) Steven Rostedt
@ 2012-02-22  2:53 ` Jason Garrett-Glaser
  2012-05-10 15:34   ` [Xen-devel] " Konrad Rzeszutek Wilk
  2 siblings, 1 reply; 15+ messages in thread
From: Jason Garrett-Glaser @ 2012-02-22  2:53 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: rostedt, tglx, mingo, hpa, x86, linux-kernel, xen-devel

On Fri, Feb 10, 2012 at 7:34 AM, Konrad Rzeszutek Wilk
<konrad.wilk@oracle.com> wrote:
> The attached patch fixes RH BZ #742032, #787403, and #745574
> and touch x86 subsystem.
>
> The patch description gives a very good overview of the problem and
> one solution. The one solution it chooses is not the most architecturally
> sound but it does not cause performance degradation. If this your
> first time reading this, please read the patch first and then come back to
> this cover letter as I've some perf numbers and more detailed explanation here.
>
> A bit of overview of the __page_change_attr_set_clr:
>
> Its purpose is to change page attributes from one type to another.
> It is important to understand that the entrance that code:
> __page_change_attr_set_clr is guarded by cpa_lock spin-lock - which makes
> that whole code be single threaded.
>
> Albeit it seems that if debug mode is turned on, it can run in parallel. The
> effect of using the posted patch is that __page_change_attr_set_clr() will be
> affected when we change caching attributes on 4KB pages and/or the NX flag.
>
> The execution of __page_change_attr_set_clr is concentrated in
> (looked for ioremap_* and set_pages_*):
>  - during bootup ("Write protecting the ..")
>  - suspend/resume and graphic adapters evicting their buffers from the card
>   to RAM (which is usually done during suspend but can be done via the
>   'evict' attribute in debugfs)
>  - when setting the memory for the cursor (AGP cards using i8xx chipset) -
>   done during bootup and startup of Xserver.
>  - setting up memory for Intel GTT scratch (i9xx) page (done during bootup)
>  - payload (purgatory code) for kexec (done during kexec -l).
>  - ioremap_* during PCI devices load - InfiniBand and video cards like to use
>   ioremap_wc.
>  - Intel, radeon, nouveau running into memory pressure and evicting pages from
>   their GEM/TTM pool (once an hour or so if compiling a lot with only 4GB).
>
> These are the cases I found when running on baremetal (and Xen) using a normal
> Fedora Core 16 distro.
>
> The alternate solution to the problem I am trying to solve, which is much
> more architecturally sound (but has some perf disadvantages) is to wrap
> the pte_flags with paravirt call everywhere. For that these patches two patches:
> http://darnok.org/results/baseline_pte_flags_pte_attrs/0001-x86-paravirt-xen-Introduce-pte_flags.patch
> http://darnok.org/results/baseline_pte_flags_pte_attrs/0002-x86-paravirt-xen-Optimize-pte_flags-by-marking-it-as.patch
>
> make the pte_flags function (after bootup and patching with alternative asm)
> look as so:
>
>   48 89 f8                     mov    %rdi,%rax
>   66 66 66 90                  data32 data32 xchg %ax,%ax
>
> [the 66 66 .. is 'nop']. Looks good right? Well, it does work very well on Intel
> (used an i3 2100), but on AMD A8-3850 it hits a performance wall - that I found out
> is a result of CONFIG_FUNCTION_TRACER (too many nops??) being compiled in (but the tracer
> is set to the default 'nop'). If I disable that specific config option the numbers
> are the same as the baseline (with CONFIG_FUNCTION_TRACER disabled) on the AMD box.
> Interestingly enough I only see these on AMD machines - not on the Intel ones.

The AMD software optimization manual says that -- at least on some
chips -- too many prefixes forces the instruction decoder into a slow
mode (basically microcoded) where it takes literally dozens of cycles
for a single instruction.  I believe more than 2 prefixes will do
this; check the manual itself for specifics.

Jason

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Xen-devel] [PATCH] x86 fixes for 3.3 impacting distros (v1).
  2012-02-22  2:53 ` Jason Garrett-Glaser
@ 2012-05-10 15:34   ` Konrad Rzeszutek Wilk
  2012-06-27 23:17     ` Cyclonus J
  0 siblings, 1 reply; 15+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-05-10 15:34 UTC (permalink / raw)
  To: Jason Garrett-Glaser
  Cc: xen-devel, x86, linux-kernel, rostedt, mingo, hpa, tglx

On Tue, Feb 21, 2012 at 06:53:35PM -0800, Jason Garrett-Glaser wrote:
> On Fri, Feb 10, 2012 at 7:34 AM, Konrad Rzeszutek Wilk
> <konrad.wilk@oracle.com> wrote:
> > The attached patch fixes RH BZ #742032, #787403, and #745574
> > and touch x86 subsystem.
> >
> > The patch description gives a very good overview of the problem and
> > one solution. The one solution it chooses is not the most architecturally
> > sound but it does not cause performance degradation. If this your
> > first time reading this, please read the patch first and then come back to
> > this cover letter as I've some perf numbers and more detailed explanation here.
> >
> > A bit of overview of the __page_change_attr_set_clr:
> >
> > Its purpose is to change page attributes from one type to another.
> > It is important to understand that the entrance that code:
> > __page_change_attr_set_clr is guarded by cpa_lock spin-lock - which makes
> > that whole code be single threaded.
> >
> > Albeit it seems that if debug mode is turned on, it can run in parallel. The
> > effect of using the posted patch is that __page_change_attr_set_clr() will be
> > affected when we change caching attributes on 4KB pages and/or the NX flag.
> >
> > The execution of __page_change_attr_set_clr is concentrated in
> > (looked for ioremap_* and set_pages_*):
> >  - during bootup ("Write protecting the ..")
> >  - suspend/resume and graphic adapters evicting their buffers from the card
> >   to RAM (which is usually done during suspend but can be done via the
> >   'evict' attribute in debugfs)
> >  - when setting the memory for the cursor (AGP cards using i8xx chipset) -
> >   done during bootup and startup of Xserver.
> >  - setting up memory for Intel GTT scratch (i9xx) page (done during bootup)
> >  - payload (purgatory code) for kexec (done during kexec -l).
> >  - ioremap_* during PCI devices load - InfiniBand and video cards like to use
> >   ioremap_wc.
> >  - Intel, radeon, nouveau running into memory pressure and evicting pages from
> >   their GEM/TTM pool (once an hour or so if compiling a lot with only 4GB).
> >
> > These are the cases I found when running on baremetal (and Xen) using a normal
> > Fedora Core 16 distro.
> >
> > The alternate solution to the problem I am trying to solve, which is much
> > more architecturally sound (but has some perf disadvantages) is to wrap
> > the pte_flags with paravirt call everywhere. For that these patches two patches:
> > http://darnok.org/results/baseline_pte_flags_pte_attrs/0001-x86-paravirt-xen-Introduce-pte_flags.patch
> > http://darnok.org/results/baseline_pte_flags_pte_attrs/0002-x86-paravirt-xen-Optimize-pte_flags-by-marking-it-as.patch
> >
> > make the pte_flags function (after bootup and patching with alternative asm)
> > look as so:
> >
> >   48 89 f8                     mov    %rdi,%rax
> >   66 66 66 90                  data32 data32 xchg %ax,%ax
> >
> > [the 66 66 .. is 'nop']. Looks good right? Well, it does work very well on Intel
> > (used an i3 2100), but on AMD A8-3850 it hits a performance wall - that I found out
> > is a result of CONFIG_FUNCTION_TRACER (too many nops??) being compiled in (but the tracer
> > is set to the default 'nop'). If I disable that specific config option the numbers
> > are the same as the baseline (with CONFIG_FUNCTION_TRACER disabled) on the AMD box.
> > Interestingly enough I only see these on AMD machines - not on the Intel ones.
> 
> The AMD software optimization manual says that -- at least on some
> chips -- too many prefixes forces the instruction decoder into a slow
> mode (basically microcoded) where it takes literally dozens of cycles
> for a single instruction.  I believe more than 2 prefixes will do
> this; check the manual itself for specifics.

I've been digging in to this during my "free" time to figure out what
is going on. Sadly, haven't progressed much besides writting a set of patches
that would add the 'notrace' to the pvops calls and playing with
those patches.

In other words - still not sure what is happening.
> 
> Jason
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Xen-devel] [PATCH] x86 fixes for 3.3 impacting distros (v1).
  2012-05-10 15:34   ` [Xen-devel] " Konrad Rzeszutek Wilk
@ 2012-06-27 23:17     ` Cyclonus J
  2012-06-28 14:28         ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 15+ messages in thread
From: Cyclonus J @ 2012-06-27 23:17 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Jason Garrett-Glaser, xen-devel, x86, linux-kernel, rostedt,
	mingo, hpa, tglx

On Thu, May 10, 2012 at 11:34:57AM -0400, Konrad Rzeszutek Wilk wrote:
> On Tue, Feb 21, 2012 at 06:53:35PM -0800, Jason Garrett-Glaser wrote:
> > On Fri, Feb 10, 2012 at 7:34 AM, Konrad Rzeszutek Wilk
> > <konrad.wilk@oracle.com> wrote:
> > > The attached patch fixes RH BZ #742032, #787403, and #745574
> > > and touch x86 subsystem.
> > >
> > > The patch description gives a very good overview of the problem and
> > > one solution. The one solution it chooses is not the most architecturally
> > > sound but it does not cause performance degradation. If this your
> > > first time reading this, please read the patch first and then come back to
> > > this cover letter as I've some perf numbers and more detailed explanation here.
> > >
> > > A bit of overview of the __page_change_attr_set_clr:
> > >
> > > Its purpose is to change page attributes from one type to another.
> > > It is important to understand that the entrance that code:
> > > __page_change_attr_set_clr is guarded by cpa_lock spin-lock - which makes
> > > that whole code be single threaded.
> > >
> > > Albeit it seems that if debug mode is turned on, it can run in parallel. The
> > > effect of using the posted patch is that __page_change_attr_set_clr() will be
> > > affected when we change caching attributes on 4KB pages and/or the NX flag.
> > >
> > > The execution of __page_change_attr_set_clr is concentrated in
> > > (looked for ioremap_* and set_pages_*):
> > >  - during bootup ("Write protecting the ..")
> > >  - suspend/resume and graphic adapters evicting their buffers from the card
> > >   to RAM (which is usually done during suspend but can be done via the
> > >   'evict' attribute in debugfs)
> > >  - when setting the memory for the cursor (AGP cards using i8xx chipset) -
> > >   done during bootup and startup of Xserver.
> > >  - setting up memory for Intel GTT scratch (i9xx) page (done during bootup)
> > >  - payload (purgatory code) for kexec (done during kexec -l).
> > >  - ioremap_* during PCI devices load - InfiniBand and video cards like to use
> > >   ioremap_wc.
> > >  - Intel, radeon, nouveau running into memory pressure and evicting pages from
> > >   their GEM/TTM pool (once an hour or so if compiling a lot with only 4GB).
> > >
> > > These are the cases I found when running on baremetal (and Xen) using a normal
> > > Fedora Core 16 distro.
> > >
> > > The alternate solution to the problem I am trying to solve, which is much
> > > more architecturally sound (but has some perf disadvantages) is to wrap
> > > the pte_flags with paravirt call everywhere. For that these patches two patches:
> > > http://darnok.org/results/baseline_pte_flags_pte_attrs/0001-x86-paravirt-xen-Introduce-pte_flags.patch
> > > http://darnok.org/results/baseline_pte_flags_pte_attrs/0002-x86-paravirt-xen-Optimize-pte_flags-by-marking-it-as.patch
> > >
> > > make the pte_flags function (after bootup and patching with alternative asm)
> > > look as so:
> > >
> > >   48 89 f8                     mov    %rdi,%rax
> > >   66 66 66 90                  data32 data32 xchg %ax,%ax
> > >
> > > [the 66 66 .. is 'nop']. Looks good right? Well, it does work very well on Intel
> > > (used an i3 2100), but on AMD A8-3850 it hits a performance wall - that I found out
> > > is a result of CONFIG_FUNCTION_TRACER (too many nops??) being compiled in (but the tracer
> > > is set to the default 'nop'). If I disable that specific config option the numbers
> > > are the same as the baseline (with CONFIG_FUNCTION_TRACER disabled) on the AMD box.
> > > Interestingly enough I only see these on AMD machines - not on the Intel ones.
> > 
> > The AMD software optimization manual says that -- at least on some
> > chips -- too many prefixes forces the instruction decoder into a slow
> > mode (basically microcoded) where it takes literally dozens of cycles
> > for a single instruction.  I believe more than 2 prefixes will do
> > this; check the manual itself for specifics.
> 
> I've been digging in to this during my "free" time to figure out what
> is going on. Sadly, haven't progressed much besides writting a set of patches
> that would add the 'notrace' to the pvops calls and playing with
> those patches.
> 
> In other words - still not sure what is happening.

I would like to spend some time looking into this issue as it blocks my
project in some cases.

I saw a temporary fix 8eaffa67b43e99ae581622c5133e20b0f48bcef1 went into 3.3 to disable
PAT support on dom0. May I ask what would be the recommended fix to support PAT on dom0? 
Will it be a possible solution to make Xen use the same PAT settings as Linux? Sorry, I don't 
know the background why Xen is doing this differently.

Suggestions?

Thanks,
CJ

> > 
> > Jason
> > 
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@lists.xen.org
> > http://lists.xen.org/xen-devel
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Xen-devel] [PATCH] x86 fixes for 3.3 impacting distros (v1).
  2012-06-27 23:17     ` Cyclonus J
@ 2012-06-28 14:28         ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 15+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-06-28 14:28 UTC (permalink / raw)
  To: cyclonusj, marmarek, hpa
  Cc: xen-devel, x86, Jason Garrett-Glaser, linux-kernel, rostedt,
	mingo, hpa, tglx

On Wed, Jun 27, 2012 at 04:17:56PM -0700, Cyclonus J wrote:
> On Thu, May 10, 2012 at 11:34:57AM -0400, Konrad Rzeszutek Wilk wrote:
> > On Tue, Feb 21, 2012 at 06:53:35PM -0800, Jason Garrett-Glaser wrote:
> > > On Fri, Feb 10, 2012 at 7:34 AM, Konrad Rzeszutek Wilk
> > > <konrad.wilk@oracle.com> wrote:
> > > > The attached patch fixes RH BZ #742032, #787403, and #745574
> > > > and touch x86 subsystem.
> > > >
> > > > The patch description gives a very good overview of the problem and
> > > > one solution. The one solution it chooses is not the most architecturally
> > > > sound but it does not cause performance degradation. If this your
> > > > first time reading this, please read the patch first and then come back to
> > > > this cover letter as I've some perf numbers and more detailed explanation here.
> > > >
> > > > A bit of overview of the __page_change_attr_set_clr:
> > > >
> > > > Its purpose is to change page attributes from one type to another.
> > > > It is important to understand that the entrance that code:
> > > > __page_change_attr_set_clr is guarded by cpa_lock spin-lock - which makes
> > > > that whole code be single threaded.
> > > >
> > > > Albeit it seems that if debug mode is turned on, it can run in parallel. The
> > > > effect of using the posted patch is that __page_change_attr_set_clr() will be
> > > > affected when we change caching attributes on 4KB pages and/or the NX flag.
> > > >
> > > > The execution of __page_change_attr_set_clr is concentrated in
> > > > (looked for ioremap_* and set_pages_*):
> > > >  - during bootup ("Write protecting the ..")
> > > >  - suspend/resume and graphic adapters evicting their buffers from the card
> > > >   to RAM (which is usually done during suspend but can be done via the
> > > >   'evict' attribute in debugfs)
> > > >  - when setting the memory for the cursor (AGP cards using i8xx chipset) -
> > > >   done during bootup and startup of Xserver.
> > > >  - setting up memory for Intel GTT scratch (i9xx) page (done during bootup)
> > > >  - payload (purgatory code) for kexec (done during kexec -l).
> > > >  - ioremap_* during PCI devices load - InfiniBand and video cards like to use
> > > >   ioremap_wc.
> > > >  - Intel, radeon, nouveau running into memory pressure and evicting pages from
> > > >   their GEM/TTM pool (once an hour or so if compiling a lot with only 4GB).
> > > >
> > > > These are the cases I found when running on baremetal (and Xen) using a normal
> > > > Fedora Core 16 distro.
> > > >
> > > > The alternate solution to the problem I am trying to solve, which is much
> > > > more architecturally sound (but has some perf disadvantages) is to wrap
> > > > the pte_flags with paravirt call everywhere. For that these patches two patches:
> > > > http://darnok.org/results/baseline_pte_flags_pte_attrs/0001-x86-paravirt-xen-Introduce-pte_flags.patch
> > > > http://darnok.org/results/baseline_pte_flags_pte_attrs/0002-x86-paravirt-xen-Optimize-pte_flags-by-marking-it-as.patch
> > > >
> > > > make the pte_flags function (after bootup and patching with alternative asm)
> > > > look as so:
> > > >
> > > >   48 89 f8                     mov    %rdi,%rax
> > > >   66 66 66 90                  data32 data32 xchg %ax,%ax
> > > >
> > > > [the 66 66 .. is 'nop']. Looks good right? Well, it does work very well on Intel
> > > > (used an i3 2100), but on AMD A8-3850 it hits a performance wall - that I found out
> > > > is a result of CONFIG_FUNCTION_TRACER (too many nops??) being compiled in (but the tracer
> > > > is set to the default 'nop'). If I disable that specific config option the numbers
> > > > are the same as the baseline (with CONFIG_FUNCTION_TRACER disabled) on the AMD box.
> > > > Interestingly enough I only see these on AMD machines - not on the Intel ones.
> > > 
> > > The AMD software optimization manual says that -- at least on some
> > > chips -- too many prefixes forces the instruction decoder into a slow
> > > mode (basically microcoded) where it takes literally dozens of cycles
> > > for a single instruction.  I believe more than 2 prefixes will do
> > > this; check the manual itself for specifics.
> > 
> > I've been digging in to this during my "free" time to figure out what
> > is going on. Sadly, haven't progressed much besides writting a set of patches
> > that would add the 'notrace' to the pvops calls and playing with
> > those patches.
> > 
> > In other words - still not sure what is happening.
> 
> I would like to spend some time looking into this issue as it blocks my
> project in some cases.

Ah, somebody else pinged me about this too. Let me CC them here.
> 
> I saw a temporary fix 8eaffa67b43e99ae581622c5133e20b0f48bcef1 went into 3.3 to disable
> PAT support on dom0. May I ask what would be the recommended fix to support PAT on dom0? 
> Will it be a possible solution to make Xen use the same PAT settings as Linux? Sorry, I don't 
> know the background why Xen is doing this differently.
> 
> Suggestions?

You could use those two patches I mentioned in the thread and revert the temporary fix.
That should get PAT support working - but .. that would be for your own tree
and wouldn't be in mainline.

The issues I had - with the CONFIG_FUNCTION_TRACER, I hadn't had a chance to dig
deeper in it. I would recommend for you to tree a simple test-case (I used kernelbench)
to see if the patches cause a regression or not on baremetal.

Make sure to start the kernel in high CPU frequency (meaning the performance governor)
otherwise your results could be skewed.

Peter mentioned to me had some ideas about software PAT table lookup. I am not
exactly sure what he meant by that.

Just to summarize, there were two ways proposed to fix this:

 1). Make __page_change_attr_set_clr use a new wrapper: pte_attr, that calls
     pte_val (pvops call) instead of pte_flag (native). Here is the patch:
     http://git.kernel.org/?p=linux/kernel/git/konrad/xen.git;a=commitdiff;h=4f93aa02acd0e34806d4ac9c3a700bb5d040eab6
     (no perf regressions across all platforms)

 2). Introduce a new pvops call - pte_flags, which would make pte_flags
     (which currently is doing just a bit mask) be pvops-fied.
     http://darnok.org/results/baseline_pte_flags_pte_attrs/0001-x86-paravirt-xen-Introduce-pte_flags.patch
     http://darnok.org/results/baseline_pte_flags_pte_attrs/0002-x86-paravirt-xen-Optimize-pte_flags-by-marking-it-as.patch
     (weird results on AMD, other platforms had no perf degradations)

  3). (not posted), was to do 2), but alter the alternative_asm and instead use asm_goto to
     make the compiler use less registers and hopefully reduce the code:
     http://git.kernel.org/?p=linux/kernel/git/konrad/xen.git;a=shortlog;h=refs/heads/devel/mmu-perf
     But the results I got showed worst performance on baremetal.. which was weird?
     Perhaps it is compiler related - never got to follow up on it.


I also chatted with the core Xen hypervisor folks about adding in the context switch code
to alter the PAT layout - but they were not keen a about it - and I am not sure how much
CPU cycles one loses by doing a wrmsr to the PAT register on every guest context switch
(worst case when on has a pvops kernel and a old-style one - where the WC bit would differ)?



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Xen-devel] [PATCH] x86 fixes for 3.3 impacting distros (v1).
@ 2012-06-28 14:28         ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 15+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-06-28 14:28 UTC (permalink / raw)
  To: cyclonusj, marmarek
  Cc: xen-devel, x86, Jason Garrett-Glaser, linux-kernel, rostedt,
	mingo, hpa, tglx

On Wed, Jun 27, 2012 at 04:17:56PM -0700, Cyclonus J wrote:
> On Thu, May 10, 2012 at 11:34:57AM -0400, Konrad Rzeszutek Wilk wrote:
> > On Tue, Feb 21, 2012 at 06:53:35PM -0800, Jason Garrett-Glaser wrote:
> > > On Fri, Feb 10, 2012 at 7:34 AM, Konrad Rzeszutek Wilk
> > > <konrad.wilk@oracle.com> wrote:
> > > > The attached patch fixes RH BZ #742032, #787403, and #745574
> > > > and touch x86 subsystem.
> > > >
> > > > The patch description gives a very good overview of the problem and
> > > > one solution. The one solution it chooses is not the most architecturally
> > > > sound but it does not cause performance degradation. If this your
> > > > first time reading this, please read the patch first and then come back to
> > > > this cover letter as I've some perf numbers and more detailed explanation here.
> > > >
> > > > A bit of overview of the __page_change_attr_set_clr:
> > > >
> > > > Its purpose is to change page attributes from one type to another.
> > > > It is important to understand that the entrance that code:
> > > > __page_change_attr_set_clr is guarded by cpa_lock spin-lock - which makes
> > > > that whole code be single threaded.
> > > >
> > > > Albeit it seems that if debug mode is turned on, it can run in parallel. The
> > > > effect of using the posted patch is that __page_change_attr_set_clr() will be
> > > > affected when we change caching attributes on 4KB pages and/or the NX flag.
> > > >
> > > > The execution of __page_change_attr_set_clr is concentrated in
> > > > (looked for ioremap_* and set_pages_*):
> > > >  - during bootup ("Write protecting the ..")
> > > >  - suspend/resume and graphic adapters evicting their buffers from the card
> > > >   to RAM (which is usually done during suspend but can be done via the
> > > >   'evict' attribute in debugfs)
> > > >  - when setting the memory for the cursor (AGP cards using i8xx chipset) -
> > > >   done during bootup and startup of Xserver.
> > > >  - setting up memory for Intel GTT scratch (i9xx) page (done during bootup)
> > > >  - payload (purgatory code) for kexec (done during kexec -l).
> > > >  - ioremap_* during PCI devices load - InfiniBand and video cards like to use
> > > >   ioremap_wc.
> > > >  - Intel, radeon, nouveau running into memory pressure and evicting pages from
> > > >   their GEM/TTM pool (once an hour or so if compiling a lot with only 4GB).
> > > >
> > > > These are the cases I found when running on baremetal (and Xen) using a normal
> > > > Fedora Core 16 distro.
> > > >
> > > > The alternate solution to the problem I am trying to solve, which is much
> > > > more architecturally sound (but has some perf disadvantages) is to wrap
> > > > the pte_flags with paravirt call everywhere. For that these patches two patches:
> > > > http://darnok.org/results/baseline_pte_flags_pte_attrs/0001-x86-paravirt-xen-Introduce-pte_flags.patch
> > > > http://darnok.org/results/baseline_pte_flags_pte_attrs/0002-x86-paravirt-xen-Optimize-pte_flags-by-marking-it-as.patch
> > > >
> > > > make the pte_flags function (after bootup and patching with alternative asm)
> > > > look as so:
> > > >
> > > >   48 89 f8                     mov    %rdi,%rax
> > > >   66 66 66 90                  data32 data32 xchg %ax,%ax
> > > >
> > > > [the 66 66 .. is 'nop']. Looks good right? Well, it does work very well on Intel
> > > > (used an i3 2100), but on AMD A8-3850 it hits a performance wall - that I found out
> > > > is a result of CONFIG_FUNCTION_TRACER (too many nops??) being compiled in (but the tracer
> > > > is set to the default 'nop'). If I disable that specific config option the numbers
> > > > are the same as the baseline (with CONFIG_FUNCTION_TRACER disabled) on the AMD box.
> > > > Interestingly enough I only see these on AMD machines - not on the Intel ones.
> > > 
> > > The AMD software optimization manual says that -- at least on some
> > > chips -- too many prefixes forces the instruction decoder into a slow
> > > mode (basically microcoded) where it takes literally dozens of cycles
> > > for a single instruction.  I believe more than 2 prefixes will do
> > > this; check the manual itself for specifics.
> > 
> > I've been digging in to this during my "free" time to figure out what
> > is going on. Sadly, haven't progressed much besides writting a set of patches
> > that would add the 'notrace' to the pvops calls and playing with
> > those patches.
> > 
> > In other words - still not sure what is happening.
> 
> I would like to spend some time looking into this issue as it blocks my
> project in some cases.

Ah, somebody else pinged me about this too. Let me CC them here.
> 
> I saw a temporary fix 8eaffa67b43e99ae581622c5133e20b0f48bcef1 went into 3.3 to disable
> PAT support on dom0. May I ask what would be the recommended fix to support PAT on dom0? 
> Will it be a possible solution to make Xen use the same PAT settings as Linux? Sorry, I don't 
> know the background why Xen is doing this differently.
> 
> Suggestions?

You could use those two patches I mentioned in the thread and revert the temporary fix.
That should get PAT support working - but .. that would be for your own tree
and wouldn't be in mainline.

The issues I had - with the CONFIG_FUNCTION_TRACER, I hadn't had a chance to dig
deeper in it. I would recommend for you to tree a simple test-case (I used kernelbench)
to see if the patches cause a regression or not on baremetal.

Make sure to start the kernel in high CPU frequency (meaning the performance governor)
otherwise your results could be skewed.

Peter mentioned to me had some ideas about software PAT table lookup. I am not
exactly sure what he meant by that.

Just to summarize, there were two ways proposed to fix this:

 1). Make __page_change_attr_set_clr use a new wrapper: pte_attr, that calls
     pte_val (pvops call) instead of pte_flag (native). Here is the patch:
     http://git.kernel.org/?p=linux/kernel/git/konrad/xen.git;a=commitdiff;h=4f93aa02acd0e34806d4ac9c3a700bb5d040eab6
     (no perf regressions across all platforms)

 2). Introduce a new pvops call - pte_flags, which would make pte_flags
     (which currently is doing just a bit mask) be pvops-fied.
     http://darnok.org/results/baseline_pte_flags_pte_attrs/0001-x86-paravirt-xen-Introduce-pte_flags.patch
     http://darnok.org/results/baseline_pte_flags_pte_attrs/0002-x86-paravirt-xen-Optimize-pte_flags-by-marking-it-as.patch
     (weird results on AMD, other platforms had no perf degradations)

  3). (not posted), was to do 2), but alter the alternative_asm and instead use asm_goto to
     make the compiler use less registers and hopefully reduce the code:
     http://git.kernel.org/?p=linux/kernel/git/konrad/xen.git;a=shortlog;h=refs/heads/devel/mmu-perf
     But the results I got showed worst performance on baremetal.. which was weird?
     Perhaps it is compiler related - never got to follow up on it.


I also chatted with the core Xen hypervisor folks about adding in the context switch code
to alter the PAT layout - but they were not keen a about it - and I am not sure how much
CPU cycles one loses by doing a wrmsr to the PAT register on every guest context switch
(worst case when on has a pvops kernel and a old-style one - where the WC bit would differ)?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Xen-devel] [PATCH] x86 fixes for 3.3 impacting distros (v1).
  2012-06-28 14:28         ` Konrad Rzeszutek Wilk
  (?)
@ 2012-06-28 14:42         ` H. Peter Anvin
  2012-06-28 15:38             ` Jan Beulich
  2012-06-29 21:52           ` Cyclonus J
  -1 siblings, 2 replies; 15+ messages in thread
From: H. Peter Anvin @ 2012-06-28 14:42 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: cyclonusj, marmarek, xen-devel, x86, Jason Garrett-Glaser,
	linux-kernel, rostedt, mingo, tglx, Siddha, Suresh B

On 06/28/2012 07:28 AM, Konrad Rzeszutek Wilk wrote:
> 
> Peter mentioned to me had some ideas about software PAT table lookup. I am not
> exactly sure what he meant by that.
> 

I could see the kernel have programmable PAT values rather than fixed if
and only if it can be showed to have no measurable performance impact.

> Just to summarize, there were two ways proposed to fix this:
> 
>  1). Make __page_change_attr_set_clr use a new wrapper: pte_attr, that calls
>      pte_val (pvops call) instead of pte_flag (native). Here is the patch:
>      http://git.kernel.org/?p=linux/kernel/git/konrad/xen.git;a=commitdiff;h=4f93aa02acd0e34806d4ac9c3a700bb5d040eab6
>      (no perf regressions across all platforms)
> 
>  2). Introduce a new pvops call - pte_flags, which would make pte_flags
>      (which currently is doing just a bit mask) be pvops-fied.
>      http://darnok.org/results/baseline_pte_flags_pte_attrs/0001-x86-paravirt-xen-Introduce-pte_flags.patch
>      http://darnok.org/results/baseline_pte_flags_pte_attrs/0002-x86-paravirt-xen-Optimize-pte_flags-by-marking-it-as.patch
>      (weird results on AMD, other platforms had no perf degradations)
> 
>   3). (not posted), was to do 2), but alter the alternative_asm and instead use asm_goto to
>      make the compiler use less registers and hopefully reduce the code:
>      http://git.kernel.org/?p=linux/kernel/git/konrad/xen.git;a=shortlog;h=refs/heads/devel/mmu-perf
>      But the results I got showed worst performance on baremetal.. which was weird?
>      Perhaps it is compiler related - never got to follow up on it.
> 

OK, let me be blunt: I will unconditionally veto any of these.

> 
> I also chatted with the core Xen hypervisor folks about adding in the context switch code
> to alter the PAT layout - but they were not keen a about it - and I am not sure how much
> CPU cycles one loses by doing a wrmsr to the PAT register on every guest context switch
> (worst case when on has a pvops kernel and a old-style one - where the WC bit would differ)?
> 

And you're comparing that to a bunch of new pvops calls?  The discussion
shouldn't even have started until you had ruled out this solution and
had data to show it.

	-hpa

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Xen-devel] [PATCH] x86 fixes for 3.3 impacting distros (v1).
  2012-06-28 14:42         ` H. Peter Anvin
@ 2012-06-28 15:38             ` Jan Beulich
  2012-06-29 21:52           ` Cyclonus J
  1 sibling, 0 replies; 15+ messages in thread
From: Jan Beulich @ 2012-06-28 15:38 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, H. Peter Anvin
  Cc: cyclonusj, rostedt, Suresh B Siddha, marmarek, x86, tglx,
	xen-devel, mingo, linux-kernel, Jason Garrett-Glaser

>>> On 28.06.12 at 16:42, "H. Peter Anvin" <hpa@zytor.com> wrote:
> On 06/28/2012 07:28 AM, Konrad Rzeszutek Wilk wrote:
>> I also chatted with the core Xen hypervisor folks about adding in the 
> context switch code
>> to alter the PAT layout - but they were not keen a about it - and I am not 
> sure how much
>> CPU cycles one loses by doing a wrmsr to the PAT register on every guest 
> context switch
>> (worst case when on has a pvops kernel and a old-style one - where the WC bit 
> would differ)?
>> 
> 
> And you're comparing that to a bunch of new pvops calls?  The discussion
> shouldn't even have started until you had ruled out this solution and
> had data to show it.

That's definitely not an option: Xen itself may be (and is, under
certain circumstances at least) using WC page table entries, so
we can't allow on-the-fly changes to the meaning of the various
indexes.

Jan


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Xen-devel] [PATCH] x86 fixes for 3.3 impacting distros (v1).
@ 2012-06-28 15:38             ` Jan Beulich
  0 siblings, 0 replies; 15+ messages in thread
From: Jan Beulich @ 2012-06-28 15:38 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, H. Peter Anvin
  Cc: cyclonusj, rostedt, Suresh B Siddha, marmarek, x86, tglx,
	xen-devel, mingo, linux-kernel, Jason Garrett-Glaser

>>> On 28.06.12 at 16:42, "H. Peter Anvin" <hpa@zytor.com> wrote:
> On 06/28/2012 07:28 AM, Konrad Rzeszutek Wilk wrote:
>> I also chatted with the core Xen hypervisor folks about adding in the 
> context switch code
>> to alter the PAT layout - but they were not keen a about it - and I am not 
> sure how much
>> CPU cycles one loses by doing a wrmsr to the PAT register on every guest 
> context switch
>> (worst case when on has a pvops kernel and a old-style one - where the WC bit 
> would differ)?
>> 
> 
> And you're comparing that to a bunch of new pvops calls?  The discussion
> shouldn't even have started until you had ruled out this solution and
> had data to show it.

That's definitely not an option: Xen itself may be (and is, under
certain circumstances at least) using WC page table entries, so
we can't allow on-the-fly changes to the meaning of the various
indexes.

Jan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Xen-devel] [PATCH] x86 fixes for 3.3 impacting distros (v1).
  2012-06-28 14:42         ` H. Peter Anvin
  2012-06-28 15:38             ` Jan Beulich
@ 2012-06-29 21:52           ` Cyclonus J
  2012-06-29 22:29             ` H. Peter Anvin
  1 sibling, 1 reply; 15+ messages in thread
From: Cyclonus J @ 2012-06-29 21:52 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Konrad Rzeszutek Wilk, marmarek, xen-devel, x86,
	Jason Garrett-Glaser, linux-kernel, rostedt, mingo, tglx, Siddha,
	Suresh B

On Thu, Jun 28, 2012 at 07:42:12AM -0700, H. Peter Anvin wrote:
> On 06/28/2012 07:28 AM, Konrad Rzeszutek Wilk wrote:
> > 
> > Peter mentioned to me had some ideas about software PAT table lookup. I am not
> > exactly sure what he meant by that.
> > 
> 
> I could see the kernel have programmable PAT values rather than fixed if
> and only if it can be showed to have no measurable performance impact.
> 
> > Just to summarize, there were two ways proposed to fix this:
> > 
> >  1). Make __page_change_attr_set_clr use a new wrapper: pte_attr, that calls
> >      pte_val (pvops call) instead of pte_flag (native). Here is the patch:
> >      http://git.kernel.org/?p=linux/kernel/git/konrad/xen.git;a=commitdiff;h=4f93aa02acd0e34806d4ac9c3a700bb5d040eab6
> >      (no perf regressions across all platforms)
> > 
> >  2). Introduce a new pvops call - pte_flags, which would make pte_flags
> >      (which currently is doing just a bit mask) be pvops-fied.
> >      http://darnok.org/results/baseline_pte_flags_pte_attrs/0001-x86-paravirt-xen-Introduce-pte_flags.patch
> >      http://darnok.org/results/baseline_pte_flags_pte_attrs/0002-x86-paravirt-xen-Optimize-pte_flags-by-marking-it-as.patch
> >      (weird results on AMD, other platforms had no perf degradations)
> > 
> >   3). (not posted), was to do 2), but alter the alternative_asm and instead use asm_goto to
> >      make the compiler use less registers and hopefully reduce the code:
> >      http://git.kernel.org/?p=linux/kernel/git/konrad/xen.git;a=shortlog;h=refs/heads/devel/mmu-perf
> >      But the results I got showed worst performance on baremetal.. which was weird?
> >      Perhaps it is compiler related - never got to follow up on it.
> > 
> 
> OK, let me be blunt: I will unconditionally veto any of these.

Peter,

hmm, It looks like option 1 doesn't have any perf regression, but it is still
not acceptable? That is fine. If you prefer to have a software PAT table lookup, could you provide
some details so I can try to get something align that direction?

CJ

> 
> > 
> > I also chatted with the core Xen hypervisor folks about adding in the context switch code
> > to alter the PAT layout - but they were not keen a about it - and I am not sure how much
> > CPU cycles one loses by doing a wrmsr to the PAT register on every guest context switch
> > (worst case when on has a pvops kernel and a old-style one - where the WC bit would differ)?
> > 
> 
> And you're comparing that to a bunch of new pvops calls?  The discussion
> shouldn't even have started until you had ruled out this solution and
> had data to show it.
> 
> 	-hpa

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Xen-devel] [PATCH] x86 fixes for 3.3 impacting distros (v1).
  2012-06-29 21:52           ` Cyclonus J
@ 2012-06-29 22:29             ` H. Peter Anvin
  0 siblings, 0 replies; 15+ messages in thread
From: H. Peter Anvin @ 2012-06-29 22:29 UTC (permalink / raw)
  To: Cyclonus J
  Cc: Konrad Rzeszutek Wilk, marmarek, xen-devel, x86,
	Jason Garrett-Glaser, linux-kernel, rostedt, mingo, tglx, Siddha,
	Suresh B

On 06/29/2012 02:52 PM, Cyclonus J wrote:
> 
> Peter,
> 
> hmm, It looks like option 1 doesn't have any perf regression, but it is still
> not acceptable? That is fine. If you prefer to have a software PAT table lookup, could you provide
> some details so I can try to get something align that direction?
> 

It has no perf regression, but it really buries deep in the code a
strange abstraction which only happens to work on Xen and is confusing
as hell.

The idea with a software PAT table is that the PAT numbers used by the
kernel should come from a table in the kernel instead of being
hard-coded.  That might take some work, and it remains to be seen if it
is practical.

It *may* be that we need to hard-code 0 as WB, still, but that should be
true on any sane platform.

	-hpa

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2012-06-29 22:30 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-02-10 15:34 [PATCH] x86 fixes for 3.3 impacting distros (v1) Konrad Rzeszutek Wilk
2012-02-10 15:34 ` [PATCH] x86/cpa: Use pte_attrs instead of pte_flags on CPA/set_p.._wb/wc operations Konrad Rzeszutek Wilk
2012-02-21  1:01 ` [PATCH] x86 fixes for 3.3 impacting distros (v1) Steven Rostedt
2012-02-21  1:38   ` H. Peter Anvin
2012-02-21  3:32   ` Konrad Rzeszutek Wilk
2012-02-22  2:53 ` Jason Garrett-Glaser
2012-05-10 15:34   ` [Xen-devel] " Konrad Rzeszutek Wilk
2012-06-27 23:17     ` Cyclonus J
2012-06-28 14:28       ` Konrad Rzeszutek Wilk
2012-06-28 14:28         ` Konrad Rzeszutek Wilk
2012-06-28 14:42         ` H. Peter Anvin
2012-06-28 15:38           ` Jan Beulich
2012-06-28 15:38             ` Jan Beulich
2012-06-29 21:52           ` Cyclonus J
2012-06-29 22:29             ` H. Peter Anvin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.