linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] x86: Add an explicit barrier() to clflushopt()
@ 2015-10-19  9:58 Chris Wilson
  2015-10-19 10:16 ` Borislav Petkov
                   ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Chris Wilson @ 2015-10-19  9:58 UTC (permalink / raw)
  To: linux-kernel
  Cc: Chris Wilson, Ross Zwisler, H . Peter Anvin, Imre Deak,
	Daniel Vetter, dri-devel

During testing we observed that the last cacheline was not being flushed
from a

	mb()
	for (addr = addr & -clflush_size; addr < end; addr += clflush_size)
		clflushopt();
	mb()

loop (where the initial addr and end were not cacheline aligned).

Changing the loop from addr < end to addr <= end, or replacing the
clflushopt() with clflush() both fixed the testcase. Hinting that GCC
was miscompling the assembly within the loop and specifically the
alternative within clflushopt() was confusing the loop optimizer.

Adding a barrier() into clflushopt() is enough for GCC to dtrt, but
solving why GCC is not seeing the constraints from the alternative_io()
would be smarter...

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=92501
Testcase: gem_tiled_partial_pwrite_pread/read
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: Imre Deak <imre.deak@intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: dri-devel@lists.freedesktop.org
---
 arch/x86/include/asm/special_insns.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h
index 2270e41b32fd..0c7aedbf8930 100644
--- a/arch/x86/include/asm/special_insns.h
+++ b/arch/x86/include/asm/special_insns.h
@@ -199,6 +199,11 @@ static inline void clflushopt(volatile void *__p)
 		       ".byte 0x66; clflush %P0",
 		       X86_FEATURE_CLFLUSHOPT,
 		       "+m" (*(volatile char __force *)__p));
+	/* GCC (4.9.1 and 5.2.1 at least) appears to be very confused when
+	 * meeting this alternative() and demonstrably miscompiles loops
+	 * iterating over clflushopts.
+	 */
+	barrier();
 }
 
 static inline void clwb(volatile void *__p)
-- 
2.6.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH] x86: Add an explicit barrier() to clflushopt()
  2015-10-19  9:58 [PATCH] x86: Add an explicit barrier() to clflushopt() Chris Wilson
@ 2015-10-19 10:16 ` Borislav Petkov
  2015-10-19 11:05   ` Chris Wilson
  2015-10-19 18:29 ` Ross Zwisler
  2016-01-07 10:16 ` Chris Wilson
  2 siblings, 1 reply; 29+ messages in thread
From: Borislav Petkov @ 2015-10-19 10:16 UTC (permalink / raw)
  To: Chris Wilson
  Cc: linux-kernel, Ross Zwisler, H . Peter Anvin, Imre Deak,
	Daniel Vetter, dri-devel

On Mon, Oct 19, 2015 at 10:58:55AM +0100, Chris Wilson wrote:
> During testing we observed that the last cacheline was not being flushed
> from a
> 
> 	mb()
> 	for (addr = addr & -clflush_size; addr < end; addr += clflush_size)
> 		clflushopt();
> 	mb()
> 
> loop (where the initial addr and end were not cacheline aligned).
> 
> Changing the loop from addr < end to addr <= end, or replacing the
> clflushopt() with clflush() both fixed the testcase. Hinting that GCC
> was miscompling the assembly within the loop and specifically the
> alternative within clflushopt() was confusing the loop optimizer.
> 
> Adding a barrier() into clflushopt() is enough for GCC to dtrt, but
> solving why GCC is not seeing the constraints from the alternative_io()
> would be smarter...

Hmm, would something like adding the memory clobber to the
alternative_io() definition work?

---
diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index 7bfc85bbb8ff..d923e5dacdb1 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -207,7 +207,7 @@ static inline int alternatives_text_reserved(void *start, void *end)
 /* Like alternative_input, but with a single output argument */
 #define alternative_io(oldinstr, newinstr, feature, output, input...)	\
 	asm volatile (ALTERNATIVE(oldinstr, newinstr, feature)		\
-		: output : "i" (0), ## input)
+		: output : "i" (0), ## input : "memory")
 
 /* Like alternative_io, but for replacing a direct call with another one. */
 #define alternative_call(oldfunc, newfunc, feature, output, input...)	\

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH] x86: Add an explicit barrier() to clflushopt()
  2015-10-19 10:16 ` Borislav Petkov
@ 2015-10-19 11:05   ` Chris Wilson
  2015-10-19 11:25     ` Borislav Petkov
  0 siblings, 1 reply; 29+ messages in thread
From: Chris Wilson @ 2015-10-19 11:05 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-kernel, Ross Zwisler, H . Peter Anvin, Imre Deak,
	Daniel Vetter, dri-devel

On Mon, Oct 19, 2015 at 12:16:12PM +0200, Borislav Petkov wrote:
> On Mon, Oct 19, 2015 at 10:58:55AM +0100, Chris Wilson wrote:
> > Adding a barrier() into clflushopt() is enough for GCC to dtrt, but
> > solving why GCC is not seeing the constraints from the alternative_io()
> > would be smarter...
> 
> Hmm, would something like adding the memory clobber to the
> alternative_io() definition work?
> 
> ---
> diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
> index 7bfc85bbb8ff..d923e5dacdb1 100644
> --- a/arch/x86/include/asm/alternative.h
> +++ b/arch/x86/include/asm/alternative.h
> @@ -207,7 +207,7 @@ static inline int alternatives_text_reserved(void *start, void *end)
>  /* Like alternative_input, but with a single output argument */
>  #define alternative_io(oldinstr, newinstr, feature, output, input...)	\
>  	asm volatile (ALTERNATIVE(oldinstr, newinstr, feature)		\
> -		: output : "i" (0), ## input)
> +		: output : "i" (0), ## input : "memory")
>  
>  /* Like alternative_io, but for replacing a direct call with another one. */
>  #define alternative_call(oldfunc, newfunc, feature, output, input...)	\

In order to add the clobbers, I had to adjust the macro slightly:

+#define alternative_output(oldinstr, newinstr, feature, output)        \
+       asm volatile (ALTERNATIVE(oldinstr, newinstr, feature)          \
+               : output : "i" (0) : "memory")

and that also works. The concern I have here is that the asm does use
"+m" as its output already, and the clflush() asm doesn't require the
manual clobber.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] x86: Add an explicit barrier() to clflushopt()
  2015-10-19 11:05   ` Chris Wilson
@ 2015-10-19 11:25     ` Borislav Petkov
  0 siblings, 0 replies; 29+ messages in thread
From: Borislav Petkov @ 2015-10-19 11:25 UTC (permalink / raw)
  To: Chris Wilson
  Cc: linux-kernel, Ross Zwisler, H . Peter Anvin, Imre Deak,
	Daniel Vetter, dri-devel

On Mon, Oct 19, 2015 at 12:05:29PM +0100, Chris Wilson wrote:
> In order to add the clobbers, I had to adjust the macro slightly:
> 
> +#define alternative_output(oldinstr, newinstr, feature, output)        \
> +       asm volatile (ALTERNATIVE(oldinstr, newinstr, feature)          \
> +               : output : "i" (0) : "memory")
> 
> and that also works. The concern I have here is that the asm does use
> "+m" as its output already, and the clflush() asm doesn't require the
> manual clobber.

Yeah, you could make the above

#define alternative_output(oldinstr, newinstr, feature, output, clobber...)

and use it to define clflushopt(). hpa might have a better idea, though...

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] x86: Add an explicit barrier() to clflushopt()
  2015-10-19  9:58 [PATCH] x86: Add an explicit barrier() to clflushopt() Chris Wilson
  2015-10-19 10:16 ` Borislav Petkov
@ 2015-10-19 18:29 ` Ross Zwisler
  2016-01-07 10:16 ` Chris Wilson
  2 siblings, 0 replies; 29+ messages in thread
From: Ross Zwisler @ 2015-10-19 18:29 UTC (permalink / raw)
  To: Chris Wilson
  Cc: linux-kernel, Ross Zwisler, H . Peter Anvin, Imre Deak,
	Daniel Vetter, dri-devel

On Mon, Oct 19, 2015 at 10:58:55AM +0100, Chris Wilson wrote:
> During testing we observed that the last cacheline was not being flushed
> from a
> 
> 	mb()
> 	for (addr = addr & -clflush_size; addr < end; addr += clflush_size)
> 		clflushopt();
> 	mb()
> 
> loop (where the initial addr and end were not cacheline aligned).
> 
> Changing the loop from addr < end to addr <= end, or replacing the
> clflushopt() with clflush() both fixed the testcase. Hinting that GCC
> was miscompling the assembly within the loop and specifically the
> alternative within clflushopt() was confusing the loop optimizer.
> 
> Adding a barrier() into clflushopt() is enough for GCC to dtrt, but
> solving why GCC is not seeing the constraints from the alternative_io()
> would be smarter...
> 
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=92501
> Testcase: gem_tiled_partial_pwrite_pread/read
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Cc: H. Peter Anvin <hpa@linux.intel.com>
> Cc: Imre Deak <imre.deak@intel.com>
> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> Cc: dri-devel@lists.freedesktop.org
> ---
>  arch/x86/include/asm/special_insns.h | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h
> index 2270e41b32fd..0c7aedbf8930 100644
> --- a/arch/x86/include/asm/special_insns.h
> +++ b/arch/x86/include/asm/special_insns.h
> @@ -199,6 +199,11 @@ static inline void clflushopt(volatile void *__p)
>  		       ".byte 0x66; clflush %P0",
>  		       X86_FEATURE_CLFLUSHOPT,
>  		       "+m" (*(volatile char __force *)__p));
> +	/* GCC (4.9.1 and 5.2.1 at least) appears to be very confused when
> +	 * meeting this alternative() and demonstrably miscompiles loops
> +	 * iterating over clflushopts.
> +	 */
> +	barrier();
>  }
>  
>  static inline void clwb(volatile void *__p)
> -- 
> 2.6.1

I'm having trouble recreating this - can you help me get a simple reproducer?

I've been using the patch at the end of this mail as a test case.  

If the issue you are seeing is indeed a compiler error, the issue should be
visible in the resulting assembly.  I've dumped the assembly using objdump for
new loop in pmem_init() using both clflush() and clflushopt() in the loop, and
the loops are exactly the same except the clflushopt case has an extra byte as
part of the clflush instruction which is our NOOP prefix - we need this so
that the instruction is the right length so we can swap clflushopt in.

clflush() in loop:
	  3a:	0f ae 3a             	clflush (%rdx)

clflushopt() in loop:
	  3a:	3e 0f ae 3a          	clflush %ds:(%rdx)

We also get an extra entry in the <.altinstr_replacement> section in the
clflushopt case, as expected:

  15:	66                   	data16
  16:	0f ae 3a             	clflush (%rdx)

This all looks correct to me.  I've tested with both GCC 4.9.2 and GCC 5.1.1,
and they behave the same.

I also tried inserting a barrier() in the clflushopt() inline function, and it
didn't change the resulting assembly for me.

What am I missing?

- Ross

-- 8< --
>From 3e8c9cf1205c4505dcc29e09700c2990a3786664 Mon Sep 17 00:00:00 2001
From: Ross Zwisler <ross.zwisler@linux.intel.com>
Date: Mon, 19 Oct 2015 11:59:02 -0600
Subject: [PATCH] TESTING: understand clflushopt() compilation error

---
 drivers/nvdimm/pmem.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 0ba6a97..3f735a6 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -451,6 +451,20 @@ static int __init pmem_init(void)
 {
 	int error;
 
+	unsigned long clflush_size = boot_cpu_data.x86_clflush_size;
+	const unsigned int size = PAGE_SIZE;
+	void *addr = kmalloc(size, GFP_KERNEL);
+	void *end = addr + size;
+	void *p;
+
+	mb();
+	for (p = (void *)((unsigned long)addr & -clflush_size); p < end;
+			p += clflush_size)
+		clflush(p);
+	mb();
+
+	kfree(addr);
+
 	pmem_major = register_blkdev(0, "pmem");
 	if (pmem_major < 0)
 		return pmem_major;
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH] x86: Add an explicit barrier() to clflushopt()
  2015-10-19  9:58 [PATCH] x86: Add an explicit barrier() to clflushopt() Chris Wilson
  2015-10-19 10:16 ` Borislav Petkov
  2015-10-19 18:29 ` Ross Zwisler
@ 2016-01-07 10:16 ` Chris Wilson
  2016-01-07 17:55   ` Andy Lutomirski
  2 siblings, 1 reply; 29+ messages in thread
From: Chris Wilson @ 2016-01-07 10:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H . Peter Anvin, Andy Lutomirski, Borislav Petkov,
	Brian Gerst, Denys Vlasenko, H. Peter Anvin, Linus Torvalds,
	Thomas Gleixner, Imre Deak, Daniel Vetter, dri-devel

On Mon, Oct 19, 2015 at 10:58:55AM +0100, Chris Wilson wrote:
> During testing we observed that the last cacheline was not being flushed
> from a
> 
> 	mb()
> 	for (addr = addr & -clflush_size; addr < end; addr += clflush_size)
> 		clflushopt();
> 	mb()
> 
> loop (where the initial addr and end were not cacheline aligned).
> 
> Changing the loop from addr < end to addr <= end, or replacing the
> clflushopt() with clflush() both fixed the testcase. Hinting that GCC
> was miscompling the assembly within the loop and specifically the
> alternative within clflushopt() was confusing the loop optimizer.
> 
> Adding a barrier() into clflushopt() is enough for GCC to dtrt, but
> solving why GCC is not seeing the constraints from the alternative_io()
> would be smarter...
> 
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=92501
> Testcase: gem_tiled_partial_pwrite_pread/read
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Cc: H. Peter Anvin <hpa@linux.intel.com>
> Cc: Imre Deak <imre.deak@intel.com>
> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> Cc: dri-devel@lists.freedesktop.org
> ---
>  arch/x86/include/asm/special_insns.h | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h
> index 2270e41b32fd..0c7aedbf8930 100644
> --- a/arch/x86/include/asm/special_insns.h
> +++ b/arch/x86/include/asm/special_insns.h
> @@ -199,6 +199,11 @@ static inline void clflushopt(volatile void *__p)
>  		       ".byte 0x66; clflush %P0",
>  		       X86_FEATURE_CLFLUSHOPT,
>  		       "+m" (*(volatile char __force *)__p));
> +	/* GCC (4.9.1 and 5.2.1 at least) appears to be very confused when
> +	 * meeting this alternative() and demonstrably miscompiles loops
> +	 * iterating over clflushopts.
> +	 */
> +	barrier();
>  }

Or an alternative:

+#define alternative_output(oldinstr, newinstr, feature, output)        \
+       asm volatile (ALTERNATIVE(oldinstr, newinstr, feature)          \
+               : output : "i" (0) : "memory")

I would really appreciate some knowledgeable folks taking a look at the
asm for clflushopt() as it still affects today's kernel and gcc.

Fwiw, I have confirmed that arch/x86/mm/pageattr.c clflush_cache_range()
is similarly affected.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] x86: Add an explicit barrier() to clflushopt()
  2016-01-07 10:16 ` Chris Wilson
@ 2016-01-07 17:55   ` Andy Lutomirski
  2016-01-07 19:44     ` Chris Wilson
  0 siblings, 1 reply; 29+ messages in thread
From: Andy Lutomirski @ 2016-01-07 17:55 UTC (permalink / raw)
  To: Chris Wilson, linux-kernel, Ross Zwisler, H . Peter Anvin,
	Andy Lutomirski, Borislav Petkov, Brian Gerst, Denys Vlasenko,
	H. Peter Anvin, Linus Torvalds, Thomas Gleixner, Imre Deak,
	Daniel Vetter, DRI

On Thu, Jan 7, 2016 at 2:16 AM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> On Mon, Oct 19, 2015 at 10:58:55AM +0100, Chris Wilson wrote:
>> During testing we observed that the last cacheline was not being flushed
>> from a
>>
>>       mb()
>>       for (addr = addr & -clflush_size; addr < end; addr += clflush_size)
>>               clflushopt();
>>       mb()
>>
>> loop (where the initial addr and end were not cacheline aligned).
>>
>> Changing the loop from addr < end to addr <= end, or replacing the
>> clflushopt() with clflush() both fixed the testcase. Hinting that GCC
>> was miscompling the assembly within the loop and specifically the
>> alternative within clflushopt() was confusing the loop optimizer.
>>
>> Adding a barrier() into clflushopt() is enough for GCC to dtrt, but
>> solving why GCC is not seeing the constraints from the alternative_io()
>> would be smarter...
>>
>> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=92501
>> Testcase: gem_tiled_partial_pwrite_pread/read
>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Cc: H. Peter Anvin <hpa@linux.intel.com>
>> Cc: Imre Deak <imre.deak@intel.com>
>> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
>> Cc: dri-devel@lists.freedesktop.org
>> ---
>>  arch/x86/include/asm/special_insns.h | 5 +++++
>>  1 file changed, 5 insertions(+)
>>
>> diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h
>> index 2270e41b32fd..0c7aedbf8930 100644
>> --- a/arch/x86/include/asm/special_insns.h
>> +++ b/arch/x86/include/asm/special_insns.h
>> @@ -199,6 +199,11 @@ static inline void clflushopt(volatile void *__p)
>>                      ".byte 0x66; clflush %P0",
>>                      X86_FEATURE_CLFLUSHOPT,
>>                      "+m" (*(volatile char __force *)__p));
>> +     /* GCC (4.9.1 and 5.2.1 at least) appears to be very confused when
>> +      * meeting this alternative() and demonstrably miscompiles loops
>> +      * iterating over clflushopts.
>> +      */
>> +     barrier();
>>  }
>
> Or an alternative:
>
> +#define alternative_output(oldinstr, newinstr, feature, output)        \
> +       asm volatile (ALTERNATIVE(oldinstr, newinstr, feature)          \
> +               : output : "i" (0) : "memory")
>
> I would really appreciate some knowledgeable folks taking a look at the
> asm for clflushopt() as it still affects today's kernel and gcc.
>
> Fwiw, I have confirmed that arch/x86/mm/pageattr.c clflush_cache_range()
> is similarly affected.

Unless I'm mis-reading the asm, clflush_cache_range() is compiled
correctly for me.  (I don't know what the %P is for in the asm, but
that shouldn't matter.)  The ALTERNATIVE shouldn't even be visible to
the optimizer.

Can you attach a bad .s file and let us know what gcc version this is?
 (You can usually do 'make foo/bar/baz.s' to get a .s file.)  I'd also
be curious whether changing clflushopt to clwb works around the issue.

--Andy

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] x86: Add an explicit barrier() to clflushopt()
  2016-01-07 17:55   ` Andy Lutomirski
@ 2016-01-07 19:44     ` Chris Wilson
  2016-01-07 21:05       ` H. Peter Anvin
  0 siblings, 1 reply; 29+ messages in thread
From: Chris Wilson @ 2016-01-07 19:44 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, Ross Zwisler, H . Peter Anvin, Borislav Petkov,
	Brian Gerst, Denys Vlasenko, H. Peter Anvin, Linus Torvalds,
	Thomas Gleixner, Imre Deak, Daniel Vetter, DRI

On Thu, Jan 07, 2016 at 09:55:51AM -0800, Andy Lutomirski wrote:
> On Thu, Jan 7, 2016 at 2:16 AM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> >> +     /* GCC (4.9.1 and 5.2.1 at least) appears to be very confused when
> >> +      * meeting this alternative() and demonstrably miscompiles loops
> >> +      * iterating over clflushopts.
> >> +      */
> >> +     barrier();
> >>  }
> >
> > Or an alternative:
> >
> > +#define alternative_output(oldinstr, newinstr, feature, output)        \
> > +       asm volatile (ALTERNATIVE(oldinstr, newinstr, feature)          \
> > +               : output : "i" (0) : "memory")
> >
> > I would really appreciate some knowledgeable folks taking a look at the
> > asm for clflushopt() as it still affects today's kernel and gcc.
> >
> > Fwiw, I have confirmed that arch/x86/mm/pageattr.c clflush_cache_range()
> > is similarly affected.
> 
> Unless I'm mis-reading the asm, clflush_cache_range() is compiled
> correctly for me.  (I don't know what the %P is for in the asm, but
> that shouldn't matter.)  The ALTERNATIVE shouldn't even be visible to
> the optimizer.
> 
> Can you attach a bad .s file and let us know what gcc version this is?
>  (You can usually do 'make foo/bar/baz.s' to get a .s file.)  I'd also
> be curious whether changing clflushopt to clwb works around the issue.

Now I feel silly. Looking at the .s, there is no difference with the
addition of the barrier to clflush_cache_range(). And sure enough
letting the test run for longer, we see a failure. I fell for a placebo.

The failing assertion is always on the last cacheline and is always one
value behind. Oh well, back to wondering where we miss the flush.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] x86: Add an explicit barrier() to clflushopt()
  2016-01-07 19:44     ` Chris Wilson
@ 2016-01-07 21:05       ` H. Peter Anvin
  2016-01-07 21:54         ` Chris Wilson
  0 siblings, 1 reply; 29+ messages in thread
From: H. Peter Anvin @ 2016-01-07 21:05 UTC (permalink / raw)
  To: Chris Wilson, Andy Lutomirski, linux-kernel, Ross Zwisler,
	H . Peter Anvin, Borislav Petkov, Brian Gerst, Denys Vlasenko,
	Linus Torvalds, Thomas Gleixner, Imre Deak, Daniel Vetter, DRI

On 01/07/16 11:44, Chris Wilson wrote:
> 
> Now I feel silly. Looking at the .s, there is no difference with the
> addition of the barrier to clflush_cache_range(). And sure enough
> letting the test run for longer, we see a failure. I fell for a placebo.
> 
> The failing assertion is always on the last cacheline and is always one
> value behind. Oh well, back to wondering where we miss the flush.
> -Chris
> 

Could you include the assembly here?

	-hpa

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] x86: Add an explicit barrier() to clflushopt()
  2016-01-07 21:05       ` H. Peter Anvin
@ 2016-01-07 21:54         ` Chris Wilson
  2016-01-07 22:29           ` H. Peter Anvin
  0 siblings, 1 reply; 29+ messages in thread
From: Chris Wilson @ 2016-01-07 21:54 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andy Lutomirski, linux-kernel, Ross Zwisler, H . Peter Anvin,
	Borislav Petkov, Brian Gerst, Denys Vlasenko, Linus Torvalds,
	Thomas Gleixner, Imre Deak, Daniel Vetter, DRI

On Thu, Jan 07, 2016 at 01:05:35PM -0800, H. Peter Anvin wrote:
> On 01/07/16 11:44, Chris Wilson wrote:
> > 
> > Now I feel silly. Looking at the .s, there is no difference with the
> > addition of the barrier to clflush_cache_range(). And sure enough
> > letting the test run for longer, we see a failure. I fell for a placebo.
> > 
> > The failing assertion is always on the last cacheline and is always one
> > value behind. Oh well, back to wondering where we miss the flush.
> > -Chris
> > 
> 
> Could you include the assembly here?

Sure, here you go:

.LHOTB18:
	.p2align 4,,15
	.globl	clflush_cache_range
	.type	clflush_cache_range, @function
clflush_cache_range:
.LFB2505:
	.loc 1 131 0
	.cfi_startproc
.LVL194:
1:	call	__fentry__
	.loc 1 132 0
	movzwl	boot_cpu_data+198(%rip), %eax
	.loc 1 131 0
	pushq	%rbp
	.cfi_def_cfa_offset 16
	.cfi_offset 6, -16
	.loc 1 133 0
	movl	%esi, %esi
.LVL195:
	addq	%rdi, %rsi
.LVL196:
	.loc 1 131 0
	movq	%rsp, %rbp
	.cfi_def_cfa_register 6
	.loc 1 132 0
	subl	$1, %eax
	cltq
.LVL197:
	.loc 1 136 0
#APP
# 136 "arch/x86/mm/pageattr.c" 1
	mfence
# 0 "" 2
	.loc 1 138 0
#NO_APP
	notq	%rax
.LVL198:
	andq	%rax, %rdi
.LVL199:
	cmpq	%rdi, %rsi
	jbe	.L216
.L217:
.LBB1741:
.LBB1742:
	.loc 8 198 0
#APP
# 198 "./arch/x86/include/asm/special_insns.h" 1
	661:
	.byte 0x3e; clflush (%rdi)
662:
.skip -(((6651f-6641f)-(662b-661b)) > 0) * ((6651f-6641f)-(662b-661b)),0x90
663:
.pushsection .altinstructions,"a"
 .long 661b - .
 .long 6641f - .
 .word ( 9*32+23)
 .byte 663b-661b
 .byte 6651f-6641f
 .byte 663b-662b
.popsection
.pushsection .altinstr_replacement, "ax"
6641:
	.byte 0x66; clflush (%rdi)
6651:
	.popsection
# 0 "" 2
#NO_APP
.LBE1742:
.LBE1741:
	.loc 1 141 0
	.loc 1 139 0
	movzwl	boot_cpu_data+198(%rip), %eax
	addq	%rax, %rdi
	.loc 1 138 0
	cmpq	%rdi, %rsi
	ja	.L217
.L216:
	.loc 1 144 0
#APP
# 144 "arch/x86/mm/pageattr.c" 1
	mfence
# 0 "" 2
	.loc 1 145 0
#NO_APP
	popq	%rbp
	.cfi_restore 6
	.cfi_def_cfa 7, 8
	ret
	.cfi_endproc
.LFE2505:
	.size	clflush_cache_range, .-clflush_cache_range
	.section	.text.unlikely


Whilst you are looking at this asm, note that we reload
boot_cpu_data.x86_cflush_size everytime around the loop. That's a small
but noticeable extra cost (especially when we are only flushing a single
cacheline).

diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index a3137a4..2cd2b4b 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -129,14 +129,13 @@ within(unsigned long addr, unsigned long start, unsigned long end)
  */
 void clflush_cache_range(void *vaddr, unsigned int size)
 {
-       unsigned long clflush_mask = boot_cpu_data.x86_clflush_size - 1;
+       unsigned long clflush_size = boot_cpu_data.x86_clflush_size;
        void *vend = vaddr + size;
-       void *p;
+       void *p = (void *)((unsigned long)vaddr & ~(clflush_size - 1));
 
        mb();
 
-       for (p = (void *)((unsigned long)vaddr & ~clflush_mask);
-            p < vend; p += boot_cpu_data.x86_clflush_size)
+       for (; p < vend; p += clflush_size)
                clflushopt(p);
 
        mb();

-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH] x86: Add an explicit barrier() to clflushopt()
  2016-01-07 21:54         ` Chris Wilson
@ 2016-01-07 22:29           ` H. Peter Anvin
  2016-01-07 22:32             ` H. Peter Anvin
  0 siblings, 1 reply; 29+ messages in thread
From: H. Peter Anvin @ 2016-01-07 22:29 UTC (permalink / raw)
  To: Chris Wilson, Andy Lutomirski, linux-kernel, Ross Zwisler,
	H . Peter Anvin, Borislav Petkov, Brian Gerst, Denys Vlasenko,
	Linus Torvalds, Thomas Gleixner, Imre Deak, Daniel Vetter, DRI

On 01/07/16 13:54, Chris Wilson wrote:
> 
> Whilst you are looking at this asm, note that we reload
> boot_cpu_data.x86_cflush_size everytime around the loop. That's a small
> but noticeable extra cost (especially when we are only flushing a single
> cacheline).
> 

I did notice that; I don't know if this is a compiler failure to do an
obvious hoisting (which should then be merged with the other load) or a
consequence of the volatile.  It might be useful to report this to the
gcc bugzilla to let them look at it.

Either way, the diff looks good and you can add my:

Acked-by: H. Peter Anvin <hpa@linux.intel.com>

However, I see absolutely nothing wrong with the assembly other than
minor optimization possibilities: since gcc generates an early-out test
as well as a late-out test anyway, we could add an explicit:

	if (p < vend)
		return;

before the first mb() at no additional cost (assuming gcc is smart
enough to skip the second test; otherwise that can easily be done
manually by replacing the for loop with a do { } while loop.

I would be very interested in knowing if replacing the final clflushopt
with a clflush would resolve your problems (in which case the last mb()
shouldn't be necessary either.)

Something like:

void clflush_cache_range(void *vaddr, unsigned int size)
{
	unsigned long clflush_size = boot_cpu_data.x86_clflush_size;
	char *vend = (char *)vaddr + size;
	char *p = (char *)((unsigned long)vaddr & ~(clflush_size - 1);

	if (p >= vend)
		return;

	mb();

	for (; p < vend - clflush_size; p += clflush_size)
		clflushopt(p);

	clflush(p);		/* Serializing and thus a barrier */
}

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] x86: Add an explicit barrier() to clflushopt()
  2016-01-07 22:29           ` H. Peter Anvin
@ 2016-01-07 22:32             ` H. Peter Anvin
  2016-01-09  5:55               ` H. Peter Anvin
  2016-01-09  8:01               ` Chris Wilson
  0 siblings, 2 replies; 29+ messages in thread
From: H. Peter Anvin @ 2016-01-07 22:32 UTC (permalink / raw)
  To: Chris Wilson, Andy Lutomirski, linux-kernel, Ross Zwisler,
	H . Peter Anvin, Borislav Petkov, Brian Gerst, Denys Vlasenko,
	Linus Torvalds, Thomas Gleixner, Imre Deak, Daniel Vetter, DRI

On 01/07/16 14:29, H. Peter Anvin wrote:
> 
> I would be very interested in knowing if replacing the final clflushopt
> with a clflush would resolve your problems (in which case the last mb()
> shouldn't be necessary either.)
> 

Nevermind.  CLFLUSH is not ordered with regards to CLFLUSHOPT to the
same cache line.

Could you add a sync_cpu(); call to the end (can replace the final mb())
and see if that helps your case?

	-hpa

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] x86: Add an explicit barrier() to clflushopt()
  2016-01-07 22:32             ` H. Peter Anvin
@ 2016-01-09  5:55               ` H. Peter Anvin
  2016-01-09  8:01               ` Chris Wilson
  1 sibling, 0 replies; 29+ messages in thread
From: H. Peter Anvin @ 2016-01-09  5:55 UTC (permalink / raw)
  To: Chris Wilson, Andy Lutomirski, linux-kernel, Ross Zwisler,
	H . Peter Anvin, Borislav Petkov, Brian Gerst, Denys Vlasenko,
	Linus Torvalds, Thomas Gleixner, Imre Deak, Daniel Vetter, DRI

On 01/07/16 14:32, H. Peter Anvin wrote:
>
> Nevermind.  CLFLUSH is not ordered with regards to CLFLUSHOPT to the
> same cache line.
> 

*Except* to the same cache line, dammit.

	-hpa

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] x86: Add an explicit barrier() to clflushopt()
  2016-01-07 22:32             ` H. Peter Anvin
  2016-01-09  5:55               ` H. Peter Anvin
@ 2016-01-09  8:01               ` Chris Wilson
  2016-01-09 22:36                 ` Andy Lutomirski
  1 sibling, 1 reply; 29+ messages in thread
From: Chris Wilson @ 2016-01-09  8:01 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andy Lutomirski, linux-kernel, Ross Zwisler, H . Peter Anvin,
	Borislav Petkov, Brian Gerst, Denys Vlasenko, Linus Torvalds,
	Thomas Gleixner, Imre Deak, Daniel Vetter, DRI

On Thu, Jan 07, 2016 at 02:32:23PM -0800, H. Peter Anvin wrote:
> On 01/07/16 14:29, H. Peter Anvin wrote:
> > 
> > I would be very interested in knowing if replacing the final clflushopt
> > with a clflush would resolve your problems (in which case the last mb()
> > shouldn't be necessary either.)
> > 
> 
> Nevermind.  CLFLUSH is not ordered with regards to CLFLUSHOPT to the
> same cache line.
> 
> Could you add a sync_cpu(); call to the end (can replace the final mb())
> and see if that helps your case?

s/sync_cpu()/sync_core()/

No. I still see failures on Baytrail and Braswell (Pineview is not
affected) with the final mb() replaced with sync_core(). I can reproduce
failures on Pineview by tweaking the clflush_cache_range() parameters,
so I am fairly confident that it is validating the current code.

iirc sync_core() is cpuid, a heavy serialising instruction, an
alternative to mfence.  Is there anything that else I can infer about
the nature of my bug from this result?
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] x86: Add an explicit barrier() to clflushopt()
  2016-01-09  8:01               ` Chris Wilson
@ 2016-01-09 22:36                 ` Andy Lutomirski
  2016-01-11 11:28                   ` Chris Wilson
  0 siblings, 1 reply; 29+ messages in thread
From: Andy Lutomirski @ 2016-01-09 22:36 UTC (permalink / raw)
  To: Chris Wilson, H. Peter Anvin, Andy Lutomirski, linux-kernel,
	Ross Zwisler, H . Peter Anvin, Borislav Petkov, Brian Gerst,
	Denys Vlasenko, Linus Torvalds, Thomas Gleixner, Imre Deak,
	Daniel Vetter, DRI

On Sat, Jan 9, 2016 at 12:01 AM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> On Thu, Jan 07, 2016 at 02:32:23PM -0800, H. Peter Anvin wrote:
>> On 01/07/16 14:29, H. Peter Anvin wrote:
>> >
>> > I would be very interested in knowing if replacing the final clflushopt
>> > with a clflush would resolve your problems (in which case the last mb()
>> > shouldn't be necessary either.)
>> >
>>
>> Nevermind.  CLFLUSH is not ordered with regards to CLFLUSHOPT to the
>> same cache line.
>>
>> Could you add a sync_cpu(); call to the end (can replace the final mb())
>> and see if that helps your case?
>
> s/sync_cpu()/sync_core()/
>
> No. I still see failures on Baytrail and Braswell (Pineview is not
> affected) with the final mb() replaced with sync_core(). I can reproduce
> failures on Pineview by tweaking the clflush_cache_range() parameters,
> so I am fairly confident that it is validating the current code.
>
> iirc sync_core() is cpuid, a heavy serialising instruction, an
> alternative to mfence.  Is there anything that else I can infer about
> the nature of my bug from this result?

No clue, but I don't know much about the underlying architecture.

Can you try clflush_cache_ranging one cacheline less and then manually
doing clflushopt; mb on the last cache line, just to make sure that
the helper is really doing the right thing?  You could also try
clflush instead of clflushopt to see if that makes a difference.

--Andy


-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] x86: Add an explicit barrier() to clflushopt()
  2016-01-09 22:36                 ` Andy Lutomirski
@ 2016-01-11 11:28                   ` Chris Wilson
  2016-01-11 20:11                     ` Linus Torvalds
  2016-01-12 17:17                     ` H. Peter Anvin
  0 siblings, 2 replies; 29+ messages in thread
From: Chris Wilson @ 2016-01-11 11:28 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: H. Peter Anvin, linux-kernel, Ross Zwisler, H . Peter Anvin,
	Borislav Petkov, Brian Gerst, Denys Vlasenko, Linus Torvalds,
	Thomas Gleixner, Imre Deak, Daniel Vetter, DRI

On Sat, Jan 09, 2016 at 02:36:03PM -0800, Andy Lutomirski wrote:
> On Sat, Jan 9, 2016 at 12:01 AM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> > On Thu, Jan 07, 2016 at 02:32:23PM -0800, H. Peter Anvin wrote:
> >> On 01/07/16 14:29, H. Peter Anvin wrote:
> >> >
> >> > I would be very interested in knowing if replacing the final clflushopt
> >> > with a clflush would resolve your problems (in which case the last mb()
> >> > shouldn't be necessary either.)
> >> >
> >>
> >> Nevermind.  CLFLUSH is not ordered with regards to CLFLUSHOPT to the
> >> same cache line.
> >>
> >> Could you add a sync_cpu(); call to the end (can replace the final mb())
> >> and see if that helps your case?
> >
> > s/sync_cpu()/sync_core()/
> >
> > No. I still see failures on Baytrail and Braswell (Pineview is not
> > affected) with the final mb() replaced with sync_core(). I can reproduce
> > failures on Pineview by tweaking the clflush_cache_range() parameters,
> > so I am fairly confident that it is validating the current code.
> >
> > iirc sync_core() is cpuid, a heavy serialising instruction, an
> > alternative to mfence.  Is there anything that else I can infer about
> > the nature of my bug from this result?
> 
> No clue, but I don't know much about the underlying architecture.
> 
> Can you try clflush_cache_ranging one cacheline less and then manually
> doing clflushopt; mb on the last cache line, just to make sure that
> the helper is really doing the right thing?  You could also try
> clflush instead of clflushopt to see if that makes a difference.

I had looked at increasing the range over which clflush_cache_range()
runs (using roundup/rounddown by cache lines), but it took something
like +/- 256 bytes to pass all the tests. And also did
s/clflushopt/clflush/ to confirm that made no differnce.

Bizarrely,

diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 6000ad7..cf074400 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -141,6 +141,7 @@ void clflush_cache_range(void *vaddr, unsigned int size)
        for (; p < vend; p += clflush_size)
                clflushopt(p);
 
+       clflushopt(vend-1);
        mb();
 }
 EXPORT_SYMBOL_GPL(clflush_cache_range);

works like a charm.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH] x86: Add an explicit barrier() to clflushopt()
  2016-01-11 11:28                   ` Chris Wilson
@ 2016-01-11 20:11                     ` Linus Torvalds
  2016-01-11 21:05                       ` Chris Wilson
  2016-01-12 17:17                     ` H. Peter Anvin
  1 sibling, 1 reply; 29+ messages in thread
From: Linus Torvalds @ 2016-01-11 20:11 UTC (permalink / raw)
  To: Chris Wilson, Andy Lutomirski, H. Peter Anvin, linux-kernel,
	Ross Zwisler, H . Peter Anvin, Borislav Petkov, Brian Gerst,
	Denys Vlasenko, Linus Torvalds, Thomas Gleixner, Imre Deak,
	Daniel Vetter, DRI

On Mon, Jan 11, 2016 at 3:28 AM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
>
> Bizarrely,
>
> diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
> index 6000ad7..cf074400 100644
> --- a/arch/x86/mm/pageattr.c
> +++ b/arch/x86/mm/pageattr.c
> @@ -141,6 +141,7 @@ void clflush_cache_range(void *vaddr, unsigned int size)
>         for (; p < vend; p += clflush_size)
>                 clflushopt(p);
>
> +       clflushopt(vend-1);
>         mb();
>  }
>  EXPORT_SYMBOL_GPL(clflush_cache_range);
>
> works like a charm.

Have you checked all your callers? If the above makes a difference, it
really sounds like the caller has passed in a size of zero, resulting
in no cache flush, because the caller had incorrect ranges. The
additional clflushopt now flushes the previous cacheline that wasn't
flushed correctly before.

That "size was zero" thing would explain why changing the loop to "p
<= vend" also fixes things for you.

IOW, just how sure are you that all the ranges are correct?

                  Linus

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] x86: Add an explicit barrier() to clflushopt()
  2016-01-11 20:11                     ` Linus Torvalds
@ 2016-01-11 21:05                       ` Chris Wilson
  2016-01-12 16:37                         ` Chris Wilson
  0 siblings, 1 reply; 29+ messages in thread
From: Chris Wilson @ 2016-01-11 21:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, H. Peter Anvin, linux-kernel, Ross Zwisler,
	H . Peter Anvin, Borislav Petkov, Brian Gerst, Denys Vlasenko,
	Thomas Gleixner, Imre Deak, Daniel Vetter, DRI

On Mon, Jan 11, 2016 at 12:11:05PM -0800, Linus Torvalds wrote:
> On Mon, Jan 11, 2016 at 3:28 AM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> >
> > Bizarrely,
> >
> > diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
> > index 6000ad7..cf074400 100644
> > --- a/arch/x86/mm/pageattr.c
> > +++ b/arch/x86/mm/pageattr.c
> > @@ -141,6 +141,7 @@ void clflush_cache_range(void *vaddr, unsigned int size)
> >         for (; p < vend; p += clflush_size)
> >                 clflushopt(p);
> >
> > +       clflushopt(vend-1);
> >         mb();
> >  }
> >  EXPORT_SYMBOL_GPL(clflush_cache_range);
> >
> > works like a charm.
> 
> Have you checked all your callers? If the above makes a difference, it
> really sounds like the caller has passed in a size of zero, resulting
> in no cache flush, because the caller had incorrect ranges. The
> additional clflushopt now flushes the previous cacheline that wasn't
> flushed correctly before.
> 
> That "size was zero" thing would explain why changing the loop to "p
> <= vend" also fixes things for you.

This is on top of HPA's suggestion to do the size==0 check up front.
 
> IOW, just how sure are you that all the ranges are correct?

All our callers are of the pattern:

memcpy(dst, vaddr, size)
clflush_cache_range(dst, size)

or 

clflush_cache_range(vaddr, size)
memcpy(dst, vaddr, size)

I am resonably confident that the ranges are sane. I've tried to verify
that we do the clflushes by forcing them. However, if I clflush the
whole object instead of the cachelines around the copies, the tests pass.
(Flushing up to a couple of megabytes instead of a few hundred bytes, it
is hard to draw any conclusions about what the bug might be.)

I can narrow down the principal buggy path by doing the clflush(vend-1)
in the callers at least.

The problem is that the tests that fail are those looking for bugs in
the coherency code, which may just as well be caused by the GPU writing
into those ranges at the same time as the CPU trying to read them. I've
looked into timing and tried adding udelay()s or uncached mmio along the
suspect paths, but that didn't change the presentation - having a udelay
fix the issue is usually a good indicator of a GPU write that hasn't landed
before the CPU read.

The bug only affects a couple of recent non-coherent platforms, earlier
Atoms and older Core seem unaffacted. That may also mean that it is the
GPU flush instruction that changed between platforms and isn't working
(as we intended at least).

Thanks for everyone's help and suggestions,
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] x86: Add an explicit barrier() to clflushopt()
  2016-01-11 21:05                       ` Chris Wilson
@ 2016-01-12 16:37                         ` Chris Wilson
  2016-01-12 17:05                           ` Linus Torvalds
  0 siblings, 1 reply; 29+ messages in thread
From: Chris Wilson @ 2016-01-12 16:37 UTC (permalink / raw)
  To: Linus Torvalds, Andy Lutomirski, H. Peter Anvin, linux-kernel,
	Ross Zwisler, H . Peter Anvin, Borislav Petkov, Brian Gerst,
	Denys Vlasenko, Thomas Gleixner, Imre Deak, Daniel Vetter, DRI

On Mon, Jan 11, 2016 at 09:05:06PM +0000, Chris Wilson wrote:
> I can narrow down the principal buggy path by doing the clflush(vend-1)
> in the callers at least.

That leads to the suspect path being a read back of a cache line from
main memory that was just written to by the GPU. Writes to memory before
using them on the GPU do not seem to be affected (or at least we have
sufficient flushing in sending the commands to the GPU that we don't
notice anything wrong).

And back to the oddity.

Instead of doing:

	clflush_cache_range(vaddr + offset, size);
	clflush(vaddr+offset+size-1);
	mb();
	memcpy(user, vaddr+offset, size);

what also worked was:

	clflush_cache_range(vaddr + offset, size);
	clflush(vaddr);
	mb();
	memcpy(user, vaddr+offset, size);

(size is definitely non-zero, offset is offset_in_page(), vaddr is from
kmap_atomic()).

i.e.

void clflush_cache_range(void *vaddr, unsigned int size)
{
        const unsigned long clflush_size = boot_cpu_data.x86_clflush_size;
        void *p = (void *)((unsigned long)vaddr & ~(clflush_size - 1));
        void *vend = vaddr + size;

        if (p >= vend)
                return;

        mb();

        for (; p < vend; p += clflush_size)
                clflushopt(p);

	clflushopt(vaddr);

        mb();
}

I have also confirmed that this doesn't just happen for single
cachelines (i.e. where the earlier clflush(vend-1) and this clflush(vaddr)
would be equivalent).

At the moment I am more inclined this is serialising the clflush()
(since clflush to the same cacheline is regarded as ordered with respect
to the earlier clflush iirc) as opposed to the writes not landing timely
from the GPU.

Am I completely going mad?
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] x86: Add an explicit barrier() to clflushopt()
  2016-01-12 16:37                         ` Chris Wilson
@ 2016-01-12 17:05                           ` Linus Torvalds
  2016-01-12 21:13                             ` Chris Wilson
  0 siblings, 1 reply; 29+ messages in thread
From: Linus Torvalds @ 2016-01-12 17:05 UTC (permalink / raw)
  To: Chris Wilson, Linus Torvalds, Andy Lutomirski, H. Peter Anvin,
	linux-kernel, Ross Zwisler, H . Peter Anvin, Borislav Petkov,
	Brian Gerst, Denys Vlasenko, Thomas Gleixner, Imre Deak,
	Daniel Vetter, DRI

On Tue, Jan 12, 2016 at 8:37 AM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> On Mon, Jan 11, 2016 at 09:05:06PM +0000, Chris Wilson wrote:
>> I can narrow down the principal buggy path by doing the clflush(vend-1)
>> in the callers at least.
>
> That leads to the suspect path being a read back of a cache line from
> main memory that was just written to by the GPU.

How do you know it was written by the GPU?

Maybe it's a memory ordering issue on the GPU. Say it writes something
to memory, then sets the "I'm done" flag (or whatever you check), but
because of ordering on the GPU the "I'm done" flag is visible before.

So the reason you see the old content may just be that the GPU writes
are still buffered on the GPU. And you adding a clflushopt on the same
address just changes the timing enough that you don't see the memory
ordering any more (or it's just much harder to see, it might still be
there).

Maybe the reason you only see the problem with the last cacheline is
simply that the "last" cacheline is also the last that was written by
the GPU, and it's still in the GPU write buffers.

Also, did you ever print out the value of clflush_size? Maybe we just
got it wrong and it's bogus data.

                    Linus

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] x86: Add an explicit barrier() to clflushopt()
  2016-01-11 11:28                   ` Chris Wilson
  2016-01-11 20:11                     ` Linus Torvalds
@ 2016-01-12 17:17                     ` H. Peter Anvin
  1 sibling, 0 replies; 29+ messages in thread
From: H. Peter Anvin @ 2016-01-12 17:17 UTC (permalink / raw)
  To: Chris Wilson, Andy Lutomirski
  Cc: linux-kernel, Ross Zwisler, H . Peter Anvin, Borislav Petkov,
	Brian Gerst, Denys Vlasenko, Linus Torvalds, Thomas Gleixner,
	Imre Deak, Daniel Vetter, DRI

On January 11, 2016 3:28:01 AM PST, Chris Wilson <chris@chris-wilson.co.uk> wrote:
>On Sat, Jan 09, 2016 at 02:36:03PM -0800, Andy Lutomirski wrote:
>> On Sat, Jan 9, 2016 at 12:01 AM, Chris Wilson
><chris@chris-wilson.co.uk> wrote:
>> > On Thu, Jan 07, 2016 at 02:32:23PM -0800, H. Peter Anvin wrote:
>> >> On 01/07/16 14:29, H. Peter Anvin wrote:
>> >> >
>> >> > I would be very interested in knowing if replacing the final
>clflushopt
>> >> > with a clflush would resolve your problems (in which case the
>last mb()
>> >> > shouldn't be necessary either.)
>> >> >
>> >>
>> >> Nevermind.  CLFLUSH is not ordered with regards to CLFLUSHOPT to
>the
>> >> same cache line.
>> >>
>> >> Could you add a sync_cpu(); call to the end (can replace the final
>mb())
>> >> and see if that helps your case?
>> >
>> > s/sync_cpu()/sync_core()/
>> >
>> > No. I still see failures on Baytrail and Braswell (Pineview is not
>> > affected) with the final mb() replaced with sync_core(). I can
>reproduce
>> > failures on Pineview by tweaking the clflush_cache_range()
>parameters,
>> > so I am fairly confident that it is validating the current code.
>> >
>> > iirc sync_core() is cpuid, a heavy serialising instruction, an
>> > alternative to mfence.  Is there anything that else I can infer
>about
>> > the nature of my bug from this result?
>> 
>> No clue, but I don't know much about the underlying architecture.
>> 
>> Can you try clflush_cache_ranging one cacheline less and then
>manually
>> doing clflushopt; mb on the last cache line, just to make sure that
>> the helper is really doing the right thing?  You could also try
>> clflush instead of clflushopt to see if that makes a difference.
>
>I had looked at increasing the range over which clflush_cache_range()
>runs (using roundup/rounddown by cache lines), but it took something
>like +/- 256 bytes to pass all the tests. And also did
>s/clflushopt/clflush/ to confirm that made no differnce.
>
>Bizarrely,
>
>diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
>index 6000ad7..cf074400 100644
>--- a/arch/x86/mm/pageattr.c
>+++ b/arch/x86/mm/pageattr.c
>@@ -141,6 +141,7 @@ void clflush_cache_range(void *vaddr, unsigned int
>size)
>        for (; p < vend; p += clflush_size)
>                clflushopt(p);
> 
>+       clflushopt(vend-1);
>        mb();
> }
> EXPORT_SYMBOL_GPL(clflush_cache_range);
>
>works like a charm.
>-Chris

That clflushopt touches a cache line already touched and therefore serializes with it.
-- 
Sent from my Android device with K-9 Mail. Please excuse brevity and formatting.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] x86: Add an explicit barrier() to clflushopt()
  2016-01-12 17:05                           ` Linus Torvalds
@ 2016-01-12 21:13                             ` Chris Wilson
  2016-01-12 22:07                               ` Linus Torvalds
  0 siblings, 1 reply; 29+ messages in thread
From: Chris Wilson @ 2016-01-12 21:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, H. Peter Anvin, linux-kernel, Ross Zwisler,
	H . Peter Anvin, Borislav Petkov, Brian Gerst, Denys Vlasenko,
	Thomas Gleixner, Imre Deak, Daniel Vetter, DRI

On Tue, Jan 12, 2016 at 09:05:19AM -0800, Linus Torvalds wrote:
> On Tue, Jan 12, 2016 at 8:37 AM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> > On Mon, Jan 11, 2016 at 09:05:06PM +0000, Chris Wilson wrote:
> >> I can narrow down the principal buggy path by doing the clflush(vend-1)
> >> in the callers at least.
> >
> > That leads to the suspect path being a read back of a cache line from
> > main memory that was just written to by the GPU.
> 
> How do you know it was written by the GPU?

Test construction: write some data, copy it on the GPU, read it back.
Repeat for various data, sequences of copy (differing GPU instructions,
intermediates etc), and accessors.

> Maybe it's a memory ordering issue on the GPU. Say it writes something
> to memory, then sets the "I'm done" flag (or whatever you check), but
> because of ordering on the GPU the "I'm done" flag is visible before.

That is a continual worry. To try and assuage that fear, I sent 8x
flush gpu writes between the end of the copy and setting the "I'm done"
flag. The definition of the GPU flush is that it both flushes all
previous writes before it completes and only after it completes does it
do the post-sync write (before moving onto the next command). The spec
is always a bit hazy on what order the memory writes will be visible on
the CPU though.

Sending the 8x GPU flushes before marking "I'm done" did not fix the
corruption.
 
> So the reason you see the old content may just be that the GPU writes
> are still buffered on the GPU. And you adding a clflushopt on the same
> address just changes the timing enough that you don't see the memory
> ordering any more (or it's just much harder to see, it might still be
> there).

Indeed. So I replaced the post-clflush_cache_range() clflush() with a
udelay(10) instead, and the corruption vanished. Putting the udelay(10)
before the clflush_cache_range() does not fix the corruption.
 
> Maybe the reason you only see the problem with the last cacheline is
> simply that the "last" cacheline is also the last that was written by
> the GPU, and it's still in the GPU write buffers.

Exactly the fear.
 
> Also, did you ever print out the value of clflush_size? Maybe we just
> got it wrong and it's bogus data.

It's 64 bytes as expected. And fudging it to any other value quickly
explodes :)


Since:

	/* lots of GPU flushes + GPU/CPU sync point */
	udelay(10);
	clflush_cache_range(vaddr, size);
	memcpy(user, vaddr, size);

fails, but

	/* lots of GPU flushes + GPU/CPU sync point */
	clflush_cache_range(vaddr, size);
	udelay(10);
	memcpy(user, vaddr, size);

passes, I'm inclined to point the finger at the mb() following the
clflush_cache_range().
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] x86: Add an explicit barrier() to clflushopt()
  2016-01-12 21:13                             ` Chris Wilson
@ 2016-01-12 22:07                               ` Linus Torvalds
  2016-01-13  0:55                                 ` Chris Wilson
  0 siblings, 1 reply; 29+ messages in thread
From: Linus Torvalds @ 2016-01-12 22:07 UTC (permalink / raw)
  To: Chris Wilson, Linus Torvalds, Andy Lutomirski, H. Peter Anvin,
	linux-kernel, Ross Zwisler, H . Peter Anvin, Borislav Petkov,
	Brian Gerst, Denys Vlasenko, Thomas Gleixner, Imre Deak,
	Daniel Vetter, DRI

On Tue, Jan 12, 2016 at 1:13 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
>
> That is a continual worry. To try and assuage that fear, I sent 8x
> flush gpu writes between the end of the copy and setting the "I'm done"
> flag. The definition of the GPU flush is that it both flushes all
> previous writes before it completes and only after it completes does it
> do the post-sync write (before moving onto the next command). The spec
> is always a bit hazy on what order the memory writes will be visible on
> the CPU though.
>
> Sending the 8x GPU flushes before marking "I'm done" did not fix the
> corruption.

Ok. So assuming the GPU flushes are supposed to work, it should be all good.

>> So the reason you see the old content may just be that the GPU writes
>> are still buffered on the GPU. And you adding a clflushopt on the same
>> address just changes the timing enough that you don't see the memory
>> ordering any more (or it's just much harder to see, it might still be
>> there).
>
> Indeed. So I replaced the post-clflush_cache_range() clflush() with a
> udelay(10) instead, and the corruption vanished. Putting the udelay(10)
> before the clflush_cache_range() does not fix the corruption.

Odd.

> passes, I'm inclined to point the finger at the mb() following the
> clflush_cache_range().

We have an entirely unrelated discussion about the value of "mfence"
as a memory barrier.

Mind trying to just make the memory barrier (in
arch/x86/include/asm/barrier.h) be a locked op instead?

The docs say "Executions of the CLFLUSHOPT instruction are ordered
with respect to fence instructions and to locked read-modify-write
instructions; ..", so the mfence should be plenty good enough. But
nobody sane uses mfence for memory ordering (that's the other
discussion we're having), since a locked rmw instruction is faster.

So maybe it's a CPU bug. I'd still consider a GPU memory ordering bug
*way* more likely (the CPU core tensd to be better validated in my
experience), but since you're trying odd things anyway, try changing
the "mfence" to "lock; addl $0,0(%%rsp)" instead.

I doubt it makes any difference, but ..

            Linus

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] x86: Add an explicit barrier() to clflushopt()
  2016-01-12 22:07                               ` Linus Torvalds
@ 2016-01-13  0:55                                 ` Chris Wilson
  2016-01-13  2:06                                   ` Linus Torvalds
  0 siblings, 1 reply; 29+ messages in thread
From: Chris Wilson @ 2016-01-13  0:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, H. Peter Anvin, linux-kernel, Ross Zwisler,
	H . Peter Anvin, Borislav Petkov, Brian Gerst, Denys Vlasenko,
	Thomas Gleixner, Imre Deak, Daniel Vetter, DRI

On Tue, Jan 12, 2016 at 02:07:35PM -0800, Linus Torvalds wrote:
> On Tue, Jan 12, 2016 at 1:13 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> > Indeed. So I replaced the post-clflush_cache_range() clflush() with a
> > udelay(10) instead, and the corruption vanished. Putting the udelay(10)
> > before the clflush_cache_range() does not fix the corruption.
> 
> Odd.
> 
> > passes, I'm inclined to point the finger at the mb() following the
> > clflush_cache_range().
> 
> We have an entirely unrelated discussion about the value of "mfence"
> as a memory barrier.
> 
> Mind trying to just make the memory barrier (in
> arch/x86/include/asm/barrier.h) be a locked op instead?

Replacing the following mb() in clflush_cache_range() with
"lock; addl $0,0(%%rsp)" gave no improvement. Neither did replacing all
mb(). Undoubtably the memory stream as seen by the CPU is serialised.
The concern then is back to the visibility of the stream from the external
device. Adding an uncached mmio read after the clflush_cache_range()
also works, but I'm not clear if this achieves anything other than
inserting a delay or if it is flushing some other write buffer in the
chipset. It is odd that is required after the clflush_cache_range() and
ineffective before. It is also odd that such a flush would be needed
multiple times after the GPU write.

The double clflush() remains a mystery.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] x86: Add an explicit barrier() to clflushopt()
  2016-01-13  0:55                                 ` Chris Wilson
@ 2016-01-13  2:06                                   ` Linus Torvalds
  2016-01-13  2:42                                     ` Andy Lutomirski
  2016-01-13 12:34                                     ` Chris Wilson
  0 siblings, 2 replies; 29+ messages in thread
From: Linus Torvalds @ 2016-01-13  2:06 UTC (permalink / raw)
  To: Chris Wilson, Linus Torvalds, Andy Lutomirski, H. Peter Anvin,
	linux-kernel, Ross Zwisler, H . Peter Anvin, Borislav Petkov,
	Brian Gerst, Denys Vlasenko, Thomas Gleixner, Imre Deak,
	Daniel Vetter, DRI

On Tue, Jan 12, 2016 at 4:55 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
>
> The double clflush() remains a mystery.

Actually, I think it's explainable.

It's wrong to do the clflush *after* the GPU has done the write, which
seems to be what you are doing.

Why?

If the GPU really isn't cache coherent, what can happen is:

 - the CPU has the line cached

 - the GPU writes the data

 - you do the clflushopt to invalidate the cacheline

 - you expect to see the GPU data.

Right?

Wrong. The above is complete crap.

Why?

Very simple reason: the CPU may have had the cacheline dirty at some
level in its caches, so when you did the clflushopt, it didn't just
invalidate the CPU cacheline, it wrote it back to memory. And in the
process over-wrote the data that the GPU had written.

Now you can say "but the CPU never wrote to the cacheline, so it's not
dirty in the CPU caches". That may or may not be trie. The CPU may
have written to it quite a long time ago.

So if you are doing a GPU write, and you want to see the data that the
GPU wrote, you had better do the clflushopt long *before* the GPU ever
writes to memory.

Your pattern of doing "flush and read" is simply fundamentally buggy.
There are only two valid CPU flushing patterns:

 - write and flush (to make the writes visible to the GPU)

 - flush before starting GPU accesses, and then read

At no point can "flush and read" be right.

Now, I haven't actually seen your code, so I'm just going by your
high-level description of where the CPU flush and CPU read were done,
but it *sounds* like you did that invalid "flush and read" behavior.

                 Linus

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] x86: Add an explicit barrier() to clflushopt()
  2016-01-13  2:06                                   ` Linus Torvalds
@ 2016-01-13  2:42                                     ` Andy Lutomirski
  2016-01-13  4:39                                       ` Linus Torvalds
  2016-01-13 12:34                                     ` Chris Wilson
  1 sibling, 1 reply; 29+ messages in thread
From: Andy Lutomirski @ 2016-01-13  2:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Chris Wilson, H. Peter Anvin, linux-kernel, Ross Zwisler,
	H . Peter Anvin, Borislav Petkov, Brian Gerst, Denys Vlasenko,
	Thomas Gleixner, Imre Deak, Daniel Vetter, DRI

On Tue, Jan 12, 2016 at 6:06 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Tue, Jan 12, 2016 at 4:55 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
>>
>> The double clflush() remains a mystery.
>
> Actually, I think it's explainable.
>
> It's wrong to do the clflush *after* the GPU has done the write, which
> seems to be what you are doing.
>
> Why?
>
> If the GPU really isn't cache coherent, what can happen is:
>
>  - the CPU has the line cached
>
>  - the GPU writes the data
>
>  - you do the clflushopt to invalidate the cacheline
>
>  - you expect to see the GPU data.
>
> Right?
>
> Wrong. The above is complete crap.
>
> Why?
>
> Very simple reason: the CPU may have had the cacheline dirty at some
> level in its caches, so when you did the clflushopt, it didn't just
> invalidate the CPU cacheline, it wrote it back to memory. And in the
> process over-wrote the data that the GPU had written.
>
> Now you can say "but the CPU never wrote to the cacheline, so it's not
> dirty in the CPU caches". That may or may not be trie. The CPU may
> have written to it quite a long time ago.
>
> So if you are doing a GPU write, and you want to see the data that the
> GPU wrote, you had better do the clflushopt long *before* the GPU ever
> writes to memory.
>
> Your pattern of doing "flush and read" is simply fundamentally buggy.
> There are only two valid CPU flushing patterns:
>
>  - write and flush (to make the writes visible to the GPU)
>
>  - flush before starting GPU accesses, and then read
>
> At no point can "flush and read" be right.
>
> Now, I haven't actually seen your code, so I'm just going by your
> high-level description of where the CPU flush and CPU read were done,
> but it *sounds* like you did that invalid "flush and read" behavior.

Since barriers are on my mind: how strong a barrier is needed to
prevent cache fills from being speculated across the barrier?

Concretely, if I do:

clflush A
clflush B
load A
load B

the architecture guarantees that (unless store forwarding happens) the
value I see for B is at least as new as the value I see for A *with
respect to other access within the coherency domain*.  But the GPU
isn't in the coherency domain at all.

Is:

clflush A
clflush B
load A
MFENCE
load B

good enough?  If it is, and if

clflush A
clflush B
load A
LOCK whatever
load B

is *not*, then this might account for the performance difference.

In any event, it seems to me that what i915 is trying to do isn't
really intended to be supported for WB memory.  i915 really wants to
force a read from main memory and to simultaneously prevent the CPU
from writing back to main memory.  Ick.  I'd assume that:

clflush A
clflush B
load A
serializing instruction here
load B

is good enough, as long as you make sure that the GPU does its writes
after the clflushes make it all the way out to main memory (which
might require a second serializing instruction in the case of
clflushopt), but this still relies on the hardware prefetcher not
prefetching B too early, which it's permitted to do even in the
absence of any explicit access at all.

Presumably this is good enough on any implementation:

clflush A
clflush B
load A
clflush B
load B

But that will be really, really slow.  And you're still screwed if the
hardware is permitted to arbitrarily change cache lines from S to M.

In other words, I'm not really convinced that x86 was ever intended to
have well-defined behavior if something outside the coherency domain
writes to a page of memory while that page is mapped WB.  Of course,
I'm also not sure how to reliably switch a page from WB to any other
memory type short of remapping it and doing CLFLUSH after remapping.

SDM Volume 3 11.12.4 seems to agree with me.

Could the driver be changed to use WC or UC and to use MOVNTDQA on
supported CPUs to get the performance back?  It sounds like i915 is
effectively doing PIO here, and reasonably modern CPUs have a nice set
of fast PIO instructions.

--Andy

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] x86: Add an explicit barrier() to clflushopt()
  2016-01-13  2:42                                     ` Andy Lutomirski
@ 2016-01-13  4:39                                       ` Linus Torvalds
  0 siblings, 0 replies; 29+ messages in thread
From: Linus Torvalds @ 2016-01-13  4:39 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Chris Wilson, H. Peter Anvin, linux-kernel, Ross Zwisler,
	H . Peter Anvin, Borislav Petkov, Brian Gerst, Denys Vlasenko,
	Thomas Gleixner, Imre Deak, Daniel Vetter, DRI

On Tue, Jan 12, 2016 at 6:42 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> Since barriers are on my mind: how strong a barrier is needed to
> prevent cache fills from being speculated across the barrier?

I don't think there are *any* architectural guarantees.

I suspect that a real serializing instruction should do it. But I
don't think even that is guaranteed.

Non-coherent IO is crazy. I really thought Intel had learnt their
lesson, and finally made all the GPU's coherent. I'm afraid to even
ask why Chris is actually working on some sh*t that requires clflush.

In general, you should probably do something nasty like

 - flush before starting IO that generates data (to make sure you have
no dirty cachelines that will write back and mess up)

 - start the IO, wait for it to complete

 - flush after finishing IO that generates the data (to make sure you
have no speculative clean cachelines with stale data)

 - read the data now.

Of course, what people actually end up doing to avoid all this is to
mark the memory noncacheable.

And finally, the *correct* thing is to not have crap hardware, and
have IO be cache coherent. Things that don't do that are shit. Really.

                 Linus

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] x86: Add an explicit barrier() to clflushopt()
  2016-01-13  2:06                                   ` Linus Torvalds
  2016-01-13  2:42                                     ` Andy Lutomirski
@ 2016-01-13 12:34                                     ` Chris Wilson
  2016-01-13 18:45                                       ` Linus Torvalds
  1 sibling, 1 reply; 29+ messages in thread
From: Chris Wilson @ 2016-01-13 12:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, H. Peter Anvin, linux-kernel, Ross Zwisler,
	H . Peter Anvin, Borislav Petkov, Brian Gerst, Denys Vlasenko,
	Thomas Gleixner, Imre Deak, Daniel Vetter, DRI

On Tue, Jan 12, 2016 at 06:06:34PM -0800, Linus Torvalds wrote:
> On Tue, Jan 12, 2016 at 4:55 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> >
> > The double clflush() remains a mystery.
> 
> Actually, I think it's explainable.
> 
> It's wrong to do the clflush *after* the GPU has done the write, which
> seems to be what you are doing.
> 
> Why?
> 
> If the GPU really isn't cache coherent, what can happen is:
> 
>  - the CPU has the line cached
> 
>  - the GPU writes the data
> 
>  - you do the clflushopt to invalidate the cacheline
> 
>  - you expect to see the GPU data.
> 
> Right?
> 
> Wrong. The above is complete crap.
> 
> Why?
> 
> Very simple reason: the CPU may have had the cacheline dirty at some
> level in its caches, so when you did the clflushopt, it didn't just
> invalidate the CPU cacheline, it wrote it back to memory. And in the
> process over-wrote the data that the GPU had written.

Forgive me for being dense, but if we overwrite the GPU data in the
backing struct page with the cacheline from the CPU, how do we see the
results from the GPU afterwards?

> Now you can say "but the CPU never wrote to the cacheline, so it's not
> dirty in the CPU caches". That may or may not be trie. The CPU may
> have written to it quite a long time ago.

What we do is we clflush the entire object after it is written to by the
CPU (including newly allocated objects from shmemfs, or pages being
returned to us by shmemfs) before any operation on that object by the
GPU. We have to so that the GPU sees the correct page contents.
(If I change the clflush of the written objects to a wbinvd, that's not
sufficient for the tests to pass.)

We do not clflush the object after we read the backing pages on the CPU
before the next GPU operation, even if it is a GPU write. This leaves us
with clean but stale cachelines. Should. That's why we then
clflush_cache_range() prior to the next read on the object, it is
intended to be a pure cache line invalidation.

If we clflush the entire object between every CPU read back and the *next*
GPU operation, it fails. If we clflush the object before every GPU write
to it, it passes. And to refresh, we always clflush after a CPU write.

I am reasonably confident that any cachelines we dirty (or inherited)
are flushed. What you are suggesting is that there are dirty cachelines
regardless. I am also reasonably confident that even if we clflush the
entire object after touching it before the GPU write, and clflush the
individual cachelines again after the GPU write, we see the errors.

I haven't found the hole yet, or been convincingly able to explain the
differences between gen.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] x86: Add an explicit barrier() to clflushopt()
  2016-01-13 12:34                                     ` Chris Wilson
@ 2016-01-13 18:45                                       ` Linus Torvalds
  0 siblings, 0 replies; 29+ messages in thread
From: Linus Torvalds @ 2016-01-13 18:45 UTC (permalink / raw)
  To: Chris Wilson, Linus Torvalds, Andy Lutomirski, H. Peter Anvin,
	linux-kernel, Ross Zwisler, H . Peter Anvin, Borislav Petkov,
	Brian Gerst, Denys Vlasenko, Thomas Gleixner, Imre Deak,
	Daniel Vetter, DRI

On Wed, Jan 13, 2016 at 4:34 AM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
>
> Forgive me for being dense, but if we overwrite the GPU data in the
> backing struct page with the cacheline from the CPU, how do we see the
> results from the GPU afterwards?

Hmm. Good point.

Ok, all the symptoms just say "writes from GPU are delayed and out of order".

Do you have access to the GPU hardware people?

I thought that all the modern Intel GPU's are cache-coherent. If this
is some castrated chip where coherence is removed (perhaps because it
is not working? perhaps config setting?) maybe it needs some extra
ghardware setting to make the GPU "flush" operation actually do
something. In a cache-coherent model, a flush could/should be a noop,
so maybe the hardware is set for that kind of "flush does nothing"
behavior.

Or maybe the GPU is just a buggy pile of crap.

            Linus

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2016-01-13 18:45 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-19  9:58 [PATCH] x86: Add an explicit barrier() to clflushopt() Chris Wilson
2015-10-19 10:16 ` Borislav Petkov
2015-10-19 11:05   ` Chris Wilson
2015-10-19 11:25     ` Borislav Petkov
2015-10-19 18:29 ` Ross Zwisler
2016-01-07 10:16 ` Chris Wilson
2016-01-07 17:55   ` Andy Lutomirski
2016-01-07 19:44     ` Chris Wilson
2016-01-07 21:05       ` H. Peter Anvin
2016-01-07 21:54         ` Chris Wilson
2016-01-07 22:29           ` H. Peter Anvin
2016-01-07 22:32             ` H. Peter Anvin
2016-01-09  5:55               ` H. Peter Anvin
2016-01-09  8:01               ` Chris Wilson
2016-01-09 22:36                 ` Andy Lutomirski
2016-01-11 11:28                   ` Chris Wilson
2016-01-11 20:11                     ` Linus Torvalds
2016-01-11 21:05                       ` Chris Wilson
2016-01-12 16:37                         ` Chris Wilson
2016-01-12 17:05                           ` Linus Torvalds
2016-01-12 21:13                             ` Chris Wilson
2016-01-12 22:07                               ` Linus Torvalds
2016-01-13  0:55                                 ` Chris Wilson
2016-01-13  2:06                                   ` Linus Torvalds
2016-01-13  2:42                                     ` Andy Lutomirski
2016-01-13  4:39                                       ` Linus Torvalds
2016-01-13 12:34                                     ` Chris Wilson
2016-01-13 18:45                                       ` Linus Torvalds
2016-01-12 17:17                     ` H. Peter Anvin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).