gem clflush optimization for media encoding

All of lore.kernel.org
 help / color / mirror / Atom feed

* gem clflush optimization for media encoding
@ 2011-06-22  3:13 Zou, Nanhai
  2011-06-22  4:13 ` Keith Packard
  0 siblings, 1 reply; 10+ messages in thread
From: Zou, Nanhai @ 2011-06-22  3:13 UTC (permalink / raw)
  To: intel-gfx; +Cc: Anholt, Eric

Hi,
	I have some questions about clflush usage in gem.

	For our encoding driver, each frame's input is raw YUV data, copy from CPU to GPU surface, output is encoded result, copy from GPU to CPU.

	The buffers are huge, for 1080p file, input buffer size could be 
1920x1080x1.5, a lot of CPU time is used in clflush.  I am trying to optimize that.

Question 1:
	If I  upload input buffer with movnti or movntdq (bypass cache) + sfence(clear write combine buffer) in the end, clflush should not be needed.
	How can I tell gem not to clflush the buffer in this case, do we need add an interface to do that?

Question2:
	How can I make sure output buffer will never be clflushed?
	Since it is CPU read only surface, clflush in not needed at all.

Thanks
Zou Nanhai

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: gem clflush optimization for media encoding
  2011-06-22  3:13 gem clflush optimization for media encoding Zou, Nanhai
@ 2011-06-22  4:13 ` Keith Packard
  2011-06-22  4:29   ` Zou, Nanhai
  0 siblings, 1 reply; 10+ messages in thread
From: Keith Packard @ 2011-06-22  4:13 UTC (permalink / raw)
  To: Zou, Nanhai, intel-gfx; +Cc: Anholt, Eric

[-- Attachment #1.1: Type: text/plain, Size: 904 bytes --]

On Wed, 22 Jun 2011 11:13:09 +0800, "Zou, Nanhai" <nanhai.zou@intel.com> wrote:

> 	If I upload input buffer with movnti or movntdq (bypass cache) +
> 	sfence(clear write combine buffer) in the end, clflush should
> 	not be needed.

Alas, neither of these will flush existing cached data, so you must
still use clflush to ensure that the data makes it out to memory. All
that they do is avoid consuming additional cache lines.

You want to use a write combining mapping, which should give you full
bandwidth access to memory without hitting any caches. You can use the GTT
mapping as the aperture is configured for write combining access, or we
can figure out how to make PAT work.

> 	Since it is CPU read only surface, clflush in not needed at all.

You'd still have to invalidate cache lines using clflush to avoid using
stale data in the CPU cache.

-- 
keith.packard@intel.com

[-- Attachment #1.2: Type: application/pgp-signature, Size: 189 bytes --]

[-- Attachment #2: Type: text/plain, Size: 159 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: gem clflush optimization for media encoding
  2011-06-22  4:13 ` Keith Packard
@ 2011-06-22  4:29   ` Zou, Nanhai
  2011-06-22  4:37     ` Zou, Nanhai
                       ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Zou, Nanhai @ 2011-06-22  4:29 UTC (permalink / raw)
  To: Keith Packard, intel-gfx; +Cc: Anholt, Eric



>>-----Original Message-----
>>From: Keith Packard [mailto:keithp@keithp.com]
>>Sent: 2011年6月22日 12:14
>>To: Zou, Nanhai; intel-gfx@lists.freedesktop.org
>>Cc: Anholt, Eric
>>Subject: Re: [Intel-gfx] gem clflush optimization for media encoding
>>
>>On Wed, 22 Jun 2011 11:13:09 +0800, "Zou, Nanhai" <nanhai.zou@intel.com> wrote:
>>
>>> 	If I upload input buffer with movnti or movntdq (bypass cache) +
>>> 	sfence(clear write combine buffer) in the end, clflush should
>>> 	not be needed.
>>
>>Alas, neither of these will flush existing cached data, so you must
>>still use clflush to ensure that the data makes it out to memory. All
>>that they do is avoid consuming additional cache lines.
>>
  As I understand,
  with movnti + sfence, data should be surly reach memory. Cache should be coherent at this case.

>>You want to use a write combining mapping, which should give you full
>>bandwidth access to memory without hitting any caches. You can use the GTT
>>mapping as the aperture is configured for write combining access, or we
>>can figure out how to make PAT work.
>>
	map_gtt in current gem is super slow. 
	I've tried map_gtt but it seems that the speed is unacceptable.

>>> 	Since it is CPU read only surface, clflush in not needed at all.
>>
>>You'd still have to invalidate cache lines using clflush to avoid using
>>stale data in the CPU cache.
>>
>>--
  Yes, you are right, in this case clflush is still needed to invalidate the CPU cache. 

  The problem is that we do not now how large the coded output buffer is before we do the encoding.
  So we have to allocate a large enough gem object before encoding, in most
case the encoding result will be less than 1/10 of the safe buffer size, 9/10 of the buffer was unnecessarily clflushed.

  A fast map_gtt implementation could be the best choice here.

Thanks
Zou Nanhai

>>keith.packard@intel.com
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: gem clflush optimization for media encoding
  2011-06-22  4:29   ` Zou, Nanhai
@ 2011-06-22  4:37     ` Zou, Nanhai
  2011-06-22  6:29     ` Daniel Vetter
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 10+ messages in thread
From: Zou, Nanhai @ 2011-06-22  4:37 UTC (permalink / raw)
  To: Zou, Nanhai, Keith Packard, intel-gfx; +Cc: Anholt, Eric



>>-----Original Message-----
>>From: intel-gfx-bounces+nanhai.zou=intel.com@lists.freedesktop.org
>>[mailto:intel-gfx-bounces+nanhai.zou=intel.com@lists.freedesktop.org] On
>>Behalf Of Zou, Nanhai
>>Sent: 2011年6月22日 12:29
>>To: Keith Packard; intel-gfx@lists.freedesktop.org
>>Cc: Anholt, Eric
>>Subject: Re: [Intel-gfx] gem clflush optimization for media encoding
>>
>>
>>
>>>>-----Original Message-----
>>>>From: Keith Packard [mailto:keithp@keithp.com]
>>>>Sent: 2011年6月22日 12:14
>>>>To: Zou, Nanhai; intel-gfx@lists.freedesktop.org
>>>>Cc: Anholt, Eric
>>>>Subject: Re: [Intel-gfx] gem clflush optimization for media encoding
>>>>
>>>>On Wed, 22 Jun 2011 11:13:09 +0800, "Zou, Nanhai" <nanhai.zou@intel.com>
>>wrote:
>>>>
>>>>> 	If I upload input buffer with movnti or movntdq (bypass cache) +
>>>>> 	sfence(clear write combine buffer) in the end, clflush should
>>>>> 	not be needed.
>>>>
>>>>Alas, neither of these will flush existing cached data, so you must
>>>>still use clflush to ensure that the data makes it out to memory. All
>>>>that they do is avoid consuming additional cache lines.
>>>>
>>  As I understand,
>>  with movnti + sfence, data should be surly reach memory. Cache should be
>>coherent at this case.
>>
>>>>You want to use a write combining mapping, which should give you full
>>>>bandwidth access to memory without hitting any caches. You can use the GTT
>>>>mapping as the aperture is configured for write combining access, or we
>>>>can figure out how to make PAT work.
>>>>
>>	map_gtt in current gem is super slow.
>>	I've tried map_gtt but it seems that the speed is unacceptable.
>>
>>>>> 	Since it is CPU read only surface, clflush in not needed at all.
>>>>
>>>>You'd still have to invalidate cache lines using clflush to avoid using
>>>>stale data in the CPU cache.
>>>>
>>>>--
>>  Yes, you are right, in this case clflush is still needed to invalidate the
>>CPU cache.
>>
>>  The problem is that we do not now how large the coded output buffer is before
>>we do the encoding.
>>  So we have to allocate a large enough gem object before encoding, in most
>>case the encoding result will be less than 1/10 of the safe buffer size, 9/10
>>of the buffer was unnecessarily clflushed.
>>
>>  A fast map_gtt implementation could be the best choice here.
>>
	Or can we clflush cache line by cache line while reading instead of flush the entire object?
	This optimization will have >40% speedup for 1080p encoding.

>>Thanks
>>Zou Nanhai
>>
>>>>keith.packard@intel.com
>>_______________________________________________
>>Intel-gfx mailing list
>>Intel-gfx@lists.freedesktop.org
>>http://lists.freedesktop.org/mailman/listinfo/intel-gfx
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: gem clflush optimization for media encoding
  2011-06-22  4:29   ` Zou, Nanhai
  2011-06-22  4:37     ` Zou, Nanhai
@ 2011-06-22  6:29     ` Daniel Vetter
  2011-06-22 16:20       ` Keith Packard
  2011-06-22 16:18     ` Keith Packard
  2011-06-23 17:20     ` Jesse Barnes
  3 siblings, 1 reply; 10+ messages in thread
From: Daniel Vetter @ 2011-06-22  6:29 UTC (permalink / raw)
  To: Zou, Nanhai; +Cc: Anholt, Eric, intel-gfx

2011/6/22 Zou, Nanhai <nanhai.zou@intel.com>:
>        map_gtt in current gem is super slow.
>        I've tried map_gtt but it seems that the speed is unacceptable.
map_gtt should be pretty fast for large things on the upload side. For
the gpu->cpu download, have you tried pread? btw, the counterpart
(pwrite) also beats everything else for small uploads. Might be worth
it to use that generally throughout libva.

The important thing is that you may never use the cpu mappings with
these functions (for objects of similar size). Because libdrm reuses
bos without checking their domain, you'll get tons of unnecessary
clflush even on objects that do not get accessed through the cpu
domain.
-Daniel
-- 
Daniel Vetter
daniel.vetter@ffwll.ch - +41 (0) 79 364 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: gem clflush optimization for media encoding
  2011-06-22  4:29   ` Zou, Nanhai
  2011-06-22  4:37     ` Zou, Nanhai
  2011-06-22  6:29     ` Daniel Vetter
@ 2011-06-22 16:18     ` Keith Packard
  2011-06-23 17:20     ` Jesse Barnes
  3 siblings, 0 replies; 10+ messages in thread
From: Keith Packard @ 2011-06-22 16:18 UTC (permalink / raw)
  To: Zou, Nanhai, intel-gfx; +Cc: Anholt, Eric


[-- Attachment #1.1: Type: text/plain, Size: 953 bytes --]

On Wed, 22 Jun 2011 12:29:21 +0800, "Zou, Nanhai" <nanhai.zou@intel.com> wrote:

>   As I understand,
>   with movnti + sfence, data should be surly reach memory. Cache should be coherent at this case.

I wouldn't mind seeing additional experiments in this area, but when
Eric and I tried this a couple of years ago, we found that without
clflush, data would not reliably be forced out to memory.
> 

> 	map_gtt in current gem is super slow. 
> 	I've tried map_gtt but it seems that the speed is unacceptable.

You almost certainly want to allocate a couple of sufficient GTT mapped
buffers and hang onto them; you'll call map_gtt only at startup, then
re-use the buffers.

>   A fast map_gtt implementation could be the best choice here.

Chris Wilson may have some ideas about how to speed up map_gtt, but I
suspect the best plan is to not need to speed it up by re-using the same
mapped buffers.

-- 
keith.packard@intel.com

[-- Attachment #1.2: Type: application/pgp-signature, Size: 189 bytes --]

[-- Attachment #2: Type: text/plain, Size: 159 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: gem clflush optimization for media encoding
  2011-06-22  6:29     ` Daniel Vetter
@ 2011-06-22 16:20       ` Keith Packard
  2011-06-22 16:49         ` Chris Wilson
  0 siblings, 1 reply; 10+ messages in thread
From: Keith Packard @ 2011-06-22 16:20 UTC (permalink / raw)
  To: Daniel Vetter, Zou, Nanhai; +Cc: intel-gfx, Anholt, Eric


[-- Attachment #1.1: Type: text/plain, Size: 646 bytes --]

On Wed, 22 Jun 2011 08:29:24 +0200, Daniel Vetter <daniel@ffwll.ch> wrote:

> The important thing is that you may never use the cpu mappings with
> these functions (for objects of similar size). Because libdrm reuses
> bos without checking their domain, you'll get tons of unnecessary
> clflush even on objects that do not get accessed through the cpu
> domain.

Yeah, the problem is that a BO which is not pinned down may get paged
out, in which case it lands in the CPU domain. I'm not sure we've ever
added an optimization to avoid flushing objects which are known not to
have been written to disk?

-- 
keith.packard@intel.com

[-- Attachment #1.2: Type: application/pgp-signature, Size: 189 bytes --]

[-- Attachment #2: Type: text/plain, Size: 159 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: gem clflush optimization for media encoding
  2011-06-22 16:20       ` Keith Packard
@ 2011-06-22 16:49         ` Chris Wilson
  0 siblings, 0 replies; 10+ messages in thread
From: Chris Wilson @ 2011-06-22 16:49 UTC (permalink / raw)
  To: Keith Packard, Daniel Vetter, Zou, Nanhai; +Cc: intel-gfx, Anholt, Eric

On Wed, 22 Jun 2011 09:20:35 -0700, Keith Packard <keithp@keithp.com> wrote:
> On Wed, 22 Jun 2011 08:29:24 +0200, Daniel Vetter <daniel@ffwll.ch> wrote:
> 
> > The important thing is that you may never use the cpu mappings with
> > these functions (for objects of similar size). Because libdrm reuses
> > bos without checking their domain, you'll get tons of unnecessary
> > clflush even on objects that do not get accessed through the cpu
> > domain.
> 
> Yeah, the problem is that a BO which is not pinned down may get paged
> out, in which case it lands in the CPU domain. I'm not sure we've ever
> added an optimization to avoid flushing objects which are known not to
> have been written to disk?

I've toyed with such. Can be very effective for large working sets like
firefox thrashing the aperture. A very simple example is
firefox-planet-gnome which demonstrates the effect just by scrolling
within a single page on gen3. [The trace currently spends 35% of its time
in clflush which can be entirely eliminated by such tracking.] There's
also the secondary benefit that shmemfs is quite slow for our purposes.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: gem clflush optimization for media encoding
  2011-06-22  4:29   ` Zou, Nanhai
                       ` (2 preceding siblings ...)
  2011-06-22 16:18     ` Keith Packard
@ 2011-06-23 17:20     ` Jesse Barnes
  2011-06-24  1:41       ` Zou, Nanhai
  3 siblings, 1 reply; 10+ messages in thread
From: Jesse Barnes @ 2011-06-23 17:20 UTC (permalink / raw)
  To: Zou, Nanhai; +Cc: Anholt, Eric, intel-gfx

On Wed, 22 Jun 2011 12:29:21 +0800
"Zou, Nanhai" <nanhai.zou@intel.com> wrote:
> 	map_gtt in current gem is super slow. 
> 	I've tried map_gtt but it seems that the speed is unacceptable.
> 
> >>> 	Since it is CPU read only surface, clflush in not needed at all.
> >>
> >>You'd still have to invalidate cache lines using clflush to avoid using
> >>stale data in the CPU cache.
> >>
> >>--
>   Yes, you are right, in this case clflush is still needed to invalidate the CPU cache. 
> 
>   The problem is that we do not now how large the coded output buffer is before we do the encoding.
>   So we have to allocate a large enough gem object before encoding, in most
> case the encoding result will be less than 1/10 of the safe buffer size, 9/10 of the buffer was unnecessarily clflushed.
> 
>   A fast map_gtt implementation could be the best choice here.

What's slow about it?  Are you sure you're getting a WC mapping?  If
your MTRRs or PAT are messed up you may be getting a regular UC
mapping, which would be slow.  Also you need to write the data
sequentially to get the benefits of WC.  If you write every other byte
or jump around (and of course read) you'll flush the WC buffer and slow
things down.

-- 
Jesse Barnes, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: gem clflush optimization for media encoding
  2011-06-23 17:20     ` Jesse Barnes
@ 2011-06-24  1:41       ` Zou, Nanhai
  0 siblings, 0 replies; 10+ messages in thread
From: Zou, Nanhai @ 2011-06-24  1:41 UTC (permalink / raw)
  To: Jesse Barnes; +Cc: Anholt, Eric, intel-gfx



>>-----Original Message-----
>>From: Jesse Barnes [mailto:jbarnes@virtuousgeek.org]
>>Sent: 2011年6月24日 1:20
>>To: Zou, Nanhai
>>Cc: Keith Packard; intel-gfx@lists.freedesktop.org; Anholt, Eric
>>Subject: Re: [Intel-gfx] gem clflush optimization for media encoding
>>
>>On Wed, 22 Jun 2011 12:29:21 +0800
>>"Zou, Nanhai" <nanhai.zou@intel.com> wrote:
>>> 	map_gtt in current gem is super slow.
>>> 	I've tried map_gtt but it seems that the speed is unacceptable.
>>>
>>> >>> 	Since it is CPU read only surface, clflush in not needed at all.
>>> >>
>>> >>You'd still have to invalidate cache lines using clflush to avoid using
>>> >>stale data in the CPU cache.
>>> >>
>>> >>--
>>>   Yes, you are right, in this case clflush is still needed to invalidate the
>>CPU cache.
>>>
>>>   The problem is that we do not now how large the coded output buffer is before
>>we do the encoding.
>>>   So we have to allocate a large enough gem object before encoding, in most
>>> case the encoding result will be less than 1/10 of the safe buffer size, 9/10
>>of the buffer was unnecessarily clflushed.
>>>
>>>   A fast map_gtt implementation could be the best choice here.
>>
>>What's slow about it?  Are you sure you're getting a WC mapping?  If
>>your MTRRs or PAT are messed up you may be getting a regular UC
>>mapping, which would be slow.  Also you need to write the data
>>sequentially to get the benefits of WC.  If you write every other byte
>>or jump around (and of course read) you'll flush the WC buffer and slow
>>things down.
>>

Yes, I have noticed that, seems that the uploaded data was written through uc mapping.
We are trying to fix this. 

Thanks
Zou Nanhai
>>--
>>Jesse Barnes, Intel Open Source Technology Center
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2011-06-24  1:43 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-06-22  3:13 gem clflush optimization for media encoding Zou, Nanhai
2011-06-22  4:13 ` Keith Packard
2011-06-22  4:29   ` Zou, Nanhai
2011-06-22  4:37     ` Zou, Nanhai
2011-06-22  6:29     ` Daniel Vetter
2011-06-22 16:20       ` Keith Packard
2011-06-22 16:49         ` Chris Wilson
2011-06-22 16:18     ` Keith Packard
2011-06-23 17:20     ` Jesse Barnes
2011-06-24  1:41       ` Zou, Nanhai

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.