dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* Batched ww-mutexes, wound-wait vs wait-die etc.
@ 2018-04-11  8:27 Thomas Hellstrom
  2018-04-13 17:13 ` Daniel Vetter
  0 siblings, 1 reply; 8+ messages in thread
From: Thomas Hellstrom @ 2018-04-11  8:27 UTC (permalink / raw)
  To: dri-devel; +Cc: Maarten Lankhorst

Hi!

Thinking of adding ww-mutexes for reservation also of vmwgfx resources, 
(like surfaces), I became a bit worried that doubling the locks taken 
during command submission wouldn't be a good thing. Particularly on ESX 
servers where a high number of virtual machines running graphics on a 
multi-core processor would initiate a very high number of processor 
locked cycles. The method we use today is to reserve all those resources 
under a single mutex. Buffer objects are still using reservation objects 
and hence ww-mutexes, though.

So I figured a "middle way" would be to add batched ww-mutexes, where 
the ww-mutex locking state, instead of being manipulated atomically,
was manipulated under a single lock-class global spinlock. We could then 
condense the sometimes 200+ locking operations per command submission to 
two, one for lock and one for unlock. Obvious drawbacks are that taking 
the spinlock is slightly more expensive than an atomic operation, and 
that we could introduce contention for the spinlock where there is no 
contention for an atomic operation.

So I set out to test this in practice. After reading up a bit on the 
theory it turned out that the current in-kernel wound-wait 
implementation, like once TTM (unknowingly), is actually not wound-wait, 
but wait-die. Correct name would thus be "wait-die mutexes", Some 
sources on the net claimed "wait-wound" is the better algorithm due to a 
reduced number of backoffs:

http://www.mathcs.emory.edu/~cheung/Courses/554/Syllabus/8-recv+serial/deadlock-compare.html

So I implemented both algorithms in a standalone testing module:

git+ssh://people.freedesktop.org/~thomash/ww_mutex_test

Some preliminary test trends:

1) Testing uncontended sequential command submissions: Batching 
ww-mutexes seems to be between 50% and 100% faster than the current 
kernel implementation. Still the kernel implementation performing much 
better than I thought.

2) Testing uncontended parallell command submission: Batching ww-mutexes 
slower (up to 50%) of the current kernel implementation, since the 
in-kernel implementation can make use of multi-core parallellism where 
the batching implementation sees spinlock contention. This effect 
should, however, probably  be relaxed if setting a longer command 
submission time, reducing the spinlock contention.

3) Testing contended parallell command submission: Batching is generally 
superior by usually around 50%, sometimes up to 100%, One of the reasons 
could be that batching appears to result in a significantly lower number 
of rollbacks.

5) Taking batching locks without actually batching can result i poor 
performance.

4) Wound-Wait vs Wait-Die. As predicted, particularly with a low number 
of parallell cs threads, Wound-wait appears to give a lower number of 
rollbacks, but there seems to be no overall locking time benefits. On 
the contrary, as the number of threads exceeds the number of cores, 
wound-wait appears to become increasingly more time-consuming than 
Wait-Die. One of the reason for this might be that Wound-Wait may see an 
increased number of unlocks per rollback. Another is that it is not 
trivial to find a good lock to wait for with Wound-Wait. With Wait-Die 
the thread rolling back just waits for the contended lock. With 
wound-wait the wounded thread is preempted, and in my implementation I 
choose to lazy-preempt at the next blocking lock, so that at least we 
have a lock to wait on, even if it's not a relevant lock to trigger a 
rollback.

So this raises a couple of questions:

1) Should we implement an upstream version of batching locks, perhaps as 
a choice on a per-lock-class basis?
2) Should we add a *real* wound-wait choice to our wound-wait mutexes. 
Otherwise perhaps rename them or document that they're actually doing 
wait-die.

/Thomas



_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Batched ww-mutexes, wound-wait vs wait-die etc.
  2018-04-11  8:27 Batched ww-mutexes, wound-wait vs wait-die etc Thomas Hellstrom
@ 2018-04-13 17:13 ` Daniel Vetter
  2018-04-13 20:23   ` Thomas Hellstrom
  0 siblings, 1 reply; 8+ messages in thread
From: Daniel Vetter @ 2018-04-13 17:13 UTC (permalink / raw)
  To: Thomas Hellstrom; +Cc: Maarten Lankhorst, dri-devel

Hi Thomas,

On Wed, Apr 11, 2018 at 10:27:06AM +0200, Thomas Hellstrom wrote:
> Hi!
> 
> Thinking of adding ww-mutexes for reservation also of vmwgfx resources,
> (like surfaces), I became a bit worried that doubling the locks taken during
> command submission wouldn't be a good thing. Particularly on ESX servers
> where a high number of virtual machines running graphics on a multi-core
> processor would initiate a very high number of processor locked cycles. The
> method we use today is to reserve all those resources under a single mutex.
> Buffer objects are still using reservation objects and hence ww-mutexes,
> though.
> 
> So I figured a "middle way" would be to add batched ww-mutexes, where the
> ww-mutex locking state, instead of being manipulated atomically,
> was manipulated under a single lock-class global spinlock. We could then
> condense the sometimes 200+ locking operations per command submission to
> two, one for lock and one for unlock. Obvious drawbacks are that taking the
> spinlock is slightly more expensive than an atomic operation, and that we
> could introduce contention for the spinlock where there is no contention for
> an atomic operation.
> 
> So I set out to test this in practice. After reading up a bit on the theory
> it turned out that the current in-kernel wound-wait implementation, like
> once TTM (unknowingly), is actually not wound-wait, but wait-die. Correct
> name would thus be "wait-die mutexes", Some sources on the net claimed
> "wait-wound" is the better algorithm due to a reduced number of backoffs:

Nice stuff.

> http://www.mathcs.emory.edu/~cheung/Courses/554/Syllabus/8-recv+serial/deadlock-compare.html
> 
> So I implemented both algorithms in a standalone testing module:
> 
> git+ssh://people.freedesktop.org/~thomash/ww_mutex_test

One issue with real wait-wound is that we can't really implement it in the
kernel: Kernel context can't be aborted at arbitrary places. It should
still work (since as soon as it tries to grab another lock will wound it,
and anyone waiting for a lock will get wounded), but the balance might be
different in real world tests (which do some real cpu work once locks are
acquired). Or did increase CS_UDELAY not change any of that?

> Some preliminary test trends:
> 
> 1) Testing uncontended sequential command submissions: Batching ww-mutexes
> seems to be between 50% and 100% faster than the current kernel
> implementation. Still the kernel implementation performing much better than
> I thought.

The core mutex implementation we share is real fast in the uncontended
case. Benefitting from all the micro-optimization work was part of the
reasons for moving the ttm reservations into core code.

> 2) Testing uncontended parallell command submission: Batching ww-mutexes
> slower (up to 50%) of the current kernel implementation, since the in-kernel
> implementation can make use of multi-core parallellism where the batching
> implementation sees spinlock contention. This effect should, however,
> probably  be relaxed if setting a longer command submission time, reducing
> the spinlock contention.
> 
> 3) Testing contended parallell command submission: Batching is generally
> superior by usually around 50%, sometimes up to 100%, One of the reasons
> could be that batching appears to result in a significantly lower number of
> rollbacks.
> 
> 5) Taking batching locks without actually batching can result i poor
> performance.
> 
> 4) Wound-Wait vs Wait-Die. As predicted, particularly with a low number of
> parallell cs threads, Wound-wait appears to give a lower number of
> rollbacks, but there seems to be no overall locking time benefits. On the
> contrary, as the number of threads exceeds the number of cores, wound-wait
> appears to become increasingly more time-consuming than Wait-Die. One of the
> reason for this might be that Wound-Wait may see an increased number of
> unlocks per rollback. Another is that it is not trivial to find a good lock
> to wait for with Wound-Wait. With Wait-Die the thread rolling back just
> waits for the contended lock. With wound-wait the wounded thread is
> preempted, and in my implementation I choose to lazy-preempt at the next
> blocking lock, so that at least we have a lock to wait on, even if it's not
> a relevant lock to trigger a rollback.
> 
> So this raises a couple of questions:
> 
> 1) Should we implement an upstream version of batching locks, perhaps as a
> choice on a per-lock-class basis?

Not sure. At least for me ww-mutexes were meant to be highly efficient in
the uncontended case (generally you don't have people trampling all over
their feet^Wbuffers, or there's bigger problems already). But with the
guarantee that in contention cases you can't stage a denial of service,
i.e. there's guaranteed forward progress and later threads can't block
older threads forever. And given multicores are here to stay, I'd say
weighting 2) higher than 1) in your experiments makes sense.

If you need arbitrary lock ordering, in parallel, and contented, not sure
that's even possible to solve in a fast&generic way. My gut feeling would
be to reorg the algorithm to not have that?

Note: ww-mutex use in struct drm_modeset_lock is kinda different, there we
only care about the "does not deadlock" part, because generally display
updates are ratelimited and a little bit of overhead doesn't matter.

I guess your ESX use-case is a bit different, where you care less about
parallelism within a single guest instance and more about parallelism of
the overall system. Honestly not sure what to do.

> 2) Should we add a *real* wound-wait choice to our wound-wait mutexes.
> Otherwise perhaps rename them or document that they're actually doing
> wait-die.

I think a doc patch would be good at least. Including all the data you
assembled here.

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Batched ww-mutexes, wound-wait vs wait-die etc.
  2018-04-13 17:13 ` Daniel Vetter
@ 2018-04-13 20:23   ` Thomas Hellstrom
  2018-04-14  8:33     ` Daniel Vetter
  0 siblings, 1 reply; 8+ messages in thread
From: Thomas Hellstrom @ 2018-04-13 20:23 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Maarten Lankhorst, dri-devel

Hi, Daniel,

On 04/13/2018 07:13 PM, Daniel Vetter wrote:
> Hi Thomas,
>
> On Wed, Apr 11, 2018 at 10:27:06AM +0200, Thomas Hellstrom wrote:
>> Hi!
>>
>> Thinking of adding ww-mutexes for reservation also of vmwgfx resources,
>> (like surfaces), I became a bit worried that doubling the locks taken during
>> command submission wouldn't be a good thing. Particularly on ESX servers
>> where a high number of virtual machines running graphics on a multi-core
>> processor would initiate a very high number of processor locked cycles. The
>> method we use today is to reserve all those resources under a single mutex.
>> Buffer objects are still using reservation objects and hence ww-mutexes,
>> though.
>>
>> So I figured a "middle way" would be to add batched ww-mutexes, where the
>> ww-mutex locking state, instead of being manipulated atomically,
>> was manipulated under a single lock-class global spinlock. We could then
>> condense the sometimes 200+ locking operations per command submission to
>> two, one for lock and one for unlock. Obvious drawbacks are that taking the
>> spinlock is slightly more expensive than an atomic operation, and that we
>> could introduce contention for the spinlock where there is no contention for
>> an atomic operation.
>>
>> So I set out to test this in practice. After reading up a bit on the theory
>> it turned out that the current in-kernel wound-wait implementation, like
>> once TTM (unknowingly), is actually not wound-wait, but wait-die. Correct
>> name would thus be "wait-die mutexes", Some sources on the net claimed
>> "wait-wound" is the better algorithm due to a reduced number of backoffs:
> Nice stuff.
>
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.mathcs.emory.edu_-7Echeung_Courses_554_Syllabus_8-2Drecv-2Bserial_deadlock-2Dcompare.html&d=DwIDAw&c=uilaK90D4TOVoH58JNXRgQ&r=wnSlgOCqfpNS4d02vP68_E9q2BNMCwfD2OZ_6dCFVQQ&m=wKSXGXnu1C6LFxAzoexAGlxUjlKVGGfv8XT25lAB1MM&s=d7rFCVCe2D_tHB-7ciIFdoCwm3MYMXoRUqQNC6O8QC8&e=
>>
>> So I implemented both algorithms in a standalone testing module:
>>
>> git+ssh://people.freedesktop.org/~thomash/ww_mutex_test
> One issue with real wait-wound is that we can't really implement it in the
> kernel: Kernel context can't be aborted at arbitrary places. It should
> still work (since as soon as it tries to grab another lock will wound it,
> and anyone waiting for a lock will get wounded),
So what I implemented was lazy wait-wound preemption, meaning that if a 
thread get wounded, it aborts on the next lock it needs to wait on. This 
is safe from a deadlock avoidance point of view and has an important 
implication, since after the abort we will wait on the lock we aborted 
on. If we just abort on the next lock, it may well be that we grab a 
number of locks immediately after abort and get wounded again, ..and 
again ..and again by the same contending thread. It's like restarting 
wait-die without waiting..


>   but the balance might be
> different in real world tests (which do some real cpu work once locks are
> acquired). Or did increase CS_UDELAY not change any of that?



>
>> Some preliminary test trends:
>>
>> 1) Testing uncontended sequential command submissions: Batching ww-mutexes
>> seems to be between 50% and 100% faster than the current kernel
>> implementation. Still the kernel implementation performing much better than
>> I thought.
> The core mutex implementation we share is real fast in the uncontended
> case. Benefitting from all the micro-optimization work was part of the
> reasons for moving the ttm reservations into core code.
>
>> 2) Testing uncontended parallell command submission: Batching ww-mutexes
>> slower (up to 50%) of the current kernel implementation, since the in-kernel
>> implementation can make use of multi-core parallellism where the batching
>> implementation sees spinlock contention. This effect should, however,
>> probably  be relaxed if setting a longer command submission time, reducing
>> the spinlock contention.
>>
>> 3) Testing contended parallell command submission: Batching is generally
>> superior by usually around 50%, sometimes up to 100%, One of the reasons
>> could be that batching appears to result in a significantly lower number of
>> rollbacks.
>>
>> 5) Taking batching locks without actually batching can result i poor
>> performance.
>>
>> 4) Wound-Wait vs Wait-Die. As predicted, particularly with a low number of
>> parallell cs threads, Wound-wait appears to give a lower number of
>> rollbacks, but there seems to be no overall locking time benefits. On the
>> contrary, as the number of threads exceeds the number of cores, wound-wait
>> appears to become increasingly more time-consuming than Wait-Die. One of the
>> reason for this might be that Wound-Wait may see an increased number of
>> unlocks per rollback. Another is that it is not trivial to find a good lock
>> to wait for with Wound-Wait. With Wait-Die the thread rolling back just
>> waits for the contended lock. With wound-wait the wounded thread is
>> preempted, and in my implementation I choose to lazy-preempt at the next
>> blocking lock, so that at least we have a lock to wait on, even if it's not
>> a relevant lock to trigger a rollback.
>>
>> So this raises a couple of questions:
>>
>> 1) Should we implement an upstream version of batching locks, perhaps as a
>> choice on a per-lock-class basis?
> Not sure. At least for me ww-mutexes were meant to be highly efficient in
> the uncontended case (generally you don't have people trampling all over
> their feet^Wbuffers, or there's bigger problems already). But with the
> guarantee that in contention cases you can't stage a denial of service,
> i.e. there's guaranteed forward progress and later threads can't block
> older threads forever. And given multicores are here to stay, I'd say
> weighting 2) higher than 1) in your experiments makes sense.
>
> If you need arbitrary lock ordering, in parallel, and contented, not sure
> that's even possible to solve in a fast&generic way. My gut feeling would
> be to reorg the algorithm to not have that?
>
> Note: ww-mutex use in struct drm_modeset_lock is kinda different, there we
> only care about the "does not deadlock" part, because generally display
> updates are ratelimited and a little bit of overhead doesn't matter.
>
> I guess your ESX use-case is a bit different, where you care less about
> parallelism within a single guest instance and more about parallelism of
> the overall system. Honestly not sure what to do.
>
>> 2) Should we add a *real* wound-wait choice to our wound-wait mutexes.
>> Otherwise perhaps rename them or document that they're actually doing
>> wait-die.
> I think a doc patch would be good at least. Including all the data you
> assembled here.

Actually, a further investigation appears to indicate that manipulating 
the lock state under a local spinlock is about fast as using atomic 
operations even for the completely uncontended cases.

This means that we could have a solution where you decide on a per-mutex 
or per-reservation object basis whether you want to manipulate 
lock-state under a "batch group" spinlock, meaning certain performance 
characteristics or traditional local locking, meaning other performance 
characteristics.

Like, vmwgfx could choose batching locks, radeon traditional locks, but 
the same API would work for both and locks could be shared between drivers..

I'll see if I get time to put together an RFC.

/Thomas


> Cheers, Daniel



_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Batched ww-mutexes, wound-wait vs wait-die etc.
  2018-04-13 20:23   ` Thomas Hellstrom
@ 2018-04-14  8:33     ` Daniel Vetter
  2018-04-16  8:23       ` Thomas Hellstrom
  0 siblings, 1 reply; 8+ messages in thread
From: Daniel Vetter @ 2018-04-14  8:33 UTC (permalink / raw)
  To: Thomas Hellstrom; +Cc: Maarten Lankhorst, dri-devel

Hi Thomas,

On Fri, Apr 13, 2018 at 10:23 PM, Thomas Hellstrom
<thellstrom@vmware.com> wrote:
> On 04/13/2018 07:13 PM, Daniel Vetter wrote:
>> On Wed, Apr 11, 2018 at 10:27:06AM +0200, Thomas Hellstrom wrote:
>>> 2) Should we add a *real* wound-wait choice to our wound-wait mutexes.
>>> Otherwise perhaps rename them or document that they're actually doing
>>> wait-die.
>>
>> I think a doc patch would be good at least. Including all the data you
>> assembled here.
>
>
> Actually, a further investigation appears to indicate that manipulating the
> lock state under a local spinlock is about fast as using atomic operations
> even for the completely uncontended cases.
>
> This means that we could have a solution where you decide on a per-mutex or
> per-reservation object basis whether you want to manipulate lock-state under
> a "batch group" spinlock, meaning certain performance characteristics or
> traditional local locking, meaning other performance characteristics.
>
> Like, vmwgfx could choose batching locks, radeon traditional locks, but the
> same API would work for both and locks could be shared between drivers..

Don't we need to make this decision at least on a per-class level? Or
how will the spinlock/batch-lock approach interact with the normal
ww_mutex_lock path (which does require the atomics/ordered stores
we're trying to avoid)?

If we can't mix them I'm kinda leaning towards a
ww_batch_mutex/ww_batch_acquire_ctx, but exactly matching api
otherwise. We probably do need the new batch_start/end api, since
ww_acquire_done isn't quite the right place ...

> I'll see if I get time to put together an RFC.

Yeah I think there's definitely some use for batched ww locks, where
parallelism is generally low, or at least the ratio between "time
spent acquiring locks" and "time spent doing stuff while holding
locks" small enough to not make the reduced parallelism while
acquiring an issue.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Batched ww-mutexes, wound-wait vs wait-die etc.
  2018-04-14  8:33     ` Daniel Vetter
@ 2018-04-16  8:23       ` Thomas Hellstrom
  2018-04-16  9:19         ` Daniel Vetter
  0 siblings, 1 reply; 8+ messages in thread
From: Thomas Hellstrom @ 2018-04-16  8:23 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Maarten Lankhorst, dri-devel

On 04/14/2018 10:33 AM, Daniel Vetter wrote:
> Hi Thomas,
>
> On Fri, Apr 13, 2018 at 10:23 PM, Thomas Hellstrom
> <thellstrom@vmware.com> wrote:
>> On 04/13/2018 07:13 PM, Daniel Vetter wrote:
>>> On Wed, Apr 11, 2018 at 10:27:06AM +0200, Thomas Hellstrom wrote:
>>>> 2) Should we add a *real* wound-wait choice to our wound-wait mutexes.
>>>> Otherwise perhaps rename them or document that they're actually doing
>>>> wait-die.
>>> I think a doc patch would be good at least. Including all the data you
>>> assembled here.
>>
>> Actually, a further investigation appears to indicate that manipulating the
>> lock state under a local spinlock is about fast as using atomic operations
>> even for the completely uncontended cases.
>>
>> This means that we could have a solution where you decide on a per-mutex or
>> per-reservation object basis whether you want to manipulate lock-state under
>> a "batch group" spinlock, meaning certain performance characteristics or
>> traditional local locking, meaning other performance characteristics.
>>
>> Like, vmwgfx could choose batching locks, radeon traditional locks, but the
>> same API would work for both and locks could be shared between drivers..
> Don't we need to make this decision at least on a per-class level?

No, I was thinking more in the line of the ww_mutex having a pointer to 
the spinlock. It could either be the local mutex "wait_lock", or a
per-batch-group lock. The mutex code wouldn't care. We do need a special 
API for batched locking, though, but not for ordinary locking.
Both APIs should be able to handle local or grouped spinlocks.

Note that this would of course require that there would be no 
performance loss for users that don't use batch groups.

I guess the most efficient use for GPU command submission would be to 
use per-process batch-groups. Then when the batch encounters a ww_mutex 
with a different batch group (for example the display server shared 
surface, it'll just switch batch lock), and this way the contention for
the batch spinlock will be mostly eliminated.

>   Or
> how will the spinlock/batch-lock approach interact with the normal
> ww_mutex_lock path (which does require the atomics/ordered stores
> we're trying to avoid)?

We can use the same code with some extra
if (use_ww_ctx) in the common locking and unlocking path.
Note that the "use_ww_ctx" parameter is defined at compile-time so the 
ordinary mutex path (binary) shouldn't change at all after optimization 
but I need to verify that, of course.

What you can't do with such a change is to lock / unlock a ww_mutex 
using the standard mutex API, like mutex_lock(&ww_mutex->base), but I 
guess that would be OK?


>
> If we can't mix them I'm kinda leaning towards a
> ww_batch_mutex/ww_batch_acquire_ctx, but exactly matching api
> otherwise. We probably do need the new batch_start/end api, since
> ww_acquire_done isn't quite the right place ...
>
>> I'll see if I get time to put together an RFC.
> Yeah I think there's definitely some use for batched ww locks, where
> parallelism is generally low, or at least the ratio between "time
> spent acquiring locks" and "time spent doing stuff while holding
> locks" small enough to not make the reduced parallelism while
> acquiring an issue.

Yes. At least it's worth bringing up for discussion. The reduced 
parallelism shouldn't be an issue if per-process batch-groups are used, 
or, like for vmwgfx the command submission itself is serialized, due to 
a single FIFO.

/Thomas


> -Daniel


_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Batched ww-mutexes, wound-wait vs wait-die etc.
  2018-04-16  8:23       ` Thomas Hellstrom
@ 2018-04-16  9:19         ` Daniel Vetter
  2018-04-16 12:15           ` Thomas Hellstrom
  0 siblings, 1 reply; 8+ messages in thread
From: Daniel Vetter @ 2018-04-16  9:19 UTC (permalink / raw)
  To: Thomas Hellstrom; +Cc: Maarten Lankhorst, dri-devel

On Mon, Apr 16, 2018 at 10:23 AM, Thomas Hellstrom
<thellstrom@vmware.com> wrote:
> On 04/14/2018 10:33 AM, Daniel Vetter wrote:
>>
>> Hi Thomas,
>>
>> On Fri, Apr 13, 2018 at 10:23 PM, Thomas Hellstrom
>> <thellstrom@vmware.com> wrote:
>>>
>>> On 04/13/2018 07:13 PM, Daniel Vetter wrote:
>>>>
>>>> On Wed, Apr 11, 2018 at 10:27:06AM +0200, Thomas Hellstrom wrote:
>>>>>
>>>>> 2) Should we add a *real* wound-wait choice to our wound-wait mutexes.
>>>>> Otherwise perhaps rename them or document that they're actually doing
>>>>> wait-die.
>>>>
>>>> I think a doc patch would be good at least. Including all the data you
>>>> assembled here.
>>>
>>>
>>> Actually, a further investigation appears to indicate that manipulating
>>> the
>>> lock state under a local spinlock is about fast as using atomic
>>> operations
>>> even for the completely uncontended cases.
>>>
>>> This means that we could have a solution where you decide on a per-mutex
>>> or
>>> per-reservation object basis whether you want to manipulate lock-state
>>> under
>>> a "batch group" spinlock, meaning certain performance characteristics or
>>> traditional local locking, meaning other performance characteristics.
>>>
>>> Like, vmwgfx could choose batching locks, radeon traditional locks, but
>>> the
>>> same API would work for both and locks could be shared between drivers..
>>
>> Don't we need to make this decision at least on a per-class level?
>
>
> No, I was thinking more in the line of the ww_mutex having a pointer to the
> spinlock. It could either be the local mutex "wait_lock", or a
> per-batch-group lock. The mutex code wouldn't care. We do need a special API
> for batched locking, though, but not for ordinary locking.
> Both APIs should be able to handle local or grouped spinlocks.
>
> Note that this would of course require that there would be no performance
> loss for users that don't use batch groups.
>
> I guess the most efficient use for GPU command submission would be to use
> per-process batch-groups. Then when the batch encounters a ww_mutex with a
> different batch group (for example the display server shared surface, it'll
> just switch batch lock), and this way the contention for
> the batch spinlock will be mostly eliminated.

But won't this force us through the spinlock case for all ww_mutex?
The core mutex code goes to extreme lengths to avoid that for the
uncontended fast path. That's why I meant the batched and non-batch
ww_mutex look fundamentally incompatible. Or maybe I missed something
somewhere.

>>   Or
>> how will the spinlock/batch-lock approach interact with the normal
>> ww_mutex_lock path (which does require the atomics/ordered stores
>> we're trying to avoid)?
>
>
> We can use the same code with some extra
> if (use_ww_ctx) in the common locking and unlocking path.
> Note that the "use_ww_ctx" parameter is defined at compile-time so the
> ordinary mutex path (binary) shouldn't change at all after optimization but
> I need to verify that, of course.

Agreed, no issue there.

> What you can't do with such a change is to lock / unlock a ww_mutex using
> the standard mutex API, like mutex_lock(&ww_mutex->base), but I guess that
> would be OK?

Yeah right now that works, but I don't care about that. The point of
merging ww_mutex into the core mutex was that we can fully reuse the
__mutex_trylock_fast. Same for the unlock path.

Once we don't share that code anymore, there's imo not that much point
in having ww_mutex interleaved with core mutex code.

>> If we can't mix them I'm kinda leaning towards a
>> ww_batch_mutex/ww_batch_acquire_ctx, but exactly matching api
>> otherwise. We probably do need the new batch_start/end api, since
>> ww_acquire_done isn't quite the right place ...
>>
>>> I'll see if I get time to put together an RFC.
>>
>> Yeah I think there's definitely some use for batched ww locks, where
>> parallelism is generally low, or at least the ratio between "time
>> spent acquiring locks" and "time spent doing stuff while holding
>> locks" small enough to not make the reduced parallelism while
>> acquiring an issue.
>
>
> Yes. At least it's worth bringing up for discussion. The reduced parallelism
> shouldn't be an issue if per-process batch-groups are used, or, like for
> vmwgfx the command submission itself is serialized, due to a single FIFO.

Random aside: We do seem to already implement wound semantics (or I'm
terribly confused). See __ww_mutex_add_waiter plus related wakeup code
(in __ww_mutex_lock_check_stamp).
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Batched ww-mutexes, wound-wait vs wait-die etc.
  2018-04-16  9:19         ` Daniel Vetter
@ 2018-04-16 12:15           ` Thomas Hellstrom
  2018-04-16 12:22             ` Daniel Vetter
  0 siblings, 1 reply; 8+ messages in thread
From: Thomas Hellstrom @ 2018-04-16 12:15 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Maarten Lankhorst, dri-devel

On 04/16/2018 11:19 AM, Daniel Vetter wrote:
> On Mon, Apr 16, 2018 at 10:23 AM, Thomas Hellstrom
> <thellstrom@vmware.com> wrote:
>> On 04/14/2018 10:33 AM, Daniel Vetter wrote:
>>> Hi Thomas,
>>>
>>> On Fri, Apr 13, 2018 at 10:23 PM, Thomas Hellstrom
>>> <thellstrom@vmware.com> wrote:
>>>> On 04/13/2018 07:13 PM, Daniel Vetter wrote:
>>>>> On Wed, Apr 11, 2018 at 10:27:06AM +0200, Thomas Hellstrom wrote:
>>>>>> 2) Should we add a *real* wound-wait choice to our wound-wait mutexes.
>>>>>> Otherwise perhaps rename them or document that they're actually doing
>>>>>> wait-die.
>>>>> I think a doc patch would be good at least. Including all the data you
>>>>> assembled here.
>>>>
>>>> Actually, a further investigation appears to indicate that manipulating
>>>> the
>>>> lock state under a local spinlock is about fast as using atomic
>>>> operations
>>>> even for the completely uncontended cases.
>>>>
>>>> This means that we could have a solution where you decide on a per-mutex
>>>> or
>>>> per-reservation object basis whether you want to manipulate lock-state
>>>> under
>>>> a "batch group" spinlock, meaning certain performance characteristics or
>>>> traditional local locking, meaning other performance characteristics.
>>>>
>>>> Like, vmwgfx could choose batching locks, radeon traditional locks, but
>>>> the
>>>> same API would work for both and locks could be shared between drivers..
>>> Don't we need to make this decision at least on a per-class level?
>>
>> No, I was thinking more in the line of the ww_mutex having a pointer to the
>> spinlock. It could either be the local mutex "wait_lock", or a
>> per-batch-group lock. The mutex code wouldn't care. We do need a special API
>> for batched locking, though, but not for ordinary locking.
>> Both APIs should be able to handle local or grouped spinlocks.
>>
>> Note that this would of course require that there would be no performance
>> loss for users that don't use batch groups.
>>
>> I guess the most efficient use for GPU command submission would be to use
>> per-process batch-groups. Then when the batch encounters a ww_mutex with a
>> different batch group (for example the display server shared surface, it'll
>> just switch batch lock), and this way the contention for
>> the batch spinlock will be mostly eliminated.
> But won't this force us through the spinlock case for all ww_mutex?
> The core mutex code goes to extreme lengths to avoid that for the
> uncontended fast path. That's why I meant the batched and non-batch
> ww_mutex look fundamentally incompatible. Or maybe I missed something
> somewhere.

Yes, this would require  the assumption to hold that a _local_ spinlock 
path, that is, taking a local spinlock also in the fastpath would be as 
fast as using the atomic operations directly. And that's what I'm 
seeing, (or perhaps a percent or so slower with 20000 simulated CS'es 
taking 800 uncontended locks each). Both running sequentially and in 
parallel. Guess I need to verify this on my rpi as well, so it's not an 
Intel specific thing.

This is of course a prerequisite for the idea to work.

Basically the spinlock fastpath appears to be an inlined atomic_cmpxchg, 
so even theoretically there should not be any noticeable performance 
loss. I'm not sure why the sleeping locks insist using atomic operations 
over spinlocks, but with the qspinlock implementation (seems to be 
2014-ish), the atomic exchange on spin_unlock was elminated, and I guess 
that changed the playground...

With a _shared_ spinlock, like with a batch group, we would see 
different performance characteristics, though.

>>>    Or
>>> how will the spinlock/batch-lock approach interact with the normal
>>> ww_mutex_lock path (which does require the atomics/ordered stores
>>> we're trying to avoid)?
>>
>> We can use the same code with some extra
>> if (use_ww_ctx) in the common locking and unlocking path.
>> Note that the "use_ww_ctx" parameter is defined at compile-time so the
>> ordinary mutex path (binary) shouldn't change at all after optimization but
>> I need to verify that, of course.
> Agreed, no issue there.
>
>> What you can't do with such a change is to lock / unlock a ww_mutex using
>> the standard mutex API, like mutex_lock(&ww_mutex->base), but I guess that
>> would be OK?
> Yeah right now that works, but I don't care about that. The point of
> merging ww_mutex into the core mutex was that we can fully reuse the
> __mutex_trylock_fast. Same for the unlock path.
>
> Once we don't share that code anymore, there's imo not that much point
> in having ww_mutex interleaved with core mutex code.
>
>>> If we can't mix them I'm kinda leaning towards a
>>> ww_batch_mutex/ww_batch_acquire_ctx, but exactly matching api
>>> otherwise. We probably do need the new batch_start/end api, since
>>> ww_acquire_done isn't quite the right place ...
>>>
>>>> I'll see if I get time to put together an RFC.
>>> Yeah I think there's definitely some use for batched ww locks, where
>>> parallelism is generally low, or at least the ratio between "time
>>> spent acquiring locks" and "time spent doing stuff while holding
>>> locks" small enough to not make the reduced parallelism while
>>> acquiring an issue.
>>
>> Yes. At least it's worth bringing up for discussion. The reduced parallelism
>> shouldn't be an issue if per-process batch-groups are used, or, like for
>> vmwgfx the command submission itself is serialized, due to a single FIFO.
> Random aside: We do seem to already implement wound semantics (or I'm
> terribly confused). See __ww_mutex_add_waiter plus related wakeup code
> (in __ww_mutex_lock_check_stamp).
> -Daniel

 From what I can see, that's only the wait-die case, that is processes 
that don't hold the lock kill their transaction. What's implemented is 
also a multi-contender logic so that if we can deduce that a waiter will 
need to die once another waiter with higher priority becomes the lock 
holder, we wake it up early to kill its transaction. Wound-wait is when 
we preempt the lock holder. That never happens with the current code.

/Thomas



_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Batched ww-mutexes, wound-wait vs wait-die etc.
  2018-04-16 12:15           ` Thomas Hellstrom
@ 2018-04-16 12:22             ` Daniel Vetter
  0 siblings, 0 replies; 8+ messages in thread
From: Daniel Vetter @ 2018-04-16 12:22 UTC (permalink / raw)
  To: Thomas Hellstrom; +Cc: Maarten Lankhorst, dri-devel

On Mon, Apr 16, 2018 at 2:15 PM, Thomas Hellstrom <thellstrom@vmware.com> wrote:
> On 04/16/2018 11:19 AM, Daniel Vetter wrote:
>>
>> On Mon, Apr 16, 2018 at 10:23 AM, Thomas Hellstrom
>> <thellstrom@vmware.com> wrote:
>>>
>>> On 04/14/2018 10:33 AM, Daniel Vetter wrote:
>>>>
>>>> Hi Thomas,
>>>>
>>>> On Fri, Apr 13, 2018 at 10:23 PM, Thomas Hellstrom
>>>> <thellstrom@vmware.com> wrote:
>>>>>
>>>>> On 04/13/2018 07:13 PM, Daniel Vetter wrote:
>>>>>>
>>>>>> On Wed, Apr 11, 2018 at 10:27:06AM +0200, Thomas Hellstrom wrote:
>>>>>>>
>>>>>>> 2) Should we add a *real* wound-wait choice to our wound-wait
>>>>>>> mutexes.
>>>>>>> Otherwise perhaps rename them or document that they're actually doing
>>>>>>> wait-die.
>>>>>>
>>>>>> I think a doc patch would be good at least. Including all the data you
>>>>>> assembled here.
>>>>>
>>>>>
>>>>> Actually, a further investigation appears to indicate that manipulating
>>>>> the
>>>>> lock state under a local spinlock is about fast as using atomic
>>>>> operations
>>>>> even for the completely uncontended cases.
>>>>>
>>>>> This means that we could have a solution where you decide on a
>>>>> per-mutex
>>>>> or
>>>>> per-reservation object basis whether you want to manipulate lock-state
>>>>> under
>>>>> a "batch group" spinlock, meaning certain performance characteristics
>>>>> or
>>>>> traditional local locking, meaning other performance characteristics.
>>>>>
>>>>> Like, vmwgfx could choose batching locks, radeon traditional locks, but
>>>>> the
>>>>> same API would work for both and locks could be shared between
>>>>> drivers..
>>>>
>>>> Don't we need to make this decision at least on a per-class level?
>>>
>>>
>>> No, I was thinking more in the line of the ww_mutex having a pointer to
>>> the
>>> spinlock. It could either be the local mutex "wait_lock", or a
>>> per-batch-group lock. The mutex code wouldn't care. We do need a special
>>> API
>>> for batched locking, though, but not for ordinary locking.
>>> Both APIs should be able to handle local or grouped spinlocks.
>>>
>>> Note that this would of course require that there would be no performance
>>> loss for users that don't use batch groups.
>>>
>>> I guess the most efficient use for GPU command submission would be to use
>>> per-process batch-groups. Then when the batch encounters a ww_mutex with
>>> a
>>> different batch group (for example the display server shared surface,
>>> it'll
>>> just switch batch lock), and this way the contention for
>>> the batch spinlock will be mostly eliminated.
>>
>> But won't this force us through the spinlock case for all ww_mutex?
>> The core mutex code goes to extreme lengths to avoid that for the
>> uncontended fast path. That's why I meant the batched and non-batch
>> ww_mutex look fundamentally incompatible. Or maybe I missed something
>> somewhere.
>
>
> Yes, this would require  the assumption to hold that a _local_ spinlock
> path, that is, taking a local spinlock also in the fastpath would be as fast
> as using the atomic operations directly. And that's what I'm seeing, (or
> perhaps a percent or so slower with 20000 simulated CS'es taking 800
> uncontended locks each). Both running sequentially and in parallel. Guess I
> need to verify this on my rpi as well, so it's not an Intel specific thing.
>
> This is of course a prerequisite for the idea to work.
>
> Basically the spinlock fastpath appears to be an inlined atomic_cmpxchg, so
> even theoretically there should not be any noticeable performance loss. I'm
> not sure why the sleeping locks insist using atomic operations over
> spinlocks, but with the qspinlock implementation (seems to be 2014-ish), the
> atomic exchange on spin_unlock was elminated, and I guess that changed the
> playground...
>
> With a _shared_ spinlock, like with a batch group, we would see different
> performance characteristics, though.

Hm yeah, the performance characteristics have massively converged there.

Otoh I think spinlocks are still a lot more arch-dependent than struct
mutex locking, so not sure how well this holds over the gazillion of
other architectures Linux supports. And I think since we've moved
ww_mutex to be a core locking construct we should make sure stuff
doesn't get worse on those either, even if they practically have no
relevance for gpu drivers. ppc might be the only one, besides x86 and
arm/arm64.

>>>>    Or
>>>> how will the spinlock/batch-lock approach interact with the normal
>>>> ww_mutex_lock path (which does require the atomics/ordered stores
>>>> we're trying to avoid)?
>>>
>>>
>>> We can use the same code with some extra
>>> if (use_ww_ctx) in the common locking and unlocking path.
>>> Note that the "use_ww_ctx" parameter is defined at compile-time so the
>>> ordinary mutex path (binary) shouldn't change at all after optimization
>>> but
>>> I need to verify that, of course.
>>
>> Agreed, no issue there.
>>
>>> What you can't do with such a change is to lock / unlock a ww_mutex using
>>> the standard mutex API, like mutex_lock(&ww_mutex->base), but I guess
>>> that
>>> would be OK?
>>
>> Yeah right now that works, but I don't care about that. The point of
>> merging ww_mutex into the core mutex was that we can fully reuse the
>> __mutex_trylock_fast. Same for the unlock path.
>>
>> Once we don't share that code anymore, there's imo not that much point
>> in having ww_mutex interleaved with core mutex code.
>>
>>>> If we can't mix them I'm kinda leaning towards a
>>>> ww_batch_mutex/ww_batch_acquire_ctx, but exactly matching api
>>>> otherwise. We probably do need the new batch_start/end api, since
>>>> ww_acquire_done isn't quite the right place ...
>>>>
>>>>> I'll see if I get time to put together an RFC.
>>>>
>>>> Yeah I think there's definitely some use for batched ww locks, where
>>>> parallelism is generally low, or at least the ratio between "time
>>>> spent acquiring locks" and "time spent doing stuff while holding
>>>> locks" small enough to not make the reduced parallelism while
>>>> acquiring an issue.
>>>
>>>
>>> Yes. At least it's worth bringing up for discussion. The reduced
>>> parallelism
>>> shouldn't be an issue if per-process batch-groups are used, or, like for
>>> vmwgfx the command submission itself is serialized, due to a single FIFO.
>>
>> Random aside: We do seem to already implement wound semantics (or I'm
>> terribly confused). See __ww_mutex_add_waiter plus related wakeup code
>> (in __ww_mutex_lock_check_stamp).
>> -Daniel
>
>
> From what I can see, that's only the wait-die case, that is processes that
> don't hold the lock kill their transaction. What's implemented is also a
> multi-contender logic so that if we can deduce that a waiter will need to
> die once another waiter with higher priority becomes the lock holder, we
> wake it up early to kill its transaction. Wound-wait is when we preempt the
> lock holder. That never happens with the current code.

Right. Somehow I'm too dense to realize the difference of who gets
waken up and why and always mix them up. Upstream code only wakes
waiters, your code wakes the current holder (but it'll only notice the
next time around it'll try to lock another ww_mutex). I guess we could
submit at least that change as an RFC to the ww_mutex code.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2018-04-16 12:22 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-11  8:27 Batched ww-mutexes, wound-wait vs wait-die etc Thomas Hellstrom
2018-04-13 17:13 ` Daniel Vetter
2018-04-13 20:23   ` Thomas Hellstrom
2018-04-14  8:33     ` Daniel Vetter
2018-04-16  8:23       ` Thomas Hellstrom
2018-04-16  9:19         ` Daniel Vetter
2018-04-16 12:15           ` Thomas Hellstrom
2018-04-16 12:22             ` Daniel Vetter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).