All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] drm/ttm: make ttm_bo_unpin more defensive
@ 2021-03-12  9:38 Christian König
  2021-03-12 10:32 ` Matthew Auld
  2021-03-13 18:29 ` Thomas Hellström (Intel)
  0 siblings, 2 replies; 17+ messages in thread
From: Christian König @ 2021-03-12  9:38 UTC (permalink / raw)
  To: dri-devel

We seem to have some more driver bugs than thought.

Signed-off-by: Christian König <christian.koenig@amd.com>
---
 include/drm/ttm/ttm_bo_api.h | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/drm/ttm/ttm_bo_api.h b/include/drm/ttm/ttm_bo_api.h
index 4fb523dfab32..df9fe596e7c5 100644
--- a/include/drm/ttm/ttm_bo_api.h
+++ b/include/drm/ttm/ttm_bo_api.h
@@ -603,9 +603,11 @@ static inline void ttm_bo_pin(struct ttm_buffer_object *bo)
 static inline void ttm_bo_unpin(struct ttm_buffer_object *bo)
 {
 	dma_resv_assert_held(bo->base.resv);
-	WARN_ON_ONCE(!bo->pin_count);
 	WARN_ON_ONCE(!kref_read(&bo->kref));
-	--bo->pin_count;
+	if (bo->pin_count)
+		--bo->pin_count;
+	else
+		WARN_ON_ONCE(true);
 }
 
 int ttm_mem_evict_first(struct ttm_device *bdev,
-- 
2.25.1

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] drm/ttm: make ttm_bo_unpin more defensive
  2021-03-12  9:38 [PATCH] drm/ttm: make ttm_bo_unpin more defensive Christian König
@ 2021-03-12 10:32 ` Matthew Auld
  2021-03-13 18:29 ` Thomas Hellström (Intel)
  1 sibling, 0 replies; 17+ messages in thread
From: Matthew Auld @ 2021-03-12 10:32 UTC (permalink / raw)
  To: Christian König; +Cc: ML dri-devel

On Fri, 12 Mar 2021 at 09:38, Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
> We seem to have some more driver bugs than thought.
>
> Signed-off-by: Christian König <christian.koenig@amd.com>
Acked-by: Matthew Auld <matthew.auld@intel.com>
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] drm/ttm: make ttm_bo_unpin more defensive
  2021-03-12  9:38 [PATCH] drm/ttm: make ttm_bo_unpin more defensive Christian König
  2021-03-12 10:32 ` Matthew Auld
@ 2021-03-13 18:29 ` Thomas Hellström (Intel)
  2021-03-15  7:48   ` Thomas Zimmermann
  2021-03-15 10:26   ` Christian König
  1 sibling, 2 replies; 17+ messages in thread
From: Thomas Hellström (Intel) @ 2021-03-13 18:29 UTC (permalink / raw)
  To: Christian König, dri-devel

Hi, Christian

On 3/12/21 10:38 AM, Christian König wrote:
> We seem to have some more driver bugs than thought.
>
> Signed-off-by: Christian König <christian.koenig@amd.com>
> ---
>   include/drm/ttm/ttm_bo_api.h | 6 ++++--
>   1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/include/drm/ttm/ttm_bo_api.h b/include/drm/ttm/ttm_bo_api.h
> index 4fb523dfab32..df9fe596e7c5 100644
> --- a/include/drm/ttm/ttm_bo_api.h
> +++ b/include/drm/ttm/ttm_bo_api.h
> @@ -603,9 +603,11 @@ static inline void ttm_bo_pin(struct ttm_buffer_object *bo)
>   static inline void ttm_bo_unpin(struct ttm_buffer_object *bo)
>   {
>   	dma_resv_assert_held(bo->base.resv);
> -	WARN_ON_ONCE(!bo->pin_count);
>   	WARN_ON_ONCE(!kref_read(&bo->kref));
> -	--bo->pin_count;
> +	if (bo->pin_count)
> +		--bo->pin_count;
> +	else
> +		WARN_ON_ONCE(true);
>   }
>   
>   int ttm_mem_evict_first(struct ttm_device *bdev,

Since I now have been staring for half a year at the code of the driver 
that made pinning an art, I have a couple of suggestions here, Could we 
use an atomic for pin_count, allowing unlocked unpinning but require the 
lock only for pin_count transition 0->1, (but unlocked for 
pin_if_already_pinned). In particular I think vmwgfx would benefit from 
unlocked unpins. Also if the atomic were a refcount_t, that would 
probably give you the above behaviour?

/Thomas


_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] drm/ttm: make ttm_bo_unpin more defensive
  2021-03-13 18:29 ` Thomas Hellström (Intel)
@ 2021-03-15  7:48   ` Thomas Zimmermann
  2021-03-15 17:10     ` Thomas Hellström (Intel)
  2021-03-15 10:26   ` Christian König
  1 sibling, 1 reply; 17+ messages in thread
From: Thomas Zimmermann @ 2021-03-15  7:48 UTC (permalink / raw)
  To: Thomas Hellström (Intel), Christian König, dri-devel


[-- Attachment #1.1.1: Type: text/plain, Size: 2271 bytes --]

Hi

Am 13.03.21 um 19:29 schrieb Thomas Hellström (Intel):
> Hi, Christian
> 
> On 3/12/21 10:38 AM, Christian König wrote:
>> We seem to have some more driver bugs than thought.
>>
>> Signed-off-by: Christian König <christian.koenig@amd.com>
>> ---
>>   include/drm/ttm/ttm_bo_api.h | 6 ++++--
>>   1 file changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/drm/ttm/ttm_bo_api.h b/include/drm/ttm/ttm_bo_api.h
>> index 4fb523dfab32..df9fe596e7c5 100644
>> --- a/include/drm/ttm/ttm_bo_api.h
>> +++ b/include/drm/ttm/ttm_bo_api.h
>> @@ -603,9 +603,11 @@ static inline void ttm_bo_pin(struct 
>> ttm_buffer_object *bo)
>>   static inline void ttm_bo_unpin(struct ttm_buffer_object *bo)
>>   {
>>       dma_resv_assert_held(bo->base.resv);
>> -    WARN_ON_ONCE(!bo->pin_count);
>>       WARN_ON_ONCE(!kref_read(&bo->kref));
>> -    --bo->pin_count;
>> +    if (bo->pin_count)
>> +        --bo->pin_count;
>> +    else
>> +        WARN_ON_ONCE(true);
>>   }
>>   int ttm_mem_evict_first(struct ttm_device *bdev,
> 
> Since I now have been staring for half a year at the code of the driver 
> that made pinning an art, I have a couple of suggestions here, Could we 
> use an atomic for pin_count, allowing unlocked unpinning but require the 
> lock only for pin_count transition 0->1, (but unlocked for 
> pin_if_already_pinned). In particular I think vmwgfx would benefit from 
> unlocked unpins. Also if the atomic were a refcount_t, that would 
> probably give you the above behaviour?

What's the benefit?

I'm asking because, there's been talk about streamlining all the GEM 
locking and actually allowing dma-buf resv locking in pin and vmap 
operations. Atomic ops might not contradict this, but also might not be 
useful in the long term.

Best regards
Thomas

> 
> /Thomas
> 
> 
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel

-- 
Thomas Zimmermann
Graphics Driver Developer
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5, 90409 Nürnberg, Germany
(HRB 36809, AG Nürnberg)
Geschäftsführer: Felix Imendörffer


[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 840 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] drm/ttm: make ttm_bo_unpin more defensive
  2021-03-13 18:29 ` Thomas Hellström (Intel)
  2021-03-15  7:48   ` Thomas Zimmermann
@ 2021-03-15 10:26   ` Christian König
  2021-03-15 17:08     ` Thomas Hellström (Intel)
  1 sibling, 1 reply; 17+ messages in thread
From: Christian König @ 2021-03-15 10:26 UTC (permalink / raw)
  To: Thomas Hellström (Intel), dri-devel



Am 13.03.21 um 19:29 schrieb Thomas Hellström (Intel):
> Hi, Christian
>
> On 3/12/21 10:38 AM, Christian König wrote:
>> We seem to have some more driver bugs than thought.
>>
>> Signed-off-by: Christian König <christian.koenig@amd.com>
>> ---
>>   include/drm/ttm/ttm_bo_api.h | 6 ++++--
>>   1 file changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/drm/ttm/ttm_bo_api.h b/include/drm/ttm/ttm_bo_api.h
>> index 4fb523dfab32..df9fe596e7c5 100644
>> --- a/include/drm/ttm/ttm_bo_api.h
>> +++ b/include/drm/ttm/ttm_bo_api.h
>> @@ -603,9 +603,11 @@ static inline void ttm_bo_pin(struct 
>> ttm_buffer_object *bo)
>>   static inline void ttm_bo_unpin(struct ttm_buffer_object *bo)
>>   {
>>       dma_resv_assert_held(bo->base.resv);
>> -    WARN_ON_ONCE(!bo->pin_count);
>>       WARN_ON_ONCE(!kref_read(&bo->kref));
>> -    --bo->pin_count;
>> +    if (bo->pin_count)
>> +        --bo->pin_count;
>> +    else
>> +        WARN_ON_ONCE(true);
>>   }
>>     int ttm_mem_evict_first(struct ttm_device *bdev,
>
> Since I now have been staring for half a year at the code of the 
> driver that made pinning an art, I have a couple of suggestions here, 
> Could we use an atomic for pin_count, allowing unlocked unpinning but 
> require the lock only for pin_count transition 0->1, (but unlocked for 
> pin_if_already_pinned). In particular I think vmwgfx would benefit 
> from unlocked unpins. Also if the atomic were a refcount_t, that would 
> probably give you the above behaviour?

Nope, I've considered this as well while moving the pin count into TTM.

The problem is that you not only need the BO reserved for 0->1 
transitions, but also for 1->0 transitions to move the BO on the LRU 
correctly.

Christian.

>
> /Thomas
>
>

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] drm/ttm: make ttm_bo_unpin more defensive
  2021-03-15 10:26   ` Christian König
@ 2021-03-15 17:08     ` Thomas Hellström (Intel)
  2021-03-15 18:47       ` Christian König
  0 siblings, 1 reply; 17+ messages in thread
From: Thomas Hellström (Intel) @ 2021-03-15 17:08 UTC (permalink / raw)
  To: Christian König, dri-devel


On 3/15/21 11:26 AM, Christian König wrote:
>
>
> Am 13.03.21 um 19:29 schrieb Thomas Hellström (Intel):
>> Hi, Christian
>>
>> On 3/12/21 10:38 AM, Christian König wrote:
>>> We seem to have some more driver bugs than thought.
>>>
>>> Signed-off-by: Christian König <christian.koenig@amd.com>
>>> ---
>>>   include/drm/ttm/ttm_bo_api.h | 6 ++++--
>>>   1 file changed, 4 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/include/drm/ttm/ttm_bo_api.h 
>>> b/include/drm/ttm/ttm_bo_api.h
>>> index 4fb523dfab32..df9fe596e7c5 100644
>>> --- a/include/drm/ttm/ttm_bo_api.h
>>> +++ b/include/drm/ttm/ttm_bo_api.h
>>> @@ -603,9 +603,11 @@ static inline void ttm_bo_pin(struct 
>>> ttm_buffer_object *bo)
>>>   static inline void ttm_bo_unpin(struct ttm_buffer_object *bo)
>>>   {
>>>       dma_resv_assert_held(bo->base.resv);
>>> -    WARN_ON_ONCE(!bo->pin_count);
>>>       WARN_ON_ONCE(!kref_read(&bo->kref));
>>> -    --bo->pin_count;
>>> +    if (bo->pin_count)
>>> +        --bo->pin_count;
>>> +    else
>>> +        WARN_ON_ONCE(true);
>>>   }
>>>     int ttm_mem_evict_first(struct ttm_device *bdev,
>>
>> Since I now have been staring for half a year at the code of the 
>> driver that made pinning an art, I have a couple of suggestions here, 
>> Could we use an atomic for pin_count, allowing unlocked unpinning but 
>> require the lock only for pin_count transition 0->1, (but unlocked 
>> for pin_if_already_pinned). In particular I think vmwgfx would 
>> benefit from unlocked unpins. Also if the atomic were a refcount_t, 
>> that would probably give you the above behaviour?
>
> Nope, I've considered this as well while moving the pin count into TTM.
>
> The problem is that you not only need the BO reserved for 0->1 
> transitions, but also for 1->0 transitions to move the BO on the LRU 
> correctly.

Ah, and there is no way to have us know the correct LRU list without 
reservation?

/Thomas


>
> Christian.
>
>>
>> /Thomas
>>
>>
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] drm/ttm: make ttm_bo_unpin more defensive
  2021-03-15  7:48   ` Thomas Zimmermann
@ 2021-03-15 17:10     ` Thomas Hellström (Intel)
  0 siblings, 0 replies; 17+ messages in thread
From: Thomas Hellström (Intel) @ 2021-03-15 17:10 UTC (permalink / raw)
  To: Thomas Zimmermann, Christian König, dri-devel


On 3/15/21 8:48 AM, Thomas Zimmermann wrote:
> Hi
>
> Am 13.03.21 um 19:29 schrieb Thomas Hellström (Intel):
>> Hi, Christian
>>
>> On 3/12/21 10:38 AM, Christian König wrote:
>>> We seem to have some more driver bugs than thought.
>>>
>>> Signed-off-by: Christian König <christian.koenig@amd.com>
>>> ---
>>>   include/drm/ttm/ttm_bo_api.h | 6 ++++--
>>>   1 file changed, 4 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/include/drm/ttm/ttm_bo_api.h 
>>> b/include/drm/ttm/ttm_bo_api.h
>>> index 4fb523dfab32..df9fe596e7c5 100644
>>> --- a/include/drm/ttm/ttm_bo_api.h
>>> +++ b/include/drm/ttm/ttm_bo_api.h
>>> @@ -603,9 +603,11 @@ static inline void ttm_bo_pin(struct 
>>> ttm_buffer_object *bo)
>>>   static inline void ttm_bo_unpin(struct ttm_buffer_object *bo)
>>>   {
>>>       dma_resv_assert_held(bo->base.resv);
>>> -    WARN_ON_ONCE(!bo->pin_count);
>>>       WARN_ON_ONCE(!kref_read(&bo->kref));
>>> -    --bo->pin_count;
>>> +    if (bo->pin_count)
>>> +        --bo->pin_count;
>>> +    else
>>> +        WARN_ON_ONCE(true);
>>>   }
>>>   int ttm_mem_evict_first(struct ttm_device *bdev,
>>
>> Since I now have been staring for half a year at the code of the 
>> driver that made pinning an art, I have a couple of suggestions here, 
>> Could we use an atomic for pin_count, allowing unlocked unpinning but 
>> require the lock only for pin_count transition 0->1, (but unlocked 
>> for pin_if_already_pinned). In particular I think vmwgfx would 
>> benefit from unlocked unpins. Also if the atomic were a refcount_t, 
>> that would probably give you the above behaviour?
>
> What's the benefit?
>
> I'm asking because, there's been talk about streamlining all the GEM 
> locking and actually allowing dma-buf resv locking in pin and vmap 
> operations. Atomic ops might not contradict this, but also might not 
> be useful in the long term.

The benefit would be that at unpinning time it might be tricky to take 
the reservation lock, because you might be already in a ww transaction.

But Christian pointed out the LRU complications...

/Thomas


_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] drm/ttm: make ttm_bo_unpin more defensive
  2021-03-15 17:08     ` Thomas Hellström (Intel)
@ 2021-03-15 18:47       ` Christian König
  2021-03-15 19:00         ` Thomas Hellström (Intel)
  0 siblings, 1 reply; 17+ messages in thread
From: Christian König @ 2021-03-15 18:47 UTC (permalink / raw)
  To: Thomas Hellström (Intel), dri-devel



Am 15.03.21 um 18:08 schrieb Thomas Hellström (Intel):
>
> On 3/15/21 11:26 AM, Christian König wrote:
>>
>>
>> Am 13.03.21 um 19:29 schrieb Thomas Hellström (Intel):
>>> Hi, Christian
>>>
>>> On 3/12/21 10:38 AM, Christian König wrote:
>>>> We seem to have some more driver bugs than thought.
>>>>
>>>> Signed-off-by: Christian König <christian.koenig@amd.com>
>>>> ---
>>>>   include/drm/ttm/ttm_bo_api.h | 6 ++++--
>>>>   1 file changed, 4 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/include/drm/ttm/ttm_bo_api.h 
>>>> b/include/drm/ttm/ttm_bo_api.h
>>>> index 4fb523dfab32..df9fe596e7c5 100644
>>>> --- a/include/drm/ttm/ttm_bo_api.h
>>>> +++ b/include/drm/ttm/ttm_bo_api.h
>>>> @@ -603,9 +603,11 @@ static inline void ttm_bo_pin(struct 
>>>> ttm_buffer_object *bo)
>>>>   static inline void ttm_bo_unpin(struct ttm_buffer_object *bo)
>>>>   {
>>>>       dma_resv_assert_held(bo->base.resv);
>>>> -    WARN_ON_ONCE(!bo->pin_count);
>>>>       WARN_ON_ONCE(!kref_read(&bo->kref));
>>>> -    --bo->pin_count;
>>>> +    if (bo->pin_count)
>>>> +        --bo->pin_count;
>>>> +    else
>>>> +        WARN_ON_ONCE(true);
>>>>   }
>>>>     int ttm_mem_evict_first(struct ttm_device *bdev,
>>>
>>> Since I now have been staring for half a year at the code of the 
>>> driver that made pinning an art, I have a couple of suggestions 
>>> here, Could we use an atomic for pin_count, allowing unlocked 
>>> unpinning but require the lock only for pin_count transition 0->1, 
>>> (but unlocked for pin_if_already_pinned). In particular I think 
>>> vmwgfx would benefit from unlocked unpins. Also if the atomic were a 
>>> refcount_t, that would probably give you the above behaviour?
>>
>> Nope, I've considered this as well while moving the pin count into TTM.
>>
>> The problem is that you not only need the BO reserved for 0->1 
>> transitions, but also for 1->0 transitions to move the BO on the LRU 
>> correctly.
>
> Ah, and there is no way to have us know the correct LRU list without 
> reservation?

Not really, there is always the chance that CPU A is reducing the count 
from 1->0 while CPU B is doing 0->1 and you end up with a LRU status 
which doesn't match the pin count.

We could try to do something like a loop updating the LRU status until 
it matches the pin count, but the implications of that are usually 
really nasty.

Christian.

>
> /Thomas
>
>
>>
>> Christian.
>>
>>>
>>> /Thomas
>>>
>>>

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] drm/ttm: make ttm_bo_unpin more defensive
  2021-03-15 18:47       ` Christian König
@ 2021-03-15 19:00         ` Thomas Hellström (Intel)
  2021-03-16  9:27           ` Daniel Vetter
  0 siblings, 1 reply; 17+ messages in thread
From: Thomas Hellström (Intel) @ 2021-03-15 19:00 UTC (permalink / raw)
  To: Christian König, dri-devel


On 3/15/21 7:47 PM, Christian König wrote:
>
>
> Am 15.03.21 um 18:08 schrieb Thomas Hellström (Intel):
>>
>> On 3/15/21 11:26 AM, Christian König wrote:
>>>
>>>
>>> Am 13.03.21 um 19:29 schrieb Thomas Hellström (Intel):
>>>> Hi, Christian
>>>>
>>>> On 3/12/21 10:38 AM, Christian König wrote:
>>>>> We seem to have some more driver bugs than thought.
>>>>>
>>>>> Signed-off-by: Christian König <christian.koenig@amd.com>
>>>>> ---
>>>>>   include/drm/ttm/ttm_bo_api.h | 6 ++++--
>>>>>   1 file changed, 4 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/include/drm/ttm/ttm_bo_api.h 
>>>>> b/include/drm/ttm/ttm_bo_api.h
>>>>> index 4fb523dfab32..df9fe596e7c5 100644
>>>>> --- a/include/drm/ttm/ttm_bo_api.h
>>>>> +++ b/include/drm/ttm/ttm_bo_api.h
>>>>> @@ -603,9 +603,11 @@ static inline void ttm_bo_pin(struct 
>>>>> ttm_buffer_object *bo)
>>>>>   static inline void ttm_bo_unpin(struct ttm_buffer_object *bo)
>>>>>   {
>>>>>       dma_resv_assert_held(bo->base.resv);
>>>>> -    WARN_ON_ONCE(!bo->pin_count);
>>>>>       WARN_ON_ONCE(!kref_read(&bo->kref));
>>>>> -    --bo->pin_count;
>>>>> +    if (bo->pin_count)
>>>>> +        --bo->pin_count;
>>>>> +    else
>>>>> +        WARN_ON_ONCE(true);
>>>>>   }
>>>>>     int ttm_mem_evict_first(struct ttm_device *bdev,
>>>>
>>>> Since I now have been staring for half a year at the code of the 
>>>> driver that made pinning an art, I have a couple of suggestions 
>>>> here, Could we use an atomic for pin_count, allowing unlocked 
>>>> unpinning but require the lock only for pin_count transition 0->1, 
>>>> (but unlocked for pin_if_already_pinned). In particular I think 
>>>> vmwgfx would benefit from unlocked unpins. Also if the atomic were 
>>>> a refcount_t, that would probably give you the above behaviour?
>>>
>>> Nope, I've considered this as well while moving the pin count into TTM.
>>>
>>> The problem is that you not only need the BO reserved for 0->1 
>>> transitions, but also for 1->0 transitions to move the BO on the LRU 
>>> correctly.
>>
>> Ah, and there is no way to have us know the correct LRU list without 
>> reservation?
>
> Not really, there is always the chance that CPU A is reducing the 
> count from 1->0 while CPU B is doing 0->1 and you end up with a LRU 
> status which doesn't match the pin count.
>
> We could try to do something like a loop updating the LRU status until 
> it matches the pin count, but the implications of that are usually 
> really nasty.
>
OK, yeah I was more thinking along the lines of protecting the LRU 
status with the global lru lock and unpin would then be

if (refcount_dec_and_lock(&bo->pin_count, &ttm_glob.lru_lock)) {
     add_to_relevant_lrus(bo, bo->lru_status);
     spin_unlock(&ttm_glob.lru_lock);
}

But looking at ttm_bo_move_to_lru_tail() I realize that's not really 
trivial anymore. I hope it's doable at some point though.

But meanwhile, perhaps TTM needs to accept and pave over that drivers 
are in fact destroying pinned buffers?

/Thomas





> Christian.
>
>>
>> /Thomas
>>
>>
>>>
>>> Christian.
>>>
>>>>
>>>> /Thomas
>>>>
>>>>
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] drm/ttm: make ttm_bo_unpin more defensive
  2021-03-15 19:00         ` Thomas Hellström (Intel)
@ 2021-03-16  9:27           ` Daniel Vetter
  2021-03-16 10:38             ` Thomas Hellström (Intel)
  0 siblings, 1 reply; 17+ messages in thread
From: Daniel Vetter @ 2021-03-16  9:27 UTC (permalink / raw)
  To: Thomas Hellström (Intel); +Cc: Christian König, dri-devel

On Mon, Mar 15, 2021 at 08:00:30PM +0100, Thomas Hellström (Intel) wrote:
> 
> On 3/15/21 7:47 PM, Christian König wrote:
> > 
> > 
> > Am 15.03.21 um 18:08 schrieb Thomas Hellström (Intel):
> > > 
> > > On 3/15/21 11:26 AM, Christian König wrote:
> > > > 
> > > > 
> > > > Am 13.03.21 um 19:29 schrieb Thomas Hellström (Intel):
> > > > > Hi, Christian
> > > > > 
> > > > > On 3/12/21 10:38 AM, Christian König wrote:
> > > > > > We seem to have some more driver bugs than thought.
> > > > > > 
> > > > > > Signed-off-by: Christian König <christian.koenig@amd.com>
> > > > > > ---
> > > > > >   include/drm/ttm/ttm_bo_api.h | 6 ++++--
> > > > > >   1 file changed, 4 insertions(+), 2 deletions(-)
> > > > > > 
> > > > > > diff --git a/include/drm/ttm/ttm_bo_api.h
> > > > > > b/include/drm/ttm/ttm_bo_api.h
> > > > > > index 4fb523dfab32..df9fe596e7c5 100644
> > > > > > --- a/include/drm/ttm/ttm_bo_api.h
> > > > > > +++ b/include/drm/ttm/ttm_bo_api.h
> > > > > > @@ -603,9 +603,11 @@ static inline void
> > > > > > ttm_bo_pin(struct ttm_buffer_object *bo)
> > > > > >   static inline void ttm_bo_unpin(struct ttm_buffer_object *bo)
> > > > > >   {
> > > > > >       dma_resv_assert_held(bo->base.resv);
> > > > > > -    WARN_ON_ONCE(!bo->pin_count);
> > > > > >       WARN_ON_ONCE(!kref_read(&bo->kref));
> > > > > > -    --bo->pin_count;
> > > > > > +    if (bo->pin_count)
> > > > > > +        --bo->pin_count;
> > > > > > +    else
> > > > > > +        WARN_ON_ONCE(true);
> > > > > >   }
> > > > > >     int ttm_mem_evict_first(struct ttm_device *bdev,
> > > > > 
> > > > > Since I now have been staring for half a year at the code of
> > > > > the driver that made pinning an art, I have a couple of
> > > > > suggestions here, Could we use an atomic for pin_count,
> > > > > allowing unlocked unpinning but require the lock only for
> > > > > pin_count transition 0->1, (but unlocked for
> > > > > pin_if_already_pinned). In particular I think vmwgfx would
> > > > > benefit from unlocked unpins. Also if the atomic were a
> > > > > refcount_t, that would probably give you the above
> > > > > behaviour?
> > > > 
> > > > Nope, I've considered this as well while moving the pin count into TTM.
> > > > 
> > > > The problem is that you not only need the BO reserved for 0->1
> > > > transitions, but also for 1->0 transitions to move the BO on the
> > > > LRU correctly.
> > > 
> > > Ah, and there is no way to have us know the correct LRU list without
> > > reservation?
> > 
> > Not really, there is always the chance that CPU A is reducing the count
> > from 1->0 while CPU B is doing 0->1 and you end up with a LRU status
> > which doesn't match the pin count.
> > 
> > We could try to do something like a loop updating the LRU status until
> > it matches the pin count, but the implications of that are usually
> > really nasty.
> > 
> OK, yeah I was more thinking along the lines of protecting the LRU status
> with the global lru lock and unpin would then be
> 
> if (refcount_dec_and_lock(&bo->pin_count, &ttm_glob.lru_lock)) {
>     add_to_relevant_lrus(bo, bo->lru_status);
>     spin_unlock(&ttm_glob.lru_lock);
> }
> 
> But looking at ttm_bo_move_to_lru_tail() I realize that's not really trivial
> anymore. I hope it's doable at some point though.
> 
> But meanwhile, perhaps TTM needs to accept and pave over that drivers are in
> fact destroying pinned buffers?

Do we have more trouble than the very fancy tricks vmwgfx does? If so I
think we could do a small helper that like ttm_dont_check_unpin() to shut
it up. Since vmwgfx drivers tend to not be loaded with any other drivers
that shouldn't reduce any test coverage. And allows vmwgfx folks to figure
out what to do - maybe you do need your own in-house pin/unpin for these
special bo?

I did try to parse your reply in the other thread, and tbh I didn't grock
it.

Also I think a comment why we need dma_resv (maybe in the kerneldoc for
pin count), i.e. that it's needed to keep lru state in sync with pin state
would be really good?
-Daniel

> 
> /Thomas
> 
> 
> 
> 
> 
> > Christian.
> > 
> > > 
> > > /Thomas
> > > 
> > > 
> > > > 
> > > > Christian.
> > > > 
> > > > > 
> > > > > /Thomas
> > > > > 
> > > > > 
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] drm/ttm: make ttm_bo_unpin more defensive
  2021-03-16  9:27           ` Daniel Vetter
@ 2021-03-16 10:38             ` Thomas Hellström (Intel)
  2021-03-16 11:06               ` Daniel Vetter
  0 siblings, 1 reply; 17+ messages in thread
From: Thomas Hellström (Intel) @ 2021-03-16 10:38 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Christian König, dri-devel

Hi,

On 3/16/21 10:27 AM, Daniel Vetter wrote:
> On Mon, Mar 15, 2021 at 08:00:30PM +0100, Thomas Hellström (Intel) wrote:
>> On 3/15/21 7:47 PM, Christian König wrote:
>>>
>>> Am 15.03.21 um 18:08 schrieb Thomas Hellström (Intel):
>>>> On 3/15/21 11:26 AM, Christian König wrote:
>>>>>
>>>>> Am 13.03.21 um 19:29 schrieb Thomas Hellström (Intel):
>>>>>> Hi, Christian
>>>>>>
>>>>>> On 3/12/21 10:38 AM, Christian König wrote:
>>>>>>> We seem to have some more driver bugs than thought.
>>>>>>>
>>>>>>> Signed-off-by: Christian König <christian.koenig@amd.com>
>>>>>>> ---
>>>>>>>    include/drm/ttm/ttm_bo_api.h | 6 ++++--
>>>>>>>    1 file changed, 4 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/include/drm/ttm/ttm_bo_api.h
>>>>>>> b/include/drm/ttm/ttm_bo_api.h
>>>>>>> index 4fb523dfab32..df9fe596e7c5 100644
>>>>>>> --- a/include/drm/ttm/ttm_bo_api.h
>>>>>>> +++ b/include/drm/ttm/ttm_bo_api.h
>>>>>>> @@ -603,9 +603,11 @@ static inline void
>>>>>>> ttm_bo_pin(struct ttm_buffer_object *bo)
>>>>>>>    static inline void ttm_bo_unpin(struct ttm_buffer_object *bo)
>>>>>>>    {
>>>>>>>        dma_resv_assert_held(bo->base.resv);
>>>>>>> -    WARN_ON_ONCE(!bo->pin_count);
>>>>>>>        WARN_ON_ONCE(!kref_read(&bo->kref));
>>>>>>> -    --bo->pin_count;
>>>>>>> +    if (bo->pin_count)
>>>>>>> +        --bo->pin_count;
>>>>>>> +    else
>>>>>>> +        WARN_ON_ONCE(true);
>>>>>>>    }
>>>>>>>      int ttm_mem_evict_first(struct ttm_device *bdev,
>>>>>> Since I now have been staring for half a year at the code of
>>>>>> the driver that made pinning an art, I have a couple of
>>>>>> suggestions here, Could we use an atomic for pin_count,
>>>>>> allowing unlocked unpinning but require the lock only for
>>>>>> pin_count transition 0->1, (but unlocked for
>>>>>> pin_if_already_pinned). In particular I think vmwgfx would
>>>>>> benefit from unlocked unpins. Also if the atomic were a
>>>>>> refcount_t, that would probably give you the above
>>>>>> behaviour?
>>>>> Nope, I've considered this as well while moving the pin count into TTM.
>>>>>
>>>>> The problem is that you not only need the BO reserved for 0->1
>>>>> transitions, but also for 1->0 transitions to move the BO on the
>>>>> LRU correctly.
>>>> Ah, and there is no way to have us know the correct LRU list without
>>>> reservation?
>>> Not really, there is always the chance that CPU A is reducing the count
>>> from 1->0 while CPU B is doing 0->1 and you end up with a LRU status
>>> which doesn't match the pin count.
>>>
>>> We could try to do something like a loop updating the LRU status until
>>> it matches the pin count, but the implications of that are usually
>>> really nasty.
>>>
>> OK, yeah I was more thinking along the lines of protecting the LRU status
>> with the global lru lock and unpin would then be
>>
>> if (refcount_dec_and_lock(&bo->pin_count, &ttm_glob.lru_lock)) {
>>      add_to_relevant_lrus(bo, bo->lru_status);
>>      spin_unlock(&ttm_glob.lru_lock);
>> }
>>
>> But looking at ttm_bo_move_to_lru_tail() I realize that's not really trivial
>> anymore. I hope it's doable at some point though.
>>
>> But meanwhile, perhaps TTM needs to accept and pave over that drivers are in
>> fact destroying pinned buffers?
> Do we have more trouble than the very fancy tricks vmwgfx does? If so I
> think we could do a small helper that like ttm_dont_check_unpin() to shut
> it up. Since vmwgfx drivers tend to not be loaded with any other drivers
> that shouldn't reduce any test coverage. And allows vmwgfx folks to figure
> out what to do - maybe you do need your own in-house pin/unpin for these
> special bo?
>
> I did try to parse your reply in the other thread, and tbh I didn't grock
> it.

Not sure if you mean the description was unclear or if you thought it 
was a bad idea, but in case the former, what I mean is

static void ttm_bo_pin(struct ttm_buffer_object *bo)
{

dma_resv_assert_held()                        // No surprises if an 
evictor determined that this object is not pinned.

     if (!refcount_inc_not_zero(&bo->pin_count)) { // Could be made 
ttm_bo_pin_if_pinned() and exported if there are users
         spin_lock(&ttm_glob.lru_lock);            // Don't race with 
unpin 1->0
         if (refcount_read(&bo->pin_count) == 0 && bo->lru)
             ttm_bo_del_from_lru(bo);
         refcount_inc(&bo->pin_count);
         spin_unlock(&ttm_glob.lru_lock);
     }
}

static void ttm_bo_unpin(struct ttm_buffer_object *bo)
{
     if (refcount_dec_and_lock(&bo->pin_count, &ttm_glob.lru_lock)) {
         ttm_bo_move_to_lru_tail(bo, bo->lru_mem_type, bo->lru_prio,
                     NULL);
         spin_unlock(&ttm_glob.lru_lock);
     }
}

In addition, bo->lru_mem_type and bo->lru_prio would need to be 
protected by the lru lock, and updated together with the LRU list 
position, which would be the extra complexity in fastpaths. Wouldn't 
that resolve the pin - lru inconsistency?

But yeah if vmwgfx is the only driver hitting trouble because of this, 
then I agree let's leave it for when / if it becomes needed. Having had 
that requirement in the Intel driver would have complicated the dma_resv 
work quite a lot.

/Thomas


>
> Also I think a comment why we need dma_resv (maybe in the kerneldoc for
> pin count), i.e. that it's needed to keep lru state in sync with pin state
> would be really good?
> -Daniel
>
>> /Thomas
>>
>>
>>
>>
>>
>>> Christian.
>>>
>>>> /Thomas
>>>>
>>>>
>>>>> Christian.
>>>>>
>>>>>> /Thomas
>>>>>>
>>>>>>
>> _______________________________________________
>> dri-devel mailing list
>> dri-devel@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] drm/ttm: make ttm_bo_unpin more defensive
  2021-03-16 10:38             ` Thomas Hellström (Intel)
@ 2021-03-16 11:06               ` Daniel Vetter
  2021-03-16 11:23                 ` Thomas Hellström (Intel)
  0 siblings, 1 reply; 17+ messages in thread
From: Daniel Vetter @ 2021-03-16 11:06 UTC (permalink / raw)
  To: Thomas Hellström (Intel); +Cc: Christian König, dri-devel

On Tue, Mar 16, 2021 at 11:38:53AM +0100, Thomas Hellström (Intel) wrote:
> Hi,
> 
> On 3/16/21 10:27 AM, Daniel Vetter wrote:
> > On Mon, Mar 15, 2021 at 08:00:30PM +0100, Thomas Hellström (Intel) wrote:
> > > On 3/15/21 7:47 PM, Christian König wrote:
> > > > 
> > > > Am 15.03.21 um 18:08 schrieb Thomas Hellström (Intel):
> > > > > On 3/15/21 11:26 AM, Christian König wrote:
> > > > > > 
> > > > > > Am 13.03.21 um 19:29 schrieb Thomas Hellström (Intel):
> > > > > > > Hi, Christian
> > > > > > > 
> > > > > > > On 3/12/21 10:38 AM, Christian König wrote:
> > > > > > > > We seem to have some more driver bugs than thought.
> > > > > > > > 
> > > > > > > > Signed-off-by: Christian König <christian.koenig@amd.com>
> > > > > > > > ---
> > > > > > > >    include/drm/ttm/ttm_bo_api.h | 6 ++++--
> > > > > > > >    1 file changed, 4 insertions(+), 2 deletions(-)
> > > > > > > > 
> > > > > > > > diff --git a/include/drm/ttm/ttm_bo_api.h
> > > > > > > > b/include/drm/ttm/ttm_bo_api.h
> > > > > > > > index 4fb523dfab32..df9fe596e7c5 100644
> > > > > > > > --- a/include/drm/ttm/ttm_bo_api.h
> > > > > > > > +++ b/include/drm/ttm/ttm_bo_api.h
> > > > > > > > @@ -603,9 +603,11 @@ static inline void
> > > > > > > > ttm_bo_pin(struct ttm_buffer_object *bo)
> > > > > > > >    static inline void ttm_bo_unpin(struct ttm_buffer_object *bo)
> > > > > > > >    {
> > > > > > > >        dma_resv_assert_held(bo->base.resv);
> > > > > > > > -    WARN_ON_ONCE(!bo->pin_count);
> > > > > > > >        WARN_ON_ONCE(!kref_read(&bo->kref));
> > > > > > > > -    --bo->pin_count;
> > > > > > > > +    if (bo->pin_count)
> > > > > > > > +        --bo->pin_count;
> > > > > > > > +    else
> > > > > > > > +        WARN_ON_ONCE(true);
> > > > > > > >    }
> > > > > > > >      int ttm_mem_evict_first(struct ttm_device *bdev,
> > > > > > > Since I now have been staring for half a year at the code of
> > > > > > > the driver that made pinning an art, I have a couple of
> > > > > > > suggestions here, Could we use an atomic for pin_count,
> > > > > > > allowing unlocked unpinning but require the lock only for
> > > > > > > pin_count transition 0->1, (but unlocked for
> > > > > > > pin_if_already_pinned). In particular I think vmwgfx would
> > > > > > > benefit from unlocked unpins. Also if the atomic were a
> > > > > > > refcount_t, that would probably give you the above
> > > > > > > behaviour?
> > > > > > Nope, I've considered this as well while moving the pin count into TTM.
> > > > > > 
> > > > > > The problem is that you not only need the BO reserved for 0->1
> > > > > > transitions, but also for 1->0 transitions to move the BO on the
> > > > > > LRU correctly.
> > > > > Ah, and there is no way to have us know the correct LRU list without
> > > > > reservation?
> > > > Not really, there is always the chance that CPU A is reducing the count
> > > > from 1->0 while CPU B is doing 0->1 and you end up with a LRU status
> > > > which doesn't match the pin count.
> > > > 
> > > > We could try to do something like a loop updating the LRU status until
> > > > it matches the pin count, but the implications of that are usually
> > > > really nasty.
> > > > 
> > > OK, yeah I was more thinking along the lines of protecting the LRU status
> > > with the global lru lock and unpin would then be
> > > 
> > > if (refcount_dec_and_lock(&bo->pin_count, &ttm_glob.lru_lock)) {
> > >      add_to_relevant_lrus(bo, bo->lru_status);
> > >      spin_unlock(&ttm_glob.lru_lock);
> > > }
> > > 
> > > But looking at ttm_bo_move_to_lru_tail() I realize that's not really trivial
> > > anymore. I hope it's doable at some point though.
> > > 
> > > But meanwhile, perhaps TTM needs to accept and pave over that drivers are in
> > > fact destroying pinned buffers?
> > Do we have more trouble than the very fancy tricks vmwgfx does? If so I
> > think we could do a small helper that like ttm_dont_check_unpin() to shut
> > it up. Since vmwgfx drivers tend to not be loaded with any other drivers
> > that shouldn't reduce any test coverage. And allows vmwgfx folks to figure
> > out what to do - maybe you do need your own in-house pin/unpin for these
> > special bo?
> > 
> > I did try to parse your reply in the other thread, and tbh I didn't grock
> > it.
> 
> Not sure if you mean the description was unclear or if you thought it was a
> bad idea, but in case the former, what I mean is

My unclarity is on what you explained in the vmwgfx thread about how
vmwgfx uses its pin/unpin and why. That was full of vmwgfx concepts I
don't know about. This here looks reasonably clear to me, but it does have
the race Christian sees I think.

> static void ttm_bo_pin(struct ttm_buffer_object *bo)
> {
> 
> dma_resv_assert_held()                        // No surprises if an evictor
> determined that this object is not pinned.
> 
>     if (!refcount_inc_not_zero(&bo->pin_count)) { // Could be made
> ttm_bo_pin_if_pinned() and exported if there are users
>         spin_lock(&ttm_glob.lru_lock);            // Don't race with unpin
> 1->0
>         if (refcount_read(&bo->pin_count) == 0 && bo->lru)
>             ttm_bo_del_from_lru(bo);
>         refcount_inc(&bo->pin_count);
>         spin_unlock(&ttm_glob.lru_lock);
>     }
> }
> 
> static void ttm_bo_unpin(struct ttm_buffer_object *bo)
> {
>     if (refcount_dec_and_lock(&bo->pin_count, &ttm_glob.lru_lock)) {
>         ttm_bo_move_to_lru_tail(bo, bo->lru_mem_type, bo->lru_prio,
>                     NULL);
>         spin_unlock(&ttm_glob.lru_lock);
>     }
> }
> 
> In addition, bo->lru_mem_type and bo->lru_prio would need to be protected by
> the lru lock, and updated together with the LRU list position, which would
> be the extra complexity in fastpaths. Wouldn't that resolve the pin - lru
> inconsistency?
> 
> But yeah if vmwgfx is the only driver hitting trouble because of this, then
> I agree let's leave it for when / if it becomes needed. Having had that
> requirement in the Intel driver would have complicated the dma_resv work
> quite a lot.

Yeah I know i915 does a lot of unpin in interesting places, and that's
part of why I'm worried. I've seen some bugfixes fly by that dropped
dma_resv_lock around unpin to fix really scary looking stalls
(rcu_synchronize vs other stuff and lockdep did not catch it). And once I
see that kind of stuff I'm heavily leaning towards simpler locking we can
grasp, just to be able to stay on top of the bugs. Because the bugfix did
not come with any clear explanation for why not taking dma_resv_lock
actually helps, or any other clear and in-depth locking analysis.

I think we also still have some temporary pin/unpin flying about, but
maybe those are all gone now.

So i915 gem code and what I've seen there is part of the reasons why I
think we shouldn't make unpin lockless. It just doesn't look like it leads
to very managable code. Plus for most drivers pin/unpin really is only
something for display, and the locking shouldn't matter at all from a perf
pov.
-Daniel


> 
> /Thomas
> 
> 
> > 
> > Also I think a comment why we need dma_resv (maybe in the kerneldoc for
> > pin count), i.e. that it's needed to keep lru state in sync with pin state
> > would be really good?
> > -Daniel
> > 
> > > /Thomas
> > > 
> > > 
> > > 
> > > 
> > > 
> > > > Christian.
> > > > 
> > > > > /Thomas
> > > > > 
> > > > > 
> > > > > > Christian.
> > > > > > 
> > > > > > > /Thomas
> > > > > > > 
> > > > > > > 
> > > _______________________________________________
> > > dri-devel mailing list
> > > dri-devel@lists.freedesktop.org
> > > https://lists.freedesktop.org/mailman/listinfo/dri-devel

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] drm/ttm: make ttm_bo_unpin more defensive
  2021-03-16 11:06               ` Daniel Vetter
@ 2021-03-16 11:23                 ` Thomas Hellström (Intel)
  2021-03-16 14:07                   ` Daniel Vetter
  0 siblings, 1 reply; 17+ messages in thread
From: Thomas Hellström (Intel) @ 2021-03-16 11:23 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Christian König, dri-devel


On 3/16/21 12:06 PM, Daniel Vetter wrote:
> On Tue, Mar 16, 2021 at 11:38:53AM +0100, Thomas Hellström (Intel) wrote:
>> Hi,
>>
>> On 3/16/21 10:27 AM, Daniel Vetter wrote:
>>> On Mon, Mar 15, 2021 at 08:00:30PM +0100, Thomas Hellström (Intel) wrote:
>>>> On 3/15/21 7:47 PM, Christian König wrote:
>>>>> Am 15.03.21 um 18:08 schrieb Thomas Hellström (Intel):
>>>>>> On 3/15/21 11:26 AM, Christian König wrote:
>>>>>>> Am 13.03.21 um 19:29 schrieb Thomas Hellström (Intel):
>>>>>>>> Hi, Christian
>>>>>>>>
>>>>>>>> On 3/12/21 10:38 AM, Christian König wrote:
>>>>>>>>> We seem to have some more driver bugs than thought.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Christian König <christian.koenig@amd.com>
>>>>>>>>> ---
>>>>>>>>>     include/drm/ttm/ttm_bo_api.h | 6 ++++--
>>>>>>>>>     1 file changed, 4 insertions(+), 2 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/include/drm/ttm/ttm_bo_api.h
>>>>>>>>> b/include/drm/ttm/ttm_bo_api.h
>>>>>>>>> index 4fb523dfab32..df9fe596e7c5 100644
>>>>>>>>> --- a/include/drm/ttm/ttm_bo_api.h
>>>>>>>>> +++ b/include/drm/ttm/ttm_bo_api.h
>>>>>>>>> @@ -603,9 +603,11 @@ static inline void
>>>>>>>>> ttm_bo_pin(struct ttm_buffer_object *bo)
>>>>>>>>>     static inline void ttm_bo_unpin(struct ttm_buffer_object *bo)
>>>>>>>>>     {
>>>>>>>>>         dma_resv_assert_held(bo->base.resv);
>>>>>>>>> -    WARN_ON_ONCE(!bo->pin_count);
>>>>>>>>>         WARN_ON_ONCE(!kref_read(&bo->kref));
>>>>>>>>> -    --bo->pin_count;
>>>>>>>>> +    if (bo->pin_count)
>>>>>>>>> +        --bo->pin_count;
>>>>>>>>> +    else
>>>>>>>>> +        WARN_ON_ONCE(true);
>>>>>>>>>     }
>>>>>>>>>       int ttm_mem_evict_first(struct ttm_device *bdev,
>>>>>>>> Since I now have been staring for half a year at the code of
>>>>>>>> the driver that made pinning an art, I have a couple of
>>>>>>>> suggestions here, Could we use an atomic for pin_count,
>>>>>>>> allowing unlocked unpinning but require the lock only for
>>>>>>>> pin_count transition 0->1, (but unlocked for
>>>>>>>> pin_if_already_pinned). In particular I think vmwgfx would
>>>>>>>> benefit from unlocked unpins. Also if the atomic were a
>>>>>>>> refcount_t, that would probably give you the above
>>>>>>>> behaviour?
>>>>>>> Nope, I've considered this as well while moving the pin count into TTM.
>>>>>>>
>>>>>>> The problem is that you not only need the BO reserved for 0->1
>>>>>>> transitions, but also for 1->0 transitions to move the BO on the
>>>>>>> LRU correctly.
>>>>>> Ah, and there is no way to have us know the correct LRU list without
>>>>>> reservation?
>>>>> Not really, there is always the chance that CPU A is reducing the count
>>>>> from 1->0 while CPU B is doing 0->1 and you end up with a LRU status
>>>>> which doesn't match the pin count.
>>>>>
>>>>> We could try to do something like a loop updating the LRU status until
>>>>> it matches the pin count, but the implications of that are usually
>>>>> really nasty.
>>>>>
>>>> OK, yeah I was more thinking along the lines of protecting the LRU status
>>>> with the global lru lock and unpin would then be
>>>>
>>>> if (refcount_dec_and_lock(&bo->pin_count, &ttm_glob.lru_lock)) {
>>>>       add_to_relevant_lrus(bo, bo->lru_status);
>>>>       spin_unlock(&ttm_glob.lru_lock);
>>>> }
>>>>
>>>> But looking at ttm_bo_move_to_lru_tail() I realize that's not really trivial
>>>> anymore. I hope it's doable at some point though.
>>>>
>>>> But meanwhile, perhaps TTM needs to accept and pave over that drivers are in
>>>> fact destroying pinned buffers?
>>> Do we have more trouble than the very fancy tricks vmwgfx does? If so I
>>> think we could do a small helper that like ttm_dont_check_unpin() to shut
>>> it up. Since vmwgfx drivers tend to not be loaded with any other drivers
>>> that shouldn't reduce any test coverage. And allows vmwgfx folks to figure
>>> out what to do - maybe you do need your own in-house pin/unpin for these
>>> special bo?
>>>
>>> I did try to parse your reply in the other thread, and tbh I didn't grock
>>> it.
>> Not sure if you mean the description was unclear or if you thought it was a
>> bad idea, but in case the former, what I mean is
> My unclarity is on what you explained in the vmwgfx thread about how
> vmwgfx uses its pin/unpin and why. That was full of vmwgfx concepts I
> don't know about. This here looks reasonably clear to me, but it does have
> the race Christian sees I think.

Hmm, OK, I thought transition 0->1 under the LRU lock would have 
resolved that...

>> static void ttm_bo_pin(struct ttm_buffer_object *bo)
>> {
>>
>> dma_resv_assert_held()                        // No surprises if an evictor
>> determined that this object is not pinned.
>>
>>      if (!refcount_inc_not_zero(&bo->pin_count)) { // Could be made
>> ttm_bo_pin_if_pinned() and exported if there are users
>>          spin_lock(&ttm_glob.lru_lock);            // Don't race with unpin
>> 1->0
>>          if (refcount_read(&bo->pin_count) == 0 && bo->lru)
>>              ttm_bo_del_from_lru(bo);
>>          refcount_inc(&bo->pin_count);
>>          spin_unlock(&ttm_glob.lru_lock);
>>      }
>> }
>>
>> static void ttm_bo_unpin(struct ttm_buffer_object *bo)
>> {
>>      if (refcount_dec_and_lock(&bo->pin_count, &ttm_glob.lru_lock)) {
>>          ttm_bo_move_to_lru_tail(bo, bo->lru_mem_type, bo->lru_prio,
>>                      NULL);
>>          spin_unlock(&ttm_glob.lru_lock);
>>      }
>> }
>>
>> In addition, bo->lru_mem_type and bo->lru_prio would need to be protected by
>> the lru lock, and updated together with the LRU list position, which would
>> be the extra complexity in fastpaths. Wouldn't that resolve the pin - lru
>> inconsistency?
>>
>> But yeah if vmwgfx is the only driver hitting trouble because of this, then
>> I agree let's leave it for when / if it becomes needed. Having had that
>> requirement in the Intel driver would have complicated the dma_resv work
>> quite a lot.
> Yeah I know i915 does a lot of unpin in interesting places, and that's
> part of why I'm worried. I've seen some bugfixes fly by that dropped
> dma_resv_lock around unpin to fix really scary looking stalls
> (rcu_synchronize vs other stuff and lockdep did not catch it). And once I
> see that kind of stuff I'm heavily leaning towards simpler locking we can
> grasp, just to be able to stay on top of the bugs. Because the bugfix did
> not come with any clear explanation for why not taking dma_resv_lock
> actually helps, or any other clear and in-depth locking analysis.
>
> I think we also still have some temporary pin/unpin flying about, but
> maybe those are all gone now.

No, unfortunately they aren't and yes, I now see your point. And yes, if 
we can reduce the pinning to only display kinds of things this wouldn't 
really matter. But if not, I'm afraid the alternative is that drivers 
will have to start resorting to assuming perhaps incorrectly that an 
object is isolated (only a single user), and that therefore tryreserve 
for unpinning will always succeed, which is fragile at best. (This is 
what vmwgfx does for the pagetables). And pinning pagetables in that 
case makes sense since it reduces the number of buffers to validate and 
fence by 50%...

/Thomas



_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] drm/ttm: make ttm_bo_unpin more defensive
  2021-03-16 11:23                 ` Thomas Hellström (Intel)
@ 2021-03-16 14:07                   ` Daniel Vetter
  2021-03-16 18:18                     ` Thomas Hellström (Intel)
  0 siblings, 1 reply; 17+ messages in thread
From: Daniel Vetter @ 2021-03-16 14:07 UTC (permalink / raw)
  To: Thomas Hellström (Intel); +Cc: Christian König, dri-devel

On Tue, Mar 16, 2021 at 12:24 PM Thomas Hellström (Intel)
<thomas_os@shipmail.org> wrote:
>
>
> On 3/16/21 12:06 PM, Daniel Vetter wrote:
> > On Tue, Mar 16, 2021 at 11:38:53AM +0100, Thomas Hellström (Intel) wrote:
> >> Hi,
> >>
> >> On 3/16/21 10:27 AM, Daniel Vetter wrote:
> >>> On Mon, Mar 15, 2021 at 08:00:30PM +0100, Thomas Hellström (Intel) wrote:
> >>>> On 3/15/21 7:47 PM, Christian König wrote:
> >>>>> Am 15.03.21 um 18:08 schrieb Thomas Hellström (Intel):
> >>>>>> On 3/15/21 11:26 AM, Christian König wrote:
> >>>>>>> Am 13.03.21 um 19:29 schrieb Thomas Hellström (Intel):
> >>>>>>>> Hi, Christian
> >>>>>>>>
> >>>>>>>> On 3/12/21 10:38 AM, Christian König wrote:
> >>>>>>>>> We seem to have some more driver bugs than thought.
> >>>>>>>>>
> >>>>>>>>> Signed-off-by: Christian König <christian.koenig@amd.com>
> >>>>>>>>> ---
> >>>>>>>>>     include/drm/ttm/ttm_bo_api.h | 6 ++++--
> >>>>>>>>>     1 file changed, 4 insertions(+), 2 deletions(-)
> >>>>>>>>>
> >>>>>>>>> diff --git a/include/drm/ttm/ttm_bo_api.h
> >>>>>>>>> b/include/drm/ttm/ttm_bo_api.h
> >>>>>>>>> index 4fb523dfab32..df9fe596e7c5 100644
> >>>>>>>>> --- a/include/drm/ttm/ttm_bo_api.h
> >>>>>>>>> +++ b/include/drm/ttm/ttm_bo_api.h
> >>>>>>>>> @@ -603,9 +603,11 @@ static inline void
> >>>>>>>>> ttm_bo_pin(struct ttm_buffer_object *bo)
> >>>>>>>>>     static inline void ttm_bo_unpin(struct ttm_buffer_object *bo)
> >>>>>>>>>     {
> >>>>>>>>>         dma_resv_assert_held(bo->base.resv);
> >>>>>>>>> -    WARN_ON_ONCE(!bo->pin_count);
> >>>>>>>>>         WARN_ON_ONCE(!kref_read(&bo->kref));
> >>>>>>>>> -    --bo->pin_count;
> >>>>>>>>> +    if (bo->pin_count)
> >>>>>>>>> +        --bo->pin_count;
> >>>>>>>>> +    else
> >>>>>>>>> +        WARN_ON_ONCE(true);
> >>>>>>>>>     }
> >>>>>>>>>       int ttm_mem_evict_first(struct ttm_device *bdev,
> >>>>>>>> Since I now have been staring for half a year at the code of
> >>>>>>>> the driver that made pinning an art, I have a couple of
> >>>>>>>> suggestions here, Could we use an atomic for pin_count,
> >>>>>>>> allowing unlocked unpinning but require the lock only for
> >>>>>>>> pin_count transition 0->1, (but unlocked for
> >>>>>>>> pin_if_already_pinned). In particular I think vmwgfx would
> >>>>>>>> benefit from unlocked unpins. Also if the atomic were a
> >>>>>>>> refcount_t, that would probably give you the above
> >>>>>>>> behaviour?
> >>>>>>> Nope, I've considered this as well while moving the pin count into TTM.
> >>>>>>>
> >>>>>>> The problem is that you not only need the BO reserved for 0->1
> >>>>>>> transitions, but also for 1->0 transitions to move the BO on the
> >>>>>>> LRU correctly.
> >>>>>> Ah, and there is no way to have us know the correct LRU list without
> >>>>>> reservation?
> >>>>> Not really, there is always the chance that CPU A is reducing the count
> >>>>> from 1->0 while CPU B is doing 0->1 and you end up with a LRU status
> >>>>> which doesn't match the pin count.
> >>>>>
> >>>>> We could try to do something like a loop updating the LRU status until
> >>>>> it matches the pin count, but the implications of that are usually
> >>>>> really nasty.
> >>>>>
> >>>> OK, yeah I was more thinking along the lines of protecting the LRU status
> >>>> with the global lru lock and unpin would then be
> >>>>
> >>>> if (refcount_dec_and_lock(&bo->pin_count, &ttm_glob.lru_lock)) {
> >>>>       add_to_relevant_lrus(bo, bo->lru_status);
> >>>>       spin_unlock(&ttm_glob.lru_lock);
> >>>> }
> >>>>
> >>>> But looking at ttm_bo_move_to_lru_tail() I realize that's not really trivial
> >>>> anymore. I hope it's doable at some point though.
> >>>>
> >>>> But meanwhile, perhaps TTM needs to accept and pave over that drivers are in
> >>>> fact destroying pinned buffers?
> >>> Do we have more trouble than the very fancy tricks vmwgfx does? If so I
> >>> think we could do a small helper that like ttm_dont_check_unpin() to shut
> >>> it up. Since vmwgfx drivers tend to not be loaded with any other drivers
> >>> that shouldn't reduce any test coverage. And allows vmwgfx folks to figure
> >>> out what to do - maybe you do need your own in-house pin/unpin for these
> >>> special bo?
> >>>
> >>> I did try to parse your reply in the other thread, and tbh I didn't grock
> >>> it.
> >> Not sure if you mean the description was unclear or if you thought it was a
> >> bad idea, but in case the former, what I mean is
> > My unclarity is on what you explained in the vmwgfx thread about how
> > vmwgfx uses its pin/unpin and why. That was full of vmwgfx concepts I
> > don't know about. This here looks reasonably clear to me, but it does have
> > the race Christian sees I think.
>
> Hmm, OK, I thought transition 0->1 under the LRU lock would have
> resolved that...

Hm now you make me question how this all works again, and when it
stops working. I do think yours works, but it triggers another icky
thing I have with refcounts and locking. And at this point pin_count
is more a refcount really ...

So imo the way to do a weak refcount is kref_get_unless_zero, because
that way you do not have to hold the lock on the final unref, but
instead you can remove all weakly referenced entries (lookup caches,
lru, whatever) simply in your cleanup code (and there still grab the
right lock ofc), and the refcount_dec isn't special.

The other approach is the refcount_dec_and_lock, and this works, but
only for a single lock. So as soon as you have multiple weak
references, or possible weak references, then it falls apart, because
you can't take all the locks for all of them for the final decrement.

Now ofc our pin_count here isn't a weak reference, but I think it's
easier to analyze with that concept in mind. But conclusion is the
same, as soon as we need multiple lru locks for the final unpin, we're
toast. And Christian wants to move to a per-memory region lru lock, so
that would just dig us in in another corner.

So that does work now that I look closer, but it kinda worries me for
other reasons. And handling weak references with the refcount_and_lock
has cause a lot of pain in the past (e.g. just moving the final gem_bo
unreference from under drm_device->struct_mutex was a multi-year
effort full of pain). drm_prime import/export caches also had similar
problems, and the same applies to some of the things we do with kms
objects. It's become a bit an anti-pattern for handling refcount_t and
locks protecting weak references for me at least.

> >> static void ttm_bo_pin(struct ttm_buffer_object *bo)
> >> {
> >>
> >> dma_resv_assert_held()                        // No surprises if an evictor
> >> determined that this object is not pinned.
> >>
> >>      if (!refcount_inc_not_zero(&bo->pin_count)) { // Could be made
> >> ttm_bo_pin_if_pinned() and exported if there are users
> >>          spin_lock(&ttm_glob.lru_lock);            // Don't race with unpin
> >> 1->0
> >>          if (refcount_read(&bo->pin_count) == 0 && bo->lru)
> >>              ttm_bo_del_from_lru(bo);
> >>          refcount_inc(&bo->pin_count);
> >>          spin_unlock(&ttm_glob.lru_lock);
> >>      }
> >> }
> >>
> >> static void ttm_bo_unpin(struct ttm_buffer_object *bo)
> >> {
> >>      if (refcount_dec_and_lock(&bo->pin_count, &ttm_glob.lru_lock)) {
> >>          ttm_bo_move_to_lru_tail(bo, bo->lru_mem_type, bo->lru_prio,
> >>                      NULL);
> >>          spin_unlock(&ttm_glob.lru_lock);
> >>      }
> >> }
> >>
> >> In addition, bo->lru_mem_type and bo->lru_prio would need to be protected by
> >> the lru lock, and updated together with the LRU list position, which would
> >> be the extra complexity in fastpaths. Wouldn't that resolve the pin - lru
> >> inconsistency?
> >>
> >> But yeah if vmwgfx is the only driver hitting trouble because of this, then
> >> I agree let's leave it for when / if it becomes needed. Having had that
> >> requirement in the Intel driver would have complicated the dma_resv work
> >> quite a lot.
> > Yeah I know i915 does a lot of unpin in interesting places, and that's
> > part of why I'm worried. I've seen some bugfixes fly by that dropped
> > dma_resv_lock around unpin to fix really scary looking stalls
> > (rcu_synchronize vs other stuff and lockdep did not catch it). And once I
> > see that kind of stuff I'm heavily leaning towards simpler locking we can
> > grasp, just to be able to stay on top of the bugs. Because the bugfix did
> > not come with any clear explanation for why not taking dma_resv_lock
> > actually helps, or any other clear and in-depth locking analysis.
> >
> > I think we also still have some temporary pin/unpin flying about, but
> > maybe those are all gone now.
>
> No, unfortunately they aren't and yes, I now see your point. And yes, if
> we can reduce the pinning to only display kinds of things this wouldn't
> really matter. But if not, I'm afraid the alternative is that drivers
> will have to start resorting to assuming perhaps incorrectly that an
> object is isolated (only a single user), and that therefore tryreserve
> for unpinning will always succeed, which is fragile at best. (This is
> what vmwgfx does for the pagetables). And pinning pagetables in that
> case makes sense since it reduces the number of buffers to validate and
> fence by 50%...

Yeah WARN_ON(tryreserve) is also a very fragile approach.

Now for pagetables I think amdgpu has a very nice approach.
Essentially any pagetable object you assign to the same dma_resv
object that you have on your vm structure. That way you dma_resv_lock
your vm, and all your vm related buffer objects (all the pagetables
really) are locked and can be reserved. Also, activity tracking (if
you don't have tlb flushes through mmio, only through in-line command
submission) with dma_fences also becomes O(1).

Now even that isn't good enough since you still have to
O(#pagetable_objects) reserve your buffers. But again amdgpu has a
trick for that: For these private objects they're essentially treated
as always bound for that vm (because really, they are), so you
optimize for the unbound case: When one of these vm objects gets
evicted we put them on the vm->evicted_bo list. Since we know these
objects are private we now the single list entry per bo isn't a
problem, these objects are never used by any other vm.

The final trick is the lru, and that's solved with the bulk move
tricks. End result: No need to pin pagetables while still getting O(1)
execbuf for them. It's really neat.

On top of that amdgpu also has an allocation flag where userspace
promises to never share a buffer beyond that single VM it's allocated
for. That allows them to also handle all the various state buffers and
anything else that's guaranteed to be unshared with the same trick.
Which for many buffers essentially gives you O(1) execbuf for what
we're trying to do with our VM_BIND. All while still making it look to
the driver (and ttm) at large that the buffer is fully managed with
dma_resv and fences, running around on the lru like anything else. And
so integrates into the eviction code with zero special cases.

Last time I looked our own VM_BIND isn't quite there yet with that
much finesse in the implementation.

But also, I'm kinda diverging a bit from the topic :-)

Back to i915: I do think we need to give all our unpin a serious
review. A lot of them are motivated by lock dropping tricks to come
from the good old days of everything is protected by
dev->struct_mutex. So essentially we've hand-rolled a fairly bad
semaphore with our pin_count (you can't even wait on all the temp pins
to drop out!). And imo part of the longer term switch to dma_resv
should be to check which of these lock dropping tricks we really still
need, and which we can sunset.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] drm/ttm: make ttm_bo_unpin more defensive
  2021-03-16 14:07                   ` Daniel Vetter
@ 2021-03-16 18:18                     ` Thomas Hellström (Intel)
  2021-03-16 18:28                       ` Daniel Vetter
  0 siblings, 1 reply; 17+ messages in thread
From: Thomas Hellström (Intel) @ 2021-03-16 18:18 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Christian König, dri-devel


On 3/16/21 3:07 PM, Daniel Vetter wrote:
> On Tue, Mar 16, 2021 at 12:24 PM Thomas Hellström (Intel)
> <thomas_os@shipmail.org> wrote:
>>
>> On 3/16/21 12:06 PM, Daniel Vetter wrote:
>>> On Tue, Mar 16, 2021 at 11:38:53AM +0100, Thomas Hellström (Intel) wrote:
>>>> Hi,
>>>>
>>>> On 3/16/21 10:27 AM, Daniel Vetter wrote:
>>>>> On Mon, Mar 15, 2021 at 08:00:30PM +0100, Thomas Hellström (Intel) wrote:
>>>>>> On 3/15/21 7:47 PM, Christian König wrote:
>>>>>>> Am 15.03.21 um 18:08 schrieb Thomas Hellström (Intel):
>>>>>>>> On 3/15/21 11:26 AM, Christian König wrote:
>>>>>>>>> Am 13.03.21 um 19:29 schrieb Thomas Hellström (Intel):
>>>>>>>>>> Hi, Christian
>>>>>>>>>>
>>>>>>>>>> On 3/12/21 10:38 AM, Christian König wrote:
>>>>>>>>>>> We seem to have some more driver bugs than thought.
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Christian König <christian.koenig@amd.com>
>>>>>>>>>>> ---
>>>>>>>>>>>      include/drm/ttm/ttm_bo_api.h | 6 ++++--
>>>>>>>>>>>      1 file changed, 4 insertions(+), 2 deletions(-)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/include/drm/ttm/ttm_bo_api.h
>>>>>>>>>>> b/include/drm/ttm/ttm_bo_api.h
>>>>>>>>>>> index 4fb523dfab32..df9fe596e7c5 100644
>>>>>>>>>>> --- a/include/drm/ttm/ttm_bo_api.h
>>>>>>>>>>> +++ b/include/drm/ttm/ttm_bo_api.h
>>>>>>>>>>> @@ -603,9 +603,11 @@ static inline void
>>>>>>>>>>> ttm_bo_pin(struct ttm_buffer_object *bo)
>>>>>>>>>>>      static inline void ttm_bo_unpin(struct ttm_buffer_object *bo)
>>>>>>>>>>>      {
>>>>>>>>>>>          dma_resv_assert_held(bo->base.resv);
>>>>>>>>>>> -    WARN_ON_ONCE(!bo->pin_count);
>>>>>>>>>>>          WARN_ON_ONCE(!kref_read(&bo->kref));
>>>>>>>>>>> -    --bo->pin_count;
>>>>>>>>>>> +    if (bo->pin_count)
>>>>>>>>>>> +        --bo->pin_count;
>>>>>>>>>>> +    else
>>>>>>>>>>> +        WARN_ON_ONCE(true);
>>>>>>>>>>>      }
>>>>>>>>>>>        int ttm_mem_evict_first(struct ttm_device *bdev,
>>>>>>>>>> Since I now have been staring for half a year at the code of
>>>>>>>>>> the driver that made pinning an art, I have a couple of
>>>>>>>>>> suggestions here, Could we use an atomic for pin_count,
>>>>>>>>>> allowing unlocked unpinning but require the lock only for
>>>>>>>>>> pin_count transition 0->1, (but unlocked for
>>>>>>>>>> pin_if_already_pinned). In particular I think vmwgfx would
>>>>>>>>>> benefit from unlocked unpins. Also if the atomic were a
>>>>>>>>>> refcount_t, that would probably give you the above
>>>>>>>>>> behaviour?
>>>>>>>>> Nope, I've considered this as well while moving the pin count into TTM.
>>>>>>>>>
>>>>>>>>> The problem is that you not only need the BO reserved for 0->1
>>>>>>>>> transitions, but also for 1->0 transitions to move the BO on the
>>>>>>>>> LRU correctly.
>>>>>>>> Ah, and there is no way to have us know the correct LRU list without
>>>>>>>> reservation?
>>>>>>> Not really, there is always the chance that CPU A is reducing the count
>>>>>>> from 1->0 while CPU B is doing 0->1 and you end up with a LRU status
>>>>>>> which doesn't match the pin count.
>>>>>>>
>>>>>>> We could try to do something like a loop updating the LRU status until
>>>>>>> it matches the pin count, but the implications of that are usually
>>>>>>> really nasty.
>>>>>>>
>>>>>> OK, yeah I was more thinking along the lines of protecting the LRU status
>>>>>> with the global lru lock and unpin would then be
>>>>>>
>>>>>> if (refcount_dec_and_lock(&bo->pin_count, &ttm_glob.lru_lock)) {
>>>>>>        add_to_relevant_lrus(bo, bo->lru_status);
>>>>>>        spin_unlock(&ttm_glob.lru_lock);
>>>>>> }
>>>>>>
>>>>>> But looking at ttm_bo_move_to_lru_tail() I realize that's not really trivial
>>>>>> anymore. I hope it's doable at some point though.
>>>>>>
>>>>>> But meanwhile, perhaps TTM needs to accept and pave over that drivers are in
>>>>>> fact destroying pinned buffers?
>>>>> Do we have more trouble than the very fancy tricks vmwgfx does? If so I
>>>>> think we could do a small helper that like ttm_dont_check_unpin() to shut
>>>>> it up. Since vmwgfx drivers tend to not be loaded with any other drivers
>>>>> that shouldn't reduce any test coverage. And allows vmwgfx folks to figure
>>>>> out what to do - maybe you do need your own in-house pin/unpin for these
>>>>> special bo?
>>>>>
>>>>> I did try to parse your reply in the other thread, and tbh I didn't grock
>>>>> it.
>>>> Not sure if you mean the description was unclear or if you thought it was a
>>>> bad idea, but in case the former, what I mean is
>>> My unclarity is on what you explained in the vmwgfx thread about how
>>> vmwgfx uses its pin/unpin and why. That was full of vmwgfx concepts I
>>> don't know about. This here looks reasonably clear to me, but it does have
>>> the race Christian sees I think.
>> Hmm, OK, I thought transition 0->1 under the LRU lock would have
>> resolved that...
> Hm now you make me question how this all works again, and when it
> stops working. I do think yours works, but it triggers another icky
> thing I have with refcounts and locking. And at this point pin_count
> is more a refcount really ...
>
> So imo the way to do a weak refcount is kref_get_unless_zero, because
> that way you do not have to hold the lock on the final unref, but
> instead you can remove all weakly referenced entries (lookup caches,
> lru, whatever) simply in your cleanup code (and there still grab the
> right lock ofc), and the refcount_dec isn't special.
>
> The other approach is the refcount_dec_and_lock, and this works, but
> only for a single lock. So as soon as you have multiple weak
> references, or possible weak references, then it falls apart, because
> you can't take all the locks for all of them for the final decrement.
>
> Now ofc our pin_count here isn't a weak reference, but I think it's
> easier to analyze with that concept in mind. But conclusion is the
> same, as soon as we need multiple lru locks for the final unpin, we're
> toast. And Christian wants to move to a per-memory region lru lock, so
> that would just dig us in in another corner.
Well as long as there is only one lock at the time protecting the bo LRU 
state, that's doable (a lock pointer becomes part of the bo LRU state 
and can only be updated with the lock it currently points to held, 
similar to locking i915 engines), but then we bump the level of 
complexity even higher.
>
> So that does work now that I look closer, but it kinda worries me for
> other reasons. And handling weak references with the refcount_and_lock
> has cause a lot of pain in the past (e.g. just moving the final gem_bo
> unreference from under drm_device->struct_mutex was a multi-year
> effort full of pain). drm_prime import/export caches also had similar
> problems, and the same applies to some of the things we do with kms
> objects. It's become a bit an anti-pattern for handling refcount_t and
> locks protecting weak references for me at least.

Yeah the difference compared to weak references and kref_get_unless_zero 
is that we now need to handle and never race with the pincount 
transition 0->1 which can never happen in the weak reference case. So we 
need locking around those transitions.

>
>>>> static void ttm_bo_pin(struct ttm_buffer_object *bo)
>>>> {
>>>>
>>>> dma_resv_assert_held()                        // No surprises if an evictor
>>>> determined that this object is not pinned.
>>>>
>>>>       if (!refcount_inc_not_zero(&bo->pin_count)) { // Could be made
>>>> ttm_bo_pin_if_pinned() and exported if there are users
>>>>           spin_lock(&ttm_glob.lru_lock);            // Don't race with unpin
>>>> 1->0
>>>>           if (refcount_read(&bo->pin_count) == 0 && bo->lru)
>>>>               ttm_bo_del_from_lru(bo);
>>>>           refcount_inc(&bo->pin_count);
>>>>           spin_unlock(&ttm_glob.lru_lock);
>>>>       }
>>>> }
>>>>
>>>> static void ttm_bo_unpin(struct ttm_buffer_object *bo)
>>>> {
>>>>       if (refcount_dec_and_lock(&bo->pin_count, &ttm_glob.lru_lock)) {
>>>>           ttm_bo_move_to_lru_tail(bo, bo->lru_mem_type, bo->lru_prio,
>>>>                       NULL);
>>>>           spin_unlock(&ttm_glob.lru_lock);
>>>>       }
>>>> }
>>>>
>>>> In addition, bo->lru_mem_type and bo->lru_prio would need to be protected by
>>>> the lru lock, and updated together with the LRU list position, which would
>>>> be the extra complexity in fastpaths. Wouldn't that resolve the pin - lru
>>>> inconsistency?
>>>>
>>>> But yeah if vmwgfx is the only driver hitting trouble because of this, then
>>>> I agree let's leave it for when / if it becomes needed. Having had that
>>>> requirement in the Intel driver would have complicated the dma_resv work
>>>> quite a lot.
>>> Yeah I know i915 does a lot of unpin in interesting places, and that's
>>> part of why I'm worried. I've seen some bugfixes fly by that dropped
>>> dma_resv_lock around unpin to fix really scary looking stalls
>>> (rcu_synchronize vs other stuff and lockdep did not catch it). And once I
>>> see that kind of stuff I'm heavily leaning towards simpler locking we can
>>> grasp, just to be able to stay on top of the bugs. Because the bugfix did
>>> not come with any clear explanation for why not taking dma_resv_lock
>>> actually helps, or any other clear and in-depth locking analysis.
>>>
>>> I think we also still have some temporary pin/unpin flying about, but
>>> maybe those are all gone now.
>> No, unfortunately they aren't and yes, I now see your point. And yes, if
>> we can reduce the pinning to only display kinds of things this wouldn't
>> really matter. But if not, I'm afraid the alternative is that drivers
>> will have to start resorting to assuming perhaps incorrectly that an
>> object is isolated (only a single user), and that therefore tryreserve
>> for unpinning will always succeed, which is fragile at best. (This is
>> what vmwgfx does for the pagetables). And pinning pagetables in that
>> case makes sense since it reduces the number of buffers to validate and
>> fence by 50%...
> Yeah WARN_ON(tryreserve) is also a very fragile approach.
>
> Now for pagetables I think amdgpu has a very nice approach.
> Essentially any pagetable object you assign to the same dma_resv
> object that you have on your vm structure. That way you dma_resv_lock
> your vm, and all your vm related buffer objects (all the pagetables
> really) are locked and can be reserved. Also, activity tracking (if
> you don't have tlb flushes through mmio, only through in-line command
> submission) with dma_fences also becomes O(1).
>
> Now even that isn't good enough since you still have to
> O(#pagetable_objects) reserve your buffers. But again amdgpu has a
> trick for that: For these private objects they're essentially treated
> as always bound for that vm (because really, they are), so you
> optimize for the unbound case: When one of these vm objects gets
> evicted we put them on the vm->evicted_bo list. Since we know these
> objects are private we now the single list entry per bo isn't a
> problem, these objects are never used by any other vm.
>
> The final trick is the lru, and that's solved with the bulk move
> tricks. End result: No need to pin pagetables while still getting O(1)
> execbuf for them. It's really neat.
>
> On top of that amdgpu also has an allocation flag where userspace
> promises to never share a buffer beyond that single VM it's allocated
> for. That allows them to also handle all the various state buffers and
> anything else that's guaranteed to be unshared with the same trick.
> Which for many buffers essentially gives you O(1) execbuf for what
> we're trying to do with our VM_BIND. All while still making it look to
> the driver (and ttm) at large that the buffer is fully managed with
> dma_resv and fences, running around on the lru like anything else. And
> so integrates into the eviction code with zero special cases.
>
> Last time I looked our own VM_BIND isn't quite there yet with that
> much finesse in the implementation.
>
> But also, I'm kinda diverging a bit from the topic :-)
>
> Back to i915: I do think we need to give all our unpin a serious
> review. A lot of them are motivated by lock dropping tricks to come
> from the good old days of everything is protected by
> dev->struct_mutex. So essentially we've hand-rolled a fairly bad
> semaphore with our pin_count (you can't even wait on all the temp pins
> to drop out!). And imo part of the longer term switch to dma_resv
> should be to check which of these lock dropping tricks we really still
> need, and which we can sunset.

Agreed.
And if the reserve at unpin really becomes a sever pain, I think we at 
least have a way out.

/Thomas



> -Daniel
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] drm/ttm: make ttm_bo_unpin more defensive
  2021-03-16 18:18                     ` Thomas Hellström (Intel)
@ 2021-03-16 18:28                       ` Daniel Vetter
  2021-03-16 18:58                         ` Thomas Hellström (Intel)
  0 siblings, 1 reply; 17+ messages in thread
From: Daniel Vetter @ 2021-03-16 18:28 UTC (permalink / raw)
  To: Thomas Hellström (Intel); +Cc: Christian König, dri-devel

On Tue, Mar 16, 2021 at 7:18 PM Thomas Hellström (Intel)
<thomas_os@shipmail.org> wrote:
>
>
> On 3/16/21 3:07 PM, Daniel Vetter wrote:
> > On Tue, Mar 16, 2021 at 12:24 PM Thomas Hellström (Intel)
> > <thomas_os@shipmail.org> wrote:
> >>
> >> On 3/16/21 12:06 PM, Daniel Vetter wrote:
> >>> On Tue, Mar 16, 2021 at 11:38:53AM +0100, Thomas Hellström (Intel) wrote:
> >>>> Hi,
> >>>>
> >>>> On 3/16/21 10:27 AM, Daniel Vetter wrote:
> >>>>> On Mon, Mar 15, 2021 at 08:00:30PM +0100, Thomas Hellström (Intel) wrote:
> >>>>>> On 3/15/21 7:47 PM, Christian König wrote:
> >>>>>>> Am 15.03.21 um 18:08 schrieb Thomas Hellström (Intel):
> >>>>>>>> On 3/15/21 11:26 AM, Christian König wrote:
> >>>>>>>>> Am 13.03.21 um 19:29 schrieb Thomas Hellström (Intel):
> >>>>>>>>>> Hi, Christian
> >>>>>>>>>>
> >>>>>>>>>> On 3/12/21 10:38 AM, Christian König wrote:
> >>>>>>>>>>> We seem to have some more driver bugs than thought.
> >>>>>>>>>>>
> >>>>>>>>>>> Signed-off-by: Christian König <christian.koenig@amd.com>
> >>>>>>>>>>> ---
> >>>>>>>>>>>      include/drm/ttm/ttm_bo_api.h | 6 ++++--
> >>>>>>>>>>>      1 file changed, 4 insertions(+), 2 deletions(-)
> >>>>>>>>>>>
> >>>>>>>>>>> diff --git a/include/drm/ttm/ttm_bo_api.h
> >>>>>>>>>>> b/include/drm/ttm/ttm_bo_api.h
> >>>>>>>>>>> index 4fb523dfab32..df9fe596e7c5 100644
> >>>>>>>>>>> --- a/include/drm/ttm/ttm_bo_api.h
> >>>>>>>>>>> +++ b/include/drm/ttm/ttm_bo_api.h
> >>>>>>>>>>> @@ -603,9 +603,11 @@ static inline void
> >>>>>>>>>>> ttm_bo_pin(struct ttm_buffer_object *bo)
> >>>>>>>>>>>      static inline void ttm_bo_unpin(struct ttm_buffer_object *bo)
> >>>>>>>>>>>      {
> >>>>>>>>>>>          dma_resv_assert_held(bo->base.resv);
> >>>>>>>>>>> -    WARN_ON_ONCE(!bo->pin_count);
> >>>>>>>>>>>          WARN_ON_ONCE(!kref_read(&bo->kref));
> >>>>>>>>>>> -    --bo->pin_count;
> >>>>>>>>>>> +    if (bo->pin_count)
> >>>>>>>>>>> +        --bo->pin_count;
> >>>>>>>>>>> +    else
> >>>>>>>>>>> +        WARN_ON_ONCE(true);
> >>>>>>>>>>>      }
> >>>>>>>>>>>        int ttm_mem_evict_first(struct ttm_device *bdev,
> >>>>>>>>>> Since I now have been staring for half a year at the code of
> >>>>>>>>>> the driver that made pinning an art, I have a couple of
> >>>>>>>>>> suggestions here, Could we use an atomic for pin_count,
> >>>>>>>>>> allowing unlocked unpinning but require the lock only for
> >>>>>>>>>> pin_count transition 0->1, (but unlocked for
> >>>>>>>>>> pin_if_already_pinned). In particular I think vmwgfx would
> >>>>>>>>>> benefit from unlocked unpins. Also if the atomic were a
> >>>>>>>>>> refcount_t, that would probably give you the above
> >>>>>>>>>> behaviour?
> >>>>>>>>> Nope, I've considered this as well while moving the pin count into TTM.
> >>>>>>>>>
> >>>>>>>>> The problem is that you not only need the BO reserved for 0->1
> >>>>>>>>> transitions, but also for 1->0 transitions to move the BO on the
> >>>>>>>>> LRU correctly.
> >>>>>>>> Ah, and there is no way to have us know the correct LRU list without
> >>>>>>>> reservation?
> >>>>>>> Not really, there is always the chance that CPU A is reducing the count
> >>>>>>> from 1->0 while CPU B is doing 0->1 and you end up with a LRU status
> >>>>>>> which doesn't match the pin count.
> >>>>>>>
> >>>>>>> We could try to do something like a loop updating the LRU status until
> >>>>>>> it matches the pin count, but the implications of that are usually
> >>>>>>> really nasty.
> >>>>>>>
> >>>>>> OK, yeah I was more thinking along the lines of protecting the LRU status
> >>>>>> with the global lru lock and unpin would then be
> >>>>>>
> >>>>>> if (refcount_dec_and_lock(&bo->pin_count, &ttm_glob.lru_lock)) {
> >>>>>>        add_to_relevant_lrus(bo, bo->lru_status);
> >>>>>>        spin_unlock(&ttm_glob.lru_lock);
> >>>>>> }
> >>>>>>
> >>>>>> But looking at ttm_bo_move_to_lru_tail() I realize that's not really trivial
> >>>>>> anymore. I hope it's doable at some point though.
> >>>>>>
> >>>>>> But meanwhile, perhaps TTM needs to accept and pave over that drivers are in
> >>>>>> fact destroying pinned buffers?
> >>>>> Do we have more trouble than the very fancy tricks vmwgfx does? If so I
> >>>>> think we could do a small helper that like ttm_dont_check_unpin() to shut
> >>>>> it up. Since vmwgfx drivers tend to not be loaded with any other drivers
> >>>>> that shouldn't reduce any test coverage. And allows vmwgfx folks to figure
> >>>>> out what to do - maybe you do need your own in-house pin/unpin for these
> >>>>> special bo?
> >>>>>
> >>>>> I did try to parse your reply in the other thread, and tbh I didn't grock
> >>>>> it.
> >>>> Not sure if you mean the description was unclear or if you thought it was a
> >>>> bad idea, but in case the former, what I mean is
> >>> My unclarity is on what you explained in the vmwgfx thread about how
> >>> vmwgfx uses its pin/unpin and why. That was full of vmwgfx concepts I
> >>> don't know about. This here looks reasonably clear to me, but it does have
> >>> the race Christian sees I think.
> >> Hmm, OK, I thought transition 0->1 under the LRU lock would have
> >> resolved that...
> > Hm now you make me question how this all works again, and when it
> > stops working. I do think yours works, but it triggers another icky
> > thing I have with refcounts and locking. And at this point pin_count
> > is more a refcount really ...
> >
> > So imo the way to do a weak refcount is kref_get_unless_zero, because
> > that way you do not have to hold the lock on the final unref, but
> > instead you can remove all weakly referenced entries (lookup caches,
> > lru, whatever) simply in your cleanup code (and there still grab the
> > right lock ofc), and the refcount_dec isn't special.
> >
> > The other approach is the refcount_dec_and_lock, and this works, but
> > only for a single lock. So as soon as you have multiple weak
> > references, or possible weak references, then it falls apart, because
> > you can't take all the locks for all of them for the final decrement.
> >
> > Now ofc our pin_count here isn't a weak reference, but I think it's
> > easier to analyze with that concept in mind. But conclusion is the
> > same, as soon as we need multiple lru locks for the final unpin, we're
> > toast. And Christian wants to move to a per-memory region lru lock, so
> > that would just dig us in in another corner.
> Well as long as there is only one lock at the time protecting the bo LRU
> state, that's doable (a lock pointer becomes part of the bo LRU state
> and can only be updated with the lock it currently points to held,
> similar to locking i915 engines), but then we bump the level of
> complexity even higher.

Yeah, but how do we then protect the pointer to the spinlock? So you
need to throw in some lockless/retry magic for that too. It gets real
fancy real fast I think :-/

That's "pointer to the lock you need isn't protected" has killed a lot
of really cool locking designs in the past. E.g the trouble with the
amdgpu dma_resv object sharing for truly private objects is that you
can never unshare that lock. Because you simply have no control over
what all the other drivers in the system are doing, and with dynamic
dma-buf we've definitely reached the point where we've leaked that
detail all across drivers.

> > So that does work now that I look closer, but it kinda worries me for
> > other reasons. And handling weak references with the refcount_and_lock
> > has cause a lot of pain in the past (e.g. just moving the final gem_bo
> > unreference from under drm_device->struct_mutex was a multi-year
> > effort full of pain). drm_prime import/export caches also had similar
> > problems, and the same applies to some of the things we do with kms
> > objects. It's become a bit an anti-pattern for handling refcount_t and
> > locks protecting weak references for me at least.
>
> Yeah the difference compared to weak references and kref_get_unless_zero
> is that we now need to handle and never race with the pincount
> transition 0->1 which can never happen in the weak reference case. So we
> need locking around those transitions.

Ah yes good insight, but also means we can't use that for a pin_count.

> >>>> static void ttm_bo_pin(struct ttm_buffer_object *bo)
> >>>> {
> >>>>
> >>>> dma_resv_assert_held()                        // No surprises if an evictor
> >>>> determined that this object is not pinned.
> >>>>
> >>>>       if (!refcount_inc_not_zero(&bo->pin_count)) { // Could be made
> >>>> ttm_bo_pin_if_pinned() and exported if there are users
> >>>>           spin_lock(&ttm_glob.lru_lock);            // Don't race with unpin
> >>>> 1->0
> >>>>           if (refcount_read(&bo->pin_count) == 0 && bo->lru)
> >>>>               ttm_bo_del_from_lru(bo);
> >>>>           refcount_inc(&bo->pin_count);
> >>>>           spin_unlock(&ttm_glob.lru_lock);
> >>>>       }
> >>>> }
> >>>>
> >>>> static void ttm_bo_unpin(struct ttm_buffer_object *bo)
> >>>> {
> >>>>       if (refcount_dec_and_lock(&bo->pin_count, &ttm_glob.lru_lock)) {
> >>>>           ttm_bo_move_to_lru_tail(bo, bo->lru_mem_type, bo->lru_prio,
> >>>>                       NULL);
> >>>>           spin_unlock(&ttm_glob.lru_lock);
> >>>>       }
> >>>> }
> >>>>
> >>>> In addition, bo->lru_mem_type and bo->lru_prio would need to be protected by
> >>>> the lru lock, and updated together with the LRU list position, which would
> >>>> be the extra complexity in fastpaths. Wouldn't that resolve the pin - lru
> >>>> inconsistency?
> >>>>
> >>>> But yeah if vmwgfx is the only driver hitting trouble because of this, then
> >>>> I agree let's leave it for when / if it becomes needed. Having had that
> >>>> requirement in the Intel driver would have complicated the dma_resv work
> >>>> quite a lot.
> >>> Yeah I know i915 does a lot of unpin in interesting places, and that's
> >>> part of why I'm worried. I've seen some bugfixes fly by that dropped
> >>> dma_resv_lock around unpin to fix really scary looking stalls
> >>> (rcu_synchronize vs other stuff and lockdep did not catch it). And once I
> >>> see that kind of stuff I'm heavily leaning towards simpler locking we can
> >>> grasp, just to be able to stay on top of the bugs. Because the bugfix did
> >>> not come with any clear explanation for why not taking dma_resv_lock
> >>> actually helps, or any other clear and in-depth locking analysis.
> >>>
> >>> I think we also still have some temporary pin/unpin flying about, but
> >>> maybe those are all gone now.
> >> No, unfortunately they aren't and yes, I now see your point. And yes, if
> >> we can reduce the pinning to only display kinds of things this wouldn't
> >> really matter. But if not, I'm afraid the alternative is that drivers
> >> will have to start resorting to assuming perhaps incorrectly that an
> >> object is isolated (only a single user), and that therefore tryreserve
> >> for unpinning will always succeed, which is fragile at best. (This is
> >> what vmwgfx does for the pagetables). And pinning pagetables in that
> >> case makes sense since it reduces the number of buffers to validate and
> >> fence by 50%...
> > Yeah WARN_ON(tryreserve) is also a very fragile approach.
> >
> > Now for pagetables I think amdgpu has a very nice approach.
> > Essentially any pagetable object you assign to the same dma_resv
> > object that you have on your vm structure. That way you dma_resv_lock
> > your vm, and all your vm related buffer objects (all the pagetables
> > really) are locked and can be reserved. Also, activity tracking (if
> > you don't have tlb flushes through mmio, only through in-line command
> > submission) with dma_fences also becomes O(1).
> >
> > Now even that isn't good enough since you still have to
> > O(#pagetable_objects) reserve your buffers. But again amdgpu has a
> > trick for that: For these private objects they're essentially treated
> > as always bound for that vm (because really, they are), so you
> > optimize for the unbound case: When one of these vm objects gets
> > evicted we put them on the vm->evicted_bo list. Since we know these
> > objects are private we now the single list entry per bo isn't a
> > problem, these objects are never used by any other vm.
> >
> > The final trick is the lru, and that's solved with the bulk move
> > tricks. End result: No need to pin pagetables while still getting O(1)
> > execbuf for them. It's really neat.
> >
> > On top of that amdgpu also has an allocation flag where userspace
> > promises to never share a buffer beyond that single VM it's allocated
> > for. That allows them to also handle all the various state buffers and
> > anything else that's guaranteed to be unshared with the same trick.
> > Which for many buffers essentially gives you O(1) execbuf for what
> > we're trying to do with our VM_BIND. All while still making it look to
> > the driver (and ttm) at large that the buffer is fully managed with
> > dma_resv and fences, running around on the lru like anything else. And
> > so integrates into the eviction code with zero special cases.
> >
> > Last time I looked our own VM_BIND isn't quite there yet with that
> > much finesse in the implementation.
> >
> > But also, I'm kinda diverging a bit from the topic :-)
> >
> > Back to i915: I do think we need to give all our unpin a serious
> > review. A lot of them are motivated by lock dropping tricks to come
> > from the good old days of everything is protected by
> > dev->struct_mutex. So essentially we've hand-rolled a fairly bad
> > semaphore with our pin_count (you can't even wait on all the temp pins
> > to drop out!). And imo part of the longer term switch to dma_resv
> > should be to check which of these lock dropping tricks we really still
> > need, and which we can sunset.
>
> Agreed.
> And if the reserve at unpin really becomes a sever pain, I think we at
> least have a way out.

Yeah I really do like these discussions, I always learn new stuff and
usually we get a slightly better understanding of what the tradeoffs
in complexity and the fallback options are.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] drm/ttm: make ttm_bo_unpin more defensive
  2021-03-16 18:28                       ` Daniel Vetter
@ 2021-03-16 18:58                         ` Thomas Hellström (Intel)
  0 siblings, 0 replies; 17+ messages in thread
From: Thomas Hellström (Intel) @ 2021-03-16 18:58 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Christian König, dri-devel


On 3/16/21 7:28 PM, Daniel Vetter wrote:
> On Tue, Mar 16, 2021 at 7:18 PM Thomas Hellström (Intel)
> <thomas_os@shipmail.org> wrote:
>>
>> On 3/16/21 3:07 PM, Daniel Vetter wrote:
>>> On Tue, Mar 16, 2021 at 12:24 PM Thomas Hellström (Intel)
>>> <thomas_os@shipmail.org> wrote:
>>>> On 3/16/21 12:06 PM, Daniel Vetter wrote:
>>>>> On Tue, Mar 16, 2021 at 11:38:53AM +0100, Thomas Hellström (Intel) wrote:
>>>>>> Hi,
>>>>>>
>>>>>> On 3/16/21 10:27 AM, Daniel Vetter wrote:
>>>>>>> On Mon, Mar 15, 2021 at 08:00:30PM +0100, Thomas Hellström (Intel) wrote:
>>>>>>>> On 3/15/21 7:47 PM, Christian König wrote:
>>>>>>>>> Am 15.03.21 um 18:08 schrieb Thomas Hellström (Intel):
>>>>>>>>>> On 3/15/21 11:26 AM, Christian König wrote:
>>>>>>>>>>> Am 13.03.21 um 19:29 schrieb Thomas Hellström (Intel):
>>>>>>>>>>>> Hi, Christian
>>>>>>>>>>>>
>>>>>>>>>>>> On 3/12/21 10:38 AM, Christian König wrote:
>>>>>>>>>>>>> We seem to have some more driver bugs than thought.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Signed-off-by: Christian König <christian.koenig@amd.com>
>>>>>>>>>>>>> ---
>>>>>>>>>>>>>       include/drm/ttm/ttm_bo_api.h | 6 ++++--
>>>>>>>>>>>>>       1 file changed, 4 insertions(+), 2 deletions(-)
>>>>>>>>>>>>>
>>>>>>>>>>>>> diff --git a/include/drm/ttm/ttm_bo_api.h
>>>>>>>>>>>>> b/include/drm/ttm/ttm_bo_api.h
>>>>>>>>>>>>> index 4fb523dfab32..df9fe596e7c5 100644
>>>>>>>>>>>>> --- a/include/drm/ttm/ttm_bo_api.h
>>>>>>>>>>>>> +++ b/include/drm/ttm/ttm_bo_api.h
>>>>>>>>>>>>> @@ -603,9 +603,11 @@ static inline void
>>>>>>>>>>>>> ttm_bo_pin(struct ttm_buffer_object *bo)
>>>>>>>>>>>>>       static inline void ttm_bo_unpin(struct ttm_buffer_object *bo)
>>>>>>>>>>>>>       {
>>>>>>>>>>>>>           dma_resv_assert_held(bo->base.resv);
>>>>>>>>>>>>> -    WARN_ON_ONCE(!bo->pin_count);
>>>>>>>>>>>>>           WARN_ON_ONCE(!kref_read(&bo->kref));
>>>>>>>>>>>>> -    --bo->pin_count;
>>>>>>>>>>>>> +    if (bo->pin_count)
>>>>>>>>>>>>> +        --bo->pin_count;
>>>>>>>>>>>>> +    else
>>>>>>>>>>>>> +        WARN_ON_ONCE(true);
>>>>>>>>>>>>>       }
>>>>>>>>>>>>>         int ttm_mem_evict_first(struct ttm_device *bdev,
>>>>>>>>>>>> Since I now have been staring for half a year at the code of
>>>>>>>>>>>> the driver that made pinning an art, I have a couple of
>>>>>>>>>>>> suggestions here, Could we use an atomic for pin_count,
>>>>>>>>>>>> allowing unlocked unpinning but require the lock only for
>>>>>>>>>>>> pin_count transition 0->1, (but unlocked for
>>>>>>>>>>>> pin_if_already_pinned). In particular I think vmwgfx would
>>>>>>>>>>>> benefit from unlocked unpins. Also if the atomic were a
>>>>>>>>>>>> refcount_t, that would probably give you the above
>>>>>>>>>>>> behaviour?
>>>>>>>>>>> Nope, I've considered this as well while moving the pin count into TTM.
>>>>>>>>>>>
>>>>>>>>>>> The problem is that you not only need the BO reserved for 0->1
>>>>>>>>>>> transitions, but also for 1->0 transitions to move the BO on the
>>>>>>>>>>> LRU correctly.
>>>>>>>>>> Ah, and there is no way to have us know the correct LRU list without
>>>>>>>>>> reservation?
>>>>>>>>> Not really, there is always the chance that CPU A is reducing the count
>>>>>>>>> from 1->0 while CPU B is doing 0->1 and you end up with a LRU status
>>>>>>>>> which doesn't match the pin count.
>>>>>>>>>
>>>>>>>>> We could try to do something like a loop updating the LRU status until
>>>>>>>>> it matches the pin count, but the implications of that are usually
>>>>>>>>> really nasty.
>>>>>>>>>
>>>>>>>> OK, yeah I was more thinking along the lines of protecting the LRU status
>>>>>>>> with the global lru lock and unpin would then be
>>>>>>>>
>>>>>>>> if (refcount_dec_and_lock(&bo->pin_count, &ttm_glob.lru_lock)) {
>>>>>>>>         add_to_relevant_lrus(bo, bo->lru_status);
>>>>>>>>         spin_unlock(&ttm_glob.lru_lock);
>>>>>>>> }
>>>>>>>>
>>>>>>>> But looking at ttm_bo_move_to_lru_tail() I realize that's not really trivial
>>>>>>>> anymore. I hope it's doable at some point though.
>>>>>>>>
>>>>>>>> But meanwhile, perhaps TTM needs to accept and pave over that drivers are in
>>>>>>>> fact destroying pinned buffers?
>>>>>>> Do we have more trouble than the very fancy tricks vmwgfx does? If so I
>>>>>>> think we could do a small helper that like ttm_dont_check_unpin() to shut
>>>>>>> it up. Since vmwgfx drivers tend to not be loaded with any other drivers
>>>>>>> that shouldn't reduce any test coverage. And allows vmwgfx folks to figure
>>>>>>> out what to do - maybe you do need your own in-house pin/unpin for these
>>>>>>> special bo?
>>>>>>>
>>>>>>> I did try to parse your reply in the other thread, and tbh I didn't grock
>>>>>>> it.
>>>>>> Not sure if you mean the description was unclear or if you thought it was a
>>>>>> bad idea, but in case the former, what I mean is
>>>>> My unclarity is on what you explained in the vmwgfx thread about how
>>>>> vmwgfx uses its pin/unpin and why. That was full of vmwgfx concepts I
>>>>> don't know about. This here looks reasonably clear to me, but it does have
>>>>> the race Christian sees I think.
>>>> Hmm, OK, I thought transition 0->1 under the LRU lock would have
>>>> resolved that...
>>> Hm now you make me question how this all works again, and when it
>>> stops working. I do think yours works, but it triggers another icky
>>> thing I have with refcounts and locking. And at this point pin_count
>>> is more a refcount really ...
>>>
>>> So imo the way to do a weak refcount is kref_get_unless_zero, because
>>> that way you do not have to hold the lock on the final unref, but
>>> instead you can remove all weakly referenced entries (lookup caches,
>>> lru, whatever) simply in your cleanup code (and there still grab the
>>> right lock ofc), and the refcount_dec isn't special.
>>>
>>> The other approach is the refcount_dec_and_lock, and this works, but
>>> only for a single lock. So as soon as you have multiple weak
>>> references, or possible weak references, then it falls apart, because
>>> you can't take all the locks for all of them for the final decrement.
>>>
>>> Now ofc our pin_count here isn't a weak reference, but I think it's
>>> easier to analyze with that concept in mind. But conclusion is the
>>> same, as soon as we need multiple lru locks for the final unpin, we're
>>> toast. And Christian wants to move to a per-memory region lru lock, so
>>> that would just dig us in in another corner.
>> Well as long as there is only one lock at the time protecting the bo LRU
>> state, that's doable (a lock pointer becomes part of the bo LRU state
>> and can only be updated with the lock it currently points to held,
>> similar to locking i915 engines), but then we bump the level of
>> complexity even higher.
> Yeah, but how do we then protect the pointer to the spinlock? So you
> need to throw in some lockless/retry magic for that too. It gets real
> fancy real fast I think :-/

Yes, the pointer to the spinlock is, being part of the LRU state, 
protected by the lock it points to. So if you want to update the 
pointer, you need (with the old lock held) do smp_store_release() to the 
pointer and then consider the bo LRU state unlocked, (unless you already 
hold the new lock). Similarly a locker would grab the lock the pointer 
points to, and then verify that the pointer still points to the same 
lock. If not, release and retry. But as you say, to make that 
understandable, the amount of comments would probably have to outweigh 
the amount of code :)

>
> That's "pointer to the lock you need isn't protected" has killed a lot
> of really cool locking designs in the past. E.g the trouble with the
> amdgpu dma_resv object sharing for truly private objects is that you
> can never unshare that lock. Because you simply have no control over
> what all the other drivers in the system are doing, and with dynamic
> dma-buf we've definitely reached the point where we've leaked that
> detail all across drivers.
Yeah, i915 DG1 code does similar lock sharing but to a less extent with 
similar problems.
>
>>> So that does work now that I look closer, but it kinda worries me for
>>> other reasons. And handling weak references with the refcount_and_lock
>>> has cause a lot of pain in the past (e.g. just moving the final gem_bo
>>> unreference from under drm_device->struct_mutex was a multi-year
>>> effort full of pain). drm_prime import/export caches also had similar
>>> problems, and the same applies to some of the things we do with kms
>>> objects. It's become a bit an anti-pattern for handling refcount_t and
>>> locks protecting weak references for me at least.
>> Yeah the difference compared to weak references and kref_get_unless_zero
>> is that we now need to handle and never race with the pincount
>> transition 0->1 which can never happen in the weak reference case. So we
>> need locking around those transitions.
> Ah yes good insight, but also means we can't use that for a pin_count.

You mean weak referencing? No we can't.

>
>>>>>> static void ttm_bo_pin(struct ttm_buffer_object *bo)
>>>>>> {
>>>>>>
>>>>>> dma_resv_assert_held()                        // No surprises if an evictor
>>>>>> determined that this object is not pinned.
>>>>>>
>>>>>>        if (!refcount_inc_not_zero(&bo->pin_count)) { // Could be made
>>>>>> ttm_bo_pin_if_pinned() and exported if there are users
>>>>>>            spin_lock(&ttm_glob.lru_lock);            // Don't race with unpin
>>>>>> 1->0
>>>>>>            if (refcount_read(&bo->pin_count) == 0 && bo->lru)
>>>>>>                ttm_bo_del_from_lru(bo);
>>>>>>            refcount_inc(&bo->pin_count);
>>>>>>            spin_unlock(&ttm_glob.lru_lock);
>>>>>>        }
>>>>>> }
>>>>>>
>>>>>> static void ttm_bo_unpin(struct ttm_buffer_object *bo)
>>>>>> {
>>>>>>        if (refcount_dec_and_lock(&bo->pin_count, &ttm_glob.lru_lock)) {
>>>>>>            ttm_bo_move_to_lru_tail(bo, bo->lru_mem_type, bo->lru_prio,
>>>>>>                        NULL);
>>>>>>            spin_unlock(&ttm_glob.lru_lock);
>>>>>>        }
>>>>>> }
>>>>>>
>>>>>> In addition, bo->lru_mem_type and bo->lru_prio would need to be protected by
>>>>>> the lru lock, and updated together with the LRU list position, which would
>>>>>> be the extra complexity in fastpaths. Wouldn't that resolve the pin - lru
>>>>>> inconsistency?
>>>>>>
>>>>>> But yeah if vmwgfx is the only driver hitting trouble because of this, then
>>>>>> I agree let's leave it for when / if it becomes needed. Having had that
>>>>>> requirement in the Intel driver would have complicated the dma_resv work
>>>>>> quite a lot.
>>>>> Yeah I know i915 does a lot of unpin in interesting places, and that's
>>>>> part of why I'm worried. I've seen some bugfixes fly by that dropped
>>>>> dma_resv_lock around unpin to fix really scary looking stalls
>>>>> (rcu_synchronize vs other stuff and lockdep did not catch it). And once I
>>>>> see that kind of stuff I'm heavily leaning towards simpler locking we can
>>>>> grasp, just to be able to stay on top of the bugs. Because the bugfix did
>>>>> not come with any clear explanation for why not taking dma_resv_lock
>>>>> actually helps, or any other clear and in-depth locking analysis.
>>>>>
>>>>> I think we also still have some temporary pin/unpin flying about, but
>>>>> maybe those are all gone now.
>>>> No, unfortunately they aren't and yes, I now see your point. And yes, if
>>>> we can reduce the pinning to only display kinds of things this wouldn't
>>>> really matter. But if not, I'm afraid the alternative is that drivers
>>>> will have to start resorting to assuming perhaps incorrectly that an
>>>> object is isolated (only a single user), and that therefore tryreserve
>>>> for unpinning will always succeed, which is fragile at best. (This is
>>>> what vmwgfx does for the pagetables). And pinning pagetables in that
>>>> case makes sense since it reduces the number of buffers to validate and
>>>> fence by 50%...
>>> Yeah WARN_ON(tryreserve) is also a very fragile approach.
>>>
>>> Now for pagetables I think amdgpu has a very nice approach.
>>> Essentially any pagetable object you assign to the same dma_resv
>>> object that you have on your vm structure. That way you dma_resv_lock
>>> your vm, and all your vm related buffer objects (all the pagetables
>>> really) are locked and can be reserved. Also, activity tracking (if
>>> you don't have tlb flushes through mmio, only through in-line command
>>> submission) with dma_fences also becomes O(1).
>>>
>>> Now even that isn't good enough since you still have to
>>> O(#pagetable_objects) reserve your buffers. But again amdgpu has a
>>> trick for that: For these private objects they're essentially treated
>>> as always bound for that vm (because really, they are), so you
>>> optimize for the unbound case: When one of these vm objects gets
>>> evicted we put them on the vm->evicted_bo list. Since we know these
>>> objects are private we now the single list entry per bo isn't a
>>> problem, these objects are never used by any other vm.
>>>
>>> The final trick is the lru, and that's solved with the bulk move
>>> tricks. End result: No need to pin pagetables while still getting O(1)
>>> execbuf for them. It's really neat.
>>>
>>> On top of that amdgpu also has an allocation flag where userspace
>>> promises to never share a buffer beyond that single VM it's allocated
>>> for. That allows them to also handle all the various state buffers and
>>> anything else that's guaranteed to be unshared with the same trick.
>>> Which for many buffers essentially gives you O(1) execbuf for what
>>> we're trying to do with our VM_BIND. All while still making it look to
>>> the driver (and ttm) at large that the buffer is fully managed with
>>> dma_resv and fences, running around on the lru like anything else. And
>>> so integrates into the eviction code with zero special cases.
>>>
>>> Last time I looked our own VM_BIND isn't quite there yet with that
>>> much finesse in the implementation.
>>>
>>> But also, I'm kinda diverging a bit from the topic :-)
>>>
>>> Back to i915: I do think we need to give all our unpin a serious
>>> review. A lot of them are motivated by lock dropping tricks to come
>>> from the good old days of everything is protected by
>>> dev->struct_mutex. So essentially we've hand-rolled a fairly bad
>>> semaphore with our pin_count (you can't even wait on all the temp pins
>>> to drop out!). And imo part of the longer term switch to dma_resv
>>> should be to check which of these lock dropping tricks we really still
>>> need, and which we can sunset.
>> Agreed.
>> And if the reserve at unpin really becomes a sever pain, I think we at
>> least have a way out.
> Yeah I really do like these discussions, I always learn new stuff and
> usually we get a slightly better understanding of what the tradeoffs
> in complexity and the fallback options are.
> -Daniel
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2021-03-16 18:58 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-12  9:38 [PATCH] drm/ttm: make ttm_bo_unpin more defensive Christian König
2021-03-12 10:32 ` Matthew Auld
2021-03-13 18:29 ` Thomas Hellström (Intel)
2021-03-15  7:48   ` Thomas Zimmermann
2021-03-15 17:10     ` Thomas Hellström (Intel)
2021-03-15 10:26   ` Christian König
2021-03-15 17:08     ` Thomas Hellström (Intel)
2021-03-15 18:47       ` Christian König
2021-03-15 19:00         ` Thomas Hellström (Intel)
2021-03-16  9:27           ` Daniel Vetter
2021-03-16 10:38             ` Thomas Hellström (Intel)
2021-03-16 11:06               ` Daniel Vetter
2021-03-16 11:23                 ` Thomas Hellström (Intel)
2021-03-16 14:07                   ` Daniel Vetter
2021-03-16 18:18                     ` Thomas Hellström (Intel)
2021-03-16 18:28                       ` Daniel Vetter
2021-03-16 18:58                         ` Thomas Hellström (Intel)

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.