Re: [PATCH] dma-buf: Fix missing excl fence waiting

From: "Christian König" <christian.koenig@amd.com>
To: "Pan, Xinhui" <Xinhui.Pan@amd.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linaro-mm-sig@lists.linaro.org" <linaro-mm-sig@lists.linaro.org>,
	"dri-devel@lists.freedesktop.org"
	<dri-devel@lists.freedesktop.org>,
	"linux-media@vger.kernel.org" <linux-media@vger.kernel.org>,
	"sumit.semwal@linaro.org" <sumit.semwal@linaro.org>
Subject: Re: [PATCH] dma-buf: Fix missing excl fence waiting
Date: Tue, 25 Feb 2020 20:11:51 +0100	[thread overview]
Message-ID: <484ce316-55f2-b85e-a1e4-730db94f3fe3@amd.com> (raw)
In-Reply-To: <20200225172355.GO2363188@phenom.ffwll.local>

Am 25.02.20 um 18:23 schrieb Daniel Vetter:
> On Sun, Feb 23, 2020 at 01:04:15PM +0100, Christian König wrote:
>> Am 23.02.20 um 12:56 schrieb Pan, Xinhui:
>>> If shared fence list is not empty, even we want to test all fences, excl fence is ignored.
>>> That is abviously wrong, so fix it.
>> Yeah that is a known issue and I completely agree with you, but other
>> disagree.
>>
>> See the shared fences are meant to depend on the exclusive fence. So all
>> shared fences must finish only after the exclusive one has finished as well.
>>
>> The problem now is that for error handling this isn't necessary true. In
>> other words when a shared fence completes with an error it is perfectly
>> possible that he does this before the exclusive fence is finished.
>>
>> I'm trying to convince Daniel that this is a problem for years :)
> I thought the consensus is that reasonable gpu schedulers and gpu reset
> code should try to make really, really sure it only completes stuff in
> sequence? That's at least my take away from the syncobj timeline
> discussion, where you convinced me we shouldn't just crash&burn.
>
> I think as long as your scheduler is competent and your gpu reset tries to
> limit damage (i.e. kill offending ctx terminally, mark everything else
> that didn't complete for re-running) we should end up with everything
> completing in sequence. I guess if you do kill a lot more stuff, then
> you'd have to push these through your scheduler as dummy jobs, i.e. they
> still wait for their dependencies, but then all they do is set the
> dma_fence error and complete it. Maybe something the common scheduler
> could do.

Yes, that's exactly how we currently implement it. But I still think 
that this is not necessary the best approach :)

Anyway Xinhui's problem turned out to be deeper. We somehow add an old 
stale fence to the dma_resv object sometimes and that can result in 
quite a bunch of problems.

I'm currently trying to hunt down what's going wrong here in more detail.

Regards,
Christian.

> -Daniel
>
>> Regards,
>> Christian.
>>
>>> Signed-off-by: xinhui pan <xinhui.pan@amd.com>
>>> ---
>>>    drivers/dma-buf/dma-resv.c | 9 +++++----
>>>    1 file changed, 5 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/dma-buf/dma-resv.c b/drivers/dma-buf/dma-resv.c
>>> index 4264e64788c4..44dc64c547c6 100644
>>> --- a/drivers/dma-buf/dma-resv.c
>>> +++ b/drivers/dma-buf/dma-resv.c
>>> @@ -632,14 +632,14 @@ static inline int dma_resv_test_signaled_single(struct dma_fence *passed_fence)
>>>     */
>>>    bool dma_resv_test_signaled_rcu(struct dma_resv *obj, bool test_all)
>>>    {
>>> -	unsigned seq, shared_count;
>>> +	unsigned int seq, shared_count, left;
>>>    	int ret;
>>>    	rcu_read_lock();
>>>    retry:
>>>    	ret = true;
>>>    	shared_count = 0;
>>> -	seq = read_seqcount_begin(&obj->seq);
>>> +	left = seq = read_seqcount_begin(&obj->seq);
>>>    	if (test_all) {
>>>    		unsigned i;
>>> @@ -647,7 +647,7 @@ bool dma_resv_test_signaled_rcu(struct dma_resv *obj, bool test_all)
>>>    		struct dma_resv_list *fobj = rcu_dereference(obj->fence);
>>>    		if (fobj)
>>> -			shared_count = fobj->shared_count;
>>> +			left = shared_count = fobj->shared_count;
>>>    		for (i = 0; i < shared_count; ++i) {
>>>    			struct dma_fence *fence = rcu_dereference(fobj->shared[i]);
>>> @@ -657,13 +657,14 @@ bool dma_resv_test_signaled_rcu(struct dma_resv *obj, bool test_all)
>>>    				goto retry;
>>>    			else if (!ret)
>>>    				break;
>>> +			left--;
>>>    		}
>>>    		if (read_seqcount_retry(&obj->seq, seq))
>>>    			goto retry;
>>>    	}
>>> -	if (!shared_count) {
>>> +	if (!left) {
>>>    		struct dma_fence *fence_excl = rcu_dereference(obj->fence_excl);
>>>    		if (fence_excl) {