All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
To: Jingwen Chen <Jingwen.Chen2@amd.com>, amd-gfx@lists.freedesktop.org
Cc: ckoenig.leichtzumerken@gmail.com, horace.chen@amd.com,
	"jingwen.chen2@amd.com Jack Zhang" <Jack.Zhang1@amd.com>,
	monk.liu@amd.com
Subject: Re: [PATCH 2/2] drm: add tdr support for embeded hw_fence
Date: Thu, 22 Jul 2021 10:45:40 -0400	[thread overview]
Message-ID: <0699ff85-4aec-0e33-711b-665307fbbc24@amd.com> (raw)
In-Reply-To: <20210722104525.okcnb43goxg6uqej@wayne-build>


On 2021-07-22 6:45 a.m., Jingwen Chen wrote:
> On Wed Jul 21, 2021 at 12:53:51PM -0400, Andrey Grodzovsky wrote:
>> On 2021-07-20 11:13 p.m., Jingwen Chen wrote:
>>> [Why]
>>> After embeded hw_fence to amdgpu_job, we need to add tdr support
>>> for this feature.
>>>
>>> [How]
>>> 1. Add a resubmit_flag for resubmit jobs.
>>> 2. Clear job fence from RCU and force complete vm flush fences in
>>>      pre_asic_reset
>>> 3. skip dma_fence_get for resubmit jobs and add a dma_fence_put
>>>      for guilty jobs.
>>>
>>> Signed-off-by: Jack Zhang <Jack.Zhang1@amd.com>
>>> Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 +++++++++++-
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 16 +++++++++++-----
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  4 +++-
>>>    drivers/gpu/drm/scheduler/sched_main.c     |  1 +
>>>    include/drm/gpu_scheduler.h                |  1 +
>>>    5 files changed, 27 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index 40461547701a..fe0237f72a09 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -4382,7 +4382,7 @@ int amdgpu_device_mode1_reset(struct amdgpu_device *adev)
>>>    int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
>>>    				 struct amdgpu_reset_context *reset_context)
>>>    {
>>> -	int i, r = 0;
>>> +	int i, j, r = 0;
>>>    	struct amdgpu_job *job = NULL;
>>>    	bool need_full_reset =
>>>    		test_bit(AMDGPU_NEED_FULL_RESET, &reset_context->flags);
>>> @@ -4406,6 +4406,16 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
>>>    		if (!ring || !ring->sched.thread)
>>>    			continue;
>>> +		/*clear job fence from fence drv to avoid force_completion
>>> +		 *leave NULL and vm flush fence in fence drv */
>>> +		for (j = 0; j <= ring->fence_drv.num_fences_mask; j ++) {
>>> +			struct dma_fence *old,**ptr;
>>> +			ptr = &ring->fence_drv.fences[j];
>>> +			old = rcu_dereference_protected(*ptr, 1);
>>> +			if (old && test_bit(DMA_FENCE_FLAG_USER_BITS, &old->flags))) {
>>> +				RCU_INIT_POINTER(*ptr, NULL);
>>> +			}
>>
>> Is this to avoid premature job free because of dma_fence_put inside
>> amdgpu_fence_process ?
>> I can't currently remember why but we probably want all the HW fences
>> currently in the ring to
>> be forced signaled - maybe better to test for DMA_FENCE_FLAG_USER_BITS
>> inside amdgpu_fence_process
>> and still do the signaling but not the dma_fence_put part
>>
>> Andrey
> Hi Andrey,
>
> This is to avoid signaling the same fence twice. If we still do the
> signaling, then the job in the pending list will be signaled first in
> force_completion, and later be signaled in resubmit. This will go to
> BUG() in amdgpu_fence_process.


Oh, i see, how about just adding 'skip' flag to amdgpu_ring and setting 
it before calling
amdgpu_fence_driver_force_completion and resetting it after, then inside 
amdgpu_fence_driver_force_completion
you can just skip the signaling part with this flag for fences with 
DMA_FENCE_FLAG_USER_BITS set
Less lines of code at least.

Andrey


>
>>> +		}
>>>    		/* after all hw jobs are reset, hw fence is meaningless, so force_completion */
>>>    		amdgpu_fence_driver_force_completion(ring);
>>>    	}
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> index eecf21d8ec33..815776c9a013 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> @@ -156,11 +156,17 @@ int amdgpu_fence_emit(struct amdgpu_ring *ring, struct dma_fence **f, struct amd
>>>    		job->ring = ring;
>>>    	}
>>> -	seq = ++ring->fence_drv.sync_seq;
>>> -	dma_fence_init(fence, &amdgpu_fence_ops,
>>> -		       &ring->fence_drv.lock,
>>> -		       adev->fence_context + ring->idx,
>>> -		       seq);
>>> +	if (job != NULL && job->base.resubmit_flag == 1) {
>>> +		/* reinit seq for resubmitted jobs */
>>> +		seq = ++ring->fence_drv.sync_seq;
>>> +		fence->seqno = seq;
>>> +	} else {
>>> +		seq = ++ring->fence_drv.sync_seq;
>>
>> Seems like you could do the above line only once above if-else as it was
>> before
> Sure, I will modify this.
>
>
> Best Regards,
> JingWen Chen
>>> +		dma_fence_init(fence, &amdgpu_fence_ops,
>>> +				&ring->fence_drv.lock,
>>> +				adev->fence_context + ring->idx,
>>> +				seq);
>>> +	}
>>>    	if (job != NULL) {
>>>    		/* mark this fence has a parent job */
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> index 7c426e225b24..d6f848adc3f4 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> @@ -241,6 +241,7 @@ static struct dma_fence *amdgpu_job_run(struct drm_sched_job *sched_job)
>>>    		dma_fence_set_error(finished, -ECANCELED);/* skip IB as well if VRAM lost */
>>>    	if (finished->error < 0) {
>>> +		dma_fence_put(&job->hw_fence);
>>>    		DRM_INFO("Skip scheduling IBs!\n");
>>>    	} else {
>>>    		r = amdgpu_ib_schedule(ring, job->num_ibs, job->ibs, job,
>>> @@ -249,7 +250,8 @@ static struct dma_fence *amdgpu_job_run(struct drm_sched_job *sched_job)
>>>    			DRM_ERROR("Error scheduling IBs (%d)\n", r);
>>>    	}
>>> -	dma_fence_get(fence);
>>> +	if (!job->base.resubmit_flag)
>>> +		dma_fence_get(fence);
>>>    	amdgpu_job_free_resources(job);
>>>    	fence = r ? ERR_PTR(r) : fence;
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>> index f4f474944169..5a36ab5aea2d 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -544,6 +544,7 @@ void drm_sched_resubmit_jobs_ext(struct drm_gpu_scheduler *sched, int max)
>>>    			dma_fence_set_error(&s_fence->finished, -ECANCELED);
>>>    		dma_fence_put(s_job->s_fence->parent);
>>> +		s_job->resubmit_flag = 1;
>>>    		fence = sched->ops->run_job(s_job);
>>>    		i++;
>>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>>> index 4ea8606d91fe..06c101af1f71 100644
>>> --- a/include/drm/gpu_scheduler.h
>>> +++ b/include/drm/gpu_scheduler.h
>>> @@ -198,6 +198,7 @@ struct drm_sched_job {
>>>    	enum drm_sched_priority		s_priority;
>>>    	struct drm_sched_entity         *entity;
>>>    	struct dma_fence_cb		cb;
>>> +	int resubmit_flag;
>>>    };
>>>    static inline bool drm_sched_invalidate_job(struct drm_sched_job *s_job,
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

  reply	other threads:[~2021-07-22 14:45 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-21  3:13 [PATCH 1/2] drm/amd/amdgpu embed hw_fence into amdgpu_job Jingwen Chen
2021-07-21  3:13 ` [PATCH 2/2] drm: add tdr support for embeded hw_fence Jingwen Chen
2021-07-21 16:53   ` Andrey Grodzovsky
2021-07-22 10:45     ` Jingwen Chen
2021-07-22 14:45       ` Andrey Grodzovsky [this message]
2021-07-22 15:08         ` Jingwen Chen
2021-07-22 16:24         ` Christian König
2021-07-22 16:47           ` Jingwen Chen
2021-07-22 17:17             ` Andrey Grodzovsky
2021-07-22 17:27               ` Jingwen Chen
2021-07-22 17:50                 ` Andrey Grodzovsky
2021-07-23  0:20                   ` Jingwen Chen
2021-07-23  4:06                     ` Andrey Grodzovsky
2021-07-23  4:36                       ` Jingwen Chen
2021-07-23  6:33             ` Christian König
2021-07-23  7:07               ` Jingwen Chen
2021-07-23  8:38                 ` Liu, Monk
2021-07-23  8:45                 ` Christian König
2021-07-23  9:25                   ` Jingwen Chen
2021-07-23  9:44                     ` Christian König

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0699ff85-4aec-0e33-711b-665307fbbc24@amd.com \
    --to=andrey.grodzovsky@amd.com \
    --cc=Jack.Zhang1@amd.com \
    --cc=Jingwen.Chen2@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=ckoenig.leichtzumerken@gmail.com \
    --cc=horace.chen@amd.com \
    --cc=monk.liu@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.