All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Christian König" <ckoenig.leichtzumerken-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
To: "Liu, Monk" <Monk.Liu-5C7GfCeVMHo@public.gmane.org>,
	"Koenig,
	Christian" <Christian.Koenig-5C7GfCeVMHo@public.gmane.org>,
	"amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org"
	<amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>
Subject: Re: [PATCH 10/12] drm/amdgpu/sriov:implement guilty ctx for loose reset
Date: Mon, 9 Oct 2017 11:24:12 +0200	[thread overview]
Message-ID: <35d8a631-e7fb-0746-6890-c6f8fbdf7ae8@gmail.com> (raw)
In-Reply-To: <BLUPR12MB04498EE183C86C2B93DDA85484740-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>

> Can you illustrate which points breaks MESA ?
It doesn't break Mesa, but there is not handling in Mesa for -ENODEV.

So I will block adding any kernel functionality which returns -ENODEV 
before Mesa gets a proper handling for this.

We need to implement feature in Mesa first, it is our primary user space 
client. Without having handling there we can't submit anything upstream.

Regards,
Christian.

Am 09.10.2017 um 11:14 schrieb Liu, Monk:
> I can assure you this loose mode is 100% fit with current MESA,
>
> Can you illustrate which points breaks MESA ?
>
> You can see the whole logic only interested in the guilty ctx, and only the guilty ctx would receive the -ENODEV error
>
> All innocent/regular running MESA client like X server and compositor eve didn't aware of a gpu reset at all, they just keep running
>
>
> BR  Monk
>
> -----Original Message-----
> From: Koenig, Christian
> Sent: 2017年10月9日 17:04
> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 10/12] drm/amdgpu/sriov:implement guilty ctx for loose reset
>
> Well I'm not saying that the app needs to repeatedly call amdgpu_ctx_query, but rather that we need a complete concept.
>
> See, the upstream kernel driver is made for Mesa and not the closed source driver stack.
>
> I can't 100% judge if this approach wouldn't work with Mesa because we haven't implemented it there, but it strongly looks like that stuff won't work.
>
> So I need a solution which works with Mesa and the open source components before we can push it upstream.
>
> Regards,
> Christian.
>
> Am 09.10.2017 um 10:39 schrieb Liu, Monk:
>> How APP/UMD aware that a context is guilty or triggered too much loops of hang ??
>>
>> Why APP/UMD voluntarily call amdgpu_ctx_query() to check whether gpu reset occurred or not ?
>>
>> Please be aware that for another CSP customer this "loose mode" is
>> 100% welcome and wanted by they, and more important
>>
>> This approach won't cross X server at all, only the guilty
>> process/context is rejected upon its submitting
>>
>>
>> I don't agree that we should rely on ctx_query(), no one is
>> responsible to call it from time to time
>>
>>
>>
>> -----Original Message-----
>> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
>> Sent: 2017年10月9日 16:28
>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH 10/12] drm/amdgpu/sriov:implement guilty ctx for
>> loose reset
>>
>> Am 30.09.2017 um 08:03 schrieb Monk Liu:
>>> Change-Id: I7904f362aa0f578a5cbf5d40c7a242c2c6680a92
>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>> NAK, if a context is guilty of a GPU reset should be determined in
>> amdgpu_ctx_query() by looking at the fences in the ring buffer.
>>
>> Not when the GPU reset itself occurs.
>>
>> Regards,
>> Christian.
>>
>>> ---
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu.h           |  1 +
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        | 16 +++++++++-------
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c       |  1 +
>>>     drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 22 ++++++++++++++++++++++
>>>     drivers/gpu/drm/amd/scheduler/gpu_scheduler.h |  1 +
>>>     5 files changed, 34 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> index b40d4ba..b63e602 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> @@ -737,6 +737,7 @@ struct amdgpu_ctx {
>>>     	struct dma_fence	**fences;
>>>     	struct amdgpu_ctx_ring	rings[AMDGPU_MAX_RINGS];
>>>     	bool preamble_presented;
>>> +	bool guilty;
>>>     };
>>>     
>>>     struct amdgpu_ctx_mgr {
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>> index 6a1515e..f92962e 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>> @@ -79,16 +79,19 @@ static int amdgpu_cs_parser_init(struct amdgpu_cs_parser *p, void *data)
>>>     	if (cs->in.num_chunks == 0)
>>>     		return 0;
>>>     
>>> +	p->ctx = amdgpu_ctx_get(fpriv, cs->in.ctx_id);
>>> +	if (!p->ctx)
>>> +		return -EINVAL;
>>> +
>>> +	if (amdgpu_sriov_vf(p->adev) &&
>>> +		amdgpu_sriov_reset_level == 0 &&
>>> +		p->ctx->guilty)
>>> +		return -ENODEV;
>>> +
>>>     	chunk_array = kmalloc_array(cs->in.num_chunks, sizeof(uint64_t), GFP_KERNEL);
>>>     	if (!chunk_array)
>>>     		return -ENOMEM;
>>>     
>>> -	p->ctx = amdgpu_ctx_get(fpriv, cs->in.ctx_id);
>>> -	if (!p->ctx) {
>>> -		ret = -EINVAL;
>>> -		goto free_chunk;
>>> -	}
>>> -
>>>     	/* get chunks */
>>>     	chunk_array_user = u64_to_user_ptr(cs->in.chunks);
>>>     	if (copy_from_user(chunk_array, chunk_array_user, @@ -184,7
>>> +187,6 @@ static int amdgpu_cs_parser_init(struct amdgpu_cs_parser *p, void *data)
>>>     	p->nchunks = 0;
>>>     put_ctx:
>>>     	amdgpu_ctx_put(p->ctx);
>>> -free_chunk:
>>>     	kfree(chunk_array);
>>>     
>>>     	return ret;
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>> index 75c933b..028e9f1 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>> @@ -60,6 +60,7 @@ static int amdgpu_ctx_init(struct amdgpu_device *adev, struct amdgpu_ctx *ctx)
>>>     					  rq, amdgpu_sched_jobs);
>>>     		if (r)
>>>     			goto failed;
>>> +		ctx->rings[i].entity.guilty = &ctx->guilty;
>>>     	}
>>>     
>>>     	r = amdgpu_queue_mgr_init(adev, &ctx->queue_mgr); diff --git
>>> a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>> index 12c3092..89b0573 100644
>>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>> @@ -493,10 +493,32 @@ void amd_sched_set_queue_hang(struct amd_gpu_scheduler *sched)
>>>     void amd_sched_job_kickout(struct amd_sched_job *s_job)
>>>     {
>>>     	struct amd_gpu_scheduler *sched = s_job->sched;
>>> +	struct amd_sched_entity *entity, *tmp;
>>> +	struct amd_sched_rq *rq;
>>> +	int i;
>>> +	bool found;
>>>     
>>>     	spin_lock(&sched->job_list_lock);
>>>     	list_del_init(&s_job->node);
>>>     	spin_unlock(&sched->job_list_lock);
>>> +
>>> +	dma_fence_set_error(&s_job->s_fence->finished, -ETIME);
>>> +
>>> +	for (i = AMD_SCHED_PRIORITY_MIN; i < AMD_SCHED_PRIORITY_KERNEL; i++) {
>>> +		rq = &sched->sched_rq[i];
>>> +
>>> +		spin_lock(&rq->lock);
>>> +		list_for_each_entry_safe(entity, tmp, &rq->entities, list) {
>>> +			if (s_job->s_entity == entity && entity->guilty) {
>>> +				*entity->guilty = true;
>>> +				found = true;
>>> +				break;
>>> +			}
>>> +		}
>>> +		spin_unlock(&rq->lock);
>>> +		if (found)
>>> +			break;
>>> +	}
>>>     }
>>>     
>>>     void amd_sched_job_recovery(struct amd_gpu_scheduler *sched) diff
>>> --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>>> index f0242aa..16c2244 100644
>>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>>> @@ -49,6 +49,7 @@ struct amd_sched_entity {
>>>     
>>>     	struct dma_fence		*dependency;
>>>     	struct dma_fence_cb		cb;
>>> +	bool *guilty; /* this points to ctx's guilty */
>>>     };
>>>     
>>>     /**
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

  parent reply	other threads:[~2017-10-09  9:24 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-09-30  6:03 [PATCH 00/12] *** SRIOV GPU RESET PATCHES *** Monk Liu
     [not found] ` <1506751432-21789-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2017-09-30  6:03   ` [PATCH 01/12] drm/amdgpu/sriov:now must reinit psp Monk Liu
2017-09-30  6:03   ` [PATCH 02/12] drm/amdgpu/sriov:fix memory leak in psp_load_fw Monk Liu
2017-09-30  6:03   ` [PATCH 03/12] drm/amdgpu/sriov:use atomic type for sriov_reset Monk Liu
2017-09-30  6:03   ` [PATCH 04/12] drm/amdgpu/sriov:cleanup gpu rest mlock Monk Liu
2017-09-30  6:03   ` [PATCH 05/12] drm/amdgpu/sriov:accurate description for sriov_gpu_reset Monk Liu
2017-09-30  6:03   ` [PATCH 06/12] drm/amdgpu/sriov:handle more jobs hang in different ring case Monk Liu
     [not found]     ` <1506751432-21789-7-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2017-10-09  8:18       ` Christian König
2017-09-30  6:03   ` [PATCH 07/12] drm/amdgpu/sriov:implement strict gpu reset Monk Liu
     [not found]     ` <1506751432-21789-8-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2017-10-09  8:20       ` Christian König
     [not found]         ` <250ce10a-cca0-0193-b2ed-cc2f04e80d0c-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-10-09  8:30           ` Liu, Monk
2017-10-09 10:58       ` Nicolai Hähnle
2017-09-30  6:03   ` [PATCH 08/12] drm/amdgpu:explicitly call fence_process Monk Liu
     [not found]     ` <1506751432-21789-9-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2017-10-09  8:23       ` Christian König
     [not found]         ` <5cb1ae43-ec3a-2b0b-b78b-91cefd575672-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-10-09  8:32           ` Liu, Monk
     [not found]             ` <BLUPR12MB04491DDBC8ACFE2FB43D0F0084740-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-09  8:40               ` Christian König
     [not found]                 ` <62bb9496-b29f-0230-8fa4-0bad470c12c8-5C7GfCeVMHo@public.gmane.org>
2017-10-09  8:51                   ` Liu, Monk
     [not found]                     ` <BLUPR12MB0449E49C10230F350B9BD3B284740-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-09  8:52                       ` Liu, Monk
     [not found]                         ` <BLUPR12MB04495DD27084790E5B219D7384740-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-09  8:58                           ` Christian König
2017-09-30  6:03   ` [PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu reseted Monk Liu
     [not found]     ` <1506751432-21789-10-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2017-10-09  8:25       ` Christian König
     [not found]         ` <6e81d8b0-267a-1ea8-b228-93286fc6a954-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-10-09  8:35           ` Liu, Monk
     [not found]             ` <BLUPR12MB0449531313F50BE080F7746D84740-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-09  8:54               ` Christian König
2017-10-09 11:01               ` Nicolai Hähnle
     [not found]                 ` <71b411c8-21a6-fe9b-ed33-7928571a88da-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-10-10  4:26                   ` Liu, Monk
     [not found]                     ` <BLUPR12MB04492B28DF57EACE2149562D84750-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-10  6:58                       ` Christian König
     [not found]                         ` <85c67ae9-bfe2-390a-79d0-6e5872b9be62-5C7GfCeVMHo@public.gmane.org>
2017-10-10  7:12                           ` Liu, Monk
     [not found]                             ` <BLUPR12MB04497B5442F66C969861742584750-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-10  7:25                               ` Christian König
     [not found]                                 ` <f06b80fa-fc96-a93c-59b7-2460dba95e94-5C7GfCeVMHo@public.gmane.org>
2017-10-10  8:21                                   ` Liu, Monk
     [not found]                                     ` <BLUPR12MB0449B68E81C778A9D07FB38584750-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-10  8:57                                       ` Nicolai Hähnle
2017-10-10  7:19                           ` Liu, Monk
2017-10-10  7:47                           ` Michel Dänzer
     [not found]                             ` <0c91bb14-a874-9ee6-8756-2a31eb41d5b2-otUistvHUpPR7s880joybQ@public.gmane.org>
2017-10-10  7:57                               ` Christian König
     [not found]                                 ` <36f5b680-c881-3b4f-0784-3cd624064004-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-10-10  8:17                                   ` Michel Dänzer
2017-09-30  6:03   ` [PATCH 10/12] drm/amdgpu/sriov:implement guilty ctx for loose reset Monk Liu
     [not found]     ` <1506751432-21789-11-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2017-10-09  8:27       ` Christian König
     [not found]         ` <e4c96014-b4f4-e013-a966-9e2e03b9a62b-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-10-09  8:39           ` Liu, Monk
     [not found]             ` <BLUPR12MB0449C8E878F09AE59BA816E284740-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-09  9:03               ` Christian König
     [not found]                 ` <d249cc75-29e3-713f-fc5a-2f26f555500b-5C7GfCeVMHo@public.gmane.org>
2017-10-09  9:14                   ` Liu, Monk
     [not found]                     ` <BLUPR12MB04498EE183C86C2B93DDA85484740-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-09  9:24                       ` Christian König [this message]
2017-09-30  6:03   ` [PATCH 11/12] drm/amdgpu/sriov:show error if ib test failed Monk Liu
     [not found]     ` <1506751432-21789-12-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2017-10-09  8:29       ` Christian König
2017-09-30  6:03   ` [PATCH 12/12] drm/amdgpu/sriov:no shadow buffer recovery Monk Liu
     [not found]     ` <1506751432-21789-13-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2017-10-01  9:32       ` Christian König
2017-10-01  9:36       ` Christian König
     [not found]         ` <e767c6f2-4050-c697-2075-c3d744e6b379-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-10-04  9:41           ` Liu, Monk
     [not found]             ` <BLUPR12MB0449346A746E70A7BE88FEA084730-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-04 10:56               ` Christian König
     [not found]                 ` <9b08e030-1a47-39ef-8010-64c51d4560e8-5C7GfCeVMHo@public.gmane.org>
2017-10-09  4:12                   ` Liu, Monk
2017-10-01  9:31   ` [PATCH 00/12] *** SRIOV GPU RESET PATCHES *** Christian König

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=35d8a631-e7fb-0746-6890-c6f8fbdf7ae8@gmail.com \
    --to=ckoenig.leichtzumerken-re5jqeeqqe8avxtiumwx3w@public.gmane.org \
    --cc=Christian.Koenig-5C7GfCeVMHo@public.gmane.org \
    --cc=Monk.Liu-5C7GfCeVMHo@public.gmane.org \
    --cc=amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.