Re: [PATCH 1/2] drm/amdgpu: race issue when jobs on 2 ring timeout

All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH 1/2] drm/amdgpu: race issue when jobs on 2 ring timeout
       [not found] <20210119122236.8039-1-horace.chen@amd.com>
@ 2021-01-19 14:33 ` Andrey Grodzovsky
  2021-01-19 16:39   ` 回复: " Chen, Horace
       [not found] ` <20210119122236.8039-2-horace.chen@amd.com>
  1 sibling, 1 reply; 9+ messages in thread
From: Andrey Grodzovsky @ 2021-01-19 14:33 UTC (permalink / raw)
  To: Horace Chen, amd-gfx
  Cc: Jack Xiao, Feifei Xu, Kevin Wang, Xiaojie Yuan, Tuikov Luben,
	Deucher Alexander, Evan Quan, Christian König, Monk Liu,
	Hawking Zhang

On 1/19/21 7:22 AM, Horace Chen wrote:
> Fix a racing issue when jobs on 2 rings timeout simultaneously.
>
> If 2 rings timed out at the same time, the amdgpu_device_gpu_recover
> will be reentered. Then the adev->gmc.xgmi.head will be grabbed
> by 2 local linked list, which may cause wild pointer issue in
> iterating.
>
> lock the device earily to prevent the node be added to 2 different
> lists.
>
> Signed-off-by: Horace Chen <horace.chen@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 42 +++++++++++++++-------
>   1 file changed, 30 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 4d434803fb49..9574da3abc32 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4540,6 +4540,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   	int i, r = 0;
>   	bool need_emergency_restart = false;
>   	bool audio_suspended = false;
> +	bool get_dev_lock = false;
>   
>   	/*
>   	 * Special case: RAS triggered and full reset isn't supported
> @@ -4582,28 +4583,45 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   	 * Build list of devices to reset.
>   	 * In case we are in XGMI hive mode, resort the device list
>   	 * to put adev in the 1st position.
> +	 *
> +	 * lock the device before we try to operate the linked list
> +	 * if didn't get the device lock, don't touch the linked list since
> +	 * others may iterating it.
>   	 */
>   	INIT_LIST_HEAD(&device_list);
>   	if (adev->gmc.xgmi.num_physical_nodes > 1) {
>   		if (!hive)
>   			return -ENODEV;
> -		if (!list_is_first(&adev->gmc.xgmi.head, &hive->device_list))
> -			list_rotate_to_front(&adev->gmc.xgmi.head, &hive->device_list);
> -		device_list_handle = &hive->device_list;
> +
> +		list_for_each_entry(tmp_adev, &hive->device_list, gmc.xgmi.head) {
> +			get_dev_lock = amdgpu_device_lock_adev(tmp_adev, hive);
> +			if (!get_dev_lock)
> +				break;

What about unlocking back all the devices you already locked if the break
happens in the middle of the iteration ?
Note that at skip_recovery: we don't do it. BTW, i see this issue is already in 
the current code.

Also, maybe now it's better to separate the actual locking in 
amdgpu_device_lock_adev
from the other stuff going on there since I don't think you would wont to toggle 
stuff
like adev->mp1_state back and forth and also the function name is not 
descriptive of
the other stuff going on there anyway.

Andrey

> +		}
> +		if (get_dev_lock) {
> +			if (!list_is_first(&adev->gmc.xgmi.head, &hive->device_list))
> +				list_rotate_to_front(&adev->gmc.xgmi.head, &hive->device_list);
> +			device_list_handle = &hive->device_list;
> +		}
>   	} else {
> -		list_add_tail(&adev->gmc.xgmi.head, &device_list);
> -		device_list_handle = &device_list;
> +		get_dev_lock = amdgpu_device_lock_adev(adev, hive);
> +		tmp_adev = adev;
> +		if (get_dev_lock) {
> +			list_add_tail(&adev->gmc.xgmi.head, &device_list);
> +			device_list_handle = &device_list;
> +		}
> +	}
> +
> +	if (!get_dev_lock) {
> +		dev_info(tmp_adev->dev, "Bailing on TDR for s_job:%llx, as another already in progress",
> +					job ? job->base.id : -1);
> +		r = 0;
> +		/* even we skipped this reset, still need to set the job to guilty */
> +		goto skip_recovery;
>   	}
>   
>   	/* block all schedulers and reset given job's ring */
>   	list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {
> -		if (!amdgpu_device_lock_adev(tmp_adev, hive)) {
> -			dev_info(tmp_adev->dev, "Bailing on TDR for s_job:%llx, as another already in progress",
> -				  job ? job->base.id : -1);
> -			r = 0;
> -			goto skip_recovery;
> -		}
> -
>   		/*
>   		 * Try to put the audio codec into suspend state
>   		 * before gpu reset started.
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 9+ messages in thread
* 回复: [PATCH 1/2] drm/amdgpu: race issue when jobs on 2 ring timeout
  2021-01-19 14:33 ` [PATCH 1/2] drm/amdgpu: race issue when jobs on 2 ring timeout Andrey Grodzovsky
@ 2021-01-19 16:39   ` Chen, Horace
  2021-01-19 16:58     ` Andrey Grodzovsky
  0 siblings, 1 reply; 9+ messages in thread
From: Chen, Horace @ 2021-01-19 16:39 UTC (permalink / raw)
  To: Grodzovsky, Andrey, amd-gfx
  Cc: Xiao, Jack, Xu, Feifei, Wang, Kevin(Yang),
	Xiaojie Yuan, Tuikov, Luben, Deucher, Alexander, Quan, Evan,
	Koenig, Christian, Liu, Monk, Zhang, Hawking

[-- Attachment #1.1: Type: text/plain, Size: 6488 bytes --]

[AMD Official Use Only - Internal Distribution Only]

Hi Andrey,

I think the list in the XGMI hive won't be break in the middle if we lock the device before we change the list. Because if 2 devices in 1 hive went into the function, it will follow the same sequence to lock the devices. So one of them will definately break at the first device. I add iterate devices here is just to lock all device in the hive since we will change the device sequence in the hive soon after.

The reason to break the interation in the middle is that the list is changed during the iteration without taking any lock. It is quite bad since I'm fixing one of this issue. And for XGMI hive, there are 2 locks protecting the list, one is the device lock I changed here, the other one is in front of my change, there is a hive->lock to protect the hive.

Even the bad thing really happened, I think moving back through the list is also very dengerous since we don't know what the list finally be, Unless we stack the devices we have iterated through a mirrored list. That can be a big change.

I'm ok to seperate the locking in amdgpu_device_lock_adev here, I'll do some test and update the code later.

Thanks & Regards,
Horace.
________________________________
发件人: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
发送时间: 2021年1月19日 22:33
收件人: Chen, Horace <Horace.Chen@amd.com>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>
抄送: Quan, Evan <Evan.Quan@amd.com>; Tuikov, Luben <Luben.Tuikov@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>; Xiao, Jack <Jack.Xiao@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>; Liu, Monk <Monk.Liu@amd.com>; Xu, Feifei <Feifei.Xu@amd.com>; Wang, Kevin(Yang) <Kevin1.Wang@amd.com>; Xiaojie Yuan <xiaojie.yuan@amd.com>
主题: Re: [PATCH 1/2] drm/amdgpu: race issue when jobs on 2 ring timeout

On 1/19/21 7:22 AM, Horace Chen wrote:
> Fix a racing issue when jobs on 2 rings timeout simultaneously.
>
> If 2 rings timed out at the same time, the amdgpu_device_gpu_recover
> will be reentered. Then the adev->gmc.xgmi.head will be grabbed
> by 2 local linked list, which may cause wild pointer issue in
> iterating.
>
> lock the device earily to prevent the node be added to 2 different
> lists.
>
> Signed-off-by: Horace Chen <horace.chen@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 42 +++++++++++++++-------
>   1 file changed, 30 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 4d434803fb49..9574da3abc32 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4540,6 +4540,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>        int i, r = 0;
>        bool need_emergency_restart = false;
>        bool audio_suspended = false;
> +     bool get_dev_lock = false;
>
>        /*
>         * Special case: RAS triggered and full reset isn't supported
> @@ -4582,28 +4583,45 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>         * Build list of devices to reset.
>         * In case we are in XGMI hive mode, resort the device list
>         * to put adev in the 1st position.
> +      *
> +      * lock the device before we try to operate the linked list
> +      * if didn't get the device lock, don't touch the linked list since
> +      * others may iterating it.
>         */
>        INIT_LIST_HEAD(&device_list);
>        if (adev->gmc.xgmi.num_physical_nodes > 1) {
>                if (!hive)
>                        return -ENODEV;
> -             if (!list_is_first(&adev->gmc.xgmi.head, &hive->device_list))
> -                     list_rotate_to_front(&adev->gmc.xgmi.head, &hive->device_list);
> -             device_list_handle = &hive->device_list;
> +
> +             list_for_each_entry(tmp_adev, &hive->device_list, gmc.xgmi.head) {
> +                     get_dev_lock = amdgpu_device_lock_adev(tmp_adev, hive);
> +                     if (!get_dev_lock)
> +                             break;

What about unlocking back all the devices you already locked if the break
happens in the middle of the iteration ?
Note that at skip_recovery: we don't do it. BTW, i see this issue is already in
the current code.

Also, maybe now it's better to separate the actual locking in
amdgpu_device_lock_adev
from the other stuff going on there since I don't think you would wont to toggle
stuff
like adev->mp1_state back and forth and also the function name is not
descriptive of
the other stuff going on there anyway.

Andrey

> +             }
> +             if (get_dev_lock) {
> +                     if (!list_is_first(&adev->gmc.xgmi.head, &hive->device_list))
> +                             list_rotate_to_front(&adev->gmc.xgmi.head, &hive->device_list);
> +                     device_list_handle = &hive->device_list;
> +             }
>        } else {
> -             list_add_tail(&adev->gmc.xgmi.head, &device_list);
> -             device_list_handle = &device_list;
> +             get_dev_lock = amdgpu_device_lock_adev(adev, hive);
> +             tmp_adev = adev;
> +             if (get_dev_lock) {
> +                     list_add_tail(&adev->gmc.xgmi.head, &device_list);
> +                     device_list_handle = &device_list;
> +             }
> +     }
> +
> +     if (!get_dev_lock) {
> +             dev_info(tmp_adev->dev, "Bailing on TDR for s_job:%llx, as another already in progress",
> +                                     job ? job->base.id : -1);
> +             r = 0;
> +             /* even we skipped this reset, still need to set the job to guilty */
> +             goto skip_recovery;
>        }
>
>        /* block all schedulers and reset given job's ring */
>        list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {
> -             if (!amdgpu_device_lock_adev(tmp_adev, hive)) {
> -                     dev_info(tmp_adev->dev, "Bailing on TDR for s_job:%llx, as another already in progress",
> -                               job ? job->base.id : -1);
> -                     r = 0;
> -                     goto skip_recovery;
> -             }
> -
>                /*
>                 * Try to put the audio codec into suspend state
>                 * before gpu reset started.

[-- Attachment #1.2: Type: text/html, Size: 13709 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 9+ messages in thread
* Re: 回复: [PATCH 1/2] drm/amdgpu: race issue when jobs on 2 ring timeout
  2021-01-19 16:39   ` 回复: " Chen, Horace
@ 2021-01-19 16:58     ` Andrey Grodzovsky
  2021-01-19 17:09       ` 回复: " Chen, Horace
  0 siblings, 1 reply; 9+ messages in thread
From: Andrey Grodzovsky @ 2021-01-19 16:58 UTC (permalink / raw)
  To: Chen, Horace, amd-gfx
  Cc: Xiao, Jack, Xu, Feifei, Wang, Kevin(Yang),
	Xiaojie Yuan, Tuikov, Luben, Deucher, Alexander, Quan, Evan,
	Koenig, Christian, Liu, Monk, Zhang, Hawking

[-- Attachment #1.1: Type: text/plain, Size: 7850 bytes --]

On 1/19/21 11:39 AM, Chen, Horace wrote:
>
> [AMD Official Use Only - Internal Distribution Only]
>
>
> Hi Andrey,
>
> I think the list in the XGMI hive won't be break in the middle if we lock the 
> device before we change the list. Because if 2 devices in 1 hive went into the 
> function, it will follow the same sequence to lock the devices. So one of them 
> will definately break at the first device. I add iterate devices here is just 
> to lock all device in the hive since we will change the device sequence in the 
> hive soon after.

I didn't mean break in a sense of breaking the list itself, I just meant the 
literal 'break' instruction
to terminate the iteration once you failed to lock a particular device.

>
> The reason to break the interation in the middle is that the list is changed 
> during the iteration without taking any lock. It is quite bad since I'm fixing 
> one of this issue. And for XGMI hive, there are 2 locks protecting the list, 
> one is the device lock I changed here, the other one is in front of my change, 
> there is a hive->lock to protect the hive.
>
> Even the bad thing really happened, I think moving back through the list is 
> also very dengerous since we don't know what the list finally be, Unless we 
> stack the devices we have iterated through a mirrored list. That can be a big 
> change.

Not sure we are on the same page, my concern is let's sat your XGMI hive 
consists of 2 devices, you manged to call successfully do
amdgpu_device_lock_adev for dev1 but then failed for dev2, in this case you will 
bail out without releasing dev1, no ?

Andrey

>
>
> I'm ok to seperate the locking in amdgpu_device_lock_adev here, I'll do some 
> test and update the code later.
>
> Thanks & Regards,
> Horace.
> --------------------------------------------------------------------------------
> *发件人:* Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
> *发送时间:* 2021年1月19日 22:33
> *收件人:* Chen, Horace <Horace.Chen@amd.com>; amd-gfx@lists.freedesktop.org 
> <amd-gfx@lists.freedesktop.org>
> *抄送:* Quan, Evan <Evan.Quan@amd.com>; Tuikov, Luben <Luben.Tuikov@amd.com>; 
> Koenig, Christian <Christian.Koenig@amd.com>; Deucher, Alexander 
> <Alexander.Deucher@amd.com>; Xiao, Jack <Jack.Xiao@amd.com>; Zhang, Hawking 
> <Hawking.Zhang@amd.com>; Liu, Monk <Monk.Liu@amd.com>; Xu, Feifei 
> <Feifei.Xu@amd.com>; Wang, Kevin(Yang) <Kevin1.Wang@amd.com>; Xiaojie Yuan 
> <xiaojie.yuan@amd.com>
> *主题:* Re: [PATCH 1/2] drm/amdgpu: race issue when jobs on 2 ring timeout
>
> On 1/19/21 7:22 AM, Horace Chen wrote:
> > Fix a racing issue when jobs on 2 rings timeout simultaneously.
> >
> > If 2 rings timed out at the same time, the amdgpu_device_gpu_recover
> > will be reentered. Then the adev->gmc.xgmi.head will be grabbed
> > by 2 local linked list, which may cause wild pointer issue in
> > iterating.
> >
> > lock the device earily to prevent the node be added to 2 different
> > lists.
> >
> > Signed-off-by: Horace Chen <horace.chen@amd.com>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 42 +++++++++++++++-------
> >   1 file changed, 30 insertions(+), 12 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index 4d434803fb49..9574da3abc32 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -4540,6 +4540,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
> >        int i, r = 0;
> >        bool need_emergency_restart = false;
> >        bool audio_suspended = false;
> > +     bool get_dev_lock = false;
> >
> >        /*
> >         * Special case: RAS triggered and full reset isn't supported
> > @@ -4582,28 +4583,45 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
> *adev,
> >         * Build list of devices to reset.
> >         * In case we are in XGMI hive mode, resort the device list
> >         * to put adev in the 1st position.
> > +      *
> > +      * lock the device before we try to operate the linked list
> > +      * if didn't get the device lock, don't touch the linked list since
> > +      * others may iterating it.
> >         */
> >        INIT_LIST_HEAD(&device_list);
> >        if (adev->gmc.xgmi.num_physical_nodes > 1) {
> >                if (!hive)
> >                        return -ENODEV;
> > -             if (!list_is_first(&adev->gmc.xgmi.head, &hive->device_list))
> > - list_rotate_to_front(&adev->gmc.xgmi.head, &hive->device_list);
> > -             device_list_handle = &hive->device_list;
> > +
> > +             list_for_each_entry(tmp_adev, &hive->device_list, gmc.xgmi.head) {
> > +                     get_dev_lock = amdgpu_device_lock_adev(tmp_adev, hive);
> > +                     if (!get_dev_lock)
> > +                             break;
>
>
> What about unlocking back all the devices you already locked if the break
> happens in the middle of the iteration ?
> Note that at skip_recovery: we don't do it. BTW, i see this issue is already in
> the current code.
>
> Also, maybe now it's better to separate the actual locking in
> amdgpu_device_lock_adev
> from the other stuff going on there since I don't think you would wont to toggle
> stuff
> like adev->mp1_state back and forth and also the function name is not
> descriptive of
> the other stuff going on there anyway.
>
> Andrey
>
>
> > +             }
> > +             if (get_dev_lock) {
> > +                     if (!list_is_first(&adev->gmc.xgmi.head, 
> &hive->device_list))
> > + list_rotate_to_front(&adev->gmc.xgmi.head, &hive->device_list);
> > +                     device_list_handle = &hive->device_list;
> > +             }
> >        } else {
> > - list_add_tail(&adev->gmc.xgmi.head, &device_list);
> > -             device_list_handle = &device_list;
> > +             get_dev_lock = amdgpu_device_lock_adev(adev, hive);
> > +             tmp_adev = adev;
> > +             if (get_dev_lock) {
> > + list_add_tail(&adev->gmc.xgmi.head, &device_list);
> > +                     device_list_handle = &device_list;
> > +             }
> > +     }
> > +
> > +     if (!get_dev_lock) {
> > +             dev_info(tmp_adev->dev, "Bailing on TDR for s_job:%llx, as 
> another already in progress",
> > +                                     job ? job->base.id : -1);
> > +             r = 0;
> > +             /* even we skipped this reset, still need to set the job to 
> guilty */
> > +             goto skip_recovery;
> >        }
> >
> >        /* block all schedulers and reset given job's ring */
> >        list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {
> > -             if (!amdgpu_device_lock_adev(tmp_adev, hive)) {
> > -                     dev_info(tmp_adev->dev, "Bailing on TDR for 
> s_job:%llx, as another already in progress",
> > -                               job ? job->base.id : -1);
> > -                     r = 0;
> > -                     goto skip_recovery;
> > -             }
> > -
> >                /*
> >                 * Try to put the audio codec into suspend state
> >                 * before gpu reset started.

[-- Attachment #1.2: Type: text/html, Size: 19903 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 9+ messages in thread
* 回复: 回复: [PATCH 1/2] drm/amdgpu: race issue when jobs on 2 ring timeout
  2021-01-19 16:58     ` Andrey Grodzovsky
@ 2021-01-19 17:09       ` Chen, Horace
  2021-01-19 17:15         ` Andrey Grodzovsky
  0 siblings, 1 reply; 9+ messages in thread
From: Chen, Horace @ 2021-01-19 17:09 UTC (permalink / raw)
  To: Grodzovsky, Andrey, amd-gfx
  Cc: Xiao, Jack, Xu, Feifei, Wang, Kevin(Yang),
	Tuikov, Luben, Deucher, Alexander, Quan, Evan, Koenig, Christian,
	Liu, Monk, Zhang, Hawking

[-- Attachment #1.1: Type: text/plain, Size: 8381 bytes --]

[AMD Official Use Only - Internal Distribution Only]

OK, I understand. You mean one device in the hive may be locked up independently without locking up the whole hive.

It could happen, I'll change my code.

Thanks & Regards,
Horace.

________________________________
发件人: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
发送时间: 2021年1月20日 0:58
收件人: Chen, Horace <Horace.Chen@amd.com>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>
抄送: Quan, Evan <Evan.Quan@amd.com>; Tuikov, Luben <Luben.Tuikov@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>; Xiao, Jack <Jack.Xiao@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>; Liu, Monk <Monk.Liu@amd.com>; Xu, Feifei <Feifei.Xu@amd.com>; Wang, Kevin(Yang) <Kevin1.Wang@amd.com>; Xiaojie Yuan <xiaojie.yuan@amd.com>
主题: Re: 回复: [PATCH 1/2] drm/amdgpu: race issue when jobs on 2 ring timeout

On 1/19/21 11:39 AM, Chen, Horace wrote:

[AMD Official Use Only - Internal Distribution Only]

Hi Andrey,

I think the list in the XGMI hive won't be break in the middle if we lock the device before we change the list. Because if 2 devices in 1 hive went into the function, it will follow the same sequence to lock the devices. So one of them will definately break at the first device. I add iterate devices here is just to lock all device in the hive since we will change the device sequence in the hive soon after.

I didn't mean break in a sense of breaking the list itself, I just meant the literal 'break' instruction
to terminate the iteration once you failed to lock a particular device.

The reason to break the interation in the middle is that the list is changed during the iteration without taking any lock. It is quite bad since I'm fixing one of this issue. And for XGMI hive, there are 2 locks protecting the list, one is the device lock I changed here, the other one is in front of my change, there is a hive->lock to protect the hive.

Even the bad thing really happened, I think moving back through the list is also very dengerous since we don't know what the list finally be, Unless we stack the devices we have iterated through a mirrored list. That can be a big change.

Not sure we are on the same page, my concern is let's sat your XGMI hive consists of 2 devices, you manged to call successfully do
amdgpu_device_lock_adev for dev1 but then failed for dev2, in this case you will bail out  without releasing dev1, no ?

Andrey

I'm ok to seperate the locking in amdgpu_device_lock_adev here, I'll do some test and update the code later.

Thanks & Regards,
Horace.
________________________________
发件人: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com><mailto:Andrey.Grodzovsky@amd.com>
发送时间: 2021年1月19日 22:33
收件人: Chen, Horace <Horace.Chen@amd.com><mailto:Horace.Chen@amd.com>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org><mailto:amd-gfx@lists.freedesktop.org>
抄送: Quan, Evan <Evan.Quan@amd.com><mailto:Evan.Quan@amd.com>; Tuikov, Luben <Luben.Tuikov@amd.com><mailto:Luben.Tuikov@amd.com>; Koenig, Christian <Christian.Koenig@amd.com><mailto:Christian.Koenig@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com><mailto:Alexander.Deucher@amd.com>; Xiao, Jack <Jack.Xiao@amd.com><mailto:Jack.Xiao@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com><mailto:Hawking.Zhang@amd.com>; Liu, Monk <Monk.Liu@amd.com><mailto:Monk.Liu@amd.com>; Xu, Feifei <Feifei.Xu@amd.com><mailto:Feifei.Xu@amd.com>; Wang, Kevin(Yang) <Kevin1.Wang@amd.com><mailto:Kevin1.Wang@amd.com>; Xiaojie Yuan <xiaojie.yuan@amd.com><mailto:xiaojie.yuan@amd.com>
主题: Re: [PATCH 1/2] drm/amdgpu: race issue when jobs on 2 ring timeout

On 1/19/21 7:22 AM, Horace Chen wrote:
> Fix a racing issue when jobs on 2 rings timeout simultaneously.
>
> If 2 rings timed out at the same time, the amdgpu_device_gpu_recover
> will be reentered. Then the adev->gmc.xgmi.head will be grabbed
> by 2 local linked list, which may cause wild pointer issue in
> iterating.
>
> lock the device earily to prevent the node be added to 2 different
> lists.
>
> Signed-off-by: Horace Chen <horace.chen@amd.com><mailto:horace.chen@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 42 +++++++++++++++-------
>   1 file changed, 30 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 4d434803fb49..9574da3abc32 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4540,6 +4540,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>        int i, r = 0;
>        bool need_emergency_restart = false;
>        bool audio_suspended = false;
> +     bool get_dev_lock = false;
>
>        /*
>         * Special case: RAS triggered and full reset isn't supported
> @@ -4582,28 +4583,45 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>         * Build list of devices to reset.
>         * In case we are in XGMI hive mode, resort the device list
>         * to put adev in the 1st position.
> +      *
> +      * lock the device before we try to operate the linked list
> +      * if didn't get the device lock, don't touch the linked list since
> +      * others may iterating it.
>         */
>        INIT_LIST_HEAD(&device_list);
>        if (adev->gmc.xgmi.num_physical_nodes > 1) {
>                if (!hive)
>                        return -ENODEV;
> -             if (!list_is_first(&adev->gmc.xgmi.head, &hive->device_list))
> -                     list_rotate_to_front(&adev->gmc.xgmi.head, &hive->device_list);
> -             device_list_handle = &hive->device_list;
> +
> +             list_for_each_entry(tmp_adev, &hive->device_list, gmc.xgmi.head) {
> +                     get_dev_lock = amdgpu_device_lock_adev(tmp_adev, hive);
> +                     if (!get_dev_lock)
> +                             break;

What about unlocking back all the devices you already locked if the break
happens in the middle of the iteration ?
Note that at skip_recovery: we don't do it. BTW, i see this issue is already in
the current code.

Also, maybe now it's better to separate the actual locking in
amdgpu_device_lock_adev
from the other stuff going on there since I don't think you would wont to toggle
stuff
like adev->mp1_state back and forth and also the function name is not
descriptive of
the other stuff going on there anyway.

Andrey

> +             }
> +             if (get_dev_lock) {
> +                     if (!list_is_first(&adev->gmc.xgmi.head, &hive->device_list))
> +                             list_rotate_to_front(&adev->gmc.xgmi.head, &hive->device_list);
> +                     device_list_handle = &hive->device_list;
> +             }
>        } else {
> -             list_add_tail(&adev->gmc.xgmi.head, &device_list);
> -             device_list_handle = &device_list;
> +             get_dev_lock = amdgpu_device_lock_adev(adev, hive);
> +             tmp_adev = adev;
> +             if (get_dev_lock) {
> +                     list_add_tail(&adev->gmc.xgmi.head, &device_list);
> +                     device_list_handle = &device_list;
> +             }
> +     }
> +
> +     if (!get_dev_lock) {
> +             dev_info(tmp_adev->dev, "Bailing on TDR for s_job:%llx, as another already in progress",
> +                                     job ? job->base.id : -1);
> +             r = 0;
> +             /* even we skipped this reset, still need to set the job to guilty */
> +             goto skip_recovery;
>        }
>
>        /* block all schedulers and reset given job's ring */
>        list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {
> -             if (!amdgpu_device_lock_adev(tmp_adev, hive)) {
> -                     dev_info(tmp_adev->dev, "Bailing on TDR for s_job:%llx, as another already in progress",
> -                               job ? job->base.id : -1);
> -                     r = 0;
> -                     goto skip_recovery;
> -             }
> -
>                /*
>                 * Try to put the audio codec into suspend state
>                 * before gpu reset started.

[-- Attachment #1.2: Type: text/html, Size: 18274 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 9+ messages in thread
* Re: 回复: 回复: [PATCH 1/2] drm/amdgpu: race issue when jobs on 2 ring timeout
  2021-01-19 17:09       ` 回复: " Chen, Horace
@ 2021-01-19 17:15         ` Andrey Grodzovsky
  2021-01-19 17:22           ` Lazar, Lijo
  0 siblings, 1 reply; 9+ messages in thread
From: Andrey Grodzovsky @ 2021-01-19 17:15 UTC (permalink / raw)
  To: Chen, Horace, amd-gfx
  Cc: Xiao, Jack, Xu, Feifei, Wang, Kevin(Yang),
	Tuikov, Luben, Deucher, Alexander, Quan, Evan, Koenig, Christian,
	Liu, Monk, Zhang, Hawking

[-- Attachment #1.1: Type: text/plain, Size: 9893 bytes --]

Well, it shouldn't happen with the hive locked as I am browsing the code but 
then your code should
reflect that and if you do fail to lock particular adev AFTER the hive is locked 
you should not silently break
iteration but throw an error, WARN_ON or BUG_ON then. Or alternatively bail out 
with unlocking all already
locked devices.

Andrey

On 1/19/21 12:09 PM, Chen, Horace wrote:
>
> [AMD Official Use Only - Internal Distribution Only]
>
>
> OK, I understand. You mean one device in the hive may be locked up 
> independently without locking up the whole hive.
>
> It could happen, I'll change my code.
>
> Thanks & Regards,
> Horace.
>
> --------------------------------------------------------------------------------
> *发件人:* Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
> *发送时间:* 2021年1月20日 0:58
> *收件人:* Chen, Horace <Horace.Chen@amd.com>; amd-gfx@lists.freedesktop.org 
> <amd-gfx@lists.freedesktop.org>
> *抄送:* Quan, Evan <Evan.Quan@amd.com>; Tuikov, Luben <Luben.Tuikov@amd.com>; 
> Koenig, Christian <Christian.Koenig@amd.com>; Deucher, Alexander 
> <Alexander.Deucher@amd.com>; Xiao, Jack <Jack.Xiao@amd.com>; Zhang, Hawking 
> <Hawking.Zhang@amd.com>; Liu, Monk <Monk.Liu@amd.com>; Xu, Feifei 
> <Feifei.Xu@amd.com>; Wang, Kevin(Yang) <Kevin1.Wang@amd.com>; Xiaojie Yuan 
> <xiaojie.yuan@amd.com>
> *主题:* Re: 回复: [PATCH 1/2] drm/amdgpu: race issue when jobs on 2 ring timeout
>
>
> On 1/19/21 11:39 AM, Chen, Horace wrote:
>>
>> [AMD Official Use Only - Internal Distribution Only]
>>
>>
>> Hi Andrey,
>>
>> I think the list in the XGMI hive won't be break in the middle if we lock the 
>> device before we change the list. Because if 2 devices in 1 hive went into 
>> the function, it will follow the same sequence to lock the devices. So one of 
>> them will definately break at the first device. I add iterate devices here is 
>> just to lock all device in the hive since we will change the device sequence 
>> in the hive soon after.
>
>
> I didn't mean break in a sense of breaking the list itself, I just meant the 
> literal 'break' instruction
> to terminate the iteration once you failed to lock a particular device.
>
>
>>
>> The reason to break the interation in the middle is that the list is changed 
>> during the iteration without taking any lock. It is quite bad since I'm 
>> fixing one of this issue. And for XGMI hive, there are 2 locks protecting the 
>> list, one is the device lock I changed here, the other one is in front of my 
>> change, there is a hive->lock to protect the hive.
>>
>> Even the bad thing really happened, I think moving back through the list is 
>> also very dengerous since we don't know what the list finally be, Unless we 
>> stack the devices we have iterated through a mirrored list. That can be a big 
>> change.
>
>
> Not sure we are on the same page, my concern is let's sat your XGMI hive 
> consists of 2 devices, you manged to call successfully do
> amdgpu_device_lock_adev for dev1 but then failed for dev2, in this case you 
> will bail out without releasing dev1, no ?
>
>
> Andrey
>
>
>>
>>
>> I'm ok to seperate the locking in amdgpu_device_lock_adev here, I'll do some 
>> test and update the code later.
>>
>> Thanks & Regards,
>> Horace.
>> --------------------------------------------------------------------------------
>> *发件人:* Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com> 
>> <mailto:Andrey.Grodzovsky@amd.com>
>> *发送时间:* 2021年1月19日 22:33
>> *收件人:* Chen, Horace <Horace.Chen@amd.com> <mailto:Horace.Chen@amd.com>; 
>> amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org> 
>> <amd-gfx@lists.freedesktop.org> <mailto:amd-gfx@lists.freedesktop.org>
>> *抄送:* Quan, Evan <Evan.Quan@amd.com> <mailto:Evan.Quan@amd.com>; Tuikov, 
>> Luben <Luben.Tuikov@amd.com> <mailto:Luben.Tuikov@amd.com>; Koenig, Christian 
>> <Christian.Koenig@amd.com> <mailto:Christian.Koenig@amd.com>; Deucher, 
>> Alexander <Alexander.Deucher@amd.com> <mailto:Alexander.Deucher@amd.com>; 
>> Xiao, Jack <Jack.Xiao@amd.com> <mailto:Jack.Xiao@amd.com>; Zhang, Hawking 
>> <Hawking.Zhang@amd.com> <mailto:Hawking.Zhang@amd.com>; Liu, Monk 
>> <Monk.Liu@amd.com> <mailto:Monk.Liu@amd.com>; Xu, Feifei <Feifei.Xu@amd.com> 
>> <mailto:Feifei.Xu@amd.com>; Wang, Kevin(Yang) <Kevin1.Wang@amd.com> 
>> <mailto:Kevin1.Wang@amd.com>; Xiaojie Yuan <xiaojie.yuan@amd.com> 
>> <mailto:xiaojie.yuan@amd.com>
>> *主题:* Re: [PATCH 1/2] drm/amdgpu: race issue when jobs on 2 ring timeout
>>
>> On 1/19/21 7:22 AM, Horace Chen wrote:
>> > Fix a racing issue when jobs on 2 rings timeout simultaneously.
>> >
>> > If 2 rings timed out at the same time, the amdgpu_device_gpu_recover
>> > will be reentered. Then the adev->gmc.xgmi.head will be grabbed
>> > by 2 local linked list, which may cause wild pointer issue in
>> > iterating.
>> >
>> > lock the device earily to prevent the node be added to 2 different
>> > lists.
>> >
>> > Signed-off-by: Horace Chen <horace.chen@amd.com> <mailto:horace.chen@amd.com>
>> > ---
>> > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 42 +++++++++++++++-------
>> >   1 file changed, 30 insertions(+), 12 deletions(-)
>> >
>> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> > index 4d434803fb49..9574da3abc32 100644
>> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> > @@ -4540,6 +4540,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>> >        int i, r = 0;
>> >        bool need_emergency_restart = false;
>> >        bool audio_suspended = false;
>> > +     bool get_dev_lock = false;
>> >
>> >        /*
>> >         * Special case: RAS triggered and full reset isn't supported
>> > @@ -4582,28 +4583,45 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
>> *adev,
>> >         * Build list of devices to reset.
>> >         * In case we are in XGMI hive mode, resort the device list
>> >         * to put adev in the 1st position.
>> > +      *
>> > +      * lock the device before we try to operate the linked list
>> > +      * if didn't get the device lock, don't touch the linked list since
>> > +      * others may iterating it.
>> >         */
>> >        INIT_LIST_HEAD(&device_list);
>> >        if (adev->gmc.xgmi.num_physical_nodes > 1) {
>> >                if (!hive)
>> >                        return -ENODEV;
>> > -             if (!list_is_first(&adev->gmc.xgmi.head, &hive->device_list))
>> > - list_rotate_to_front(&adev->gmc.xgmi.head, &hive->device_list);
>> > -             device_list_handle = &hive->device_list;
>> > +
>> > + list_for_each_entry(tmp_adev, &hive->device_list, gmc.xgmi.head) {
>> > +                     get_dev_lock = amdgpu_device_lock_adev(tmp_adev, hive);
>> > +                     if (!get_dev_lock)
>> > +                             break;
>>
>>
>> What about unlocking back all the devices you already locked if the break
>> happens in the middle of the iteration ?
>> Note that at skip_recovery: we don't do it. BTW, i see this issue is already in
>> the current code.
>>
>> Also, maybe now it's better to separate the actual locking in
>> amdgpu_device_lock_adev
>> from the other stuff going on there since I don't think you would wont to toggle
>> stuff
>> like adev->mp1_state back and forth and also the function name is not
>> descriptive of
>> the other stuff going on there anyway.
>>
>> Andrey
>>
>>
>> > +             }
>> > +             if (get_dev_lock) {
>> > +                     if (!list_is_first(&adev->gmc.xgmi.head, 
>> &hive->device_list))
>> > + list_rotate_to_front(&adev->gmc.xgmi.head, &hive->device_list);
>> > +                     device_list_handle = &hive->device_list;
>> > +             }
>> >        } else {
>> > - list_add_tail(&adev->gmc.xgmi.head, &device_list);
>> > -             device_list_handle = &device_list;
>> > +             get_dev_lock = amdgpu_device_lock_adev(adev, hive);
>> > +             tmp_adev = adev;
>> > +             if (get_dev_lock) {
>> > + list_add_tail(&adev->gmc.xgmi.head, &device_list);
>> > +                     device_list_handle = &device_list;
>> > +             }
>> > +     }
>> > +
>> > +     if (!get_dev_lock) {
>> > +             dev_info(tmp_adev->dev, "Bailing on TDR for s_job:%llx, as 
>> another already in progress",
>> > +                                     job ? job->base.id : -1);
>> > +             r = 0;
>> > +             /* even we skipped this reset, still need to set the job to 
>> guilty */
>> > +             goto skip_recovery;
>> >        }
>> >
>> >        /* block all schedulers and reset given job's ring */
>> >        list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {
>> > -             if (!amdgpu_device_lock_adev(tmp_adev, hive)) {
>> > - dev_info(tmp_adev->dev, "Bailing on TDR for s_job:%llx, as another 
>> already in progress",
>> > -                               job ? job->base.id : -1);
>> > -                     r = 0;
>> > -                     goto skip_recovery;
>> > -             }
>> > -
>> >                /*
>> >                 * Try to put the audio codec into suspend state
>> >                 * before gpu reset started.

[-- Attachment #1.2: Type: text/html, Size: 27641 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 9+ messages in thread
* RE: 回复: 回复: [PATCH 1/2] drm/amdgpu: race issue when jobs on 2 ring timeout
  2021-01-19 17:15         ` Andrey Grodzovsky
@ 2021-01-19 17:22           ` Lazar, Lijo
  0 siblings, 0 replies; 9+ messages in thread
From: Lazar, Lijo @ 2021-01-19 17:22 UTC (permalink / raw)
  To: Grodzovsky, Andrey, Chen, Horace, amd-gfx
  Cc: Xiao, Jack, Xu, Feifei, Wang,  Kevin(Yang),
	Tuikov, Luben, Deucher, Alexander, Quan, Evan, Koenig, Christian,
	Liu,  Monk, Zhang, Hawking

[-- Attachment #1.1: Type: text/plain, Size: 10126 bytes --]

[AMD Public Use]

What about changing the lock hive logic like
                If (this device locked) return;
Lock hive -> lock this device.

In the regular flow, lock every thing in the list except this device.
Thanks,
Lijo

From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Andrey Grodzovsky
Sent: Tuesday, January 19, 2021 10:45 PM
To: Chen, Horace <Horace.Chen@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Xiao, Jack <Jack.Xiao@amd.com>; Xu, Feifei <Feifei.Xu@amd.com>; Wang, Kevin(Yang) <Kevin1.Wang@amd.com>; Tuikov, Luben <Luben.Tuikov@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>; Quan, Evan <Evan.Quan@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Liu, Monk <Monk.Liu@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>
Subject: Re: 回复: 回复: [PATCH 1/2] drm/amdgpu: race issue when jobs on 2 ring timeout

Well, it shouldn't happen with the hive locked as I am browsing the code but then your code should
reflect that and if you do fail to lock particular adev AFTER the hive is locked you should not silently break
iteration but throw an error, WARN_ON or BUG_ON then. Or alternatively bail out with unlocking all already
locked devices.

Andrey

On 1/19/21 12:09 PM, Chen, Horace wrote:

[AMD Official Use Only - Internal Distribution Only]

OK, I understand. You mean one device in the hive may be locked up independently without locking up the whole hive.

It could happen, I'll change my code.

Thanks & Regards,
Horace.

________________________________
发件人: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com><mailto:Andrey.Grodzovsky@amd.com>
发送时间: 2021年1月20日 0:58
收件人: Chen, Horace <Horace.Chen@amd.com><mailto:Horace.Chen@amd.com>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org><mailto:amd-gfx@lists.freedesktop.org>
抄送: Quan, Evan <Evan.Quan@amd.com><mailto:Evan.Quan@amd.com>; Tuikov, Luben <Luben.Tuikov@amd.com><mailto:Luben.Tuikov@amd.com>; Koenig, Christian <Christian.Koenig@amd.com><mailto:Christian.Koenig@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com><mailto:Alexander.Deucher@amd.com>; Xiao, Jack <Jack.Xiao@amd.com><mailto:Jack.Xiao@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com><mailto:Hawking.Zhang@amd.com>; Liu, Monk <Monk.Liu@amd.com><mailto:Monk.Liu@amd.com>; Xu, Feifei <Feifei.Xu@amd.com><mailto:Feifei.Xu@amd.com>; Wang, Kevin(Yang) <Kevin1.Wang@amd.com><mailto:Kevin1.Wang@amd.com>; Xiaojie Yuan <xiaojie.yuan@amd.com><mailto:xiaojie.yuan@amd.com>
主题: Re: 回复: [PATCH 1/2] drm/amdgpu: race issue when jobs on 2 ring timeout

On 1/19/21 11:39 AM, Chen, Horace wrote:

[AMD Official Use Only - Internal Distribution Only]

Hi Andrey,

I think the list in the XGMI hive won't be break in the middle if we lock the device before we change the list. Because if 2 devices in 1 hive went into the function, it will follow the same sequence to lock the devices. So one of them will definately break at the first device. I add iterate devices here is just to lock all device in the hive since we will change the device sequence in the hive soon after.

I didn't mean break in a sense of breaking the list itself, I just meant the literal 'break' instruction
to terminate the iteration once you failed to lock a particular device.

The reason to break the interation in the middle is that the list is changed during the iteration without taking any lock. It is quite bad since I'm fixing one of this issue. And for XGMI hive, there are 2 locks protecting the list, one is the device lock I changed here, the other one is in front of my change, there is a hive->lock to protect the hive.

Even the bad thing really happened, I think moving back through the list is also very dengerous since we don't know what the list finally be, Unless we stack the devices we have iterated through a mirrored list. That can be a big change.

Not sure we are on the same page, my concern is let's sat your XGMI hive consists of 2 devices, you manged to call successfully do
amdgpu_device_lock_adev for dev1 but then failed for dev2, in this case you will bail out  without releasing dev1, no ?

Andrey

I'm ok to seperate the locking in amdgpu_device_lock_adev here, I'll do some test and update the code later.

Thanks & Regards,
Horace.
________________________________
发件人: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com><mailto:Andrey.Grodzovsky@amd.com>
发送时间: 2021年1月19日 22:33
收件人: Chen, Horace <Horace.Chen@amd.com><mailto:Horace.Chen@amd.com>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org><mailto:amd-gfx@lists.freedesktop.org>
抄送: Quan, Evan <Evan.Quan@amd.com><mailto:Evan.Quan@amd.com>; Tuikov, Luben <Luben.Tuikov@amd.com><mailto:Luben.Tuikov@amd.com>; Koenig, Christian <Christian.Koenig@amd.com><mailto:Christian.Koenig@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com><mailto:Alexander.Deucher@amd.com>; Xiao, Jack <Jack.Xiao@amd.com><mailto:Jack.Xiao@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com><mailto:Hawking.Zhang@amd.com>; Liu, Monk <Monk.Liu@amd.com><mailto:Monk.Liu@amd.com>; Xu, Feifei <Feifei.Xu@amd.com><mailto:Feifei.Xu@amd.com>; Wang, Kevin(Yang) <Kevin1.Wang@amd.com><mailto:Kevin1.Wang@amd.com>; Xiaojie Yuan <xiaojie.yuan@amd.com><mailto:xiaojie.yuan@amd.com>
主题: Re: [PATCH 1/2] drm/amdgpu: race issue when jobs on 2 ring timeout

On 1/19/21 7:22 AM, Horace Chen wrote:
> Fix a racing issue when jobs on 2 rings timeout simultaneously.
>
> If 2 rings timed out at the same time, the amdgpu_device_gpu_recover
> will be reentered. Then the adev->gmc.xgmi.head will be grabbed
> by 2 local linked list, which may cause wild pointer issue in
> iterating.
>
> lock the device earily to prevent the node be added to 2 different
> lists.
>
> Signed-off-by: Horace Chen <horace.chen@amd.com><mailto:horace.chen@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 42 +++++++++++++++-------
>   1 file changed, 30 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 4d434803fb49..9574da3abc32 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4540,6 +4540,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>        int i, r = 0;
>        bool need_emergency_restart = false;
>        bool audio_suspended = false;
> +     bool get_dev_lock = false;
>
>        /*
>         * Special case: RAS triggered and full reset isn't supported
> @@ -4582,28 +4583,45 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>         * Build list of devices to reset.
>         * In case we are in XGMI hive mode, resort the device list
>         * to put adev in the 1st position.
> +      *
> +      * lock the device before we try to operate the linked list
> +      * if didn't get the device lock, don't touch the linked list since
> +      * others may iterating it.
>         */
>        INIT_LIST_HEAD(&device_list);
>        if (adev->gmc.xgmi.num_physical_nodes > 1) {
>                if (!hive)
>                        return -ENODEV;
> -             if (!list_is_first(&adev->gmc.xgmi.head, &hive->device_list))
> -                     list_rotate_to_front(&adev->gmc.xgmi.head, &hive->device_list);
> -             device_list_handle = &hive->device_list;
> +
> +             list_for_each_entry(tmp_adev, &hive->device_list, gmc.xgmi.head) {
> +                     get_dev_lock = amdgpu_device_lock_adev(tmp_adev, hive);
> +                     if (!get_dev_lock)
> +                             break;

What about unlocking back all the devices you already locked if the break
happens in the middle of the iteration ?
Note that at skip_recovery: we don't do it. BTW, i see this issue is already in
the current code.

Also, maybe now it's better to separate the actual locking in
amdgpu_device_lock_adev
from the other stuff going on there since I don't think you would wont to toggle
stuff
like adev->mp1_state back and forth and also the function name is not
descriptive of
the other stuff going on there anyway.

Andrey

> +             }
> +             if (get_dev_lock) {
> +                     if (!list_is_first(&adev->gmc.xgmi.head, &hive->device_list))
> +                             list_rotate_to_front(&adev->gmc.xgmi.head, &hive->device_list);
> +                     device_list_handle = &hive->device_list;
> +             }
>        } else {
> -             list_add_tail(&adev->gmc.xgmi.head, &device_list);
> -             device_list_handle = &device_list;
> +             get_dev_lock = amdgpu_device_lock_adev(adev, hive);
> +             tmp_adev = adev;
> +             if (get_dev_lock) {
> +                     list_add_tail(&adev->gmc.xgmi.head, &device_list);
> +                     device_list_handle = &device_list;
> +             }
> +     }
> +
> +     if (!get_dev_lock) {
> +             dev_info(tmp_adev->dev, "Bailing on TDR for s_job:%llx, as another already in progress",
> +                                     job ? job->base.id : -1);
> +             r = 0;
> +             /* even we skipped this reset, still need to set the job to guilty */
> +             goto skip_recovery;
>        }
>
>        /* block all schedulers and reset given job's ring */
>        list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {
> -             if (!amdgpu_device_lock_adev(tmp_adev, hive)) {
> -                     dev_info(tmp_adev->dev, "Bailing on TDR for s_job:%llx, as another already in progress",
> -                               job ? job->base.id : -1);
> -                     r = 0;
> -                     goto skip_recovery;
> -             }
> -
>                /*
>                 * Try to put the audio codec into suspend state
>                 * before gpu reset started.

[-- Attachment #1.2: Type: text/html, Size: 24894 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 9+ messages in thread
[parent not found: <20210119122236.8039-2-horace.chen@amd.com>]
* Re: [PATCH 2/2] drm/amdgpu: set job guilty if reset skipped
       [not found] ` <20210119122236.8039-2-horace.chen@amd.com>
@ 2021-01-19 14:55   ` Andrey Grodzovsky
  0 siblings, 0 replies; 9+ messages in thread
From: Andrey Grodzovsky @ 2021-01-19 14:55 UTC (permalink / raw)
  To: Horace Chen, amd-gfx
  Cc: Jack Xiao, Feifei Xu, Kevin Wang, Xiaojie Yuan, Tuikov Luben,
	Deucher Alexander, Evan Quan, Christian König, Monk Liu,
	Hawking Zhang

Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>

Andrey

On 1/19/21 7:22 AM, Horace Chen wrote:
> If 2 jobs on 2 different ring timed out the at a very short
> period, the reset for second job will be skipped because the
> reset is already in progress.
>
> But it doesn't mean the second job is not guilty since it
> also timed out and can be a bad job. So before skipped out
> from the reset, we need to increase karma for this job too.
>
> Signed-off-by: Horace Chen <horace.chen@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++++
>   1 file changed, 4 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 9574da3abc32..1d6ff9fe37de 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4574,6 +4574,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   			DRM_INFO("Bailing on TDR for s_job:%llx, hive: %llx as another already in progress",
>   				job ? job->base.id : -1, hive->hive_id);
>   			amdgpu_put_xgmi_hive(hive);
> +			if (job)
> +				drm_sched_increase_karma(&job->base);
>   			return 0;
>   		}
>   		mutex_lock(&hive->hive_lock);
> @@ -4617,6 +4619,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   					job ? job->base.id : -1);
>   		r = 0;
>   		/* even we skipped this reset, still need to set the job to guilty */
> +		if (job)
> +			drm_sched_increase_karma(&job->base);
>   		goto skip_recovery;
>   	}
>   
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 9+ messages in thread
[parent not found: <20210114133729.24169-1-horace.chen@amd.com>]
[parent not found: <20210114133729.24169-2-horace.chen@amd.com>]
* Re: [PATCH 2/2] drm/amdgpu: set job guilty if reset skipped
       [not found] ` <20210114133729.24169-2-horace.chen@amd.com>
@ 2021-01-14 14:48   ` Andrey Grodzovsky
  2021-01-14 17:25     ` Alex Deucher
  0 siblings, 1 reply; 9+ messages in thread
From: Andrey Grodzovsky @ 2021-01-14 14:48 UTC (permalink / raw)
  To: Horace Chen, amd-gfx
  Cc: Jack Xiao, Feifei Xu, Kevin Wang, Xiaojie Yuan, Tuikov Luben,
	Deucher Alexander, Evan Quan, Christian König, Monk Liu,
	Hawking Zhang

Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>

Andrey

On 1/14/21 8:37 AM, Horace Chen wrote:
> If 2 jobs on 2 different ring timed out the at a very
> short period, the reset for second job will be skipped
> because the reset is already in progress.
>
> But it doesn't mean the second job is not guilty since it also
> timed out and can be a bad job. So before skipped out from the
> reset, we need to increase karma for this job too.
>
> Signed-off-by: Horace Chen <horace.chen@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++++
>   1 file changed, 4 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index a28e138ac72c..d1112e29c8b4 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4572,6 +4572,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   		if (atomic_cmpxchg(&hive->in_reset, 0, 1) != 0) {
>   			DRM_INFO("Bailing on TDR for s_job:%llx, hive: %llx as another already in progress",
>   				job ? job->base.id : -1, hive->hive_id);
> +			if(job)
> +				drm_sched_increase_karma(&job->base);
>   			amdgpu_put_xgmi_hive(hive);
>   			return 0;
>   		}
> @@ -4596,6 +4598,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   			dev_info(adev->dev, "Bailing on TDR for s_job:%llx, as another already in progress",
>   					job ? job->base.id : -1);
>   			r = 0;
> +			if(job)
> +				drm_sched_increase_karma(&job->base);
>   			goto skip_recovery;
>   		}
>   
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 9+ messages in thread
* Re: [PATCH 2/2] drm/amdgpu: set job guilty if reset skipped
  2021-01-14 14:48   ` Andrey Grodzovsky
@ 2021-01-14 17:25     ` Alex Deucher
  0 siblings, 0 replies; 9+ messages in thread
From: Alex Deucher @ 2021-01-14 17:25 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: Jack Xiao, Feifei Xu, Horace Chen, Kevin Wang, amd-gfx list,
	Hawking Zhang, Tuikov Luben, Deucher Alexander, Evan Quan,
	Christian König, Monk Liu, Xiaojie Yuan

On Thu, Jan 14, 2021 at 9:48 AM Andrey Grodzovsky
<Andrey.Grodzovsky@amd.com> wrote:
>
> Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>
> Andrey
>
> On 1/14/21 8:37 AM, Horace Chen wrote:
> > If 2 jobs on 2 different ring timed out the at a very
> > short period, the reset for second job will be skipped
> > because the reset is already in progress.
> >
> > But it doesn't mean the second job is not guilty since it also
> > timed out and can be a bad job. So before skipped out from the
> > reset, we need to increase karma for this job too.
> >
> > Signed-off-by: Horace Chen <horace.chen@amd.com>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++++
> >   1 file changed, 4 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index a28e138ac72c..d1112e29c8b4 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -4572,6 +4572,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
> >               if (atomic_cmpxchg(&hive->in_reset, 0, 1) != 0) {
> >                       DRM_INFO("Bailing on TDR for s_job:%llx, hive: %llx as another already in progress",
> >                               job ? job->base.id : -1, hive->hive_id);
> > +                     if(job)

space between the if and (.  E.g.,

if (job)

> > +                             drm_sched_increase_karma(&job->base);
> >                       amdgpu_put_xgmi_hive(hive);
> >                       return 0;
> >               }
> > @@ -4596,6 +4598,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
> >                       dev_info(adev->dev, "Bailing on TDR for s_job:%llx, as another already in progress",
> >                                       job ? job->base.id : -1);
> >                       r = 0;
> > +                     if(job)

Same here.

Alex

> > +                             drm_sched_increase_karma(&job->base);
> >                       goto skip_recovery;
> >               }
> >
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 9+ messages in thread
end of thread, other threads:[~2021-01-19 19:24 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20210119122236.8039-1-horace.chen@amd.com>
2021-01-19 14:33 ` [PATCH 1/2] drm/amdgpu: race issue when jobs on 2 ring timeout Andrey Grodzovsky
2021-01-19 16:39   ` 回复: " Chen, Horace
2021-01-19 16:58     ` Andrey Grodzovsky
2021-01-19 17:09       ` 回复: " Chen, Horace
2021-01-19 17:15         ` Andrey Grodzovsky
2021-01-19 17:22           ` Lazar, Lijo
     [not found] ` <20210119122236.8039-2-horace.chen@amd.com>
2021-01-19 14:55   ` [PATCH 2/2] drm/amdgpu: set job guilty if reset skipped Andrey Grodzovsky
     [not found] <20210114133729.24169-1-horace.chen@amd.com>
     [not found] ` <20210114133729.24169-2-horace.chen@amd.com>
2021-01-14 14:48   ` Andrey Grodzovsky
2021-01-14 17:25     ` Alex Deucher

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.