All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] dmr/amdgpu: Fix wrongly unref of BO
@ 2017-04-13 21:34 Alex Xie
       [not found] ` <1492119298-8526-1-git-send-email-AlexBin.Xie-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Alex Xie @ 2017-04-13 21:34 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Alex Xie

According to comment of amdgpu_bo_reserve, amdgpu_bo_reserve
can return with -ERESTARTSYS. When this function was interrupted
by a signal, BO should not be unref. Otherwise the BO might be
released while is kmapped and pinned, or BO MIGHT be deref
multiple times, etc.

Change-Id: If76071a768950a0d3ad9d5da7fcae04881807621
Signed-off-by: Alex Xie <AlexBin.Xie@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 53996e3..1dcc2d1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -355,8 +355,8 @@ static void amdgpu_vram_scratch_fini(struct amdgpu_device *adev)
 		amdgpu_bo_kunmap(adev->vram_scratch.robj);
 		amdgpu_bo_unpin(adev->vram_scratch.robj);
 		amdgpu_bo_unreserve(adev->vram_scratch.robj);
+		amdgpu_bo_unref(&adev->vram_scratch.robj);
 	}
-	amdgpu_bo_unref(&adev->vram_scratch.robj);
 }
 
 /**
-- 
1.9.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
       [not found] ` <1492119298-8526-1-git-send-email-AlexBin.Xie-5C7GfCeVMHo@public.gmane.org>
@ 2017-04-14  4:00   ` zhoucm1
       [not found]     ` <58F0495D.60607-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: zhoucm1 @ 2017-04-14  4:00 UTC (permalink / raw)
  To: Alex Xie, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW



On 2017年04月14日 05:34, Alex Xie wrote:
> According to comment of amdgpu_bo_reserve, amdgpu_bo_reserve
> can return with -ERESTARTSYS. When this function was interrupted
> by a signal, BO should not be unref. Otherwise the BO might be
> released while is kmapped and pinned, or BO MIGHT be deref
> multiple times, etc.
         r = amdgpu_bo_reserve(adev->vram_scratch.robj, false);
we have specified interruptible to false, so -ERESTARTSYS isn't possible 
here.

Thanks,
David Zhou
>
> Change-Id: If76071a768950a0d3ad9d5da7fcae04881807621
> Signed-off-by: Alex Xie <AlexBin.Xie@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 53996e3..1dcc2d1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -355,8 +355,8 @@ static void amdgpu_vram_scratch_fini(struct amdgpu_device *adev)
>   		amdgpu_bo_kunmap(adev->vram_scratch.robj);
>   		amdgpu_bo_unpin(adev->vram_scratch.robj);
>   		amdgpu_bo_unreserve(adev->vram_scratch.robj);
> +		amdgpu_bo_unref(&adev->vram_scratch.robj);
>   	}
> -	amdgpu_bo_unref(&adev->vram_scratch.robj);
>   }
>   
>   /**

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
       [not found]     ` <58F0495D.60607-5C7GfCeVMHo@public.gmane.org>
@ 2017-04-17 14:54       ` Xie, AlexBin
       [not found]         ` <MWHPR1201MB0045DE55E81D0FF5CE76671FF2060-3iK1xFAIwjq45V35fqe+hGrFom/aUZj6nBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Xie, AlexBin @ 2017-04-17 14:54 UTC (permalink / raw)
  To: Zhou, David(ChunMing), amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW


[-- Attachment #1.1: Type: text/plain, Size: 2208 bytes --]

Hi David,


Thanks for the comments. However, please have look at amdgpu_bo_reserve definition.

static inline int amdgpu_bo_reserve(struct amdgpu_bo *bo, bool no_intr)


When we call this function like the following:

         r = amdgpu_bo_reserve(adev->vram_scratch.robj, false);
The false means interruptible.


On the other hand,  when amdgpu_bo_reserve function return error, why do we unref BO without kunmap and unpin the BO? This is wrong implementation when amdgpu_bo_reserve return any error.


Thanks,
Alex Bin Xie

________________________________
From: Zhou, David(ChunMing)
Sent: Friday, April 14, 2017 12:00 AM
To: Xie, AlexBin; amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
Subject: Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO



On 2017年04月14日 05:34, Alex Xie wrote:
> According to comment of amdgpu_bo_reserve, amdgpu_bo_reserve
> can return with -ERESTARTSYS. When this function was interrupted
> by a signal, BO should not be unref. Otherwise the BO might be
> released while is kmapped and pinned, or BO MIGHT be deref
> multiple times, etc.
         r = amdgpu_bo_reserve(adev->vram_scratch.robj, false);
we have specified interruptible to false, so -ERESTARTSYS isn't possible
here.

Thanks,
David Zhou
>
> Change-Id: If76071a768950a0d3ad9d5da7fcae04881807621
> Signed-off-by: Alex Xie <AlexBin.Xie-5C7GfCeVMHo@public.gmane.org>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 53996e3..1dcc2d1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -355,8 +355,8 @@ static void amdgpu_vram_scratch_fini(struct amdgpu_device *adev)
>                amdgpu_bo_kunmap(adev->vram_scratch.robj);
>                amdgpu_bo_unpin(adev->vram_scratch.robj);
>                amdgpu_bo_unreserve(adev->vram_scratch.robj);
> +             amdgpu_bo_unref(&adev->vram_scratch.robj);
>        }
> -     amdgpu_bo_unref(&adev->vram_scratch.robj);
>   }
>
>   /**


[-- Attachment #1.2: Type: text/html, Size: 5254 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
       [not found]         ` <MWHPR1201MB0045DE55E81D0FF5CE76671FF2060-3iK1xFAIwjq45V35fqe+hGrFom/aUZj6nBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
@ 2017-04-18  2:38           ` zhoucm1
       [not found]             ` <58F57C14.20403-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: zhoucm1 @ 2017-04-18  2:38 UTC (permalink / raw)
  To: Xie, AlexBin, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW


[-- Attachment #1.1: Type: text/plain, Size: 2562 bytes --]



On 2017年04月17日 22:54, Xie, AlexBin wrote:
>
> Hi David,
>
>
> Thanks for the comments. However, please have look at 
> amdgpu_bo_reserve definition.
>
> static inline int amdgpu_bo_reserve(struct amdgpu_bo *bo, bool no_intr)
>
Ah, this is a wired wrapper for ttm_bo_reserve.

>
> When we call this function like the following:
>
>          r = amdgpu_bo_reserve(adev->vram_scratch.robj, false);
> The false means interruptible.
>
>
> On the other hand,  when amdgpu_bo_reserve function return error, why 
> do we unref BO without kunmap and unpin the BO? This is wrong 
> implementation when amdgpu_bo_reserve return any error.
>
Yeah, I see your mean, it's in driver un-loading, How about changing to 
no interruptible? Your patch will make a memleak if bo_reserve fails, 
but it seems not matter. I have no strong preference.

Regards,
David Zhou
>
>
> Thanks,
> Alex Bin Xie
>
> ------------------------------------------------------------------------
> *From:* Zhou, David(ChunMing)
> *Sent:* Friday, April 14, 2017 12:00 AM
> *To:* Xie, AlexBin; amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
>
>
> On 2017年04月14日 05:34, Alex Xie wrote:
> > According to comment of amdgpu_bo_reserve, amdgpu_bo_reserve
> > can return with -ERESTARTSYS. When this function was interrupted
> > by a signal, BO should not be unref. Otherwise the BO might be
> > released while is kmapped and pinned, or BO MIGHT be deref
> > multiple times, etc.
>          r = amdgpu_bo_reserve(adev->vram_scratch.robj, false);
> we have specified interruptible to false, so -ERESTARTSYS isn't possible
> here.
>
> Thanks,
> David Zhou
> >
> > Change-Id: If76071a768950a0d3ad9d5da7fcae04881807621
> > Signed-off-by: Alex Xie <AlexBin.Xie-5C7GfCeVMHo@public.gmane.org>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index 53996e3..1dcc2d1 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -355,8 +355,8 @@ static void amdgpu_vram_scratch_fini(struct 
> amdgpu_device *adev)
> > amdgpu_bo_kunmap(adev->vram_scratch.robj);
> > amdgpu_bo_unpin(adev->vram_scratch.robj);
> > amdgpu_bo_unreserve(adev->vram_scratch.robj);
> > + amdgpu_bo_unref(&adev->vram_scratch.robj);
> >        }
> > - amdgpu_bo_unref(&adev->vram_scratch.robj);
> >   }
> >
> >   /**
>


[-- Attachment #1.2: Type: text/html, Size: 8196 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
       [not found]             ` <58F57C14.20403-5C7GfCeVMHo@public.gmane.org>
@ 2017-04-18 15:17               ` Xie, AlexBin
       [not found]                 ` <MWHPR1201MB00452889FF2C7FDC55630B4BF2190-3iK1xFAIwjq45V35fqe+hGrFom/aUZj6nBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Xie, AlexBin @ 2017-04-18 15:17 UTC (permalink / raw)
  To: Zhou, David(ChunMing), amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW


[-- Attachment #1.1: Type: text/plain, Size: 3649 bytes --]

Hi David,


When amdgpu_bo_reserve return errors, we cannot release the BO. This is not a memory leak. General speaking, memory leak is unnoticed and unintentional.


The caller of function amdgpu_vram_scratch_fini ignores the return error value...


The "memory leak" is not caused by my patch. It is caused because reserving BO fails.


This patch only aim to make function amdgpu_vram_scratch_fini behave correctly.


To follow up, we can add a warning message when amdgpu_bo_reserve error happens in a different patch.


If function call amdgpu_bo_reserve is changed to uninterruptible, this changes driver behaviour. Without a substantial issue, I would be cautious for such a change.


Thanks,

Alex Bin Xie

________________________________
From: Zhou, David(ChunMing)
Sent: Monday, April 17, 2017 10:38 PM
To: Xie, AlexBin; amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
Subject: Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO



On 2017年04月17日 22:54, Xie, AlexBin wrote:

Hi David,


Thanks for the comments. However, please have look at amdgpu_bo_reserve definition.

static inline int amdgpu_bo_reserve(struct amdgpu_bo *bo, bool no_intr)

Ah, this is a wired wrapper for ttm_bo_reserve.



When we call this function like the following:

         r = amdgpu_bo_reserve(adev->vram_scratch.robj, false);
The false means interruptible.


On the other hand,  when amdgpu_bo_reserve function return error, why do we unref BO without kunmap and unpin the BO? This is wrong implementation when amdgpu_bo_reserve return any error.

Yeah, I see your mean, it's in driver un-loading, How about changing to no interruptible? Your patch will make a memleak if bo_reserve fails, but it seems not matter. I have no strong preference.

Regards,
David Zhou


Thanks,
Alex Bin Xie

________________________________
From: Zhou, David(ChunMing)
Sent: Friday, April 14, 2017 12:00 AM
To: Xie, AlexBin; amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org<mailto:amd-gfx-PD4FTy7X32lMiVNPc3mojA@public.gmane.orgsktop.org>
Subject: Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO



On 2017年04月14日 05:34, Alex Xie wrote:
> According to comment of amdgpu_bo_reserve, amdgpu_bo_reserve
> can return with -ERESTARTSYS. When this function was interrupted
> by a signal, BO should not be unref. Otherwise the BO might be
> released while is kmapped and pinned, or BO MIGHT be deref
> multiple times, etc.
         r = amdgpu_bo_reserve(adev->vram_scratch.robj, false);
we have specified interruptible to false, so -ERESTARTSYS isn't possible
here.

Thanks,
David Zhou
>
> Change-Id: If76071a768950a0d3ad9d5da7fcae04881807621
> Signed-off-by: Alex Xie <AlexBin.Xie-5C7GfCeVMHo@public.gmane.org><mailto:AlexBin.Xie-5C7GfCeVMHo@public.gmane.org>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 53996e3..1dcc2d1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -355,8 +355,8 @@ static void amdgpu_vram_scratch_fini(struct amdgpu_device *adev)
>                amdgpu_bo_kunmap(adev->vram_scratch.robj);
>                amdgpu_bo_unpin(adev->vram_scratch.robj);
>                amdgpu_bo_unreserve(adev->vram_scratch.robj);
> +             amdgpu_bo_unref(&adev->vram_scratch.robj);
>        }
> -     amdgpu_bo_unref(&adev->vram_scratch.robj);
>   }
>
>   /**



[-- Attachment #1.2: Type: text/html, Size: 8235 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
       [not found]                 ` <MWHPR1201MB00452889FF2C7FDC55630B4BF2190-3iK1xFAIwjq45V35fqe+hGrFom/aUZj6nBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
@ 2017-04-18 17:46                   ` Christian König
       [not found]                     ` <7807726d-7318-b72f-1a14-2b37a73988bf-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Christian König @ 2017-04-18 17:46 UTC (permalink / raw)
  To: Xie, AlexBin, Zhou, David(ChunMing),
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW


[-- Attachment #1.1: Type: text/plain, Size: 4315 bytes --]

Hi AlexBin,

No, David is right. This is a very common coding pattern in the kernel 
module.

Freeing up a BO while there still exists a kernel mapping is perfectly 
ok, the mapping will just be released a bit later on.

So this code is actually perfectly ok and just an optimization, but your 
patch breaks it and creates a memory leak.

Regards,
Christian.

Am 18.04.2017 um 17:17 schrieb Xie, AlexBin:
>
> Hi David,
>
>
> When amdgpu_bo_reserve return errors, we cannot release the BO. This 
> is not a memory leak. General speaking, memory leak is unnoticed and 
> unintentional.
>
>
> The caller of function amdgpu_vram_scratch_fini ignores the return 
> error value...
>
>
> The "memory leak" is not caused by my patch. It is caused because 
> reserving BO fails.
>
>
> This patch only aim to make function amdgpu_vram_scratch_fini behave 
> correctly.
>
>
> To follow up, we can add a warning message when amdgpu_bo_reserve 
> error happens in a different patch.
>
>
> If function call amdgpu_bo_reserve is changed to uninterruptible, this 
> changes driver behaviour. Without a substantial issue, I would be 
> cautious for such a change.
>
>
> Thanks,
>
> Alex Bin Xie
>
>
> ------------------------------------------------------------------------
> *From:* Zhou, David(ChunMing)
> *Sent:* Monday, April 17, 2017 10:38 PM
> *To:* Xie, AlexBin; amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
>
>
> On 2017年04月17日 22:54, Xie, AlexBin wrote:
>>
>> Hi David,
>>
>>
>> Thanks for the comments. However, please have look at 
>> amdgpu_bo_reserve definition.
>>
>> static inline int amdgpu_bo_reserve(struct amdgpu_bo *bo, bool no_intr)
>>
> Ah, this is a wired wrapper for ttm_bo_reserve.
>
>>
>> When we call this function like the following:
>>
>>          r = amdgpu_bo_reserve(adev->vram_scratch.robj, false);
>> The false means interruptible.
>>
>>
>> On the other hand,  when amdgpu_bo_reserve function return error, why 
>> do we unref BO without kunmap and unpin the BO? This is wrong 
>> implementation when amdgpu_bo_reserve return any error.
>>
> Yeah, I see your mean, it's in driver un-loading, How about changing 
> to no interruptible? Your patch will make a memleak if bo_reserve 
> fails, but it seems not matter. I have no strong preference.
>
> Regards,
> David Zhou
>>
>>
>> Thanks,
>> Alex Bin Xie
>>
>> ------------------------------------------------------------------------
>> *From:* Zhou, David(ChunMing)
>> *Sent:* Friday, April 14, 2017 12:00 AM
>> *To:* Xie, AlexBin; amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
>>
>>
>> On 2017年04月14日 05:34, Alex Xie wrote:
>> > According to comment of amdgpu_bo_reserve, amdgpu_bo_reserve
>> > can return with -ERESTARTSYS. When this function was interrupted
>> > by a signal, BO should not be unref. Otherwise the BO might be
>> > released while is kmapped and pinned, or BO MIGHT be deref
>> > multiple times, etc.
>>          r = amdgpu_bo_reserve(adev->vram_scratch.robj, false);
>> we have specified interruptible to false, so -ERESTARTSYS isn't possible
>> here.
>>
>> Thanks,
>> David Zhou
>> >
>> > Change-Id: If76071a768950a0d3ad9d5da7fcae04881807621
>> > Signed-off-by: Alex Xie <AlexBin.Xie-5C7GfCeVMHo@public.gmane.org>
>> > ---
>> > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
>> >   1 file changed, 1 insertion(+), 1 deletion(-)
>> >
>> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> > index 53996e3..1dcc2d1 100644
>> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> > @@ -355,8 +355,8 @@ static void amdgpu_vram_scratch_fini(struct 
>> amdgpu_device *adev)
>> > amdgpu_bo_kunmap(adev->vram_scratch.robj);
>> > amdgpu_bo_unpin(adev->vram_scratch.robj);
>> > amdgpu_bo_unreserve(adev->vram_scratch.robj);
>> > + amdgpu_bo_unref(&adev->vram_scratch.robj);
>> >        }
>> > - amdgpu_bo_unref(&adev->vram_scratch.robj);
>> >   }
>> >
>> >   /**
>>
>
>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx



[-- Attachment #1.2: Type: text/html, Size: 13038 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
       [not found]                     ` <7807726d-7318-b72f-1a14-2b37a73988bf-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
@ 2017-04-18 18:04                       ` Xie, AlexBin
       [not found]                         ` <MWHPR1201MB00451DFD2A06BBA3CD6F928EF2190-3iK1xFAIwjq45V35fqe+hGrFom/aUZj6nBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Xie, AlexBin @ 2017-04-18 18:04 UTC (permalink / raw)
  To: Christian König, Zhou, David(ChunMing),
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW


[-- Attachment #1.1: Type: text/plain, Size: 4772 bytes --]

Hi Christian,


Would you point out where/when will kunmap happen for this BO when release? It must be somewhere in some function calls.


I checked before I asked for review. But I did not see such obvious kunmap function call.


If so, there should be a comment in function amdgpu_vram_scratch_fini to avoid future confusion.

Thanks,
Alex Bin Xie
________________________________
From: Christian König <deathsimple@vodafone.de>
Sent: Tuesday, April 18, 2017 1:46 PM
To: Xie, AlexBin; Zhou, David(ChunMing); amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO

Hi AlexBin,

No, David is right. This is a very common coding pattern in the kernel module.

Freeing up a BO while there still exists a kernel mapping is perfectly ok, the mapping will just be released a bit later on.

So this code is actually perfectly ok and just an optimization, but your patch breaks it and creates a memory leak.

Regards,
Christian.

Am 18.04.2017 um 17:17 schrieb Xie, AlexBin:

Hi David,


When amdgpu_bo_reserve return errors, we cannot release the BO. This is not a memory leak. General speaking, memory leak is unnoticed and unintentional.


The caller of function amdgpu_vram_scratch_fini ignores the return error value...


The "memory leak" is not caused by my patch. It is caused because reserving BO fails.


This patch only aim to make function amdgpu_vram_scratch_fini behave correctly.


To follow up, we can add a warning message when amdgpu_bo_reserve error happens in a different patch.


If function call amdgpu_bo_reserve is changed to uninterruptible, this changes driver behaviour. Without a substantial issue, I would be cautious for such a change.


Thanks,

Alex Bin Xie

________________________________
From: Zhou, David(ChunMing)
Sent: Monday, April 17, 2017 10:38 PM
To: Xie, AlexBin; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO



On 2017年04月17日 22:54, Xie, AlexBin wrote:

Hi David,


Thanks for the comments. However, please have look at amdgpu_bo_reserve definition.

static inline int amdgpu_bo_reserve(struct amdgpu_bo *bo, bool no_intr)

Ah, this is a wired wrapper for ttm_bo_reserve.



When we call this function like the following:

         r = amdgpu_bo_reserve(adev->vram_scratch.robj, false);
The false means interruptible.


On the other hand,  when amdgpu_bo_reserve function return error, why do we unref BO without kunmap and unpin the BO? This is wrong implementation when amdgpu_bo_reserve return any error.

Yeah, I see your mean, it's in driver un-loading, How about changing to no interruptible? Your patch will make a memleak if bo_reserve fails, but it seems not matter. I have no strong preference.

Regards,
David Zhou


Thanks,
Alex Bin Xie

________________________________
From: Zhou, David(ChunMing)
Sent: Friday, April 14, 2017 12:00 AM
To: Xie, AlexBin; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO



On 2017年04月14日 05:34, Alex Xie wrote:
> According to comment of amdgpu_bo_reserve, amdgpu_bo_reserve
> can return with -ERESTARTSYS. When this function was interrupted
> by a signal, BO should not be unref. Otherwise the BO might be
> released while is kmapped and pinned, or BO MIGHT be deref
> multiple times, etc.
         r = amdgpu_bo_reserve(adev->vram_scratch.robj, false);
we have specified interruptible to false, so -ERESTARTSYS isn't possible
here.

Thanks,
David Zhou
>
> Change-Id: If76071a768950a0d3ad9d5da7fcae04881807621
> Signed-off-by: Alex Xie <AlexBin.Xie@amd.com><mailto:AlexBin.Xie@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 53996e3..1dcc2d1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -355,8 +355,8 @@ static void amdgpu_vram_scratch_fini(struct amdgpu_device *adev)
>                amdgpu_bo_kunmap(adev->vram_scratch.robj);
>                amdgpu_bo_unpin(adev->vram_scratch.robj);
>                amdgpu_bo_unreserve(adev->vram_scratch.robj);
> +             amdgpu_bo_unref(&adev->vram_scratch.robj);
>        }
> -     amdgpu_bo_unref(&adev->vram_scratch.robj);
>   }
>
>   /**





_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
https://lists.freedesktop.org/mailman/listinfo/amd-gfx



[-- Attachment #1.2: Type: text/html, Size: 10464 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
       [not found]                         ` <MWHPR1201MB00451DFD2A06BBA3CD6F928EF2190-3iK1xFAIwjq45V35fqe+hGrFom/aUZj6nBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
@ 2017-04-18 18:54                           ` Xie, AlexBin
       [not found]                             ` <MWHPR1201MB004553768BCAC951D28A347AF2190-3iK1xFAIwjq45V35fqe+hGrFom/aUZj6nBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Xie, AlexBin @ 2017-04-18 18:54 UTC (permalink / raw)
  To: Christian König, Zhou, David(ChunMing),
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW


[-- Attachment #1.1: Type: text/plain, Size: 5909 bytes --]

Hi Christian,

Have you found how/where/when? When you said "mapping will just be released a bit later on", you must know the answer.

It is difficult to prove something does not exist. Anyway, I will give it a try to prove such "later on" does not exist.

Function ttm_bo_kunmap is the only function to unmap. To prove this, search ttm_bo_map_iomap, only ttm_bo_kunmap use this enum to correctly kunmap.

Function ttm_bo_kunmap is not called by ttm itself. This is a hint that all TTM delay delete mechanism or unref mechanism will NOT kunmap BO later on.

Function ttm_bo_kunmap is called by AMDGPU function amdgpu_bo_kunmap and amdgpu_gem_prime_vunmap.

Search AMDGPU for amdgpu_bo_kunmap. All matches do not kunmap for scratch VRAM BO.  amdgpu_bo_free_kernel is a suspect but the answer is still NO.

So all possibilities are searched. Did I miss anything?

Thanks,
Alex Bin Xie

________________________________
From: Xie, AlexBin
Sent: Tuesday, April 18, 2017 2:04:33 PM
To: Christian König; Zhou, David(ChunMing); amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO


Hi Christian,


Would you point out where/when will kunmap happen for this BO when release? It must be somewhere in some function calls.


I checked before I asked for review. But I did not see such obvious kunmap function call.


If so, there should be a comment in function amdgpu_vram_scratch_fini to avoid future confusion.

Thanks,
Alex Bin Xie
________________________________
From: Christian König <deathsimple@vodafone.de>
Sent: Tuesday, April 18, 2017 1:46 PM
To: Xie, AlexBin; Zhou, David(ChunMing); amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO

Hi AlexBin,

No, David is right. This is a very common coding pattern in the kernel module.

Freeing up a BO while there still exists a kernel mapping is perfectly ok, the mapping will just be released a bit later on.

So this code is actually perfectly ok and just an optimization, but your patch breaks it and creates a memory leak.

Regards,
Christian.

Am 18.04.2017 um 17:17 schrieb Xie, AlexBin:

Hi David,


When amdgpu_bo_reserve return errors, we cannot release the BO. This is not a memory leak. General speaking, memory leak is unnoticed and unintentional.


The caller of function amdgpu_vram_scratch_fini ignores the return error value...


The "memory leak" is not caused by my patch. It is caused because reserving BO fails.


This patch only aim to make function amdgpu_vram_scratch_fini behave correctly.


To follow up, we can add a warning message when amdgpu_bo_reserve error happens in a different patch.


If function call amdgpu_bo_reserve is changed to uninterruptible, this changes driver behaviour. Without a substantial issue, I would be cautious for such a change.


Thanks,

Alex Bin Xie

________________________________
From: Zhou, David(ChunMing)
Sent: Monday, April 17, 2017 10:38 PM
To: Xie, AlexBin; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO



On 2017年04月17日 22:54, Xie, AlexBin wrote:

Hi David,


Thanks for the comments. However, please have look at amdgpu_bo_reserve definition.

static inline int amdgpu_bo_reserve(struct amdgpu_bo *bo, bool no_intr)

Ah, this is a wired wrapper for ttm_bo_reserve.



When we call this function like the following:

         r = amdgpu_bo_reserve(adev->vram_scratch.robj, false);
The false means interruptible.


On the other hand,  when amdgpu_bo_reserve function return error, why do we unref BO without kunmap and unpin the BO? This is wrong implementation when amdgpu_bo_reserve return any error.

Yeah, I see your mean, it's in driver un-loading, How about changing to no interruptible? Your patch will make a memleak if bo_reserve fails, but it seems not matter. I have no strong preference.

Regards,
David Zhou


Thanks,
Alex Bin Xie

________________________________
From: Zhou, David(ChunMing)
Sent: Friday, April 14, 2017 12:00 AM
To: Xie, AlexBin; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO



On 2017年04月14日 05:34, Alex Xie wrote:
> According to comment of amdgpu_bo_reserve, amdgpu_bo_reserve
> can return with -ERESTARTSYS. When this function was interrupted
> by a signal, BO should not be unref. Otherwise the BO might be
> released while is kmapped and pinned, or BO MIGHT be deref
> multiple times, etc.
         r = amdgpu_bo_reserve(adev->vram_scratch.robj, false);
we have specified interruptible to false, so -ERESTARTSYS isn't possible
here.

Thanks,
David Zhou
>
> Change-Id: If76071a768950a0d3ad9d5da7fcae04881807621
> Signed-off-by: Alex Xie <AlexBin.Xie@amd.com><mailto:AlexBin.Xie@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 53996e3..1dcc2d1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -355,8 +355,8 @@ static void amdgpu_vram_scratch_fini(struct amdgpu_device *adev)
>                amdgpu_bo_kunmap(adev->vram_scratch.robj);
>                amdgpu_bo_unpin(adev->vram_scratch.robj);
>                amdgpu_bo_unreserve(adev->vram_scratch.robj);
> +             amdgpu_bo_unref(&adev->vram_scratch.robj);
>        }
> -     amdgpu_bo_unref(&adev->vram_scratch.robj);
>   }
>
>   /**





_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
https://lists.freedesktop.org/mailman/listinfo/amd-gfx



[-- Attachment #1.2: Type: text/html, Size: 15574 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
       [not found]                             ` <MWHPR1201MB004553768BCAC951D28A347AF2190-3iK1xFAIwjq45V35fqe+hGrFom/aUZj6nBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
@ 2017-04-19  6:35                               ` Christian König
       [not found]                                 ` <4509f79a-ffd5-eb40-117c-84ea02cbf5cd-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Christian König @ 2017-04-19  6:35 UTC (permalink / raw)
  To: Xie, AlexBin, Zhou, David(ChunMing),
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW


[-- Attachment #1.1: Type: text/plain, Size: 7290 bytes --]

Hi AlexBin,

the answer is ttm_bo_kunmap isn't called at all and yes in the case of 
an iomap we leak the address space reserved.

But that is completely harmless on a 64bit system compared to leaking 
the memory backing the address space.

Using amdgpu_bo_free_kernel() instead of openly coding it here is 
probably a good idea.

Additional to that it's probably a good idea to set the no_intr flag 
when reserving kernel BOs. It is impossible to receive a signal during 
module load/unload, but it's probably better to document that in the 
code as well.

Regards,
Christian.

Am 18.04.2017 um 20:54 schrieb Xie, AlexBin:
> Hi Christian,
>
> Have you found how/where/when? When you said "mapping will just be 
> released a bit later on", you must know the answer.
>
> It is difficult to prove something does not exist. Anyway, I will give 
> it a try to prove such "later on" does not exist.
>
> Function ttm_bo_kunmap is the only function to unmap. To prove this, 
> search ttm_bo_map_iomap, only ttm_bo_kunmap use this enum to correctly 
> kunmap.
>
> Function ttm_bo_kunmap is not called by ttm itself. This is a hint 
> that all TTM delay delete mechanism or unref mechanism will NOT kunmap 
> BO later on.
>
> Function ttm_bo_kunmap is called by AMDGPU function amdgpu_bo_kunmap 
> and amdgpu_gem_prime_vunmap.
>
> Search AMDGPU for amdgpu_bo_kunmap. All matches do not kunmap for 
> scratch VRAM BO. amdgpu_bo_free_kernel is a suspect but the answer is 
> still NO.
>
> So all possibilities are searched. Did I miss anything?
>
> Thanks,
> Alex Bin Xie
>
> ------------------------------------------------------------------------
> *From:* Xie, AlexBin
> *Sent:* Tuesday, April 18, 2017 2:04:33 PM
> *To:* Christian König; Zhou, David(ChunMing); 
> amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
>
> Hi Christian,
>
>
> Would you point out where/when will kunmap happen for this BO when 
> release? It must be somewhere in some function calls.
>
>
> I checked before I asked for review. But I did not see such obvious 
> kunmap function call.
>
>
> If so, there should be a comment in function amdgpu_vram_scratch_fini 
> to avoid future confusion.
>
>
> Thanks,
> Alex Bin Xie
> ------------------------------------------------------------------------
> *From:* Christian König <deathsimple-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
> *Sent:* Tuesday, April 18, 2017 1:46 PM
> *To:* Xie, AlexBin; Zhou, David(ChunMing); amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
> Hi AlexBin,
>
> No, David is right. This is a very common coding pattern in the kernel 
> module.
>
> Freeing up a BO while there still exists a kernel mapping is perfectly 
> ok, the mapping will just be released a bit later on.
>
> So this code is actually perfectly ok and just an optimization, but 
> your patch breaks it and creates a memory leak.
>
> Regards,
> Christian.
>
> Am 18.04.2017 um 17:17 schrieb Xie, AlexBin:
>>
>> Hi David,
>>
>>
>> When amdgpu_bo_reserve return errors, we cannot release the BO. This 
>> is not a memory leak. General speaking, memory leak is unnoticed and 
>> unintentional.
>>
>>
>> The caller of function amdgpu_vram_scratch_fini ignores the return 
>> error value...
>>
>>
>> The "memory leak" is not caused by my patch. It is caused because 
>> reserving BO fails.
>>
>>
>> This patch only aim to make function amdgpu_vram_scratch_fini behave 
>> correctly.
>>
>>
>> To follow up, we can add a warning message when amdgpu_bo_reserve 
>> error happens in a different patch.
>>
>>
>> If function call amdgpu_bo_reserve is changed to uninterruptible, 
>> this changes driver behaviour. Without a substantial issue, I would 
>> be cautious for such a change.
>>
>>
>> Thanks,
>>
>> Alex Bin Xie
>>
>>
>> ------------------------------------------------------------------------
>> *From:* Zhou, David(ChunMing)
>> *Sent:* Monday, April 17, 2017 10:38 PM
>> *To:* Xie, AlexBin; amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
>>
>>
>> On 2017年04月17日 22:54, Xie, AlexBin wrote:
>>>
>>> Hi David,
>>>
>>>
>>> Thanks for the comments. However, please have look at 
>>> amdgpu_bo_reserve definition.
>>>
>>> static inline int amdgpu_bo_reserve(struct amdgpu_bo *bo, bool no_intr)
>>>
>> Ah, this is a wired wrapper for ttm_bo_reserve.
>>
>>>
>>> When we call this function like the following:
>>>
>>>          r = amdgpu_bo_reserve(adev->vram_scratch.robj, false);
>>> The false means interruptible.
>>>
>>>
>>> On the other hand,  when amdgpu_bo_reserve function return error, 
>>> why do we unref BO without kunmap and unpin the BO? This is wrong 
>>> implementation when amdgpu_bo_reserve return any error.
>>>
>> Yeah, I see your mean, it's in driver un-loading, How about changing 
>> to no interruptible? Your patch will make a memleak if bo_reserve 
>> fails, but it seems not matter. I have no strong preference.
>>
>> Regards,
>> David Zhou
>>>
>>>
>>> Thanks,
>>> Alex Bin Xie
>>>
>>> ------------------------------------------------------------------------
>>> *From:* Zhou, David(ChunMing)
>>> *Sent:* Friday, April 14, 2017 12:00 AM
>>> *To:* Xie, AlexBin; amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>>> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
>>>
>>>
>>> On 2017年04月14日 05:34, Alex Xie wrote:
>>> > According to comment of amdgpu_bo_reserve, amdgpu_bo_reserve
>>> > can return with -ERESTARTSYS. When this function was interrupted
>>> > by a signal, BO should not be unref. Otherwise the BO might be
>>> > released while is kmapped and pinned, or BO MIGHT be deref
>>> > multiple times, etc.
>>>          r = amdgpu_bo_reserve(adev->vram_scratch.robj, false);
>>> we have specified interruptible to false, so -ERESTARTSYS isn't 
>>> possible
>>> here.
>>>
>>> Thanks,
>>> David Zhou
>>> >
>>> > Change-Id: If76071a768950a0d3ad9d5da7fcae04881807621
>>> > Signed-off-by: Alex Xie <AlexBin.Xie-5C7GfCeVMHo@public.gmane.org>
>>> > ---
>>> > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
>>> >   1 file changed, 1 insertion(+), 1 deletion(-)
>>> >
>>> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> > index 53996e3..1dcc2d1 100644
>>> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> > @@ -355,8 +355,8 @@ static void amdgpu_vram_scratch_fini(struct 
>>> amdgpu_device *adev)
>>> > amdgpu_bo_kunmap(adev->vram_scratch.robj);
>>> > amdgpu_bo_unpin(adev->vram_scratch.robj);
>>> > amdgpu_bo_unreserve(adev->vram_scratch.robj);
>>> > + amdgpu_bo_unref(&adev->vram_scratch.robj);
>>> >        }
>>> > - amdgpu_bo_unref(&adev->vram_scratch.robj);
>>> >   }
>>> >
>>> >   /**
>>>
>>
>>
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
>
>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx



[-- Attachment #1.2: Type: text/html, Size: 27093 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
       [not found]                                 ` <4509f79a-ffd5-eb40-117c-84ea02cbf5cd-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
@ 2017-04-19 11:37                                   ` Xie, AlexBin
       [not found]                                     ` <MWHPR1201MB0045EC51F6CE7C6C4283888CF2180-3iK1xFAIwjq45V35fqe+hGrFom/aUZj6nBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Xie, AlexBin @ 2017-04-19 11:37 UTC (permalink / raw)
  To: Christian König, Zhou, David(ChunMing),
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW


[-- Attachment #1.1: Type: text/plain, Size: 8127 bytes --]

Hi Christian,


Without correctly kunmap, page table is corrupted. Page entries point to wrong memory locations. You might call it completely harmless. But I think this is a severe bug. Leaking memory is better than a corrupted page table. Think security.


Would you provide any document and reference by saying" It is impossible to receive a signal during module load/unload"? For example, if the unload stuck in a lock, can CTRL+C stop the unload?


If "It is impossible to receive a signal during module load/unload", interruptible waiting is fine too, because function amdgpu_bo_reserve will return successfully.


What about there is some other return error? What about in future somebody improve amdgpu_bo_reserve to return other errors, then function amdgpu_vram_scratch_fini becomes buggy?


While I am thinking whether there is a better way for the current situation, would you give a real world example that my patch really not working? Then we can address it.


Thanks,

Alex Bin

________________________________
From: Christian König <deathsimple@vodafone.de>
Sent: Wednesday, April 19, 2017 2:35 AM
To: Xie, AlexBin; Zhou, David(ChunMing); amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO

Hi AlexBin,

the answer is ttm_bo_kunmap isn't called at all and yes in the case of an iomap we leak the address space reserved.

But that is completely harmless on a 64bit system compared to leaking the memory backing the address space.

Using amdgpu_bo_free_kernel() instead of openly coding it here is probably a good idea.

Additional to that it's probably a good idea to set the no_intr flag when reserving kernel BOs. It is impossible to receive a signal during module load/unload, but it's probably better to document that in the code as well.

Regards,
Christian.

Am 18.04.2017 um 20:54 schrieb Xie, AlexBin:
Hi Christian,

Have you found how/where/when? When you said "mapping will just be released a bit later on", you must know the answer.

It is difficult to prove something does not exist. Anyway, I will give it a try to prove such "later on" does not exist.

Function ttm_bo_kunmap is the only function to unmap. To prove this, search ttm_bo_map_iomap, only ttm_bo_kunmap use this enum to correctly kunmap.

Function ttm_bo_kunmap is not called by ttm itself. This is a hint that all TTM delay delete mechanism or unref mechanism will NOT kunmap BO later on.

Function ttm_bo_kunmap is called by AMDGPU function amdgpu_bo_kunmap and amdgpu_gem_prime_vunmap.

Search AMDGPU for amdgpu_bo_kunmap. All matches do not kunmap for scratch VRAM BO.  amdgpu_bo_free_kernel is a suspect but the answer is still NO.

So all possibilities are searched. Did I miss anything?

Thanks,
Alex Bin Xie

________________________________
From: Xie, AlexBin
Sent: Tuesday, April 18, 2017 2:04:33 PM
To: Christian König; Zhou, David(ChunMing); amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO


Hi Christian,


Would you point out where/when will kunmap happen for this BO when release? It must be somewhere in some function calls.


I checked before I asked for review. But I did not see such obvious kunmap function call.


If so, there should be a comment in function amdgpu_vram_scratch_fini to avoid future confusion.

Thanks,
Alex Bin Xie
________________________________
From: Christian König <deathsimple@vodafone.de><mailto:deathsimple@vodafone.de>
Sent: Tuesday, April 18, 2017 1:46 PM
To: Xie, AlexBin; Zhou, David(ChunMing); amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO

Hi AlexBin,

No, David is right. This is a very common coding pattern in the kernel module.

Freeing up a BO while there still exists a kernel mapping is perfectly ok, the mapping will just be released a bit later on.

So this code is actually perfectly ok and just an optimization, but your patch breaks it and creates a memory leak.

Regards,
Christian.

Am 18.04.2017 um 17:17 schrieb Xie, AlexBin:

Hi David,


When amdgpu_bo_reserve return errors, we cannot release the BO. This is not a memory leak. General speaking, memory leak is unnoticed and unintentional.


The caller of function amdgpu_vram_scratch_fini ignores the return error value...


The "memory leak" is not caused by my patch. It is caused because reserving BO fails.


This patch only aim to make function amdgpu_vram_scratch_fini behave correctly.


To follow up, we can add a warning message when amdgpu_bo_reserve error happens in a different patch.


If function call amdgpu_bo_reserve is changed to uninterruptible, this changes driver behaviour. Without a substantial issue, I would be cautious for such a change.


Thanks,

Alex Bin Xie

________________________________
From: Zhou, David(ChunMing)
Sent: Monday, April 17, 2017 10:38 PM
To: Xie, AlexBin; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO



On 2017年04月17日 22:54, Xie, AlexBin wrote:

Hi David,


Thanks for the comments. However, please have look at amdgpu_bo_reserve definition.

static inline int amdgpu_bo_reserve(struct amdgpu_bo *bo, bool no_intr)

Ah, this is a wired wrapper for ttm_bo_reserve.



When we call this function like the following:

         r = amdgpu_bo_reserve(adev->vram_scratch.robj, false);
The false means interruptible.


On the other hand,  when amdgpu_bo_reserve function return error, why do we unref BO without kunmap and unpin the BO? This is wrong implementation when amdgpu_bo_reserve return any error.

Yeah, I see your mean, it's in driver un-loading, How about changing to no interruptible? Your patch will make a memleak if bo_reserve fails, but it seems not matter. I have no strong preference.

Regards,
David Zhou


Thanks,
Alex Bin Xie

________________________________
From: Zhou, David(ChunMing)
Sent: Friday, April 14, 2017 12:00 AM
To: Xie, AlexBin; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO



On 2017年04月14日 05:34, Alex Xie wrote:
> According to comment of amdgpu_bo_reserve, amdgpu_bo_reserve
> can return with -ERESTARTSYS. When this function was interrupted
> by a signal, BO should not be unref. Otherwise the BO might be
> released while is kmapped and pinned, or BO MIGHT be deref
> multiple times, etc.
         r = amdgpu_bo_reserve(adev->vram_scratch.robj, false);
we have specified interruptible to false, so -ERESTARTSYS isn't possible
here.

Thanks,
David Zhou
>
> Change-Id: If76071a768950a0d3ad9d5da7fcae04881807621
> Signed-off-by: Alex Xie <AlexBin.Xie@amd.com><mailto:AlexBin.Xie@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 53996e3..1dcc2d1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -355,8 +355,8 @@ static void amdgpu_vram_scratch_fini(struct amdgpu_device *adev)
>                amdgpu_bo_kunmap(adev->vram_scratch.robj);
>                amdgpu_bo_unpin(adev->vram_scratch.robj);
>                amdgpu_bo_unreserve(adev->vram_scratch.robj);
> +             amdgpu_bo_unref(&adev->vram_scratch.robj);
>        }
> -     amdgpu_bo_unref(&adev->vram_scratch.robj);
>   }
>
>   /**





_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
https://lists.freedesktop.org/mailman/listinfo/amd-gfx





_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
https://lists.freedesktop.org/mailman/listinfo/amd-gfx



[-- Attachment #1.2: Type: text/html, Size: 17886 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
       [not found]                                     ` <MWHPR1201MB0045EC51F6CE7C6C4283888CF2180-3iK1xFAIwjq45V35fqe+hGrFom/aUZj6nBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
@ 2017-04-19 11:50                                       ` Christian König
       [not found]                                         ` <2b43e2a9-0c79-f945-c835-0cfc486304c8-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Christian König @ 2017-04-19 11:50 UTC (permalink / raw)
  To: Xie, AlexBin, Zhou, David(ChunMing),
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW


[-- Attachment #1.1: Type: text/plain, Size: 10654 bytes --]

> Without correctly kunmap, page table is corrupted. Page entries point 
> to wrong memory locations. You might call it completely harmless. But 
> I think this is a severe bug. Leaking memory is better than a 
> corrupted page table. Think security.
We are talking about the page tables for the vmalloc area in the kernel 
here, so no security problem. Leaking memory is much more problematic.

> Would you provide any document and reference by saying" It is 
> impossible to receive a signal during module load/unload"? For 
> example, if the unload stuck in a lock, can CTRL+C stop the unload?
>
No, CTRL+C doesn't abort module load/unload. There have been patches to 
changes this a while ago, but IIRC it broke a whole bunch of driver 
relying on this.

> What about there is some other return error? What about in future 
> somebody improve amdgpu_bo_reserve to return other errors, 
> then function amdgpu_vram_scratch_fini becomes buggy?
>
Yes, that is indeed an issue. For example -EDEADLK is possible as well. 
That's why I said we should use amdgpu_bo_free_kernel() instead.

> While I am thinking whether there is a better way for the current 
> situation, would you give a real world example that my patch really 
> not working? Then we can address it.
>
I don't think there is because the driver can't receive a signal during 
load/unload, but the problem is rather that the patch doesn't improve 
the situation at all.

Regards,
Christian.

Am 19.04.2017 um 13:37 schrieb Xie, AlexBin:
>
> Hi Christian,
>
>
> Without correctly kunmap, page table is corrupted. Page entries point 
> to wrong memory locations. You might call it completely harmless. But 
> I think this is a severe bug. Leaking memory is better than a 
> corrupted page table. Think security.
>
>
> Would you provide any document and reference by saying" It is 
> impossible to receive a signal during module load/unload"? For 
> example, if the unload stuck in a lock, can CTRL+C stop the unload?
>
>
> If "It is impossible to receive a signal during module load/unload", 
> interruptible waiting is fine too, because function amdgpu_bo_reserve 
> will return successfully.
>
>
> What about there is some other return error? What about in future 
> somebody improve amdgpu_bo_reserve to return other errors, 
> then function amdgpu_vram_scratch_fini becomes buggy?
>
>
> While I am thinking whether there is a better way for the current 
> situation, would you give a real world example that my patch really 
> not working? Then we can address it.
>
>
> Thanks,
>
> Alex Bin
>
>
> ------------------------------------------------------------------------
> *From:* Christian König <deathsimple-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
> *Sent:* Wednesday, April 19, 2017 2:35 AM
> *To:* Xie, AlexBin; Zhou, David(ChunMing); amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
> Hi AlexBin,
>
> the answer is ttm_bo_kunmap isn't called at all and yes in the case of 
> an iomap we leak the address space reserved.
>
> But that is completely harmless on a 64bit system compared to leaking 
> the memory backing the address space.
>
> Using amdgpu_bo_free_kernel() instead of openly coding it here is 
> probably a good idea.
>
> Additional to that it's probably a good idea to set the no_intr flag 
> when reserving kernel BOs. It is impossible to receive a signal during 
> module load/unload, but it's probably better to document that in the 
> code as well.
>
> Regards,
> Christian.
>
> Am 18.04.2017 um 20:54 schrieb Xie, AlexBin:
>> Hi Christian,
>>
>> Have you found how/where/when? When you said "mapping will just be 
>> released a bit later on", you must know the answer.
>>
>> It is difficult to prove something does not exist. Anyway, I will 
>> give it a try to prove such "later on" does not exist.
>>
>> Function ttm_bo_kunmap is the only function to unmap. To prove this, 
>> search ttm_bo_map_iomap, only ttm_bo_kunmap use this enum to 
>> correctly kunmap.
>>
>> Function ttm_bo_kunmap is not called by ttm itself. This is a hint 
>> that all TTM delay delete mechanism or unref mechanism will NOT 
>> kunmap BO later on.
>>
>> Function ttm_bo_kunmap is called by AMDGPU function amdgpu_bo_kunmap 
>> and amdgpu_gem_prime_vunmap.
>>
>> Search AMDGPU for amdgpu_bo_kunmap. All matches do not kunmap for 
>> scratch VRAM BO. amdgpu_bo_free_kernel is a suspect but the answer is 
>> still NO.
>>
>> So all possibilities are searched. Did I miss anything?
>>
>> Thanks,
>> Alex Bin Xie
>>
>> ------------------------------------------------------------------------
>> *From:* Xie, AlexBin
>> *Sent:* Tuesday, April 18, 2017 2:04:33 PM
>> *To:* Christian König; Zhou, David(ChunMing); 
>> amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
>>
>> Hi Christian,
>>
>>
>> Would you point out where/when will kunmap happen for this BO when 
>> release? It must be somewhere in some function calls.
>>
>>
>> I checked before I asked for review. But I did not see such obvious 
>> kunmap function call.
>>
>>
>> If so, there should be a comment in function amdgpu_vram_scratch_fini 
>> to avoid future confusion.
>>
>>
>> Thanks,
>> Alex Bin Xie
>> ------------------------------------------------------------------------
>> *From:* Christian König <deathsimple-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
>> *Sent:* Tuesday, April 18, 2017 1:46 PM
>> *To:* Xie, AlexBin; Zhou, David(ChunMing); amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
>> Hi AlexBin,
>>
>> No, David is right. This is a very common coding pattern in the 
>> kernel module.
>>
>> Freeing up a BO while there still exists a kernel mapping is 
>> perfectly ok, the mapping will just be released a bit later on.
>>
>> So this code is actually perfectly ok and just an optimization, but 
>> your patch breaks it and creates a memory leak.
>>
>> Regards,
>> Christian.
>>
>> Am 18.04.2017 um 17:17 schrieb Xie, AlexBin:
>>>
>>> Hi David,
>>>
>>>
>>> When amdgpu_bo_reserve return errors, we cannot release the BO. This 
>>> is not a memory leak. General speaking, memory leak is unnoticed and 
>>> unintentional.
>>>
>>>
>>> The caller of function amdgpu_vram_scratch_fini ignores the return 
>>> error value...
>>>
>>>
>>> The "memory leak" is not caused by my patch. It is caused because 
>>> reserving BO fails.
>>>
>>>
>>> This patch only aim to make function amdgpu_vram_scratch_fini behave 
>>> correctly.
>>>
>>>
>>> To follow up, we can add a warning message when amdgpu_bo_reserve 
>>> error happens in a different patch.
>>>
>>>
>>> If function call amdgpu_bo_reserve is changed to uninterruptible, 
>>> this changes driver behaviour. Without a substantial issue, I would 
>>> be cautious for such a change.
>>>
>>>
>>> Thanks,
>>>
>>> Alex Bin Xie
>>>
>>>
>>> ------------------------------------------------------------------------
>>> *From:* Zhou, David(ChunMing)
>>> *Sent:* Monday, April 17, 2017 10:38 PM
>>> *To:* Xie, AlexBin; amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>>> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
>>>
>>>
>>> On 2017年04月17日 22:54, Xie, AlexBin wrote:
>>>>
>>>> Hi David,
>>>>
>>>>
>>>> Thanks for the comments. However, please have look at 
>>>> amdgpu_bo_reserve definition.
>>>>
>>>> static inline int amdgpu_bo_reserve(struct amdgpu_bo *bo, bool no_intr)
>>>>
>>> Ah, this is a wired wrapper for ttm_bo_reserve.
>>>
>>>>
>>>> When we call this function like the following:
>>>>
>>>>          r = amdgpu_bo_reserve(adev->vram_scratch.robj, false);
>>>> The false means interruptible.
>>>>
>>>>
>>>> On the other hand,  when amdgpu_bo_reserve function return error, 
>>>> why do we unref BO without kunmap and unpin the BO? This is wrong 
>>>> implementation when amdgpu_bo_reserve return any error.
>>>>
>>> Yeah, I see your mean, it's in driver un-loading, How about changing 
>>> to no interruptible? Your patch will make a memleak if bo_reserve 
>>> fails, but it seems not matter. I have no strong preference.
>>>
>>> Regards,
>>> David Zhou
>>>>
>>>>
>>>> Thanks,
>>>> Alex Bin Xie
>>>>
>>>> ------------------------------------------------------------------------
>>>> *From:* Zhou, David(ChunMing)
>>>> *Sent:* Friday, April 14, 2017 12:00 AM
>>>> *To:* Xie, AlexBin; amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>>>> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
>>>>
>>>>
>>>> On 2017年04月14日 05:34, Alex Xie wrote:
>>>> > According to comment of amdgpu_bo_reserve, amdgpu_bo_reserve
>>>> > can return with -ERESTARTSYS. When this function was interrupted
>>>> > by a signal, BO should not be unref. Otherwise the BO might be
>>>> > released while is kmapped and pinned, or BO MIGHT be deref
>>>> > multiple times, etc.
>>>>          r = amdgpu_bo_reserve(adev->vram_scratch.robj, false);
>>>> we have specified interruptible to false, so -ERESTARTSYS isn't 
>>>> possible
>>>> here.
>>>>
>>>> Thanks,
>>>> David Zhou
>>>> >
>>>> > Change-Id: If76071a768950a0d3ad9d5da7fcae04881807621
>>>> > Signed-off-by: Alex Xie <AlexBin.Xie-5C7GfCeVMHo@public.gmane.org>
>>>> > ---
>>>> > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
>>>> >   1 file changed, 1 insertion(+), 1 deletion(-)
>>>> >
>>>> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> > index 53996e3..1dcc2d1 100644
>>>> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> > @@ -355,8 +355,8 @@ static void amdgpu_vram_scratch_fini(struct 
>>>> amdgpu_device *adev)
>>>> > amdgpu_bo_kunmap(adev->vram_scratch.robj);
>>>> > amdgpu_bo_unpin(adev->vram_scratch.robj);
>>>> > amdgpu_bo_unreserve(adev->vram_scratch.robj);
>>>> > + amdgpu_bo_unref(&adev->vram_scratch.robj);
>>>> >        }
>>>> > - amdgpu_bo_unref(&adev->vram_scratch.robj);
>>>> >   }
>>>> >
>>>> >   /**
>>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>
>>
>>
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
>
>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx



[-- Attachment #1.2: Type: text/html, Size: 35268 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
       [not found]                                         ` <2b43e2a9-0c79-f945-c835-0cfc486304c8-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
@ 2017-04-19 19:30                                           ` Xie, AlexBin
       [not found]                                             ` <MWHPR1201MB0045B04F77E1F1C6D37D24FBF2180-3iK1xFAIwjq45V35fqe+hGrFom/aUZj6nBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Xie, AlexBin @ 2017-04-19 19:30 UTC (permalink / raw)
  To: Christian König, Zhou, David(ChunMing),
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW


[-- Attachment #1.1: Type: text/plain, Size: 11261 bytes --]

Hi Christian,


Missing kunmap mapping in vmalloc will make kernel master page table incorrect. I would not call such issue as completely harmless. Please note that AMD graphic driver can run in 32 bit system. In 32 bit system, vmalloc address space is much smaller than size of most GPU VRAM.


amdgpu_bo_free_kernel has same issue as amdgpu_vram_scratch_fini. 1. It calls amdgpu_bo_reserve interruptible too. 2. It misses kunmap when amdgpu_bo_reserve returns error too. As result, kernel master page table can become incorrect, or as you call it "completely harmless vmalloc space leaking".


Because amdgpu_bo_free_kernel is used in more places, such as psp command submission, there will be bigger chance to have other usage where signal is not blocked. This will become a real bug.


I am thinking that we may fix the issue completely when TTM releases BO. But that is a bigger change.


It is good that you agree that there is no real world bad example caused by my patch. I will not discuss whether it is an improvement or not now to save time for both of us.


Thanks,

Alex Bin Xie


________________________________
From: Christian König <deathsimple@vodafone.de>
Sent: Wednesday, April 19, 2017 7:50 AM
To: Xie, AlexBin; Zhou, David(ChunMing); amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO

Without correctly kunmap, page table is corrupted. Page entries point to wrong memory locations. You might call it completely harmless. But I think this is a severe bug. Leaking memory is better than a corrupted page table. Think security.
We are talking about the page tables for the vmalloc area in the kernel here, so no security problem. Leaking memory is much more problematic.


Would you provide any document and reference by saying" It is impossible to receive a signal during module load/unload"? For example, if the unload stuck in a lock, can CTRL+C stop the unload?

No, CTRL+C doesn't abort module load/unload. There have been patches to changes this a while ago, but IIRC it broke a whole bunch of driver relying on this.


What about there is some other return error? What about in future somebody improve amdgpu_bo_reserve to return other errors, then function amdgpu_vram_scratch_fini becomes buggy?

Yes, that is indeed an issue. For example -EDEADLK is possible as well. That's why I said we should use amdgpu_bo_free_kernel() instead.


While I am thinking whether there is a better way for the current situation, would you give a real world example that my patch really not working? Then we can address it.

I don't think there is because the driver can't receive a signal during load/unload, but the problem is rather that the patch doesn't improve the situation at all.

Regards,
Christian.

Am 19.04.2017 um 13:37 schrieb Xie, AlexBin:

Hi Christian,


Without correctly kunmap, page table is corrupted. Page entries point to wrong memory locations. You might call it completely harmless. But I think this is a severe bug. Leaking memory is better than a corrupted page table. Think security.


Would you provide any document and reference by saying" It is impossible to receive a signal during module load/unload"? For example, if the unload stuck in a lock, can CTRL+C stop the unload?


If "It is impossible to receive a signal during module load/unload", interruptible waiting is fine too, because function amdgpu_bo_reserve will return successfully.


What about there is some other return error? What about in future somebody improve amdgpu_bo_reserve to return other errors, then function amdgpu_vram_scratch_fini becomes buggy?


While I am thinking whether there is a better way for the current situation, would you give a real world example that my patch really not working? Then we can address it.


Thanks,

Alex Bin

________________________________
From: Christian König <deathsimple@vodafone.de><mailto:deathsimple@vodafone.de>
Sent: Wednesday, April 19, 2017 2:35 AM
To: Xie, AlexBin; Zhou, David(ChunMing); amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO

Hi AlexBin,

the answer is ttm_bo_kunmap isn't called at all and yes in the case of an iomap we leak the address space reserved.

But that is completely harmless on a 64bit system compared to leaking the memory backing the address space.

Using amdgpu_bo_free_kernel() instead of openly coding it here is probably a good idea.

Additional to that it's probably a good idea to set the no_intr flag when reserving kernel BOs. It is impossible to receive a signal during module load/unload, but it's probably better to document that in the code as well.

Regards,
Christian.

Am 18.04.2017 um 20:54 schrieb Xie, AlexBin:
Hi Christian,

Have you found how/where/when? When you said "mapping will just be released a bit later on", you must know the answer.

It is difficult to prove something does not exist. Anyway, I will give it a try to prove such "later on" does not exist.

Function ttm_bo_kunmap is the only function to unmap. To prove this, search ttm_bo_map_iomap, only ttm_bo_kunmap use this enum to correctly kunmap.

Function ttm_bo_kunmap is not called by ttm itself. This is a hint that all TTM delay delete mechanism or unref mechanism will NOT kunmap BO later on.

Function ttm_bo_kunmap is called by AMDGPU function amdgpu_bo_kunmap and amdgpu_gem_prime_vunmap.

Search AMDGPU for amdgpu_bo_kunmap. All matches do not kunmap for scratch VRAM BO.  amdgpu_bo_free_kernel is a suspect but the answer is still NO.

So all possibilities are searched. Did I miss anything?

Thanks,
Alex Bin Xie

________________________________
From: Xie, AlexBin
Sent: Tuesday, April 18, 2017 2:04:33 PM
To: Christian König; Zhou, David(ChunMing); amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO


Hi Christian,


Would you point out where/when will kunmap happen for this BO when release? It must be somewhere in some function calls.


I checked before I asked for review. But I did not see such obvious kunmap function call.


If so, there should be a comment in function amdgpu_vram_scratch_fini to avoid future confusion.

Thanks,
Alex Bin Xie
________________________________
From: Christian König <deathsimple@vodafone.de><mailto:deathsimple@vodafone.de>
Sent: Tuesday, April 18, 2017 1:46 PM
To: Xie, AlexBin; Zhou, David(ChunMing); amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO

Hi AlexBin,

No, David is right. This is a very common coding pattern in the kernel module.

Freeing up a BO while there still exists a kernel mapping is perfectly ok, the mapping will just be released a bit later on.

So this code is actually perfectly ok and just an optimization, but your patch breaks it and creates a memory leak.

Regards,
Christian.

Am 18.04.2017 um 17:17 schrieb Xie, AlexBin:

Hi David,


When amdgpu_bo_reserve return errors, we cannot release the BO. This is not a memory leak. General speaking, memory leak is unnoticed and unintentional.


The caller of function amdgpu_vram_scratch_fini ignores the return error value...


The "memory leak" is not caused by my patch. It is caused because reserving BO fails.


This patch only aim to make function amdgpu_vram_scratch_fini behave correctly.


To follow up, we can add a warning message when amdgpu_bo_reserve error happens in a different patch.


If function call amdgpu_bo_reserve is changed to uninterruptible, this changes driver behaviour. Without a substantial issue, I would be cautious for such a change.


Thanks,

Alex Bin Xie

________________________________
From: Zhou, David(ChunMing)
Sent: Monday, April 17, 2017 10:38 PM
To: Xie, AlexBin; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO



On 2017年04月17日 22:54, Xie, AlexBin wrote:

Hi David,


Thanks for the comments. However, please have look at amdgpu_bo_reserve definition.

static inline int amdgpu_bo_reserve(struct amdgpu_bo *bo, bool no_intr)

Ah, this is a wired wrapper for ttm_bo_reserve.



When we call this function like the following:

         r = amdgpu_bo_reserve(adev->vram_scratch.robj, false);
The false means interruptible.


On the other hand,  when amdgpu_bo_reserve function return error, why do we unref BO without kunmap and unpin the BO? This is wrong implementation when amdgpu_bo_reserve return any error.

Yeah, I see your mean, it's in driver un-loading, How about changing to no interruptible? Your patch will make a memleak if bo_reserve fails, but it seems not matter. I have no strong preference.

Regards,
David Zhou


Thanks,
Alex Bin Xie

________________________________
From: Zhou, David(ChunMing)
Sent: Friday, April 14, 2017 12:00 AM
To: Xie, AlexBin; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO



On 2017年04月14日 05:34, Alex Xie wrote:
> According to comment of amdgpu_bo_reserve, amdgpu_bo_reserve
> can return with -ERESTARTSYS. When this function was interrupted
> by a signal, BO should not be unref. Otherwise the BO might be
> released while is kmapped and pinned, or BO MIGHT be deref
> multiple times, etc.
         r = amdgpu_bo_reserve(adev->vram_scratch.robj, false);
we have specified interruptible to false, so -ERESTARTSYS isn't possible
here.

Thanks,
David Zhou
>
> Change-Id: If76071a768950a0d3ad9d5da7fcae04881807621
> Signed-off-by: Alex Xie <AlexBin.Xie@amd.com><mailto:AlexBin.Xie@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 53996e3..1dcc2d1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -355,8 +355,8 @@ static void amdgpu_vram_scratch_fini(struct amdgpu_device *adev)
>                amdgpu_bo_kunmap(adev->vram_scratch.robj);
>                amdgpu_bo_unpin(adev->vram_scratch.robj);
>                amdgpu_bo_unreserve(adev->vram_scratch.robj);
> +             amdgpu_bo_unref(&adev->vram_scratch.robj);
>        }
> -     amdgpu_bo_unref(&adev->vram_scratch.robj);
>   }
>
>   /**





_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
https://lists.freedesktop.org/mailman/listinfo/amd-gfx





_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
https://lists.freedesktop.org/mailman/listinfo/amd-gfx





_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
https://lists.freedesktop.org/mailman/listinfo/amd-gfx



[-- Attachment #1.2: Type: text/html, Size: 22839 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
       [not found]                                             ` <MWHPR1201MB0045B04F77E1F1C6D37D24FBF2180-3iK1xFAIwjq45V35fqe+hGrFom/aUZj6nBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
@ 2017-04-20  8:43                                               ` Christian König
       [not found]                                                 ` <96809715-3c78-3784-fbeb-57a00359e3f3-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Christian König @ 2017-04-20  8:43 UTC (permalink / raw)
  To: Xie, AlexBin, Zhou, David(ChunMing),
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW


[-- Attachment #1.1: Type: text/plain, Size: 13403 bytes --]

Hi AlexBin,

> Missing kunmap mapping in vmalloc will make kernel master page table 
> incorrect.
That's what I tried to explain yesterday, but unfortunately didn't had 
time to do so. There is not corruption of the kernel master page table 
in this case!

The call of ttm_bo_kunmap is completely optional, take a look at 
amdgpu_ttm_io_mem_reserve() and amdgpu_ttm_io_mem_free().

The aperture is kept mapped into the page tables for the whole time the 
driver is loaded. So this is a complete no-op and only done for consistency.

> It is good that you agree that there is no real world bad example 
> caused by my patch. I will not discuss whether it is an improvement or 
> not now to save time for both of us.
>
Great at least we can now agree to completely drop this patch.

Thanks,
Christian.

Am 19.04.2017 um 21:30 schrieb Xie, AlexBin:
>
> Hi Christian,
>
>
> Missing kunmap mapping in vmalloc will make kernel master page table 
> incorrect. I would not call such issue as completely harmless. Please 
> note that AMD graphic driver can run in 32 bit system. In 32 bit 
> system, vmalloc address space is much smaller than size of most GPU VRAM.
>
>
> amdgpu_bo_free_kernel has same issue as amdgpu_vram_scratch_fini. 1. 
> It calls amdgpu_bo_reserve interruptible too. 2. It misses kunmap when 
> amdgpu_bo_reserve returns error too. As result, kernel master page 
> table can become incorrect, or as you call it "completely harmless 
> vmalloc space leaking".
>
>
> Because amdgpu_bo_free_kernel is used in more places, such as psp 
> command submission, there will be bigger chance to have other usage 
> where signal is not blocked. This will become a real bug.
>
>
> I am thinking that we may fix the issue completely when TTM releases 
> BO. But that is a bigger change.
>
>
> It is good that you agree that there is no real world bad example 
> caused by my patch. I will not discuss whether it is an improvement or 
> not now to save time for both of us.
>
>
> Thanks,
>
> Alex Bin Xie
>
>
> ------------------------------------------------------------------------
> *From:* Christian König <deathsimple-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
> *Sent:* Wednesday, April 19, 2017 7:50 AM
> *To:* Xie, AlexBin; Zhou, David(ChunMing); amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
>> Without correctly kunmap, page table is corrupted. Page entries point 
>> to wrong memory locations. You might call it completely harmless. But 
>> I think this is a severe bug. Leaking memory is better than a 
>> corrupted page table. Think security.
> We are talking about the page tables for the vmalloc area in the 
> kernel here, so no security problem. Leaking memory is much more 
> problematic.
>
>> Would you provide any document and reference by saying" It is 
>> impossible to receive a signal during module load/unload"? For 
>> example, if the unload stuck in a lock, can CTRL+C stop the unload?
>>
> No, CTRL+C doesn't abort module load/unload. There have been patches 
> to changes this a while ago, but IIRC it broke a whole bunch of driver 
> relying on this.
>
>> What about there is some other return error? What about in future 
>> somebody improve amdgpu_bo_reserve to return other errors, 
>> then function amdgpu_vram_scratch_fini becomes buggy?
>>
> Yes, that is indeed an issue. For example -EDEADLK is possible as 
> well. That's why I said we should use amdgpu_bo_free_kernel() instead.
>
>> While I am thinking whether there is a better way for the current 
>> situation, would you give a real world example that my patch really 
>> not working? Then we can address it.
>>
> I don't think there is because the driver can't receive a signal 
> during load/unload, but the problem is rather that the patch doesn't 
> improve the situation at all.
>
> Regards,
> Christian.
>
> Am 19.04.2017 um 13:37 schrieb Xie, AlexBin:
>>
>> Hi Christian,
>>
>>
>> Without correctly kunmap, page table is corrupted. Page entries point 
>> to wrong memory locations. You might call it completely harmless. But 
>> I think this is a severe bug. Leaking memory is better than a 
>> corrupted page table. Think security.
>>
>>
>> Would you provide any document and reference by saying" It is 
>> impossible to receive a signal during module load/unload"? For 
>> example, if the unload stuck in a lock, can CTRL+C stop the unload?
>>
>>
>> If "It is impossible to receive a signal during module load/unload", 
>> interruptible waiting is fine too, because function amdgpu_bo_reserve 
>> will return successfully.
>>
>>
>> What about there is some other return error? What about in future 
>> somebody improve amdgpu_bo_reserve to return other errors, 
>> then function amdgpu_vram_scratch_fini becomes buggy?
>>
>>
>> While I am thinking whether there is a better way for the current 
>> situation, would you give a real world example that my patch really 
>> not working? Then we can address it.
>>
>>
>> Thanks,
>>
>> Alex Bin
>>
>>
>> ------------------------------------------------------------------------
>> *From:* Christian König <deathsimple-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
>> *Sent:* Wednesday, April 19, 2017 2:35 AM
>> *To:* Xie, AlexBin; Zhou, David(ChunMing); amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
>> Hi AlexBin,
>>
>> the answer is ttm_bo_kunmap isn't called at all and yes in the case 
>> of an iomap we leak the address space reserved.
>>
>> But that is completely harmless on a 64bit system compared to leaking 
>> the memory backing the address space.
>>
>> Using amdgpu_bo_free_kernel() instead of openly coding it here is 
>> probably a good idea.
>>
>> Additional to that it's probably a good idea to set the no_intr flag 
>> when reserving kernel BOs. It is impossible to receive a signal 
>> during module load/unload, but it's probably better to document that 
>> in the code as well.
>>
>> Regards,
>> Christian.
>>
>> Am 18.04.2017 um 20:54 schrieb Xie, AlexBin:
>>> Hi Christian,
>>>
>>> Have you found how/where/when? When you said "mapping will just be 
>>> released a bit later on", you must know the answer.
>>>
>>> It is difficult to prove something does not exist. Anyway, I will 
>>> give it a try to prove such "later on" does not exist.
>>>
>>> Function ttm_bo_kunmap is the only function to unmap. To prove this, 
>>> search ttm_bo_map_iomap, only ttm_bo_kunmap use this enum to 
>>> correctly kunmap.
>>>
>>> Function ttm_bo_kunmap is not called by ttm itself. This is a hint 
>>> that all TTM delay delete mechanism or unref mechanism will NOT 
>>> kunmap BO later on.
>>>
>>> Function ttm_bo_kunmap is called by AMDGPU function amdgpu_bo_kunmap 
>>> and amdgpu_gem_prime_vunmap.
>>>
>>> Search AMDGPU for amdgpu_bo_kunmap. All matches do not kunmap for 
>>> scratch VRAM BO. amdgpu_bo_free_kernel is a suspect but the answer 
>>> is still NO.
>>>
>>> So all possibilities are searched. Did I miss anything?
>>>
>>> Thanks,
>>> Alex Bin Xie
>>>
>>> ------------------------------------------------------------------------
>>> *From:* Xie, AlexBin
>>> *Sent:* Tuesday, April 18, 2017 2:04:33 PM
>>> *To:* Christian König; Zhou, David(ChunMing); 
>>> amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>>> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
>>>
>>> Hi Christian,
>>>
>>>
>>> Would you point out where/when will kunmap happen for this BO when 
>>> release? It must be somewhere in some function calls.
>>>
>>>
>>> I checked before I asked for review. But I did not see such obvious 
>>> kunmap function call.
>>>
>>>
>>> If so, there should be a comment in function 
>>> amdgpu_vram_scratch_fini to avoid future confusion.
>>>
>>>
>>> Thanks,
>>> Alex Bin Xie
>>> ------------------------------------------------------------------------
>>> *From:* Christian König <deathsimple-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
>>> *Sent:* Tuesday, April 18, 2017 1:46 PM
>>> *To:* Xie, AlexBin; Zhou, David(ChunMing); amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>>> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
>>> Hi AlexBin,
>>>
>>> No, David is right. This is a very common coding pattern in the 
>>> kernel module.
>>>
>>> Freeing up a BO while there still exists a kernel mapping is 
>>> perfectly ok, the mapping will just be released a bit later on.
>>>
>>> So this code is actually perfectly ok and just an optimization, but 
>>> your patch breaks it and creates a memory leak.
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 18.04.2017 um 17:17 schrieb Xie, AlexBin:
>>>>
>>>> Hi David,
>>>>
>>>>
>>>> When amdgpu_bo_reserve return errors, we cannot release the BO. 
>>>> This is not a memory leak. General speaking, memory leak 
>>>> is unnoticed and unintentional.
>>>>
>>>>
>>>> The caller of function amdgpu_vram_scratch_fini ignores the return 
>>>> error value...
>>>>
>>>>
>>>> The "memory leak" is not caused by my patch. It is caused because 
>>>> reserving BO fails.
>>>>
>>>>
>>>> This patch only aim to make function amdgpu_vram_scratch_fini 
>>>> behave correctly.
>>>>
>>>>
>>>> To follow up, we can add a warning message when amdgpu_bo_reserve 
>>>> error happens in a different patch.
>>>>
>>>>
>>>> If function call amdgpu_bo_reserve is changed to uninterruptible, 
>>>> this changes driver behaviour. Without a substantial issue, I would 
>>>> be cautious for such a change.
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Alex Bin Xie
>>>>
>>>>
>>>> ------------------------------------------------------------------------
>>>> *From:* Zhou, David(ChunMing)
>>>> *Sent:* Monday, April 17, 2017 10:38 PM
>>>> *To:* Xie, AlexBin; amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>>>> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
>>>>
>>>>
>>>> On 2017年04月17日 22:54, Xie, AlexBin wrote:
>>>>>
>>>>> Hi David,
>>>>>
>>>>>
>>>>> Thanks for the comments. However, please have look at 
>>>>> amdgpu_bo_reserve definition.
>>>>>
>>>>> static inline int amdgpu_bo_reserve(struct amdgpu_bo *bo, bool 
>>>>> no_intr)
>>>>>
>>>> Ah, this is a wired wrapper for ttm_bo_reserve.
>>>>
>>>>>
>>>>> When we call this function like the following:
>>>>>
>>>>>      r = amdgpu_bo_reserve(adev->vram_scratch.robj, false);
>>>>> The false means interruptible.
>>>>>
>>>>>
>>>>> On the other hand,  when amdgpu_bo_reserve function return error, 
>>>>> why do we unref BO without kunmap and unpin the BO? This is wrong 
>>>>> implementation when amdgpu_bo_reserve return any error.
>>>>>
>>>> Yeah, I see your mean, it's in driver un-loading, How about 
>>>> changing to no interruptible? Your patch will make a memleak if 
>>>> bo_reserve fails, but it seems not matter. I have no strong preference.
>>>>
>>>> Regards,
>>>> David Zhou
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Alex Bin Xie
>>>>>
>>>>> ------------------------------------------------------------------------
>>>>> *From:* Zhou, David(ChunMing)
>>>>> *Sent:* Friday, April 14, 2017 12:00 AM
>>>>> *To:* Xie, AlexBin; amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>>>>> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
>>>>>
>>>>>
>>>>> On 2017年04月14日 05:34, Alex Xie wrote:
>>>>> > According to comment of amdgpu_bo_reserve, amdgpu_bo_reserve
>>>>> > can return with -ERESTARTSYS. When this function was interrupted
>>>>> > by a signal, BO should not be unref. Otherwise the BO might be
>>>>> > released while is kmapped and pinned, or BO MIGHT be deref
>>>>> > multiple times, etc.
>>>>>          r = amdgpu_bo_reserve(adev->vram_scratch.robj, false);
>>>>> we have specified interruptible to false, so -ERESTARTSYS isn't 
>>>>> possible
>>>>> here.
>>>>>
>>>>> Thanks,
>>>>> David Zhou
>>>>> >
>>>>> > Change-Id: If76071a768950a0d3ad9d5da7fcae04881807621
>>>>> > Signed-off-by: Alex Xie <AlexBin.Xie-5C7GfCeVMHo@public.gmane.org>
>>>>> > ---
>>>>> > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
>>>>> >   1 file changed, 1 insertion(+), 1 deletion(-)
>>>>> >
>>>>> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> > index 53996e3..1dcc2d1 100644
>>>>> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> > @@ -355,8 +355,8 @@ static void amdgpu_vram_scratch_fini(struct 
>>>>> amdgpu_device *adev)
>>>>> >                amdgpu_bo_kunmap(adev->vram_scratch.robj);
>>>>> >                amdgpu_bo_unpin(adev->vram_scratch.robj);
>>>>> >                amdgpu_bo_unreserve(adev->vram_scratch.robj);
>>>>> > + amdgpu_bo_unref(&adev->vram_scratch.robj);
>>>>> >        }
>>>>> > - amdgpu_bo_unref(&adev->vram_scratch.robj);
>>>>> >   }
>>>>> >
>>>>> >   /**
>>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>
>>
>>
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
>


[-- Attachment #1.2: Type: text/html, Size: 46494 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
       [not found]                                                 ` <96809715-3c78-3784-fbeb-57a00359e3f3-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
@ 2017-04-20 21:56                                                   ` Xie, AlexBin
       [not found]                                                     ` <MWHPR1201MB004568A33CBF287710759301F21B0-3iK1xFAIwjq45V35fqe+hGrFom/aUZj6nBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Xie, AlexBin @ 2017-04-20 21:56 UTC (permalink / raw)
  To: Christian König, Zhou, David(ChunMing),
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW


[-- Attachment #1.1: Type: text/plain, Size: 19254 bytes --]

Hi Christian,


I read amdgpu_ttm_io_mem_reserve() and amdgpu_ttm_io_mem_free() and relevant codes from amdgpu_vram_scratch_init and amdgpu_vram_scratch_fini.


No. For the current source code, I think the premap and no-op is not working.


I add printk to prove. You can see bo_kmap_type is an invalid value. ioremap_wc is really called to map the IO range into vmalloc space.


...

Apr 20 16:31:18 axie-System-Product-Name kernel: [  106.759623] entering amdgpu_vram_scratch_init.
Apr 20 16:31:18 axie-System-Product-Name kernel: [  106.759631] scratch ioremap_wc
Apr 20 16:31:18 axie-System-Product-Name kernel: [  106.759631] bo_kmap_type = 129
Apr 20 16:31:18 axie-System-Product-Name kernel: [  106.759632] bus address =           (null)
Apr 20 16:31:18 axie-System-Product-Name kernel: [  106.759632] is_iomem = 1
Apr 20 16:31:18 axie-System-Product-Name kernel: [  106.759633] leaving amdgpu_vram_scratch_init.
...

I don't have log on rmmod AMDGPU yet. There are quite some settings to make that happen in my computer.
I think they are symemtric. Both mapping and unmapping are not no-op.

Here is the printk patch for your reference.

From 25f95239c23225008e4b59f30b9b5363fb321f94 Mon Sep 17 00:00:00 2001
From: Alex Xie <AlexBin.Xie@amd.com>
Date: Thu, 20 Apr 2017 17:48:49 -0400
Subject: [PATCH] A hack to trace premap and noop.

Change-Id: I61fbbdbd82f62433e229b2e7e97be7a780ea8afa
Signed-off-by: Alex Xie <AlexBin.Xie@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 22 ++++++++++++++++++++++
 drivers/gpu/drm/ttm/ttm_bo.c               |  1 +
 drivers/gpu/drm/ttm/ttm_bo_util.c          | 29 ++++++++++++++++++++++++++---
 include/drm/ttm/ttm_bo_api.h               |  1 +
 4 files changed, 50 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index fbb4afb..a537ea1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -313,6 +313,7 @@ static void amdgpu_block_invalid_wreg(struct amdgpu_device *adev,
 static int amdgpu_vram_scratch_init(struct amdgpu_device *adev)
 {
  int r;
+ printk("entering amdgpu_vram_scratch_init.");

  if (adev->vram_scratch.robj == NULL) {
  r = amdgpu_bo_create(adev, AMDGPU_GPU_PAGE_SIZE,
@@ -340,16 +341,36 @@ static int amdgpu_vram_scratch_init(struct amdgpu_device *adev)
  amdgpu_bo_unpin(adev->vram_scratch.robj);
  amdgpu_bo_unreserve(adev->vram_scratch.robj);

+ /* Would like a printk to see if map / unmap is noop*/
+ adev->vram_scratch.robj->tbo.mem.bus.printk = true;
+
+ if (adev->vram_scratch.robj->kmap.bo_kmap_type == ttm_bo_map_premapped)
+ printk("amdgpu_vram_scratch premapped!");
+
+ printk("bo_kmap_type = %d", adev->vram_scratch.robj->kmap.bo_kmap_type);
+ printk("bus address = %p", adev->vram_scratch.robj->tbo.mem.bus.addr);
+ printk("is_iomem = %d", adev->vram_scratch.robj->tbo.mem.bus.is_iomem);
+ printk("leaving amdgpu_vram_scratch_init.");
+
  return r;
 }

 static void amdgpu_vram_scratch_fini(struct amdgpu_device *adev)
 {
  int r;
+ printk("entering amdgpu_vram_scratch_fini.");

  if (adev->vram_scratch.robj == NULL) {
  return;
  }
+
+ if (adev->vram_scratch.robj->kmap.bo_kmap_type == ttm_bo_map_premapped)
+ printk("amdgpu_vram_scratch premapped!");
+
+ printk("bo_kmap_type = %d", adev->vram_scratch.robj->kmap.bo_kmap_type);
+ printk("bus address = %p", adev->vram_scratch.robj->tbo.mem.bus.addr);
+ printk("is_iomem = %d", adev->vram_scratch.robj->tbo.mem.bus.is_iomem);
+
  r = amdgpu_bo_reserve(adev->vram_scratch.robj, false);
  if (likely(r == 0)) {
  amdgpu_bo_kunmap(adev->vram_scratch.robj);
@@ -357,6 +378,7 @@ static void amdgpu_vram_scratch_fini(struct amdgpu_device *adev)
  amdgpu_bo_unreserve(adev->vram_scratch.robj);
  }
  amdgpu_bo_unref(&adev->vram_scratch.robj);
+ printk("leaving amdgpu_vram_scratch_fini.");
 }

 /**
diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index 989b98b..9b05d54 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -1178,6 +1178,7 @@ int ttm_bo_init_reserved(struct ttm_bo_device *bdev,
  bo->mem.page_alignment = page_alignment;
  bo->mem.bus.io_reserved_vm = false;
  bo->mem.bus.io_reserved_count = 0;
+ bo->mem.bus.printk = false;
  bo->moving = NULL;
  bo->mem.placement = (TTM_PL_FLAG_SYSTEM | TTM_PL_FLAG_CACHED);
  bo->persistent_swap_storage = persistent_swap_storage;
diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c b/drivers/gpu/drm/ttm/ttm_bo_util.c
index bf6e216..9d06952 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_util.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_util.c
@@ -526,14 +526,24 @@ static int ttm_bo_ioremap(struct ttm_buffer_object *bo,
  if (bo->mem.bus.addr) {
  map->bo_kmap_type = ttm_bo_map_premapped;
  map->virtual = (void *)(((u8 *)bo->mem.bus.addr) + offset);
+ if (bo->mem.bus.printk)
+ printk ("scratch premapping");
+
  } else {
  map->bo_kmap_type = ttm_bo_map_iomap;
- if (mem->placement & TTM_PL_FLAG_WC)
+ if (mem->placement & TTM_PL_FLAG_WC) {
  map->virtual = ioremap_wc(bo->mem.bus.base + bo->mem.bus.offset + offset,
   size);
- else
+ if (bo->mem.bus.printk)
+ printk ("scratch ioremap_wc");
+
+ }
+ else {
  map->virtual = ioremap_nocache(bo->mem.bus.base + bo->mem.bus.offset + offset,
        size);
+ if (bo->mem.bus.printk)
+ printk ("scratch ioremap_nocache");
+ }
  }
  return (!map->virtual) ? -ENOMEM : 0;
 }
@@ -618,21 +628,34 @@ void ttm_bo_kunmap(struct ttm_bo_kmap_obj *map)
  struct ttm_mem_type_manager *man =
  &bo->bdev->man[bo->mem.mem_type];

- if (!map->virtual)
+ if (!map->virtual) {
+ if (bo->mem.bus.printk)
+ printk ("scratch unmap return earlier");
  return;
+ }
  switch (map->bo_kmap_type) {
  case ttm_bo_map_iomap:
+ if (bo->mem.bus.printk)
+ printk ("scratch iounmap");
  iounmap(map->virtual);
  break;
  case ttm_bo_map_vmap:
+ if (bo->mem.bus.printk)
+ printk ("scratch vunmap");
  vunmap(map->virtual);
  break;
  case ttm_bo_map_kmap:
+ if (bo->mem.bus.printk)
+ printk ("scratch kunmap");
  kunmap(map->page);
  break;
  case ttm_bo_map_premapped:
+ if (bo->mem.bus.printk)
+ printk ("scratch unmap ttm_bo_map_premapped");
  break;
  default:
+ if (bo->mem.bus.printk)
+ printk ("scratch unmap bug");
  BUG();
  }
  (void) ttm_mem_io_lock(man, false);
diff --git a/include/drm/ttm/ttm_bo_api.h b/include/drm/ttm/ttm_bo_api.h
index 2d0f63e..f74a554 100644
--- a/include/drm/ttm/ttm_bo_api.h
+++ b/include/drm/ttm/ttm_bo_api.h
@@ -70,6 +70,7 @@ struct ttm_bus_placement {
  bool is_iomem;
  bool io_reserved_vm;
  uint64_t        io_reserved_count;
+ bool            printk;
 };


--
2.7.4




Thanks,

Alex Bin

________________________________
From: Christian König <deathsimple@vodafone.de>
Sent: Thursday, April 20, 2017 4:43 AM
To: Xie, AlexBin; Zhou, David(ChunMing); amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO

Hi AlexBin,

Missing kunmap mapping in vmalloc will make kernel master page table incorrect.
That's what I tried to explain yesterday, but unfortunately didn't had time to do so. There is not corruption of the kernel master page table in this case!

The call of ttm_bo_kunmap is completely optional, take a look at amdgpu_ttm_io_mem_reserve() and amdgpu_ttm_io_mem_free().

The aperture is kept mapped into the page tables for the whole time the driver is loaded. So this is a complete no-op and only done for consistency.


It is good that you agree that there is no real world bad example caused by my patch. I will not discuss whether it is an improvement or not now to save time for both of us.

Great at least we can now agree to completely drop this patch.

Thanks,
Christian.

Am 19.04.2017 um 21:30 schrieb Xie, AlexBin:

Hi Christian,


Missing kunmap mapping in vmalloc will make kernel master page table incorrect. I would not call such issue as completely harmless. Please note that AMD graphic driver can run in 32 bit system. In 32 bit system, vmalloc address space is much smaller than size of most GPU VRAM.


amdgpu_bo_free_kernel has same issue as amdgpu_vram_scratch_fini. 1. It calls amdgpu_bo_reserve interruptible too. 2. It misses kunmap when amdgpu_bo_reserve returns error too. As result, kernel master page table can become incorrect, or as you call it "completely harmless vmalloc space leaking".


Because amdgpu_bo_free_kernel is used in more places, such as psp command submission, there will be bigger chance to have other usage where signal is not blocked. This will become a real bug.


I am thinking that we may fix the issue completely when TTM releases BO. But that is a bigger change.


It is good that you agree that there is no real world bad example caused by my patch. I will not discuss whether it is an improvement or not now to save time for both of us.


Thanks,

Alex Bin Xie


________________________________
From: Christian König <deathsimple@vodafone.de><mailto:deathsimple@vodafone.de>
Sent: Wednesday, April 19, 2017 7:50 AM
To: Xie, AlexBin; Zhou, David(ChunMing); amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO

Without correctly kunmap, page table is corrupted. Page entries point to wrong memory locations. You might call it completely harmless. But I think this is a severe bug. Leaking memory is better than a corrupted page table. Think security.
We are talking about the page tables for the vmalloc area in the kernel here, so no security problem. Leaking memory is much more problematic.


Would you provide any document and reference by saying" It is impossible to receive a signal during module load/unload"? For example, if the unload stuck in a lock, can CTRL+C stop the unload?

No, CTRL+C doesn't abort module load/unload. There have been patches to changes this a while ago, but IIRC it broke a whole bunch of driver relying on this.


What about there is some other return error? What about in future somebody improve amdgpu_bo_reserve to return other errors, then function amdgpu_vram_scratch_fini becomes buggy?

Yes, that is indeed an issue. For example -EDEADLK is possible as well. That's why I said we should use amdgpu_bo_free_kernel() instead.


While I am thinking whether there is a better way for the current situation, would you give a real world example that my patch really not working? Then we can address it.

I don't think there is because the driver can't receive a signal during load/unload, but the problem is rather that the patch doesn't improve the situation at all.

Regards,
Christian.

Am 19.04.2017 um 13:37 schrieb Xie, AlexBin:

Hi Christian,


Without correctly kunmap, page table is corrupted. Page entries point to wrong memory locations. You might call it completely harmless. But I think this is a severe bug. Leaking memory is better than a corrupted page table. Think security.


Would you provide any document and reference by saying" It is impossible to receive a signal during module load/unload"? For example, if the unload stuck in a lock, can CTRL+C stop the unload?


If "It is impossible to receive a signal during module load/unload", interruptible waiting is fine too, because function amdgpu_bo_reserve will return successfully.


What about there is some other return error? What about in future somebody improve amdgpu_bo_reserve to return other errors, then function amdgpu_vram_scratch_fini becomes buggy?


While I am thinking whether there is a better way for the current situation, would you give a real world example that my patch really not working? Then we can address it.


Thanks,

Alex Bin

________________________________
From: Christian König <deathsimple@vodafone.de><mailto:deathsimple@vodafone.de>
Sent: Wednesday, April 19, 2017 2:35 AM
To: Xie, AlexBin; Zhou, David(ChunMing); amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO

Hi AlexBin,

the answer is ttm_bo_kunmap isn't called at all and yes in the case of an iomap we leak the address space reserved.

But that is completely harmless on a 64bit system compared to leaking the memory backing the address space.

Using amdgpu_bo_free_kernel() instead of openly coding it here is probably a good idea.

Additional to that it's probably a good idea to set the no_intr flag when reserving kernel BOs. It is impossible to receive a signal during module load/unload, but it's probably better to document that in the code as well.

Regards,
Christian.

Am 18.04.2017 um 20:54 schrieb Xie, AlexBin:
Hi Christian,

Have you found how/where/when? When you said "mapping will just be released a bit later on", you must know the answer.

It is difficult to prove something does not exist. Anyway, I will give it a try to prove such "later on" does not exist.

Function ttm_bo_kunmap is the only function to unmap. To prove this, search ttm_bo_map_iomap, only ttm_bo_kunmap use this enum to correctly kunmap.

Function ttm_bo_kunmap is not called by ttm itself. This is a hint that all TTM delay delete mechanism or unref mechanism will NOT kunmap BO later on.

Function ttm_bo_kunmap is called by AMDGPU function amdgpu_bo_kunmap and amdgpu_gem_prime_vunmap.

Search AMDGPU for amdgpu_bo_kunmap. All matches do not kunmap for scratch VRAM BO.  amdgpu_bo_free_kernel is a suspect but the answer is still NO.

So all possibilities are searched. Did I miss anything?

Thanks,
Alex Bin Xie

________________________________
From: Xie, AlexBin
Sent: Tuesday, April 18, 2017 2:04:33 PM
To: Christian König; Zhou, David(ChunMing); amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO


Hi Christian,


Would you point out where/when will kunmap happen for this BO when release? It must be somewhere in some function calls.


I checked before I asked for review. But I did not see such obvious kunmap function call.


If so, there should be a comment in function amdgpu_vram_scratch_fini to avoid future confusion.

Thanks,
Alex Bin Xie
________________________________
From: Christian König <deathsimple@vodafone.de><mailto:deathsimple@vodafone.de>
Sent: Tuesday, April 18, 2017 1:46 PM
To: Xie, AlexBin; Zhou, David(ChunMing); amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO

Hi AlexBin,

No, David is right. This is a very common coding pattern in the kernel module.

Freeing up a BO while there still exists a kernel mapping is perfectly ok, the mapping will just be released a bit later on.

So this code is actually perfectly ok and just an optimization, but your patch breaks it and creates a memory leak.

Regards,
Christian.

Am 18.04.2017 um 17:17 schrieb Xie, AlexBin:

Hi David,


When amdgpu_bo_reserve return errors, we cannot release the BO. This is not a memory leak. General speaking, memory leak is unnoticed and unintentional.


The caller of function amdgpu_vram_scratch_fini ignores the return error value...


The "memory leak" is not caused by my patch. It is caused because reserving BO fails.


This patch only aim to make function amdgpu_vram_scratch_fini behave correctly.


To follow up, we can add a warning message when amdgpu_bo_reserve error happens in a different patch.


If function call amdgpu_bo_reserve is changed to uninterruptible, this changes driver behaviour. Without a substantial issue, I would be cautious for such a change.


Thanks,

Alex Bin Xie

________________________________
From: Zhou, David(ChunMing)
Sent: Monday, April 17, 2017 10:38 PM
To: Xie, AlexBin; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO



On 2017年04月17日 22:54, Xie, AlexBin wrote:

Hi David,


Thanks for the comments. However, please have look at amdgpu_bo_reserve definition.

static inline int amdgpu_bo_reserve(struct amdgpu_bo *bo, bool no_intr)

Ah, this is a wired wrapper for ttm_bo_reserve.



When we call this function like the following:

         r = amdgpu_bo_reserve(adev->vram_scratch.robj, false);
The false means interruptible.


On the other hand,  when amdgpu_bo_reserve function return error, why do we unref BO without kunmap and unpin the BO? This is wrong implementation when amdgpu_bo_reserve return any error.

Yeah, I see your mean, it's in driver un-loading, How about changing to no interruptible? Your patch will make a memleak if bo_reserve fails, but it seems not matter. I have no strong preference.

Regards,
David Zhou


Thanks,
Alex Bin Xie

________________________________
From: Zhou, David(ChunMing)
Sent: Friday, April 14, 2017 12:00 AM
To: Xie, AlexBin; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO



On 2017年04月14日 05:34, Alex Xie wrote:
> According to comment of amdgpu_bo_reserve, amdgpu_bo_reserve
> can return with -ERESTARTSYS. When this function was interrupted
> by a signal, BO should not be unref. Otherwise the BO might be
> released while is kmapped and pinned, or BO MIGHT be deref
> multiple times, etc.
         r = amdgpu_bo_reserve(adev->vram_scratch.robj, false);
we have specified interruptible to false, so -ERESTARTSYS isn't possible
here.

Thanks,
David Zhou
>
> Change-Id: If76071a768950a0d3ad9d5da7fcae04881807621
> Signed-off-by: Alex Xie <AlexBin.Xie@amd.com><mailto:AlexBin.Xie@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 53996e3..1dcc2d1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -355,8 +355,8 @@ static void amdgpu_vram_scratch_fini(struct amdgpu_device *adev)
>                amdgpu_bo_kunmap(adev->vram_scratch.robj);
>                amdgpu_bo_unpin(adev->vram_scratch.robj);
>                amdgpu_bo_unreserve(adev->vram_scratch.robj);
> +             amdgpu_bo_unref(&adev->vram_scratch.robj);
>        }
> -     amdgpu_bo_unref(&adev->vram_scratch.robj);
>   }
>
>   /**





_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
https://lists.freedesktop.org/mailman/listinfo/amd-gfx





_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
https://lists.freedesktop.org/mailman/listinfo/amd-gfx





_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
https://lists.freedesktop.org/mailman/listinfo/amd-gfx




[-- Attachment #1.2: Type: text/html, Size: 41944 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
       [not found]                                                     ` <MWHPR1201MB004568A33CBF287710759301F21B0-3iK1xFAIwjq45V35fqe+hGrFom/aUZj6nBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
@ 2017-04-21  7:11                                                       ` Christian König
       [not found]                                                         ` <af065740-e219-fccb-4dac-d46f3472f4aa-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Christian König @ 2017-04-21  7:11 UTC (permalink / raw)
  To: Xie, AlexBin, Zhou, David(ChunMing),
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW


[-- Attachment #1.1: Type: text/plain, Size: 21790 bytes --]

Hi Alex,

> No. For the current source code, I think the premap and no-op is not 
> working.
>
Indeed, we don't set bo->mem.bus.addr in amdgpu_ttm_io_mem_reserve() any 
more. Felix will probably want to fix that for the KFD branch.

Anyway, as I said not unmapping the page tables is harmless compared to 
not releasing the memory backing it.

So please just do as I told you and change the interruptible reservation 
to a non-interruptible.

Regards,
Christian.

Am 20.04.2017 um 23:56 schrieb Xie, AlexBin:
>
> Hi Christian,
>
>
> I read amdgpu_ttm_io_mem_reserve() and amdgpu_ttm_io_mem_free() and 
> relevant codes from amdgpu_vram_scratch_init and 
> amdgpu_vram_scratch_fini.
>
>
> No. For the current source code, I think the premap and no-op is not 
> working.
>
>
> I add printk to prove. You can see bo_kmap_type is an invalid value. 
> ioremap_wc is really called to map the IO range into vmalloc space.
>
>
> ...
>
> Apr 20 16:31:18 axie-System-Product-Name kernel: [  106.759623] 
> entering amdgpu_vram_scratch_init.
> Apr 20 16:31:18 axie-System-Product-Name kernel: [  106.759631] 
> scratch ioremap_wc
> Apr 20 16:31:18 axie-System-Product-Name kernel: [  106.759631] 
> bo_kmap_type = 129
> Apr 20 16:31:18 axie-System-Product-Name kernel: [  106.759632] bus 
> address =           (null)
> Apr 20 16:31:18 axie-System-Product-Name kernel: [  106.759632] 
> is_iomem = 1
> Apr 20 16:31:18 axie-System-Product-Name kernel: [  106.759633] 
> leaving amdgpu_vram_scratch_init.
> ...
>
> I don't have log on rmmod AMDGPU yet. There are quite some settings to 
> make that happen in my computer.
> I think they are symemtric. Both mapping and unmapping are not no-op.
>
> Here is the printk patch for your reference.
>
> From 25f95239c23225008e4b59f30b9b5363fb321f94 Mon Sep 17 00:00:00 2001
> From: Alex Xie <AlexBin.Xie-5C7GfCeVMHo@public.gmane.org>
> Date: Thu, 20 Apr 2017 17:48:49 -0400
> Subject: [PATCH] A hack to trace premap and noop.
>
> Change-Id: I61fbbdbd82f62433e229b2e7e97be7a780ea8afa
> Signed-off-by: Alex Xie <AlexBin.Xie-5C7GfCeVMHo@public.gmane.org>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 22 ++++++++++++++++++++++
>  drivers/gpu/drm/ttm/ttm_bo.c               |  1 +
>  drivers/gpu/drm/ttm/ttm_bo_util.c          | 29 
> ++++++++++++++++++++++++++---
>  include/drm/ttm/ttm_bo_api.h               |  1 +
>  4 files changed, 50 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index fbb4afb..a537ea1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -313,6 +313,7 @@ static void amdgpu_block_invalid_wreg(struct 
> amdgpu_device *adev,
>  static int amdgpu_vram_scratch_init(struct amdgpu_device *adev)
>  {
> int r;
> +printk("entering amdgpu_vram_scratch_init.");
> if (adev->vram_scratch.robj == NULL) {
> r = amdgpu_bo_create(adev, AMDGPU_GPU_PAGE_SIZE,
> @@ -340,16 +341,36 @@ static int amdgpu_vram_scratch_init(struct 
> amdgpu_device *adev)
> amdgpu_bo_unpin(adev->vram_scratch.robj);
> amdgpu_bo_unreserve(adev->vram_scratch.robj);
> +/* Would like a printk to see if map / unmap is noop*/
> +adev->vram_scratch.robj->tbo.mem.bus.printk = true;
> +
> +if (adev->vram_scratch.robj->kmap.bo_kmap_type == ttm_bo_map_premapped)
> +printk("amdgpu_vram_scratch premapped!");
> +
> +printk("bo_kmap_type = %d", adev->vram_scratch.robj->kmap.bo_kmap_type);
> +printk("bus address = %p", adev->vram_scratch.robj->tbo.mem.bus.addr);
> +printk("is_iomem = %d", adev->vram_scratch.robj->tbo.mem.bus.is_iomem);
> +printk("leaving amdgpu_vram_scratch_init.");
> +
> return r;
>  }
>  static void amdgpu_vram_scratch_fini(struct amdgpu_device *adev)
>  {
> int r;
> +printk("entering amdgpu_vram_scratch_fini.");
> if (adev->vram_scratch.robj == NULL) {
> return;
> }
> +
> +if (adev->vram_scratch.robj->kmap.bo_kmap_type == ttm_bo_map_premapped)
> +printk("amdgpu_vram_scratch premapped!");
> +
> +printk("bo_kmap_type = %d", adev->vram_scratch.robj->kmap.bo_kmap_type);
> +printk("bus address = %p", adev->vram_scratch.robj->tbo.mem.bus.addr);
> +printk("is_iomem = %d", adev->vram_scratch.robj->tbo.mem.bus.is_iomem);
> +
> r = amdgpu_bo_reserve(adev->vram_scratch.robj, false);
> if (likely(r == 0)) {
> amdgpu_bo_kunmap(adev->vram_scratch.robj);
> @@ -357,6 +378,7 @@ static void amdgpu_vram_scratch_fini(struct 
> amdgpu_device *adev)
> amdgpu_bo_unreserve(adev->vram_scratch.robj);
> }
> amdgpu_bo_unref(&adev->vram_scratch.robj);
> +printk("leaving amdgpu_vram_scratch_fini.");
>  }
>  /**
> diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
> index 989b98b..9b05d54 100644
> --- a/drivers/gpu/drm/ttm/ttm_bo.c
> +++ b/drivers/gpu/drm/ttm/ttm_bo.c
> @@ -1178,6 +1178,7 @@ int ttm_bo_init_reserved(struct ttm_bo_device *bdev,
> bo->mem.page_alignment = page_alignment;
> bo->mem.bus.io_reserved_vm = false;
> bo->mem.bus.io_reserved_count = 0;
> +bo->mem.bus.printk = false;
> bo->moving = NULL;
> bo->mem.placement = (TTM_PL_FLAG_SYSTEM | TTM_PL_FLAG_CACHED);
> bo->persistent_swap_storage = persistent_swap_storage;
> diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c 
> b/drivers/gpu/drm/ttm/ttm_bo_util.c
> index bf6e216..9d06952 100644
> --- a/drivers/gpu/drm/ttm/ttm_bo_util.c
> +++ b/drivers/gpu/drm/ttm/ttm_bo_util.c
> @@ -526,14 +526,24 @@ static int ttm_bo_ioremap(struct 
> ttm_buffer_object *bo,
> if (bo->mem.bus.addr) {
> map->bo_kmap_type = ttm_bo_map_premapped;
> map->virtual = (void *)(((u8 *)bo->mem.bus.addr) + offset);
> +if (bo->mem.bus.printk)
> +printk ("scratch premapping");
> +
> } else {
> map->bo_kmap_type = ttm_bo_map_iomap;
> -if (mem->placement & TTM_PL_FLAG_WC)
> +if (mem->placement & TTM_PL_FLAG_WC) {
> map->virtual = ioremap_wc(bo->mem.bus.base + bo->mem.bus.offset + offset,
>  size);
> -else
> +if (bo->mem.bus.printk)
> +printk ("scratch ioremap_wc");
> +
> +}
> +else {
> map->virtual = ioremap_nocache(bo->mem.bus.base + bo->mem.bus.offset + 
> offset,
>     size);
> +if (bo->mem.bus.printk)
> +printk ("scratch ioremap_nocache");
> +}
> }
> return (!map->virtual) ? -ENOMEM : 0;
>  }
> @@ -618,21 +628,34 @@ void ttm_bo_kunmap(struct ttm_bo_kmap_obj *map)
> struct ttm_mem_type_manager *man =
> &bo->bdev->man[bo->mem.mem_type];
> -if (!map->virtual)
> +if (!map->virtual) {
> +if (bo->mem.bus.printk)
> +printk ("scratch unmap return earlier");
> return;
> +}
> switch (map->bo_kmap_type) {
> case ttm_bo_map_iomap:
> +if (bo->mem.bus.printk)
> +printk ("scratch iounmap");
> iounmap(map->virtual);
> break;
> case ttm_bo_map_vmap:
> +if (bo->mem.bus.printk)
> +printk ("scratch vunmap");
> vunmap(map->virtual);
> break;
> case ttm_bo_map_kmap:
> +if (bo->mem.bus.printk)
> +printk ("scratch kunmap");
> kunmap(map->page);
> break;
> case ttm_bo_map_premapped:
> +if (bo->mem.bus.printk)
> +printk ("scratch unmap ttm_bo_map_premapped");
> break;
> default:
> +if (bo->mem.bus.printk)
> +printk ("scratch unmap bug");
> BUG();
> }
> (void) ttm_mem_io_lock(man, false);
> diff --git a/include/drm/ttm/ttm_bo_api.h b/include/drm/ttm/ttm_bo_api.h
> index 2d0f63e..f74a554 100644
> --- a/include/drm/ttm/ttm_bo_api.h
> +++ b/include/drm/ttm/ttm_bo_api.h
> @@ -70,6 +70,7 @@ struct ttm_bus_placement {
> boolis_iomem;
> boolio_reserved_vm;
> uint64_t        io_reserved_count;
> +bool            printk;
>  };
> -- 
> 2.7.4
>
>
>
>
> Thanks,
>
> Alex Bin
>
> ------------------------------------------------------------------------
> *From:* Christian König <deathsimple-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
> *Sent:* Thursday, April 20, 2017 4:43 AM
> *To:* Xie, AlexBin; Zhou, David(ChunMing); amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
> Hi AlexBin,
>
>> Missing kunmap mapping in vmalloc will make kernel master page table 
>> incorrect.
> That's what I tried to explain yesterday, but unfortunately didn't had 
> time to do so. There is not corruption of the kernel master page table 
> in this case!
>
> The call of ttm_bo_kunmap is completely optional, take a look at 
> amdgpu_ttm_io_mem_reserve() and amdgpu_ttm_io_mem_free().
>
> The aperture is kept mapped into the page tables for the whole time 
> the driver is loaded. So this is a complete no-op and only done for 
> consistency.
>
>> It is good that you agree that there is no real world bad example 
>> caused by my patch. I will not discuss whether it is an improvement 
>> or not now to save time for both of us.
>>
> Great at least we can now agree to completely drop this patch.
>
> Thanks,
> Christian.
>
> Am 19.04.2017 um 21:30 schrieb Xie, AlexBin:
>>
>> Hi Christian,
>>
>>
>> Missing kunmap mapping in vmalloc will make kernel master page table 
>> incorrect. I would not call such issue as completely harmless. Please 
>> note that AMD graphic driver can run in 32 bit system. In 32 bit 
>> system, vmalloc address space is much smaller than size of most GPU VRAM.
>>
>>
>> amdgpu_bo_free_kernel has same issue as amdgpu_vram_scratch_fini. 1. 
>> It calls amdgpu_bo_reserve interruptible too. 2. It misses kunmap 
>> when amdgpu_bo_reserve returns error too. As result, kernel master 
>> page table can become incorrect, or as you call it "completely 
>> harmless vmalloc space leaking".
>>
>>
>> Because amdgpu_bo_free_kernel is used in more places, such as psp 
>> command submission, there will be bigger chance to have other usage 
>> where signal is not blocked. This will become a real bug.
>>
>>
>> I am thinking that we may fix the issue completely when TTM releases 
>> BO. But that is a bigger change.
>>
>>
>> It is good that you agree that there is no real world bad example 
>> caused by my patch. I will not discuss whether it is an improvement 
>> or not now to save time for both of us.
>>
>>
>> Thanks,
>>
>> Alex Bin Xie
>>
>>
>> ------------------------------------------------------------------------
>> *From:* Christian König <deathsimple-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
>> *Sent:* Wednesday, April 19, 2017 7:50 AM
>> *To:* Xie, AlexBin; Zhou, David(ChunMing); amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
>>> Without correctly kunmap, page table is corrupted. Page entries 
>>> point to wrong memory locations. You might call it completely 
>>> harmless. But I think this is a severe bug. Leaking memory is better 
>>> than a corrupted page table. Think security.
>> We are talking about the page tables for the vmalloc area in the 
>> kernel here, so no security problem. Leaking memory is much more 
>> problematic.
>>
>>> Would you provide any document and reference by saying" It is 
>>> impossible to receive a signal during module load/unload"? For 
>>> example, if the unload stuck in a lock, can CTRL+C stop the unload?
>>>
>> No, CTRL+C doesn't abort module load/unload. There have been patches 
>> to changes this a while ago, but IIRC it broke a whole bunch of 
>> driver relying on this.
>>
>>> What about there is some other return error? What about in future 
>>> somebody improve amdgpu_bo_reserve to return other errors, 
>>> then function amdgpu_vram_scratch_fini becomes buggy?
>>>
>> Yes, that is indeed an issue. For example -EDEADLK is possible as 
>> well. That's why I said we should use amdgpu_bo_free_kernel() instead.
>>
>>> While I am thinking whether there is a better way for the current 
>>> situation, would you give a real world example that my patch really 
>>> not working? Then we can address it.
>>>
>> I don't think there is because the driver can't receive a signal 
>> during load/unload, but the problem is rather that the patch doesn't 
>> improve the situation at all.
>>
>> Regards,
>> Christian.
>>
>> Am 19.04.2017 um 13:37 schrieb Xie, AlexBin:
>>>
>>> Hi Christian,
>>>
>>>
>>> Without correctly kunmap, page table is corrupted. Page entries 
>>> point to wrong memory locations. You might call it completely 
>>> harmless. But I think this is a severe bug. Leaking memory is better 
>>> than a corrupted page table. Think security.
>>>
>>>
>>> Would you provide any document and reference by saying" It is 
>>> impossible to receive a signal during module load/unload"? For 
>>> example, if the unload stuck in a lock, can CTRL+C stop the unload?
>>>
>>>
>>> If "It is impossible to receive a signal during module load/unload", 
>>> interruptible waiting is fine too, because function 
>>> amdgpu_bo_reserve will return successfully.
>>>
>>>
>>> What about there is some other return error? What about in future 
>>> somebody improve amdgpu_bo_reserve to return other errors, 
>>> then function amdgpu_vram_scratch_fini becomes buggy?
>>>
>>>
>>> While I am thinking whether there is a better way for the current 
>>> situation, would you give a real world example that my patch really 
>>> not working? Then we can address it.
>>>
>>>
>>> Thanks,
>>>
>>> Alex Bin
>>>
>>>
>>> ------------------------------------------------------------------------
>>> *From:* Christian König <deathsimple-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
>>> *Sent:* Wednesday, April 19, 2017 2:35 AM
>>> *To:* Xie, AlexBin; Zhou, David(ChunMing); amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>>> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
>>> Hi AlexBin,
>>>
>>> the answer is ttm_bo_kunmap isn't called at all and yes in the case 
>>> of an iomap we leak the address space reserved.
>>>
>>> But that is completely harmless on a 64bit system compared to 
>>> leaking the memory backing the address space.
>>>
>>> Using amdgpu_bo_free_kernel() instead of openly coding it here is 
>>> probably a good idea.
>>>
>>> Additional to that it's probably a good idea to set the no_intr flag 
>>> when reserving kernel BOs. It is impossible to receive a signal 
>>> during module load/unload, but it's probably better to document that 
>>> in the code as well.
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 18.04.2017 um 20:54 schrieb Xie, AlexBin:
>>>> Hi Christian,
>>>>
>>>> Have you found how/where/when? When you said "mapping will just be 
>>>> released a bit later on", you must know the answer.
>>>>
>>>> It is difficult to prove something does not exist. Anyway, I will 
>>>> give it a try to prove such "later on" does not exist.
>>>>
>>>> Function ttm_bo_kunmap is the only function to unmap. To prove 
>>>> this, search ttm_bo_map_iomap, only ttm_bo_kunmap use this enum to 
>>>> correctly kunmap.
>>>>
>>>> Function ttm_bo_kunmap is not called by ttm itself. This is a hint 
>>>> that all TTM delay delete mechanism or unref mechanism will NOT 
>>>> kunmap BO later on.
>>>>
>>>> Function ttm_bo_kunmap is called by AMDGPU function 
>>>> amdgpu_bo_kunmap and amdgpu_gem_prime_vunmap.
>>>>
>>>> Search AMDGPU for amdgpu_bo_kunmap. All matches do not kunmap for 
>>>> scratch VRAM BO. amdgpu_bo_free_kernel is a suspect but the answer 
>>>> is still NO.
>>>>
>>>> So all possibilities are searched. Did I miss anything?
>>>>
>>>> Thanks,
>>>> Alex Bin Xie
>>>>
>>>> ------------------------------------------------------------------------
>>>> *From:* Xie, AlexBin
>>>> *Sent:* Tuesday, April 18, 2017 2:04:33 PM
>>>> *To:* Christian König; Zhou, David(ChunMing); 
>>>> amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>>>> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
>>>>
>>>> Hi Christian,
>>>>
>>>>
>>>> Would you point out where/when will kunmap happen for this BO when 
>>>> release? It must be somewhere in some function calls.
>>>>
>>>>
>>>> I checked before I asked for review. But I did not see such obvious 
>>>> kunmap function call.
>>>>
>>>>
>>>> If so, there should be a comment in function 
>>>> amdgpu_vram_scratch_fini to avoid future confusion.
>>>>
>>>>
>>>> Thanks,
>>>> Alex Bin Xie
>>>> ------------------------------------------------------------------------
>>>> *From:* Christian König <deathsimple-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
>>>> *Sent:* Tuesday, April 18, 2017 1:46 PM
>>>> *To:* Xie, AlexBin; Zhou, David(ChunMing); 
>>>> amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>>>> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
>>>> Hi AlexBin,
>>>>
>>>> No, David is right. This is a very common coding pattern in the 
>>>> kernel module.
>>>>
>>>> Freeing up a BO while there still exists a kernel mapping is 
>>>> perfectly ok, the mapping will just be released a bit later on.
>>>>
>>>> So this code is actually perfectly ok and just an optimization, but 
>>>> your patch breaks it and creates a memory leak.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>> Am 18.04.2017 um 17:17 schrieb Xie, AlexBin:
>>>>>
>>>>> Hi David,
>>>>>
>>>>>
>>>>> When amdgpu_bo_reserve return errors, we cannot release the BO. 
>>>>> This is not a memory leak. General speaking, memory leak 
>>>>> is unnoticed and unintentional.
>>>>>
>>>>>
>>>>> The caller of function amdgpu_vram_scratch_fini ignores the return 
>>>>> error value...
>>>>>
>>>>>
>>>>> The "memory leak" is not caused by my patch. It is caused because 
>>>>> reserving BO fails.
>>>>>
>>>>>
>>>>> This patch only aim to make function amdgpu_vram_scratch_fini 
>>>>> behave correctly.
>>>>>
>>>>>
>>>>> To follow up, we can add a warning message when amdgpu_bo_reserve 
>>>>> error happens in a different patch.
>>>>>
>>>>>
>>>>> If function call amdgpu_bo_reserve is changed to uninterruptible, 
>>>>> this changes driver behaviour. Without a substantial issue, I 
>>>>> would be cautious for such a change.
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Alex Bin Xie
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------
>>>>> *From:* Zhou, David(ChunMing)
>>>>> *Sent:* Monday, April 17, 2017 10:38 PM
>>>>> *To:* Xie, AlexBin; amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>>>>> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
>>>>>
>>>>>
>>>>> On 2017年04月17日 22:54, Xie, AlexBin wrote:
>>>>>>
>>>>>> Hi David,
>>>>>>
>>>>>>
>>>>>> Thanks for the comments. However, please have look at 
>>>>>> amdgpu_bo_reserve definition.
>>>>>>
>>>>>> static inline int amdgpu_bo_reserve(struct amdgpu_bo *bo, bool 
>>>>>> no_intr)
>>>>>>
>>>>> Ah, this is a wired wrapper for ttm_bo_reserve.
>>>>>
>>>>>>
>>>>>> When we call this function like the following:
>>>>>>
>>>>>>    r = amdgpu_bo_reserve(adev->vram_scratch.robj, false);
>>>>>> The false means interruptible.
>>>>>>
>>>>>>
>>>>>> On the other hand,  when amdgpu_bo_reserve function return error, 
>>>>>> why do we unref BO without kunmap and unpin the BO? This is wrong 
>>>>>> implementation when amdgpu_bo_reserve return any error.
>>>>>>
>>>>> Yeah, I see your mean, it's in driver un-loading, How about 
>>>>> changing to no interruptible? Your patch will make a memleak if 
>>>>> bo_reserve fails, but it seems not matter. I have no strong 
>>>>> preference.
>>>>>
>>>>> Regards,
>>>>> David Zhou
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Alex Bin Xie
>>>>>>
>>>>>> ------------------------------------------------------------------------
>>>>>> *From:* Zhou, David(ChunMing)
>>>>>> *Sent:* Friday, April 14, 2017 12:00 AM
>>>>>> *To:* Xie, AlexBin; amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>>>>>> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
>>>>>>
>>>>>>
>>>>>> On 2017年04月14日 05:34, Alex Xie wrote:
>>>>>> > According to comment of amdgpu_bo_reserve, amdgpu_bo_reserve
>>>>>> > can return with -ERESTARTSYS. When this function was interrupted
>>>>>> > by a signal, BO should not be unref. Otherwise the BO might be
>>>>>> > released while is kmapped and pinned, or BO MIGHT be deref
>>>>>> > multiple times, etc.
>>>>>>          r = amdgpu_bo_reserve(adev->vram_scratch.robj, false);
>>>>>> we have specified interruptible to false, so -ERESTARTSYS isn't 
>>>>>> possible
>>>>>> here.
>>>>>>
>>>>>> Thanks,
>>>>>> David Zhou
>>>>>> >
>>>>>> > Change-Id: If76071a768950a0d3ad9d5da7fcae04881807621
>>>>>> > Signed-off-by: Alex Xie <AlexBin.Xie-5C7GfCeVMHo@public.gmane.org>
>>>>>> > ---
>>>>>> > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
>>>>>> >   1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>> >
>>>>>> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> > index 53996e3..1dcc2d1 100644
>>>>>> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> > @@ -355,8 +355,8 @@ static void amdgpu_vram_scratch_fini(struct 
>>>>>> amdgpu_device *adev)
>>>>>> >                amdgpu_bo_kunmap(adev->vram_scratch.robj);
>>>>>> >                amdgpu_bo_unpin(adev->vram_scratch.robj);
>>>>>> >                amdgpu_bo_unreserve(adev->vram_scratch.robj);
>>>>>> > + amdgpu_bo_unref(&adev->vram_scratch.robj);
>>>>>> >        }
>>>>>> > - amdgpu_bo_unref(&adev->vram_scratch.robj);
>>>>>> >   }
>>>>>> >
>>>>>> >   /**
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> amd-gfx mailing list
>>>>> amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>
>>
>
>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx



[-- Attachment #1.2: Type: text/html, Size: 76773 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
       [not found]                                                         ` <af065740-e219-fccb-4dac-d46f3472f4aa-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
@ 2017-04-21 15:43                                                           ` Felix Kuehling
       [not found]                                                             ` <003d5e5f-ea99-4517-fd4a-de4df3ce1b89-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Felix Kuehling @ 2017-04-21 15:43 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW


On 17-04-21 03:11 AM, Christian König wrote:
> Hi Alex,
>
>> No. For the current source code, I think the premap and no-op is not
>> working.
>>
> Indeed, we don't set bo->mem.bus.addr in amdgpu_ttm_io_mem_reserve()
> any more. Felix will probably want to fix that for the KFD branch.

I vaguely remember discussing this in the past: using mem->bus.addr to
keep memory permanently CPU-mapped and avoid redundant ioremap calls. As
I remember it, we weren't actually using this. It's only something we
considered at one point.

Regards,
  Felix

>
> Anyway, as I said not unmapping the page tables is harmless compared
> to not releasing the memory backing it.
>
> So please just do as I told you and change the interruptible
> reservation to a non-interruptible.
>
> Regards,
> Christian.
>
> Am 20.04.2017 um 23:56 schrieb Xie, AlexBin:
>>
>> Hi Christian,
>>
>>
>> I read amdgpu_ttm_io_mem_reserve() and amdgpu_ttm_io_mem_free() and
>> relevant codes from amdgpu_vram_scratch_init
>> and amdgpu_vram_scratch_fini. 
>>
>>
>> No. For the current source code, I think the premap and no-op is not
>> working.
>>
>>
>> I add printk to prove. You can see bo_kmap_type is an invalid
>> value. ioremap_wc is really called to map the IO range into vmalloc
>> space.
>>
>>
>> ...
>>
>> Apr 20 16:31:18 axie-System-Product-Name kernel: [  106.759623]
>> entering amdgpu_vram_scratch_init.
>> Apr 20 16:31:18 axie-System-Product-Name kernel: [  106.759631]
>> scratch ioremap_wc
>> Apr 20 16:31:18 axie-System-Product-Name kernel: [  106.759631]
>> bo_kmap_type = 129
>> Apr 20 16:31:18 axie-System-Product-Name kernel: [  106.759632] bus
>> address =           (null)
>> Apr 20 16:31:18 axie-System-Product-Name kernel: [  106.759632]
>> is_iomem = 1
>> Apr 20 16:31:18 axie-System-Product-Name kernel: [  106.759633]
>> leaving amdgpu_vram_scratch_init.
>> ...
>>
>> I don't have log on rmmod AMDGPU yet. There are quite some settings
>> to make that happen in my computer.
>> I think they are symemtric. Both mapping and unmapping are not no-op.
>>
>> Here is the printk patch for your reference.
>>
>> From 25f95239c23225008e4b59f30b9b5363fb321f94 Mon Sep 17 00:00:00 2001
>> From: Alex Xie <AlexBin.Xie@amd.com>
>> Date: Thu, 20 Apr 2017 17:48:49 -0400
>> Subject: [PATCH] A hack to trace premap and noop.
>>
>> Change-Id: I61fbbdbd82f62433e229b2e7e97be7a780ea8afa
>> Signed-off-by: Alex Xie <AlexBin.Xie@amd.com>
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 22 ++++++++++++++++++++++
>>  drivers/gpu/drm/ttm/ttm_bo.c               |  1 +
>>  drivers/gpu/drm/ttm/ttm_bo_util.c          | 29
>> ++++++++++++++++++++++++++---
>>  include/drm/ttm/ttm_bo_api.h               |  1 +
>>  4 files changed, 50 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index fbb4afb..a537ea1 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -313,6 +313,7 @@ static void amdgpu_block_invalid_wreg(struct
>> amdgpu_device *adev,
>>  static int amdgpu_vram_scratch_init(struct amdgpu_device *adev)
>>  {
>>  int r;
>> +printk("entering amdgpu_vram_scratch_init.");
>>  
>>  if (adev->vram_scratch.robj == NULL) {
>>  r = amdgpu_bo_create(adev, AMDGPU_GPU_PAGE_SIZE,
>> @@ -340,16 +341,36 @@ static int amdgpu_vram_scratch_init(struct
>> amdgpu_device *adev)
>>  amdgpu_bo_unpin(adev->vram_scratch.robj);
>>  amdgpu_bo_unreserve(adev->vram_scratch.robj);
>>  
>> +/* Would like a printk to see if map / unmap is noop*/
>> +adev->vram_scratch.robj->tbo.mem.bus.printk = true;
>> +
>> +if (adev->vram_scratch.robj->kmap.bo_kmap_type == ttm_bo_map_premapped)
>> +printk("amdgpu_vram_scratch premapped!");
>> +
>> +printk("bo_kmap_type = %d", adev->vram_scratch.robj->kmap.bo_kmap_type);
>> +printk("bus address = %p", adev->vram_scratch.robj->tbo.mem.bus.addr);
>> +printk("is_iomem = %d", adev->vram_scratch.robj->tbo.mem.bus.is_iomem);
>> +printk("leaving amdgpu_vram_scratch_init.");
>> +
>>  return r;
>>  }
>>  
>>  static void amdgpu_vram_scratch_fini(struct amdgpu_device *adev)
>>  {
>>  int r;
>> +printk("entering amdgpu_vram_scratch_fini.");
>>  
>>  if (adev->vram_scratch.robj == NULL) {
>>  return;
>>  }
>> +
>> +if (adev->vram_scratch.robj->kmap.bo_kmap_type == ttm_bo_map_premapped)
>> +printk("amdgpu_vram_scratch premapped!");
>> +
>> +printk("bo_kmap_type = %d", adev->vram_scratch.robj->kmap.bo_kmap_type);
>> +printk("bus address = %p", adev->vram_scratch.robj->tbo.mem.bus.addr);
>> +printk("is_iomem = %d", adev->vram_scratch.robj->tbo.mem.bus.is_iomem);
>> +
>>  r = amdgpu_bo_reserve(adev->vram_scratch.robj, false);
>>  if (likely(r == 0)) {
>>  amdgpu_bo_kunmap(adev->vram_scratch.robj);
>> @@ -357,6 +378,7 @@ static void amdgpu_vram_scratch_fini(struct
>> amdgpu_device *adev)
>>  amdgpu_bo_unreserve(adev->vram_scratch.robj);
>>  }
>>  amdgpu_bo_unref(&adev->vram_scratch.robj);
>> +printk("leaving amdgpu_vram_scratch_fini.");
>>  }
>>  
>>  /**
>> diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
>> index 989b98b..9b05d54 100644
>> --- a/drivers/gpu/drm/ttm/ttm_bo.c
>> +++ b/drivers/gpu/drm/ttm/ttm_bo.c
>> @@ -1178,6 +1178,7 @@ int ttm_bo_init_reserved(struct ttm_bo_device
>> *bdev,
>>  bo->mem.page_alignment = page_alignment;
>>  bo->mem.bus.io_reserved_vm = false;
>>  bo->mem.bus.io_reserved_count = 0;
>> +bo->mem.bus.printk = false;
>>  bo->moving = NULL;
>>  bo->mem.placement = (TTM_PL_FLAG_SYSTEM | TTM_PL_FLAG_CACHED);
>>  bo->persistent_swap_storage = persistent_swap_storage;
>> diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c
>> b/drivers/gpu/drm/ttm/ttm_bo_util.c
>> index bf6e216..9d06952 100644
>> --- a/drivers/gpu/drm/ttm/ttm_bo_util.c
>> +++ b/drivers/gpu/drm/ttm/ttm_bo_util.c
>> @@ -526,14 +526,24 @@ static int ttm_bo_ioremap(struct
>> ttm_buffer_object *bo,
>>  if (bo->mem.bus.addr) {
>>  map->bo_kmap_type = ttm_bo_map_premapped;
>>  map->virtual = (void *)(((u8 *)bo->mem.bus.addr) + offset);
>> +if (bo->mem.bus.printk)
>> +printk ("scratch premapping");
>> +
>>  } else {
>>  map->bo_kmap_type = ttm_bo_map_iomap;
>> -if (mem->placement & TTM_PL_FLAG_WC)
>> +if (mem->placement & TTM_PL_FLAG_WC) {
>>  map->virtual = ioremap_wc(bo->mem.bus.base + bo->mem.bus.offset +
>> offset,
>>   size);
>> -else
>> +if (bo->mem.bus.printk)
>> +printk ("scratch ioremap_wc");
>> +
>> +}
>> +else {
>>  map->virtual = ioremap_nocache(bo->mem.bus.base + bo->mem.bus.offset
>> + offset,
>>        size);
>> +if (bo->mem.bus.printk)
>> +printk ("scratch ioremap_nocache");
>> +}
>>  }
>>  return (!map->virtual) ? -ENOMEM : 0;
>>  }
>> @@ -618,21 +628,34 @@ void ttm_bo_kunmap(struct ttm_bo_kmap_obj *map)
>>  struct ttm_mem_type_manager *man =
>>  &bo->bdev->man[bo->mem.mem_type];
>>  
>> -if (!map->virtual)
>> +if (!map->virtual) {
>> +if (bo->mem.bus.printk)
>> +printk ("scratch unmap return earlier");
>>  return;
>> +}
>>  switch (map->bo_kmap_type) {
>>  case ttm_bo_map_iomap:
>> +if (bo->mem.bus.printk)
>> +printk ("scratch iounmap");
>>  iounmap(map->virtual);
>>  break;
>>  case ttm_bo_map_vmap:
>> +if (bo->mem.bus.printk)
>> +printk ("scratch vunmap");
>>  vunmap(map->virtual);
>>  break;
>>  case ttm_bo_map_kmap:
>> +if (bo->mem.bus.printk)
>> +printk ("scratch kunmap");
>>  kunmap(map->page);
>>  break;
>>  case ttm_bo_map_premapped:
>> +if (bo->mem.bus.printk)
>> +printk ("scratch unmap ttm_bo_map_premapped");
>>  break;
>>  default:
>> +if (bo->mem.bus.printk)
>> +printk ("scratch unmap bug");
>>  BUG();
>>  }
>>  (void) ttm_mem_io_lock(man, false);
>> diff --git a/include/drm/ttm/ttm_bo_api.h b/include/drm/ttm/ttm_bo_api.h
>> index 2d0f63e..f74a554 100644
>> --- a/include/drm/ttm/ttm_bo_api.h
>> +++ b/include/drm/ttm/ttm_bo_api.h
>> @@ -70,6 +70,7 @@ struct ttm_bus_placement {
>>  boolis_iomem;
>>  boolio_reserved_vm;
>>  uint64_t        io_reserved_count;
>> +bool            printk;
>>  };
>>  
>>  
>> -- 
>> 2.7.4
>>
>>
>>
>>
>> Thanks,
>>
>> Alex Bin
>>
>> ------------------------------------------------------------------------
>> *From:* Christian König <deathsimple@vodafone.de>
>> *Sent:* Thursday, April 20, 2017 4:43 AM
>> *To:* Xie, AlexBin; Zhou, David(ChunMing); amd-gfx@lists.freedesktop.org
>> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
>>  
>> Hi AlexBin,
>>
>>> Missing kunmap mapping in vmalloc will make kernel master page table
>>> incorrect.
>> That's what I tried to explain yesterday, but unfortunately didn't
>> had time to do so. There is not corruption of the kernel master page
>> table in this case!
>>
>> The call of ttm_bo_kunmap is completely optional, take a look at
>> amdgpu_ttm_io_mem_reserve() and amdgpu_ttm_io_mem_free().
>>
>> The aperture is kept mapped into the page tables for the whole time
>> the driver is loaded. So this is a complete no-op and only done for
>> consistency.
>>
>>> It is good that you agree that there is no real world bad example
>>> caused by my patch. I will not discuss whether it is an improvement
>>> or not now to save time for both of us.
>>>
>> Great at least we can now agree to completely drop this patch.
>>
>> Thanks,
>> Christian.
>>
>> Am 19.04.2017 um 21:30 schrieb Xie, AlexBin:
>>>
>>> Hi Christian,
>>>
>>>
>>> Missing kunmap mapping in vmalloc will make kernel master page table
>>> incorrect. I would not call such issue as completely harmless.
>>> Please note that AMD graphic driver can run in 32 bit system. In 32
>>> bit system, vmalloc address space is much smaller than size of most
>>> GPU VRAM.
>>>
>>>
>>> amdgpu_bo_free_kernel has same issue as amdgpu_vram_scratch_fini. 1.
>>> It calls amdgpu_bo_reserve interruptible too. 2. It misses kunmap
>>> when amdgpu_bo_reserve returns error too. As result, kernel master
>>> page table can become incorrect, or as you call it "completely
>>> harmless vmalloc space leaking". 
>>>
>>>
>>> Because amdgpu_bo_free_kernel is used in more places, such as psp
>>> command submission, there will be bigger chance to have other usage
>>> where signal is not blocked. This will become a real bug.
>>>
>>>
>>> I am thinking that we may fix the issue completely when TTM releases
>>> BO. But that is a bigger change.
>>>
>>>
>>> It is good that you agree that there is no real world bad example
>>> caused by my patch. I will not discuss whether it is an improvement
>>> or not now to save time for both of us.
>>>
>>>
>>> Thanks,
>>>
>>> Alex Bin Xie
>>>
>>>
>>> ------------------------------------------------------------------------
>>> *From:* Christian König <deathsimple@vodafone.de>
>>> *Sent:* Wednesday, April 19, 2017 7:50 AM
>>> *To:* Xie, AlexBin; Zhou, David(ChunMing); amd-gfx@lists.freedesktop.org
>>> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
>>>  
>>>> Without correctly kunmap, page table is corrupted. Page entries
>>>> point to wrong memory locations. You might call it completely
>>>> harmless. But I think this is a severe bug. Leaking memory is
>>>> better than a corrupted page table. Think security.
>>> We are talking about the page tables for the vmalloc area in the
>>> kernel here, so no security problem. Leaking memory is much more
>>> problematic.
>>>
>>>> Would you provide any document and reference by saying" It is
>>>> impossible to receive a signal during module load/unload"? For
>>>> example, if the unload stuck in a lock, can CTRL+C stop the unload?
>>>>
>>> No, CTRL+C doesn't abort module load/unload. There have been patches
>>> to changes this a while ago, but IIRC it broke a whole bunch of
>>> driver relying on this.
>>>
>>>> What about there is some other return error? What about in future
>>>> somebody improve amdgpu_bo_reserve to return other errors,
>>>> then function amdgpu_vram_scratch_fini becomes buggy?
>>>>
>>> Yes, that is indeed an issue. For example -EDEADLK is possible as
>>> well. That's why I said we should use amdgpu_bo_free_kernel() instead.
>>>
>>>> While I am thinking whether there is a better way for the current
>>>> situation, would you give a real world example that my patch really
>>>> not working? Then we can address it.
>>>>
>>> I don't think there is because the driver can't receive a signal
>>> during load/unload, but the problem is rather that the patch doesn't
>>> improve the situation at all.
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 19.04.2017 um 13:37 schrieb Xie, AlexBin:
>>>>
>>>> Hi Christian,
>>>>
>>>>
>>>> Without correctly kunmap, page table is corrupted. Page entries
>>>> point to wrong memory locations. You might call it completely
>>>> harmless. But I think this is a severe bug. Leaking memory is
>>>> better than a corrupted page table. Think security.
>>>>
>>>>
>>>> Would you provide any document and reference by saying" It is
>>>> impossible to receive a signal during module load/unload"? For
>>>> example, if the unload stuck in a lock, can CTRL+C stop the unload?
>>>>
>>>>
>>>> If "It is impossible to receive a signal during module
>>>> load/unload", interruptible waiting is fine too, because
>>>> function amdgpu_bo_reserve will return successfully.
>>>>
>>>>
>>>> What about there is some other return error? What about in future
>>>> somebody improve amdgpu_bo_reserve to return other errors,
>>>> then function amdgpu_vram_scratch_fini becomes buggy?
>>>>
>>>>
>>>> While I am thinking whether there is a better way for the current
>>>> situation, would you give a real world example that my patch really
>>>> not working? Then we can address it.
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Alex Bin
>>>>
>>>>
>>>> ------------------------------------------------------------------------
>>>> *From:* Christian König <deathsimple@vodafone.de>
>>>> *Sent:* Wednesday, April 19, 2017 2:35 AM
>>>> *To:* Xie, AlexBin; Zhou, David(ChunMing);
>>>> amd-gfx@lists.freedesktop.org
>>>> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
>>>>  
>>>> Hi AlexBin,
>>>>
>>>> the answer is ttm_bo_kunmap isn't called at all and yes in the case
>>>> of an iomap we leak the address space reserved.
>>>>
>>>> But that is completely harmless on a 64bit system compared to
>>>> leaking the memory backing the address space.
>>>>
>>>> Using amdgpu_bo_free_kernel() instead of openly coding it here is
>>>> probably a good idea.
>>>>
>>>> Additional to that it's probably a good idea to set the no_intr
>>>> flag when reserving kernel BOs. It is impossible to receive a
>>>> signal during module load/unload, but it's probably better to
>>>> document that in the code as well.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>> Am 18.04.2017 um 20:54 schrieb Xie, AlexBin:
>>>>> Hi Christian,
>>>>>
>>>>> Have you found how/where/when? When you said "mapping will just be
>>>>> released a bit later on", you must know the answer. 
>>>>>
>>>>> It is difficult to prove something does not exist. Anyway, I will
>>>>> give it a try to prove such "later on" does not exist.
>>>>>
>>>>> Function ttm_bo_kunmap is the only function to unmap. To prove
>>>>> this, search ttm_bo_map_iomap, only ttm_bo_kunmap use this enum to
>>>>> correctly kunmap.
>>>>>
>>>>> Function ttm_bo_kunmap is not called by ttm itself. This is a hint
>>>>> that all TTM delay delete mechanism or unref mechanism will NOT
>>>>> kunmap BO later on.
>>>>>
>>>>> Function ttm_bo_kunmap is called by AMDGPU
>>>>> function amdgpu_bo_kunmap and amdgpu_gem_prime_vunmap.
>>>>>
>>>>> Search AMDGPU for amdgpu_bo_kunmap. All matches do not kunmap for
>>>>> scratch VRAM BO.  amdgpu_bo_free_kernel is a suspect but the
>>>>> answer is still NO.
>>>>>
>>>>> So all possibilities are searched. Did I miss anything?
>>>>>
>>>>> Thanks,
>>>>> Alex Bin Xie
>>>>>
>>>>> ------------------------------------------------------------------------
>>>>> *From:* Xie, AlexBin
>>>>> *Sent:* Tuesday, April 18, 2017 2:04:33 PM
>>>>> *To:* Christian König; Zhou, David(ChunMing);
>>>>> amd-gfx@lists.freedesktop.org
>>>>> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
>>>>>  
>>>>>
>>>>> Hi Christian,
>>>>>
>>>>>
>>>>> Would you point out where/when will kunmap happen for this BO when
>>>>> release? It must be somewhere in some function calls.
>>>>>
>>>>>
>>>>> I checked before I asked for review. But I did not see such
>>>>> obvious kunmap function call.
>>>>>
>>>>>
>>>>> If so, there should be a comment in
>>>>> function amdgpu_vram_scratch_fini to avoid future confusion.
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Alex Bin Xie
>>>>> ------------------------------------------------------------------------
>>>>> *From:* Christian König <deathsimple@vodafone.de>
>>>>> *Sent:* Tuesday, April 18, 2017 1:46 PM
>>>>> *To:* Xie, AlexBin; Zhou, David(ChunMing);
>>>>> amd-gfx@lists.freedesktop.org
>>>>> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
>>>>>  
>>>>> Hi AlexBin,
>>>>>
>>>>> No, David is right. This is a very common coding pattern in the
>>>>> kernel module.
>>>>>
>>>>> Freeing up a BO while there still exists a kernel mapping is
>>>>> perfectly ok, the mapping will just be released a bit later on.
>>>>>
>>>>> So this code is actually perfectly ok and just an optimization,
>>>>> but your patch breaks it and creates a memory leak.
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>> Am 18.04.2017 um 17:17 schrieb Xie, AlexBin:
>>>>>>
>>>>>> Hi David,
>>>>>>
>>>>>>
>>>>>> When amdgpu_bo_reserve return errors, we cannot release the BO.
>>>>>> This is not a memory leak. General speaking, memory leak
>>>>>> is unnoticed and unintentional.
>>>>>>
>>>>>>
>>>>>> The caller of function amdgpu_vram_scratch_fini ignores the
>>>>>> return error value...
>>>>>>
>>>>>>
>>>>>> The "memory leak" is not caused by my patch. It is caused because
>>>>>> reserving BO fails.
>>>>>>
>>>>>>
>>>>>> This patch only aim to make function amdgpu_vram_scratch_fini
>>>>>> behave correctly.
>>>>>>
>>>>>>
>>>>>> To follow up, we can add a warning message when amdgpu_bo_reserve
>>>>>> error happens in a different patch.
>>>>>>
>>>>>>
>>>>>> If function call amdgpu_bo_reserve is changed to uninterruptible,
>>>>>> this changes driver behaviour. Without a substantial issue, I
>>>>>> would be cautious for such a change.
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Alex Bin Xie
>>>>>>
>>>>>>
>>>>>> ------------------------------------------------------------------------
>>>>>> *From:* Zhou, David(ChunMing)
>>>>>> *Sent:* Monday, April 17, 2017 10:38 PM
>>>>>> *To:* Xie, AlexBin; amd-gfx@lists.freedesktop.org
>>>>>> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
>>>>>>  
>>>>>>
>>>>>>
>>>>>> On 2017年04月17日 22:54, Xie, AlexBin wrote:
>>>>>>>
>>>>>>> Hi David,
>>>>>>>
>>>>>>>
>>>>>>> Thanks for the comments. However, please have look at
>>>>>>> amdgpu_bo_reserve definition.
>>>>>>>
>>>>>>> static inline int amdgpu_bo_reserve(struct amdgpu_bo *bo, bool
>>>>>>> no_intr)
>>>>>>>
>>>>>> Ah, this is a wired wrapper for ttm_bo_reserve.
>>>>>>
>>>>>>>
>>>>>>> When we call this function like the following:
>>>>>>>
>>>>>>>          r = amdgpu_bo_reserve(adev->vram_scratch.robj, false);
>>>>>>> The false means interruptible.
>>>>>>>
>>>>>>>
>>>>>>> On the other hand,  when amdgpu_bo_reserve function return
>>>>>>> error, why do we unref BO without kunmap and unpin the BO? This
>>>>>>> is wrong implementation when amdgpu_bo_reserve return any error.
>>>>>>>
>>>>>> Yeah, I see your mean, it's in driver un-loading, How about
>>>>>> changing to no interruptible? Your patch will make a memleak if
>>>>>> bo_reserve fails, but it seems not matter. I have no strong
>>>>>> preference.
>>>>>>
>>>>>> Regards,
>>>>>> David Zhou
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Alex Bin Xie
>>>>>>>
>>>>>>> ------------------------------------------------------------------------
>>>>>>> *From:* Zhou, David(ChunMing)
>>>>>>> *Sent:* Friday, April 14, 2017 12:00 AM
>>>>>>> *To:* Xie, AlexBin; amd-gfx@lists.freedesktop.org
>>>>>>> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
>>>>>>>  
>>>>>>>
>>>>>>>
>>>>>>> On 2017年04月14日 05:34, Alex Xie wrote:
>>>>>>> > According to comment of amdgpu_bo_reserve, amdgpu_bo_reserve
>>>>>>> > can return with -ERESTARTSYS. When this function was interrupted
>>>>>>> > by a signal, BO should not be unref. Otherwise the BO might be
>>>>>>> > released while is kmapped and pinned, or BO MIGHT be deref
>>>>>>> > multiple times, etc.
>>>>>>>          r = amdgpu_bo_reserve(adev->vram_scratch.robj, false);
>>>>>>> we have specified interruptible to false, so -ERESTARTSYS isn't
>>>>>>> possible
>>>>>>> here.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> David Zhou
>>>>>>> >
>>>>>>> > Change-Id: If76071a768950a0d3ad9d5da7fcae04881807621
>>>>>>> > Signed-off-by: Alex Xie <AlexBin.Xie@amd.com>
>>>>>>> > ---
>>>>>>> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
>>>>>>> >   1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>> >
>>>>>>> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> > index 53996e3..1dcc2d1 100644
>>>>>>> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> > @@ -355,8 +355,8 @@ static void
>>>>>>> amdgpu_vram_scratch_fini(struct amdgpu_device *adev)
>>>>>>> >                amdgpu_bo_kunmap(adev->vram_scratch.robj);
>>>>>>> >                amdgpu_bo_unpin(adev->vram_scratch.robj);
>>>>>>> >                amdgpu_bo_unreserve(adev->vram_scratch.robj);
>>>>>>> > +             amdgpu_bo_unref(&adev->vram_scratch.robj);
>>>>>>> >        }
>>>>>>> > -     amdgpu_bo_unref(&adev->vram_scratch.robj);
>>>>>>> >   }
>>>>>>> >  
>>>>>>> >   /**
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> amd-gfx mailing list
>>>>>> amd-gfx@lists.freedesktop.org
>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> amd-gfx mailing list
>>>>> amd-gfx@lists.freedesktop.org
>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx@lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>
>>>
>>
>>
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
>
>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
       [not found]                                                             ` <003d5e5f-ea99-4517-fd4a-de4df3ce1b89-5C7GfCeVMHo@public.gmane.org>
@ 2017-04-21 17:01                                                               ` Christian König
       [not found]                                                                 ` <a8715aca-05ff-1262-e2f5-31279052e606-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Christian König @ 2017-04-21 17:01 UTC (permalink / raw)
  To: Felix Kuehling, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Am 21.04.2017 um 17:43 schrieb Felix Kuehling:
> On 17-04-21 03:11 AM, Christian König wrote:
>> Hi Alex,
>>
>>> No. For the current source code, I think the premap and no-op is not
>>> working.
>>>
>> Indeed, we don't set bo->mem.bus.addr in amdgpu_ttm_io_mem_reserve()
>> any more. Felix will probably want to fix that for the KFD branch.
> I vaguely remember discussing this in the past: using mem->bus.addr to
> keep memory permanently CPU-mapped and avoid redundant ioremap calls. As
> I remember it, we weren't actually using this. It's only something we
> considered at one point.

Well that explains why my memory fooled me.

My last status was that we implemented this together with CPU based page 
table updates.

Otherwise mapping them for CPU access is rather heavy if we indeed 
update the page tables every time.

Christian.

>
> Regards,
>    Felix
>
>> Anyway, as I said not unmapping the page tables is harmless compared
>> to not releasing the memory backing it.
>>
>> So please just do as I told you and change the interruptible
>> reservation to a non-interruptible.
>>
>> Regards,
>> Christian.
>>
>> Am 20.04.2017 um 23:56 schrieb Xie, AlexBin:
>>> Hi Christian,
>>>
>>>
>>> I read amdgpu_ttm_io_mem_reserve() and amdgpu_ttm_io_mem_free() and
>>> relevant codes from amdgpu_vram_scratch_init
>>> and amdgpu_vram_scratch_fini.
>>>
>>>
>>> No. For the current source code, I think the premap and no-op is not
>>> working.
>>>
>>>
>>> I add printk to prove. You can see bo_kmap_type is an invalid
>>> value. ioremap_wc is really called to map the IO range into vmalloc
>>> space.
>>>
>>>
>>> ...
>>>
>>> Apr 20 16:31:18 axie-System-Product-Name kernel: [  106.759623]
>>> entering amdgpu_vram_scratch_init.
>>> Apr 20 16:31:18 axie-System-Product-Name kernel: [  106.759631]
>>> scratch ioremap_wc
>>> Apr 20 16:31:18 axie-System-Product-Name kernel: [  106.759631]
>>> bo_kmap_type = 129
>>> Apr 20 16:31:18 axie-System-Product-Name kernel: [  106.759632] bus
>>> address =           (null)
>>> Apr 20 16:31:18 axie-System-Product-Name kernel: [  106.759632]
>>> is_iomem = 1
>>> Apr 20 16:31:18 axie-System-Product-Name kernel: [  106.759633]
>>> leaving amdgpu_vram_scratch_init.
>>> ...
>>>
>>> I don't have log on rmmod AMDGPU yet. There are quite some settings
>>> to make that happen in my computer.
>>> I think they are symemtric. Both mapping and unmapping are not no-op.
>>>
>>> Here is the printk patch for your reference.
>>>
>>>  From 25f95239c23225008e4b59f30b9b5363fb321f94 Mon Sep 17 00:00:00 2001
>>> From: Alex Xie <AlexBin.Xie@amd.com>
>>> Date: Thu, 20 Apr 2017 17:48:49 -0400
>>> Subject: [PATCH] A hack to trace premap and noop.
>>>
>>> Change-Id: I61fbbdbd82f62433e229b2e7e97be7a780ea8afa
>>> Signed-off-by: Alex Xie <AlexBin.Xie@amd.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 22 ++++++++++++++++++++++
>>>   drivers/gpu/drm/ttm/ttm_bo.c               |  1 +
>>>   drivers/gpu/drm/ttm/ttm_bo_util.c          | 29
>>> ++++++++++++++++++++++++++---
>>>   include/drm/ttm/ttm_bo_api.h               |  1 +
>>>   4 files changed, 50 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index fbb4afb..a537ea1 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -313,6 +313,7 @@ static void amdgpu_block_invalid_wreg(struct
>>> amdgpu_device *adev,
>>>   static int amdgpu_vram_scratch_init(struct amdgpu_device *adev)
>>>   {
>>>   int r;
>>> +printk("entering amdgpu_vram_scratch_init.");
>>>   
>>>   if (adev->vram_scratch.robj == NULL) {
>>>   r = amdgpu_bo_create(adev, AMDGPU_GPU_PAGE_SIZE,
>>> @@ -340,16 +341,36 @@ static int amdgpu_vram_scratch_init(struct
>>> amdgpu_device *adev)
>>>   amdgpu_bo_unpin(adev->vram_scratch.robj);
>>>   amdgpu_bo_unreserve(adev->vram_scratch.robj);
>>>   
>>> +/* Would like a printk to see if map / unmap is noop*/
>>> +adev->vram_scratch.robj->tbo.mem.bus.printk = true;
>>> +
>>> +if (adev->vram_scratch.robj->kmap.bo_kmap_type == ttm_bo_map_premapped)
>>> +printk("amdgpu_vram_scratch premapped!");
>>> +
>>> +printk("bo_kmap_type = %d", adev->vram_scratch.robj->kmap.bo_kmap_type);
>>> +printk("bus address = %p", adev->vram_scratch.robj->tbo.mem.bus.addr);
>>> +printk("is_iomem = %d", adev->vram_scratch.robj->tbo.mem.bus.is_iomem);
>>> +printk("leaving amdgpu_vram_scratch_init.");
>>> +
>>>   return r;
>>>   }
>>>   
>>>   static void amdgpu_vram_scratch_fini(struct amdgpu_device *adev)
>>>   {
>>>   int r;
>>> +printk("entering amdgpu_vram_scratch_fini.");
>>>   
>>>   if (adev->vram_scratch.robj == NULL) {
>>>   return;
>>>   }
>>> +
>>> +if (adev->vram_scratch.robj->kmap.bo_kmap_type == ttm_bo_map_premapped)
>>> +printk("amdgpu_vram_scratch premapped!");
>>> +
>>> +printk("bo_kmap_type = %d", adev->vram_scratch.robj->kmap.bo_kmap_type);
>>> +printk("bus address = %p", adev->vram_scratch.robj->tbo.mem.bus.addr);
>>> +printk("is_iomem = %d", adev->vram_scratch.robj->tbo.mem.bus.is_iomem);
>>> +
>>>   r = amdgpu_bo_reserve(adev->vram_scratch.robj, false);
>>>   if (likely(r == 0)) {
>>>   amdgpu_bo_kunmap(adev->vram_scratch.robj);
>>> @@ -357,6 +378,7 @@ static void amdgpu_vram_scratch_fini(struct
>>> amdgpu_device *adev)
>>>   amdgpu_bo_unreserve(adev->vram_scratch.robj);
>>>   }
>>>   amdgpu_bo_unref(&adev->vram_scratch.robj);
>>> +printk("leaving amdgpu_vram_scratch_fini.");
>>>   }
>>>   
>>>   /**
>>> diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
>>> index 989b98b..9b05d54 100644
>>> --- a/drivers/gpu/drm/ttm/ttm_bo.c
>>> +++ b/drivers/gpu/drm/ttm/ttm_bo.c
>>> @@ -1178,6 +1178,7 @@ int ttm_bo_init_reserved(struct ttm_bo_device
>>> *bdev,
>>>   bo->mem.page_alignment = page_alignment;
>>>   bo->mem.bus.io_reserved_vm = false;
>>>   bo->mem.bus.io_reserved_count = 0;
>>> +bo->mem.bus.printk = false;
>>>   bo->moving = NULL;
>>>   bo->mem.placement = (TTM_PL_FLAG_SYSTEM | TTM_PL_FLAG_CACHED);
>>>   bo->persistent_swap_storage = persistent_swap_storage;
>>> diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c
>>> b/drivers/gpu/drm/ttm/ttm_bo_util.c
>>> index bf6e216..9d06952 100644
>>> --- a/drivers/gpu/drm/ttm/ttm_bo_util.c
>>> +++ b/drivers/gpu/drm/ttm/ttm_bo_util.c
>>> @@ -526,14 +526,24 @@ static int ttm_bo_ioremap(struct
>>> ttm_buffer_object *bo,
>>>   if (bo->mem.bus.addr) {
>>>   map->bo_kmap_type = ttm_bo_map_premapped;
>>>   map->virtual = (void *)(((u8 *)bo->mem.bus.addr) + offset);
>>> +if (bo->mem.bus.printk)
>>> +printk ("scratch premapping");
>>> +
>>>   } else {
>>>   map->bo_kmap_type = ttm_bo_map_iomap;
>>> -if (mem->placement & TTM_PL_FLAG_WC)
>>> +if (mem->placement & TTM_PL_FLAG_WC) {
>>>   map->virtual = ioremap_wc(bo->mem.bus.base + bo->mem.bus.offset +
>>> offset,
>>>    size);
>>> -else
>>> +if (bo->mem.bus.printk)
>>> +printk ("scratch ioremap_wc");
>>> +
>>> +}
>>> +else {
>>>   map->virtual = ioremap_nocache(bo->mem.bus.base + bo->mem.bus.offset
>>> + offset,
>>>         size);
>>> +if (bo->mem.bus.printk)
>>> +printk ("scratch ioremap_nocache");
>>> +}
>>>   }
>>>   return (!map->virtual) ? -ENOMEM : 0;
>>>   }
>>> @@ -618,21 +628,34 @@ void ttm_bo_kunmap(struct ttm_bo_kmap_obj *map)
>>>   struct ttm_mem_type_manager *man =
>>>   &bo->bdev->man[bo->mem.mem_type];
>>>   
>>> -if (!map->virtual)
>>> +if (!map->virtual) {
>>> +if (bo->mem.bus.printk)
>>> +printk ("scratch unmap return earlier");
>>>   return;
>>> +}
>>>   switch (map->bo_kmap_type) {
>>>   case ttm_bo_map_iomap:
>>> +if (bo->mem.bus.printk)
>>> +printk ("scratch iounmap");
>>>   iounmap(map->virtual);
>>>   break;
>>>   case ttm_bo_map_vmap:
>>> +if (bo->mem.bus.printk)
>>> +printk ("scratch vunmap");
>>>   vunmap(map->virtual);
>>>   break;
>>>   case ttm_bo_map_kmap:
>>> +if (bo->mem.bus.printk)
>>> +printk ("scratch kunmap");
>>>   kunmap(map->page);
>>>   break;
>>>   case ttm_bo_map_premapped:
>>> +if (bo->mem.bus.printk)
>>> +printk ("scratch unmap ttm_bo_map_premapped");
>>>   break;
>>>   default:
>>> +if (bo->mem.bus.printk)
>>> +printk ("scratch unmap bug");
>>>   BUG();
>>>   }
>>>   (void) ttm_mem_io_lock(man, false);
>>> diff --git a/include/drm/ttm/ttm_bo_api.h b/include/drm/ttm/ttm_bo_api.h
>>> index 2d0f63e..f74a554 100644
>>> --- a/include/drm/ttm/ttm_bo_api.h
>>> +++ b/include/drm/ttm/ttm_bo_api.h
>>> @@ -70,6 +70,7 @@ struct ttm_bus_placement {
>>>   boolis_iomem;
>>>   boolio_reserved_vm;
>>>   uint64_t        io_reserved_count;
>>> +bool            printk;
>>>   };
>>>   
>>>   
>>> -- 
>>> 2.7.4
>>>
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Alex Bin
>>>
>>> ------------------------------------------------------------------------
>>> *From:* Christian König <deathsimple@vodafone.de>
>>> *Sent:* Thursday, April 20, 2017 4:43 AM
>>> *To:* Xie, AlexBin; Zhou, David(ChunMing); amd-gfx@lists.freedesktop.org
>>> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
>>>   
>>> Hi AlexBin,
>>>
>>>> Missing kunmap mapping in vmalloc will make kernel master page table
>>>> incorrect.
>>> That's what I tried to explain yesterday, but unfortunately didn't
>>> had time to do so. There is not corruption of the kernel master page
>>> table in this case!
>>>
>>> The call of ttm_bo_kunmap is completely optional, take a look at
>>> amdgpu_ttm_io_mem_reserve() and amdgpu_ttm_io_mem_free().
>>>
>>> The aperture is kept mapped into the page tables for the whole time
>>> the driver is loaded. So this is a complete no-op and only done for
>>> consistency.
>>>
>>>> It is good that you agree that there is no real world bad example
>>>> caused by my patch. I will not discuss whether it is an improvement
>>>> or not now to save time for both of us.
>>>>
>>> Great at least we can now agree to completely drop this patch.
>>>
>>> Thanks,
>>> Christian.
>>>
>>> Am 19.04.2017 um 21:30 schrieb Xie, AlexBin:
>>>> Hi Christian,
>>>>
>>>>
>>>> Missing kunmap mapping in vmalloc will make kernel master page table
>>>> incorrect. I would not call such issue as completely harmless.
>>>> Please note that AMD graphic driver can run in 32 bit system. In 32
>>>> bit system, vmalloc address space is much smaller than size of most
>>>> GPU VRAM.
>>>>
>>>>
>>>> amdgpu_bo_free_kernel has same issue as amdgpu_vram_scratch_fini. 1.
>>>> It calls amdgpu_bo_reserve interruptible too. 2. It misses kunmap
>>>> when amdgpu_bo_reserve returns error too. As result, kernel master
>>>> page table can become incorrect, or as you call it "completely
>>>> harmless vmalloc space leaking".
>>>>
>>>>
>>>> Because amdgpu_bo_free_kernel is used in more places, such as psp
>>>> command submission, there will be bigger chance to have other usage
>>>> where signal is not blocked. This will become a real bug.
>>>>
>>>>
>>>> I am thinking that we may fix the issue completely when TTM releases
>>>> BO. But that is a bigger change.
>>>>
>>>>
>>>> It is good that you agree that there is no real world bad example
>>>> caused by my patch. I will not discuss whether it is an improvement
>>>> or not now to save time for both of us.
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Alex Bin Xie
>>>>
>>>>
>>>> ------------------------------------------------------------------------
>>>> *From:* Christian König <deathsimple@vodafone.de>
>>>> *Sent:* Wednesday, April 19, 2017 7:50 AM
>>>> *To:* Xie, AlexBin; Zhou, David(ChunMing); amd-gfx@lists.freedesktop.org
>>>> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
>>>>   
>>>>> Without correctly kunmap, page table is corrupted. Page entries
>>>>> point to wrong memory locations. You might call it completely
>>>>> harmless. But I think this is a severe bug. Leaking memory is
>>>>> better than a corrupted page table. Think security.
>>>> We are talking about the page tables for the vmalloc area in the
>>>> kernel here, so no security problem. Leaking memory is much more
>>>> problematic.
>>>>
>>>>> Would you provide any document and reference by saying" It is
>>>>> impossible to receive a signal during module load/unload"? For
>>>>> example, if the unload stuck in a lock, can CTRL+C stop the unload?
>>>>>
>>>> No, CTRL+C doesn't abort module load/unload. There have been patches
>>>> to changes this a while ago, but IIRC it broke a whole bunch of
>>>> driver relying on this.
>>>>
>>>>> What about there is some other return error? What about in future
>>>>> somebody improve amdgpu_bo_reserve to return other errors,
>>>>> then function amdgpu_vram_scratch_fini becomes buggy?
>>>>>
>>>> Yes, that is indeed an issue. For example -EDEADLK is possible as
>>>> well. That's why I said we should use amdgpu_bo_free_kernel() instead.
>>>>
>>>>> While I am thinking whether there is a better way for the current
>>>>> situation, would you give a real world example that my patch really
>>>>> not working? Then we can address it.
>>>>>
>>>> I don't think there is because the driver can't receive a signal
>>>> during load/unload, but the problem is rather that the patch doesn't
>>>> improve the situation at all.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>> Am 19.04.2017 um 13:37 schrieb Xie, AlexBin:
>>>>> Hi Christian,
>>>>>
>>>>>
>>>>> Without correctly kunmap, page table is corrupted. Page entries
>>>>> point to wrong memory locations. You might call it completely
>>>>> harmless. But I think this is a severe bug. Leaking memory is
>>>>> better than a corrupted page table. Think security.
>>>>>
>>>>>
>>>>> Would you provide any document and reference by saying" It is
>>>>> impossible to receive a signal during module load/unload"? For
>>>>> example, if the unload stuck in a lock, can CTRL+C stop the unload?
>>>>>
>>>>>
>>>>> If "It is impossible to receive a signal during module
>>>>> load/unload", interruptible waiting is fine too, because
>>>>> function amdgpu_bo_reserve will return successfully.
>>>>>
>>>>>
>>>>> What about there is some other return error? What about in future
>>>>> somebody improve amdgpu_bo_reserve to return other errors,
>>>>> then function amdgpu_vram_scratch_fini becomes buggy?
>>>>>
>>>>>
>>>>> While I am thinking whether there is a better way for the current
>>>>> situation, would you give a real world example that my patch really
>>>>> not working? Then we can address it.
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Alex Bin
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------
>>>>> *From:* Christian König <deathsimple@vodafone.de>
>>>>> *Sent:* Wednesday, April 19, 2017 2:35 AM
>>>>> *To:* Xie, AlexBin; Zhou, David(ChunMing);
>>>>> amd-gfx@lists.freedesktop.org
>>>>> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
>>>>>   
>>>>> Hi AlexBin,
>>>>>
>>>>> the answer is ttm_bo_kunmap isn't called at all and yes in the case
>>>>> of an iomap we leak the address space reserved.
>>>>>
>>>>> But that is completely harmless on a 64bit system compared to
>>>>> leaking the memory backing the address space.
>>>>>
>>>>> Using amdgpu_bo_free_kernel() instead of openly coding it here is
>>>>> probably a good idea.
>>>>>
>>>>> Additional to that it's probably a good idea to set the no_intr
>>>>> flag when reserving kernel BOs. It is impossible to receive a
>>>>> signal during module load/unload, but it's probably better to
>>>>> document that in the code as well.
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>> Am 18.04.2017 um 20:54 schrieb Xie, AlexBin:
>>>>>> Hi Christian,
>>>>>>
>>>>>> Have you found how/where/when? When you said "mapping will just be
>>>>>> released a bit later on", you must know the answer.
>>>>>>
>>>>>> It is difficult to prove something does not exist. Anyway, I will
>>>>>> give it a try to prove such "later on" does not exist.
>>>>>>
>>>>>> Function ttm_bo_kunmap is the only function to unmap. To prove
>>>>>> this, search ttm_bo_map_iomap, only ttm_bo_kunmap use this enum to
>>>>>> correctly kunmap.
>>>>>>
>>>>>> Function ttm_bo_kunmap is not called by ttm itself. This is a hint
>>>>>> that all TTM delay delete mechanism or unref mechanism will NOT
>>>>>> kunmap BO later on.
>>>>>>
>>>>>> Function ttm_bo_kunmap is called by AMDGPU
>>>>>> function amdgpu_bo_kunmap and amdgpu_gem_prime_vunmap.
>>>>>>
>>>>>> Search AMDGPU for amdgpu_bo_kunmap. All matches do not kunmap for
>>>>>> scratch VRAM BO.  amdgpu_bo_free_kernel is a suspect but the
>>>>>> answer is still NO.
>>>>>>
>>>>>> So all possibilities are searched. Did I miss anything?
>>>>>>
>>>>>> Thanks,
>>>>>> Alex Bin Xie
>>>>>>
>>>>>> ------------------------------------------------------------------------
>>>>>> *From:* Xie, AlexBin
>>>>>> *Sent:* Tuesday, April 18, 2017 2:04:33 PM
>>>>>> *To:* Christian König; Zhou, David(ChunMing);
>>>>>> amd-gfx@lists.freedesktop.org
>>>>>> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
>>>>>>   
>>>>>>
>>>>>> Hi Christian,
>>>>>>
>>>>>>
>>>>>> Would you point out where/when will kunmap happen for this BO when
>>>>>> release? It must be somewhere in some function calls.
>>>>>>
>>>>>>
>>>>>> I checked before I asked for review. But I did not see such
>>>>>> obvious kunmap function call.
>>>>>>
>>>>>>
>>>>>> If so, there should be a comment in
>>>>>> function amdgpu_vram_scratch_fini to avoid future confusion.
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Alex Bin Xie
>>>>>> ------------------------------------------------------------------------
>>>>>> *From:* Christian König <deathsimple@vodafone.de>
>>>>>> *Sent:* Tuesday, April 18, 2017 1:46 PM
>>>>>> *To:* Xie, AlexBin; Zhou, David(ChunMing);
>>>>>> amd-gfx@lists.freedesktop.org
>>>>>> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
>>>>>>   
>>>>>> Hi AlexBin,
>>>>>>
>>>>>> No, David is right. This is a very common coding pattern in the
>>>>>> kernel module.
>>>>>>
>>>>>> Freeing up a BO while there still exists a kernel mapping is
>>>>>> perfectly ok, the mapping will just be released a bit later on.
>>>>>>
>>>>>> So this code is actually perfectly ok and just an optimization,
>>>>>> but your patch breaks it and creates a memory leak.
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>> Am 18.04.2017 um 17:17 schrieb Xie, AlexBin:
>>>>>>> Hi David,
>>>>>>>
>>>>>>>
>>>>>>> When amdgpu_bo_reserve return errors, we cannot release the BO.
>>>>>>> This is not a memory leak. General speaking, memory leak
>>>>>>> is unnoticed and unintentional.
>>>>>>>
>>>>>>>
>>>>>>> The caller of function amdgpu_vram_scratch_fini ignores the
>>>>>>> return error value...
>>>>>>>
>>>>>>>
>>>>>>> The "memory leak" is not caused by my patch. It is caused because
>>>>>>> reserving BO fails.
>>>>>>>
>>>>>>>
>>>>>>> This patch only aim to make function amdgpu_vram_scratch_fini
>>>>>>> behave correctly.
>>>>>>>
>>>>>>>
>>>>>>> To follow up, we can add a warning message when amdgpu_bo_reserve
>>>>>>> error happens in a different patch.
>>>>>>>
>>>>>>>
>>>>>>> If function call amdgpu_bo_reserve is changed to uninterruptible,
>>>>>>> this changes driver behaviour. Without a substantial issue, I
>>>>>>> would be cautious for such a change.
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Alex Bin Xie
>>>>>>>
>>>>>>>
>>>>>>> ------------------------------------------------------------------------
>>>>>>> *From:* Zhou, David(ChunMing)
>>>>>>> *Sent:* Monday, April 17, 2017 10:38 PM
>>>>>>> *To:* Xie, AlexBin; amd-gfx@lists.freedesktop.org
>>>>>>> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
>>>>>>>   
>>>>>>>
>>>>>>>
>>>>>>> On 2017年04月17日 22:54, Xie, AlexBin wrote:
>>>>>>>> Hi David,
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks for the comments. However, please have look at
>>>>>>>> amdgpu_bo_reserve definition.
>>>>>>>>
>>>>>>>> static inline int amdgpu_bo_reserve(struct amdgpu_bo *bo, bool
>>>>>>>> no_intr)
>>>>>>>>
>>>>>>> Ah, this is a wired wrapper for ttm_bo_reserve.
>>>>>>>
>>>>>>>> When we call this function like the following:
>>>>>>>>
>>>>>>>>           r = amdgpu_bo_reserve(adev->vram_scratch.robj, false);
>>>>>>>> The false means interruptible.
>>>>>>>>
>>>>>>>>
>>>>>>>> On the other hand,  when amdgpu_bo_reserve function return
>>>>>>>> error, why do we unref BO without kunmap and unpin the BO? This
>>>>>>>> is wrong implementation when amdgpu_bo_reserve return any error.
>>>>>>>>
>>>>>>> Yeah, I see your mean, it's in driver un-loading, How about
>>>>>>> changing to no interruptible? Your patch will make a memleak if
>>>>>>> bo_reserve fails, but it seems not matter. I have no strong
>>>>>>> preference.
>>>>>>>
>>>>>>> Regards,
>>>>>>> David Zhou
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Alex Bin Xie
>>>>>>>>
>>>>>>>> ------------------------------------------------------------------------
>>>>>>>> *From:* Zhou, David(ChunMing)
>>>>>>>> *Sent:* Friday, April 14, 2017 12:00 AM
>>>>>>>> *To:* Xie, AlexBin; amd-gfx@lists.freedesktop.org
>>>>>>>> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
>>>>>>>>   
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2017年04月14日 05:34, Alex Xie wrote:
>>>>>>>>> According to comment of amdgpu_bo_reserve, amdgpu_bo_reserve
>>>>>>>>> can return with -ERESTARTSYS. When this function was interrupted
>>>>>>>>> by a signal, BO should not be unref. Otherwise the BO might be
>>>>>>>>> released while is kmapped and pinned, or BO MIGHT be deref
>>>>>>>>> multiple times, etc.
>>>>>>>>           r = amdgpu_bo_reserve(adev->vram_scratch.robj, false);
>>>>>>>> we have specified interruptible to false, so -ERESTARTSYS isn't
>>>>>>>> possible
>>>>>>>> here.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> David Zhou
>>>>>>>>> Change-Id: If76071a768950a0d3ad9d5da7fcae04881807621
>>>>>>>>> Signed-off-by: Alex Xie <AlexBin.Xie@amd.com>
>>>>>>>>> ---
>>>>>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
>>>>>>>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>> index 53996e3..1dcc2d1 100644
>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>> @@ -355,8 +355,8 @@ static void
>>>>>>>> amdgpu_vram_scratch_fini(struct amdgpu_device *adev)
>>>>>>>>>                 amdgpu_bo_kunmap(adev->vram_scratch.robj);
>>>>>>>>>                 amdgpu_bo_unpin(adev->vram_scratch.robj);
>>>>>>>>>                 amdgpu_bo_unreserve(adev->vram_scratch.robj);
>>>>>>>>> +             amdgpu_bo_unref(&adev->vram_scratch.robj);
>>>>>>>>>         }
>>>>>>>>> -     amdgpu_bo_unref(&adev->vram_scratch.robj);
>>>>>>>>>    }
>>>>>>>>>   
>>>>>>>>>    /**
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> amd-gfx mailing list
>>>>>>> amd-gfx@lists.freedesktop.org
>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> amd-gfx mailing list
>>>>>> amd-gfx@lists.freedesktop.org
>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> amd-gfx mailing list
>>>>> amd-gfx@lists.freedesktop.org
>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>
>>>
>>>
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>
>>
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
       [not found]                                                                 ` <a8715aca-05ff-1262-e2f5-31279052e606-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
@ 2017-04-21 17:53                                                                   ` Felix Kuehling
  0 siblings, 0 replies; 18+ messages in thread
From: Felix Kuehling @ 2017-04-21 17:53 UTC (permalink / raw)
  To: Christian König, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW


On 17-04-21 01:01 PM, Christian König wrote:
> Am 21.04.2017 um 17:43 schrieb Felix Kuehling:
>> On 17-04-21 03:11 AM, Christian König wrote:
>>> Hi Alex,
>>>
>>>> No. For the current source code, I think the premap and no-op is not
>>>> working.
>>>>
>>> Indeed, we don't set bo->mem.bus.addr in amdgpu_ttm_io_mem_reserve()
>>> any more. Felix will probably want to fix that for the KFD branch.
>> I vaguely remember discussing this in the past: using mem->bus.addr to
>> keep memory permanently CPU-mapped and avoid redundant ioremap calls. As
>> I remember it, we weren't actually using this. It's only something we
>> considered at one point.
>
> Well that explains why my memory fooled me.
>
> My last status was that we implemented this together with CPU based
> page table updates.
>
> Otherwise mapping them for CPU access is rather heavy if we indeed
> update the page tables every time.

We also don't currently have CPU page table updates for KFD. When we
rebased on 4.9, the VM code was restructured so much that we essentially
had to go back and reimplement it. Harish is working on this right now.

Regards,
  Felix

>
> Christian.
>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2017-04-21 17:53 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-13 21:34 [PATCH] dmr/amdgpu: Fix wrongly unref of BO Alex Xie
     [not found] ` <1492119298-8526-1-git-send-email-AlexBin.Xie-5C7GfCeVMHo@public.gmane.org>
2017-04-14  4:00   ` zhoucm1
     [not found]     ` <58F0495D.60607-5C7GfCeVMHo@public.gmane.org>
2017-04-17 14:54       ` Xie, AlexBin
     [not found]         ` <MWHPR1201MB0045DE55E81D0FF5CE76671FF2060-3iK1xFAIwjq45V35fqe+hGrFom/aUZj6nBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
2017-04-18  2:38           ` zhoucm1
     [not found]             ` <58F57C14.20403-5C7GfCeVMHo@public.gmane.org>
2017-04-18 15:17               ` Xie, AlexBin
     [not found]                 ` <MWHPR1201MB00452889FF2C7FDC55630B4BF2190-3iK1xFAIwjq45V35fqe+hGrFom/aUZj6nBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
2017-04-18 17:46                   ` Christian König
     [not found]                     ` <7807726d-7318-b72f-1a14-2b37a73988bf-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
2017-04-18 18:04                       ` Xie, AlexBin
     [not found]                         ` <MWHPR1201MB00451DFD2A06BBA3CD6F928EF2190-3iK1xFAIwjq45V35fqe+hGrFom/aUZj6nBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
2017-04-18 18:54                           ` Xie, AlexBin
     [not found]                             ` <MWHPR1201MB004553768BCAC951D28A347AF2190-3iK1xFAIwjq45V35fqe+hGrFom/aUZj6nBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
2017-04-19  6:35                               ` Christian König
     [not found]                                 ` <4509f79a-ffd5-eb40-117c-84ea02cbf5cd-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
2017-04-19 11:37                                   ` Xie, AlexBin
     [not found]                                     ` <MWHPR1201MB0045EC51F6CE7C6C4283888CF2180-3iK1xFAIwjq45V35fqe+hGrFom/aUZj6nBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
2017-04-19 11:50                                       ` Christian König
     [not found]                                         ` <2b43e2a9-0c79-f945-c835-0cfc486304c8-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
2017-04-19 19:30                                           ` Xie, AlexBin
     [not found]                                             ` <MWHPR1201MB0045B04F77E1F1C6D37D24FBF2180-3iK1xFAIwjq45V35fqe+hGrFom/aUZj6nBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
2017-04-20  8:43                                               ` Christian König
     [not found]                                                 ` <96809715-3c78-3784-fbeb-57a00359e3f3-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
2017-04-20 21:56                                                   ` Xie, AlexBin
     [not found]                                                     ` <MWHPR1201MB004568A33CBF287710759301F21B0-3iK1xFAIwjq45V35fqe+hGrFom/aUZj6nBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
2017-04-21  7:11                                                       ` Christian König
     [not found]                                                         ` <af065740-e219-fccb-4dac-d46f3472f4aa-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
2017-04-21 15:43                                                           ` Felix Kuehling
     [not found]                                                             ` <003d5e5f-ea99-4517-fd4a-de4df3ce1b89-5C7GfCeVMHo@public.gmane.org>
2017-04-21 17:01                                                               ` Christian König
     [not found]                                                                 ` <a8715aca-05ff-1262-e2f5-31279052e606-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
2017-04-21 17:53                                                                   ` Felix Kuehling

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.