Hi AlexBin, > Missing kunmap mapping in vmalloc will make kernel master page table > incorrect. That's what I tried to explain yesterday, but unfortunately didn't had time to do so. There is not corruption of the kernel master page table in this case! The call of ttm_bo_kunmap is completely optional, take a look at amdgpu_ttm_io_mem_reserve() and amdgpu_ttm_io_mem_free(). The aperture is kept mapped into the page tables for the whole time the driver is loaded. So this is a complete no-op and only done for consistency. > It is good that you agree that there is no real world bad example > caused by my patch. I will not discuss whether it is an improvement or > not now to save time for both of us. > Great at least we can now agree to completely drop this patch. Thanks, Christian. Am 19.04.2017 um 21:30 schrieb Xie, AlexBin: > > Hi Christian, > > > Missing kunmap mapping in vmalloc will make kernel master page table > incorrect. I would not call such issue as completely harmless. Please > note that AMD graphic driver can run in 32 bit system. In 32 bit > system, vmalloc address space is much smaller than size of most GPU VRAM. > > > amdgpu_bo_free_kernel has same issue as amdgpu_vram_scratch_fini. 1. > It calls amdgpu_bo_reserve interruptible too. 2. It misses kunmap when > amdgpu_bo_reserve returns error too. As result, kernel master page > table can become incorrect, or as you call it "completely harmless > vmalloc space leaking". > > > Because amdgpu_bo_free_kernel is used in more places, such as psp > command submission, there will be bigger chance to have other usage > where signal is not blocked. This will become a real bug. > > > I am thinking that we may fix the issue completely when TTM releases > BO. But that is a bigger change. > > > It is good that you agree that there is no real world bad example > caused by my patch. I will not discuss whether it is an improvement or > not now to save time for both of us. > > > Thanks, > > Alex Bin Xie > > > ------------------------------------------------------------------------ > *From:* Christian König > *Sent:* Wednesday, April 19, 2017 7:50 AM > *To:* Xie, AlexBin; Zhou, David(ChunMing); amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org > *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO >> Without correctly kunmap, page table is corrupted. Page entries point >> to wrong memory locations. You might call it completely harmless. But >> I think this is a severe bug. Leaking memory is better than a >> corrupted page table. Think security. > We are talking about the page tables for the vmalloc area in the > kernel here, so no security problem. Leaking memory is much more > problematic. > >> Would you provide any document and reference by saying" It is >> impossible to receive a signal during module load/unload"? For >> example, if the unload stuck in a lock, can CTRL+C stop the unload? >> > No, CTRL+C doesn't abort module load/unload. There have been patches > to changes this a while ago, but IIRC it broke a whole bunch of driver > relying on this. > >> What about there is some other return error? What about in future >> somebody improve amdgpu_bo_reserve to return other errors, >> then function amdgpu_vram_scratch_fini becomes buggy? >> > Yes, that is indeed an issue. For example -EDEADLK is possible as > well. That's why I said we should use amdgpu_bo_free_kernel() instead. > >> While I am thinking whether there is a better way for the current >> situation, would you give a real world example that my patch really >> not working? Then we can address it. >> > I don't think there is because the driver can't receive a signal > during load/unload, but the problem is rather that the patch doesn't > improve the situation at all. > > Regards, > Christian. > > Am 19.04.2017 um 13:37 schrieb Xie, AlexBin: >> >> Hi Christian, >> >> >> Without correctly kunmap, page table is corrupted. Page entries point >> to wrong memory locations. You might call it completely harmless. But >> I think this is a severe bug. Leaking memory is better than a >> corrupted page table. Think security. >> >> >> Would you provide any document and reference by saying" It is >> impossible to receive a signal during module load/unload"? For >> example, if the unload stuck in a lock, can CTRL+C stop the unload? >> >> >> If "It is impossible to receive a signal during module load/unload", >> interruptible waiting is fine too, because function amdgpu_bo_reserve >> will return successfully. >> >> >> What about there is some other return error? What about in future >> somebody improve amdgpu_bo_reserve to return other errors, >> then function amdgpu_vram_scratch_fini becomes buggy? >> >> >> While I am thinking whether there is a better way for the current >> situation, would you give a real world example that my patch really >> not working? Then we can address it. >> >> >> Thanks, >> >> Alex Bin >> >> >> ------------------------------------------------------------------------ >> *From:* Christian König >> *Sent:* Wednesday, April 19, 2017 2:35 AM >> *To:* Xie, AlexBin; Zhou, David(ChunMing); amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org >> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO >> Hi AlexBin, >> >> the answer is ttm_bo_kunmap isn't called at all and yes in the case >> of an iomap we leak the address space reserved. >> >> But that is completely harmless on a 64bit system compared to leaking >> the memory backing the address space. >> >> Using amdgpu_bo_free_kernel() instead of openly coding it here is >> probably a good idea. >> >> Additional to that it's probably a good idea to set the no_intr flag >> when reserving kernel BOs. It is impossible to receive a signal >> during module load/unload, but it's probably better to document that >> in the code as well. >> >> Regards, >> Christian. >> >> Am 18.04.2017 um 20:54 schrieb Xie, AlexBin: >>> Hi Christian, >>> >>> Have you found how/where/when? When you said "mapping will just be >>> released a bit later on", you must know the answer. >>> >>> It is difficult to prove something does not exist. Anyway, I will >>> give it a try to prove such "later on" does not exist. >>> >>> Function ttm_bo_kunmap is the only function to unmap. To prove this, >>> search ttm_bo_map_iomap, only ttm_bo_kunmap use this enum to >>> correctly kunmap. >>> >>> Function ttm_bo_kunmap is not called by ttm itself. This is a hint >>> that all TTM delay delete mechanism or unref mechanism will NOT >>> kunmap BO later on. >>> >>> Function ttm_bo_kunmap is called by AMDGPU function amdgpu_bo_kunmap >>> and amdgpu_gem_prime_vunmap. >>> >>> Search AMDGPU for amdgpu_bo_kunmap. All matches do not kunmap for >>> scratch VRAM BO. amdgpu_bo_free_kernel is a suspect but the answer >>> is still NO. >>> >>> So all possibilities are searched. Did I miss anything? >>> >>> Thanks, >>> Alex Bin Xie >>> >>> ------------------------------------------------------------------------ >>> *From:* Xie, AlexBin >>> *Sent:* Tuesday, April 18, 2017 2:04:33 PM >>> *To:* Christian König; Zhou, David(ChunMing); >>> amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org >>> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO >>> >>> Hi Christian, >>> >>> >>> Would you point out where/when will kunmap happen for this BO when >>> release? It must be somewhere in some function calls. >>> >>> >>> I checked before I asked for review. But I did not see such obvious >>> kunmap function call. >>> >>> >>> If so, there should be a comment in function >>> amdgpu_vram_scratch_fini to avoid future confusion. >>> >>> >>> Thanks, >>> Alex Bin Xie >>> ------------------------------------------------------------------------ >>> *From:* Christian König >>> *Sent:* Tuesday, April 18, 2017 1:46 PM >>> *To:* Xie, AlexBin; Zhou, David(ChunMing); amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org >>> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO >>> Hi AlexBin, >>> >>> No, David is right. This is a very common coding pattern in the >>> kernel module. >>> >>> Freeing up a BO while there still exists a kernel mapping is >>> perfectly ok, the mapping will just be released a bit later on. >>> >>> So this code is actually perfectly ok and just an optimization, but >>> your patch breaks it and creates a memory leak. >>> >>> Regards, >>> Christian. >>> >>> Am 18.04.2017 um 17:17 schrieb Xie, AlexBin: >>>> >>>> Hi David, >>>> >>>> >>>> When amdgpu_bo_reserve return errors, we cannot release the BO. >>>> This is not a memory leak. General speaking, memory leak >>>> is unnoticed and unintentional. >>>> >>>> >>>> The caller of function amdgpu_vram_scratch_fini ignores the return >>>> error value... >>>> >>>> >>>> The "memory leak" is not caused by my patch. It is caused because >>>> reserving BO fails. >>>> >>>> >>>> This patch only aim to make function amdgpu_vram_scratch_fini >>>> behave correctly. >>>> >>>> >>>> To follow up, we can add a warning message when amdgpu_bo_reserve >>>> error happens in a different patch. >>>> >>>> >>>> If function call amdgpu_bo_reserve is changed to uninterruptible, >>>> this changes driver behaviour. Without a substantial issue, I would >>>> be cautious for such a change. >>>> >>>> >>>> Thanks, >>>> >>>> Alex Bin Xie >>>> >>>> >>>> ------------------------------------------------------------------------ >>>> *From:* Zhou, David(ChunMing) >>>> *Sent:* Monday, April 17, 2017 10:38 PM >>>> *To:* Xie, AlexBin; amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org >>>> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO >>>> >>>> >>>> On 2017年04月17日 22:54, Xie, AlexBin wrote: >>>>> >>>>> Hi David, >>>>> >>>>> >>>>> Thanks for the comments. However, please have look at >>>>> amdgpu_bo_reserve definition. >>>>> >>>>> static inline int amdgpu_bo_reserve(struct amdgpu_bo *bo, bool >>>>> no_intr) >>>>> >>>> Ah, this is a wired wrapper for ttm_bo_reserve. >>>> >>>>> >>>>> When we call this function like the following: >>>>> >>>>> r = amdgpu_bo_reserve(adev->vram_scratch.robj, false); >>>>> The false means interruptible. >>>>> >>>>> >>>>> On the other hand, when amdgpu_bo_reserve function return error, >>>>> why do we unref BO without kunmap and unpin the BO? This is wrong >>>>> implementation when amdgpu_bo_reserve return any error. >>>>> >>>> Yeah, I see your mean, it's in driver un-loading, How about >>>> changing to no interruptible? Your patch will make a memleak if >>>> bo_reserve fails, but it seems not matter. I have no strong preference. >>>> >>>> Regards, >>>> David Zhou >>>>> >>>>> >>>>> Thanks, >>>>> Alex Bin Xie >>>>> >>>>> ------------------------------------------------------------------------ >>>>> *From:* Zhou, David(ChunMing) >>>>> *Sent:* Friday, April 14, 2017 12:00 AM >>>>> *To:* Xie, AlexBin; amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org >>>>> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO >>>>> >>>>> >>>>> On 2017年04月14日 05:34, Alex Xie wrote: >>>>> > According to comment of amdgpu_bo_reserve, amdgpu_bo_reserve >>>>> > can return with -ERESTARTSYS. When this function was interrupted >>>>> > by a signal, BO should not be unref. Otherwise the BO might be >>>>> > released while is kmapped and pinned, or BO MIGHT be deref >>>>> > multiple times, etc. >>>>> r = amdgpu_bo_reserve(adev->vram_scratch.robj, false); >>>>> we have specified interruptible to false, so -ERESTARTSYS isn't >>>>> possible >>>>> here. >>>>> >>>>> Thanks, >>>>> David Zhou >>>>> > >>>>> > Change-Id: If76071a768950a0d3ad9d5da7fcae04881807621 >>>>> > Signed-off-by: Alex Xie >>>>> > --- >>>>> > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +- >>>>> > 1 file changed, 1 insertion(+), 1 deletion(-) >>>>> > >>>>> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>> > index 53996e3..1dcc2d1 100644 >>>>> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>> > @@ -355,8 +355,8 @@ static void amdgpu_vram_scratch_fini(struct >>>>> amdgpu_device *adev) >>>>> > amdgpu_bo_kunmap(adev->vram_scratch.robj); >>>>> > amdgpu_bo_unpin(adev->vram_scratch.robj); >>>>> > amdgpu_bo_unreserve(adev->vram_scratch.robj); >>>>> > + amdgpu_bo_unref(&adev->vram_scratch.robj); >>>>> > } >>>>> > - amdgpu_bo_unref(&adev->vram_scratch.robj); >>>>> > } >>>>> > >>>>> > /** >>>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> amd-gfx mailing list >>>> amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org >>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>> >>> >>> >>> >>> _______________________________________________ >>> amd-gfx mailing list >>> amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org >>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >> >> >> >> >> _______________________________________________ >> amd-gfx mailing list >> amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org >> https://lists.freedesktop.org/mailman/listinfo/amd-gfx > >