Re: [PATCH 2/2] drm: Revert syncobj timeline changes.

From: zhoucm1 <zhoucm1@amd.com>
To: "Chris Wilson" <chris@chris-wilson.co.uk>,
	"dri-devel@lists.freedesktop.org"
	<dri-devel@lists.freedesktop.org>,
	"Christian König" <ckoenig.leichtzumerken@gmail.com>,
	"Eric Anholt" <eric@anholt.net>,
	christian.koenig@amd.com
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 2/2] drm: Revert syncobj timeline changes.
Date: Tue, 13 Nov 2018 13:57:27 +0800	[thread overview]
Message-ID: <cd0f89e7-1018-e4c3-3a81-4e796c384c1d@amd.com> (raw)
In-Reply-To: <154201973877.16646.5745251436337959698@skylake-alporthouse-com>

On 2018年11月12日 18:48, Chris Wilson wrote:
> Quoting Christian König (2018-11-12 10:16:01)
>> Am 09.11.18 um 23:26 schrieb Eric Anholt:
>>
>>      Eric Anholt <eric@anholt.net> writes:
>>
>>
>>          [ Unknown signature status ]
>>          zhoucm1 <zhoucm1@amd.com> writes:
>>
>>
>>              On 2018年11月09日 00:52, Christian König wrote:
>>
>>                  Am 08.11.18 um 17:07 schrieb Koenig, Christian:
>>
>>                      Am 08.11.18 um 17:04 schrieb Eric Anholt:
>>
>>                          Daniel suggested I submit this, since we're still seeing regressions
>>                          from it.  This is a revert to before 48197bc564c7 ("drm: add syncobj
>>                          timeline support v9") and its followon fixes.
>>
>>                      This is a harmless false positive from lockdep, Chouming and I are
>>                      already working on a fix.
>>
>>                  On the other hand we had enough trouble with that patch, so if it
>>                  really bothers you feel free to add my Acked-by: Christian König
>>                  <christian.koenig@amd.com> and push it.
>>
>>              NAK, please no, I don't think this needed, the Warning totally isn't
>>              related to syncobj timeline, but fence-array implementation flaw, just
>>              exposed by syncobj.
>>              In addition, Christian already has a fix for this Warning, I've tested.
>>              Please Christian send to public review.
>>
>>          I backed out my revert of #2 (#1 still necessary) after adding the
>>          lockdep regression fix, and now my CTS run got oomkilled after just a
>>          few hours, with these notable lines in the unreclaimable slab info list:
>>
>>          [ 6314.373099] drm_sched_fence        69095KB      69095KB
>>          [ 6314.373653] kmemleak_object       428249KB     428384KB
>>          [ 6314.373736] kmalloc-262144           256KB        256KB
>>          [ 6314.373743] kmalloc-131072           128KB        128KB
>>          [ 6314.373750] kmalloc-65536             64KB         64KB
>>          [ 6314.373756] kmalloc-32768           1472KB       1728KB
>>          [ 6314.373763] kmalloc-16384             64KB         64KB
>>          [ 6314.373770] kmalloc-8192             208KB        208KB
>>          [ 6314.373778] kmalloc-4096            2408KB       2408KB
>>          [ 6314.373784] kmalloc-2048             288KB        336KB
>>          [ 6314.373792] kmalloc-1024            1457KB       1512KB
>>          [ 6314.373800] kmalloc-512              854KB       1048KB
>>          [ 6314.373808] kmalloc-256              188KB        268KB
>>          [ 6314.373817] kmalloc-192            69141KB      69142KB
>>          [ 6314.373824] kmalloc-64             47703KB      47704KB
>>          [ 6314.373886] kmalloc-128            46396KB      46396KB
>>          [ 6314.373894] kmem_cache                31KB         35KB
>>
>>          No results from kmemleak, though.
>>
>>      OK, it looks like the #2 revert probably isn't related to the OOM issue.
>>      Running a single job on otherwise unused DRM, watching /proc/slabinfo
>>      every second for drm_sched_fence, I get:
>>
>>      drm_sched_fence        0      0    192   21    1 : tunables   32   16    8 : slabdata      0      0      0 : globalstat       0      0     0    0    0    0    0    0    0 : cpustat      0      0      0      0
>>      drm_sched_fence       16     21    192   21    1 : tunables   32   16    8 : slabdata      1      1      0 : globalstat      16     16     1    0    0    0    0    0    0 : cpustat      5      1      6      0
>>      drm_sched_fence       13     21    192   21    1 : tunables   32   16    8 : slabdata      1      1      0 : globalstat      16     16     1    0    0    0    0    0    0 : cpustat      5      1      6      0
>>      drm_sched_fence        6     21    192   21    1 : tunables   32   16    8 : slabdata      1      1      0 : globalstat      16     16     1    0    0    0    0    0    0 : cpustat      5      1      6      0
>>      drm_sched_fence        4     21    192   21    1 : tunables   32   16    8 : slabdata      1      1      0 : globalstat      16     16     1    0    0    0    0    0    0 : cpustat      5      1      6      0
>>      drm_sched_fence        2     21    192   21    1 : tunables   32   16    8 : slabdata      1      1      0 : globalstat      16     16     1    0    0    0    0    0    0 : cpustat      5      1      6      0
>>      drm_sched_fence        0     21    192   21    1 : tunables   32   16    8 : slabdata      0      1      0 : globalstat      16     16     1    0    0    0    0    0    0 : cpustat      5      1      6      0
>>
>>      So we generate a ton of fences, and I guess free them slowly because of
>>      RCU?  And presumably kmemleak was sucking up lots of memory because of
>>      how many of these objects were laying around.
>>
>>
>> That is certainly possible. Another possibility is that we don't drop the
>> reference in dma-fence-array early enough.
>>
>> E.g. the dma-fence-array will keep the reference to its fences until it is
>> destroyed, which is a bit late when you chain multiple dma-fence-array objects
>> together.
>>
>> David can you take a look at this and propose a fix? That would probably be
>> good to have fixed in dma-fence-array separately to the timeline work.
> Note that drm_syncobj_replace_fence() leaks any existing fence for
> !timeline syncobjs.
Hi Chris,

Hui! Isn't existing fence collected as garbage?

Could you point where/how leaks existing fence?

Thanks,
David
> Which coupled with the linear search ends up with
> a severe regression in both time and memory.
> -Chris