From: Neil Armstrong <narmstrong@baylibre.com> To: "Grodzovsky, Andrey" <Andrey.Grodzovsky@amd.com>, "daniel@ffwll.ch" <daniel@ffwll.ch>, "airlied@linux.ie" <airlied@linux.ie>, "Koenig, Christian" <Christian.Koenig@amd.com> Cc: Erico Nunes <nunes.erico@gmail.com>, "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>, "steven.price@arm.com" <steven.price@arm.com>, "dri-devel@lists.freedesktop.org" <dri-devel@lists.freedesktop.org>, Rob Herring <robh@kernel.org>, Tomeu Vizoso <tomeu.vizoso@collabora.com>, "open list:ARM/Amlogic Meson..." <linux-amlogic@lists.infradead.org> Subject: Re: drm_sched with panfrost crash on T820 Date: Thu, 3 Oct 2019 10:36:44 +0200 [thread overview] Message-ID: <530d1549-367f-b387-7f89-be6221b864a9@baylibre.com> (raw) In-Reply-To: <d5ceef14-b876-c102-d793-25289635cab1@amd.com> Le 02/10/2019 à 18:53, Grodzovsky, Andrey a écrit : > > On 9/30/19 5:17 AM, Neil Armstrong wrote: >> Hi Andrey, >> >> On 27/09/2019 22:55, Grodzovsky, Andrey wrote: >>> Can you please use addr2line or gdb to pinpoint where in >>> drm_sched_increase_karma you hit the NULL ptr ? It looks like the guilty >>> job, but to be sure. >> Did a new run from 5.3: >> >> [ 35.971972] Call trace: >> [ 35.974391] drm_sched_increase_karma+0x5c/0xf0 ffff000010667f38 FFFF000010667F94 drivers/gpu/drm/scheduler/sched_main.c:335 >> >> >> The crashing line is : >> if (bad->s_fence->scheduled.context == >> entity->fence_context) { >> >> Doesn't seem related to guilty job. >> >> Neil > > > Thanks Neil, by guilty i meant the 'bad' job. I reviewed the code and > can't see anything suspicious for now. To help clarify could you please > provide ftrace log for this ? All the dma_fence and gpu_scheduler traces > can help. I usually just set them all up in one line using trace-cmd > utility like this before starting the run. If you have any relevant > traces in panfrost it aslo can be useful. > > sudo trace-cmd start -e dma_fence -e gpu_scheduler Sure but I'll need much more time to do this, in the meantime I did 10 runs with your patch and is fixed the issue. I'll try to generate the traces. Neil > > Andrey > > >> >>> Andrey >>> >>> On 9/27/19 4:12 AM, Neil Armstrong wrote: >>>> Hi Christian, >>>> >>>> In v5.3, running dEQP triggers the following kernel crash : >>>> >>>> [ 20.224982] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000038 >>>> [...] >>>> [ 20.291064] Hardware name: Khadas VIM2 (DT) >>>> [ 20.295217] Workqueue: events drm_sched_job_timedout >>>> [...] >>>> [ 20.304867] pc : drm_sched_increase_karma+0x5c/0xf0 >>>> [ 20.309696] lr : drm_sched_increase_karma+0x44/0xf0 >>>> [...] >>>> [ 20.396720] Call trace: >>>> [ 20.399138] drm_sched_increase_karma+0x5c/0xf0 >>>> [ 20.403623] panfrost_job_timedout+0x12c/0x1e0 >>>> [ 20.408021] drm_sched_job_timedout+0x48/0xa0 >>>> [ 20.412336] process_one_work+0x1e0/0x320 >>>> [ 20.416300] worker_thread+0x40/0x450 >>>> [ 20.419924] kthread+0x124/0x128 >>>> [ 20.423116] ret_from_fork+0x10/0x18 >>>> [ 20.426653] Code: f9400001 540001c0 f9400a83 f9402402 (f9401c64) >>>> [ 20.432690] ---[ end trace bd02f890139096a7 ]--- >>>> >>>> Which never happens, at all, on v5.2. >>>> >>>> I did a (very) long (7 days, ~100runs) bisect run using our LAVA lab (thanks tomeu !), but >>>> bisecting was not easy since the bad commit landed on drm-misc-next after v5.1-rc6, and >>>> then v5.2-rc1 was backmerged into drm-misc-next at: >>>> [1] 374ed5429346 Merge drm/drm-next into drm-misc-next >>>> >>>> Thus bisecting between [1] ang v5.2-rc1 leads to commit based on v5.2-rc1... where panfrost was >>>> not enabled in the Khadas VIM2 DT. >>>> >>>> Anyway, I managed to identify 3 possibly breaking commits : >>>> [2] 290764af7e36 drm/sched: Keep s_fence->parent pointer >>>> [3] 5918045c4ed4 drm/scheduler: rework job destruction >>>> [4] a5343b8a2ca5 drm/scheduler: Add flag to hint the release of guilty job. >>>> >>>> But [1] and [2] doesn't crash the same way : >>>> [ 16.257912] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000060 >>>> [...] >>>> [ 16.308307] CPU: 4 PID: 80 Comm: kworker/4:1 Not tainted 5.1.0-rc2-01185-g290764af7e36-dirty #378 >>>> [ 16.317099] Hardware name: Khadas VIM2 (DT) >>>> [...]) >>>> [ 16.330907] pc : refcount_sub_and_test_checked+0x4/0xb0 >>>> [ 16.336078] lr : refcount_dec_and_test_checked+0x14/0x20 >>>> [...] >>>> [ 16.423533] Process kworker/4:1 (pid: 80, stack limit = 0x(____ptrval____)) >>>> [ 16.430431] Call trace: >>>> [ 16.432851] refcount_sub_and_test_checked+0x4/0xb0 >>>> [ 16.437681] drm_sched_job_cleanup+0x24/0x58 >>>> [ 16.441908] panfrost_job_free+0x14/0x28 >>>> [ 16.445787] drm_sched_job_timedout+0x6c/0xa0 >>>> [ 16.450102] process_one_work+0x1e0/0x320 >>>> [ 16.454067] worker_thread+0x40/0x450 >>>> [ 16.457690] kthread+0x124/0x128 >>>> [ 16.460882] ret_from_fork+0x10/0x18 >>>> [ 16.464421] Code: 52800000 d65f03c0 d503201f aa0103e3 (b9400021) >>>> [ 16.470456] ---[ end trace 39a67412ee1b64b5 ]--- >>>> >>>> and [3] fails like on v5.3 (in drm_sched_increase_karma): >>>> [ 33.830080] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000038 >>>> [...] >>>> [ 33.871946] Internal error: Oops: 96000004 [#1] PREEMPT SMP >>>> [ 33.877450] Modules linked in: >>>> [ 33.880474] CPU: 6 PID: 81 Comm: kworker/6:1 Not tainted 5.1.0-rc2-01186-ga5343b8a2ca5-dirty #380 >>>> [ 33.889265] Hardware name: Khadas VIM2 (DT) >>>> [ 33.893419] Workqueue: events drm_sched_job_timedout >>>> [...] >>>> [ 33.903069] pc : drm_sched_increase_karma+0x5c/0xf0 >>>> [ 33.907898] lr : drm_sched_increase_karma+0x44/0xf0 >>>> [...] >>>> [ 33.994924] Process kworker/6:1 (pid: 81, stack limit = 0x(____ptrval____)) >>>> [ 34.001822] Call trace: >>>> [ 34.004242] drm_sched_increase_karma+0x5c/0xf0 >>>> [ 34.008726] panfrost_job_timedout+0x12c/0x1e0 >>>> [ 34.013122] drm_sched_job_timedout+0x48/0xa0 >>>> [ 34.017438] process_one_work+0x1e0/0x320 >>>> [ 34.021402] worker_thread+0x40/0x450 >>>> [ 34.025026] kthread+0x124/0x128 >>>> [ 34.028218] ret_from_fork+0x10/0x18 >>>> [ 34.031755] Code: f9400001 540001c0 f9400a83 f9402402 (f9401c64) >>>> [ 34.037792] ---[ end trace be3fd6f77f4df267 ]--- >>>> >>>> >>>> When I revert [3] on [1], i get the same crash as [2], meaning >>>> the commit [3] masks the failure [2] introduced. >>>> >>>> Do you know how to solve this ? >>>> >>>> Thanks, >>>> Neil
WARNING: multiple messages have this Message-ID (diff)
From: Neil Armstrong <narmstrong@baylibre.com> To: "Grodzovsky, Andrey" <Andrey.Grodzovsky@amd.com>, "daniel@ffwll.ch" <daniel@ffwll.ch>, "airlied@linux.ie" <airlied@linux.ie>, "Koenig, Christian" <Christian.Koenig@amd.com> Cc: Rob Herring <robh@kernel.org>, Tomeu Vizoso <tomeu.vizoso@collabora.com>, "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>, "dri-devel@lists.freedesktop.org" <dri-devel@lists.freedesktop.org>, "steven.price@arm.com" <steven.price@arm.com>, "open list:ARM/Amlogic Meson..." <linux-amlogic@lists.infradead.org>, Erico Nunes <nunes.erico@gmail.com> Subject: Re: drm_sched with panfrost crash on T820 Date: Thu, 3 Oct 2019 10:36:44 +0200 [thread overview] Message-ID: <530d1549-367f-b387-7f89-be6221b864a9@baylibre.com> (raw) In-Reply-To: <d5ceef14-b876-c102-d793-25289635cab1@amd.com> Le 02/10/2019 à 18:53, Grodzovsky, Andrey a écrit : > > On 9/30/19 5:17 AM, Neil Armstrong wrote: >> Hi Andrey, >> >> On 27/09/2019 22:55, Grodzovsky, Andrey wrote: >>> Can you please use addr2line or gdb to pinpoint where in >>> drm_sched_increase_karma you hit the NULL ptr ? It looks like the guilty >>> job, but to be sure. >> Did a new run from 5.3: >> >> [ 35.971972] Call trace: >> [ 35.974391] drm_sched_increase_karma+0x5c/0xf0 ffff000010667f38 FFFF000010667F94 drivers/gpu/drm/scheduler/sched_main.c:335 >> >> >> The crashing line is : >> if (bad->s_fence->scheduled.context == >> entity->fence_context) { >> >> Doesn't seem related to guilty job. >> >> Neil > > > Thanks Neil, by guilty i meant the 'bad' job. I reviewed the code and > can't see anything suspicious for now. To help clarify could you please > provide ftrace log for this ? All the dma_fence and gpu_scheduler traces > can help. I usually just set them all up in one line using trace-cmd > utility like this before starting the run. If you have any relevant > traces in panfrost it aslo can be useful. > > sudo trace-cmd start -e dma_fence -e gpu_scheduler Sure but I'll need much more time to do this, in the meantime I did 10 runs with your patch and is fixed the issue. I'll try to generate the traces. Neil > > Andrey > > >> >>> Andrey >>> >>> On 9/27/19 4:12 AM, Neil Armstrong wrote: >>>> Hi Christian, >>>> >>>> In v5.3, running dEQP triggers the following kernel crash : >>>> >>>> [ 20.224982] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000038 >>>> [...] >>>> [ 20.291064] Hardware name: Khadas VIM2 (DT) >>>> [ 20.295217] Workqueue: events drm_sched_job_timedout >>>> [...] >>>> [ 20.304867] pc : drm_sched_increase_karma+0x5c/0xf0 >>>> [ 20.309696] lr : drm_sched_increase_karma+0x44/0xf0 >>>> [...] >>>> [ 20.396720] Call trace: >>>> [ 20.399138] drm_sched_increase_karma+0x5c/0xf0 >>>> [ 20.403623] panfrost_job_timedout+0x12c/0x1e0 >>>> [ 20.408021] drm_sched_job_timedout+0x48/0xa0 >>>> [ 20.412336] process_one_work+0x1e0/0x320 >>>> [ 20.416300] worker_thread+0x40/0x450 >>>> [ 20.419924] kthread+0x124/0x128 >>>> [ 20.423116] ret_from_fork+0x10/0x18 >>>> [ 20.426653] Code: f9400001 540001c0 f9400a83 f9402402 (f9401c64) >>>> [ 20.432690] ---[ end trace bd02f890139096a7 ]--- >>>> >>>> Which never happens, at all, on v5.2. >>>> >>>> I did a (very) long (7 days, ~100runs) bisect run using our LAVA lab (thanks tomeu !), but >>>> bisecting was not easy since the bad commit landed on drm-misc-next after v5.1-rc6, and >>>> then v5.2-rc1 was backmerged into drm-misc-next at: >>>> [1] 374ed5429346 Merge drm/drm-next into drm-misc-next >>>> >>>> Thus bisecting between [1] ang v5.2-rc1 leads to commit based on v5.2-rc1... where panfrost was >>>> not enabled in the Khadas VIM2 DT. >>>> >>>> Anyway, I managed to identify 3 possibly breaking commits : >>>> [2] 290764af7e36 drm/sched: Keep s_fence->parent pointer >>>> [3] 5918045c4ed4 drm/scheduler: rework job destruction >>>> [4] a5343b8a2ca5 drm/scheduler: Add flag to hint the release of guilty job. >>>> >>>> But [1] and [2] doesn't crash the same way : >>>> [ 16.257912] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000060 >>>> [...] >>>> [ 16.308307] CPU: 4 PID: 80 Comm: kworker/4:1 Not tainted 5.1.0-rc2-01185-g290764af7e36-dirty #378 >>>> [ 16.317099] Hardware name: Khadas VIM2 (DT) >>>> [...]) >>>> [ 16.330907] pc : refcount_sub_and_test_checked+0x4/0xb0 >>>> [ 16.336078] lr : refcount_dec_and_test_checked+0x14/0x20 >>>> [...] >>>> [ 16.423533] Process kworker/4:1 (pid: 80, stack limit = 0x(____ptrval____)) >>>> [ 16.430431] Call trace: >>>> [ 16.432851] refcount_sub_and_test_checked+0x4/0xb0 >>>> [ 16.437681] drm_sched_job_cleanup+0x24/0x58 >>>> [ 16.441908] panfrost_job_free+0x14/0x28 >>>> [ 16.445787] drm_sched_job_timedout+0x6c/0xa0 >>>> [ 16.450102] process_one_work+0x1e0/0x320 >>>> [ 16.454067] worker_thread+0x40/0x450 >>>> [ 16.457690] kthread+0x124/0x128 >>>> [ 16.460882] ret_from_fork+0x10/0x18 >>>> [ 16.464421] Code: 52800000 d65f03c0 d503201f aa0103e3 (b9400021) >>>> [ 16.470456] ---[ end trace 39a67412ee1b64b5 ]--- >>>> >>>> and [3] fails like on v5.3 (in drm_sched_increase_karma): >>>> [ 33.830080] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000038 >>>> [...] >>>> [ 33.871946] Internal error: Oops: 96000004 [#1] PREEMPT SMP >>>> [ 33.877450] Modules linked in: >>>> [ 33.880474] CPU: 6 PID: 81 Comm: kworker/6:1 Not tainted 5.1.0-rc2-01186-ga5343b8a2ca5-dirty #380 >>>> [ 33.889265] Hardware name: Khadas VIM2 (DT) >>>> [ 33.893419] Workqueue: events drm_sched_job_timedout >>>> [...] >>>> [ 33.903069] pc : drm_sched_increase_karma+0x5c/0xf0 >>>> [ 33.907898] lr : drm_sched_increase_karma+0x44/0xf0 >>>> [...] >>>> [ 33.994924] Process kworker/6:1 (pid: 81, stack limit = 0x(____ptrval____)) >>>> [ 34.001822] Call trace: >>>> [ 34.004242] drm_sched_increase_karma+0x5c/0xf0 >>>> [ 34.008726] panfrost_job_timedout+0x12c/0x1e0 >>>> [ 34.013122] drm_sched_job_timedout+0x48/0xa0 >>>> [ 34.017438] process_one_work+0x1e0/0x320 >>>> [ 34.021402] worker_thread+0x40/0x450 >>>> [ 34.025026] kthread+0x124/0x128 >>>> [ 34.028218] ret_from_fork+0x10/0x18 >>>> [ 34.031755] Code: f9400001 540001c0 f9400a83 f9402402 (f9401c64) >>>> [ 34.037792] ---[ end trace be3fd6f77f4df267 ]--- >>>> >>>> >>>> When I revert [3] on [1], i get the same crash as [2], meaning >>>> the commit [3] masks the failure [2] introduced. >>>> >>>> Do you know how to solve this ? >>>> >>>> Thanks, >>>> Neil _______________________________________________ linux-amlogic mailing list linux-amlogic@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-amlogic
next prev parent reply other threads:[~2019-10-03 8:36 UTC|newest] Thread overview: 55+ messages / expand[flat|nested] mbox.gz Atom feed top 2019-09-27 8:12 drm_sched with panfrost crash on T820 Neil Armstrong 2019-09-27 8:12 ` Neil Armstrong 2019-09-27 9:55 ` Steven Price 2019-09-27 9:55 ` Steven Price 2019-09-27 10:48 ` Steven Price 2019-09-27 10:48 ` Steven Price 2019-09-27 10:48 ` Steven Price 2019-09-27 11:27 ` Neil Armstrong 2019-09-27 11:27 ` Neil Armstrong 2019-09-27 11:48 ` Neil Armstrong 2019-09-27 11:48 ` Neil Armstrong 2019-09-27 15:00 ` Steven Price 2019-09-27 15:00 ` Steven Price 2019-09-27 15:00 ` Steven Price 2019-09-27 15:20 ` Neil Armstrong 2019-09-27 15:20 ` Neil Armstrong 2019-09-30 13:18 ` Neil Armstrong 2019-09-30 13:18 ` Neil Armstrong 2019-09-30 13:18 ` Neil Armstrong 2019-09-27 20:55 ` Grodzovsky, Andrey 2019-09-27 20:55 ` Grodzovsky, Andrey 2019-09-27 20:55 ` Grodzovsky, Andrey 2019-09-30 9:17 ` Neil Armstrong 2019-09-30 9:17 ` Neil Armstrong 2019-09-30 9:17 ` Neil Armstrong 2019-10-02 16:53 ` Grodzovsky, Andrey 2019-10-02 16:53 ` Grodzovsky, Andrey 2019-10-03 8:36 ` Neil Armstrong [this message] 2019-10-03 8:36 ` Neil Armstrong 2019-09-30 14:52 ` Hillf Danton 2019-09-30 14:52 ` Hillf Danton 2019-10-02 14:40 ` Grodzovsky, Andrey 2019-10-02 14:40 ` Grodzovsky, Andrey 2019-10-02 14:40 ` Grodzovsky, Andrey 2019-10-02 14:44 ` Neil Armstrong 2019-10-02 14:44 ` Neil Armstrong 2019-10-02 14:44 ` Neil Armstrong 2019-10-03 8:34 ` Neil Armstrong 2019-10-03 8:34 ` Neil Armstrong 2019-10-03 8:34 ` Neil Armstrong 2019-10-04 14:53 ` Grodzovsky, Andrey 2019-10-04 14:53 ` Grodzovsky, Andrey 2019-10-04 15:03 ` Neil Armstrong 2019-10-04 15:03 ` Neil Armstrong 2019-10-04 15:03 ` Neil Armstrong 2019-10-04 15:27 ` Steven Price 2019-10-04 15:27 ` Steven Price 2019-10-04 15:34 ` Koenig, Christian 2019-10-04 15:34 ` Koenig, Christian 2019-10-04 15:34 ` Koenig, Christian 2019-10-04 16:02 ` Steven Price 2019-10-04 16:02 ` Steven Price 2019-10-04 16:33 Koenig, Christian 2019-10-07 12:47 ` Steven Price 2019-10-07 12:47 ` Steven Price
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=530d1549-367f-b387-7f89-be6221b864a9@baylibre.com \ --to=narmstrong@baylibre.com \ --cc=Andrey.Grodzovsky@amd.com \ --cc=Christian.Koenig@amd.com \ --cc=airlied@linux.ie \ --cc=daniel@ffwll.ch \ --cc=dri-devel@lists.freedesktop.org \ --cc=linux-amlogic@lists.infradead.org \ --cc=linux-kernel@vger.kernel.org \ --cc=nunes.erico@gmail.com \ --cc=robh@kernel.org \ --cc=steven.price@arm.com \ --cc=tomeu.vizoso@collabora.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.