From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id EA28CC433F5 for ; Fri, 7 Oct 2022 02:45:52 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229482AbiJGCpu (ORCPT ); Thu, 6 Oct 2022 22:45:50 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57330 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229445AbiJGCpr (ORCPT ); Thu, 6 Oct 2022 22:45:47 -0400 Received: from mail-ej1-x629.google.com (mail-ej1-x629.google.com [IPv6:2a00:1450:4864:20::629]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 310547CB42 for ; Thu, 6 Oct 2022 19:45:46 -0700 (PDT) Received: by mail-ej1-x629.google.com with SMTP id k2so8516434ejr.2 for ; Thu, 06 Oct 2022 19:45:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=ivyaq7Zdbnsd0LwI0QA38s0k4M1IqIXnuitGACfqB9k=; b=bzFrq8wdea7xDwYZs0RAQ24bbRSX8VcLP9T5BKZxhrCx3Uz9b8LaQn/kiACgUxsGJO OXaNDtefBFsWuashZDwx6B8Pr0Pz5MM1LsbM/Z9WnEBA/5YV9404CT6lf8QWh6rBSse4 Gq2a1Ga/98MxfJMP7VOkdth94yFqwq4AlLOg9LgaK5QSBghQ/nNPyTRxmsL2ZaFN6PDB GvqSgzqZPl+oeRmYWJYKqa/Gd216GWXdMRyTDGG9SJpNROFqQiLB/VCM2RELnwPw3sbG 2P3THmEsvMnfHbTVNj3Npm8KVDUtfyxIvqRlTlbkY7Kvg0J2ztYhKgYoNCesoc2zSHQK +orw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=ivyaq7Zdbnsd0LwI0QA38s0k4M1IqIXnuitGACfqB9k=; b=7JAGZu7qwhmpcsFWRGHoJt5nmh6RIFgLRRGCf4Dns3CUsqGqkdSPqvrRzw+UTBluOl NoKz7cly+soqpAB/j2hv43DdT2D6AE/8QG1+2F/fdEijieqQAliO0ApQpV3Ww+PRN7LH Vbb/3+aImCMqKu1M7pfjeAa1S3R5bVgFksTqMV3mjHX1oFUXrJ9IWjd1iOrJjiko9tX7 TNSppv3MEA3H6LFtgYM+97h8t8Of2KW4viNqCb3hLu+edWfZBXDor92wfJVlTyTHadbF I6Y5Jx7oov15ZbGLdTW7B0fxw3DrV7ltb/l8Xdp19o+KDLRzuJyJnrtxcaHzZU9jtnnm 0adQ== X-Gm-Message-State: ACrzQf27gwHQiMwLZAxYbJLhxBpXNidUibY2BlBCvdh0dBz2t2jg3iwx 9kdnKcdjKveFHrk6jySY35Mz3kaf9peyX7AT0+A= X-Google-Smtp-Source: AMsMyM6ZGfHXX4tnm3sbFfUmH7+FXiBdS4uSaaCbjic61nBpTDo17A/Czhtq0ORlyQ0XNdPGduA6taEF1aLeNVapW2U= X-Received: by 2002:a17:906:eec7:b0:733:189f:b07a with SMTP id wu7-20020a170906eec700b00733189fb07amr2382947ejb.230.1665110744602; Thu, 06 Oct 2022 19:45:44 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Dave Airlie Date: Fri, 7 Oct 2022 12:45:32 +1000 Message-ID: Subject: Re: [git pull] drm for 6.1-rc1 To: Linus Torvalds , Arvind.Yadav@amd.com Cc: Alex Deucher , Alex Deucher , =?UTF-8?Q?Christian_K=C3=B6nig?= , Daniel Vetter , LKML , dri-devel Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 7 Oct 2022 at 09:45, Linus Torvalds wrote: > > On Thu, Oct 6, 2022 at 1:25 PM Dave Airlie wrote: > > > > > > [ 1234.778760] BUG: kernel NULL pointer dereference, address: 000000000= 0000088 > > [ 1234.778813] RIP: 0010:drm_sched_job_done.isra.0+0xc/0x140 [gpu_sched= ] > > As far as I can tell, that's the line > > struct drm_gpu_scheduler *sched =3D s_fence->sched; > > where 's_fence' is NULL. The code is > > 0: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) > 5: 41 54 push %r12 > 7: 55 push %rbp > 8: 53 push %rbx > 9: 48 89 fb mov %rdi,%rbx > c:* 48 8b af 88 00 00 00 mov 0x88(%rdi),%rbp <-- trapping instructi= on > 13: f0 ff 8d f0 00 00 00 lock decl 0xf0(%rbp) > 1a: 48 8b 85 80 01 00 00 mov 0x180(%rbp),%rax > > and that next 'lock decl' instruction would have been the > > atomic_dec(&sched->hw_rq_count); > > at the top of drm_sched_job_done(). > > Now, as to *why* you'd have a NULL s_fence, it would seem that > drm_sched_job_cleanup() was called with an active job. Looking at that > code, it does > > if (kref_read(&job->s_fence->finished.refcount)) { > /* drm_sched_job_arm() has been called */ > dma_fence_put(&job->s_fence->finished); > ... > > but then it does > > job->s_fence =3D NULL; > > anyway, despite the job still being active. The logic of that kind of > "fake refcount" escapes me. The above looks fundamentally racy, not to > say pointless and wrong (a refcount is a _count_, not a flag, so there > could be multiple references to it, what says that you can just > decrement one of them and say "I'm done"). > > Now, _why_ any of that happens, I have no idea. I'm just looking at > the immediate "that pointer is NULL" thing, and reacting to what looks > like a completely bogus refcount pattern. > > But that odd refcount pattern isn't new, so it's presumably some user > on the amd gpu side that changed. > > The problem hasn't happened again for me, but that's not saying a lot, > since it was very random to begin with. I chased down the culprit to a drm sched patch, I'll send you a pull with a revert in it. commit e4dc45b1848bc6bcac31eb1b4ccdd7f6718b3c86 Author: Arvind Yadav Date: Wed Sep 14 22:13:20 2022 +0530 drm/sched: Use parent fence instead of finished Using the parent fence instead of the finished fence to get the job status. This change is to avoid GPU scheduler timeout error which can cause GPU reset. Signed-off-by: Arvind Yadav Reviewed-by: Andrey Grodzovsky Link: https://patchwork.freedesktop.org/patch/msgid/20220914164321.2156= -6-Arvind.Yadav@amd.com Signed-off-by: Christian K=C3=B6nig I'll let Arvind and Christian maybe work out what is going wrong there. Dave. > > Linus From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 05B5BC433FE for ; Fri, 7 Oct 2022 02:45:50 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 4C1A210E6D0; Fri, 7 Oct 2022 02:45:49 +0000 (UTC) Received: from mail-ej1-x62c.google.com (mail-ej1-x62c.google.com [IPv6:2a00:1450:4864:20::62c]) by gabe.freedesktop.org (Postfix) with ESMTPS id 3554510E6D0 for ; Fri, 7 Oct 2022 02:45:46 +0000 (UTC) Received: by mail-ej1-x62c.google.com with SMTP id o21so8417686ejm.11 for ; Thu, 06 Oct 2022 19:45:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=ivyaq7Zdbnsd0LwI0QA38s0k4M1IqIXnuitGACfqB9k=; b=bzFrq8wdea7xDwYZs0RAQ24bbRSX8VcLP9T5BKZxhrCx3Uz9b8LaQn/kiACgUxsGJO OXaNDtefBFsWuashZDwx6B8Pr0Pz5MM1LsbM/Z9WnEBA/5YV9404CT6lf8QWh6rBSse4 Gq2a1Ga/98MxfJMP7VOkdth94yFqwq4AlLOg9LgaK5QSBghQ/nNPyTRxmsL2ZaFN6PDB GvqSgzqZPl+oeRmYWJYKqa/Gd216GWXdMRyTDGG9SJpNROFqQiLB/VCM2RELnwPw3sbG 2P3THmEsvMnfHbTVNj3Npm8KVDUtfyxIvqRlTlbkY7Kvg0J2ztYhKgYoNCesoc2zSHQK +orw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=ivyaq7Zdbnsd0LwI0QA38s0k4M1IqIXnuitGACfqB9k=; b=PcHNgMGU/fTVVPRu66I5fN704qr4yUGeLSTgW86VbAgLoA/j+TTKL4XOdQyP287A7r G487KGNIZkBKUZz5d3aPAXDkhtPqJYJXnwPBGwz+UZcROkDQLr9466cOZgAvE54RUnik PsVe6xDGt2E7y7KTfGrRGYND5PkNLiS6p3LSp1ntzqj32zMLbNLsGnZ72vKuex1nLemE aJbykK41vgU867dgVkPdaaJ2nliH4FDR1MpQmkiaxmbin5+/jBeDEN2aGWAeGA+SN93W SekAmGhNOMRA0clV8dw4VUL3LfugL4oSTBS+gVV+2gVOhc7TfHaOfv6PoDcy0v8tjXuh Zddw== X-Gm-Message-State: ACrzQf1swRGTzdzux6fBSeqkhblUmbT5mA7ykHativBolTClH25ZrxEr DH6Ua/oTeql1jtgjuTx7KD2qbSFV+YlBsXIw9JUFlaE7DrE= X-Google-Smtp-Source: AMsMyM6ZGfHXX4tnm3sbFfUmH7+FXiBdS4uSaaCbjic61nBpTDo17A/Czhtq0ORlyQ0XNdPGduA6taEF1aLeNVapW2U= X-Received: by 2002:a17:906:eec7:b0:733:189f:b07a with SMTP id wu7-20020a170906eec700b00733189fb07amr2382947ejb.230.1665110744602; Thu, 06 Oct 2022 19:45:44 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Dave Airlie Date: Fri, 7 Oct 2022 12:45:32 +1000 Message-ID: Subject: Re: [git pull] drm for 6.1-rc1 To: Linus Torvalds , Arvind.Yadav@amd.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Daniel Vetter , LKML , dri-devel , Alex Deucher , =?UTF-8?Q?Christian_K=C3=B6nig?= Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" On Fri, 7 Oct 2022 at 09:45, Linus Torvalds wrote: > > On Thu, Oct 6, 2022 at 1:25 PM Dave Airlie wrote: > > > > > > [ 1234.778760] BUG: kernel NULL pointer dereference, address: 000000000= 0000088 > > [ 1234.778813] RIP: 0010:drm_sched_job_done.isra.0+0xc/0x140 [gpu_sched= ] > > As far as I can tell, that's the line > > struct drm_gpu_scheduler *sched =3D s_fence->sched; > > where 's_fence' is NULL. The code is > > 0: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) > 5: 41 54 push %r12 > 7: 55 push %rbp > 8: 53 push %rbx > 9: 48 89 fb mov %rdi,%rbx > c:* 48 8b af 88 00 00 00 mov 0x88(%rdi),%rbp <-- trapping instructi= on > 13: f0 ff 8d f0 00 00 00 lock decl 0xf0(%rbp) > 1a: 48 8b 85 80 01 00 00 mov 0x180(%rbp),%rax > > and that next 'lock decl' instruction would have been the > > atomic_dec(&sched->hw_rq_count); > > at the top of drm_sched_job_done(). > > Now, as to *why* you'd have a NULL s_fence, it would seem that > drm_sched_job_cleanup() was called with an active job. Looking at that > code, it does > > if (kref_read(&job->s_fence->finished.refcount)) { > /* drm_sched_job_arm() has been called */ > dma_fence_put(&job->s_fence->finished); > ... > > but then it does > > job->s_fence =3D NULL; > > anyway, despite the job still being active. The logic of that kind of > "fake refcount" escapes me. The above looks fundamentally racy, not to > say pointless and wrong (a refcount is a _count_, not a flag, so there > could be multiple references to it, what says that you can just > decrement one of them and say "I'm done"). > > Now, _why_ any of that happens, I have no idea. I'm just looking at > the immediate "that pointer is NULL" thing, and reacting to what looks > like a completely bogus refcount pattern. > > But that odd refcount pattern isn't new, so it's presumably some user > on the amd gpu side that changed. > > The problem hasn't happened again for me, but that's not saying a lot, > since it was very random to begin with. I chased down the culprit to a drm sched patch, I'll send you a pull with a revert in it. commit e4dc45b1848bc6bcac31eb1b4ccdd7f6718b3c86 Author: Arvind Yadav Date: Wed Sep 14 22:13:20 2022 +0530 drm/sched: Use parent fence instead of finished Using the parent fence instead of the finished fence to get the job status. This change is to avoid GPU scheduler timeout error which can cause GPU reset. Signed-off-by: Arvind Yadav Reviewed-by: Andrey Grodzovsky Link: https://patchwork.freedesktop.org/patch/msgid/20220914164321.2156= -6-Arvind.Yadav@amd.com Signed-off-by: Christian K=C3=B6nig I'll let Arvind and Christian maybe work out what is going wrong there. Dave. > > Linus