From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id D39B2C433F5 for ; Fri, 7 Oct 2022 09:28:28 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 8352B10E2F6; Fri, 7 Oct 2022 09:28:27 +0000 (UTC) Received: from mail-oi1-x230.google.com (mail-oi1-x230.google.com [IPv6:2607:f8b0:4864:20::230]) by gabe.freedesktop.org (Postfix) with ESMTPS id 235F410E2F6 for ; Fri, 7 Oct 2022 09:28:25 +0000 (UTC) Received: by mail-oi1-x230.google.com with SMTP id m130so4878499oif.6 for ; Fri, 07 Oct 2022 02:28:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ffwll.ch; s=google; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=JxLAH+bnod0VfWRdnLNFFaWOZptJDJqV0Gq0f68hNdc=; b=Lld1GrctRhHOLbNsUNrujKK+MA6Gux6LyL6cFvYmqezas+x2uXUh0neHffeKlflrpg dZkxx8hziJWp3N7O/T/3HPCuk+wbLZG0pbMfhDNLxG9i7Pulwx7P45z/gUbRlk7cHsWM OULqzB+Sd9lowQX+RG4FzxkPt0m5B6fPgGT1k= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=JxLAH+bnod0VfWRdnLNFFaWOZptJDJqV0Gq0f68hNdc=; b=0pNtZhFxwg511qSwrh+6r2WuAEBi2ZRGRw02HmqzE57Q8J4bUWKwQcxPxggcwNU7f5 XjWCjrRzCX0ieXfy/lsFYObKVsSBnuFlkH2OaqAEg25MU5K7djGvJ/cCkmqoSOiA6xA+ 4lPEv02+X53/G875onskOTvjJ5xDFTM8c2XWwwpaxCjEEc3eJtdgixKkp1FXvxErrlIZ HNfiOQJRlx/8N/LDb9N5tlIWoUmhTvvFoF3eNbKfSD61o651wwTbIz/CbtRdb8XfRDOd rTl/TX2KTx3f1rPoGuAlYMWzc5xTuMA111uz3E79tk0RfM7VDuAaPXmgCNDCl2qJPCOl /i8w== X-Gm-Message-State: ACrzQf3EuNnui4XdXMQHfB8QVj+9qR0unqMCQZgLpTTT2Wxkhv5c37Gz 3905KPt5sLxgDSYLeA1ORASti8IlH2pSqgj8rRSYjw== X-Google-Smtp-Source: AMsMyM4iCHqpCD5g3MdwdOFmIgnP809IboqOPrt054xZ791CkiY/TmbkuiTtOrcsRVDaXZe6VZnDY559987elRnrgWI= X-Received: by 2002:a05:6808:e8c:b0:354:2751:69ae with SMTP id k12-20020a0568080e8c00b00354275169aemr2402247oil.228.1665134904319; Fri, 07 Oct 2022 02:28:24 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Daniel Vetter Date: Fri, 7 Oct 2022 11:28:12 +0200 Message-ID: Subject: Re: [git pull] drm for 6.1-rc1 To: Linus Torvalds , Andrey Grodzovsky Content-Type: text/plain; charset="UTF-8" X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: LKML , dri-devel , Alex Deucher , =?UTF-8?Q?Christian_K=C3=B6nig?= Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" Forgot to add Andrey as scheduler maintainer. -Daniel On Fri, 7 Oct 2022 at 10:16, Daniel Vetter wrote: > > On Fri, 7 Oct 2022 at 01:45, Linus Torvalds > wrote: > > > > On Thu, Oct 6, 2022 at 1:25 PM Dave Airlie wrote: > > > > > > > > > [ 1234.778760] BUG: kernel NULL pointer dereference, address: 0000000000000088 > > > [ 1234.778813] RIP: 0010:drm_sched_job_done.isra.0+0xc/0x140 [gpu_sched] > > > > As far as I can tell, that's the line > > > > struct drm_gpu_scheduler *sched = s_fence->sched; > > > > where 's_fence' is NULL. The code is > > > > 0: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) > > 5: 41 54 push %r12 > > 7: 55 push %rbp > > 8: 53 push %rbx > > 9: 48 89 fb mov %rdi,%rbx > > c:* 48 8b af 88 00 00 00 mov 0x88(%rdi),%rbp <-- trapping instruction > > 13: f0 ff 8d f0 00 00 00 lock decl 0xf0(%rbp) > > 1a: 48 8b 85 80 01 00 00 mov 0x180(%rbp),%rax > > > > and that next 'lock decl' instruction would have been the > > > > atomic_dec(&sched->hw_rq_count); > > > > at the top of drm_sched_job_done(). > > > > Now, as to *why* you'd have a NULL s_fence, it would seem that > > drm_sched_job_cleanup() was called with an active job. Looking at that > > code, it does > > > > if (kref_read(&job->s_fence->finished.refcount)) { > > /* drm_sched_job_arm() has been called */ > > dma_fence_put(&job->s_fence->finished); > > ... > > > > but then it does > > > > job->s_fence = NULL; > > > > anyway, despite the job still being active. The logic of that kind of > > "fake refcount" escapes me. The above looks fundamentally racy, not to > > say pointless and wrong (a refcount is a _count_, not a flag, so there > > could be multiple references to it, what says that you can just > > decrement one of them and say "I'm done"). > > Just figured I'll clarify this, because it's indeed a bit wtf and the > comment doesn't explain much. drm_sched_job_cleanup can be called both > when a real job is being cleaned up (which holds a full reference on > job->s_fence and needs to drop it) and to simplify error path in job > constructions (and the "is this refcount initialized already" signals > what exactly needs to be cleaned up or not). So no race, because the > only times this check goes different is when job construction has > failed before the job struct is visible by any other thread. > > But yeah the comment could actually explain what's going on here :-) > > And yeah the patch Dave reverted screws up the cascade of references > that ensures this all stays alive until drm_sched_job_cleanup is > called on active jobs, so looks all reasonable to me. Some Kunit tests > maybe to exercise these corners? Not the first time pure scheduler > code blew up, so proably worth the effort. > -Daniel > > > > > Now, _why_ any of that happens, I have no idea. I'm just looking at > > the immediate "that pointer is NULL" thing, and reacting to what looks > > like a completely bogus refcount pattern. > > > > But that odd refcount pattern isn't new, so it's presumably some user > > on the amd gpu side that changed. > > > > The problem hasn't happened again for me, but that's not saying a lot, > > since it was very random to begin with. > > > > Linus > > > > -- > Daniel Vetter > Software Engineer, Intel Corporation > http://blog.ffwll.ch -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch