From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.2 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 03BA2C63777 for ; Mon, 30 Nov 2020 16:55:56 +0000 (UTC) Received: from merlin.infradead.org (merlin.infradead.org [205.233.59.134]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 85F3C207F7 for ; Mon, 30 Nov 2020 16:55:55 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=lists.infradead.org header.i=@lists.infradead.org header.b="E7TPKQUa" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 85F3C207F7 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=arm.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=merlin.20170209; h=Sender:Content-Transfer-Encoding: Content-Type:Cc:List-Subscribe:List-Help:List-Post:List-Archive: List-Unsubscribe:List-Id:In-Reply-To:MIME-Version:References:Message-ID: Subject:To:From:Date:Reply-To:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner; bh=o5oNkMoUwFjAkZB9T2J3uge4xVinvkX5UTzLF1a0rPA=; b=E7TPKQUa8gnO7iUhbvrkwMmNS 0Ahuv3j9O8bhpF8R2ztGcRpg1hoYp7OSG+xoZUXURXpC6Zu2+yZ3ZwHs1N3RUl1EVqrr//NC6LxSg w3s5LbJY0Y/tbfG+8cxYh7glLPeTRPjMqZa/eZQUzcVhnhfNoowmS4an/OzB5PHVZbgVg/QuMd8F5 T/ik2UJFhhmyqdVNyFxTQIHJCsJIAACPnOwdtkSzmyp/wGYuklFba/sqY2oVtEwlLDAugSnkNhR3T 0aFOOKgMcsPuhrJH6zjt7r8uXSYQ8ie1eOc0+5oGPwT7MoPPqmdcoS+n4nOVaihyCHe1AzgEgDPPZ i6IzI1irQ==; Received: from localhost ([::1] helo=merlin.infradead.org) by merlin.infradead.org with esmtp (Exim 4.92.3 #3 (Red Hat Linux)) id 1kjmRf-00062H-UT; Mon, 30 Nov 2020 16:54:35 +0000 Received: from foss.arm.com ([217.140.110.172]) by merlin.infradead.org with esmtp (Exim 4.92.3 #3 (Red Hat Linux)) id 1kjmRc-00061P-IQ for linux-arm-kernel@lists.infradead.org; Mon, 30 Nov 2020 16:54:34 +0000 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 510281042; Mon, 30 Nov 2020 08:54:31 -0800 (PST) Received: from C02TD0UTHF1T.local (unknown [10.57.31.53]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id F1BD33F718; Mon, 30 Nov 2020 08:54:29 -0800 (PST) Date: Mon, 30 Nov 2020 16:54:27 +0000 From: Mark Rutland To: Marco Elver Subject: Re: [PATCH 00/11] arm64: entry lockdep/rcu/tracing fixes Message-ID: <20201130165427.GD1251@C02TD0UTHF1T.local> References: <20201126123602.23454-1-mark.rutland@arm.com> <20201130120305.GA1292961@elver.google.com> <20201130123810.GB1251@C02TD0UTHF1T.local> <20201130133245.GA1307615@elver.google.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20201130133245.GA1307615@elver.google.com> X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20201130_115432_840758_19EF803F X-CRM114-Status: GOOD ( 46.57 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: paulmck@kernel.org, peterz@infradead.org, catalin.marinas@arm.com, james.morse@arm.com, linux-arm-kernel@lists.infradead.org, will@kernel.org, dvyukov@google.com Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org On Mon, Nov 30, 2020 at 02:32:45PM +0100, Marco Elver wrote: > On Mon, Nov 30, 2020 at 12:38PM +0000, Mark Rutland wrote: > > On Mon, Nov 30, 2020 at 01:03:05PM +0100, Marco Elver wrote: > > > So, I was hoping that this would fix all the problems I was seeing when > > > running the ftrace tests ... unfortunately, it didn't. :-( Perhaps the > > > WIP version you had only worked because it ended up disabling lockdep > > > early? > > > > Possibly, yes. Either that or the way we do / do-not treat debug > > exceptions as true NMIs. Either way this appears to be a latent issue > > rather than something introduced by this series. > > > > From the log below I see you're using: > > > > 5.10.0-rc4-next-20201119-00002-gc88aca8827ce #1 Not tainted > > > > ... and it's possible that the issue you're seeing now is a delta > > between v5.10-rc3 and what's queued in linux-next -- I've been running > > the ftrace tests locally without issue atop v5.10-rc3 and v5.10-rc5. > > > > Are you able to reproduce this on my branch alone? If so that gives us a > > stable tree to investigate, and if not that gives us a stable base for a > > bisect against linux-next. > > It's the same problem as before and that I've been reporting in the > other thread [1]. We know mainline is fine, however, -next is broken. We > also know that next-20201105 was still fine, and next-202010 started > breaking: > > https://lkml.kernel.org/r/20201111133813.GA81547@elver.google.com > > The recent tests have been on next-20201119 (including the logs from > previous email). > > I tried bisection, but results are never conclusive (the closest I got > was a -rcu merge commit). As discussed in the thread at [1] (and its > ancestors) we never really got anywhere and really exhausted all options > (several bisection attempts, etc.). Ah; I'd lost track and missed that you'd already identified this was introduced in linux-next, and that bisection wasn't getting anywhere. Thanks for bearing with me! :) > > This area is really sensitive to config options, so if you can reproduce > > this on a stable base, could you share youir exact config? > > No, it's not reproducible on mainline. > > Which might also mean that it's something else in -next and your work is > unrelated. > > But I was surprised your WIP series fixed the problems on next-20201119 > (or so it seemed). So, given all the confusion in [1], I was really > hoping this would be it... The major difference between that and the version upstreamed is the way debug exceptions (including BRKs) got handled as true NMIs, which hints that there could be a subtle interaction in that area (or that the lockdep disable calls in the NMI paths simply masked the problem). One simple thing to try would be to hack the debug exception cases to enter/exit as true NMIs and see whether that hides the issue again. If so, we can start teasing that apart to narrow it down. > > > I've attached the log and the symbolized report. > > > > Thanks for all this. I'll see if I can tickle this locally while waiting > > for the above. If you could share your config from this time around > > that'd be a great head-start! > > It's the same as I've been using for the work in > > [1] https://lore.kernel.org/r/20201119193819.GA2601289@elver.google.com > > In summary, to repro: > > 1. Switch to next-20201119 (possibly even latest, but I haven't tested) > > 2. Apply provoke-bug.diff > > 3. Use the attached .config > > 4. Run with > > qemu-system-aarch64 -kernel $KERNEL_WORKTREE/arch/arm64/boot/Image \ > -append "console=ttyAMA0 root=/dev/sda debug earlycon earlyprintk=serial workqueue.watchdog_thresh=10" \ > -nographic -smp 1 -machine virt -cpu cortex-a57 -m 2G Thanks for the comprehensive repro information! I note that you're using QEMU in TCG mode, whereas I've been testing with KVM acceleration. Those differ in speed by ordered of magnitude, so I wonder if the stalls you see are down to TCG simply being slow, and my patches just happened to shuffle where that slowness was felt. I gave the above a go, but I wasn't able to reproduce the issue under either TCG or KVM acceleration after a few attempts. I'm not sure whether this is intermittent and I'm just getting lucky, or if something is different between our setups that's causing me to not hit this. FWIW I'm testing on a ThunderX2 workstation running Debian 10.6, using the packaged GCC 8.3.0-6, and a locally-built QEMU 5.1.50 (v5.1.0-2347-g1f3081f6de). The QEMU has a couple of test patches atop upstream commit ba2a9a9e6318bfd93a2306dec40137e198205b86. > The tests I ran on your WIP series and just now were applied on top of > next-20201119+provoke-bug.diff. Your WIP series seemed to fix whatever > it was we were debugging in [1] (but with some new warnings), but this > latest series shows no difference and behaviour is unchanged again. > > I also want to emphasize it is really hard to say if your series here is > related or the fact that the WIP series worked was some other > side-effect we don't understand. Sure; I think we're aligned on that understanding. There are a sufficient number of moving parts here that the WIP might have been masking a problem, or might have unintentionally solved a problem we haven't realised exists. > So I leave it to your judgement to decide to what extent this series > could possibly help, because I wouldn't want to make you go down a > rabbit hole that doesn't lead anywhere (as I had already done to > somehow debug the problem in [1]). I think as you say it's not at all clear, but I'd hope this series at least removes a number of potential problems from the search space. Thanks, Mark. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel