From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id D5648C6379F for ; Sun, 5 Feb 2023 20:39:54 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229551AbjBEUjx (ORCPT ); Sun, 5 Feb 2023 15:39:53 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52526 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229536AbjBEUjv (ORCPT ); Sun, 5 Feb 2023 15:39:51 -0500 Received: from zeniv.linux.org.uk (zeniv.linux.org.uk [IPv6:2a03:a000:7:0:5054:ff:fe1c:15ff]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EC1C718B02; Sun, 5 Feb 2023 12:39:48 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=linux.org.uk; s=zeniv-20220401; h=Sender:In-Reply-To:Content-Type: MIME-Version:References:Message-ID:Subject:Cc:To:From:Date:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=5fdXU73i1sPfwWSnl2Kp9uJVLtcRmjMeCcLubSidJE0=; b=fqMdlsGN8nHZgBmTtkX0uJosyn I2hmQQEMNued4xPYkfOmu6mTrqpWm5u8wUSETlQvuaE0pwJuhtMyx8lrr9dk8SSsTIGiRhLIuaT8X JAV2LzNUGi6risliJ6Xz3ASLXAtsDMGBIkHgbCtj3VHxswchT8DL1K5YCAGkg+pt5X56moDZSg6Sn jTebPrr1wvWwjngAk96JLfxOVBeOrY6e5na0+/pIlSMlYIhIRcz+Gl0wDJf4tFPbNNwOghuw/CBde X5YVBCZZR3+RnoAmiLu+gzrqL8S8e57gPdTh85kOi/LLec7Uu7Aksg6Z7lsrpJQ5DsYrfIS82sZwG Wx+K/UzQ==; Received: from viro by zeniv.linux.org.uk with local (Exim 4.96 #2 (Red Hat Linux)) id 1pOlnd-006PZ0-06; Sun, 05 Feb 2023 20:39:45 +0000 Date: Sun, 5 Feb 2023 20:39:44 +0000 From: Al Viro To: Finn Thain Cc: linux-arch@vger.kernel.org, linux-alpha@vger.kernel.org, linux-ia64@vger.kernel.org, linux-hexagon@vger.kernel.org, linux-m68k@lists.linux-m68k.org, Michal Simek , Dinh Nguyen , openrisc@lists.librecores.org, linux-parisc@vger.kernel.org, linux-riscv@lists.infradead.org, sparclinux@vger.kernel.org, Linus Torvalds Subject: Re: [PATCH 04/10] m68k: fix livelock in uaccess Message-ID: References: <92a4aa45-0a7c-a389-798a-2f3e3cfa516f@linux-m68k.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <92a4aa45-0a7c-a389-798a-2f3e3cfa516f@linux-m68k.org> Sender: Al Viro Precedence: bulk List-ID: X-Mailing-List: linux-parisc@vger.kernel.org On Sun, Feb 05, 2023 at 05:18:08PM +1100, Finn Thain wrote: > That could be a bug I was chasing back in 2021 but never found. The mmap > stressors in stress-ng were triggering a crash on a Mac Quadras, though > only rarely. Sometimes it would run all day without a failure. > > Last year when I started using GCC 12 to build the kernel, I saw the same > workload fail again but the failure mode had become a silent hang/livelock > instead of the oopses I got with GCC 6. > > When I press the NMI button after the livelock I always see > do_page_fault() in the backtrace. So I've been testing your patch. I've > been running the same stress-ng reproducer for about 12 hours now with no > failures which looks promising. > > In case that stress-ng testing is of use: > Tested-by: Finn Thain > > BTW, how did you identify that bug in do_page_fault()? If its the same bug > I was chasing, it could be an old one. The stress-ng logs I collected last > year include a crash from a v4.14 build. Went to reread the current state of mm/gup.c, decided to reread handle_mm_fault() and its callers, noticed fault_signal_pending() which hadn't been there back when I last crawled through that area, realized what it had replaced, went to check if everything had been converted (arch/um got missed, BTW). Noticed the difference between the architectures (the first hit was on alpha, without the "sod off to no_context if it's a user fault" logics, the last - xtensa, with it). Checked the log for xtensa, found the commit from 2021 adding that part; looked on arm and arm64, found commits from 2017 doing the same thing, then, on x86, Linus' commit from 2014 adding the x86 counterpart... Figuring out what all of those had been for wasn't particularly hard, and it was easy to check which architectures still needed the same thing... BTW, since these patches would be much easier to backport than any unification work, I think the right thing to do would be to have further unification done on top of them. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 6323DC636CD for ; Sun, 5 Feb 2023 20:40:13 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender: Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:In-Reply-To:MIME-Version:References: Message-ID:Subject:Cc:To:From:Date:Reply-To:Content-ID:Content-Description: Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID: List-Owner; bh=oniM+gXwjBEabJgr6Of4muH/70DRBxgmir0qV6t7qXM=; b=AB507GsZW2tDN8 KqAXcpyIQkyoI8nzdEj1fm7vUnPtxXazH3wE845TZXMacMt0FZ1qBcT3m3DzGByakGhoE6owUOYSr DJhkvT484VjONIcap6bRadi4dD4JQq1moJywugtfgQ7QS12ZMdWMAJmsCOQ2WBCTNV0Zx7yiEiRV2 LkRfB4AQ7xZEfEJzsQ/JVygyaVaYsK1NvMYF5EZ08a56MTVVwz1KR9LjujQs3oN2SXkuk2bzaPC0u 2gooD/Io5+r4CgrBbTHjFMk1Ppmq8q2JvVzptrLLBR7Fs/Ru09tXUy4ZCKeUQ8Oemnu+pTsQmHVyG d0pItUIQVm+XsHQdBv+Q==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux)) id 1pOlnu-006oIy-G9; Sun, 05 Feb 2023 20:40:02 +0000 Received: from zeniv.linux.org.uk ([2a03:a000:7:0:5054:ff:fe1c:15ff]) by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux)) id 1pOlnp-006oH3-9l for linux-riscv@lists.infradead.org; Sun, 05 Feb 2023 20:40:00 +0000 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=linux.org.uk; s=zeniv-20220401; h=Sender:In-Reply-To:Content-Type: MIME-Version:References:Message-ID:Subject:Cc:To:From:Date:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=5fdXU73i1sPfwWSnl2Kp9uJVLtcRmjMeCcLubSidJE0=; b=fqMdlsGN8nHZgBmTtkX0uJosyn I2hmQQEMNued4xPYkfOmu6mTrqpWm5u8wUSETlQvuaE0pwJuhtMyx8lrr9dk8SSsTIGiRhLIuaT8X JAV2LzNUGi6risliJ6Xz3ASLXAtsDMGBIkHgbCtj3VHxswchT8DL1K5YCAGkg+pt5X56moDZSg6Sn jTebPrr1wvWwjngAk96JLfxOVBeOrY6e5na0+/pIlSMlYIhIRcz+Gl0wDJf4tFPbNNwOghuw/CBde X5YVBCZZR3+RnoAmiLu+gzrqL8S8e57gPdTh85kOi/LLec7Uu7Aksg6Z7lsrpJQ5DsYrfIS82sZwG Wx+K/UzQ==; Received: from viro by zeniv.linux.org.uk with local (Exim 4.96 #2 (Red Hat Linux)) id 1pOlnd-006PZ0-06; Sun, 05 Feb 2023 20:39:45 +0000 Date: Sun, 5 Feb 2023 20:39:44 +0000 From: Al Viro To: Finn Thain Cc: linux-arch@vger.kernel.org, linux-alpha@vger.kernel.org, linux-ia64@vger.kernel.org, linux-hexagon@vger.kernel.org, linux-m68k@lists.linux-m68k.org, Michal Simek , Dinh Nguyen , openrisc@lists.librecores.org, linux-parisc@vger.kernel.org, linux-riscv@lists.infradead.org, sparclinux@vger.kernel.org, Linus Torvalds Subject: Re: [PATCH 04/10] m68k: fix livelock in uaccess Message-ID: References: <92a4aa45-0a7c-a389-798a-2f3e3cfa516f@linux-m68k.org> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <92a4aa45-0a7c-a389-798a-2f3e3cfa516f@linux-m68k.org> X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20230205_123957_360178_77DD3EFF X-CRM114-Status: GOOD ( 19.72 ) X-BeenThere: linux-riscv@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-riscv" Errors-To: linux-riscv-bounces+linux-riscv=archiver.kernel.org@lists.infradead.org On Sun, Feb 05, 2023 at 05:18:08PM +1100, Finn Thain wrote: > That could be a bug I was chasing back in 2021 but never found. The mmap > stressors in stress-ng were triggering a crash on a Mac Quadras, though > only rarely. Sometimes it would run all day without a failure. > > Last year when I started using GCC 12 to build the kernel, I saw the same > workload fail again but the failure mode had become a silent hang/livelock > instead of the oopses I got with GCC 6. > > When I press the NMI button after the livelock I always see > do_page_fault() in the backtrace. So I've been testing your patch. I've > been running the same stress-ng reproducer for about 12 hours now with no > failures which looks promising. > > In case that stress-ng testing is of use: > Tested-by: Finn Thain > > BTW, how did you identify that bug in do_page_fault()? If its the same bug > I was chasing, it could be an old one. The stress-ng logs I collected last > year include a crash from a v4.14 build. Went to reread the current state of mm/gup.c, decided to reread handle_mm_fault() and its callers, noticed fault_signal_pending() which hadn't been there back when I last crawled through that area, realized what it had replaced, went to check if everything had been converted (arch/um got missed, BTW). Noticed the difference between the architectures (the first hit was on alpha, without the "sod off to no_context if it's a user fault" logics, the last - xtensa, with it). Checked the log for xtensa, found the commit from 2021 adding that part; looked on arm and arm64, found commits from 2017 doing the same thing, then, on x86, Linus' commit from 2014 adding the x86 counterpart... Figuring out what all of those had been for wasn't particularly hard, and it was easy to check which architectures still needed the same thing... BTW, since these patches would be much easier to backport than any unification work, I think the right thing to do would be to have further unification done on top of them. _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv From mboxrd@z Thu Jan 1 00:00:00 1970 From: Al Viro Date: Sun, 05 Feb 2023 20:39:44 +0000 Subject: Re: [PATCH 04/10] m68k: fix livelock in uaccess Message-Id: List-Id: References: <92a4aa45-0a7c-a389-798a-2f3e3cfa516f@linux-m68k.org> In-Reply-To: <92a4aa45-0a7c-a389-798a-2f3e3cfa516f@linux-m68k.org> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Finn Thain Cc: linux-arch@vger.kernel.org, linux-alpha@vger.kernel.org, linux-ia64@vger.kernel.org, linux-hexagon@vger.kernel.org, linux-m68k@lists.linux-m68k.org, Michal Simek , Dinh Nguyen , openrisc@lists.librecores.org, linux-parisc@vger.kernel.org, linux-riscv@lists.infradead.org, sparclinux@vger.kernel.org, Linus Torvalds On Sun, Feb 05, 2023 at 05:18:08PM +1100, Finn Thain wrote: > That could be a bug I was chasing back in 2021 but never found. The mmap > stressors in stress-ng were triggering a crash on a Mac Quadras, though > only rarely. Sometimes it would run all day without a failure. > > Last year when I started using GCC 12 to build the kernel, I saw the same > workload fail again but the failure mode had become a silent hang/livelock > instead of the oopses I got with GCC 6. > > When I press the NMI button after the livelock I always see > do_page_fault() in the backtrace. So I've been testing your patch. I've > been running the same stress-ng reproducer for about 12 hours now with no > failures which looks promising. > > In case that stress-ng testing is of use: > Tested-by: Finn Thain > > BTW, how did you identify that bug in do_page_fault()? If its the same bug > I was chasing, it could be an old one. The stress-ng logs I collected last > year include a crash from a v4.14 build. Went to reread the current state of mm/gup.c, decided to reread handle_mm_fault() and its callers, noticed fault_signal_pending() which hadn't been there back when I last crawled through that area, realized what it had replaced, went to check if everything had been converted (arch/um got missed, BTW). Noticed the difference between the architectures (the first hit was on alpha, without the "sod off to no_context if it's a user fault" logics, the last - xtensa, with it). Checked the log for xtensa, found the commit from 2021 adding that part; looked on arm and arm64, found commits from 2017 doing the same thing, then, on x86, Linus' commit from 2014 adding the x86 counterpart... Figuring out what all of those had been for wasn't particularly hard, and it was easy to check which architectures still needed the same thing... BTW, since these patches would be much easier to backport than any unification work, I think the right thing to do would be to have further unification done on top of them.