From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id BD526C77B60 for ; Sun, 23 Apr 2023 20:43:09 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229800AbjDWUnI (ORCPT ); Sun, 23 Apr 2023 16:43:08 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46364 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229476AbjDWUnH (ORCPT ); Sun, 23 Apr 2023 16:43:07 -0400 Received: from mail-pj1-x1030.google.com (mail-pj1-x1030.google.com [IPv6:2607:f8b0:4864:20::1030]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0A10110C9 for ; Sun, 23 Apr 2023 13:43:06 -0700 (PDT) Received: by mail-pj1-x1030.google.com with SMTP id 98e67ed59e1d1-24986c7cf2dso3367832a91.2 for ; Sun, 23 Apr 2023 13:43:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1682282585; x=1684874585; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=QAXki05qpGQiLVoEwrMk83Mycpqpp2edFtYdxscDNiI=; b=NMHkWKN9vS95tlxKrC9pdBG7raAWUnQK+4P00TD+W0Px+AdICcjOctW16Q1/wWScXy dWWqHTSd4ytyPmFTLGgcKalvXVzmAqzKxqpEkk8M9tc2SsCk8IL0HlZHQho5mvSXFe3Y bVMGsTmavWXo388mHN9UqPOo0woloThaKQ1LcRRYjUkChWGN7IxS0eboXAZEaRQxIp1u 9GdqaovcZ5HO1MFzqoicUnlUSuEiMF5yp7kAlSjrHOJFT5yEqfoq3iDHC8ncf8GCf84s 8N5hocXEGKtgW5EuwryC3UEh3RwcgPoAv2ohcalxZQHPxiDO1PXqx30H/5F388XUYahM Z6eA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1682282585; x=1684874585; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=QAXki05qpGQiLVoEwrMk83Mycpqpp2edFtYdxscDNiI=; b=RKgMd8Gkzg8wpScwX0JVf1SZNQJ7p8Ztume9+dnVTEYtdfziZHvKQ0ZSAQN7X1Lq9D xnlqh+uqGhuQdAP4ACncYZMvqE+imdZZk6ISKQMNRlXVmhI8zFkoAD3acAfdyX2z/LVc GjC+1hKuAbKf9wnUojgjCnD292/f2W1gSbs5unMmkGhCDTxxhoahXZnpGJyerSkfJ3VX mT/jEkPAUpipMIRQ7rF1JbLX48+ccOO1x/nd815BjHBZLvQTwovZfYIN+JY1hM/maOt8 PQAMFH9R1uO2Bw27SajB7YCVuqK8SUyigXIIp3/tYVxDNvFXwmtdCRc+qyuGg4ADL0aL GfSQ== X-Gm-Message-State: AAQBX9f9z2fInA8HuD7z5Rc7KOH6/Cl8KwUBUVFBVqPXLRB2DuOVNUES 6c9p2Hap1Kl9ysDLvInUOz4= X-Google-Smtp-Source: AKy350b9TIwgojOXOBPbaSrE/WIQltkiryD+F4Ohan8ipsarkOGUtk8zirVVj/3FBNKr05lsuthFZg== X-Received: by 2002:a17:90a:98e:b0:247:26da:5de2 with SMTP id 14-20020a17090a098e00b0024726da5de2mr12011266pjo.20.1682282585345; Sun, 23 Apr 2023 13:43:05 -0700 (PDT) Received: from ?IPV6:2001:df0:0:200c:7d4f:891d:86e0:1e1c? ([2001:df0:0:200c:7d4f:891d:86e0:1e1c]) by smtp.gmail.com with ESMTPSA id om12-20020a17090b3a8c00b002405d3bbe42sm7189170pjb.0.2023.04.23.13.43.02 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sun, 23 Apr 2023 13:43:04 -0700 (PDT) Message-ID: Date: Mon, 24 Apr 2023 08:43:00 +1200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.10.0 Subject: Re: reliable reproducer, was Re: core dump analysis Content-Language: en-US To: Finn Thain Cc: Andreas Schwab , debian-68k@lists.debian.org, linux-m68k@lists.linux-m68k.org References: <4a9c1d0d-07aa-792e-921f-237d5a30fc44.ref@yahoo.com> <71af7b52-a1d4-581c-d5af-afce6991c48d@gmail.com> <7ea095ba-7df1-1ffe-e87d-12d46ebe72f6@gmail.com> <2fdc2819-526a-756f-19d0-ac1147f85b63@linux-m68k.org> <868b5214-fa13-dcf7-a671-9843169eea06@gmail.com> <87fs8sz6e9.fsf@igel.home> <878rekz0md.fsf@igel.home> <87o7nfyd7e.fsf@igel.home> <87jzy3y79y.fsf@igel.home> <5824d97d-683b-a354-3c39-cb0f54e50bc0@gmail.com> <06c14a4a-1679-31d6-0501-97e20741f88a@gmail.com> <13d36a79-5aae-d63c-5014-5503688f07bb@linux-m68k.org> From: Michael Schmitz In-Reply-To: <13d36a79-5aae-d63c-5014-5503688f07bb@linux-m68k.org> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-m68k@vger.kernel.org Hi Finn, On 23/04/23 21:23, Finn Thain wrote: > On Sun, 23 Apr 2023, Michael Schmitz wrote: > >> Am 23.04.2023 um 13:41 schrieb Michael Schmitz: >> >> Though the question remains - is this expected behaviour for programs >> that do deep recursion on the stack while taking signals (and the reason >> for the option to run signal handlers on an alternate stack)? >> > I don't understand how "deep recursion" can be used to explain this. We've > seen crashes with only 1.8 MB of stack usage. OK, it's not really deep (though I've managed to get the test case aborted by the oom killer once on my rather puny RAM). But it's putting lots of frames on the stack in a short span while also utilizing the stack for signal delivery. > The best reason I can think of for having a signal stack would be that it > may be better for signal delivery to fail than for the target process to > fail. But I've no idea whether the kernel makes that kind of defensive > programming possible (?) I don't think there's any provision for signal delivery to fail - the signal handler is started from the return-to-userspace code in entry.S, and upon return from the handler, a sigreturn syscall is automatically executed to clean up the stack. As long as the handler returns, all's fine. Not sure what happens if the process context that the handler runs in is killed by the kernel - I suppose the entire process is killed and the context removed, so the issue of parent process survival is moot. But I'm sure we can place an illegal instruction in the handler as soon as a stack overflow is spotted, get a dump and look at that. >> And why does this almost always appear to happen after bus error exceptions >> (frame format b)? The extra exception stack information isn't even accounted >> for in the above frame end address! >> >> Result with sa_sigaction handler: >> >> parent usp : 0xef969e28 >> handler tos : 0xef969e6c >> handler stack overwrote usp! >> frame end : 0xef969e7c >> frame start : 0xef969b58 >> handler usp : 0xef969b40 >> signal usp : 0xef969e04 >> signal pc : 0x80000696 >> signal fmtv : 0x114 >> >> parent usp : 0xef955008 >> handler tos : 0xef955064 >> handler stack overwrote usp! >> frame end : 0xef955074 >> frame start : 0xef954d50 >> handler usp : 0xef954d38 >> signal usp : 0xef954ffc >> signal pc : 0x80000680 >> signal fmtv : 0xb008 >> >> parent usp : 0xef945eb8 >> handler tos : 0xef945f0c >> handler stack overwrote usp! >> frame end : 0xef945f1c >> frame start : 0xef945bf8 >> handler usp : 0xef945be0 >> signal usp : 0xef945ea8 >> signal pc : 0xc009f37a >> signal fmtv : 0x80 >> >> parent usp : 0xef933eb8 >> handler tos : 0xef933f0c >> handler stack overwrote usp! >> frame end : 0xef933f1c >> frame start : 0xef933bf8 >> handler usp : 0xef933be0 >> signal usp : 0xef933ea8 >> signal pc : 0xc009f37a >> signal fmtv : 0x80 >> >> parent usp : 0xef921edc >> handler tos : 0xef9aaca4 >> handler stack overwrote usp! >> frame end : 0xef9aacb4 >> frame start : 0xef9aa990 >> handler usp : 0xef9aa978 >> signal usp : 0xef9aac40 >> signal pc : 0x80000782 >> signal fmtv : 0x114 >> >> Illegal instruction (core dumped) >> > I don't understand these results. If usp was really overwritten, the > program would have crashed early, no? I think we're still at the point where rec() is called recursively, before any returns. >> Exception right before crash was an interrupt in this case (only seen >> that once in this context, though I've seen lots of those in the course >> of the test runs). Frame start calculated from siginfo pointer value in >> this case. >> > I didn't realize that you could get a crash from a signal delivered > following an interrupt. I'll try to modify the kernel such that signals > are not delivered after page faults. Yes, that was news to me, too. I've got swap enabled and probably see a lot more disk I/O than on your machines. Delaying signal return until the next syscall or interrupt after page fault ought not be too hard - just replace the 'jra ret_from_exception' by 'RESTORE_ALL' (though that would also defer rescheduling until the next interrupt). For a proper solution, replicate exit_work without a call to do_signal_return ... Cheers,     Michael