From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-11.6 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,NICE_REPLY_A, SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 52DC6C4338F for ; Tue, 24 Aug 2021 10:12:19 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 2EEA9610D0 for ; Tue, 24 Aug 2021 10:12:19 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S236017AbhHXKNB (ORCPT ); Tue, 24 Aug 2021 06:13:01 -0400 Received: from szxga01-in.huawei.com ([45.249.212.187]:18032 "EHLO szxga01-in.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236007AbhHXKMx (ORCPT ); Tue, 24 Aug 2021 06:12:53 -0400 Received: from dggemv703-chm.china.huawei.com (unknown [172.30.72.54]) by szxga01-in.huawei.com (SkyGuard) with ESMTP id 4Gv4Yr4Yrwzbhh1; Tue, 24 Aug 2021 18:08:16 +0800 (CST) Received: from dggema766-chm.china.huawei.com (10.1.198.208) by dggemv703-chm.china.huawei.com (10.3.19.46) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256) id 15.1.2176.2; Tue, 24 Aug 2021 18:11:59 +0800 Received: from [10.174.177.210] (10.174.177.210) by dggema766-chm.china.huawei.com (10.1.198.208) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256_P256) id 15.1.2176.2; Tue, 24 Aug 2021 18:11:59 +0800 Subject: Re: [PATCH] ext4: flush s_error_work before journal destroy in ext4_fill_super From: yangerkun To: Theodore Ts'o CC: , , References: <20210720062409.960734-1-yangerkun@huawei.com> Message-ID: <3296465e-8da2-99ac-32cf-9cde32f4de32@huawei.com> Date: Tue, 24 Aug 2021 18:11:59 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.10.2 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset="utf-8"; format=flowed Content-Transfer-Encoding: 8bit X-Originating-IP: [10.174.177.210] X-ClientProxiedBy: dggems705-chm.china.huawei.com (10.3.19.182) To dggema766-chm.china.huawei.com (10.1.198.208) X-CFilter-Loop: Reflected Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org Sorry for that this patch will lead a bug when mount failed on arm64(I have verify this patch on x86, and it seems ok...). I will try to fix the problem with another way.... [2021-08-24 17:43:29 INFO] [ 45.808127] Internal error: Oops: 96000004 [#1] SMP [2021-08-24 17:43:29 INFO] [ 45.812986] Modules linked in: realtek hclge hns3 megaraid_sas hisi_sas_v3_hw hibmc_drm hisi_sas_main host_edma_drv hnae3 [2021-08-24 17:43:29 INFO] [ 45.823896] Process mount (pid: 1187, stack limit = 0x00000000e0bd7181) [2021-08-24 17:43:29 INFO] [ 45.830481] CPU: 9 PID: 1187 Comm: mount Not tainted 4.19.90_4.19.90-2108.7.0+ #2 [2021-08-24 17:43:29 INFO] [ 45.837929] Hardware name: Huawei TaiShan 2280 V2/BC82AMDC, BIOS 2280-V2 CS V3.26.01 06/14/2019 [2021-08-24 17:43:29 INFO] [ 45.846587] pstate: 10400009 (nzcV daif +PAN -UAO) [2021-08-24 17:43:29 INFO] [ 45.851359] pc : percpu_counter_add_batch+0x5c/0x358 [2021-08-24 17:43:29 INFO] [ 45.856303] lr : ext4_es_free_extent+0xd4/0x4a8 [2021-08-24 17:43:29 INFO] [ 45.860812] sp : ffff80262a60f170 [2021-08-24 17:43:29 INFO] [ 45.864112] x29: ffff80262a60f170 x28: dfff200000000000 [2021-08-24 17:43:29 INFO] [ 45.869399] x27: ffff8026403e65cc x26: 0000000000000000 [2021-08-24 17:43:29 INFO] [ 45.874686] x25: ffff80266c429e20 x24: 0000000000000100 [2021-08-24 17:43:29 INFO] [ 45.879973] x23: 1ffff004cd8853c4 x22: ffffffffffffffff [2021-08-24 17:43:29 INFO] [ 45.885260] x21: 0000000000000000 x20: 00006026b7a31000 [2021-08-24 17:43:29 INFO] [ 45.890547] x19: ffff80266c429e00 x18: 0000000000000000 [2021-08-24 17:43:29 INFO] [ 45.895833] x17: 0000000000000000 x16: 0000000000000000 [2021-08-24 17:43:29 INFO] [ 45.901120] x15: 1fffe4000504bd02 x14: ffff200023ba1adc [2021-08-24 17:43:29 INFO] [ 45.906406] x13: ffff200024332530 x12: ffff20002433240c [2021-08-24 17:43:29 INFO] [ 45.911693] x11: ffff2000243304f4 x10: ffff200024328f6c [2021-08-24 17:43:29 INFO] [ 45.916980] x9 : 1ffff004c807ccb9 x8 : ffff200023b75950 [2021-08-24 17:43:29 INFO] [ 45.922266] x7 : ffff200023ba1fe0 x6 : 0000000000000004 [2021-08-24 17:43:29 INFO] [ 45.927553] x5 : 0000000000000000 x4 : 0000000000000000 [2021-08-24 17:43:29 INFO] [ 45.932839] x3 : 00000c04d6f46200 x2 : dfff200000000000 [2021-08-24 17:43:29 INFO] [ 45.938126] x1 : 0000000000000003 x0 : 00006026b7a31000 [2021-08-24 17:43:29 INFO] [ 45.943413] Call trace: [2021-08-24 17:43:29 INFO] [ 45.945851] percpu_counter_add_batch+0x5c/0x358 [2021-08-24 17:43:29 INFO] [ 45.950446] ext4_es_free_extent+0xd4/0x4a8 [2021-08-24 17:43:29 INFO] [ 45.954610] __es_remove_extent+0x37c/0x598 [2021-08-24 17:43:29 INFO] [ 45.958773] ext4_es_remove_extent+0x6c/0x270 [2021-08-24 17:43:29 INFO] [ 45.963110] ext4_clear_inode+0x50/0x188 [2021-08-24 17:43:29 INFO] [ 45.967016] ext4_evict_inode+0x314/0x1450 [2021-08-24 17:43:29 INFO] [ 45.971094] evict+0x238/0x5a0 [2021-08-24 17:43:29 INFO] [ 45.974136] iput+0x2bc/0x6b8 [2021-08-24 17:43:29 INFO] [ 45.977090] jbd2_journal_destroy+0x408/0x800 [2021-08-24 17:43:29 INFO] [ 45.981428] ext4_fill_super+0x5a04/0x8cc0 [2021-08-24 17:43:29 INFO] [ 45.985505] mount_bdev+0x268/0x328 [2021-08-24 17:43:29 INFO] [ 45.988978] ext4_mount+0x44/0x58 [2021-08-24 17:43:29 INFO] [ 45.992277] mount_fs+0x68/0x390 [2021-08-24 17:43:29 INFO] [ 45.995492] vfs_kern_mount.part.2+0x54/0x388 [2021-08-24 17:43:29 INFO] [ 45.999830] do_mount+0xa5c/0x20d0 [2021-08-24 17:43:29 INFO] [ 46.003216] ksys_mount+0x9c/0x118 [2021-08-24 17:43:29 INFO] [ 46.006603] __arm64_sys_mount+0xa8/0x108 [2021-08-24 17:43:29 INFO] [ 46.010595] el0_svc_common+0x10c/0x4a0 [2021-08-24 17:43:29 INFO] [ 46.014415] el0_svc_handler+0x170/0x248 [2021-08-24 17:43:29 INFO] [ 46.018320] el0_svc+0x10/0x218 [2021-08-24 17:43:29 INFO] [ 46.021448] Code: f2fbffe2 d343fc03 92400801 11000c21 (38e26862) [2021-08-24 17:43:29 INFO] [ 46.027513] ---[ end trace 316fdb8833588599 ]--- [2021-08-24 17:43:29 INFO] [ 46.037646] Kernel panic - not syncing: Fatal exception [2021-08-24 17:43:29 INFO] [ 46.042848] SMP: stopping secondary CPUs [2021-08-24 17:43:29 INFO] [ 46.046779] Kernel Offset: 0x1baf0000 from 0xffff200008000000 [2021-08-24 17:43:29 INFO] [ 46.052500] CPU features: 0x12,a2a00a38 [2021-08-24 17:43:29 INFO] [ 46.056318] Memory Limit: none [2021-08-24 17:43:29 INFO] [ 46.064887] Rebooting in 1 seconds.. [2021-08-24 17:43:30 INFO] [ 47.068489] SMP: stopping secondary CPUs [2021-08-24 17:43:32 INFO] [ 48.158471] SMP: failed to stop secondary CPUs 0-127 在 2021/7/23 21:25, yangerkun 写道: > > > 在 2021/7/23 20:11, Theodore Ts'o 写道: >> On Tue, Jul 20, 2021 at 02:24:09PM +0800, yangerkun wrote: >>> 'commit c92dc856848f ("ext4: defer saving error info from atomic >>> context")' and '2d01ddc86606 ("ext4: save error info to sb through >>> journal >>> if available")' add s_error_work to fix checksum error problem. But the >>> error path in ext4_fill_super can lead the follow BUG_ON. >> >> Can you share with me your test case?  Your patch will result in the >> shrinker potentially not getting released in some error paths (which >> will cause other kernel panics), and in any case, once the journal is > > Hi Ted, > > The only logic we have changed is that we move the flush_work before we > call jbd2_journal_destory. I have not seen the problem you describe... > Can you help to explain more... > > Thanks, > Kun. > >> destroyed here: >> >>> @@ -5173,15 +5173,15 @@ static int ext4_fill_super(struct super_block >>> *sb, void *data, int silent) >>>       ext4_xattr_destroy_cache(sbi->s_ea_block_cache); >>>       sbi->s_ea_block_cache = NULL; >>> +failed_mount3a: >>> +    ext4_es_unregister_shrinker(sbi); >>> +failed_mount3: >>> +    flush_work(&sbi->s_error_work); >>>       if (sbi->s_journal) { >>>           jbd2_journal_destroy(sbi->s_journal); >>>           sbi->s_journal = NULL; >>>       } >>> -failed_mount3a: >>> -    ext4_es_unregister_shrinker(sbi); >>> -failed_mount3: >>> -    flush_work(&sbi->s_error_work); >> >> sbi->s_journal is set to NULL, which means that in >> flush_stashed_error_work(), journal will be NULL, which means we won't >> call start_this_handle and so this change will not make a difference >> given the kernel stack trace in the commit description. >> >> Thanks, >> >>                         - Ted >> . >> > .