From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.2 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FORGED_MUA_MOZILLA,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS, USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 15978C07E96 for ; Thu, 8 Jul 2021 16:14:37 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id E953261876 for ; Thu, 8 Jul 2021 16:14:36 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229580AbhGHQRR (ORCPT ); Thu, 8 Jul 2021 12:17:17 -0400 Received: from a4-2.smtp-out.eu-west-1.amazonses.com ([54.240.4.2]:56267 "EHLO a4-2.smtp-out.eu-west-1.amazonses.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229553AbhGHQRR (ORCPT ); Thu, 8 Jul 2021 12:17:17 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/simple; s=vbsgq4olmwpaxkmtpgfbbmccllr2wq3g; d=urbackup.org; t=1625760874; h=Subject:To:References:From:Message-ID:Date:MIME-Version:In-Reply-To:Content-Type:Content-Transfer-Encoding; bh=v5TrdRD8u5ib3ScvKsWKCVQ305RLRyPYaAnwFPuCe8A=; b=oh8ryhj4V6lUpFTh7gaUU9YRX4nY6QkndVh/71yCCofM1C6hhpW5e8M4d4WiQuDS CFfLU5c/SHfRHZxEnu4NDg5phY2WANAo7csXvveBAnFZDI0mKIWbbJxPSrAYcByCXCE krWDPyFCvJzGF5crUJjBaVy0+gxCPwBYgrh+hL08= DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/simple; s=uku4taia5b5tsbglxyj6zym32efj7xqv; d=amazonses.com; t=1625760874; h=Subject:To:References:From:Message-ID:Date:MIME-Version:In-Reply-To:Content-Type:Content-Transfer-Encoding:Feedback-ID; bh=v5TrdRD8u5ib3ScvKsWKCVQ305RLRyPYaAnwFPuCe8A=; b=M0eBkHmxd1WWTPYOIn7YojE5hcixHDyuH5KNRFYropIcZGm6eESXWKzCJ8J8QpoD 9jwLpTOSpoA/MKV9ufpnHu4XeX2CswAKpSMoYApWvnMNDrMobsk5fdquuY4b/bdi6Sy MsQ67Xi/ww+iF+i4uPp0QoZI/ue61kltOvDuOLzM= Subject: Re: IO failure without other (device) error To: Qu Wenruo , "linux-btrfs@vger.kernel.org" References: <010201795331ffe5-6933accd-b72f-45f0-be86-c2832b6fe306-000000@eu-west-1.amazonses.com> <0102017a1fead031-e0b49bda-297e-42f8-8fde-5567c5cfdec9-000000@eu-west-1.amazonses.com> <0102017a5e38aa40-fb2774c8-5be1-4022-abfa-c59fe23f46a3-000000@eu-west-1.amazonses.com> <4e6c3598-92b4-30d6-3df8-6b70badbd893@gmx.com> <0102017a631abd46-c29f6d05-e5b2-44b1-a945-53f43026154f-000000@eu-west-1.amazonses.com> <6422009e-be80-f753-2243-2a13178a1763@gmx.com> <0102017a680d637c-4a958f96-dd8b-433d-84a6-8fc6a84abf47-000000@eu-west-1.amazonses.com> From: Martin Raiber Message-ID: <0102017a86e63e18-a5023d15-0b20-43e8-b71c-6dd241451179-000000@eu-west-1.amazonses.com> Date: Thu, 8 Jul 2021 16:14:34 +0000 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.11.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Feedback-ID: 1.eu-west-1.zKMZH6MF2g3oUhhjaE2f3oQ8IBjABPbvixQzV8APwT0=:AmazonSES X-SES-Outgoing: 2021.07.08-54.240.4.2 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org On 03.07.2021 00:46 Qu Wenruo wrote: > > > On 2021/7/3 上午12:29, Martin Raiber wrote: >> On 02.07.2021 00:19 Qu Wenruo wrote: >>> >>> >>> On 2021/7/2 上午1:25, Martin Raiber wrote: >>>> On 01.07.2021 03:40 Qu Wenruo wrote: >>>>> >>>>> >>>>> On 2021/7/1 上午2:40, Martin Raiber wrote: >>>>>> On 18.06.2021 18:18 Martin Raiber wrote: >>>>>>> On 10.05.2021 00:14 Martin Raiber wrote: >>>>>>>> I get this (rare) issue where btrfs reports an IO error in run_delayed_refs or finish_ordered_io with no underlying device errors being reported. This is with 5.10.26 but with a few patches like the pcpu ENOMEM fix or work-arounds for btrfs ENOSPC issues: >>>>>>>> >>>>>>>> [1885197.101981] systemd-sysv-generator[2324776]: SysV service '/etc/init.d/exim4' lacks a native systemd unit file. Automatically generating a unit file for compatibility. Please update package to include a native systemd unit file, in order to make it more safe and robust. >>>>>>>> [2260628.156893] BTRFS: error (device dm-0) in btrfs_finish_ordered_io:2736: errno=-5 IO failure >>>>>>>> [2260628.156980] BTRFS info (device dm-0): forced readonly >>>>>>>> >>>>>>>> This issue occured on two different machines now (on one twice). Both with ECC RAM. One bare metal (where dm-0 is on a NVMe) and one in a VM (where dm-0 is a ceph volume). >>>>>>> Just got it again (5.10.43). So I guess the question is how can I trace where this error comes from... The error message points at btrfs_csum_file_blocks but nothing beyond that. Grep for EIO and put a WARN_ON at each location? >>>>>>> >>>>>> Added the WARN_ON -EIOs. And hit it. It points at read_extent_buffer_pages (this time), this part before unlock_exit: >>>>> >>>>> Well, this is quite different from your initial report. >>>>> >>>>> Your initial report is EIO in btrfs_finish_ordered_io(), which happens >>>>> after all data is written back to disk. >>>>> >>>>> But in this particular case, it happens before we submit the data to disk. >>>>> >>>>> In this case, we search csum tree first, to find the csum for the range >>>>> we want to read, before submit the read bio. >>>>> >>>>> Thus they are at completely different path. >>>> Yes it fails to read the csum, because read_extent_buffer_pages returns -EIO. I made the, I think, reasonable assumption that there is only one issue in btrfs where -EIO happens without an actual IO error on the underlying device. The original issue has line numbers that point at btrfs_csum_file_blocks which calls btrfs_lookup_csum which is in the call path of this issue. Can't confirm it's the same issue because the original report didn't have the WARN_ONs in there, so feel free to treat them as separate issues. >>>>> >>>>>> >>>>>>        for (i = 0; i < num_pages; i++) { >>>>>>            page = eb->pages[i]; >>>>>>            wait_on_page_locked(page); >>>>>>            if (!PageUptodate(page)) >>>>>>                -->ret = -EIO; >>>>>>        } >>>>>> >>>>>> Complete dmesg output. In this instance it seems to not be able to read a csum. It doesn't go read only in this case... Maybe it should? >>>>>> >>>>>> [Wed Jun 30 10:31:11 2021] kernel log >>>>> >>>>> For this particular case, btrfs first can't find the csum for the range >>>>> of read, and just left the csum as all zeros and continue. >>>>> >>>>> Then the data read from disk will definitely cause a csum mismatch. >>>>> >>>>> This normally means a csum tree corruption. >>>>> >>>>> Can you run btrfs-check on that fs? >>>> >>>> It didn't "find" the csum because it has an -EIO error reading the extent where the csum is supposed to be stored. It is not a csum tree corruption because that would cause different log messages like transid not matching or csum of tree nodes being wrong, I think. >>> >>> Yes, that's what I expect, and feel strange about. >>> >>>> >>>> Sorry, the file is long deleted. Scrub comes back as clean and I guess the -EIO error causing the csum read failure was only transient anyway. >>>> >>>> I'm not sufficiently familiar with btrfs/block device/mm subsystem obviously but here is one guess what could be wrong. >>>> >>>> It waits for completion for the read of the extent buffer page like this: >>>> >>>> wait_on_page_locked(page); >>>> if (!PageUptodate(page)) >>>>       ret = -EIO; >>>> >>>> while in filemap.c it reads a page like this: >>>> >>>> wait_on_page_locked(page); >>>> if (PageUptodate(page)) >>>>       goto out; >>>> lock_page(page); >>>> if (!page->mapping) { >>>>           unlock_page(page); >>>>           put_page(page); >>>>           goto repeat; >>>> } >>> >>> Yes, that what we do for data read path, as each time a page get >>> unlocked, we can get page invalidator trigger for the page, and when we >>> re-lock the page, it may has been invalidated. >>> >>> Although above check has been updated to do extra check including >>> page->mapping and page->private check to be extra sure. >>> >>>> /* Someone else locked and filled the page in a very small window */ >>>> if (PageUptodate(page)) { >>>>           unlock_page(page); >>>>           goto out; >>>> >>>> } >>>> >>>> With the comment: >>>> >>>>> /* >>>>> * Page is not up to date and may be locked due to one of the following >>>>> * case a: Page is being filled and the page lock is held >>>>> * case b: Read/write error clearing the page uptodate status >>>>> * case c: Truncation in progress (page locked) >>>>> * case d: Reclaim in progress >>>>> *  [...] >>>>> */ >>>> So maybe the extent buffer page gets e.g. reclaimed in the small window between unlock and PageUptodate check? >>> >>> But for metadata case, unlike data path, we have very limited way to >>> invalidate/release a page. >>> >>> Unlike data path, metadata page uses it page->private as pointer to >>> extent buffer. >>> >>> And each time we want to drop a metadata page, we can only do that if >>> the extent buffer owning the page can be removed from the extent buffer >>> cache. >>> >>> Thus a unlock metadata page get released halfway is not expected >>> behavior at all. >> I see. It mainly checks extent_buffer->refs and it increments that after allocating/finding the extent... No further idea what could be the problem... >>> >>>> >>>> Another option is case b (read/write error), but the NVMe/dm subsystem doesn't log any error for some reason. >>> >>> I don't believe that's the case neither, or we should have csum mismatch >>> report from btrfs. >>> >>>> >>>> I guess I could add the lock and check for mapping and PageError(page) to narrow it down further? >>>> >>> If you have a proper way to reproduce the bug reliable, I could craft a >>> diff for you to debug (with everything output to ftrace buffer to debug) >> >> Unfortunately I can't reproduce it and it is really rare. I could add further output at that location. E.g. checking for mapping presence if the page has an error or if it was submitted for read or was Uptodate before? >> >> For now I'll just extent the retry logic in btree_read_extent_buffer_pages to try the same mirror multiple times (The problematic btrfs filesystems have single metadata). That should do it as work-around. >> > My recommendation for debugging is to add extra trace_printk() on > btree_releasepage() to see when a metadata page really get freed, and > read_extent_buffer_pages() to see what's the possible race. > > Another idea is to add extra debug output in end_bio_extent_readpage() > for metadata pages. > > As I'm still wondering if there is something wrong detected by btrfs > module but without any error message. > > In that case, we definitely want to add a proper error message. I guess this would be a proper error message: --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -6505,8 +6505,14 @@ int read_extent_buffer_pages(struct extent_buffer *eb, int wait, int mirror_num)         for (i = 0; i < num_pages; i++) {                 page = eb->pages[i];                 wait_on_page_locked(page); -               if (!PageUptodate(page)) +               if (!PageUptodate(page)) { +                       btrfs_err_rl(eb->fs_info, +                               "error reading extent buffer " +                               "start %llu len %lu PageError %d", +                               eb->start, eb->len, +                               PageError(page) ? 1 : 0);                         ret = -EIO; +               }         }         return ret; I haven't added this to the kernel I'm running yet. It currently still has a WARN_ON instead of the error message. I also added the retry like this: --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -385,6 +385,7 @@ static int btree_read_extent_buffer_pages(struct extent_buffer *eb,         int num_copies = 0;         int mirror_num = 0;         int failed_mirror = 0; +       int tries = 2;         io_tree = &BTRFS_I(fs_info->btree_inode)->io_tree;         while (1) { @@ -403,6 +404,10 @@ static int btree_read_extent_buffer_pages(struct extent_buffer *eb,                 num_copies = btrfs_num_copies(fs_info,                                               eb->start, eb->len); + +               if (num_copies == 1 && tries-- > 0) +                       continue; +                 if (num_copies == 1)                         break; On Tuesday I got this: [Tue Jul  6 14:49:17 2021] ------------[ cut here ]------------ [Tue Jul  6 14:49:17 2021] WARNING: CPU: 13 PID: 2265463 at fs/btrfs/extent_io.c:5597 read_extent_buffer_pages+0x346/0x360 [Tue Jul  6 14:49:17 2021] Modules linked in: zram bcache crc64 loop dm_crypt bfq xfs dm_mod st sr_mod cdrom bridge stp llc intel_powerclamp coretemp snd_pcm kvm_intel snd_timer snd mgag200 kvm soundcore drm_kms_helper iTCO_wdt dcdbas irqbypass pcspkr serio_raw iTCO_vendor_support i2c_algo_bit evdev joydev i7core_edac sg ipmi_si ipmi_devintf ipmi_msghandler wmi acpi_power_meter button ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi drm configfs ip_tables x_tables autofs4 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx raid0 multipath linear raid1 md_mod ses enclosure sd_mod hid_generic usbhid hid crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper ahci libahci uhci_hcd ehci_pci psmouse mpt3sas ehci_hcd lpc_ich raid_class libata nvme scsi_transport_sas mfd_core nvme_core usbcore t10_pi scsi_mod bnx2 [Tue Jul  6 14:49:17 2021] CPU: 13 PID: 2265463 Comm: btrfs Tainted: G          I       5.10.47 #1 [Tue Jul  6 14:49:17 2021] Hardware name: Dell Inc. PowerEdge R510/0DPRKF, BIOS 1.13.0 03/02/2018 [Tue Jul  6 14:49:17 2021] RIP: 0010:read_extent_buffer_pages+0x346/0x360 [Tue Jul  6 14:49:17 2021] Code: 48 8b 43 08 a8 01 48 8d 78 ff 48 0f 44 fb 31 f6 e8 cf d0 db ff 48 8b 43 08 48 8d 50 ff a8 01 48 0f 45 da 48 8b 03 a8 04 75 ab <0f> 0b 41 be fb ff ff ff eb a1 e8 4b af 56 00 66 66 2e 0f 1f 84 00 [Tue Jul  6 14:49:17 2021] RSP: 0018:ffffc9007562bb40 EFLAGS: 00010246 [Tue Jul  6 14:49:17 2021] RAX: 06ffff80000020e3 RBX: ffffea00095a5fc0 RCX: 0000000000000000 [Tue Jul  6 14:49:17 2021] RDX: dead0000000000ff RSI: 0000000000000020 RDI: ffff888713babd40 [Tue Jul  6 14:49:17 2021] RBP: ffff88869a7d6f78 R08: 0000f5a746752ed4 R09: 0000000000000483 [Tue Jul  6 14:49:17 2021] R10: 0000000000000001 R11: 0000000000000000 R12: ffff88869a7d7020 [Tue Jul  6 14:49:17 2021] R13: ffffea0005f51f80 R14: 0000000000000000 R15: ffff88869a7d7020 [Tue Jul  6 14:49:17 2021] FS:  00007f748ddc18c0(0000) GS:ffff888713b80000(0000) knlGS:0000000000000000 [Tue Jul  6 14:49:17 2021] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [Tue Jul  6 14:49:17 2021] CR2: 00005618b6037058 CR3: 000000011feca004 CR4: 00000000000206e0 [Tue Jul  6 14:49:17 2021] Call Trace: [Tue Jul  6 14:49:17 2021]  btree_read_extent_buffer_pages+0x66/0x130 [Tue Jul  6 14:49:17 2021]  read_tree_block+0x36/0x60 [Tue Jul  6 14:49:17 2021]  btrfs_read_node_slot+0xc0/0x110 [Tue Jul  6 14:49:17 2021]  btrfs_search_forward+0x1db/0x350 [Tue Jul  6 14:49:17 2021]  search_ioctl+0x19e/0x250 [Tue Jul  6 14:49:17 2021]  btrfs_ioctl_tree_search+0x63/0xc0 [Tue Jul  6 14:49:17 2021]  btrfs_ioctl+0x1874/0x3060 [Tue Jul  6 14:49:17 2021]  ? page_add_new_anon_rmap+0xa3/0x1f0 [Tue Jul  6 14:49:17 2021]  ? handle_mm_fault+0xf6c/0x1950 [Tue Jul  6 14:49:17 2021]  ? __x64_sys_ioctl+0x83/0xb0 [Tue Jul  6 14:49:17 2021]  __x64_sys_ioctl+0x83/0xb0 [Tue Jul  6 14:49:17 2021]  do_syscall_64+0x33/0x80 [Tue Jul  6 14:49:17 2021]  entry_SYSCALL_64_after_hwframe+0x44/0xa9 [Tue Jul  6 14:49:17 2021] RIP: 0033:0x7f748deb8cc7 [Tue Jul  6 14:49:17 2021] Code: 00 00 00 48 8b 05 c9 91 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 99 91 0c 00 f7 d8 64 89 01 48 [Tue Jul  6 14:49:17 2021] RSP: 002b:00007ffc7e07f098 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [Tue Jul  6 14:49:17 2021] RAX: ffffffffffffffda RBX: 00007ffc7e0801d8 RCX: 00007f748deb8cc7 [Tue Jul  6 14:49:17 2021] RDX: 00007ffc7e07f0f8 RSI: 00000000d0009411 RDI: 0000000000000005 [Tue Jul  6 14:49:17 2021] RBP: 0000000000bafaad R08: 000000000000000b R09: 00007f748df82be0 [Tue Jul  6 14:49:17 2021] R10: 0000000000000010 R11: 0000000000000246 R12: 0000000000000005 [Tue Jul  6 14:49:17 2021] R13: 00007ffc7e07ffe1 R14: 000000000000000b R15: 00007ffc7e07f160 [Tue Jul  6 14:49:17 2021] ---[ end trace 81f64eb2e9ceb4de ]--- with no other message afterwards. This means the retry is working and "fixed" the problem for me. I'll keep monitoring this and will try to catch the -EIO from other areas, like the original btrfs_finish_ordered_io.