All of lore.kernel.org
 help / color / mirror / Atom feed
From: Martin Raiber <martin@urbackup.org>
To: Qu Wenruo <quwenruo.btrfs@gmx.com>,
	"linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>
Subject: Re: IO failure without other (device) error
Date: Thu, 1 Jul 2021 17:25:34 +0000	[thread overview]
Message-ID: <0102017a631abd46-c29f6d05-e5b2-44b1-a945-53f43026154f-000000@eu-west-1.amazonses.com> (raw)
In-Reply-To: <4e6c3598-92b4-30d6-3df8-6b70badbd893@gmx.com>

On 01.07.2021 03:40 Qu Wenruo wrote:
>
>
> On 2021/7/1 上午2:40, Martin Raiber wrote:
>> On 18.06.2021 18:18 Martin Raiber wrote:
>>> On 10.05.2021 00:14 Martin Raiber wrote:
>>>> I get this (rare) issue where btrfs reports an IO error in run_delayed_refs or finish_ordered_io with no underlying device errors being reported. This is with 5.10.26 but with a few patches like the pcpu ENOMEM fix or work-arounds for btrfs ENOSPC issues:
>>>>
>>>> [1885197.101981] systemd-sysv-generator[2324776]: SysV service '/etc/init.d/exim4' lacks a native systemd unit file. Automatically generating a unit file for compatibility. Please update package to include a native systemd unit file, in order to make it more safe and robust.
>>>> [2260628.156893] BTRFS: error (device dm-0) in btrfs_finish_ordered_io:2736: errno=-5 IO failure
>>>> [2260628.156980] BTRFS info (device dm-0): forced readonly
>>>>
>>>> This issue occured on two different machines now (on one twice). Both with ECC RAM. One bare metal (where dm-0 is on a NVMe) and one in a VM (where dm-0 is a ceph volume).
>>> Just got it again (5.10.43). So I guess the question is how can I trace where this error comes from... The error message points at btrfs_csum_file_blocks but nothing beyond that. Grep for EIO and put a WARN_ON at each location?
>>>
>> Added the WARN_ON -EIOs. And hit it. It points at read_extent_buffer_pages (this time), this part before unlock_exit:
>
> Well, this is quite different from your initial report.
>
> Your initial report is EIO in btrfs_finish_ordered_io(), which happens
> after all data is written back to disk.
>
> But in this particular case, it happens before we submit the data to disk.
>
> In this case, we search csum tree first, to find the csum for the range
> we want to read, before submit the read bio.
>
> Thus they are at completely different path.
Yes it fails to read the csum, because read_extent_buffer_pages returns -EIO. I made the, I think, reasonable assumption that there is only one issue in btrfs where -EIO happens without an actual IO error on the underlying device. The original issue has line numbers that point at btrfs_csum_file_blocks which calls btrfs_lookup_csum which is in the call path of this issue. Can't confirm it's the same issue because the original report didn't have the WARN_ONs in there, so feel free to treat them as separate issues.
>
>>
>>      for (i = 0; i < num_pages; i++) {
>>          page = eb->pages[i];
>>          wait_on_page_locked(page);
>>          if (!PageUptodate(page))
>>              -->ret = -EIO;
>>      }
>>
>> Complete dmesg output. In this instance it seems to not be able to read a csum. It doesn't go read only in this case... Maybe it should?
>>
>> [Wed Jun 30 10:31:11 2021] kernel log
>
> For this particular case, btrfs first can't find the csum for the range
> of read, and just left the csum as all zeros and continue.
>
> Then the data read from disk will definitely cause a csum mismatch.
>
> This normally means a csum tree corruption.
>
> Can you run btrfs-check on that fs?

It didn't "find" the csum because it has an -EIO error reading the extent where the csum is supposed to be stored. It is not a csum tree corruption because that would cause different log messages like transid not matching or csum of tree nodes being wrong, I think.

Sorry, the file is long deleted. Scrub comes back as clean and I guess the -EIO error causing the csum read failure was only transient anyway.

I'm not sufficiently familiar with btrfs/block device/mm subsystem obviously but here is one guess what could be wrong.

It waits for completion for the read of the extent buffer page like this:

wait_on_page_locked(page);
if (!PageUptodate(page))
    ret = -EIO;

while in filemap.c it reads a page like this:

wait_on_page_locked(page);
if (PageUptodate(page))
    goto out;
lock_page(page);
if (!page->mapping) {
        unlock_page(page);
        put_page(page);
        goto repeat;
}
/* Someone else locked and filled the page in a very small window */
if (PageUptodate(page)) {
        unlock_page(page);
        goto out;

}

With the comment:

> /*
> * Page is not up to date and may be locked due to one of the following
> * case a: Page is being filled and the page lock is held
> * case b: Read/write error clearing the page uptodate status
> * case c: Truncation in progress (page locked)
> * case d: Reclaim in progress
> *
> * Case a, the page will be up to date when the page is unlocked.
> * There is no need to serialise on the page lock here as the page
> * is pinned so the lock gives no additional protection. Even if the
> * page is truncated, the data is still valid if PageUptodate as
> * it's a race vs truncate race.
> * Case b, the page will not be up to date
> * Case c, the page may be truncated but in itself, the data may still
> * be valid after IO completes as it's a read vs truncate race. The
> * operation must restart if the page is not uptodate on unlock but
> * otherwise serialising on page lock to stabilise the mapping gives
> * no additional guarantees to the caller as the page lock is
> * released before return.
> * Case d, similar to truncation. If reclaim holds the page lock, it
> * will be a race with remove_mapping that determines if the mapping
> * is valid on unlock but otherwise the data is valid and there is
> * no need to serialise with page lock.
> *
> * As the page lock gives no additional guarantee, we optimistically
> * wait on the page to be unlocked and check if it's up to date and
> * use the page if it is. Otherwise, the page lock is required to
> * distinguish between the different cases. The motivation is that we
> * avoid spurious serialisations and wakeups when multiple processes
> * wait on the same page for IO to complete.
> */
So maybe the extent buffer page gets e.g. reclaimed in the small window between unlock and PageUptodate check?

Another option is case b (read/write error), but the NVMe/dm subsystem doesn't log any error for some reason.

I guess I could add the lock and check for mapping and PageError(page) to narrow it down further?


  reply	other threads:[~2021-07-01 17:25 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-09 22:14 IO failure without other (device) error Martin Raiber
2021-06-18 16:18 ` Martin Raiber
2021-06-18 16:28   ` Roman Mamedov
2021-06-18 17:17     ` Martin Raiber
2021-06-18 17:36   ` Roman Mamedov
2021-06-18 18:24     ` Martin Raiber
2021-06-30 18:40   ` Martin Raiber
2021-07-01  1:40     ` Qu Wenruo
2021-07-01 17:25       ` Martin Raiber [this message]
2021-07-01 22:19         ` Qu Wenruo
2021-07-02 16:29           ` Martin Raiber
2021-07-02 22:46             ` Qu Wenruo
2021-07-08 16:14               ` Martin Raiber
2021-07-08 23:32                 ` Qu Wenruo
2021-07-14 16:32                   ` Martin Raiber

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0102017a631abd46-c29f6d05-e5b2-44b1-a945-53f43026154f-000000@eu-west-1.amazonses.com \
    --to=martin@urbackup.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=quwenruo.btrfs@gmx.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.