From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-4.9 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FORGED_MUA_MOZILLA,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4915FC11F64 for ; Thu, 1 Jul 2021 17:25:38 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 374116140B for ; Thu, 1 Jul 2021 17:25:38 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233528AbhGAR2H (ORCPT ); Thu, 1 Jul 2021 13:28:07 -0400 Received: from a4-2.smtp-out.eu-west-1.amazonses.com ([54.240.4.2]:37327 "EHLO a4-2.smtp-out.eu-west-1.amazonses.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229987AbhGAR2H (ORCPT ); Thu, 1 Jul 2021 13:28:07 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/simple; s=vbsgq4olmwpaxkmtpgfbbmccllr2wq3g; d=urbackup.org; t=1625160335; h=Subject:To:References:From:Message-ID:Date:MIME-Version:In-Reply-To:Content-Type:Content-Transfer-Encoding; bh=yNH7U5aTCOgltDhqfHz66G2dV22sTsC879F2yCxnjMM=; b=SosP88IhBS+ZSSMmuGr+EsHm832mdNj5nuehPW9SPilSmh6DI2tb2AA2W5rLv8LY Ez5s8uq5LI/QqTLfER86NwkUbY+9kSDlpJ9tMY0zeXgsjFTyHpqel2DbyogC2gKxiBf DEgrZtY9HPrHKeZ1FA2bq7//dQK0HRbvuOSap2T4= DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/simple; s=uku4taia5b5tsbglxyj6zym32efj7xqv; d=amazonses.com; t=1625160335; h=Subject:To:References:From:Message-ID:Date:MIME-Version:In-Reply-To:Content-Type:Content-Transfer-Encoding:Feedback-ID; bh=yNH7U5aTCOgltDhqfHz66G2dV22sTsC879F2yCxnjMM=; b=Z0mTLXJplwdWjr10QC6b+W4bJxslVSgBnJsz2tLSyCDTRg8RDkx0zvv58fNVIhwt dm64UrWJUc4WFXfFrhXcBUhpUuwpNmi7vmqu5JP7zuY9IPyi+6qiYnLn43swoKSb6Q1 Rd4J/FEQ2o1qmghIEkK01TJDk3LGrY39VMt9vNEo= Subject: Re: IO failure without other (device) error To: Qu Wenruo , "linux-btrfs@vger.kernel.org" References: <010201795331ffe5-6933accd-b72f-45f0-be86-c2832b6fe306-000000@eu-west-1.amazonses.com> <0102017a1fead031-e0b49bda-297e-42f8-8fde-5567c5cfdec9-000000@eu-west-1.amazonses.com> <0102017a5e38aa40-fb2774c8-5be1-4022-abfa-c59fe23f46a3-000000@eu-west-1.amazonses.com> <4e6c3598-92b4-30d6-3df8-6b70badbd893@gmx.com> From: Martin Raiber Message-ID: <0102017a631abd46-c29f6d05-e5b2-44b1-a945-53f43026154f-000000@eu-west-1.amazonses.com> Date: Thu, 1 Jul 2021 17:25:34 +0000 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.11.0 MIME-Version: 1.0 In-Reply-To: <4e6c3598-92b4-30d6-3df8-6b70badbd893@gmx.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Feedback-ID: 1.eu-west-1.zKMZH6MF2g3oUhhjaE2f3oQ8IBjABPbvixQzV8APwT0=:AmazonSES X-SES-Outgoing: 2021.07.01-54.240.4.2 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org On 01.07.2021 03:40 Qu Wenruo wrote: > > > On 2021/7/1 上午2:40, Martin Raiber wrote: >> On 18.06.2021 18:18 Martin Raiber wrote: >>> On 10.05.2021 00:14 Martin Raiber wrote: >>>> I get this (rare) issue where btrfs reports an IO error in run_delayed_refs or finish_ordered_io with no underlying device errors being reported. This is with 5.10.26 but with a few patches like the pcpu ENOMEM fix or work-arounds for btrfs ENOSPC issues: >>>> >>>> [1885197.101981] systemd-sysv-generator[2324776]: SysV service '/etc/init.d/exim4' lacks a native systemd unit file. Automatically generating a unit file for compatibility. Please update package to include a native systemd unit file, in order to make it more safe and robust. >>>> [2260628.156893] BTRFS: error (device dm-0) in btrfs_finish_ordered_io:2736: errno=-5 IO failure >>>> [2260628.156980] BTRFS info (device dm-0): forced readonly >>>> >>>> This issue occured on two different machines now (on one twice). Both with ECC RAM. One bare metal (where dm-0 is on a NVMe) and one in a VM (where dm-0 is a ceph volume). >>> Just got it again (5.10.43). So I guess the question is how can I trace where this error comes from... The error message points at btrfs_csum_file_blocks but nothing beyond that. Grep for EIO and put a WARN_ON at each location? >>> >> Added the WARN_ON -EIOs. And hit it. It points at read_extent_buffer_pages (this time), this part before unlock_exit: > > Well, this is quite different from your initial report. > > Your initial report is EIO in btrfs_finish_ordered_io(), which happens > after all data is written back to disk. > > But in this particular case, it happens before we submit the data to disk. > > In this case, we search csum tree first, to find the csum for the range > we want to read, before submit the read bio. > > Thus they are at completely different path. Yes it fails to read the csum, because read_extent_buffer_pages returns -EIO. I made the, I think, reasonable assumption that there is only one issue in btrfs where -EIO happens without an actual IO error on the underlying device. The original issue has line numbers that point at btrfs_csum_file_blocks which calls btrfs_lookup_csum which is in the call path of this issue. Can't confirm it's the same issue because the original report didn't have the WARN_ONs in there, so feel free to treat them as separate issues. > >> >>      for (i = 0; i < num_pages; i++) { >>          page = eb->pages[i]; >>          wait_on_page_locked(page); >>          if (!PageUptodate(page)) >>              -->ret = -EIO; >>      } >> >> Complete dmesg output. In this instance it seems to not be able to read a csum. It doesn't go read only in this case... Maybe it should? >> >> [Wed Jun 30 10:31:11 2021] kernel log > > For this particular case, btrfs first can't find the csum for the range > of read, and just left the csum as all zeros and continue. > > Then the data read from disk will definitely cause a csum mismatch. > > This normally means a csum tree corruption. > > Can you run btrfs-check on that fs? It didn't "find" the csum because it has an -EIO error reading the extent where the csum is supposed to be stored. It is not a csum tree corruption because that would cause different log messages like transid not matching or csum of tree nodes being wrong, I think. Sorry, the file is long deleted. Scrub comes back as clean and I guess the -EIO error causing the csum read failure was only transient anyway. I'm not sufficiently familiar with btrfs/block device/mm subsystem obviously but here is one guess what could be wrong. It waits for completion for the read of the extent buffer page like this: wait_on_page_locked(page); if (!PageUptodate(page))     ret = -EIO; while in filemap.c it reads a page like this: wait_on_page_locked(page); if (PageUptodate(page))     goto out; lock_page(page); if (!page->mapping) {         unlock_page(page);         put_page(page);         goto repeat; } /* Someone else locked and filled the page in a very small window */ if (PageUptodate(page)) {         unlock_page(page);         goto out; } With the comment: > /* > * Page is not up to date and may be locked due to one of the following > * case a: Page is being filled and the page lock is held > * case b: Read/write error clearing the page uptodate status > * case c: Truncation in progress (page locked) > * case d: Reclaim in progress > * > * Case a, the page will be up to date when the page is unlocked. > * There is no need to serialise on the page lock here as the page > * is pinned so the lock gives no additional protection. Even if the > * page is truncated, the data is still valid if PageUptodate as > * it's a race vs truncate race. > * Case b, the page will not be up to date > * Case c, the page may be truncated but in itself, the data may still > * be valid after IO completes as it's a read vs truncate race. The > * operation must restart if the page is not uptodate on unlock but > * otherwise serialising on page lock to stabilise the mapping gives > * no additional guarantees to the caller as the page lock is > * released before return. > * Case d, similar to truncation. If reclaim holds the page lock, it > * will be a race with remove_mapping that determines if the mapping > * is valid on unlock but otherwise the data is valid and there is > * no need to serialise with page lock. > * > * As the page lock gives no additional guarantee, we optimistically > * wait on the page to be unlocked and check if it's up to date and > * use the page if it is. Otherwise, the page lock is required to > * distinguish between the different cases. The motivation is that we > * avoid spurious serialisations and wakeups when multiple processes > * wait on the same page for IO to complete. > */ So maybe the extent buffer page gets e.g. reclaimed in the small window between unlock and PageUptodate check? Another option is case b (read/write error), but the NVMe/dm subsystem doesn't log any error for some reason. I guess I could add the lock and check for mapping and PageError(page) to narrow it down further?