From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from ipmail01.adl2.internode.on.net ([150.101.137.133]:6728 "EHLO ipmail01.adl2.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725755AbeKAICk (ORCPT ); Thu, 1 Nov 2018 04:02:40 -0400 Date: Thu, 1 Nov 2018 10:02:25 +1100 From: Dave Chinner Subject: Re: [PATCH 2/2] xfs: don't preempt writeback sequence on single page wb error Message-ID: <20181031230225.GX19305@dastard> References: <20181031140155.17996-1-bfoster@redhat.com> <20181031140155.17996-3-bfoster@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20181031140155.17996-3-bfoster@redhat.com> Sender: linux-xfs-owner@vger.kernel.org List-ID: List-Id: xfs To: Brian Foster Cc: linux-xfs@vger.kernel.org On Wed, Oct 31, 2018 at 10:01:55AM -0400, Brian Foster wrote: > xfs_do_writepage() currently returns errors directly regardless of > whether it is called via ->writepages() or ->writepage(). In the > case of ->writepages(), an xfs_do_writepage() error return breaks > the current writeback sequence in write_cache_pages(). This means > that an integrity writeback (i.e., sync), for example, returns > before all associated pages have been processed. That sounds like a bug in write_cache_pages(). It sends pages one at a time to the writepage context, and it is supposed to iterate the entire range on a data sync operation. If you look at the code, is clearly stopping at the first error. IOWs, every filesystem that uses write_cache_pages() for data integrity writes is broken in the same way. And a quick look as fs specific writepages implementations indicates that ext4_writepages() and ibtrfs's extent_write_cache_pages() have the same issue. > This can be problematic in cases like unmount. If the writeback > doesn't process all delalloc pages before unmounting, we end up > reclaiming inodes with non-zero delalloc block counts. In turn, this > breaks block accounting and leaves the fs inconsistent. XFS is probably the only filesystem that leaves detectable state around and then detects and reports it.... > XFS explicitly discards delalloc blocks on such writepage failures > to avoid this problem. This isn't terribly useful if we allow an > integrity writeback to complete (and thus a filesystem to unmount) > without addressing the entire set of dirty pages on an inode. > Therefore, change ->writepage[s]() to track high level error state > in the xfs_writepage_ctx structure and return it from the higher > level operation callout rather than xfs_do_writepage(). This ensures > that write_cache_pages() does not exit prematurely when called via > ->writepages(), but both ->writepage() and ->writepages() still > ultimately return an error for the higher level operation. > > This patch introduces a subtle change in the behavior of background > writeback in the event of persistent errors. The current behavior of > returning an error preempts the background writeback. Writeback > eventually comes around again and repeats the process for a few more > pages (in practice) before it once again fails. This repeats over > and over until the entire set of dirty pages is cleaned. This > behavior results in a somewhat slower stream of "page discard" > errors in the system log and dictates that many repeated fsync calls > may be required before the entire data set is processed and mapping > error consumed. With this change in place, background writeback > executes on as many pages as necessary as if each page writeback > were successful. The pages are cleaned immediately and significantly > more page discard errors can be observed at once. Yeah, this is a good change in behaviour, but I think the implementation is wrong. write_cache_pages() needs to continue iterating the range if WB_SYNC_ALL is set even when errors occur. i.e. the error state should be maintained by write_cache_pages and returned on completion, not require the filesystem to hide errors from wcp in it's own specific writepage structure... Cheers, Dave. -- Dave Chinner david@fromorbit.com