From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1761243AbdEVTJt (ORCPT <rfc822;w@1wt.eu>);
        Mon, 22 May 2017 15:09:49 -0400
Received: from mail-qt0-f177.google.com ([209.85.216.177]:33911 "EHLO
        mail-qt0-f177.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1761227AbdEVTJk (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Mon, 22 May 2017 15:09:40 -0400
Message-ID: <1495480173.2816.21.camel@redhat.com>
Subject: Re: [PATCH v4 15/27] fs: retrofit old error reporting API onto new
 infrastructure
From: Jeff Layton <jlayton@redhat.com>
To: Jan Kara <jack@suse.cz>
Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org
Date: Mon, 22 May 2017 15:09:33 -0400
In-Reply-To: <20170522175350.GF25118@quack2.suse.cz>
References: <20170509154930.29524-1-jlayton@redhat.com>
         <20170509154930.29524-16-jlayton@redhat.com>
         <20170515104246.GC16182@quack2.suse.cz>
         <1495221652.2889.13.camel@redhat.com>
         <20170522133803.GB25118@quack2.suse.cz>
         <1495461201.2816.4.camel@redhat.com>
         <20170522175350.GF25118@quack2.suse.cz>
Content-Type: text/plain; charset="UTF-8"
X-Mailer: Evolution 3.22.6 (3.22.6-2.fc25) 
Mime-Version: 1.0
Content-Transfer-Encoding: 8bit
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Mon, 2017-05-22 at 19:53 +0200, Jan Kara wrote:
> On Mon 22-05-17 09:53:21, Jeff Layton wrote:
> > On Mon, 2017-05-22 at 15:38 +0200, Jan Kara wrote:
> > > On Fri 19-05-17 15:20:52, Jeff Layton wrote:
> > > > On Mon, 2017-05-15 at 12:42 +0200, Jan Kara wrote:
> > > > > On Tue 09-05-17 11:49:18, Jeff Layton wrote:
> > > > > > Now that we have a better way to store and report errors that occur
> > > > > > during writeback, we need to convert the existing codebase to use it. We
> > > > > > could just adapt all of the filesystem code and related infrastructure
> > > > > > to the new API, but that's a lot of churn.
> > > > > > 
> > > > > > When it comes to setting errors in the mapping, filemap_set_wb_error is
> > > > > > a drop-in replacement for mapping_set_error. Turn that function into a
> > > > > > simple wrapper around the new one.
> > > > > > 
> > > > > > Because we want to ensure that writeback errors are always reported at
> > > > > > fsync time, inject filemap_report_wb_error calls much closer to the
> > > > > > syscall boundary, in call_fsync.
> > > > > > 
> > > > > > For fsync calls (and things like the nfsd equivalent), we either return
> > > > > > the error that the fsync operation returns, or the one returned by
> > > > > > filemap_report_wb_error. In both cases, we advance the file->f_wb_err to
> > > > > > the latest value. This allows us to provide new fsync semantics that
> > > > > > will return errors that may have occurred previously and been viewed
> > > > > > via other file descriptors.
> > > > > > 
> > > > > > The final piece of the puzzle is what to do about filemap_check_errors
> > > > > > calls that are being called directly or via filemap_* functions. Here,
> > > > > > we must take a little "creative license".
> > > > > > 
> > > > > > Since we now handle advancing the file->f_wb_err value at the generic
> > > > > > filesystem layer, we no longer need those callers to clear errors out
> > > > > > of the mapping or advance an errseq_t.
> > > > > > 
> > > > > > A lot of the existing codebase relies on being getting an error back
> > > > > > from those functions when there is a writeback problem, so we do still
> > > > > > want to have them report writeback errors somehow.
> > > > > > 
> > > > > > When reporting writeback errors, we will always report errors that have
> > > > > > occurred since a particular point in time. With the old writeback error
> > > > > > reporting, the time we used was "since it was last tested/cleared" which
> > > > > > is entirely arbitrary and potentially racy. Now, we can at least report
> > > > > > the latest error that has occurred since an arbitrary point in time
> > > > > > (represented as a sampled errseq_t value).
> > > > > > 
> > > > > > In the case where we don't have a struct file to work with, this patch
> > > > > > just has the wrappers sample the current mapping->wb_err value, and use
> > > > > > that as an arbitrary point from which to check for errors.
> > > > > 
> > > > > I think this is really dangerous and we shouldn't do this. You are quite
> > > > > likely to lose IO errors in such calls because you will ignore all errors
> > > > > that happened during previous background writeback or even for IO that
> > > > > managed to complete before we called filemap_fdatawait(). Maybe we need to
> > > > > keep the original set-clear-bit IO error reporting for these cases, until
> > > > > we can convert them to fdatawait_range_since()?
> > > > > 
> > > > > > That's probably not "correct" in all cases, particularly in the case of
> > > > > > something like filemap_fdatawait, but I'm not sure it's any worse than
> > > > > > what we already have, and this gives us a basis from which to work.
> > > > > > 
> > > > > > A lot of those callers will likely want to change to a model where they
> > > > > > sample the errseq_t much earlier (perhaps when starting a transaction),
> > > > > > store it in an appropriate place and then use that value later when
> > > > > > checking to see if an error occurred.
> > > > > > 
> > > > > > That will almost certainly take some involvement from other subsystem
> > > > > > maintainers. I'm quite open to adding new API functions to help enable
> > > > > > this if that would be helpful, but I don't really want to do that until
> > > > > > I better understand what's needed.
> > > > > > 
> > > > > > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> > > > > 
> > > > > ...
> > > > > 
> > > > > > diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
> > > > > > index 5f7317875a67..7ce13281925f 100644
> > > > > > --- a/fs/f2fs/file.c
> > > > > > +++ b/fs/f2fs/file.c
> > > > > > @@ -187,6 +187,7 @@ static int f2fs_do_sync_file(struct file *file, loff_t start, loff_t end,
> > > > > >  		.nr_to_write = LONG_MAX,
> > > > > >  		.for_reclaim = 0,
> > > > > >  	};
> > > > > > +	errseq_t since = READ_ONCE(file->f_wb_err);
> > > > > >  
> > > > > >  	if (unlikely(f2fs_readonly(inode->i_sb)))
> > > > > >  		return 0;
> > > > > > @@ -265,6 +266,8 @@ static int f2fs_do_sync_file(struct file *file, loff_t start, loff_t end,
> > > > > >  	}
> > > > > >  
> > > > > >  	ret = wait_on_node_pages_writeback(sbi, ino);
> > > > > > +	if (ret == 0)
> > > > > > +		ret = filemap_check_wb_error(NODE_MAPPING(sbi), since);
> > > > > >  	if (ret)
> > > > > >  		goto out;
> > > > > 
> > > > > So this conversion looks wrong and actually points to a larger issue with
> > > > > the scheme. The problem is there are two mappings that come into play here
> > > > > - file_inode(file)->i_mapping which is the data mapping and
> > > > > NODE_MAPPING(sbi) which is the metadata mapping (and this is not a problem
> > > > > specific to f2fs. For example ext2 also uses this scheme where block
> > > > > devices' mapping is the metadata mapping). And we need to merge error
> > > > > information from these two mappings so for the stamping scheme to work,
> > > > > we'd need two stamps stored in struct file. One for data mapping and one
> > > > > for metadata mapping. Or maybe there's some more clever scheme but for now
> > > > > I don't see one...
> > > > > 
> > > > > 								Honza
> > > > 
> > > > In the case of something like ext2, could we instead get away with just
> > > > marking the data mapping of the inode with an error if the metadata
> > > > writeout fails?
> > > > 
> > > > Then we could just have write_inode operations call mapping_set_error on
> > > > inode->i_mapping when they're going to return an error. That should be
> > > > functionally equivalent, I'd think.
> > > > 
> > > > The catch there is that that requires a 1:1 data:metadata mapping, and
> > > > I'm not sure that that is the case (or will always be, even if it is
> > > > now).
> > > 
> > > So for ext2 / ext4 in nojournal mode this should work - we track all
> > > relevant metadata in mapping->private_list. But I cannot really comment
> > > on other filesystems like f2fs...
> > > 
> > 
> > Actually, I think that may be problematic...
> > 
> > We could end up calling ext2_write_inode with sync_mode != WB_SYNC_ALL,
> > which just dirties the buffer without starting writeback. Then, have VM
> > subsystem write back the buffer due to memory pressure and have that
> > fail. Trying to set the error in write_inode would miss that situation.
> 
> Two notes here:
> 
> 1) Inode is a bad example because there isn't 1:1 mapping between buffers
> containing inodes and mappings - one buffer contains several inodes.
> I wanted to add that for inodes specifically it does not matter as they get
> special handling but actually fsync seems to be currently unreliable for
> them - if we first wrote them in WB_SYNC_NONE mode, they will be just
> written in bdev's page cache, but following fsync(2) will do nothing as
> they will be clean. Anyway, this is unrelated problem.
> 

Yes, that's what I was trying to articulate above. I'm not sure it's
unrelated though. Moving to errseq_t based handling there based on the
blockdev mapping seems like it'd solve that. That does require an extra
errseq_t though.

(I assume that on ext2 inode writeback, bh->b_page->mapping->host points
to the bdev inode?)

> 2) For metadata like indirect blocks where you indeed have 1:1 mapping, you
> can do the error setting in ->end_io handler based on bh->b_assoc_map and
> that should do what you need, shouldn't it?
> 

That would probably work, and I think the mark_buffer_write_io_error
function that I was adding should already be doing the right thing
there.

> If I'm indeed right, then for buffers which have 1:1 mapping we are fine
> and if we find a solution for inodes, we could avoid the second errseq_t.
> 

Yeah, I'm just still not seeing a good way to track error in inode
metadata writeback without an extra errseq_t though. I don't suppose
that a buffer holding inode metadata has a list of those inodes, does
it? Then we could walk the list and flag each one with the error.
Without something like that, I think we're stuck with an extra errseq_t.

BTW, thanks for the help so far. I haven't spent much time in local fs
metadata handling up until now, so this has been very helpful.
-- 
Jeff Layton <jlayton@redhat.com>