Re: [PATCH] xfs: preserve i_mode if __xfs_set_acl() fails

From: "Ernesto A. Fernández" <ernesto.mnd.fernandez@gmail.com>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org,
	"Darrick J. Wong" <darrick.wong@oracle.com>,
	"Zorro Lang" <zlang@redhat.com>,
	"Ernesto A. Fernández" <ernesto.mnd.fernandez@gmail.com>
Subject: Re: [PATCH] xfs: preserve i_mode if __xfs_set_acl() fails
Date: Tue, 15 Aug 2017 16:29:39 -0300	[thread overview]
Message-ID: <20170815192938.GA1247@debian.home> (raw)
In-Reply-To: <20170815084430.GK21024@dastard>

On Tue, Aug 15, 2017 at 06:44:30PM +1000, Dave Chinner wrote:
> On Tue, Aug 15, 2017 at 03:18:58AM -0300, Ernesto A. Fernández wrote:
> > Hi Dave, thanks for the review!
> > 
> > On Tue, Aug 15, 2017 at 12:00:18PM +1000, Dave Chinner wrote:
> > > On Mon, Aug 14, 2017 at 05:18:10AM -0300, Ernesto A. Fernández wrote:
> > > > diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c
> > > > index de7b9bd..c3193e1 100644
> > > > --- a/fs/xfs/libxfs/xfs_attr.c
> > > > +++ b/fs/xfs/libxfs/xfs_attr.c
> > > > @@ -395,6 +395,7 @@ xfs_attr_remove(
> > > >  	struct xfs_defer_ops	dfops;
> > > >  	xfs_fsblock_t		firstblock;
> > > >  	int			error;
> > > > +	int			noattr = 0;
> > > 
> > > bool.
> > > 
> > > And make it "bool has_attr = true" to avoid the nasty double
> > > negative of "!noattr" later on.
> > 
> > I called it noattr because the error was -ENOATTR, but you are right,
> > has_attr is better. I assume you would prefer mode_changed to be a bool
> > as well?
> > 
> > > 
> > > >  
> > > >  	XFS_STATS_INC(mp, xs_attr_remove);
> > > >  
> > > > @@ -438,7 +439,12 @@ xfs_attr_remove(
> > > >  	xfs_trans_ijoin(args.trans, dp, 0);
> > > >  
> > > >  	if (!xfs_inode_hasattr(dp)) {
> > > > -		error = -ENOATTR;
> > > > +		/*
> > > > +		 * Commit even if there is no attr to remove, because when
> > > > +		 * setting ACLs the caller may be changing the mode in the
> > > > +		 * same transaction.
> > > > +		 */
> > > > +		noattr = 1;
> > > >  	} else if (dp->i_d.di_aformat == XFS_DINODE_FMT_LOCAL) {
> > > >  		ASSERT(dp->i_afp->if_flags & XFS_IFINLINE);
> > > >  		error = xfs_attr_shortform_remove(&args);
> > > > @@ -458,7 +464,7 @@ xfs_attr_remove(
> > > >  	if (mp->m_flags & XFS_MOUNT_WSYNC)
> > > >  		xfs_trans_set_sync(args.trans);
> > > >  
> > > > -	if ((flags & ATTR_KERNOTIME) == 0)
> > > > +	if ((flags & ATTR_KERNOTIME) == 0 && !noattr)
> > > >  		xfs_trans_ichgtime(args.trans, dp, XFS_ICHGTIME_CHG);
> > > 
> > > Why?
> > 
> > Because we don't want the ctime to change when you try to delete an
> > attribute that didn't exist in the first place. At least that's how
> > it went before, so I tried to keep it the same.
> 
> Which means the logic is convoluted and no longer obvious as to what
> it does. i.e. the noattr case is running a transaction that dirties
> an inode that no modifications were made to. Alarm bells right
> there.
> 
> > > > @@ -257,16 +242,20 @@ xfs_set_acl(struct inode *inode, struct posix_acl *acl, int type)
> > > >  		return error;
> > > >  
> > > >  	if (type == ACL_TYPE_ACCESS) {
> > > > -		umode_t mode;
> > > > -
> > > > -		error = posix_acl_update_mode(inode, &mode, &acl);
> > > > -		if (error)
> > > > -			return error;
> > > > -		error = xfs_set_mode(inode, mode);
> > > > +		error = posix_acl_update_mode(inode, &inode->i_mode, &acl);
> > > >  		if (error)
> > > >  			return error;
> > > > +		inode->i_ctime = current_time(inode);
> > > > +		XFS_STATS_INC(XFS_I(inode)->i_mount, xs_ig_attrchg);
> > > > +		mode_changed = 1;
> > > >  	}
> > > 
> > > And this doesn't.
> > > 
> > > So this breaks all our rules for inode metadata updates, both for
> > > runtime access serialisation and for crash consistency.
> > 
> > I'm only making changes to the vfs inode, and xfs_set_acl() should always
> > be called with i_rwsem held for writing. I think that should be enough to
> > serialize access to the vfs inode fields.
> 
> It's not. The i_rwsem does not protect XFS inode metadata, which is
> what needs changing here. i.e. we change both the VFS inode and
> the corresponding XFS inode metadata at the same time, under the
> same XFS_ILOCK_EXCL lock, only inside a transaction context.
> 
> That's the rules, no exceptions.
> 
> > I'm sorry, I'm not sure I follow the crash consistency part. This is not
> > writing anything to disk.
> 
> xfs_set_mode() runs a transaction that is journalled to change the
> inode mode. You break in-memory consistency by modifying the VFS
> inode state directly without holding the correct locks (see
> xfs_setattr_mode()), and you break the transactional change model
> (i.e. on-disk journal consistency and on-disk metadata change
> ordering) by making metadata changes this outside transaction
> contexts and without holding the correct locks.
> 
> > My expectation was that the changes I made to
> > the vfs inode would be attached to the transaction later, when calling
> > xfs_trans_log_inode() in xfs_attr_set() or xfs_attr_remove(). Would that
> > not work as intended?
> 
> No, it wouldn't.  You have to follow the filesystem specific update
> rules to change inode metadata, not make up your own....
> 
> > I was mostly imitating how btrfs sets an ACL, so I
> > could be misunderstanding how xfs does journaling.
> 
> ... and btrfs has a completely different transaction model and
> implementation to XFS. IOWs, trying to do what works for btrfs in
> XFS code will break XFS. And vice versa.
> 
> > > >   set_acl:
> > > > -	return __xfs_set_acl(inode, acl, type);
> > > > +	error = __xfs_set_acl(inode, acl, type);
> > > > +	if (error && mode_changed) {
> > > > +		inode->i_mode = old_mode;
> > > > +		inode->i_ctime = old_ctime;
> > > > +		XFS_STATS_DEC(XFS_I(inode)->i_mount, xs_ig_attrchg);
> > > > +	}
> > > 
> > > Same here.
> > 
> > i_rwsem would still be held here. My reasoning is that, since there was
> > an error, no changes were actually committed; so there would be no need
> > to commit the restoration either, just restore the old in-memory vfs inode
> > fields. Once again, this is inspired by btrfs, so I suppose it may fail
> > here in some way I can't see.
> > 
> > (I should probably clarify I did run tests before submitting)
> 
> When you treat the code as a black box for testing, then end result
> might appear to be what you want. That doesn't mean the
> implementation is correct or that hidden transient states within the
> block box are incorrect or invalid.
> 
> > > IOWs, you can't just remove a transactional modification of inode
> > > state and replace it with in-memory VFS inode level changes. You
> > > need to do the changes in transaction context and with the correct
> > > locks held.
> > >
> > > IOWs, if you need to undo the mode changes xfs_set_mode() did, you
> > > need to call xfs_set_mode() again. And because modifications have
> > 
> > But that could fail as well, and how would we handle that?
> 
> If the xattr change was not a fatal error (which will shut down the
> filesystem), then calling xfs_set_mode() should work just fine as it
> doesn't require any new space to be allocated (XFS != BTRFS - we can
> run overwriting transactions even at ENOSPC).
> 
> And you need to run the mode undo as a separate transaction
> because....
> 
> > If this bug
> > is to be fixed the whole thing needs to be a single transaction.
> 
> .... this is simply not possible. Why? Because __xfs_set_acl() can
> run up three separate transactions internally in adding the new
> xattr. That means your unlogged vfs indoe change has already been
> logged to the journal and so we *require* a transaction to overwrite
> the new mode that has already been committed to the journal.

It's clear that I don't understand xfs, and I'm causing more trouble than
I'm helping. I should leave this issue to you guys.

> But I have to ask - why do we even need to modify the mode first?
> Why not change the ACL first, then modify the mode+timestamp? If
> setting the ACL fails, then we don't have anything to undo and all
> is good....

I intended for the mode to be committed as part of the same transaction
that sets or removes the ACL. In my mind making the changes later, as part
of a separate transaction, would have meant that a crash between the two
left the filesystem in an inconsistent state, with a new ACL but without
the corresponding mode bits.

Of course that was based on a poor understanding of xfs journaling, so now
I don't know.

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

Thanks,
Ernest