All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] xfs: don't trigger fsync log force based on inode pin count
@ 2015-04-22 14:37 Brian Foster
  2015-04-22 16:15 ` Christoph Hellwig
  0 siblings, 1 reply; 7+ messages in thread
From: Brian Foster @ 2015-04-22 14:37 UTC (permalink / raw)
  To: xfs

The fsync() requirements for crash consistency on XFS are to flush file
data and force any in-core inode updates to the log. We currently check
whether the inode is pinned to identify whether the log needs to be
forced, since a non-zero pin count generally represents an inode that
has transactions awaiting a flush to the on-disk log.

This is not sufficient in all cases, however. Reports of xfstests test
generic/311 failures on ppc64/s390x hosts have identified failures to
fsync outstanding inode modifications due to the inode not being pinned
at the time of the fsync. This occurs because certain bmap updates can
complete by logging bmapbt buffers but without ever dirtying (and thus
pinning) the core inode. The following is a specific incarnation of this
problem:

$ mount $dev /mnt -o noatime,nobarrier
$ for i in $(seq 0 2 31); do \
        xfs_io -f -c "falloc $((i * 32768)) 32k" -c fsync /mnt/file; \
	done
$ xfs_io -c "pwrite -S 0 80k 16k" -c fsync -c "pwrite 76k 4k" -c fsync /mnt/file; \
	hexdump /mnt/file; \
	./xfstests-dev/src/godown /mnt
...
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0013000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
*
0014000 0000 0000 0000 0000 0000 0000 0000 0000
*
00f8000
$ umount /mnt; mount ...
$ hexdump /mnt/file
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
00f8000

In short, the unwritten extent conversion for the last write is lost
despite the fact that an fsync executed before the filesystem was
shutdown. Note that this is significantly more difficult to reproduce on
CONFIG_HZ=1000 kernels because the problem is masked by the pre-write
cmtime updates committing a transaction against the inode. CONFIG_HZ=100
reduces timer granularity enough to increase the odds that time updates
are skipped and allows this to reproduce within a handful of attempts.

To deal with this problem, kill the xfs_ipincount() check in
xfs_file_fsync(). Make sure to check that the dynamically allocated
ip->i_itemp object exists as previously implied by a non-zero pincount.
The ili_last_lsn check is still safe because it is updated whenever the
inode is attached to a transaction, regardless of whether the inode is
ultimately dirtied. In conjunction, the xfs_bmapi_*() code
unconditionally expects the inode locked and joined to the transaction
on entry.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---

Hi all,

This is the most obvious fix for the problem described above. The
tradeoff is performance in the case when a log force is not necessary
because the log force unconditionally flushes on the workqueue before it
determines a force is not necessary. I can demonstrate this with a test
case to run 'xfs_io -f -c fsync <file>' in a 10k iteration loop with and
without a moderate fs_mark load running in the background.  The
following values are the time to complete for such an fsync loop:

		4.0.0-rc1+	4.0.0-rc1+ w/patch
loop		~39s		~39s
loop+fs_mark	~41s		~1m56s

There are probably a couple different ways to handle this. We could log
the inode in the bmap cases in order to preserve the pincount check.
Another option is to add a check down in xlog_cil_push_now() to avoid
the wq task wait when the push sequence has already been pushed beyond
push_seq. I'm testing something like the latter at the moment...
thoughts?

Brian

 fs/xfs/xfs_file.c | 19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 3a5d305..2fe5421 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -233,15 +233,20 @@ xfs_file_fsync(
 	}
 
 	/*
-	 * All metadata updates are logged, which means that we just have
-	 * to flush the log up to the latest LSN that touched the inode.
+	 * All we need to do here is force pending inode updates into the log.
+	 * All metadata updates are logged, which means that we just have to
+	 * flush the log up to the latest LSN that touched the inode.
+	 *
+	 * Note that we cannot trigger the log force based on whether the inode
+	 * is pinned because some bmapbt updates can log bmap buffers without
+	 * having to dirty the core inode. The inode is never pinned in this
+	 * case, but ili_last_lsn is updated since the inode is always joined to
+	 * the transaction...
 	 */
 	xfs_ilock(ip, XFS_ILOCK_SHARED);
-	if (xfs_ipincount(ip)) {
-		if (!datasync ||
-		    (ip->i_itemp->ili_fields & ~XFS_ILOG_TIMESTAMP))
-			lsn = ip->i_itemp->ili_last_lsn;
-	}
+	if (ip->i_itemp &&
+	    (!datasync || (ip->i_itemp->ili_fields & ~XFS_ILOG_TIMESTAMP)))
+		lsn = ip->i_itemp->ili_last_lsn;
 	xfs_iunlock(ip, XFS_ILOCK_SHARED);
 
 	if (lsn)
-- 
1.9.3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] xfs: don't trigger fsync log force based on inode pin count
  2015-04-22 14:37 [PATCH] xfs: don't trigger fsync log force based on inode pin count Brian Foster
@ 2015-04-22 16:15 ` Christoph Hellwig
  2015-04-22 17:13   ` Brian Foster
  0 siblings, 1 reply; 7+ messages in thread
From: Christoph Hellwig @ 2015-04-22 16:15 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Wed, Apr 22, 2015 at 10:37:46AM -0400, Brian Foster wrote:
> There are probably a couple different ways to handle this. We could log
> the inode in the bmap cases in order to preserve the pincount check.

I'd favor that.  For one performance should be better, second we really
need to dirty the inode anyway for v5 file systems as that's the
mechanism used to increment di_changecount.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] xfs: don't trigger fsync log force based on inode pin count
  2015-04-22 16:15 ` Christoph Hellwig
@ 2015-04-22 17:13   ` Brian Foster
  2015-04-22 21:18     ` Dave Chinner
  0 siblings, 1 reply; 7+ messages in thread
From: Brian Foster @ 2015-04-22 17:13 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Wed, Apr 22, 2015 at 09:15:09AM -0700, Christoph Hellwig wrote:
> On Wed, Apr 22, 2015 at 10:37:46AM -0400, Brian Foster wrote:
> > There are probably a couple different ways to handle this. We could log
> > the inode in the bmap cases in order to preserve the pincount check.
> 
> I'd favor that.  For one performance should be better, second we really
> need to dirty the inode anyway for v5 file systems as that's the
> mechanism used to increment di_changecount.
> 

Yeah, that's a good point. I noticed that in xfs_trans_log_inode() when
debugging but didn't think much about it since I reproduced on v4. I can
get performance back with the aforementioned cil push fix, but if the
path forward is behavior where the inode is going to be logged anyways,
that is decent reason to emulate such behavior in the pre-v5 case.

Note that we have the following in xfs_bmapi_write():

        if (bma.logflags)
                xfs_trans_log_inode(tp, ip, bma.logflags);

... and some other places. I don't reproduce this particular problem on
v5, so something else might be logging the inode here. That strikes me
as not what we want with regard to the change count, however..

Brian

> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] xfs: don't trigger fsync log force based on inode pin count
  2015-04-22 17:13   ` Brian Foster
@ 2015-04-22 21:18     ` Dave Chinner
  2015-04-22 22:02       ` Brian Foster
  0 siblings, 1 reply; 7+ messages in thread
From: Dave Chinner @ 2015-04-22 21:18 UTC (permalink / raw)
  To: Brian Foster; +Cc: Christoph Hellwig, xfs

On Wed, Apr 22, 2015 at 01:13:23PM -0400, Brian Foster wrote:
> On Wed, Apr 22, 2015 at 09:15:09AM -0700, Christoph Hellwig wrote:
> > On Wed, Apr 22, 2015 at 10:37:46AM -0400, Brian Foster wrote:
> > > There are probably a couple different ways to handle this. We could log
> > > the inode in the bmap cases in order to preserve the pincount check.
> > 
> > I'd favor that.  For one performance should be better, second we really
> > need to dirty the inode anyway for v5 file systems as that's the
> > mechanism used to increment di_changecount.
> > 
> 
> Yeah, that's a good point. I noticed that in xfs_trans_log_inode() when
> debugging but didn't think much about it since I reproduced on v4. I can
> get performance back with the aforementioned cil push fix, but if the
> path forward is behavior where the inode is going to be logged anyways,
> that is decent reason to emulate such behavior in the pre-v5 case.
> 
> Note that we have the following in xfs_bmapi_write():
> 
>         if (bma.logflags)
>                 xfs_trans_log_inode(tp, ip, bma.logflags);

Which, essentially, only contains flags when we do a extent-to-btree
conversion or vice versa, so we effectively never log the inode on
unwritten extent conversions unless the size changes.

I agree with Christoph - we should just unconditionally log the
inode in xfs_bmap_add_extent_unwritten_real() as it's a user visible
data change we need to bump di_changecount for. i.e. NFS client can
see the unwritten data after a data write has started and changed the
timestamps/write count, but then the IO completion makes the data
visible and hence the change count needs to be bumped again...

> ... and some other places. I don't reproduce this particular problem on
> v5, so something else might be logging the inode here. That strikes me
> as not what we want with regard to the change count, however..

Larger inode size with v5, so it's entirely possible that v5 is not
triggering the problemon this test because the extent list is
remaining in local format and so any updates are logging the inode
directly....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] xfs: don't trigger fsync log force based on inode pin count
  2015-04-22 21:18     ` Dave Chinner
@ 2015-04-22 22:02       ` Brian Foster
  2015-04-22 22:06         ` Christoph Hellwig
  0 siblings, 1 reply; 7+ messages in thread
From: Brian Foster @ 2015-04-22 22:02 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, xfs

On Thu, Apr 23, 2015 at 07:18:45AM +1000, Dave Chinner wrote:
> On Wed, Apr 22, 2015 at 01:13:23PM -0400, Brian Foster wrote:
> > On Wed, Apr 22, 2015 at 09:15:09AM -0700, Christoph Hellwig wrote:
> > > On Wed, Apr 22, 2015 at 10:37:46AM -0400, Brian Foster wrote:
> > > > There are probably a couple different ways to handle this. We could log
> > > > the inode in the bmap cases in order to preserve the pincount check.
> > > 
> > > I'd favor that.  For one performance should be better, second we really
> > > need to dirty the inode anyway for v5 file systems as that's the
> > > mechanism used to increment di_changecount.
> > > 
> > 
> > Yeah, that's a good point. I noticed that in xfs_trans_log_inode() when
> > debugging but didn't think much about it since I reproduced on v4. I can
> > get performance back with the aforementioned cil push fix, but if the
> > path forward is behavior where the inode is going to be logged anyways,
> > that is decent reason to emulate such behavior in the pre-v5 case.
> > 
> > Note that we have the following in xfs_bmapi_write():
> > 
> >         if (bma.logflags)
> >                 xfs_trans_log_inode(tp, ip, bma.logflags);
> 
> Which, essentially, only contains flags when we do a extent-to-btree
> conversion or vice versa, so we effectively never log the inode on
> unwritten extent conversions unless the size changes.
> 
> I agree with Christoph - we should just unconditionally log the
> inode in xfs_bmap_add_extent_unwritten_real() as it's a user visible
> data change we need to bump di_changecount for. i.e. NFS client can
> see the unwritten data after a data write has started and changed the
> timestamps/write count, but then the IO completion makes the data
> visible and hence the change count needs to be bumped again...
> 

Ok, that works for me. I'll give it a shot.

> > ... and some other places. I don't reproduce this particular problem on
> > v5, so something else might be logging the inode here. That strikes me
> > as not what we want with regard to the change count, however..
> 
> Larger inode size with v5, so it's entirely possible that v5 is not
> triggering the problemon this test because the extent list is
> remaining in local format and so any updates are logging the inode
> directly....
> 

That was what I thought at first but I bumped the extent count a couple
times and still couldn't reproduce. I was curious enough to track it
down and it is actually the time update again. For whatever reason, I
think the crc mechanism is throwing the timing off and just hiding the
problem again. E.g., no-op xfs_vn_time_update() and the problem
reproduces on v5 as well.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] xfs: don't trigger fsync log force based on inode pin count
  2015-04-22 22:02       ` Brian Foster
@ 2015-04-22 22:06         ` Christoph Hellwig
  2015-04-22 22:10           ` Brian Foster
  0 siblings, 1 reply; 7+ messages in thread
From: Christoph Hellwig @ 2015-04-22 22:06 UTC (permalink / raw)
  To: Brian Foster; +Cc: Christoph Hellwig, xfs

On Wed, Apr 22, 2015 at 06:02:44PM -0400, Brian Foster wrote:
> That was what I thought at first but I bumped the extent count a couple
> times and still couldn't reproduce. I was curious enough to track it
> down and it is actually the time update again. For whatever reason, I
> think the crc mechanism is throwing the timing off and just hiding the
> problem again. E.g., no-op xfs_vn_time_update() and the problem
> reproduces on v5 as well.

Actually, its the changecount again.  If MS_I_VERSION is set
the VFS will always call into ->xfs_vn_time_update.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] xfs: don't trigger fsync log force based on inode pin count
  2015-04-22 22:06         ` Christoph Hellwig
@ 2015-04-22 22:10           ` Brian Foster
  0 siblings, 0 replies; 7+ messages in thread
From: Brian Foster @ 2015-04-22 22:10 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Wed, Apr 22, 2015 at 03:06:11PM -0700, Christoph Hellwig wrote:
> On Wed, Apr 22, 2015 at 06:02:44PM -0400, Brian Foster wrote:
> > That was what I thought at first but I bumped the extent count a couple
> > times and still couldn't reproduce. I was curious enough to track it
> > down and it is actually the time update again. For whatever reason, I
> > think the crc mechanism is throwing the timing off and just hiding the
> > problem again. E.g., no-op xfs_vn_time_update() and the problem
> > reproduces on v5 as well.
> 
> Actually, its the changecount again.  If MS_I_VERSION is set
> the VFS will always call into ->xfs_vn_time_update.

Ah, I see. Yeah, that explains the time update then...

Brian

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2015-04-22 22:11 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-22 14:37 [PATCH] xfs: don't trigger fsync log force based on inode pin count Brian Foster
2015-04-22 16:15 ` Christoph Hellwig
2015-04-22 17:13   ` Brian Foster
2015-04-22 21:18     ` Dave Chinner
2015-04-22 22:02       ` Brian Foster
2015-04-22 22:06         ` Christoph Hellwig
2015-04-22 22:10           ` Brian Foster

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.