All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] xfs: fix incorrect log_flushed on fsync
@ 2017-08-30 13:38 Amir Goldstein
  2017-08-30 13:46 ` Christoph Hellwig
                   ` (3 more replies)
  0 siblings, 4 replies; 30+ messages in thread
From: Amir Goldstein @ 2017-08-30 13:38 UTC (permalink / raw)
  To: Darrick J . Wong, Christoph Hellwig, Dave Chinner
  Cc: linux-xfs, linux-fsdevel, Josef Bacik, stable

When calling into _xfs_log_force{,_lsn}() with a pointer
to log_flushed variable, log_flushed will be set to 1 if:
1. xlog_sync() is called to flush the active log buffer
AND/OR
2. xlog_wait() is called to wait on a syncing log buffers

xfs_file_fsync() checks the value of log_flushed after
_xfs_log_force_lsn() call to optimize away an explicit
PREFLUSH request to the data block device after writing
out all the file's pages to disk.

This optimization is incorrect in the following sequence of events:

 Task A                    Task B
 -------------------------------------------------------
 xfs_file_fsync()
   _xfs_log_force_lsn()
     xlog_sync()
        [submit PREFLUSH] 
                           xfs_file_fsync()
                             file_write_and_wait_range()
                               [submit WRITE X]
                               [endio  WRITE X]
                             _xfs_log_force_lsn()
                               xlog_wait()
        [endio  PREFLUSH]

The write X is not guarantied to be on persistent storage
when PREFLUSH request in completed, because write A was submitted
after the PREFLUSH request, but xfs_file_fsync() of task A will
be notified of log_flushed=1 and will skip explicit flush.

If the system crashes after fsync of task A, write X may not be
present on disk after reboot.

This bug was discovered and demonstrated using Josef Bacik's
dm-log-writes target, which can be used to record block io operations
and then replay a subset of these operations onto the target device.
The test goes something like this:
- Use fsx to execute ops of a file and record ops on log device
- Every now and then fsync the file, store md5 of file and mark
  the location in the log
- Then replay log onto device for each mark, mount fs and compare
  md5 of file to stored value

Cc: Christoph Hellwig <hch@lst.de>
Cc: Josef Bacik <jbacik@fb.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---

Christoph, Dave,

It's hard to believe, but I think the reported bug has been around
since 2005 f538d4da8d52 ("[XFS] write barrier support"), but I did
not try to test old kernels.

I tripped over this aleged bug a week ago when I started testing
crash consistecy with dm-log-writes:
https://marc.info/?l=linux-fsdevel&m=150350333427447&w=2
Since then, I have been trying to prove that dm-log-writes was
false accusing, but turned out that it wasn't.

The reproducer is an xfstest, which I am going to re-post soon
and is also available here:
https://github.com/amir73il/xfstests/commits/dm-log-writes
On a system with spinning disk it reproduces the issue with close
100% probability within less than a minute.

Anyway, with Darrick going on vacation, let's expedite review
of this fix and try to merge some fix for v4.14-rc1 (although this
bug fix may be eligible to later rc's as well).

The change obviously changes behavior for the worse for workload
of intensive parallel fsyncing tasks, but I don't see another quick
way to fix correctness without hurting this workload.
It might be worth to get a feedback from file_write_and_wait_range(),
whether dirty pages writes have been issued at all.

I started running full xftests cycle, but wanted to send this one out
early for comments. I can test other patches if needed.

Cheers,
Amir.

 fs/xfs/xfs_log.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 0053bcf..ec159c5 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -3342,8 +3342,6 @@ _xfs_log_force(
 		 */
 		if (iclog->ic_state & XLOG_STATE_IOERROR)
 			return -EIO;
-		if (log_flushed)
-			*log_flushed = 1;
 	} else {
 
 no_sleep:
@@ -3447,8 +3445,6 @@ _xfs_log_force_lsn(
 
 				xlog_wait(&iclog->ic_prev->ic_write_wait,
 							&log->l_icloglock);
-				if (log_flushed)
-					*log_flushed = 1;
 				already_slept = 1;
 				goto try_again;
 			}
@@ -3482,9 +3478,6 @@ _xfs_log_force_lsn(
 			 */
 			if (iclog->ic_state & XLOG_STATE_IOERROR)
 				return -EIO;
-
-			if (log_flushed)
-				*log_flushed = 1;
 		} else {		/* just return */
 			spin_unlock(&log->l_icloglock);
 		}
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH] xfs: fix incorrect log_flushed on fsync
  2017-08-30 13:38 [PATCH] xfs: fix incorrect log_flushed on fsync Amir Goldstein
@ 2017-08-30 13:46 ` Christoph Hellwig
  2017-08-30 14:12   ` Amir Goldstein
  2017-08-30 17:01 ` Darrick J. Wong
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 30+ messages in thread
From: Christoph Hellwig @ 2017-08-30 13:46 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Darrick J . Wong, Christoph Hellwig, Dave Chinner, linux-xfs,
	linux-fsdevel, Josef Bacik, stable

Oops.

Yes, except for the case of calling xlog_state_release_iclog
ourselves we can't claim that the cache has been flushed.  In
fact I'm not even sure that is save, if ic_refcnt makes
xlog_state_release_iclog exit early, so we might have to
pass additional information around there.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] xfs: fix incorrect log_flushed on fsync
  2017-08-30 13:46 ` Christoph Hellwig
@ 2017-08-30 14:12   ` Amir Goldstein
  2017-08-30 14:21     ` Christoph Hellwig
  0 siblings, 1 reply; 30+ messages in thread
From: Amir Goldstein @ 2017-08-30 14:12 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Darrick J . Wong, Dave Chinner, linux-xfs, linux-fsdevel,
	Josef Bacik, stable

On Wed, Aug 30, 2017 at 4:46 PM, Christoph Hellwig <hch@lst.de> wrote:
> Oops.
>
> Yes, except for the case of calling xlog_state_release_iclog
> ourselves we can't claim that the cache has been flushed.  In
> fact I'm not even sure that is save, if ic_refcnt makes
> xlog_state_release_iclog exit early, so we might have to
> pass additional information around there.

With this change, for the multiple fsyncing tasks workload, the majority won't
be optimized, so the remaining code is only really optimizing the single
process doing seldom fsync case.

I wonder if it's worth keeping the optimization around at all?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] xfs: fix incorrect log_flushed on fsync
  2017-08-30 14:12   ` Amir Goldstein
@ 2017-08-30 14:21     ` Christoph Hellwig
  0 siblings, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2017-08-30 14:21 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Christoph Hellwig, Darrick J . Wong, Dave Chinner, linux-xfs,
	linux-fsdevel, Josef Bacik, stable

On Wed, Aug 30, 2017 at 05:12:10PM +0300, Amir Goldstein wrote:
> On Wed, Aug 30, 2017 at 4:46 PM, Christoph Hellwig <hch@lst.de> wrote:
> > Oops.
> >
> > Yes, except for the case of calling xlog_state_release_iclog
> > ourselves we can't claim that the cache has been flushed.  In
> > fact I'm not even sure that is save, if ic_refcnt makes
> > xlog_state_release_iclog exit early, so we might have to
> > pass additional information around there.
> 
> With this change, for the multiple fsyncing tasks workload, the majority won't
> be optimized, so the remaining code is only really optimizing the single
> process doing seldom fsync case.
> 
> I wonder if it's worth keeping the optimization around at all?

I think it is.  And we should get back to optimizing the multi-threaded
cases soon using something like a sequence counter that checks if we
issued a log sync after finishing the data writeback.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] xfs: fix incorrect log_flushed on fsync
  2017-08-30 13:38 [PATCH] xfs: fix incorrect log_flushed on fsync Amir Goldstein
  2017-08-30 13:46 ` Christoph Hellwig
@ 2017-08-30 17:01 ` Darrick J. Wong
  2017-08-31 13:47 ` Christoph Hellwig
  2017-09-15 12:40 ` Amir Goldstein
  3 siblings, 0 replies; 30+ messages in thread
From: Darrick J. Wong @ 2017-08-30 17:01 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Christoph Hellwig, Dave Chinner, linux-xfs, linux-fsdevel,
	Josef Bacik, stable

On Wed, Aug 30, 2017 at 04:38:22PM +0300, Amir Goldstein wrote:
> When calling into _xfs_log_force{,_lsn}() with a pointer
> to log_flushed variable, log_flushed will be set to 1 if:
> 1. xlog_sync() is called to flush the active log buffer
> AND/OR
> 2. xlog_wait() is called to wait on a syncing log buffers
> 
> xfs_file_fsync() checks the value of log_flushed after
> _xfs_log_force_lsn() call to optimize away an explicit
> PREFLUSH request to the data block device after writing
> out all the file's pages to disk.
> 
> This optimization is incorrect in the following sequence of events:
> 
>  Task A                    Task B
>  -------------------------------------------------------
>  xfs_file_fsync()
>    _xfs_log_force_lsn()
>      xlog_sync()
>         [submit PREFLUSH] 
>                            xfs_file_fsync()
>                              file_write_and_wait_range()
>                                [submit WRITE X]
>                                [endio  WRITE X]
>                              _xfs_log_force_lsn()
>                                xlog_wait()
>         [endio  PREFLUSH]
> 
> The write X is not guarantied to be on persistent storage
> when PREFLUSH request in completed, because write A was submitted
> after the PREFLUSH request, but xfs_file_fsync() of task A will
> be notified of log_flushed=1 and will skip explicit flush.

Yikes.

> If the system crashes after fsync of task A, write X may not be
> present on disk after reboot.
> 
> This bug was discovered and demonstrated using Josef Bacik's
> dm-log-writes target, which can be used to record block io operations
> and then replay a subset of these operations onto the target device.
> The test goes something like this:
> - Use fsx to execute ops of a file and record ops on log device
> - Every now and then fsync the file, store md5 of file and mark
>   the location in the log
> - Then replay log onto device for each mark, mount fs and compare
>   md5 of file to stored value
> 
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Josef Bacik <jbacik@fb.com>
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Amir Goldstein <amir73il@gmail.com>

Will test, and if I have time try to set up this dm-log-writes thing for
my own edification;

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

> ---
> 
> Christoph, Dave,
> 
> It's hard to believe, but I think the reported bug has been around
> since 2005 f538d4da8d52 ("[XFS] write barrier support"), but I did
> not try to test old kernels.
> 
> I tripped over this aleged bug a week ago when I started testing
> crash consistecy with dm-log-writes:
> https://marc.info/?l=linux-fsdevel&m=150350333427447&w=2
> Since then, I have been trying to prove that dm-log-writes was
> false accusing, but turned out that it wasn't.
> 
> The reproducer is an xfstest, which I am going to re-post soon
> and is also available here:
> https://github.com/amir73il/xfstests/commits/dm-log-writes
> On a system with spinning disk it reproduces the issue with close
> 100% probability within less than a minute.
> 
> Anyway, with Darrick going on vacation, let's expedite review
> of this fix and try to merge some fix for v4.14-rc1 (although this
> bug fix may be eligible to later rc's as well).

Yes.

> The change obviously changes behavior for the worse for workload
> of intensive parallel fsyncing tasks, but I don't see another quick
> way to fix correctness without hurting this workload.
> It might be worth to get a feedback from file_write_and_wait_range(),
> whether dirty pages writes have been issued at all.

Agreed, but safety first. :)

--D

> I started running full xftests cycle, but wanted to send this one out
> early for comments. I can test other patches if needed.
> 
> Cheers,
> Amir.
> 
>  fs/xfs/xfs_log.c | 7 -------
>  1 file changed, 7 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 0053bcf..ec159c5 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -3342,8 +3342,6 @@ _xfs_log_force(
>  		 */
>  		if (iclog->ic_state & XLOG_STATE_IOERROR)
>  			return -EIO;
> -		if (log_flushed)
> -			*log_flushed = 1;
>  	} else {
>  
>  no_sleep:
> @@ -3447,8 +3445,6 @@ _xfs_log_force_lsn(
>  
>  				xlog_wait(&iclog->ic_prev->ic_write_wait,
>  							&log->l_icloglock);
> -				if (log_flushed)
> -					*log_flushed = 1;
>  				already_slept = 1;
>  				goto try_again;
>  			}
> @@ -3482,9 +3478,6 @@ _xfs_log_force_lsn(
>  			 */
>  			if (iclog->ic_state & XLOG_STATE_IOERROR)
>  				return -EIO;
> -
> -			if (log_flushed)
> -				*log_flushed = 1;
>  		} else {		/* just return */
>  			spin_unlock(&log->l_icloglock);
>  		}
> -- 
> 2.7.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] xfs: fix incorrect log_flushed on fsync
  2017-08-30 13:38 [PATCH] xfs: fix incorrect log_flushed on fsync Amir Goldstein
  2017-08-30 13:46 ` Christoph Hellwig
  2017-08-30 17:01 ` Darrick J. Wong
@ 2017-08-31 13:47 ` Christoph Hellwig
  2017-08-31 14:37   ` Amir Goldstein
  2017-09-15 12:40 ` Amir Goldstein
  3 siblings, 1 reply; 30+ messages in thread
From: Christoph Hellwig @ 2017-08-31 13:47 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Darrick J . Wong, Christoph Hellwig, Dave Chinner, linux-xfs,
	linux-fsdevel, Josef Bacik, stable

I think something like the following patch (totally untested,
just an idea) should fix the issue, right?

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index c4893e226fd8..555fcae9a18f 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -135,7 +135,7 @@ xfs_file_fsync(
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct xfs_mount	*mp = ip->i_mount;
 	int			error = 0;
-	int			log_flushed = 0;
+	unsigned int		flushseq;
 	xfs_lsn_t		lsn = 0;
 
 	trace_xfs_file_fsync(ip);
@@ -143,6 +143,7 @@ xfs_file_fsync(
 	error = file_write_and_wait_range(file, start, end);
 	if (error)
 		return error;
+	flushseq = READ_ONCE(mp->m_flushseq);
 
 	if (XFS_FORCED_SHUTDOWN(mp))
 		return -EIO;
@@ -181,7 +182,7 @@ xfs_file_fsync(
 	}
 
 	if (lsn) {
-		error = _xfs_log_force_lsn(mp, lsn, XFS_LOG_SYNC, &log_flushed);
+		error = _xfs_log_force_lsn(mp, lsn, XFS_LOG_SYNC, NULL);
 		ip->i_itemp->ili_fsync_fields = 0;
 	}
 	xfs_iunlock(ip, XFS_ILOCK_SHARED);
@@ -193,8 +194,9 @@ xfs_file_fsync(
 	 * an already allocated file and thus do not have any metadata to
 	 * commit.
 	 */
-	if (!log_flushed && !XFS_IS_REALTIME_INODE(ip) &&
-	    mp->m_logdev_targp == mp->m_ddev_targp)
+	if (!XFS_IS_REALTIME_INODE(ip) &&
+	    mp->m_logdev_targp == mp->m_ddev_targp &&
+	    flushseq == READ_ONCE(mp->m_flushseq))
 		xfs_blkdev_issue_flush(mp->m_ddev_targp);
 
 	return error;
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index bcb2f860e508..3c0cbb98581e 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -2922,6 +2922,8 @@ xlog_state_done_syncing(
 		iclog->ic_state = XLOG_STATE_DONE_SYNC;
 	}
 
+	log->l_mp->m_flushseq++;
+
 	/*
 	 * Someone could be sleeping prior to writing out the next
 	 * iclog buffer, we wake them all, one will get to do the
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index e0792d036be2..556e01a0175e 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -79,6 +79,7 @@ typedef struct xfs_mount {
 	struct percpu_counter	m_icount;	/* allocated inodes counter */
 	struct percpu_counter	m_ifree;	/* free inodes counter */
 	struct percpu_counter	m_fdblocks;	/* free block counter */
+	unsigned int		m_flushseq;
 
 	struct xfs_buf		*m_sb_bp;	/* buffer for superblock */
 	char			*m_fsname;	/* filesystem name */

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH] xfs: fix incorrect log_flushed on fsync
  2017-08-31 13:47 ` Christoph Hellwig
@ 2017-08-31 14:37   ` Amir Goldstein
  2017-08-31 16:39     ` Brian Foster
  2017-09-01  9:47     ` Christoph Hellwig
  0 siblings, 2 replies; 30+ messages in thread
From: Amir Goldstein @ 2017-08-31 14:37 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Darrick J . Wong, Dave Chinner, linux-xfs, linux-fsdevel,
	Josef Bacik, stable [v4.9]

On Thu, Aug 31, 2017 at 4:47 PM, Christoph Hellwig <hch@lst.de> wrote:
> I think something like the following patch (totally untested,
> just an idea) should fix the issue, right?

I think that is not enough.

>
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index c4893e226fd8..555fcae9a18f 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -135,7 +135,7 @@ xfs_file_fsync(
>         struct xfs_inode        *ip = XFS_I(inode);
>         struct xfs_mount        *mp = ip->i_mount;
>         int                     error = 0;
> -       int                     log_flushed = 0;
> +       unsigned int            flushseq;
>         xfs_lsn_t               lsn = 0;
>
>         trace_xfs_file_fsync(ip);
> @@ -143,6 +143,7 @@ xfs_file_fsync(
>         error = file_write_and_wait_range(file, start, end);
>         if (error)
>                 return error;
> +       flushseq = READ_ONCE(mp->m_flushseq);

imagine that flush was submitted and completed before
file_write_and_wait_range() but m_flushseq incremented  after.
maybe here READ m_flush_submitted_seq...

>
>         if (XFS_FORCED_SHUTDOWN(mp))
>                 return -EIO;
> @@ -181,7 +182,7 @@ xfs_file_fsync(
>         }
>
>         if (lsn) {
> -               error = _xfs_log_force_lsn(mp, lsn, XFS_LOG_SYNC, &log_flushed);
> +               error = _xfs_log_force_lsn(mp, lsn, XFS_LOG_SYNC, NULL);
>                 ip->i_itemp->ili_fsync_fields = 0;
>         }
>         xfs_iunlock(ip, XFS_ILOCK_SHARED);
> @@ -193,8 +194,9 @@ xfs_file_fsync(
>          * an already allocated file and thus do not have any metadata to
>          * commit.
>          */
> -       if (!log_flushed && !XFS_IS_REALTIME_INODE(ip) &&
> -           mp->m_logdev_targp == mp->m_ddev_targp)
> +       if (!XFS_IS_REALTIME_INODE(ip) &&
> +           mp->m_logdev_targp == mp->m_ddev_targp &&
> +           flushseq == READ_ONCE(mp->m_flushseq))

... and here READ m_flush_completed_seq
if (m_flush_completed_seq > m_flush_submitted_seq)
it is safe to skip issue flush.
Then probably READ_ONCE() is not enough and need smb_rmb?

>                 xfs_blkdev_issue_flush(mp->m_ddev_targp);
>
>         return error;
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index bcb2f860e508..3c0cbb98581e 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -2922,6 +2922,8 @@ xlog_state_done_syncing(
>                 iclog->ic_state = XLOG_STATE_DONE_SYNC;
>         }
>
> +       log->l_mp->m_flushseq++;

I recon this should use WRITE_ONCE or smp_wmb()
and then also increment m_flush_submitted_seq *before*
issueing flush

If state machine does not allow more than a single flush
to be in flight (?) then the 2 seq counters could be reduced
to single seq counter with (m_flushseq % 2) == 1 for submitted
and  (m_flushseq % 2) == 0 for completed and the test in fsync
would be (flushseq % 2) == (READ_ONCE(mp->m_flushseq) % 2)

... maybe?

Amir.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] xfs: fix incorrect log_flushed on fsync
  2017-08-31 14:37   ` Amir Goldstein
@ 2017-08-31 16:39     ` Brian Foster
  2017-08-31 19:20       ` Amir Goldstein
  2017-09-01  9:47     ` Christoph Hellwig
  1 sibling, 1 reply; 30+ messages in thread
From: Brian Foster @ 2017-08-31 16:39 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Christoph Hellwig, Darrick J . Wong, Dave Chinner, linux-xfs,
	linux-fsdevel, Josef Bacik, stable [v4.9]

On Thu, Aug 31, 2017 at 05:37:06PM +0300, Amir Goldstein wrote:
> On Thu, Aug 31, 2017 at 4:47 PM, Christoph Hellwig <hch@lst.de> wrote:
> > I think something like the following patch (totally untested,
> > just an idea) should fix the issue, right?
> 
> I think that is not enough.
> 
> >
> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > index c4893e226fd8..555fcae9a18f 100644
> > --- a/fs/xfs/xfs_file.c
> > +++ b/fs/xfs/xfs_file.c
> > @@ -135,7 +135,7 @@ xfs_file_fsync(
> >         struct xfs_inode        *ip = XFS_I(inode);
> >         struct xfs_mount        *mp = ip->i_mount;
> >         int                     error = 0;
> > -       int                     log_flushed = 0;
> > +       unsigned int            flushseq;
> >         xfs_lsn_t               lsn = 0;
> >
> >         trace_xfs_file_fsync(ip);
> > @@ -143,6 +143,7 @@ xfs_file_fsync(
> >         error = file_write_and_wait_range(file, start, end);
> >         if (error)
> >                 return error;
> > +       flushseq = READ_ONCE(mp->m_flushseq);
> 
> imagine that flush was submitted and completed before
> file_write_and_wait_range() but m_flushseq incremented  after.
> maybe here READ m_flush_submitted_seq...
> 
> >
> >         if (XFS_FORCED_SHUTDOWN(mp))
> >                 return -EIO;
> > @@ -181,7 +182,7 @@ xfs_file_fsync(
> >         }
> >
> >         if (lsn) {
> > -               error = _xfs_log_force_lsn(mp, lsn, XFS_LOG_SYNC, &log_flushed);
> > +               error = _xfs_log_force_lsn(mp, lsn, XFS_LOG_SYNC, NULL);
> >                 ip->i_itemp->ili_fsync_fields = 0;
> >         }
> >         xfs_iunlock(ip, XFS_ILOCK_SHARED);
> > @@ -193,8 +194,9 @@ xfs_file_fsync(
> >          * an already allocated file and thus do not have any metadata to
> >          * commit.
> >          */
> > -       if (!log_flushed && !XFS_IS_REALTIME_INODE(ip) &&
> > -           mp->m_logdev_targp == mp->m_ddev_targp)
> > +       if (!XFS_IS_REALTIME_INODE(ip) &&
> > +           mp->m_logdev_targp == mp->m_ddev_targp &&
> > +           flushseq == READ_ONCE(mp->m_flushseq))
> 
> ... and here READ m_flush_completed_seq
> if (m_flush_completed_seq > m_flush_submitted_seq)
> it is safe to skip issue flush.
> Then probably READ_ONCE() is not enough and need smb_rmb?
> 

IIUC, basically we need to guarantee that a flush submits after
file_write_and_wait() and completes before we return. If we do something
like the above, I wonder if that means we could wait for the submit ==
complete if we observe submit was bumped since it was initially sampled
above (rather than issue another flush, which would be necessary if a
submit hadn't occurred))..?

If we do end up with something like this, I think it's a bit cleaner to
stuff the counter(s) in the xfs_buftarg structure and update them from
the generic buffer submit/completion code based on XBF_FLUSH. FWIW, I
suspect we could also update said counter(s) from
xfs_blkdev_issue_flush().

Brian

> >                 xfs_blkdev_issue_flush(mp->m_ddev_targp);
> >
> >         return error;
> > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> > index bcb2f860e508..3c0cbb98581e 100644
> > --- a/fs/xfs/xfs_log.c
> > +++ b/fs/xfs/xfs_log.c
> > @@ -2922,6 +2922,8 @@ xlog_state_done_syncing(
> >                 iclog->ic_state = XLOG_STATE_DONE_SYNC;
> >         }
> >
> > +       log->l_mp->m_flushseq++;
> 
> I recon this should use WRITE_ONCE or smp_wmb()
> and then also increment m_flush_submitted_seq *before*
> issueing flush
> 
> If state machine does not allow more than a single flush
> to be in flight (?) then the 2 seq counters could be reduced
> to single seq counter with (m_flushseq % 2) == 1 for submitted
> and  (m_flushseq % 2) == 0 for completed and the test in fsync
> would be (flushseq % 2) == (READ_ONCE(mp->m_flushseq) % 2)
> 
> ... maybe?
> 
> Amir.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] xfs: fix incorrect log_flushed on fsync
  2017-08-31 16:39     ` Brian Foster
@ 2017-08-31 19:20       ` Amir Goldstein
  2017-08-31 20:10         ` Brian Foster
  2017-09-01  9:52         ` Christoph Hellwig
  0 siblings, 2 replies; 30+ messages in thread
From: Amir Goldstein @ 2017-08-31 19:20 UTC (permalink / raw)
  To: Brian Foster
  Cc: Christoph Hellwig, Darrick J . Wong, Dave Chinner, linux-xfs,
	linux-fsdevel, Josef Bacik, stable [v4.9]

On Thu, Aug 31, 2017 at 7:39 PM, Brian Foster <bfoster@redhat.com> wrote:
> On Thu, Aug 31, 2017 at 05:37:06PM +0300, Amir Goldstein wrote:
>> On Thu, Aug 31, 2017 at 4:47 PM, Christoph Hellwig <hch@lst.de> wrote:
>> > I think something like the following patch (totally untested,
>> > just an idea) should fix the issue, right?
>>
>> I think that is not enough.
>>
>> >
>> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
>> > index c4893e226fd8..555fcae9a18f 100644
>> > --- a/fs/xfs/xfs_file.c
>> > +++ b/fs/xfs/xfs_file.c
>> > @@ -135,7 +135,7 @@ xfs_file_fsync(
>> >         struct xfs_inode        *ip = XFS_I(inode);
>> >         struct xfs_mount        *mp = ip->i_mount;
>> >         int                     error = 0;
>> > -       int                     log_flushed = 0;
>> > +       unsigned int            flushseq;
>> >         xfs_lsn_t               lsn = 0;
>> >
>> >         trace_xfs_file_fsync(ip);
>> > @@ -143,6 +143,7 @@ xfs_file_fsync(
>> >         error = file_write_and_wait_range(file, start, end);
>> >         if (error)
>> >                 return error;
>> > +       flushseq = READ_ONCE(mp->m_flushseq);
>>
>> imagine that flush was submitted and completed before
>> file_write_and_wait_range() but m_flushseq incremented  after.
>> maybe here READ m_flush_submitted_seq...
>>
>> >
>> >         if (XFS_FORCED_SHUTDOWN(mp))
>> >                 return -EIO;
>> > @@ -181,7 +182,7 @@ xfs_file_fsync(
>> >         }
>> >
>> >         if (lsn) {
>> > -               error = _xfs_log_force_lsn(mp, lsn, XFS_LOG_SYNC, &log_flushed);
>> > +               error = _xfs_log_force_lsn(mp, lsn, XFS_LOG_SYNC, NULL);
>> >                 ip->i_itemp->ili_fsync_fields = 0;
>> >         }
>> >         xfs_iunlock(ip, XFS_ILOCK_SHARED);
>> > @@ -193,8 +194,9 @@ xfs_file_fsync(
>> >          * an already allocated file and thus do not have any metadata to
>> >          * commit.
>> >          */
>> > -       if (!log_flushed && !XFS_IS_REALTIME_INODE(ip) &&
>> > -           mp->m_logdev_targp == mp->m_ddev_targp)
>> > +       if (!XFS_IS_REALTIME_INODE(ip) &&
>> > +           mp->m_logdev_targp == mp->m_ddev_targp &&
>> > +           flushseq == READ_ONCE(mp->m_flushseq))
>>
>> ... and here READ m_flush_completed_seq
>> if (m_flush_completed_seq > m_flush_submitted_seq)
>> it is safe to skip issue flush.
>> Then probably READ_ONCE() is not enough and need smb_rmb?
>>
>
> IIUC, basically we need to guarantee that a flush submits after
> file_write_and_wait() and completes before we return.

Yeh. unless we check if file_write_and_wait() submitted anything
at all.

> If we do something
> like the above, I wonder if that means we could wait for the submit ==
> complete if we observe submit was bumped since it was initially sampled
> above (rather than issue another flush, which would be necessary if a
> submit hadn't occurred))..?
>
> If we do end up with something like this, I think it's a bit cleaner to
> stuff the counter(s) in the xfs_buftarg structure and update them from
> the generic buffer submit/completion code based on XBF_FLUSH. FWIW, I
> suspect we could also update said counter(s) from
> xfs_blkdev_issue_flush().
>

I think what you are suggesting is to optimize more cases which are
not optimized now. That is probably possible, but also more complicated
to get right and not sure if the workloads that gain from this are important
enough.

If I am not mistaken the way to fix the current optimization is to record
the last SYNC_DONE lsn (which is sort of what Christoph suggested)
and the last WANY_SYNC|ACTIVE lsn.
After  file_write_and_wait() need to save pre_sync_lsn and before
return need to make sure that post_sync_lsn >= pre_sync_lsn or
issue a flush.

Amir.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] xfs: fix incorrect log_flushed on fsync
  2017-08-31 19:20       ` Amir Goldstein
@ 2017-08-31 20:10         ` Brian Foster
  2017-09-01  7:58           ` Amir Goldstein
  2017-09-01  9:52         ` Christoph Hellwig
  1 sibling, 1 reply; 30+ messages in thread
From: Brian Foster @ 2017-08-31 20:10 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Christoph Hellwig, Darrick J . Wong, Dave Chinner, linux-xfs,
	linux-fsdevel, Josef Bacik, stable [v4.9]

On Thu, Aug 31, 2017 at 10:20:19PM +0300, Amir Goldstein wrote:
> On Thu, Aug 31, 2017 at 7:39 PM, Brian Foster <bfoster@redhat.com> wrote:
> > On Thu, Aug 31, 2017 at 05:37:06PM +0300, Amir Goldstein wrote:
> >> On Thu, Aug 31, 2017 at 4:47 PM, Christoph Hellwig <hch@lst.de> wrote:
> >> > I think something like the following patch (totally untested,
> >> > just an idea) should fix the issue, right?
> >>
> >> I think that is not enough.
> >>
> >> >
> >> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> >> > index c4893e226fd8..555fcae9a18f 100644
> >> > --- a/fs/xfs/xfs_file.c
> >> > +++ b/fs/xfs/xfs_file.c
> >> > @@ -135,7 +135,7 @@ xfs_file_fsync(
> >> >         struct xfs_inode        *ip = XFS_I(inode);
> >> >         struct xfs_mount        *mp = ip->i_mount;
> >> >         int                     error = 0;
> >> > -       int                     log_flushed = 0;
> >> > +       unsigned int            flushseq;
> >> >         xfs_lsn_t               lsn = 0;
> >> >
> >> >         trace_xfs_file_fsync(ip);
> >> > @@ -143,6 +143,7 @@ xfs_file_fsync(
> >> >         error = file_write_and_wait_range(file, start, end);
> >> >         if (error)
> >> >                 return error;
> >> > +       flushseq = READ_ONCE(mp->m_flushseq);
> >>
> >> imagine that flush was submitted and completed before
> >> file_write_and_wait_range() but m_flushseq incremented  after.
> >> maybe here READ m_flush_submitted_seq...
> >>
> >> >
> >> >         if (XFS_FORCED_SHUTDOWN(mp))
> >> >                 return -EIO;
> >> > @@ -181,7 +182,7 @@ xfs_file_fsync(
> >> >         }
> >> >
> >> >         if (lsn) {
> >> > -               error = _xfs_log_force_lsn(mp, lsn, XFS_LOG_SYNC, &log_flushed);
> >> > +               error = _xfs_log_force_lsn(mp, lsn, XFS_LOG_SYNC, NULL);
> >> >                 ip->i_itemp->ili_fsync_fields = 0;
> >> >         }
> >> >         xfs_iunlock(ip, XFS_ILOCK_SHARED);
> >> > @@ -193,8 +194,9 @@ xfs_file_fsync(
> >> >          * an already allocated file and thus do not have any metadata to
> >> >          * commit.
> >> >          */
> >> > -       if (!log_flushed && !XFS_IS_REALTIME_INODE(ip) &&
> >> > -           mp->m_logdev_targp == mp->m_ddev_targp)
> >> > +       if (!XFS_IS_REALTIME_INODE(ip) &&
> >> > +           mp->m_logdev_targp == mp->m_ddev_targp &&
> >> > +           flushseq == READ_ONCE(mp->m_flushseq))
> >>
> >> ... and here READ m_flush_completed_seq
> >> if (m_flush_completed_seq > m_flush_submitted_seq)
> >> it is safe to skip issue flush.
> >> Then probably READ_ONCE() is not enough and need smb_rmb?
> >>
> >
> > IIUC, basically we need to guarantee that a flush submits after
> > file_write_and_wait() and completes before we return.
> 
> Yeh. unless we check if file_write_and_wait() submitted anything
> at all.
> 

Ok, thanks.

> > If we do something
> > like the above, I wonder if that means we could wait for the submit ==
> > complete if we observe submit was bumped since it was initially sampled
> > above (rather than issue another flush, which would be necessary if a
> > submit hadn't occurred))..?
> >
> > If we do end up with something like this, I think it's a bit cleaner to
> > stuff the counter(s) in the xfs_buftarg structure and update them from
> > the generic buffer submit/completion code based on XBF_FLUSH. FWIW, I
> > suspect we could also update said counter(s) from
> > xfs_blkdev_issue_flush().
> >
> 
> I think what you are suggesting is to optimize more cases which are
> not optimized now. That is probably possible, but also more complicated
> to get right and not sure if the workloads that gain from this are important
> enough.
> 

Not necessarily. I'm just suggesting that the code could be factored
more generically/elegantly such that the logic is easier to follow. That
may facilitate optimizing more cases, but that's a secondary benefit. In
practice, the log buffer code is the only place we actually set
XBF_FLUSH, for example.

We could toss the waiting idea for now and issue the flush unless we
know one has submitted & completed since the writeback based on the
counters. Waiting on a log buffer may already do the wait for us in most
cases anyways, we just don't know if it submitted after the writeback or
not without the submit counter.

> If I am not mistaken the way to fix the current optimization is to record
> the last SYNC_DONE lsn (which is sort of what Christoph suggested)
> and the last WANY_SYNC|ACTIVE lsn.
> After  file_write_and_wait() need to save pre_sync_lsn and before
> return need to make sure that post_sync_lsn >= pre_sync_lsn or
> issue a flush.
> 

Perhaps, but I'm not quite following what you mean by pre/post LSNs.
Note that I believe log buffers can complete out of order, if that is
relevant here. Either way, this still seems like underhanded logic
IMO...

If the requirement is a simple "issue a flush if we can't detect that
one has submitted+completed on this device since our writeback
completed" rule, why intentionally obfuscate that with internal log
buffer state such as log buffer header LSN and log state machine values?
Just track flush submission/completions as you suggested earlier and the
fsync logic is much easier to follow. Then we don't need to work
backwards from the XFS logging infrastructure just to try and verify
whether a flush has occurred in all cases. :)

Brian

> Amir.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] xfs: fix incorrect log_flushed on fsync
  2017-08-31 20:10         ` Brian Foster
@ 2017-09-01  7:58           ` Amir Goldstein
  2017-09-01 10:46             ` Brian Foster
  0 siblings, 1 reply; 30+ messages in thread
From: Amir Goldstein @ 2017-09-01  7:58 UTC (permalink / raw)
  To: Brian Foster
  Cc: Christoph Hellwig, Darrick J . Wong, Dave Chinner, linux-xfs,
	linux-fsdevel, Josef Bacik, stable [v4.9]

On Thu, Aug 31, 2017 at 11:10 PM, Brian Foster <bfoster@redhat.com> wrote:
> On Thu, Aug 31, 2017 at 10:20:19PM +0300, Amir Goldstein wrote:
>> On Thu, Aug 31, 2017 at 7:39 PM, Brian Foster <bfoster@redhat.com> wrote:
...
>> > If we do something
>> > like the above, I wonder if that means we could wait for the submit ==
>> > complete if we observe submit was bumped since it was initially sampled
>> > above (rather than issue another flush, which would be necessary if a
>> > submit hadn't occurred))..?
>> >
>> > If we do end up with something like this, I think it's a bit cleaner to
>> > stuff the counter(s) in the xfs_buftarg structure and update them from
>> > the generic buffer submit/completion code based on XBF_FLUSH. FWIW, I
>> > suspect we could also update said counter(s) from
>> > xfs_blkdev_issue_flush().
>> >
>>
>> I think what you are suggesting is to optimize more cases which are
>> not optimized now. That is probably possible, but also more complicated
>> to get right and not sure if the workloads that gain from this are important
>> enough.
>>
>
> Not necessarily. I'm just suggesting that the code could be factored
> more generically/elegantly such that the logic is easier to follow. That
> may facilitate optimizing more cases, but that's a secondary benefit. In
> practice, the log buffer code is the only place we actually set
> XBF_FLUSH, for example.
>

I guess that makes sense.
Although it is going to end up with more code, so if we are not going for
optimization for more cases (i.e. subsequent fdatasync) we should consider
if the extra code is worth it.

>
>> If I am not mistaken the way to fix the current optimization is to record
>> the last SYNC_DONE lsn (which is sort of what Christoph suggested)
>> and the last WANY_SYNC|ACTIVE lsn.
>> After  file_write_and_wait() need to save pre_sync_lsn and before
>> return need to make sure that post_sync_lsn >= pre_sync_lsn or
>> issue a flush.
>>
>
> Perhaps, but I'm not quite following what you mean by pre/post LSNs.
> Note that I believe log buffers can complete out of order, if that is
> relevant here. Either way, this still seems like underhanded logic
> IMO...
>
> If the requirement is a simple "issue a flush if we can't detect that
> one has submitted+completed on this device since our writeback
> completed" rule, why intentionally obfuscate that with internal log
> buffer state such as log buffer header LSN and log state machine values?
> Just track flush submission/completions as you suggested earlier and the
> fsync logic is much easier to follow. Then we don't need to work
> backwards from the XFS logging infrastructure just to try and verify
> whether a flush has occurred in all cases. :)
>

Your argument makes a lot of sense. I'm just trying to be extra cautious
and looking for a small step solution. As Darrick wrote.. "safety first" :)

Amir.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] xfs: fix incorrect log_flushed on fsync
  2017-08-31 14:37   ` Amir Goldstein
  2017-08-31 16:39     ` Brian Foster
@ 2017-09-01  9:47     ` Christoph Hellwig
  1 sibling, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2017-09-01  9:47 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Christoph Hellwig, Darrick J . Wong, Dave Chinner, linux-xfs,
	linux-fsdevel, Josef Bacik, stable [v4.9]

On Thu, Aug 31, 2017 at 05:37:06PM +0300, Amir Goldstein wrote:
> > index bcb2f860e508..3c0cbb98581e 100644
> > --- a/fs/xfs/xfs_log.c
> > +++ b/fs/xfs/xfs_log.c
> > @@ -2922,6 +2922,8 @@ xlog_state_done_syncing(
> >                 iclog->ic_state = XLOG_STATE_DONE_SYNC;
> >         }
> >
> > +       log->l_mp->m_flushseq++;
> 
> I recon this should use WRITE_ONCE or smp_wmb()
> and then also increment m_flush_submitted_seq *before*
> issueing flush

It's under l_icloglock, so we shouldn't need a WRITE_ONCE.
If we did we'd need an atomic_t.

> If state machine does not allow more than a single flush
> to be in flight (?) then the 2 seq counters could be reduced
> to single seq counter with (m_flushseq % 2) == 1 for submitted
> and  (m_flushseq % 2) == 0 for completed and the test in fsync
> would be (flushseq % 2) == (READ_ONCE(mp->m_flushseq) % 2)

The state machine allows multiple flushes to in flight.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] xfs: fix incorrect log_flushed on fsync
  2017-08-31 19:20       ` Amir Goldstein
  2017-08-31 20:10         ` Brian Foster
@ 2017-09-01  9:52         ` Christoph Hellwig
  2017-09-01 10:37           ` Amir Goldstein
  1 sibling, 1 reply; 30+ messages in thread
From: Christoph Hellwig @ 2017-09-01  9:52 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Brian Foster, Christoph Hellwig, Darrick J . Wong, Dave Chinner,
	linux-xfs, linux-fsdevel, Josef Bacik, stable [v4.9]

On Thu, Aug 31, 2017 at 10:20:19PM +0300, Amir Goldstein wrote:
> > IIUC, basically we need to guarantee that a flush submits after
> > file_write_and_wait() and completes before we return.
> 
> Yeh. unless we check if file_write_and_wait() submitted anything
> at all.

Even if file_write_and_wait did not submit anything we need to
make sure a flush was submitted and completed after entering
xfs_file_fsync.  For one to deal with the case where we wrote
back data from the flusher threads or the VM, and also for
the direct I/O case.


Btw, do you have any results for your simple catch?  I wonder
how much of an issue it actually is in practice.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] xfs: fix incorrect log_flushed on fsync
  2017-09-01  9:52         ` Christoph Hellwig
@ 2017-09-01 10:37           ` Amir Goldstein
  2017-09-01 10:43             ` Christoph Hellwig
  0 siblings, 1 reply; 30+ messages in thread
From: Amir Goldstein @ 2017-09-01 10:37 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Brian Foster, Darrick J . Wong, Dave Chinner, linux-xfs,
	linux-fsdevel, Josef Bacik, stable [v4.9]

On Fri, Sep 1, 2017 at 12:52 PM, Christoph Hellwig <hch@lst.de> wrote:
> On Thu, Aug 31, 2017 at 10:20:19PM +0300, Amir Goldstein wrote:
>> > IIUC, basically we need to guarantee that a flush submits after
>> > file_write_and_wait() and completes before we return.
>>
>> Yeh. unless we check if file_write_and_wait() submitted anything
>> at all.
>
> Even if file_write_and_wait did not submit anything we need to
> make sure a flush was submitted and completed after entering
> xfs_file_fsync.  For one to deal with the case where we wrote
> back data from the flusher threads or the VM, and also for
> the direct I/O case.
>

Right.

>
> Btw, do you have any results for your simple catch?  I wonder
> how much of an issue it actually is in practice.

Well since the bug was demonstrated using crash simulator,
I have no idea how often one would run into this with actual power
failure, but does it really matter how often, if we *know* that it can
happen?

The sequence of bios that I illustrated in commit message is
what I saw regardless of the crash simulator, and that sequence
leaves a data block in the disk cache, with nothing that guaranties
a FLUSH afterwards. So for an unlucky subject, that gives a window
of data loss in case of power failure at least until the next time
a flusher thread runs. That could be seconds, no?

I am working on a variant of the flushseq you suggested using
log->l_last_sync_lsn that may be simple enough and not as painful
as turning off optimization for WANT_SYNC log buffer.

Amir.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] xfs: fix incorrect log_flushed on fsync
  2017-09-01 10:37           ` Amir Goldstein
@ 2017-09-01 10:43             ` Christoph Hellwig
  0 siblings, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2017-09-01 10:43 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Christoph Hellwig, Brian Foster, Darrick J . Wong, Dave Chinner,
	linux-xfs, linux-fsdevel, Josef Bacik, stable [v4.9]

On Fri, Sep 01, 2017 at 01:37:28PM +0300, Amir Goldstein wrote:
> Well since the bug was demonstrated using crash simulator,
> I have no idea how often one would run into this with actual power
> failure, but does it really matter how often, if we *know* that it can
> happen?

No, the issue is real.  I wonder how much of an issue the additional
flushes are in practice.

> I am working on a variant of the flushseq you suggested using
> log->l_last_sync_lsn that may be simple enough and not as painful
> as turning off optimization for WANT_SYNC log buffer.

Ok, I'll take a look once you post it.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] xfs: fix incorrect log_flushed on fsync
  2017-09-01  7:58           ` Amir Goldstein
@ 2017-09-01 10:46             ` Brian Foster
  0 siblings, 0 replies; 30+ messages in thread
From: Brian Foster @ 2017-09-01 10:46 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Christoph Hellwig, Darrick J . Wong, Dave Chinner, linux-xfs,
	linux-fsdevel, Josef Bacik, stable [v4.9]

On Fri, Sep 01, 2017 at 10:58:43AM +0300, Amir Goldstein wrote:
> On Thu, Aug 31, 2017 at 11:10 PM, Brian Foster <bfoster@redhat.com> wrote:
> > On Thu, Aug 31, 2017 at 10:20:19PM +0300, Amir Goldstein wrote:
> >> On Thu, Aug 31, 2017 at 7:39 PM, Brian Foster <bfoster@redhat.com> wrote:
> ...
> >> > If we do something
> >> > like the above, I wonder if that means we could wait for the submit ==
> >> > complete if we observe submit was bumped since it was initially sampled
> >> > above (rather than issue another flush, which would be necessary if a
> >> > submit hadn't occurred))..?
> >> >
> >> > If we do end up with something like this, I think it's a bit cleaner to
> >> > stuff the counter(s) in the xfs_buftarg structure and update them from
> >> > the generic buffer submit/completion code based on XBF_FLUSH. FWIW, I
> >> > suspect we could also update said counter(s) from
> >> > xfs_blkdev_issue_flush().
> >> >
> >>
> >> I think what you are suggesting is to optimize more cases which are
> >> not optimized now. That is probably possible, but also more complicated
> >> to get right and not sure if the workloads that gain from this are important
> >> enough.
> >>
> >
> > Not necessarily. I'm just suggesting that the code could be factored
> > more generically/elegantly such that the logic is easier to follow. That
> > may facilitate optimizing more cases, but that's a secondary benefit. In
> > practice, the log buffer code is the only place we actually set
> > XBF_FLUSH, for example.
> >
> 
> I guess that makes sense.
> Although it is going to end up with more code, so if we are not going for
> optimization for more cases (i.e. subsequent fdatasync) we should consider
> if the extra code is worth it.
> 

Incrementally more than Christoph's patch. I don't think that's an
issue.

> >
> >> If I am not mistaken the way to fix the current optimization is to record
> >> the last SYNC_DONE lsn (which is sort of what Christoph suggested)
> >> and the last WANY_SYNC|ACTIVE lsn.
> >> After  file_write_and_wait() need to save pre_sync_lsn and before
> >> return need to make sure that post_sync_lsn >= pre_sync_lsn or
> >> issue a flush.
> >>
> >
> > Perhaps, but I'm not quite following what you mean by pre/post LSNs.
> > Note that I believe log buffers can complete out of order, if that is
> > relevant here. Either way, this still seems like underhanded logic
> > IMO...
> >
> > If the requirement is a simple "issue a flush if we can't detect that
> > one has submitted+completed on this device since our writeback
> > completed" rule, why intentionally obfuscate that with internal log
> > buffer state such as log buffer header LSN and log state machine values?
> > Just track flush submission/completions as you suggested earlier and the
> > fsync logic is much easier to follow. Then we don't need to work
> > backwards from the XFS logging infrastructure just to try and verify
> > whether a flush has occurred in all cases. :)
> >
> 
> Your argument makes a lot of sense. I'm just trying to be extra cautious
> and looking for a small step solution. As Darrick wrote.. "safety first" :)
> 

Sure. FWIW, I think your patch suits that purpose as it basically
disables the shady bits of the optimization. It looks like it's already
in Darrick's for-next branch and Cc'd to stable as well.

A cleaner rework of this mechanism can come after, hopefully restore the
full optimization safely and perhaps clean out the whole log_flushed
thing.

Brian

> Amir.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] xfs: fix incorrect log_flushed on fsync
  2017-08-30 13:38 [PATCH] xfs: fix incorrect log_flushed on fsync Amir Goldstein
                   ` (2 preceding siblings ...)
  2017-08-31 13:47 ` Christoph Hellwig
@ 2017-09-15 12:40 ` Amir Goldstein
  2017-09-18 17:11   ` Darrick J. Wong
  3 siblings, 1 reply; 30+ messages in thread
From: Amir Goldstein @ 2017-09-15 12:40 UTC (permalink / raw)
  To: Darrick J . Wong, Christoph Hellwig, Dave Chinner
  Cc: linux-xfs, linux-fsdevel, Josef Bacik, stable [v4.9]

On Wed, Aug 30, 2017 at 4:38 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> When calling into _xfs_log_force{,_lsn}() with a pointer
> to log_flushed variable, log_flushed will be set to 1 if:
> 1. xlog_sync() is called to flush the active log buffer
> AND/OR
> 2. xlog_wait() is called to wait on a syncing log buffers
>
> xfs_file_fsync() checks the value of log_flushed after
> _xfs_log_force_lsn() call to optimize away an explicit
> PREFLUSH request to the data block device after writing
> out all the file's pages to disk.
>
> This optimization is incorrect in the following sequence of events:
>
>  Task A                    Task B
>  -------------------------------------------------------
>  xfs_file_fsync()
>    _xfs_log_force_lsn()
>      xlog_sync()
>         [submit PREFLUSH]
>                            xfs_file_fsync()
>                              file_write_and_wait_range()
>                                [submit WRITE X]
>                                [endio  WRITE X]
>                              _xfs_log_force_lsn()
>                                xlog_wait()
>         [endio  PREFLUSH]
>
> The write X is not guarantied to be on persistent storage
> when PREFLUSH request in completed, because write A was submitted
> after the PREFLUSH request, but xfs_file_fsync() of task A will
> be notified of log_flushed=1 and will skip explicit flush.
>
> If the system crashes after fsync of task A, write X may not be
> present on disk after reboot.
>
> This bug was discovered and demonstrated using Josef Bacik's
> dm-log-writes target, which can be used to record block io operations
> and then replay a subset of these operations onto the target device.
> The test goes something like this:
> - Use fsx to execute ops of a file and record ops on log device
> - Every now and then fsync the file, store md5 of file and mark
>   the location in the log
> - Then replay log onto device for each mark, mount fs and compare
>   md5 of file to stored value
>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Josef Bacik <jbacik@fb.com>
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> ---
>
> Christoph, Dave,
>
> It's hard to believe, but I think the reported bug has been around
> since 2005 f538d4da8d52 ("[XFS] write barrier support"), but I did
> not try to test old kernels.

Forgot to tag commit message with:
Fixes: f538d4da8d52 ("[XFS] write barrier support")

Maybe the tag could be added when applying to recent stables,
so distros and older downstream stables can see the tag.

The disclosure of the security bug fix (commit b31ff3cdf5) made me wonder
if possible data loss bug should also be disclosed in some distros forum?
I bet some users would care more about the latter than the former.
Coincidentally, both data loss and security bugs fix the same commit..

Cheers,
Amir.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] xfs: fix incorrect log_flushed on fsync
  2017-09-15 12:40 ` Amir Goldstein
@ 2017-09-18 17:11   ` Darrick J. Wong
  2017-09-18 18:00     ` Amir Goldstein
  0 siblings, 1 reply; 30+ messages in thread
From: Darrick J. Wong @ 2017-09-18 17:11 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Christoph Hellwig, Dave Chinner, linux-xfs, linux-fsdevel,
	Josef Bacik, stable [v4.9]

On Fri, Sep 15, 2017 at 03:40:24PM +0300, Amir Goldstein wrote:
> On Wed, Aug 30, 2017 at 4:38 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> > When calling into _xfs_log_force{,_lsn}() with a pointer
> > to log_flushed variable, log_flushed will be set to 1 if:
> > 1. xlog_sync() is called to flush the active log buffer
> > AND/OR
> > 2. xlog_wait() is called to wait on a syncing log buffers
> >
> > xfs_file_fsync() checks the value of log_flushed after
> > _xfs_log_force_lsn() call to optimize away an explicit
> > PREFLUSH request to the data block device after writing
> > out all the file's pages to disk.
> >
> > This optimization is incorrect in the following sequence of events:
> >
> >  Task A                    Task B
> >  -------------------------------------------------------
> >  xfs_file_fsync()
> >    _xfs_log_force_lsn()
> >      xlog_sync()
> >         [submit PREFLUSH]
> >                            xfs_file_fsync()
> >                              file_write_and_wait_range()
> >                                [submit WRITE X]
> >                                [endio  WRITE X]
> >                              _xfs_log_force_lsn()
> >                                xlog_wait()
> >         [endio  PREFLUSH]
> >
> > The write X is not guarantied to be on persistent storage
> > when PREFLUSH request in completed, because write A was submitted
> > after the PREFLUSH request, but xfs_file_fsync() of task A will
> > be notified of log_flushed=1 and will skip explicit flush.
> >
> > If the system crashes after fsync of task A, write X may not be
> > present on disk after reboot.
> >
> > This bug was discovered and demonstrated using Josef Bacik's
> > dm-log-writes target, which can be used to record block io operations
> > and then replay a subset of these operations onto the target device.
> > The test goes something like this:
> > - Use fsx to execute ops of a file and record ops on log device
> > - Every now and then fsync the file, store md5 of file and mark
> >   the location in the log
> > - Then replay log onto device for each mark, mount fs and compare
> >   md5 of file to stored value
> >
> > Cc: Christoph Hellwig <hch@lst.de>
> > Cc: Josef Bacik <jbacik@fb.com>
> > Cc: <stable@vger.kernel.org>
> > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> > ---
> >
> > Christoph, Dave,
> >
> > It's hard to believe, but I think the reported bug has been around
> > since 2005 f538d4da8d52 ("[XFS] write barrier support"), but I did
> > not try to test old kernels.
> 
> Forgot to tag commit message with:
> Fixes: f538d4da8d52 ("[XFS] write barrier support")
> 
> Maybe the tag could be added when applying to recent stables,
> so distros and older downstream stables can see the tag.
> 
> The disclosure of the security bug fix (commit b31ff3cdf5) made me wonder
> if possible data loss bug should also be disclosed in some distros forum?
> I bet some users would care more about the latter than the former.
> Coincidentally, both data loss and security bugs fix the same commit..

Yes the the patch ought to get sent on to stable w/ fixes tag.  One
would hope that the distros will pick up the stable fixes from there.
That said, it's been in the kernel for 12 years without widespread
complaints about corruption, so I'm not sure this warrants public
disclosure via CVE/Phoronix vs. just fixing it.

--D

> 
> Cheers,
> Amir.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] xfs: fix incorrect log_flushed on fsync
  2017-09-18 17:11   ` Darrick J. Wong
@ 2017-09-18 18:00     ` Amir Goldstein
  2017-09-18 18:35       ` Greg KH
  2017-09-18 21:24       ` Dave Chinner
  0 siblings, 2 replies; 30+ messages in thread
From: Amir Goldstein @ 2017-09-18 18:00 UTC (permalink / raw)
  To: Darrick J. Wong, Greg KH
  Cc: Christoph Hellwig, Dave Chinner, linux-xfs, linux-fsdevel,
	Josef Bacik, stable [v4.9]

On Mon, Sep 18, 2017 at 8:11 PM, Darrick J. Wong
<darrick.wong@oracle.com> wrote:
> On Fri, Sep 15, 2017 at 03:40:24PM +0300, Amir Goldstein wrote:
>> On Wed, Aug 30, 2017 at 4:38 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>> > When calling into _xfs_log_force{,_lsn}() with a pointer
>> > to log_flushed variable, log_flushed will be set to 1 if:
>> > 1. xlog_sync() is called to flush the active log buffer
>> > AND/OR
>> > 2. xlog_wait() is called to wait on a syncing log buffers
>> >
>> > xfs_file_fsync() checks the value of log_flushed after
>> > _xfs_log_force_lsn() call to optimize away an explicit
>> > PREFLUSH request to the data block device after writing
>> > out all the file's pages to disk.
>> >
>> > This optimization is incorrect in the following sequence of events:
>> >
>> >  Task A                    Task B
>> >  -------------------------------------------------------
>> >  xfs_file_fsync()
>> >    _xfs_log_force_lsn()
>> >      xlog_sync()
>> >         [submit PREFLUSH]
>> >                            xfs_file_fsync()
>> >                              file_write_and_wait_range()
>> >                                [submit WRITE X]
>> >                                [endio  WRITE X]
>> >                              _xfs_log_force_lsn()
>> >                                xlog_wait()
>> >         [endio  PREFLUSH]
>> >
>> > The write X is not guarantied to be on persistent storage
>> > when PREFLUSH request in completed, because write A was submitted
>> > after the PREFLUSH request, but xfs_file_fsync() of task A will
>> > be notified of log_flushed=1 and will skip explicit flush.
>> >
>> > If the system crashes after fsync of task A, write X may not be
>> > present on disk after reboot.
>> >
>> > This bug was discovered and demonstrated using Josef Bacik's
>> > dm-log-writes target, which can be used to record block io operations
>> > and then replay a subset of these operations onto the target device.
>> > The test goes something like this:
>> > - Use fsx to execute ops of a file and record ops on log device
>> > - Every now and then fsync the file, store md5 of file and mark
>> >   the location in the log
>> > - Then replay log onto device for each mark, mount fs and compare
>> >   md5 of file to stored value
>> >
>> > Cc: Christoph Hellwig <hch@lst.de>
>> > Cc: Josef Bacik <jbacik@fb.com>
>> > Cc: <stable@vger.kernel.org>
>> > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
>> > ---
>> >
>> > Christoph, Dave,
>> >
>> > It's hard to believe, but I think the reported bug has been around
>> > since 2005 f538d4da8d52 ("[XFS] write barrier support"), but I did
>> > not try to test old kernels.
>>
>> Forgot to tag commit message with:
>> Fixes: f538d4da8d52 ("[XFS] write barrier support")
>>
>> Maybe the tag could be added when applying to recent stables,
>> so distros and older downstream stables can see the tag.
>>
>> The disclosure of the security bug fix (commit b31ff3cdf5) made me wonder
>> if possible data loss bug should also be disclosed in some distros forum?
>> I bet some users would care more about the latter than the former.
>> Coincidentally, both data loss and security bugs fix the same commit..
>
> Yes the the patch ought to get sent on to stable w/ fixes tag.  One
> would hope that the distros will pick up the stable fixes from there.


Greg, for your consideration, please add
 Fixes: f538d4da8d52 ("[XFS] write barrier support")
If not pushed yet.

> That said, it's been in the kernel for 12 years without widespread
> complaints about corruption, so I'm not sure this warrants public
> disclosure via CVE/Phoronix vs. just fixing it.
>

I'm not sure either.
My intuition tells me that the chances of hitting the data loss bug
given a power failure are not slim, but the chances of users knowing
about the data loss are slim.
Meaning, the chances of users complaining:
"I swear, I did fsync, lost power, and after boot data was not there" are slim
and the chances of developers believing the report on hear say are
even slimmer.

Oh well. I was just wondering...

Amir.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] xfs: fix incorrect log_flushed on fsync
  2017-09-18 18:00     ` Amir Goldstein
@ 2017-09-18 18:35       ` Greg KH
  2017-09-18 19:29         ` Amir Goldstein
  2017-09-18 21:24       ` Dave Chinner
  1 sibling, 1 reply; 30+ messages in thread
From: Greg KH @ 2017-09-18 18:35 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Darrick J. Wong, Christoph Hellwig, Dave Chinner, linux-xfs,
	linux-fsdevel, Josef Bacik, stable [v4.9]

On Mon, Sep 18, 2017 at 09:00:30PM +0300, Amir Goldstein wrote:
> On Mon, Sep 18, 2017 at 8:11 PM, Darrick J. Wong
> <darrick.wong@oracle.com> wrote:
> > On Fri, Sep 15, 2017 at 03:40:24PM +0300, Amir Goldstein wrote:
> >> On Wed, Aug 30, 2017 at 4:38 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> >> > When calling into _xfs_log_force{,_lsn}() with a pointer
> >> > to log_flushed variable, log_flushed will be set to 1 if:
> >> > 1. xlog_sync() is called to flush the active log buffer
> >> > AND/OR
> >> > 2. xlog_wait() is called to wait on a syncing log buffers
> >> >
> >> > xfs_file_fsync() checks the value of log_flushed after
> >> > _xfs_log_force_lsn() call to optimize away an explicit
> >> > PREFLUSH request to the data block device after writing
> >> > out all the file's pages to disk.
> >> >
> >> > This optimization is incorrect in the following sequence of events:
> >> >
> >> >  Task A                    Task B
> >> >  -------------------------------------------------------
> >> >  xfs_file_fsync()
> >> >    _xfs_log_force_lsn()
> >> >      xlog_sync()
> >> >         [submit PREFLUSH]
> >> >                            xfs_file_fsync()
> >> >                              file_write_and_wait_range()
> >> >                                [submit WRITE X]
> >> >                                [endio  WRITE X]
> >> >                              _xfs_log_force_lsn()
> >> >                                xlog_wait()
> >> >         [endio  PREFLUSH]
> >> >
> >> > The write X is not guarantied to be on persistent storage
> >> > when PREFLUSH request in completed, because write A was submitted
> >> > after the PREFLUSH request, but xfs_file_fsync() of task A will
> >> > be notified of log_flushed=1 and will skip explicit flush.
> >> >
> >> > If the system crashes after fsync of task A, write X may not be
> >> > present on disk after reboot.
> >> >
> >> > This bug was discovered and demonstrated using Josef Bacik's
> >> > dm-log-writes target, which can be used to record block io operations
> >> > and then replay a subset of these operations onto the target device.
> >> > The test goes something like this:
> >> > - Use fsx to execute ops of a file and record ops on log device
> >> > - Every now and then fsync the file, store md5 of file and mark
> >> >   the location in the log
> >> > - Then replay log onto device for each mark, mount fs and compare
> >> >   md5 of file to stored value
> >> >
> >> > Cc: Christoph Hellwig <hch@lst.de>
> >> > Cc: Josef Bacik <jbacik@fb.com>
> >> > Cc: <stable@vger.kernel.org>
> >> > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> >> > ---
> >> >
> >> > Christoph, Dave,
> >> >
> >> > It's hard to believe, but I think the reported bug has been around
> >> > since 2005 f538d4da8d52 ("[XFS] write barrier support"), but I did
> >> > not try to test old kernels.
> >>
> >> Forgot to tag commit message with:
> >> Fixes: f538d4da8d52 ("[XFS] write barrier support")
> >>
> >> Maybe the tag could be added when applying to recent stables,
> >> so distros and older downstream stables can see the tag.
> >>
> >> The disclosure of the security bug fix (commit b31ff3cdf5) made me wonder
> >> if possible data loss bug should also be disclosed in some distros forum?
> >> I bet some users would care more about the latter than the former.
> >> Coincidentally, both data loss and security bugs fix the same commit..
> >
> > Yes the the patch ought to get sent on to stable w/ fixes tag.  One
> > would hope that the distros will pick up the stable fixes from there.
> 
> 
> Greg, for your consideration, please add
>  Fixes: f538d4da8d52 ("[XFS] write barrier support")
> If not pushed yet.

Add it to what?

totally confused,

greg k-h

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] xfs: fix incorrect log_flushed on fsync
  2017-09-18 18:35       ` Greg KH
@ 2017-09-18 19:29         ` Amir Goldstein
  2017-09-19  6:32           ` Greg KH
  0 siblings, 1 reply; 30+ messages in thread
From: Amir Goldstein @ 2017-09-18 19:29 UTC (permalink / raw)
  To: Greg KH
  Cc: Darrick J. Wong, Christoph Hellwig, Dave Chinner, linux-xfs,
	linux-fsdevel, Josef Bacik, stable [v4.9]

On Mon, Sep 18, 2017 at 9:35 PM, Greg KH <greg@kroah.com> wrote:
> On Mon, Sep 18, 2017 at 09:00:30PM +0300, Amir Goldstein wrote:
>> On Mon, Sep 18, 2017 at 8:11 PM, Darrick J. Wong
>> <darrick.wong@oracle.com> wrote:
>> > On Fri, Sep 15, 2017 at 03:40:24PM +0300, Amir Goldstein wrote:
>> >> On Wed, Aug 30, 2017 at 4:38 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>> >> > When calling into _xfs_log_force{,_lsn}() with a pointer
>> >> > to log_flushed variable, log_flushed will be set to 1 if:
>> >> > 1. xlog_sync() is called to flush the active log buffer
>> >> > AND/OR
>> >> > 2. xlog_wait() is called to wait on a syncing log buffers
>> >> >
>> >> > xfs_file_fsync() checks the value of log_flushed after
>> >> > _xfs_log_force_lsn() call to optimize away an explicit
>> >> > PREFLUSH request to the data block device after writing
>> >> > out all the file's pages to disk.
>> >> >
>> >> > This optimization is incorrect in the following sequence of events:
>> >> >
>> >> >  Task A                    Task B
>> >> >  -------------------------------------------------------
>> >> >  xfs_file_fsync()
>> >> >    _xfs_log_force_lsn()
>> >> >      xlog_sync()
>> >> >         [submit PREFLUSH]
>> >> >                            xfs_file_fsync()
>> >> >                              file_write_and_wait_range()
>> >> >                                [submit WRITE X]
>> >> >                                [endio  WRITE X]
>> >> >                              _xfs_log_force_lsn()
>> >> >                                xlog_wait()
>> >> >         [endio  PREFLUSH]
>> >> >
>> >> > The write X is not guarantied to be on persistent storage
>> >> > when PREFLUSH request in completed, because write A was submitted
>> >> > after the PREFLUSH request, but xfs_file_fsync() of task A will
>> >> > be notified of log_flushed=1 and will skip explicit flush.
>> >> >
>> >> > If the system crashes after fsync of task A, write X may not be
>> >> > present on disk after reboot.
>> >> >
>> >> > This bug was discovered and demonstrated using Josef Bacik's
>> >> > dm-log-writes target, which can be used to record block io operations
>> >> > and then replay a subset of these operations onto the target device.
>> >> > The test goes something like this:
>> >> > - Use fsx to execute ops of a file and record ops on log device
>> >> > - Every now and then fsync the file, store md5 of file and mark
>> >> >   the location in the log
>> >> > - Then replay log onto device for each mark, mount fs and compare
>> >> >   md5 of file to stored value
>> >> >
>> >> > Cc: Christoph Hellwig <hch@lst.de>
>> >> > Cc: Josef Bacik <jbacik@fb.com>
>> >> > Cc: <stable@vger.kernel.org>
>> >> > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
>> >> > ---
>> >> >
>> >> > Christoph, Dave,
>> >> >
>> >> > It's hard to believe, but I think the reported bug has been around
>> >> > since 2005 f538d4da8d52 ("[XFS] write barrier support"), but I did
>> >> > not try to test old kernels.
>> >>
>> >> Forgot to tag commit message with:
>> >> Fixes: f538d4da8d52 ("[XFS] write barrier support")
>> >>
>> >> Maybe the tag could be added when applying to recent stables,
>> >> so distros and older downstream stables can see the tag.
>> >>
>> >> The disclosure of the security bug fix (commit b31ff3cdf5) made me wonder
>> >> if possible data loss bug should also be disclosed in some distros forum?
>> >> I bet some users would care more about the latter than the former.
>> >> Coincidentally, both data loss and security bugs fix the same commit..
>> >
>> > Yes the the patch ought to get sent on to stable w/ fixes tag.  One
>> > would hope that the distros will pick up the stable fixes from there.
>>
>>
>> Greg, for your consideration, please add
>>  Fixes: f538d4da8d52 ("[XFS] write barrier support")
>> If not pushed yet.
>
> Add it to what?

Sorry, add that tag when applying commit 47c7d0b1950258312
to stable trees, since I missed adding the tag before it was merged
to master.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] xfs: fix incorrect log_flushed on fsync
  2017-09-18 18:00     ` Amir Goldstein
  2017-09-18 18:35       ` Greg KH
@ 2017-09-18 21:24       ` Dave Chinner
  2017-09-19  5:31         ` Amir Goldstein
  1 sibling, 1 reply; 30+ messages in thread
From: Dave Chinner @ 2017-09-18 21:24 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Darrick J. Wong, Greg KH, Christoph Hellwig, linux-xfs,
	linux-fsdevel, Josef Bacik, stable [v4.9]

On Mon, Sep 18, 2017 at 09:00:30PM +0300, Amir Goldstein wrote:
> On Mon, Sep 18, 2017 at 8:11 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > On Fri, Sep 15, 2017 at 03:40:24PM +0300, Amir Goldstein wrote:
> >> The disclosure of the security bug fix (commit b31ff3cdf5) made me wonder
> >> if possible data loss bug should also be disclosed in some distros forum?
> >> I bet some users would care more about the latter than the former.
> >> Coincidentally, both data loss and security bugs fix the same commit..
> >
> > Yes the the patch ought to get sent on to stable w/ fixes tag.  One
> > would hope that the distros will pick up the stable fixes from there.

Yup, that's the normal process for data integrity/fs corruption
bugs.

> > That said, it's been in the kernel for 12 years without widespread
> > complaints about corruption, so I'm not sure this warrants public
> > disclosure via CVE/Phoronix vs. just fixing it.
> >
> 
> I'm not sure either.
> My intuition tells me that the chances of hitting the data loss bug
> given a power failure are not slim, but the chances of users knowing
> about the data loss are slim.

The chances of hitting it are slim. Power-fail vs fsync data
integrity testing is something we do actually run as part of QE and
have for many years.  We've been running such testing for years and
never tripped over this problem, so I think the likelihood that a
user will hit it is extremely small. Compare that to the bug we
issued the CVE for - it's in code we /don't test/, the bug affects
everyone with that support compiled in (potentially millions of
systems) and it is a guaranteed local user triggered denial of
service. There's a big difference in scope and impact between these
two cases.

Further, we have to consider precedence, developer resources and
impact when deciding what we issue disclosures for. e.g. Verifying
the problem, categorising it, finding out what systems it affected,
working out the best fix for backporting, testing the fix, working
through all the disclosure processes, etc took three long days of me
doing nothing else but working on this issue.

If we start issuing disclosures for all data integrity and/or
corruption fixes, then we're going to have no time for anything
else. That is, a significant number of bugs we fix every release are
for corruption and/or data integrity issues. The process we use for
getting them back to stable and distro kernels is to tag them for
stable kernels. Adding some heavyweight disclosure process on top of
that isn't going to provide any extra value to distros or users. All
it does is add significant extra workload to the upstream
development process.

> Meaning, the chances of users complaining:
> "I swear, I did fsync, lost power, and after boot data was not there" are slim
> and the chances of developers believing the report on hear say are
> even slimmer.

A disclosure is not going to change that. All it will do is result
in users seeing ghosts (suggestion bias) and reporting things that
aren't actually filesystem data loss. e.g. "filesystem lost data on
crash" when in reality the problem is that the app doesn't use
fsync...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] xfs: fix incorrect log_flushed on fsync
  2017-09-18 21:24       ` Dave Chinner
@ 2017-09-19  5:31         ` Amir Goldstein
  2017-09-19  5:45           ` Darrick J. Wong
  2017-09-20  0:40           ` Dave Chinner
  0 siblings, 2 replies; 30+ messages in thread
From: Amir Goldstein @ 2017-09-19  5:31 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Darrick J. Wong, Greg KH, Christoph Hellwig, linux-xfs,
	linux-fsdevel, Josef Bacik, stable [v4.9]

On Tue, Sep 19, 2017 at 12:24 AM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Sep 18, 2017 at 09:00:30PM +0300, Amir Goldstein wrote:
>> On Mon, Sep 18, 2017 at 8:11 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
>> > On Fri, Sep 15, 2017 at 03:40:24PM +0300, Amir Goldstein wrote:
>> >> The disclosure of the security bug fix (commit b31ff3cdf5) made me wonder
>> >> if possible data loss bug should also be disclosed in some distros forum?
>> >> I bet some users would care more about the latter than the former.
>> >> Coincidentally, both data loss and security bugs fix the same commit..
>> >
>> > Yes the the patch ought to get sent on to stable w/ fixes tag.  One
>> > would hope that the distros will pick up the stable fixes from there.
>
> Yup, that's the normal process for data integrity/fs corruption
> bugs.

Makes sense. I'm convinced that the normal process is sufficient for this
sort of bug fix.

>
>> > That said, it's been in the kernel for 12 years without widespread
>> > complaints about corruption, so I'm not sure this warrants public
>> > disclosure via CVE/Phoronix vs. just fixing it.
>> >
>>
>> I'm not sure either.
>> My intuition tells me that the chances of hitting the data loss bug
>> given a power failure are not slim, but the chances of users knowing
>> about the data loss are slim.
>
> The chances of hitting it are slim. Power-fail vs fsync data
> integrity testing is something we do actually run as part of QE and
> have for many years.  We've been running such testing for years and
> never tripped over this problem, so I think the likelihood that a
> user will hit it is extremely small.

This sentence make me unease.
Who is We and what QE testing are you referring to?
Are those tests in xfstests or any other public repository?
My first reaction to the corruption was "no way, I need to check the test"
Second reaction after checking the test was "this must very very hard to hit"
But from closer inspection, it looks like it doesn't take more than running
a couple of fsync in parallel to get to the "at risk" state, which may persist
for seconds.
Of course the chances of users being that unlucky to also get a power
failure during "at risk" state are low, but I am puzzled how power fail tests
you claim that exists, didn't catch this sooner.

Anyway, not sure there is much more to discuss, just wanted to see
if there is a post mortem lesson to be learned from this, beyond the fact that
dm-log-writes is a valuable testing tool.

Cheers,
Amir.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] xfs: fix incorrect log_flushed on fsync
  2017-09-19  5:31         ` Amir Goldstein
@ 2017-09-19  5:45           ` Darrick J. Wong
  2017-09-20  0:40           ` Dave Chinner
  1 sibling, 0 replies; 30+ messages in thread
From: Darrick J. Wong @ 2017-09-19  5:45 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Dave Chinner, Greg KH, Christoph Hellwig, linux-xfs,
	linux-fsdevel, Josef Bacik, stable [v4.9]

On Tue, Sep 19, 2017 at 08:31:37AM +0300, Amir Goldstein wrote:
> On Tue, Sep 19, 2017 at 12:24 AM, Dave Chinner <david@fromorbit.com> wrote:
> > On Mon, Sep 18, 2017 at 09:00:30PM +0300, Amir Goldstein wrote:
> >> On Mon, Sep 18, 2017 at 8:11 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> >> > On Fri, Sep 15, 2017 at 03:40:24PM +0300, Amir Goldstein wrote:
> >> >> The disclosure of the security bug fix (commit b31ff3cdf5) made me wonder
> >> >> if possible data loss bug should also be disclosed in some distros forum?
> >> >> I bet some users would care more about the latter than the former.
> >> >> Coincidentally, both data loss and security bugs fix the same commit..
> >> >
> >> > Yes the the patch ought to get sent on to stable w/ fixes tag.  One
> >> > would hope that the distros will pick up the stable fixes from there.
> >
> > Yup, that's the normal process for data integrity/fs corruption
> > bugs.
> 
> Makes sense. I'm convinced that the normal process is sufficient for this
> sort of bug fix.
> 
> >
> >> > That said, it's been in the kernel for 12 years without widespread
> >> > complaints about corruption, so I'm not sure this warrants public
> >> > disclosure via CVE/Phoronix vs. just fixing it.
> >> >
> >>
> >> I'm not sure either.
> >> My intuition tells me that the chances of hitting the data loss bug
> >> given a power failure are not slim, but the chances of users knowing
> >> about the data loss are slim.
> >
> > The chances of hitting it are slim. Power-fail vs fsync data
> > integrity testing is something we do actually run as part of QE and
> > have for many years.  We've been running such testing for years and
> > never tripped over this problem, so I think the likelihood that a
> > user will hit it is extremely small.
> 
> This sentence make me unease.
> Who is We and what QE testing are you referring to?
> Are those tests in xfstests or any other public repository?
> My first reaction to the corruption was "no way, I need to check the test"
> Second reaction after checking the test was "this must very very hard to hit"

/me prefers to think that we've simply gotten lucky all these years and
nobody actually managed to die before another flush would take care of
the dirty data.

But then I did just spend a week in Las Vegas. :P

> But from closer inspection, it looks like it doesn't take more than running
> a couple of fsync in parallel to get to the "at risk" state, which may persist
> for seconds.
> Of course the chances of users being that unlucky to also get a power
> failure during "at risk" state are low, but I am puzzled how power fail tests
> you claim that exists, didn't catch this sooner.
> 
> Anyway, not sure there is much more to discuss, just wanted to see
> if there is a post mortem lesson to be learned from this, beyond the fact that
> dm-log-writes is a valuable testing tool.

Agreed. :)

--D

> 
> Cheers,
> Amir.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] xfs: fix incorrect log_flushed on fsync
  2017-09-18 19:29         ` Amir Goldstein
@ 2017-09-19  6:32           ` Greg KH
  2018-06-09  4:44             ` Amir Goldstein
  0 siblings, 1 reply; 30+ messages in thread
From: Greg KH @ 2017-09-19  6:32 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Darrick J. Wong, Christoph Hellwig, Dave Chinner, linux-xfs,
	linux-fsdevel, Josef Bacik, stable [v4.9]

On Mon, Sep 18, 2017 at 10:29:25PM +0300, Amir Goldstein wrote:
> On Mon, Sep 18, 2017 at 9:35 PM, Greg KH <greg@kroah.com> wrote:
> > On Mon, Sep 18, 2017 at 09:00:30PM +0300, Amir Goldstein wrote:
> >> On Mon, Sep 18, 2017 at 8:11 PM, Darrick J. Wong
> >> <darrick.wong@oracle.com> wrote:
> >> > On Fri, Sep 15, 2017 at 03:40:24PM +0300, Amir Goldstein wrote:
> >> >> On Wed, Aug 30, 2017 at 4:38 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> >> >> > When calling into _xfs_log_force{,_lsn}() with a pointer
> >> >> > to log_flushed variable, log_flushed will be set to 1 if:
> >> >> > 1. xlog_sync() is called to flush the active log buffer
> >> >> > AND/OR
> >> >> > 2. xlog_wait() is called to wait on a syncing log buffers
> >> >> >
> >> >> > xfs_file_fsync() checks the value of log_flushed after
> >> >> > _xfs_log_force_lsn() call to optimize away an explicit
> >> >> > PREFLUSH request to the data block device after writing
> >> >> > out all the file's pages to disk.
> >> >> >
> >> >> > This optimization is incorrect in the following sequence of events:
> >> >> >
> >> >> >  Task A                    Task B
> >> >> >  -------------------------------------------------------
> >> >> >  xfs_file_fsync()
> >> >> >    _xfs_log_force_lsn()
> >> >> >      xlog_sync()
> >> >> >         [submit PREFLUSH]
> >> >> >                            xfs_file_fsync()
> >> >> >                              file_write_and_wait_range()
> >> >> >                                [submit WRITE X]
> >> >> >                                [endio  WRITE X]
> >> >> >                              _xfs_log_force_lsn()
> >> >> >                                xlog_wait()
> >> >> >         [endio  PREFLUSH]
> >> >> >
> >> >> > The write X is not guarantied to be on persistent storage
> >> >> > when PREFLUSH request in completed, because write A was submitted
> >> >> > after the PREFLUSH request, but xfs_file_fsync() of task A will
> >> >> > be notified of log_flushed=1 and will skip explicit flush.
> >> >> >
> >> >> > If the system crashes after fsync of task A, write X may not be
> >> >> > present on disk after reboot.
> >> >> >
> >> >> > This bug was discovered and demonstrated using Josef Bacik's
> >> >> > dm-log-writes target, which can be used to record block io operations
> >> >> > and then replay a subset of these operations onto the target device.
> >> >> > The test goes something like this:
> >> >> > - Use fsx to execute ops of a file and record ops on log device
> >> >> > - Every now and then fsync the file, store md5 of file and mark
> >> >> >   the location in the log
> >> >> > - Then replay log onto device for each mark, mount fs and compare
> >> >> >   md5 of file to stored value
> >> >> >
> >> >> > Cc: Christoph Hellwig <hch@lst.de>
> >> >> > Cc: Josef Bacik <jbacik@fb.com>
> >> >> > Cc: <stable@vger.kernel.org>
> >> >> > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> >> >> > ---
> >> >> >
> >> >> > Christoph, Dave,
> >> >> >
> >> >> > It's hard to believe, but I think the reported bug has been around
> >> >> > since 2005 f538d4da8d52 ("[XFS] write barrier support"), but I did
> >> >> > not try to test old kernels.
> >> >>
> >> >> Forgot to tag commit message with:
> >> >> Fixes: f538d4da8d52 ("[XFS] write barrier support")
> >> >>
> >> >> Maybe the tag could be added when applying to recent stables,
> >> >> so distros and older downstream stables can see the tag.
> >> >>
> >> >> The disclosure of the security bug fix (commit b31ff3cdf5) made me wonder
> >> >> if possible data loss bug should also be disclosed in some distros forum?
> >> >> I bet some users would care more about the latter than the former.
> >> >> Coincidentally, both data loss and security bugs fix the same commit..
> >> >
> >> > Yes the the patch ought to get sent on to stable w/ fixes tag.  One
> >> > would hope that the distros will pick up the stable fixes from there.
> >>
> >>
> >> Greg, for your consideration, please add
> >>  Fixes: f538d4da8d52 ("[XFS] write barrier support")
> >> If not pushed yet.
> >
> > Add it to what?
> 
> Sorry, add that tag when applying commit 47c7d0b1950258312
> to stable trees, since I missed adding the tag before it was merged
> to master.

Nah, as the tag is just needed to let me know where to backport stuff
to, I don't think it matters when I add it to the stable tree itself, so
I'll leave it as-is.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] xfs: fix incorrect log_flushed on fsync
  2017-09-19  5:31         ` Amir Goldstein
  2017-09-19  5:45           ` Darrick J. Wong
@ 2017-09-20  0:40           ` Dave Chinner
  2017-09-20  1:08             ` Vijay Chidambaram
  2017-09-20  8:59             ` Eryu Guan
  1 sibling, 2 replies; 30+ messages in thread
From: Dave Chinner @ 2017-09-20  0:40 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Darrick J. Wong, Greg KH, Christoph Hellwig, linux-xfs,
	linux-fsdevel, Josef Bacik, stable [v4.9]

On Tue, Sep 19, 2017 at 08:31:37AM +0300, Amir Goldstein wrote:
> On Tue, Sep 19, 2017 at 12:24 AM, Dave Chinner <david@fromorbit.com> wrote:
> > On Mon, Sep 18, 2017 at 09:00:30PM +0300, Amir Goldstein wrote:
> >> On Mon, Sep 18, 2017 at 8:11 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> >> > On Fri, Sep 15, 2017 at 03:40:24PM +0300, Amir Goldstein wrote:
> >> > That said, it's been in the kernel for 12 years without widespread
> >> > complaints about corruption, so I'm not sure this warrants public
> >> > disclosure via CVE/Phoronix vs. just fixing it.
> >> >
> >>
> >> I'm not sure either.
> >> My intuition tells me that the chances of hitting the data loss bug
> >> given a power failure are not slim, but the chances of users knowing
> >> about the data loss are slim.
> >
> > The chances of hitting it are slim. Power-fail vs fsync data
> > integrity testing is something we do actually run as part of QE and
> > have for many years.  We've been running such testing for years and
> > never tripped over this problem, so I think the likelihood that a
> > user will hit it is extremely small.
> 
> This sentence make me unease.
> Who is We and what QE testing are you referring to?

I've done it in the past myself with a modified crash/xfscrash to
write patterned files (via genstream/checkstream). Unfortunately, I
lost that script when the machine used for that testing suffered a
fatal, completely unrecoverable ext3 root filesystem corruption
during a power fail cycle... :/

RH QE also runs automated power fail cycle tests - we found lots of
ext4 problems with that test rig when it was first put together, but
I don't recall seeing XFS issues reported.  Eryu would have to
confirm, but ISTR that this testing was made part of the regular
RHEL major release testing cycle...

Let's not forget all the other storage vendors and apps out there
that do their own crash/power fail testing that rely on a working
fsync. Apps like ceph, cluster, various databases, etc all have
their own data integrity testing procedures, and so if there's any
obvious or easy to hit fsync bug we would have had people reporting
it long ago.

Then there's all the research tools that have had papers written
about them testing exactly the sort of thing that dm-log writes is
testing. None of these indicated any sort of problem with fsync in
XFS, but we couldn't reproduce or verify the research results of the
because none of those fine institutions ever open sourced their
tools despite repeated requests and promises that it would happen.

> Are those tests in xfstests or any other public repository?

crash/xfscrash is, and now dm-log-write, but nothing else is.

> My first reaction to the corruption was "no way, I need to check the test"
> Second reaction after checking the test was "this must very very hard to hit"
> But from closer inspection, it looks like it doesn't take more than running
> a couple of fsync in parallel to get to the "at risk" state, which may persist
> for seconds.

That may be the case, but the reality is we don't have a body of
evidence to suggest this is a problem anyone is actually hitting. In
fact, we don't have any evidence it's been seen in the wild at all.

> Of course the chances of users being that unlucky to also get a power
> failure during "at risk" state are low, but I am puzzled how power fail tests
> you claim that exists, didn't catch this sooner.

Probably for the same reason app developers and users aren't
reporting fsync data loss problems.  While the bug may "look obvious
in hindsight", the fact is that there are no evidence of data loss
after fsync on XFS in the real world.  Occam's Razor suggests that
there is something that masks the problem that we don't understand
yet....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] xfs: fix incorrect log_flushed on fsync
  2017-09-20  0:40           ` Dave Chinner
@ 2017-09-20  1:08             ` Vijay Chidambaram
  2017-09-20  8:59             ` Eryu Guan
  1 sibling, 0 replies; 30+ messages in thread
From: Vijay Chidambaram @ 2017-09-20  1:08 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Amir Goldstein, Darrick J. Wong, Greg KH, Christoph Hellwig,
	linux-xfs, linux-fsdevel, Josef Bacik, stable [v4.9]

>> Are those tests in xfstests or any other public repository?
>
> crash/xfscrash is, and now dm-log-write, but nothing else is.
>

There's also CrashMonkey (tool under active development from my
research group at UT Austin) at:
https://github.com/utsaslab/crashmonkey.

>> My first reaction to the corruption was "no way, I need to check the test"
>> Second reaction after checking the test was "this must very very hard to hit"
>> But from closer inspection, it looks like it doesn't take more than running
>> a couple of fsync in parallel to get to the "at risk" state, which may persist
>> for seconds.
>
> That may be the case, but the reality is we don't have a body of
> evidence to suggest this is a problem anyone is actually hitting. In
> fact, we don't have any evidence it's been seen in the wild at all.

Even if people did hit this bug in the wild, as Amir suggested before,
the chances of them connecting it to a crash-consistency bug in the
file system and reporting as such are extremely low. We do have many
reports of data loss in newer file systems like btrfs, but I don't
think those are connected back to crash-consistency bugs (my suspicion
is there would be many such bugs, simply due to inadequate crash
consistency testing).

When working on CrashMonkey, we find it hard to understand what a
crash consistency bug is, even when we are the ones producing the bug!
I think there is a need for good debugging/investigative tools in this
area (something we are also working on).

Thanks,
Vijay
http://www.cs.utexas.edu/~vijay/

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] xfs: fix incorrect log_flushed on fsync
  2017-09-20  0:40           ` Dave Chinner
  2017-09-20  1:08             ` Vijay Chidambaram
@ 2017-09-20  8:59             ` Eryu Guan
  1 sibling, 0 replies; 30+ messages in thread
From: Eryu Guan @ 2017-09-20  8:59 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Amir Goldstein, Darrick J. Wong, Greg KH, Christoph Hellwig,
	linux-xfs, linux-fsdevel, Josef Bacik, stable [v4.9]

On Wed, Sep 20, 2017 at 10:40:12AM +1000, Dave Chinner wrote:
> On Tue, Sep 19, 2017 at 08:31:37AM +0300, Amir Goldstein wrote:
> > On Tue, Sep 19, 2017 at 12:24 AM, Dave Chinner <david@fromorbit.com> wrote:
> > > On Mon, Sep 18, 2017 at 09:00:30PM +0300, Amir Goldstein wrote:
> > >> On Mon, Sep 18, 2017 at 8:11 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > >> > On Fri, Sep 15, 2017 at 03:40:24PM +0300, Amir Goldstein wrote:
> > >> > That said, it's been in the kernel for 12 years without widespread
> > >> > complaints about corruption, so I'm not sure this warrants public
> > >> > disclosure via CVE/Phoronix vs. just fixing it.
> > >> >
> > >>
> > >> I'm not sure either.
> > >> My intuition tells me that the chances of hitting the data loss bug
> > >> given a power failure are not slim, but the chances of users knowing
> > >> about the data loss are slim.
> > >
> > > The chances of hitting it are slim. Power-fail vs fsync data
> > > integrity testing is something we do actually run as part of QE and
> > > have for many years.  We've been running such testing for years and
> > > never tripped over this problem, so I think the likelihood that a
> > > user will hit it is extremely small.
> > 
> > This sentence make me unease.
> > Who is We and what QE testing are you referring to?
> 
> I've done it in the past myself with a modified crash/xfscrash to
> write patterned files (via genstream/checkstream). Unfortunately, I
> lost that script when the machine used for that testing suffered a
> fatal, completely unrecoverable ext3 root filesystem corruption
> during a power fail cycle... :/
> 
> RH QE also runs automated power fail cycle tests - we found lots of

Not fully automated yet, but semi-automated :)

> ext4 problems with that test rig when it was first put together, but
> I don't recall seeing XFS issues reported.  Eryu would have to
> confirm, but ISTR that this testing was made part of the regular
> RHEL major release testing cycle...

Generally speaking, we don't find many bugs in our power failure tests.
We filed a few ext3/4 bugs against RHEL6 kernel, but IIRC there was only
one XFS bug filed, and we couldn't reproduce that bug in later RHEL
minor releases. And it seems to me that we don't file any such bugs
against RHEL7 kernel. And yes, this is part of regular RHEL release
testings only, I haven't done any power failure tests with upstream
kernels yet.

Thanks,
Eryu

> 
> Let's not forget all the other storage vendors and apps out there
> that do their own crash/power fail testing that rely on a working
> fsync. Apps like ceph, cluster, various databases, etc all have
> their own data integrity testing procedures, and so if there's any
> obvious or easy to hit fsync bug we would have had people reporting
> it long ago.
> 
> Then there's all the research tools that have had papers written
> about them testing exactly the sort of thing that dm-log writes is
> testing. None of these indicated any sort of problem with fsync in
> XFS, but we couldn't reproduce or verify the research results of the
> because none of those fine institutions ever open sourced their
> tools despite repeated requests and promises that it would happen.
> 
> > Are those tests in xfstests or any other public repository?
> 
> crash/xfscrash is, and now dm-log-write, but nothing else is.
> 
> > My first reaction to the corruption was "no way, I need to check the test"
> > Second reaction after checking the test was "this must very very hard to hit"
> > But from closer inspection, it looks like it doesn't take more than running
> > a couple of fsync in parallel to get to the "at risk" state, which may persist
> > for seconds.
> 
> That may be the case, but the reality is we don't have a body of
> evidence to suggest this is a problem anyone is actually hitting. In
> fact, we don't have any evidence it's been seen in the wild at all.
> 
> > Of course the chances of users being that unlucky to also get a power
> > failure during "at risk" state are low, but I am puzzled how power fail tests
> > you claim that exists, didn't catch this sooner.
> 
> Probably for the same reason app developers and users aren't
> reporting fsync data loss problems.  While the bug may "look obvious
> in hindsight", the fact is that there are no evidence of data loss
> after fsync on XFS in the real world.  Occam's Razor suggests that
> there is something that masks the problem that we don't understand
> yet....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] xfs: fix incorrect log_flushed on fsync
  2017-09-19  6:32           ` Greg KH
@ 2018-06-09  4:44             ` Amir Goldstein
  2018-06-09  7:13               ` Greg KH
  0 siblings, 1 reply; 30+ messages in thread
From: Amir Goldstein @ 2018-06-09  4:44 UTC (permalink / raw)
  To: Greg KH
  Cc: Darrick J. Wong, Christoph Hellwig, Dave Chinner, linux-xfs,
	linux-fsdevel, Josef Bacik, stable [v4.9]

On Tue, Sep 19, 2017 at 9:32 AM, Greg KH <greg@kroah.com> wrote:
> On Mon, Sep 18, 2017 at 10:29:25PM +0300, Amir Goldstein wrote:
>> On Mon, Sep 18, 2017 at 9:35 PM, Greg KH <greg@kroah.com> wrote:
>> > On Mon, Sep 18, 2017 at 09:00:30PM +0300, Amir Goldstein wrote:
>> >> On Mon, Sep 18, 2017 at 8:11 PM, Darrick J. Wong
>> >> <darrick.wong@oracle.com> wrote:
>> >> > On Fri, Sep 15, 2017 at 03:40:24PM +0300, Amir Goldstein wrote:
>> >> >> On Wed, Aug 30, 2017 at 4:38 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>> >> >> > When calling into _xfs_log_force{,_lsn}() with a pointer
>> >> >> > to log_flushed variable, log_flushed will be set to 1 if:
>> >> >> > 1. xlog_sync() is called to flush the active log buffer
>> >> >> > AND/OR
>> >> >> > 2. xlog_wait() is called to wait on a syncing log buffers
>> >> >> >
>> >> >> > xfs_file_fsync() checks the value of log_flushed after
>> >> >> > _xfs_log_force_lsn() call to optimize away an explicit
>> >> >> > PREFLUSH request to the data block device after writing
>> >> >> > out all the file's pages to disk.
>> >> >> >
>> >> >> > This optimization is incorrect in the following sequence of events:
>> >> >> >
>> >> >> >  Task A                    Task B
>> >> >> >  -------------------------------------------------------
>> >> >> >  xfs_file_fsync()
>> >> >> >    _xfs_log_force_lsn()
>> >> >> >      xlog_sync()
>> >> >> >         [submit PREFLUSH]
>> >> >> >                            xfs_file_fsync()
>> >> >> >                              file_write_and_wait_range()
>> >> >> >                                [submit WRITE X]
>> >> >> >                                [endio  WRITE X]
>> >> >> >                              _xfs_log_force_lsn()
>> >> >> >                                xlog_wait()
>> >> >> >         [endio  PREFLUSH]
>> >> >> >
>> >> >> > The write X is not guarantied to be on persistent storage
>> >> >> > when PREFLUSH request in completed, because write A was submitted
>> >> >> > after the PREFLUSH request, but xfs_file_fsync() of task A will
>> >> >> > be notified of log_flushed=1 and will skip explicit flush.
>> >> >> >
>> >> >> > If the system crashes after fsync of task A, write X may not be
>> >> >> > present on disk after reboot.
>> >> >> >
>> >> >> > This bug was discovered and demonstrated using Josef Bacik's
>> >> >> > dm-log-writes target, which can be used to record block io operations
>> >> >> > and then replay a subset of these operations onto the target device.
>> >> >> > The test goes something like this:
>> >> >> > - Use fsx to execute ops of a file and record ops on log device
>> >> >> > - Every now and then fsync the file, store md5 of file and mark
>> >> >> >   the location in the log
>> >> >> > - Then replay log onto device for each mark, mount fs and compare
>> >> >> >   md5 of file to stored value
>> >> >> >
>> >> >> > Cc: Christoph Hellwig <hch@lst.de>
>> >> >> > Cc: Josef Bacik <jbacik@fb.com>
>> >> >> > Cc: <stable@vger.kernel.org>
>> >> >> > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
>> >> >> > ---
>> >> >> >
>> >> >> > Christoph, Dave,
>> >> >> >
>> >> >> > It's hard to believe, but I think the reported bug has been around
>> >> >> > since 2005 f538d4da8d52 ("[XFS] write barrier support"), but I did
>> >> >> > not try to test old kernels.
>> >> >>
>> >> >> Forgot to tag commit message with:
>> >> >> Fixes: f538d4da8d52 ("[XFS] write barrier support")
>> >> >>
>> >> >> Maybe the tag could be added when applying to recent stables,
>> >> >> so distros and older downstream stables can see the tag.
>> >> >>
>> >> >> The disclosure of the security bug fix (commit b31ff3cdf5) made me wonder
>> >> >> if possible data loss bug should also be disclosed in some distros forum?
>> >> >> I bet some users would care more about the latter than the former.
>> >> >> Coincidentally, both data loss and security bugs fix the same commit..
>> >> >
>> >> > Yes the the patch ought to get sent on to stable w/ fixes tag.  One
>> >> > would hope that the distros will pick up the stable fixes from there.
>> >>
>> >>
>> >> Greg, for your consideration, please add
>> >>  Fixes: f538d4da8d52 ("[XFS] write barrier support")
>> >> If not pushed yet.
>> >
>> > Add it to what?
>>
>> Sorry, add that tag when applying commit 47c7d0b1950258312
>> to stable trees, since I missed adding the tag before it was merged
>> to master.
>
> Nah, as the tag is just needed to let me know where to backport stuff
> to, I don't think it matters when I add it to the stable tree itself, so
> I'll leave it as-is.
>

Greg,

Related or not to above Fixes discussion, I now noticed that you never picked
the patch for kernel 4.4.

Ben did take it to 3.2 and 3.16 BTW.

This is a very critical bug fix IMO.
Were you waiting for an ACK from xfs maintainers or just an oversight?
Or was it me who had to check up on that?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] xfs: fix incorrect log_flushed on fsync
  2018-06-09  4:44             ` Amir Goldstein
@ 2018-06-09  7:13               ` Greg KH
  0 siblings, 0 replies; 30+ messages in thread
From: Greg KH @ 2018-06-09  7:13 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Darrick J. Wong, Christoph Hellwig, Dave Chinner, linux-xfs,
	linux-fsdevel, Josef Bacik, stable [v4.9]

On Sat, Jun 09, 2018 at 07:44:22AM +0300, Amir Goldstein wrote:
> On Tue, Sep 19, 2017 at 9:32 AM, Greg KH <greg@kroah.com> wrote:
> > On Mon, Sep 18, 2017 at 10:29:25PM +0300, Amir Goldstein wrote:
> >> On Mon, Sep 18, 2017 at 9:35 PM, Greg KH <greg@kroah.com> wrote:
> >> > On Mon, Sep 18, 2017 at 09:00:30PM +0300, Amir Goldstein wrote:
> >> >> On Mon, Sep 18, 2017 at 8:11 PM, Darrick J. Wong
> >> >> <darrick.wong@oracle.com> wrote:
> >> >> > On Fri, Sep 15, 2017 at 03:40:24PM +0300, Amir Goldstein wrote:
> >> >> >> On Wed, Aug 30, 2017 at 4:38 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> >> >> >> > When calling into _xfs_log_force{,_lsn}() with a pointer
> >> >> >> > to log_flushed variable, log_flushed will be set to 1 if:
> >> >> >> > 1. xlog_sync() is called to flush the active log buffer
> >> >> >> > AND/OR
> >> >> >> > 2. xlog_wait() is called to wait on a syncing log buffers
> >> >> >> >
> >> >> >> > xfs_file_fsync() checks the value of log_flushed after
> >> >> >> > _xfs_log_force_lsn() call to optimize away an explicit
> >> >> >> > PREFLUSH request to the data block device after writing
> >> >> >> > out all the file's pages to disk.
> >> >> >> >
> >> >> >> > This optimization is incorrect in the following sequence of events:
> >> >> >> >
> >> >> >> >  Task A                    Task B
> >> >> >> >  -------------------------------------------------------
> >> >> >> >  xfs_file_fsync()
> >> >> >> >    _xfs_log_force_lsn()
> >> >> >> >      xlog_sync()
> >> >> >> >         [submit PREFLUSH]
> >> >> >> >                            xfs_file_fsync()
> >> >> >> >                              file_write_and_wait_range()
> >> >> >> >                                [submit WRITE X]
> >> >> >> >                                [endio  WRITE X]
> >> >> >> >                              _xfs_log_force_lsn()
> >> >> >> >                                xlog_wait()
> >> >> >> >         [endio  PREFLUSH]
> >> >> >> >
> >> >> >> > The write X is not guarantied to be on persistent storage
> >> >> >> > when PREFLUSH request in completed, because write A was submitted
> >> >> >> > after the PREFLUSH request, but xfs_file_fsync() of task A will
> >> >> >> > be notified of log_flushed=1 and will skip explicit flush.
> >> >> >> >
> >> >> >> > If the system crashes after fsync of task A, write X may not be
> >> >> >> > present on disk after reboot.
> >> >> >> >
> >> >> >> > This bug was discovered and demonstrated using Josef Bacik's
> >> >> >> > dm-log-writes target, which can be used to record block io operations
> >> >> >> > and then replay a subset of these operations onto the target device.
> >> >> >> > The test goes something like this:
> >> >> >> > - Use fsx to execute ops of a file and record ops on log device
> >> >> >> > - Every now and then fsync the file, store md5 of file and mark
> >> >> >> >   the location in the log
> >> >> >> > - Then replay log onto device for each mark, mount fs and compare
> >> >> >> >   md5 of file to stored value
> >> >> >> >
> >> >> >> > Cc: Christoph Hellwig <hch@lst.de>
> >> >> >> > Cc: Josef Bacik <jbacik@fb.com>
> >> >> >> > Cc: <stable@vger.kernel.org>
> >> >> >> > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> >> >> >> > ---
> >> >> >> >
> >> >> >> > Christoph, Dave,
> >> >> >> >
> >> >> >> > It's hard to believe, but I think the reported bug has been around
> >> >> >> > since 2005 f538d4da8d52 ("[XFS] write barrier support"), but I did
> >> >> >> > not try to test old kernels.
> >> >> >>
> >> >> >> Forgot to tag commit message with:
> >> >> >> Fixes: f538d4da8d52 ("[XFS] write barrier support")
> >> >> >>
> >> >> >> Maybe the tag could be added when applying to recent stables,
> >> >> >> so distros and older downstream stables can see the tag.
> >> >> >>
> >> >> >> The disclosure of the security bug fix (commit b31ff3cdf5) made me wonder
> >> >> >> if possible data loss bug should also be disclosed in some distros forum?
> >> >> >> I bet some users would care more about the latter than the former.
> >> >> >> Coincidentally, both data loss and security bugs fix the same commit..
> >> >> >
> >> >> > Yes the the patch ought to get sent on to stable w/ fixes tag.  One
> >> >> > would hope that the distros will pick up the stable fixes from there.
> >> >>
> >> >>
> >> >> Greg, for your consideration, please add
> >> >>  Fixes: f538d4da8d52 ("[XFS] write barrier support")
> >> >> If not pushed yet.
> >> >
> >> > Add it to what?
> >>
> >> Sorry, add that tag when applying commit 47c7d0b1950258312
> >> to stable trees, since I missed adding the tag before it was merged
> >> to master.
> >
> > Nah, as the tag is just needed to let me know where to backport stuff
> > to, I don't think it matters when I add it to the stable tree itself, so
> > I'll leave it as-is.
> >
> 
> Greg,
> 
> Related or not to above Fixes discussion, I now noticed that you never picked
> the patch for kernel 4.4.
> 
> Ben did take it to 3.2 and 3.16 BTW.
> 
> This is a very critical bug fix IMO.
> Were you waiting for an ACK from xfs maintainers or just an oversight?
> Or was it me who had to check up on that?

It looks to be an oversight, sorry, now queued up.

greg k-h

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2018-06-09  7:13 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-30 13:38 [PATCH] xfs: fix incorrect log_flushed on fsync Amir Goldstein
2017-08-30 13:46 ` Christoph Hellwig
2017-08-30 14:12   ` Amir Goldstein
2017-08-30 14:21     ` Christoph Hellwig
2017-08-30 17:01 ` Darrick J. Wong
2017-08-31 13:47 ` Christoph Hellwig
2017-08-31 14:37   ` Amir Goldstein
2017-08-31 16:39     ` Brian Foster
2017-08-31 19:20       ` Amir Goldstein
2017-08-31 20:10         ` Brian Foster
2017-09-01  7:58           ` Amir Goldstein
2017-09-01 10:46             ` Brian Foster
2017-09-01  9:52         ` Christoph Hellwig
2017-09-01 10:37           ` Amir Goldstein
2017-09-01 10:43             ` Christoph Hellwig
2017-09-01  9:47     ` Christoph Hellwig
2017-09-15 12:40 ` Amir Goldstein
2017-09-18 17:11   ` Darrick J. Wong
2017-09-18 18:00     ` Amir Goldstein
2017-09-18 18:35       ` Greg KH
2017-09-18 19:29         ` Amir Goldstein
2017-09-19  6:32           ` Greg KH
2018-06-09  4:44             ` Amir Goldstein
2018-06-09  7:13               ` Greg KH
2017-09-18 21:24       ` Dave Chinner
2017-09-19  5:31         ` Amir Goldstein
2017-09-19  5:45           ` Darrick J. Wong
2017-09-20  0:40           ` Dave Chinner
2017-09-20  1:08             ` Vijay Chidambaram
2017-09-20  8:59             ` Eryu Guan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.