[PATCH] xfs: truncate_setsize should be outside transactions

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] xfs: truncate_setsize should be outside transactions
@ 2014-05-01 22:39 Dave Chinner
  2014-05-02  4:54 ` Christoph Hellwig
  0 siblings, 1 reply; 12+ messages in thread
From: Dave Chinner @ 2014-05-01 22:39 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

truncate_setsize() removes pages from the page cache, and hence
requires page locks to be held. It is not valid to lock a page cache
page inside a transaction context as we can hold page locks when we
we reserve space for a transaction. If we do, then we expose an ABBA
deadlock between log space reservation and page locks.

That is, both the write path and writeback lock a page, then start a
transaction for block allocation, which means they can block waiting
for a log reservation with the page lock held. If we hold a log
reservation and then do something that locks a page (e.g.
truncate_setsize in xfs_setattr_size) then that page lock can block
on the page locked and waiting for a log reservation. If the
transaction that is waiting for the page lock is the only active
transaction in the system that can free log space via a commit,
then writeback will never make progress and so log space will never
free up.

This issue with xfs_setattr_size() was introduced back in 2010 by
commit fa9b227 ("xfs: new truncate sequence") which moved the page
cache truncate from outside the transaction context (what was
xfs_itruncate_data()) to inside the transaction context as a call to
truncate_setsize().

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_iops.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index ef1ca01..84db577 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -808,22 +808,25 @@ xfs_setattr_size(
 	 */
 	inode_dio_wait(inode);

+	/*
+	 * Do all the page cache truncate work outside the transaction
+	 * context as the "lock" order is page lock->log space reservation.
+	 * i.e. locking pages inside the transaction can ABBA deadlock with
+	 * writeback.
+	 */
 	error = -block_truncate_page(inode->i_mapping, newsize, xfs_get_blocks);
 	if (error)
 		return error;
+	truncate_setsize(inode, newsize);

 	tp = xfs_trans_alloc(mp, XFS_TRANS_SETATTR_SIZE);
 	error = xfs_trans_reserve(tp, &M_RES(mp)->tr_itruncate, 0, 0);
 	if (error)
 		goto out_trans_cancel;

-	truncate_setsize(inode, newsize);
-
 	commit_flags = XFS_TRANS_RELEASE_LOG_RES;
 	lock_flags |= XFS_ILOCK_EXCL;
-
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
-
 	xfs_trans_ijoin(tp, ip, 0);

 	/*
-- 
1.9.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] xfs: truncate_setsize should be outside transactions
  2014-05-01 22:39 [PATCH] xfs: truncate_setsize should be outside transactions Dave Chinner
@ 2014-05-02  4:54 ` Christoph Hellwig
  2014-05-02  5:00   ` Christoph Hellwig
  0 siblings, 1 reply; 12+ messages in thread
From: Christoph Hellwig @ 2014-05-02  4:54 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Fri, May 02, 2014 at 08:39:39AM +1000, Dave Chinner wrote:
> This issue with xfs_setattr_size() was introduced back in 2010 by
> commit fa9b227 ("xfs: new truncate sequence") which moved the page
> cache truncate from outside the transaction context (what was
> xfs_itruncate_data()) to inside the transaction context as a call to
> truncate_setsize().

And it was moved because we should only call truncate_setsize once
the truncate can't fail any more.  So to move it out of transaction
context it needs to move after the commit of the transaction(s).

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] xfs: truncate_setsize should be outside transactions
  2014-05-02  4:54 ` Christoph Hellwig
@ 2014-05-02  5:00   ` Christoph Hellwig
  2014-05-02  6:47     ` Dave Chinner
  0 siblings, 1 reply; 12+ messages in thread
From: Christoph Hellwig @ 2014-05-02  5:00 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Thu, May 01, 2014 at 09:54:43PM -0700, Christoph Hellwig wrote:
> On Fri, May 02, 2014 at 08:39:39AM +1000, Dave Chinner wrote:
> > This issue with xfs_setattr_size() was introduced back in 2010 by
> > commit fa9b227 ("xfs: new truncate sequence") which moved the page
> > cache truncate from outside the transaction context (what was
> > xfs_itruncate_data()) to inside the transaction context as a call to
> > truncate_setsize().
> 
> And it was moved because we should only call truncate_setsize once
> the truncate can't fail any more.  So to move it out of transaction
> context it needs to move after the commit of the transaction(s).

Actually that's only true for the i_size update.  So I guess
we need to call truncate_pagecache were you put the truncate_setsize
now, and then update i_size later, together with the updates of the
XFS di_size.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] xfs: truncate_setsize should be outside transactions
  2014-05-02  5:00   ` Christoph Hellwig
@ 2014-05-02  6:47     ` Dave Chinner
  2014-05-02  7:00       ` [PATCH V2] " Dave Chinner
  0 siblings, 1 reply; 12+ messages in thread
From: Dave Chinner @ 2014-05-02  6:47 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Thu, May 01, 2014 at 10:00:53PM -0700, Christoph Hellwig wrote:
> On Thu, May 01, 2014 at 09:54:43PM -0700, Christoph Hellwig wrote:
> > On Fri, May 02, 2014 at 08:39:39AM +1000, Dave Chinner wrote:
> > > This issue with xfs_setattr_size() was introduced back in 2010 by
> > > commit fa9b227 ("xfs: new truncate sequence") which moved the page
> > > cache truncate from outside the transaction context (what was
> > > xfs_itruncate_data()) to inside the transaction context as a call to
> > > truncate_setsize().
> > 
> > And it was moved because we should only call truncate_setsize once
> > the truncate can't fail any more.  So to move it out of transaction
> > context it needs to move after the commit of the transaction(s).
> 
> Actually that's only true for the i_size update.  So I guess
> we need to call truncate_pagecache were you put the truncate_setsize
> now, and then update i_size later, together with the updates of the
> XFS di_size.

OK, that seems reasonable. I'll add a comment to ensure that we
don't break it in future ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH V2] xfs: truncate_setsize should be outside transactions
  2014-05-02  6:47     ` Dave Chinner
@ 2014-05-02  7:00       ` Dave Chinner
  2014-05-02 10:08         ` Christoph Hellwig
  2014-05-02 12:50         ` [PATCH V2] " Brian Foster
  0 siblings, 2 replies; 12+ messages in thread
From: Dave Chinner @ 2014-05-02  7:00 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

From: Dave Chinner <dchinner@redhat.com>

truncate_setsize() removes pages from the page cache, and hence
requires page locks to be held. It is not valid to lock a page cache
page inside a transaction context as we can hold page locks when we
we reserve space for a transaction. If we do, then we expose an ABBA
deadlock between log space reservation and page locks.

That is, both the write path and writeback lock a page, then start a
transaction for block allocation, which means they can block waiting
for a log reservation with the page lock held. If we hold a log
reservation and then do something that locks a page (e.g.
truncate_setsize in xfs_setattr_size) then that page lock can block
on the page locked and waiting for a log reservation. If the
transaction that is waiting for the page lock is the only active
transaction in the system that can free log space via a commit,
then writeback will never make progress and so log space will never
free up.

This issue with xfs_setattr_size() was introduced back in 2010 by
commit fa9b227 ("xfs: new truncate sequence") which moved the page
cache truncate from outside the transaction context (what was
xfs_itruncate_data()) to inside the transaction context as a call to
truncate_setsize().

The reason truncate_setsize() was located where in this place was
that we can't change the file size until after we are in the
transaction context and the operation will either succeed or shut
down the filesystem on failure. Hence we have to split
truncate_setsize() back into a pagecache operation that occurs
before the transaction context, and a i_size_write() call that
happens within the transaction context.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_iops.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index ef1ca01..ab2dc47 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -808,22 +808,27 @@ xfs_setattr_size(
 	 */
 	inode_dio_wait(inode);

+	/*
+	 * Do all the page cache truncate work outside the transaction
+	 * context as the "lock" order is page lock->log space reservation.
+	 * i.e. locking pages inside the transaction can ABBA deadlock with
+	 * writeback. We have to do the inode size update inside the
+	 * transaction, however, as xfs_trans_reserve() can fail with ENOMEM
+	 * and we can't make user visible changes on non-fatal errors.
+	 */
 	error = -block_truncate_page(inode->i_mapping, newsize, xfs_get_blocks);
 	if (error)
 		return error;
+	truncate_pagecache(inode, newsize);

 	tp = xfs_trans_alloc(mp, XFS_TRANS_SETATTR_SIZE);
 	error = xfs_trans_reserve(tp, &M_RES(mp)->tr_itruncate, 0, 0);
 	if (error)
 		goto out_trans_cancel;

-	truncate_setsize(inode, newsize);
-
 	commit_flags = XFS_TRANS_RELEASE_LOG_RES;
 	lock_flags |= XFS_ILOCK_EXCL;
-
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
-
 	xfs_trans_ijoin(tp, ip, 0);

 	/*
@@ -856,6 +861,7 @@ xfs_setattr_size(
 	 * they get written to.
 	 */
 	ip->i_d.di_size = newsize;
+	i_size_write(inode, newsize);
 	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);

 	if (newsize <= oldsize) {

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH V2] xfs: truncate_setsize should be outside transactions
  2014-05-02  7:00       ` [PATCH V2] " Dave Chinner
@ 2014-05-02 10:08         ` Christoph Hellwig
  2014-05-02 23:23           ` Dave Chinner
  2014-05-02 12:50         ` [PATCH V2] " Brian Foster
  1 sibling, 1 reply; 12+ messages in thread
From: Christoph Hellwig @ 2014-05-02 10:08 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Fri, May 02, 2014 at 05:00:54PM +1000, Dave Chinner wrote:
> The reason truncate_setsize() was located where in this place was
> that we can't change the file size until after we are in the
> transaction context and the operation will either succeed or shut
> down the filesystem on failure. Hence we have to split
> truncate_setsize() back into a pagecache operation that occurs
> before the transaction context, and a i_size_write() call that
> happens within the transaction context.

Further updating myself earlier on the comment next to
truncate_pagecache claims that the file size must have been updated
before, but I can't see a reason for that.

This version looks fine to me:

Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V2] xfs: truncate_setsize should be outside transactions
  2014-05-02  7:00       ` [PATCH V2] " Dave Chinner
  2014-05-02 10:08         ` Christoph Hellwig
@ 2014-05-02 12:50         ` Brian Foster
  1 sibling, 0 replies; 12+ messages in thread
From: Brian Foster @ 2014-05-02 12:50 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, xfs

On Fri, May 02, 2014 at 05:00:54PM +1000, Dave Chinner wrote:
> 
> From: Dave Chinner <dchinner@redhat.com>
> 
> truncate_setsize() removes pages from the page cache, and hence
> requires page locks to be held. It is not valid to lock a page cache
> page inside a transaction context as we can hold page locks when we
> we reserve space for a transaction. If we do, then we expose an ABBA
> deadlock between log space reservation and page locks.
> 
> That is, both the write path and writeback lock a page, then start a
> transaction for block allocation, which means they can block waiting
> for a log reservation with the page lock held. If we hold a log
> reservation and then do something that locks a page (e.g.
> truncate_setsize in xfs_setattr_size) then that page lock can block
> on the page locked and waiting for a log reservation. If the
> transaction that is waiting for the page lock is the only active
> transaction in the system that can free log space via a commit,
> then writeback will never make progress and so log space will never
> free up.
> 
> This issue with xfs_setattr_size() was introduced back in 2010 by
> commit fa9b227 ("xfs: new truncate sequence") which moved the page
> cache truncate from outside the transaction context (what was
> xfs_itruncate_data()) to inside the transaction context as a call to
> truncate_setsize().
> 
> The reason truncate_setsize() was located where in this place was
> that we can't change the file size until after we are in the
> transaction context and the operation will either succeed or shut
> down the filesystem on failure. Hence we have to split
> truncate_setsize() back into a pagecache operation that occurs
> before the transaction context, and a i_size_write() call that
> happens within the transaction context.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

The manifestation of this that we have seen has writeback blocked on log
reservation, and a thread sitting on this:

 #0 [ffff88022e153b18] __schedule at ffffffff815f137d
 #1 [ffff88022e153b80] io_schedule at ffffffff815f1bdd
 #2 [ffff88022e153b98] sleep_on_page at ffffffff811410be
 #3 [ffff88022e153ba8] __wait_on_bit at ffffffff815ef940
 #4 [ffff88022e153be8] wait_on_page_bit at ffffffff81140e46
 #5 [ffff88022e153c38] truncate_inode_pages_range at ffffffff81150d03
 #6 [ffff88022e153d88] truncate_pagecache at ffffffff81151027
 #7 [ffff88022e153db0] truncate_setsize at ffffffff81151059
 #8 [ffff88022e153dc0] xfs_setattr_size at ffffffffa01f3594 [xfs]
 #9 [ffff88022e153e10] xfs_vn_setattr at ffffffffa01f37e0 [xfs]
#10 [ffff88022e153e30] notify_change at ffffffff811cc349
#11 [ffff88022e153e78] do_truncate at ffffffff811adb43
#12 [ffff88022e153ef0] vfs_truncate at ffffffff811adcf1
#13 [ffff88022e153f28] do_sys_truncate at ffffffff811add9c
#14 [ffff88022e153f70] sys_truncate at ffffffff811adf3e
#15 [ffff88022e153f80] system_call_fastpath at ffffffff815fc819

That wait_on_page_bit() call maps to this bit of code:

0xffffffff81150cef <truncate_inode_pages_range+879>:    mov    %rdx,%rdi
0xffffffff81150cf2 <truncate_inode_pages_range+882>:    mov    $0xd,%esi
0xffffffff81150cf7 <truncate_inode_pages_range+887>:    mov    %rdx,-0x130(%rbp)
0xffffffff81150cfe <truncate_inode_pages_range+894>:    callq  0xffffffff81140dc0 <wait_on_page_bit>

So this thread has basically come in, reserved log space, attempted a
truncate and is sitting blocked on writeback. xfs_vm_writepage() has the
page, set the writeback bit and attempted a log reservation for the file
size update transaction. Therefore, no progress can be made.

Ordered properly, the truncate should either wait on writeback without
holding the log space hostage or grab the page lock before writeback is
set, allowing either path to proceed once the page is acquired.

Makes sense, thanks for tracking this down...

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_iops.c | 14 ++++++++++----
>  1 file changed, 10 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index ef1ca01..ab2dc47 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -808,22 +808,27 @@ xfs_setattr_size(
>  	 */
>  	inode_dio_wait(inode);
>  
> +	/*
> +	 * Do all the page cache truncate work outside the transaction
> +	 * context as the "lock" order is page lock->log space reservation.
> +	 * i.e. locking pages inside the transaction can ABBA deadlock with
> +	 * writeback. We have to do the inode size update inside the
> +	 * transaction, however, as xfs_trans_reserve() can fail with ENOMEM
> +	 * and we can't make user visible changes on non-fatal errors.
> +	 */
>  	error = -block_truncate_page(inode->i_mapping, newsize, xfs_get_blocks);
>  	if (error)
>  		return error;
> +	truncate_pagecache(inode, newsize);
>  
>  	tp = xfs_trans_alloc(mp, XFS_TRANS_SETATTR_SIZE);
>  	error = xfs_trans_reserve(tp, &M_RES(mp)->tr_itruncate, 0, 0);
>  	if (error)
>  		goto out_trans_cancel;
>  
> -	truncate_setsize(inode, newsize);
> -
>  	commit_flags = XFS_TRANS_RELEASE_LOG_RES;
>  	lock_flags |= XFS_ILOCK_EXCL;
> -
>  	xfs_ilock(ip, XFS_ILOCK_EXCL);
> -
>  	xfs_trans_ijoin(tp, ip, 0);
>  
>  	/*
> @@ -856,6 +861,7 @@ xfs_setattr_size(
>  	 * they get written to.
>  	 */
>  	ip->i_d.di_size = newsize;
> +	i_size_write(inode, newsize);
>  	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
>  
>  	if (newsize <= oldsize) {
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V2] xfs: truncate_setsize should be outside transactions
  2014-05-02 10:08         ` Christoph Hellwig
@ 2014-05-02 23:23           ` Dave Chinner
  2014-05-03 15:16             ` Christoph Hellwig
  0 siblings, 1 reply; 12+ messages in thread
From: Dave Chinner @ 2014-05-02 23:23 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Fri, May 02, 2014 at 03:08:02AM -0700, Christoph Hellwig wrote:
> On Fri, May 02, 2014 at 05:00:54PM +1000, Dave Chinner wrote:
> > The reason truncate_setsize() was located where in this place was
> > that we can't change the file size until after we are in the
> > transaction context and the operation will either succeed or shut
> > down the filesystem on failure. Hence we have to split
> > truncate_setsize() back into a pagecache operation that occurs
> > before the transaction context, and a i_size_write() call that
> > happens within the transaction context.
> 
> Further updating myself earlier on the comment next to
> truncate_pagecache claims that the file size must have been updated
> before, but I can't see a reason for that.

Oh, I can, and that reminds me of why - racing with mmap page
faults, which aren't serialised against truncate except by an
indirect combination of the page locks and i_size updates. hence if
we remove the pages before updating the inode size, then a page
fault can re-instantiate a page after the truncation beyond the new
EOF when, in fact, it should SEGV.

So, no, we can't split truncate_setsize() like this.

As it is, we've already made a user visible data change in the truncate process
before we get to the transaction that can fail:
block_truncate_page() zeroes the tail of the page cache page. Hence
if the transaction reservation fails, we've already trashed the file
data - we may as well finish off the job and at least make it look
like the truncate succeeded from a user point of view. They then get
a ENOMEM error (only non-fatal error that can come from
xfs_trans_reserve) and try the truncate again....

So I now think the first version of the patch is better than this
one..

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V2] xfs: truncate_setsize should be outside transactions
  2014-05-02 23:23           ` Dave Chinner
@ 2014-05-03 15:16             ` Christoph Hellwig
  2014-05-04  0:06               ` Dave Chinner
  0 siblings, 1 reply; 12+ messages in thread
From: Christoph Hellwig @ 2014-05-03 15:16 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, xfs

On Sat, May 03, 2014 at 09:23:39AM +1000, Dave Chinner wrote:
> before we get to the transaction that can fail:
> block_truncate_page() zeroes the tail of the page cache page. Hence
> if the transaction reservation fails, we've already trashed the file
> data - we may as well finish off the job and at least make it look
> like the truncate succeeded from a user point of view. They then get
> a ENOMEM error (only non-fatal error that can come from
> xfs_trans_reserve) and try the truncate again....

I don't think we can even get the ENOMEM.  But yeah, I guess
we want something like the old version, with comments explaining exactly
we we have this order.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V2] xfs: truncate_setsize should be outside transactions
  2014-05-03 15:16             ` Christoph Hellwig
@ 2014-05-04  0:06               ` Dave Chinner
  2014-05-05  5:19                 ` [PATCH V3] " Dave Chinner
  0 siblings, 1 reply; 12+ messages in thread
From: Dave Chinner @ 2014-05-04  0:06 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Sat, May 03, 2014 at 08:16:01AM -0700, Christoph Hellwig wrote:
> On Sat, May 03, 2014 at 09:23:39AM +1000, Dave Chinner wrote:
> > before we get to the transaction that can fail:
> > block_truncate_page() zeroes the tail of the page cache page. Hence
> > if the transaction reservation fails, we've already trashed the file
> > data - we may as well finish off the job and at least make it look
> > like the truncate succeeded from a user point of view. They then get
> > a ENOMEM error (only non-fatal error that can come from
> > xfs_trans_reserve) and try the truncate again....
> 
> I don't think we can even get the ENOMEM.

We can - we pass KM_MAYFAIL to xlog_ticket_alloc() from
xfs_log_reserve().

> But yeah, I guess
> we want something like the old version, with comments explaining exactly
> we we have this order.

I'll send another version of the first patch with an expanded
comment.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH V3] xfs: truncate_setsize should be outside transactions
  2014-05-04  0:06               ` Dave Chinner
@ 2014-05-05  5:19                 ` Dave Chinner
  2014-05-06  7:52                   ` Christoph Hellwig
  0 siblings, 1 reply; 12+ messages in thread
From: Dave Chinner @ 2014-05-05  5:19 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

From: Dave Chinner <dchinner@redhat.com>

truncate_setsize() removes pages from the page cache, and hence
requires page locks to be held. It is not valid to lock a page cache
page inside a transaction context as we can hold page locks when we
we reserve space for a transaction. If we do, then we expose an ABBA
deadlock between log space reservation and page locks.

That is, both the write path and writeback lock a page, then start a
transaction for block allocation, which means they can block waiting
for a log reservation with the page lock held. If we hold a log
reservation and then do something that locks a page (e.g.
truncate_setsize in xfs_setattr_size) then that page lock can block
on the page locked and waiting for a log reservation. If the
transaction that is waiting for the page lock is the only active
transaction in the system that can free log space via a commit,
then writeback will never make progress and so log space will never
free up.

This issue with xfs_setattr_size() was introduced back in 2010 by
commit fa9b227 ("xfs: new truncate sequence") which moved the page
cache truncate from outside the transaction context (what was
xfs_itruncate_data()) to inside the transaction context as a call to
truncate_setsize().

The reason truncate_setsize() was located where in this place was
that we can't shouldn't change the file size until after we are in
the transaction context and the operation will either succeed or
shut down the filesystem on failure. However, block_truncate_page()
already modifies the file contents before we enter the transaction
context, so we can't really fulfill this guarantee in any way. Hence
we may as well ensure that on success or failure, the in-memory
inode and data is truncated away and that the application cleans up
the mess appropriately.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---

V3 - revert back to original fix but expand the comment to explain
why the truncate is done this way.

 fs/xfs/xfs_iops.c | 20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index ef1ca01..9ef6394 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -808,22 +808,34 @@ xfs_setattr_size(
 	 */
 	inode_dio_wait(inode);

+	/*
+	 * Do all the page cache truncate work outside the transaction context
+	 * as the "lock" order is page lock->log space reservation.  i.e.
+	 * locking pages inside the transaction can ABBA deadlock with
+	 * writeback. We have to do the VFS inode size update before we truncate
+	 * the pagecache, however, to avoid racing with page faults beyond the
+	 * new EOF they are not serialised against truncate operations except by
+	 * page locks and size updates.
+	 *
+	 * Hence we are in a situation where a truncate can fail with ENOMEM
+	 * from xfs_trans_reserve(), but having already truncated the in-memory
+	 * version of the file (i.e. made user visible changes). There's not
+	 * much we can do about this, except to hope that the caller sees ENOMEM
+	 * and retries the truncate operation.
+	 */
 	error = -block_truncate_page(inode->i_mapping, newsize, xfs_get_blocks);
 	if (error)
 		return error;
+	truncate_setsize(inode, newsize);

 	tp = xfs_trans_alloc(mp, XFS_TRANS_SETATTR_SIZE);
 	error = xfs_trans_reserve(tp, &M_RES(mp)->tr_itruncate, 0, 0);
 	if (error)
 		goto out_trans_cancel;

-	truncate_setsize(inode, newsize);
-
 	commit_flags = XFS_TRANS_RELEASE_LOG_RES;
 	lock_flags |= XFS_ILOCK_EXCL;
-
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
-
 	xfs_trans_ijoin(tp, ip, 0);

 	/*

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH V3] xfs: truncate_setsize should be outside transactions
  2014-05-05  5:19                 ` [PATCH V3] " Dave Chinner
@ 2014-05-06  7:52                   ` Christoph Hellwig
  0 siblings, 0 replies; 12+ messages in thread
From: Christoph Hellwig @ 2014-05-06  7:52 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, xfs

On Mon, May 05, 2014 at 03:19:42PM +1000, Dave Chinner wrote:
> 
> From: Dave Chinner <dchinner@redhat.com>
> 
> truncate_setsize() removes pages from the page cache, and hence
> requires page locks to be held. It is not valid to lock a page cache
> page inside a transaction context as we can hold page locks when we
> we reserve space for a transaction. If we do, then we expose an ABBA
> deadlock between log space reservation and page locks.
> 
> That is, both the write path and writeback lock a page, then start a
> transaction for block allocation, which means they can block waiting
> for a log reservation with the page lock held. If we hold a log
> reservation and then do something that locks a page (e.g.
> truncate_setsize in xfs_setattr_size) then that page lock can block
> on the page locked and waiting for a log reservation. If the
> transaction that is waiting for the page lock is the only active
> transaction in the system that can free log space via a commit,
> then writeback will never make progress and so log space will never
> free up.
> 
> This issue with xfs_setattr_size() was introduced back in 2010 by
> commit fa9b227 ("xfs: new truncate sequence") which moved the page
> cache truncate from outside the transaction context (what was
> xfs_itruncate_data()) to inside the transaction context as a call to
> truncate_setsize().
> 
> The reason truncate_setsize() was located where in this place was
> that we can't shouldn't change the file size until after we are in
> the transaction context and the operation will either succeed or
> shut down the filesystem on failure. However, block_truncate_page()
> already modifies the file contents before we enter the transaction
> context, so we can't really fulfill this guarantee in any way. Hence
> we may as well ensure that on success or failure, the in-memory
> inode and data is truncated away and that the application cleans up
> the mess appropriately.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2014-05-06  7:52 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-01 22:39 [PATCH] xfs: truncate_setsize should be outside transactions Dave Chinner
2014-05-02  4:54 ` Christoph Hellwig
2014-05-02  5:00   ` Christoph Hellwig
2014-05-02  6:47     ` Dave Chinner
2014-05-02  7:00       ` [PATCH V2] " Dave Chinner
2014-05-02 10:08         ` Christoph Hellwig
2014-05-02 23:23           ` Dave Chinner
2014-05-03 15:16             ` Christoph Hellwig
2014-05-04  0:06               ` Dave Chinner
2014-05-05  5:19                 ` [PATCH V3] " Dave Chinner
2014-05-06  7:52                   ` Christoph Hellwig
2014-05-02 12:50         ` [PATCH V2] " Brian Foster

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.