All of lore.kernel.org
 help / color / mirror / Atom feed
* transaction reservations for deleting of shared extents
@ 2017-02-20  7:29 Christoph Hellwig
  2017-02-21  1:43 ` Darrick J. Wong
  0 siblings, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2017-02-20  7:29 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Hi Darrick,

I got a bug report about running out of log reservations when deleting
heavily reflinked files.

We're hitting the ASSERT(tp->t_ticket->t_curr_res >= len) in
xlog_cil_insert_items and it seems we are way above the reservation.

I started auditing the issue and it seems like we don't account for
the refcountbt adjustments at all in tr_itruncate.  Do you have any
back of the napking math around for what amount of log reservations
we should need for them?  Otherwise I'll need to wade through the
code and recreate that math.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: transaction reservations for deleting of shared extents
  2017-02-20  7:29 transaction reservations for deleting of shared extents Christoph Hellwig
@ 2017-02-21  1:43 ` Darrick J. Wong
  2017-02-21  9:07   ` Christoph Hellwig
  2017-04-12 13:52   ` Christoph Hellwig
  0 siblings, 2 replies; 13+ messages in thread
From: Darrick J. Wong @ 2017-02-21  1:43 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Mon, Feb 20, 2017 at 08:29:45AM +0100, Christoph Hellwig wrote:
> Hi Darrick,
> 
> I got a bug report about running out of log reservations when deleting
> heavily reflinked files.
> 
> We're hitting the ASSERT(tp->t_ticket->t_curr_res >= len) in
> xlog_cil_insert_items and it seems we are way above the reservation.
> 
> I started auditing the issue and it seems like we don't account for
> the refcountbt adjustments at all in tr_itruncate.  Do you have any
> back of the napking math around for what amount of log reservations
> we should need for them?  Otherwise I'll need to wade through the
> code and recreate that math.

Hmmmm.  refcountbt updates should all be processed as deferred ops,
which means that each logical update ("increase refcount of blocks
3-300") should be getting its own transaction.

The function xfs_refcount_still_have_space tries to guess when we're
getting close to using up all the log reservation by assuming that each
refcount update will eventually use 32 bytes of the transaction
reservation, though it's hard to know precisely what the results of
formatting the log items will be.

When it thinks we're out of transaction space it'll signal a partial
completion, which (should) cause the defer_ops mechanism to log an RUD
and a new RUI, then roll the transaction and start again.  I speculate
that my guess of 32 bytes per refcountbt update is not correct. :(

Can you reproduce it easily?  IIRC xfs/140 should exercise some of this
mechanism.

--D

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: transaction reservations for deleting of shared extents
  2017-02-21  1:43 ` Darrick J. Wong
@ 2017-02-21  9:07   ` Christoph Hellwig
  2017-02-21 16:53     ` Darrick J. Wong
  2017-04-12 13:52   ` Christoph Hellwig
  1 sibling, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2017-02-21  9:07 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-xfs

On Mon, Feb 20, 2017 at 05:43:56PM -0800, Darrick J. Wong wrote:
> Hmmmm.  refcountbt updates should all be processed as deferred ops,
> which means that each logical update ("increase refcount of blocks
> 3-300") should be getting its own transaction.

Yes, but we'd still need to figure out how much to allocate for that
transaction.

> The function xfs_refcount_still_have_space tries to guess when we're
> getting close to using up all the log reservation by assuming that each
> refcount update will eventually use 32 bytes of the transaction
> reservation, though it's hard to know precisely what the results of
> formatting the log items will be.

I guess it's getting that estimate wrong.  It's also pretty weird
and different from how we reserve space for transactions everywhere
else in XFS..

> When it thinks we're out of transaction space it'll signal a partial
> completion, which (should) cause the defer_ops mechanism to log an RUD
> and a new RUI, then roll the transaction and start again.  I speculate
> that my guess of 32 bytes per refcountbt update is not correct. :(
> 
> Can you reproduce it easily?  IIRC xfs/140 should exercise some of this
> mechanism.

I personally can't reproduce it easily, but there is a QA setup that
reproduces it reliably, although it takes quite some time. I think I
can send you the reproducer, but it might require the right hardware
to hit the race, given that I can't actually reproduce it.

> 
> --D
---end quoted text---

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: transaction reservations for deleting of shared extents
  2017-02-21  9:07   ` Christoph Hellwig
@ 2017-02-21 16:53     ` Darrick J. Wong
  0 siblings, 0 replies; 13+ messages in thread
From: Darrick J. Wong @ 2017-02-21 16:53 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Tue, Feb 21, 2017 at 10:07:47AM +0100, Christoph Hellwig wrote:
> On Mon, Feb 20, 2017 at 05:43:56PM -0800, Darrick J. Wong wrote:
> > Hmmmm.  refcountbt updates should all be processed as deferred ops,
> > which means that each logical update ("increase refcount of blocks
> > 3-300") should be getting its own transaction.
> 
> Yes, but we'd still need to figure out how much to allocate for that
> transaction.
> 
> > The function xfs_refcount_still_have_space tries to guess when we're
> > getting close to using up all the log reservation by assuming that each
> > refcount update will eventually use 32 bytes of the transaction
> > reservation, though it's hard to know precisely what the results of
> > formatting the log items will be.
> 
> I guess it's getting that estimate wrong.  It's also pretty weird
> and different from how we reserve space for transactions everywhere
> else in XFS..

Yes, "adjust refcount up/down" is a higher level operation than the
other deferred ops, but in order to keep the operation atomic and a sane
limit on the number of RUIs being logged to the head transaction it was
necessary to do it this way.

I considered simply establishing an upper limit on the number of
refcount tree updates that one could perform in a single go -- it
survives in the form of the error-injection knob in that function that
artificially cuts off after 2 updates.  Everything seems to work just
fine with that in place, though rather more slowly when the refcountbt
has a lot of small entries.

Anyway, I could go instrument the kernel to measure nr_ops and
transaction reservation usage for the refcountbt updates to see what
tweaks might be necessary.

--D

> > When it thinks we're out of transaction space it'll signal a partial
> > completion, which (should) cause the defer_ops mechanism to log an RUD
> > and a new RUI, then roll the transaction and start again.  I speculate
> > that my guess of 32 bytes per refcountbt update is not correct. :(
> > 
> > Can you reproduce it easily?  IIRC xfs/140 should exercise some of this
> > mechanism.
> 
> I personally can't reproduce it easily, but there is a QA setup that
> reproduces it reliably, although it takes quite some time. I think I
> can send you the reproducer, but it might require the right hardware
> to hit the race, given that I can't actually reproduce it.
> 
> > 
> > --D
> ---end quoted text---
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: transaction reservations for deleting of shared extents
  2017-02-21  1:43 ` Darrick J. Wong
  2017-02-21  9:07   ` Christoph Hellwig
@ 2017-04-12 13:52   ` Christoph Hellwig
  2017-04-12 23:06     ` Darrick J. Wong
  1 sibling, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2017-04-12 13:52 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-xfs

I think the problem is that t_log_res just contains the log reservation
when the transaction was created.  But each item processed by
xfs_defer_finish uses up some of it, but in some cases these might
be different operations and not just more refcount updates, e.g. for
xfs_itruncate_extents which I see the issues with we mix EFI/EFD
items with refcount updates.

I still don't have a good idea how to fix this, though.  One idea
would be to prevent mixing different items, but I think being able
to mix them was one of your goals with the defer infrastructure rewrite.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: transaction reservations for deleting of shared extents
  2017-04-12 13:52   ` Christoph Hellwig
@ 2017-04-12 23:06     ` Darrick J. Wong
  2017-04-13  3:52       ` Darrick J. Wong
  2017-04-13 10:33       ` Christoph Hellwig
  0 siblings, 2 replies; 13+ messages in thread
From: Darrick J. Wong @ 2017-04-12 23:06 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Wed, Apr 12, 2017 at 03:52:31PM +0200, Christoph Hellwig wrote:
> I think the problem is that t_log_res just contains the log reservation
> when the transaction was created.  But each item processed by
> xfs_defer_finish uses up some of it, but in some cases these might
> be different operations and not just more refcount updates, e.g. for
> xfs_itruncate_extents which I see the issues with we mix EFI/EFD
> items with refcount updates.

Hmmm... I suppose you could end up with a heavy load of deferred updates
stemming from the removal of a single extent:

1) Start with one huge extent mapped into a file.
2) Reflink every other block into another file.
3) Delete the first file.  This results in:
   a) Unmap the huge extent.
   b) Schedule removal of the rmap, if applicable.
   c) Schedule a refcount decrease for the huge extent.
   d) Perform the deferred rmap removal.  If we push blocks off the
      AGFL as part of removing rmapbt blocks, queue an EFI.
   e) Perform the deferred refcount decrease:
      For each (singly-)shared block, set the refcount=1 by deleting the
      refcount record.  Every ~150 deletions we free a refcount block
      and queue an EFI.  (If rmap, queue a deferred rmap update too.)
   f) Perform the deferred rmap removals.  If we push blocks off the
      AGFL as part of removing rmapbt blocks, queue an EFI.
   g) Free each shared block by queueing an EFI.
   h) For each EFI, free the extent.

So I think the problem you're seeing here is that just prior to (3g) we
have the most deferred items (EFIs, specifically) attached to this
transaction at any point in the whole operation.  There can be so many
EFIs that we use up the log reservation and blow the ASSERT.

One way to fix this is to unmap a smaller range in (1) so that we don't
blow up at (3g).  Unfortunately, it is hard to guess at (1) just how
many EFIs we might end up queueing, but I think reducing the amount of
file mapping we free in a given step might be the only sane solution.
One could calculate the number of blocks we can free, given the
remaining transaction reservation and assuming the worst case number of
EFIs that could get filed to unmap those blocks, and only __bunmapi that
many blocks, thereby forcing the caller to come back with a fresh
defer_ops for another try.

> I still don't have a good idea how to fix this, though.  One idea
> would be to prevent mixing different items, but I think being able
> to mix them was one of your goals with the defer infrastructure rewrite.

Yes, we have to be able to perform several different types of updates
in one defer_ops so that we can execute CoW remappings atomically.

--D

> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: transaction reservations for deleting of shared extents
  2017-04-12 23:06     ` Darrick J. Wong
@ 2017-04-13  3:52       ` Darrick J. Wong
  2017-04-13 10:51         ` Christoph Hellwig
  2017-04-13 12:13         ` Christoph Hellwig
  2017-04-13 10:33       ` Christoph Hellwig
  1 sibling, 2 replies; 13+ messages in thread
From: Darrick J. Wong @ 2017-04-13  3:52 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

[-- Attachment #1: Type: text/plain, Size: 3435 bytes --]

On Wed, Apr 12, 2017 at 04:06:12PM -0700, Darrick J. Wong wrote:
> On Wed, Apr 12, 2017 at 03:52:31PM +0200, Christoph Hellwig wrote:
> > I think the problem is that t_log_res just contains the log reservation
> > when the transaction was created.  But each item processed by
> > xfs_defer_finish uses up some of it, but in some cases these might
> > be different operations and not just more refcount updates, e.g. for
> > xfs_itruncate_extents which I see the issues with we mix EFI/EFD
> > items with refcount updates.
> 
> Hmmm... I suppose you could end up with a heavy load of deferred updates
> stemming from the removal of a single extent:
> 
> 1) Start with one huge extent mapped into a file.
> 2) Reflink every other block into another file.
> 3) Delete the first file.  This results in:
>    a) Unmap the huge extent.
>    b) Schedule removal of the rmap, if applicable.
>    c) Schedule a refcount decrease for the huge extent.
>    d) Perform the deferred rmap removal.  If we push blocks off the
>       AGFL as part of removing rmapbt blocks, queue an EFI.
>    e) Perform the deferred refcount decrease:
>       For each (singly-)shared block, set the refcount=1 by deleting the
>       refcount record.  Every ~150 deletions we free a refcount block
>       and queue an EFI.  (If rmap, queue a deferred rmap update too.)
>    f) Perform the deferred rmap removals.  If we push blocks off the
>       AGFL as part of removing rmapbt blocks, queue an EFI.
>    g) Free each shared block by queueing an EFI.
>    h) For each EFI, free the extent.
> 
> So I think the problem you're seeing here is that just prior to (3g) we
> have the most deferred items (EFIs, specifically) attached to this
> transaction at any point in the whole operation.  There can be so many
> EFIs that we use up the log reservation and blow the ASSERT.
> 
> One way to fix this is to unmap a smaller range in (1) so that we don't
> blow up at (3g).  Unfortunately, it is hard to guess at (1) just how
> many EFIs we might end up queueing, but I think reducing the amount of
> file mapping we free in a given step might be the only sane solution.
> One could calculate the number of blocks we can free, given the
> remaining transaction reservation and assuming the worst case number of
> EFIs that could get filed to unmap those blocks, and only __bunmapi that
> many blocks, thereby forcing the caller to come back with a fresh
> defer_ops for another try.

Hmm... can you try out the attached RFC(RAP) xfstests testcase to see if
it can reproduce the problem you're seeing, and then apply the (equally
RFCRAP) patch to see if it fixes the problem?

--D

> 
> > I still don't have a good idea how to fix this, though.  One idea
> > would be to prevent mixing different items, but I think being able
> > to mix them was one of your goals with the defer infrastructure rewrite.
> 
> Yes, we have to be able to perform several different types of updates
> in one defer_ops so that we can execute CoW remappings atomically.
> 
> --D
> 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: 18-xfs-reflink-test-unlink-giant-batch-of-refcounts.patch --]
[-- Type: text/x-diff, Size: 4159 bytes --]

From: Darrick J. Wong <darrick.wong@oracle.com>
Subject: [PATCH] reflink: test unlinking a huge extent with a lot of refcount adjustments

Test a regression in XFS where we blow out a transaction reservation if
we create a big file, share every other block, and delete the first
file.  There's nothing particularly fs-specific about this stress test,
so put it in generic.

Signed-off-by: Darrick J. Wong <djwong@djwong.org>
---
 tests/generic/931     |   94 +++++++++++++++++++++++++++++++++++++++++++++++++
 tests/generic/931.out |    6 +++
 tests/generic/group   |    1 +
 3 files changed, 101 insertions(+)
 create mode 100755 tests/generic/931
 create mode 100644 tests/generic/931.out

diff --git a/tests/generic/931 b/tests/generic/931
new file mode 100755
index 0000000..cc093bf
--- /dev/null
+++ b/tests/generic/931
@@ -0,0 +1,94 @@
+#! /bin/bash
+# FS QA Test No. 931
+#
+# See how well we handle deleting a file with a million refcount extents.
+#
+#-----------------------------------------------------------------------
+# Copyright (c) 2017, Oracle and/or its affiliates.  All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#-----------------------------------------------------------------------
+
+seq=`basename "$0"`
+seqres="$RESULT_DIR/$seq"
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1    # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+    cd /
+    rm -rf "$tmp".* $testdir/file1
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+. ./common/attr
+. ./common/reflink
+
+# real QA test starts here
+_supported_os Linux
+_require_scratch_reflink
+_require_cp_reflink
+_require_test_program "punch-alternating"
+
+rm -f "$seqres.full"
+
+echo "Format and mount"
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount >> "$seqres.full" 2>&1
+
+testdir="$SCRATCH_MNT/test-$seq"
+mkdir "$testdir"
+
+# Setup for one million blocks, but we'll accept stress testing down to
+# 2^17 blocks... that should be plenty for anyone.
+fnr=20
+free_blocks=$(stat -f -c '%a' "$testdir")
+blksz=$(_get_block_size "$testdir")
+space_avail=$((free_blocks * blksz))
+calc_space() {
+	blocks_needed=$(( 2 ** (fnr + 1) ))
+	space_needed=$((blocks_needed * blksz * 5 / 4))
+}
+calc_space
+while test $space_needed -gt $space_avail; do
+	fnr=$((fnr - 1))
+	calc_space
+done
+test $fnr -lt 17 && _notrun "Insufficient space for stress test; would only create $blocks_needed extents."
+
+echo "Create a many-block file"
+echo "creating $blocks_needed blocks..." >> "$seqres.full"
+$XFS_IO_PROG -f -c "pwrite -S 0x61 -b 4194304 0 $((2 ** (fnr + 1) * blksz))" "$testdir/file1" >> "$seqres.full"
+
+echo "Reflinking file"
+_cp_reflink $testdir/file1 $testdir/file2
+
+echo "Punch file2"
+echo "Punching file2..." >> "$seqres.full"
+"$here/src/punch-alternating" "$testdir/file2" >> "$seqres.full"
+echo "...done" >> "$seqres.full"
+_scratch_cycle_mount
+
+echo "Delete file1"
+rm -rf $testdir/file1
+
+# success, all done
+status=0
+exit
diff --git a/tests/generic/931.out b/tests/generic/931.out
new file mode 100644
index 0000000..c7b724e
--- /dev/null
+++ b/tests/generic/931.out
@@ -0,0 +1,6 @@
+QA output created by 931
+Format and mount
+Create a many-block file
+Reflinking file
+Punch file2
+Delete file1
diff --git a/tests/generic/group b/tests/generic/group
index 173f6f3..455e5fb 100644
--- a/tests/generic/group
+++ b/tests/generic/group
@@ -432,3 +432,4 @@
 819 auto quick copy
 820 auto quick copy
 821 auto quick punch
+931 auto quick clone

[-- Attachment #3: 90-xfs-reflink-constrain-bunmapi.patch --]
[-- Type: text/x-diff, Size: 3669 bytes --]

From: Darrick J. Wong <darrick.wong@oracle.com>
Subject: [PATCH] xfs: try to avoid blowing out the transaction reservation when bunmaping a shared extent

In a pathological scenario where we are trying to bunmapi a single
extent in which every other block is shared, it's possible that trying
to unmap the entire large extent in a single transaction can generate so
many EFIs that we overflow the transaction reservation.

Therefore, use a heuristic to guess at the number of blocks we can
safely unmap from a reflink file's data fork in an single transaction.
This should prevent problems such as the log head slamming into the tail
and ASSERTs that trigger because we've exceeded the transaction
reservation.

Signed-off-by: Darrick J. Wong <djwong@djwong.org>
---
 fs/xfs/libxfs/xfs_bmap.c |   42 +++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 41 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 179b483..87da58d 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -5417,6 +5417,28 @@ xfs_bmap_del_extent(
 }
 
 /*
+ * Guesstimate how many blocks we can unmap without running the risk of
+ * blowing out the transaction with EFIs.
+ */
+static xfs_fileoff_t
+xfs_bunmapi_can_unmap(
+	struct xfs_trans	*tp,
+	struct xfs_inode	*ip,
+	int			whichfork,
+	xfs_fileoff_t		len)
+{
+	unsigned int		limit;
+
+	if (!xfs_is_reflink_inode(ip) || whichfork != XFS_DATA_FORK)
+		return len;
+
+	limit = (tp->t_log_res * 3 / 4) / 32;
+	if (len <= limit)
+		return len;
+	return limit;
+}
+
+/*
  * Unmap (remove) blocks from a file.
  * If nexts is nonzero then the number of extents to remove is limited to
  * that value.  If not all extents in the block range can be removed then
@@ -5451,6 +5473,7 @@ __xfs_bunmapi(
 	int			whichfork;	/* data or attribute fork */
 	xfs_fsblock_t		sum;
 	xfs_filblks_t		len = *rlen;	/* length to unmap in file */
+	xfs_fileoff_t		can_unmap;
 
 	trace_xfs_bunmap(ip, bno, len, flags, _RET_IP_);
 
@@ -5472,6 +5495,12 @@ __xfs_bunmapi(
 	ASSERT(len > 0);
 	ASSERT(nexts >= 0);
 
+	/*
+	 * If this is potentially reflinked data blocks, don't blow out
+	 * the reservation.
+	 */
+	can_unmap = xfs_bunmapi_can_unmap(tp, ip, whichfork, len);
+
 	if (!(ifp->if_flags & XFS_IFEXTENTS) &&
 	    (error = xfs_iread_extents(tp, ip, whichfork)))
 		return error;
@@ -5516,7 +5545,7 @@ __xfs_bunmapi(
 
 	extno = 0;
 	while (bno != (xfs_fileoff_t)-1 && bno >= start && lastx >= 0 &&
-	       (nexts == 0 || extno < nexts)) {
+	       (nexts == 0 || extno < nexts) && can_unmap > 0) {
 		/*
 		 * Is the found extent after a hole in which bno lives?
 		 * Just back up to the previous extent, if so.
@@ -5548,6 +5577,16 @@ __xfs_bunmapi(
 		}
 		if (del.br_startoff + del.br_blockcount > bno + 1)
 			del.br_blockcount = bno + 1 - del.br_startoff;
+
+		/* How much can we safely unmap? */
+		if (can_unmap < del.br_blockcount) {
+printk(KERN_ERR "%s: ino %lld pblk %d:%llu lblk %llu len %llu can_unmap %llu got=%llu:%llu:%llu\n", __func__, ip->i_ino, wasdel, del.br_startblock, del.br_startoff, del.br_blockcount, can_unmap, got.br_startblock, got.br_startoff, got.br_blockcount);
+			del.br_startoff += del.br_blockcount - can_unmap;
+			if (!wasdel)
+				del.br_startblock += del.br_blockcount - can_unmap;
+			del.br_blockcount = can_unmap;
+		}
+
 		sum = del.br_startblock + del.br_blockcount;
 		if (isrt &&
 		    (mod = do_mod(sum, mp->m_sb.sb_rextsize))) {
@@ -5724,6 +5763,7 @@ __xfs_bunmapi(
 		if (!isrt && wasdel)
 			xfs_mod_fdblocks(mp, (int64_t)del.br_blockcount, false);
 
+		can_unmap -= del.br_blockcount;
 		bno = del.br_startoff - 1;
 nodelete:
 		/*

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: transaction reservations for deleting of shared extents
  2017-04-12 23:06     ` Darrick J. Wong
  2017-04-13  3:52       ` Darrick J. Wong
@ 2017-04-13 10:33       ` Christoph Hellwig
  1 sibling, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2017-04-13 10:33 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-xfs

On Wed, Apr 12, 2017 at 04:06:12PM -0700, Darrick J. Wong wrote:
> So I think the problem you're seeing here is that just prior to (3g) we
> have the most deferred items (EFIs, specifically) attached to this
> transaction at any point in the whole operation.  There can be so many
> EFIs that we use up the log reservation and blow the ASSERT.

Yes.  I think that's exactly what I'm seing, except that rmap isn't
part of the game.

> > I still don't have a good idea how to fix this, though.  One idea
> > would be to prevent mixing different items, but I think being able
> > to mix them was one of your goals with the defer infrastructure rewrite.
> 
> Yes, we have to be able to perform several different types of updates
> in one defer_ops so that we can execute CoW remappings atomically.

In one defer_ops, but I don't see anything preventing us from starting
a new chained transaction everytime we move from one type to another.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: transaction reservations for deleting of shared extents
  2017-04-13  3:52       ` Darrick J. Wong
@ 2017-04-13 10:51         ` Christoph Hellwig
  2017-04-13 12:13         ` Christoph Hellwig
  1 sibling, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2017-04-13 10:51 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-xfs

On Wed, Apr 12, 2017 at 08:52:50PM -0700, Darrick J. Wong wrote:
> Hmm... can you try out the attached RFC(RAP) xfstests testcase to see if
> it can reproduce the problem you're seeing,

That one gives me a softlockup in xlog_grant_head_wait, so I think we're
having even more problems in this area.  Let me try with a larger log
size and deeg into the issues..

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: transaction reservations for deleting of shared extents
  2017-04-13  3:52       ` Darrick J. Wong
  2017-04-13 10:51         ` Christoph Hellwig
@ 2017-04-13 12:13         ` Christoph Hellwig
  2017-04-25  2:09           ` Darrick J. Wong
  1 sibling, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2017-04-13 12:13 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-xfs

On Wed, Apr 12, 2017 at 08:52:50PM -0700, Darrick J. Wong wrote:
> Hmm... can you try out the attached RFC(RAP) xfstests testcase to see if
> it can reproduce the problem you're seeing, and then apply the (equally
> RFCRAP) patch to see if it fixes the problem?

Yeah. limiting at the caller seems much better than second guessing
later.  Below is a version of your patch with a couple random edits.
I wonder if we could now get rid of xfs_refcount_still_have_space
or at least turn it into warnings only..

---
>From 9b25b1bdba8fb93650710444a6bab7b10f74680d Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <darrick.wong@oracle.com>
Date: Thu, 13 Apr 2017 13:35:00 +0200
Subject: xfs: try to avoid blowing out the transaction reservation when
 bunmaping a shared extent

In a pathological scenario where we are trying to bunmapi a single
extent in which every other block is shared, it's possible that trying
to unmap the entire large extent in a single transaction can generate so
many EFIs that we overflow the transaction reservation.

Therefore, use a heuristic to guess at the number of blocks we can
safely unmap from a reflink file's data fork in an single transaction.
This should prevent problems such as the log head slamming into the tail
and ASSERTs that trigger because we've exceeded the transaction
reservation.

Signed-off-by: Darrick J. Wong <djwong@djwong.org>
[hch: random edits, all bugs are my fault]
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/libxfs/xfs_bmap.c     | 23 ++++++++++++++++++++++-
 fs/xfs/libxfs/xfs_refcount.c | 10 +---------
 fs/xfs/libxfs/xfs_refcount.h | 16 ++++++++++++++++
 3 files changed, 39 insertions(+), 10 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 8e94030bcb8f..04dac8954425 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -5442,6 +5442,7 @@ __xfs_bunmapi(
 	int			whichfork;	/* data or attribute fork */
 	xfs_fsblock_t		sum;
 	xfs_filblks_t		len = *rlen;	/* length to unmap in file */
+	xfs_fileoff_t		max_len;
 
 	trace_xfs_bunmap(ip, bno, len, flags, _RET_IP_);
 
@@ -5463,6 +5464,16 @@ __xfs_bunmapi(
 	ASSERT(len > 0);
 	ASSERT(nexts >= 0);
 
+	/*
+	 * Guesstimate how many blocks we can unmap without running the risk of
+	 * blowing out the transaction with a mix of EFIs and reflink
+	 * adjustments.
+	 */
+	if (xfs_is_reflink_inode(ip) && whichfork == XFS_DATA_FORK)
+		max_len = min(len, xfs_refcount_max_unmap(tp->t_log_res));
+	else
+		max_len = len;
+
 	if (!(ifp->if_flags & XFS_IFEXTENTS) &&
 	    (error = xfs_iread_extents(tp, ip, whichfork)))
 		return error;
@@ -5507,7 +5518,7 @@ __xfs_bunmapi(
 
 	extno = 0;
 	while (bno != (xfs_fileoff_t)-1 && bno >= start && lastx >= 0 &&
-	       (nexts == 0 || extno < nexts)) {
+	       (nexts == 0 || extno < nexts) && max_len > 0) {
 		/*
 		 * Is the found extent after a hole in which bno lives?
 		 * Just back up to the previous extent, if so.
@@ -5539,6 +5550,15 @@ __xfs_bunmapi(
 		}
 		if (del.br_startoff + del.br_blockcount > bno + 1)
 			del.br_blockcount = bno + 1 - del.br_startoff;
+
+		/* How much can we safely unmap? */
+		if (max_len < del.br_blockcount) {
+			del.br_startoff += del.br_blockcount - max_len;
+			if (!wasdel)
+				del.br_startblock += del.br_blockcount - max_len;
+			del.br_blockcount = max_len;
+		}
+
 		sum = del.br_startblock + del.br_blockcount;
 		if (isrt &&
 		    (mod = do_mod(sum, mp->m_sb.sb_rextsize))) {
@@ -5715,6 +5735,7 @@ __xfs_bunmapi(
 		if (!isrt && wasdel)
 			xfs_mod_fdblocks(mp, (int64_t)del.br_blockcount, false);
 
+		max_len -= del.br_blockcount;
 		bno = del.br_startoff - 1;
 nodelete:
 		/*
diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
index b177ef33cd4c..29dcde381505 100644
--- a/fs/xfs/libxfs/xfs_refcount.c
+++ b/fs/xfs/libxfs/xfs_refcount.c
@@ -784,14 +784,6 @@ xfs_refcount_merge_extents(
 }
 
 /*
- * While we're adjusting the refcounts records of an extent, we have
- * to keep an eye on the number of extents we're dirtying -- run too
- * many in a single transaction and we'll exceed the transaction's
- * reservation and crash the fs.  Each record adds 12 bytes to the
- * log (plus any key updates) so we'll conservatively assume 24 bytes
- * per record.  We must also leave space for btree splits on both ends
- * of the range and space for the CUD and a new CUI.
- *
  * XXX: This is a pretty hand-wavy estimate.  The penalty for guessing
  * true incorrectly is a shutdown FS; the penalty for guessing false
  * incorrectly is more transaction rolls than might be necessary.
@@ -822,7 +814,7 @@ xfs_refcount_still_have_space(
 	else if (overhead > cur->bc_tp->t_log_res)
 		return false;
 	return  cur->bc_tp->t_log_res - overhead >
-		cur->bc_private.a.priv.refc.nr_ops * 32;
+		cur->bc_private.a.priv.refc.nr_ops * XFS_REFCOUNT_ITEM_OVERHEAD;
 }
 
 /*
diff --git a/fs/xfs/libxfs/xfs_refcount.h b/fs/xfs/libxfs/xfs_refcount.h
index 098dc668ab2c..eafb9d1f3b37 100644
--- a/fs/xfs/libxfs/xfs_refcount.h
+++ b/fs/xfs/libxfs/xfs_refcount.h
@@ -67,4 +67,20 @@ extern int xfs_refcount_free_cow_extent(struct xfs_mount *mp,
 extern int xfs_refcount_recover_cow_leftovers(struct xfs_mount *mp,
 		xfs_agnumber_t agno);
 
+/*
+ * While we're adjusting the refcounts records of an extent, we have
+ * to keep an eye on the number of extents we're dirtying -- run too
+ * many in a single transaction and we'll exceed the transaction's
+ * reservation and crash the fs.  Each record adds 12 bytes to the
+ * log (plus any key updates) so we'll conservatively assume 32 bytes
+ * per record.  We must also leave space for btree splits on both ends
+ * of the range and space for the CUD and a new CUI.
+ */
+#define XFS_REFCOUNT_ITEM_OVERHEAD	32
+
+static inline xfs_fileoff_t xfs_refcount_max_unmap(int log_res)
+{
+	return (log_res * 3 / 4) / XFS_REFCOUNT_ITEM_OVERHEAD;
+}
+
 #endif	/* __XFS_REFCOUNT_H__ */
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: transaction reservations for deleting of shared extents
  2017-04-13 12:13         ` Christoph Hellwig
@ 2017-04-25  2:09           ` Darrick J. Wong
  2017-06-03  7:13             ` Christoph Hellwig
  0 siblings, 1 reply; 13+ messages in thread
From: Darrick J. Wong @ 2017-04-25  2:09 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Thu, Apr 13, 2017 at 02:13:25PM +0200, Christoph Hellwig wrote:
> On Wed, Apr 12, 2017 at 08:52:50PM -0700, Darrick J. Wong wrote:
> > Hmm... can you try out the attached RFC(RAP) xfstests testcase to see if
> > it can reproduce the problem you're seeing, and then apply the (equally
> > RFCRAP) patch to see if it fixes the problem?
> 
> Yeah. limiting at the caller seems much better than second guessing
> later.  Below is a version of your patch with a couple random edits.
> I wonder if we could now get rid of xfs_refcount_still_have_space
> or at least turn it into warnings only..

We also have to plumb in the code necessary to recover a block unmap log
intent item of arbitrary length by splitting the unmap operation into a
series of __xfs_bunmapi calls.  That seems to fix the asserts and other
weird log burps... will send an RFC patch shortly.

--D

> 
> ---
> From 9b25b1bdba8fb93650710444a6bab7b10f74680d Mon Sep 17 00:00:00 2001
> From: "Darrick J. Wong" <darrick.wong@oracle.com>
> Date: Thu, 13 Apr 2017 13:35:00 +0200
> Subject: xfs: try to avoid blowing out the transaction reservation when
>  bunmaping a shared extent
> 
> In a pathological scenario where we are trying to bunmapi a single
> extent in which every other block is shared, it's possible that trying
> to unmap the entire large extent in a single transaction can generate so
> many EFIs that we overflow the transaction reservation.
> 
> Therefore, use a heuristic to guess at the number of blocks we can
> safely unmap from a reflink file's data fork in an single transaction.
> This should prevent problems such as the log head slamming into the tail
> and ASSERTs that trigger because we've exceeded the transaction
> reservation.
> 
> Signed-off-by: Darrick J. Wong <djwong@djwong.org>
> [hch: random edits, all bugs are my fault]
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/xfs/libxfs/xfs_bmap.c     | 23 ++++++++++++++++++++++-
>  fs/xfs/libxfs/xfs_refcount.c | 10 +---------
>  fs/xfs/libxfs/xfs_refcount.h | 16 ++++++++++++++++
>  3 files changed, 39 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index 8e94030bcb8f..04dac8954425 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -5442,6 +5442,7 @@ __xfs_bunmapi(
>  	int			whichfork;	/* data or attribute fork */
>  	xfs_fsblock_t		sum;
>  	xfs_filblks_t		len = *rlen;	/* length to unmap in file */
> +	xfs_fileoff_t		max_len;
>  
>  	trace_xfs_bunmap(ip, bno, len, flags, _RET_IP_);
>  
> @@ -5463,6 +5464,16 @@ __xfs_bunmapi(
>  	ASSERT(len > 0);
>  	ASSERT(nexts >= 0);
>  
> +	/*
> +	 * Guesstimate how many blocks we can unmap without running the risk of
> +	 * blowing out the transaction with a mix of EFIs and reflink
> +	 * adjustments.
> +	 */
> +	if (xfs_is_reflink_inode(ip) && whichfork == XFS_DATA_FORK)
> +		max_len = min(len, xfs_refcount_max_unmap(tp->t_log_res));
> +	else
> +		max_len = len;
> +
>  	if (!(ifp->if_flags & XFS_IFEXTENTS) &&
>  	    (error = xfs_iread_extents(tp, ip, whichfork)))
>  		return error;
> @@ -5507,7 +5518,7 @@ __xfs_bunmapi(
>  
>  	extno = 0;
>  	while (bno != (xfs_fileoff_t)-1 && bno >= start && lastx >= 0 &&
> -	       (nexts == 0 || extno < nexts)) {
> +	       (nexts == 0 || extno < nexts) && max_len > 0) {
>  		/*
>  		 * Is the found extent after a hole in which bno lives?
>  		 * Just back up to the previous extent, if so.
> @@ -5539,6 +5550,15 @@ __xfs_bunmapi(
>  		}
>  		if (del.br_startoff + del.br_blockcount > bno + 1)
>  			del.br_blockcount = bno + 1 - del.br_startoff;
> +
> +		/* How much can we safely unmap? */
> +		if (max_len < del.br_blockcount) {
> +			del.br_startoff += del.br_blockcount - max_len;
> +			if (!wasdel)
> +				del.br_startblock += del.br_blockcount - max_len;
> +			del.br_blockcount = max_len;
> +		}
> +
>  		sum = del.br_startblock + del.br_blockcount;
>  		if (isrt &&
>  		    (mod = do_mod(sum, mp->m_sb.sb_rextsize))) {
> @@ -5715,6 +5735,7 @@ __xfs_bunmapi(
>  		if (!isrt && wasdel)
>  			xfs_mod_fdblocks(mp, (int64_t)del.br_blockcount, false);
>  
> +		max_len -= del.br_blockcount;
>  		bno = del.br_startoff - 1;
>  nodelete:
>  		/*
> diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
> index b177ef33cd4c..29dcde381505 100644
> --- a/fs/xfs/libxfs/xfs_refcount.c
> +++ b/fs/xfs/libxfs/xfs_refcount.c
> @@ -784,14 +784,6 @@ xfs_refcount_merge_extents(
>  }
>  
>  /*
> - * While we're adjusting the refcounts records of an extent, we have
> - * to keep an eye on the number of extents we're dirtying -- run too
> - * many in a single transaction and we'll exceed the transaction's
> - * reservation and crash the fs.  Each record adds 12 bytes to the
> - * log (plus any key updates) so we'll conservatively assume 24 bytes
> - * per record.  We must also leave space for btree splits on both ends
> - * of the range and space for the CUD and a new CUI.
> - *
>   * XXX: This is a pretty hand-wavy estimate.  The penalty for guessing
>   * true incorrectly is a shutdown FS; the penalty for guessing false
>   * incorrectly is more transaction rolls than might be necessary.
> @@ -822,7 +814,7 @@ xfs_refcount_still_have_space(
>  	else if (overhead > cur->bc_tp->t_log_res)
>  		return false;
>  	return  cur->bc_tp->t_log_res - overhead >
> -		cur->bc_private.a.priv.refc.nr_ops * 32;
> +		cur->bc_private.a.priv.refc.nr_ops * XFS_REFCOUNT_ITEM_OVERHEAD;
>  }
>  
>  /*
> diff --git a/fs/xfs/libxfs/xfs_refcount.h b/fs/xfs/libxfs/xfs_refcount.h
> index 098dc668ab2c..eafb9d1f3b37 100644
> --- a/fs/xfs/libxfs/xfs_refcount.h
> +++ b/fs/xfs/libxfs/xfs_refcount.h
> @@ -67,4 +67,20 @@ extern int xfs_refcount_free_cow_extent(struct xfs_mount *mp,
>  extern int xfs_refcount_recover_cow_leftovers(struct xfs_mount *mp,
>  		xfs_agnumber_t agno);
>  
> +/*
> + * While we're adjusting the refcounts records of an extent, we have
> + * to keep an eye on the number of extents we're dirtying -- run too
> + * many in a single transaction and we'll exceed the transaction's
> + * reservation and crash the fs.  Each record adds 12 bytes to the
> + * log (plus any key updates) so we'll conservatively assume 32 bytes
> + * per record.  We must also leave space for btree splits on both ends
> + * of the range and space for the CUD and a new CUI.
> + */
> +#define XFS_REFCOUNT_ITEM_OVERHEAD	32
> +
> +static inline xfs_fileoff_t xfs_refcount_max_unmap(int log_res)
> +{
> +	return (log_res * 3 / 4) / XFS_REFCOUNT_ITEM_OVERHEAD;
> +}
> +
>  #endif	/* __XFS_REFCOUNT_H__ */
> -- 
> 2.11.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: transaction reservations for deleting of shared extents
  2017-04-25  2:09           ` Darrick J. Wong
@ 2017-06-03  7:13             ` Christoph Hellwig
  2017-06-03 17:01               ` Darrick J. Wong
  0 siblings, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2017-06-03  7:13 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-xfs

While looking at stable updates it seems like we didn't make
it anywhere with this series.  Are you planning to do further work
or should I pick it up?

On Mon, Apr 24, 2017 at 07:09:06PM -0700, Darrick J. Wong wrote:
> On Thu, Apr 13, 2017 at 02:13:25PM +0200, Christoph Hellwig wrote:
> > On Wed, Apr 12, 2017 at 08:52:50PM -0700, Darrick J. Wong wrote:
> > > Hmm... can you try out the attached RFC(RAP) xfstests testcase to see if
> > > it can reproduce the problem you're seeing, and then apply the (equally
> > > RFCRAP) patch to see if it fixes the problem?
> > 
> > Yeah. limiting at the caller seems much better than second guessing
> > later.  Below is a version of your patch with a couple random edits.
> > I wonder if we could now get rid of xfs_refcount_still_have_space
> > or at least turn it into warnings only..
> 
> We also have to plumb in the code necessary to recover a block unmap log
> intent item of arbitrary length by splitting the unmap operation into a
> series of __xfs_bunmapi calls.  That seems to fix the asserts and other
> weird log burps... will send an RFC patch shortly.
> 
> --D
> 
> > 
> > ---
> > From 9b25b1bdba8fb93650710444a6bab7b10f74680d Mon Sep 17 00:00:00 2001
> > From: "Darrick J. Wong" <darrick.wong@oracle.com>
> > Date: Thu, 13 Apr 2017 13:35:00 +0200
> > Subject: xfs: try to avoid blowing out the transaction reservation when
> >  bunmaping a shared extent
> > 
> > In a pathological scenario where we are trying to bunmapi a single
> > extent in which every other block is shared, it's possible that trying
> > to unmap the entire large extent in a single transaction can generate so
> > many EFIs that we overflow the transaction reservation.
> > 
> > Therefore, use a heuristic to guess at the number of blocks we can
> > safely unmap from a reflink file's data fork in an single transaction.
> > This should prevent problems such as the log head slamming into the tail
> > and ASSERTs that trigger because we've exceeded the transaction
> > reservation.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@djwong.org>
> > [hch: random edits, all bugs are my fault]
> > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > ---
> >  fs/xfs/libxfs/xfs_bmap.c     | 23 ++++++++++++++++++++++-
> >  fs/xfs/libxfs/xfs_refcount.c | 10 +---------
> >  fs/xfs/libxfs/xfs_refcount.h | 16 ++++++++++++++++
> >  3 files changed, 39 insertions(+), 10 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > index 8e94030bcb8f..04dac8954425 100644
> > --- a/fs/xfs/libxfs/xfs_bmap.c
> > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > @@ -5442,6 +5442,7 @@ __xfs_bunmapi(
> >  	int			whichfork;	/* data or attribute fork */
> >  	xfs_fsblock_t		sum;
> >  	xfs_filblks_t		len = *rlen;	/* length to unmap in file */
> > +	xfs_fileoff_t		max_len;
> >  
> >  	trace_xfs_bunmap(ip, bno, len, flags, _RET_IP_);
> >  
> > @@ -5463,6 +5464,16 @@ __xfs_bunmapi(
> >  	ASSERT(len > 0);
> >  	ASSERT(nexts >= 0);
> >  
> > +	/*
> > +	 * Guesstimate how many blocks we can unmap without running the risk of
> > +	 * blowing out the transaction with a mix of EFIs and reflink
> > +	 * adjustments.
> > +	 */
> > +	if (xfs_is_reflink_inode(ip) && whichfork == XFS_DATA_FORK)
> > +		max_len = min(len, xfs_refcount_max_unmap(tp->t_log_res));
> > +	else
> > +		max_len = len;
> > +
> >  	if (!(ifp->if_flags & XFS_IFEXTENTS) &&
> >  	    (error = xfs_iread_extents(tp, ip, whichfork)))
> >  		return error;
> > @@ -5507,7 +5518,7 @@ __xfs_bunmapi(
> >  
> >  	extno = 0;
> >  	while (bno != (xfs_fileoff_t)-1 && bno >= start && lastx >= 0 &&
> > -	       (nexts == 0 || extno < nexts)) {
> > +	       (nexts == 0 || extno < nexts) && max_len > 0) {
> >  		/*
> >  		 * Is the found extent after a hole in which bno lives?
> >  		 * Just back up to the previous extent, if so.
> > @@ -5539,6 +5550,15 @@ __xfs_bunmapi(
> >  		}
> >  		if (del.br_startoff + del.br_blockcount > bno + 1)
> >  			del.br_blockcount = bno + 1 - del.br_startoff;
> > +
> > +		/* How much can we safely unmap? */
> > +		if (max_len < del.br_blockcount) {
> > +			del.br_startoff += del.br_blockcount - max_len;
> > +			if (!wasdel)
> > +				del.br_startblock += del.br_blockcount - max_len;
> > +			del.br_blockcount = max_len;
> > +		}
> > +
> >  		sum = del.br_startblock + del.br_blockcount;
> >  		if (isrt &&
> >  		    (mod = do_mod(sum, mp->m_sb.sb_rextsize))) {
> > @@ -5715,6 +5735,7 @@ __xfs_bunmapi(
> >  		if (!isrt && wasdel)
> >  			xfs_mod_fdblocks(mp, (int64_t)del.br_blockcount, false);
> >  
> > +		max_len -= del.br_blockcount;
> >  		bno = del.br_startoff - 1;
> >  nodelete:
> >  		/*
> > diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
> > index b177ef33cd4c..29dcde381505 100644
> > --- a/fs/xfs/libxfs/xfs_refcount.c
> > +++ b/fs/xfs/libxfs/xfs_refcount.c
> > @@ -784,14 +784,6 @@ xfs_refcount_merge_extents(
> >  }
> >  
> >  /*
> > - * While we're adjusting the refcounts records of an extent, we have
> > - * to keep an eye on the number of extents we're dirtying -- run too
> > - * many in a single transaction and we'll exceed the transaction's
> > - * reservation and crash the fs.  Each record adds 12 bytes to the
> > - * log (plus any key updates) so we'll conservatively assume 24 bytes
> > - * per record.  We must also leave space for btree splits on both ends
> > - * of the range and space for the CUD and a new CUI.
> > - *
> >   * XXX: This is a pretty hand-wavy estimate.  The penalty for guessing
> >   * true incorrectly is a shutdown FS; the penalty for guessing false
> >   * incorrectly is more transaction rolls than might be necessary.
> > @@ -822,7 +814,7 @@ xfs_refcount_still_have_space(
> >  	else if (overhead > cur->bc_tp->t_log_res)
> >  		return false;
> >  	return  cur->bc_tp->t_log_res - overhead >
> > -		cur->bc_private.a.priv.refc.nr_ops * 32;
> > +		cur->bc_private.a.priv.refc.nr_ops * XFS_REFCOUNT_ITEM_OVERHEAD;
> >  }
> >  
> >  /*
> > diff --git a/fs/xfs/libxfs/xfs_refcount.h b/fs/xfs/libxfs/xfs_refcount.h
> > index 098dc668ab2c..eafb9d1f3b37 100644
> > --- a/fs/xfs/libxfs/xfs_refcount.h
> > +++ b/fs/xfs/libxfs/xfs_refcount.h
> > @@ -67,4 +67,20 @@ extern int xfs_refcount_free_cow_extent(struct xfs_mount *mp,
> >  extern int xfs_refcount_recover_cow_leftovers(struct xfs_mount *mp,
> >  		xfs_agnumber_t agno);
> >  
> > +/*
> > + * While we're adjusting the refcounts records of an extent, we have
> > + * to keep an eye on the number of extents we're dirtying -- run too
> > + * many in a single transaction and we'll exceed the transaction's
> > + * reservation and crash the fs.  Each record adds 12 bytes to the
> > + * log (plus any key updates) so we'll conservatively assume 32 bytes
> > + * per record.  We must also leave space for btree splits on both ends
> > + * of the range and space for the CUD and a new CUI.
> > + */
> > +#define XFS_REFCOUNT_ITEM_OVERHEAD	32
> > +
> > +static inline xfs_fileoff_t xfs_refcount_max_unmap(int log_res)
> > +{
> > +	return (log_res * 3 / 4) / XFS_REFCOUNT_ITEM_OVERHEAD;
> > +}
> > +
> >  #endif	/* __XFS_REFCOUNT_H__ */
> > -- 
> > 2.11.0
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
---end quoted text---

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: transaction reservations for deleting of shared extents
  2017-06-03  7:13             ` Christoph Hellwig
@ 2017-06-03 17:01               ` Darrick J. Wong
  0 siblings, 0 replies; 13+ messages in thread
From: Darrick J. Wong @ 2017-06-03 17:01 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Sat, Jun 03, 2017 at 09:13:07AM +0200, Christoph Hellwig wrote:
> While looking at stable updates it seems like we didn't make
> it anywhere with this series.  Are you planning to do further work
> or should I pick it up?

Hmm, I had a version of your patch that also fixed log recovery to
handle a bunmap item whose length is longer than what bunmapi is willing
to handle.  It's been languishing in my development branch forever, and
I can't seem to find any evidence that I ever sent it out(???).

So I just reran the generic/931 test that I sent out earlier in this
thread, and it still seems to fix the problem, if somewhat clunkily.
I guess I'll (re?)send the patch shortly.

--D

> 
> On Mon, Apr 24, 2017 at 07:09:06PM -0700, Darrick J. Wong wrote:
> > On Thu, Apr 13, 2017 at 02:13:25PM +0200, Christoph Hellwig wrote:
> > > On Wed, Apr 12, 2017 at 08:52:50PM -0700, Darrick J. Wong wrote:
> > > > Hmm... can you try out the attached RFC(RAP) xfstests testcase to see if
> > > > it can reproduce the problem you're seeing, and then apply the (equally
> > > > RFCRAP) patch to see if it fixes the problem?
> > > 
> > > Yeah. limiting at the caller seems much better than second guessing
> > > later.  Below is a version of your patch with a couple random edits.
> > > I wonder if we could now get rid of xfs_refcount_still_have_space
> > > or at least turn it into warnings only..
> > 
> > We also have to plumb in the code necessary to recover a block unmap log
> > intent item of arbitrary length by splitting the unmap operation into a
> > series of __xfs_bunmapi calls.  That seems to fix the asserts and other
> > weird log burps... will send an RFC patch shortly.
> > 
> > --D
> > 
> > > 
> > > ---
> > > From 9b25b1bdba8fb93650710444a6bab7b10f74680d Mon Sep 17 00:00:00 2001
> > > From: "Darrick J. Wong" <darrick.wong@oracle.com>
> > > Date: Thu, 13 Apr 2017 13:35:00 +0200
> > > Subject: xfs: try to avoid blowing out the transaction reservation when
> > >  bunmaping a shared extent
> > > 
> > > In a pathological scenario where we are trying to bunmapi a single
> > > extent in which every other block is shared, it's possible that trying
> > > to unmap the entire large extent in a single transaction can generate so
> > > many EFIs that we overflow the transaction reservation.
> > > 
> > > Therefore, use a heuristic to guess at the number of blocks we can
> > > safely unmap from a reflink file's data fork in an single transaction.
> > > This should prevent problems such as the log head slamming into the tail
> > > and ASSERTs that trigger because we've exceeded the transaction
> > > reservation.
> > > 
> > > Signed-off-by: Darrick J. Wong <djwong@djwong.org>
> > > [hch: random edits, all bugs are my fault]
> > > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > > ---
> > >  fs/xfs/libxfs/xfs_bmap.c     | 23 ++++++++++++++++++++++-
> > >  fs/xfs/libxfs/xfs_refcount.c | 10 +---------
> > >  fs/xfs/libxfs/xfs_refcount.h | 16 ++++++++++++++++
> > >  3 files changed, 39 insertions(+), 10 deletions(-)
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > > index 8e94030bcb8f..04dac8954425 100644
> > > --- a/fs/xfs/libxfs/xfs_bmap.c
> > > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > > @@ -5442,6 +5442,7 @@ __xfs_bunmapi(
> > >  	int			whichfork;	/* data or attribute fork */
> > >  	xfs_fsblock_t		sum;
> > >  	xfs_filblks_t		len = *rlen;	/* length to unmap in file */
> > > +	xfs_fileoff_t		max_len;
> > >  
> > >  	trace_xfs_bunmap(ip, bno, len, flags, _RET_IP_);
> > >  
> > > @@ -5463,6 +5464,16 @@ __xfs_bunmapi(
> > >  	ASSERT(len > 0);
> > >  	ASSERT(nexts >= 0);
> > >  
> > > +	/*
> > > +	 * Guesstimate how many blocks we can unmap without running the risk of
> > > +	 * blowing out the transaction with a mix of EFIs and reflink
> > > +	 * adjustments.
> > > +	 */
> > > +	if (xfs_is_reflink_inode(ip) && whichfork == XFS_DATA_FORK)
> > > +		max_len = min(len, xfs_refcount_max_unmap(tp->t_log_res));
> > > +	else
> > > +		max_len = len;
> > > +
> > >  	if (!(ifp->if_flags & XFS_IFEXTENTS) &&
> > >  	    (error = xfs_iread_extents(tp, ip, whichfork)))
> > >  		return error;
> > > @@ -5507,7 +5518,7 @@ __xfs_bunmapi(
> > >  
> > >  	extno = 0;
> > >  	while (bno != (xfs_fileoff_t)-1 && bno >= start && lastx >= 0 &&
> > > -	       (nexts == 0 || extno < nexts)) {
> > > +	       (nexts == 0 || extno < nexts) && max_len > 0) {
> > >  		/*
> > >  		 * Is the found extent after a hole in which bno lives?
> > >  		 * Just back up to the previous extent, if so.
> > > @@ -5539,6 +5550,15 @@ __xfs_bunmapi(
> > >  		}
> > >  		if (del.br_startoff + del.br_blockcount > bno + 1)
> > >  			del.br_blockcount = bno + 1 - del.br_startoff;
> > > +
> > > +		/* How much can we safely unmap? */
> > > +		if (max_len < del.br_blockcount) {
> > > +			del.br_startoff += del.br_blockcount - max_len;
> > > +			if (!wasdel)
> > > +				del.br_startblock += del.br_blockcount - max_len;
> > > +			del.br_blockcount = max_len;
> > > +		}
> > > +
> > >  		sum = del.br_startblock + del.br_blockcount;
> > >  		if (isrt &&
> > >  		    (mod = do_mod(sum, mp->m_sb.sb_rextsize))) {
> > > @@ -5715,6 +5735,7 @@ __xfs_bunmapi(
> > >  		if (!isrt && wasdel)
> > >  			xfs_mod_fdblocks(mp, (int64_t)del.br_blockcount, false);
> > >  
> > > +		max_len -= del.br_blockcount;
> > >  		bno = del.br_startoff - 1;
> > >  nodelete:
> > >  		/*
> > > diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
> > > index b177ef33cd4c..29dcde381505 100644
> > > --- a/fs/xfs/libxfs/xfs_refcount.c
> > > +++ b/fs/xfs/libxfs/xfs_refcount.c
> > > @@ -784,14 +784,6 @@ xfs_refcount_merge_extents(
> > >  }
> > >  
> > >  /*
> > > - * While we're adjusting the refcounts records of an extent, we have
> > > - * to keep an eye on the number of extents we're dirtying -- run too
> > > - * many in a single transaction and we'll exceed the transaction's
> > > - * reservation and crash the fs.  Each record adds 12 bytes to the
> > > - * log (plus any key updates) so we'll conservatively assume 24 bytes
> > > - * per record.  We must also leave space for btree splits on both ends
> > > - * of the range and space for the CUD and a new CUI.
> > > - *
> > >   * XXX: This is a pretty hand-wavy estimate.  The penalty for guessing
> > >   * true incorrectly is a shutdown FS; the penalty for guessing false
> > >   * incorrectly is more transaction rolls than might be necessary.
> > > @@ -822,7 +814,7 @@ xfs_refcount_still_have_space(
> > >  	else if (overhead > cur->bc_tp->t_log_res)
> > >  		return false;
> > >  	return  cur->bc_tp->t_log_res - overhead >
> > > -		cur->bc_private.a.priv.refc.nr_ops * 32;
> > > +		cur->bc_private.a.priv.refc.nr_ops * XFS_REFCOUNT_ITEM_OVERHEAD;
> > >  }
> > >  
> > >  /*
> > > diff --git a/fs/xfs/libxfs/xfs_refcount.h b/fs/xfs/libxfs/xfs_refcount.h
> > > index 098dc668ab2c..eafb9d1f3b37 100644
> > > --- a/fs/xfs/libxfs/xfs_refcount.h
> > > +++ b/fs/xfs/libxfs/xfs_refcount.h
> > > @@ -67,4 +67,20 @@ extern int xfs_refcount_free_cow_extent(struct xfs_mount *mp,
> > >  extern int xfs_refcount_recover_cow_leftovers(struct xfs_mount *mp,
> > >  		xfs_agnumber_t agno);
> > >  
> > > +/*
> > > + * While we're adjusting the refcounts records of an extent, we have
> > > + * to keep an eye on the number of extents we're dirtying -- run too
> > > + * many in a single transaction and we'll exceed the transaction's
> > > + * reservation and crash the fs.  Each record adds 12 bytes to the
> > > + * log (plus any key updates) so we'll conservatively assume 32 bytes
> > > + * per record.  We must also leave space for btree splits on both ends
> > > + * of the range and space for the CUD and a new CUI.
> > > + */
> > > +#define XFS_REFCOUNT_ITEM_OVERHEAD	32
> > > +
> > > +static inline xfs_fileoff_t xfs_refcount_max_unmap(int log_res)
> > > +{
> > > +	return (log_res * 3 / 4) / XFS_REFCOUNT_ITEM_OVERHEAD;
> > > +}
> > > +
> > >  #endif	/* __XFS_REFCOUNT_H__ */
> > > -- 
> > > 2.11.0
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> ---end quoted text---
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2017-06-03 17:01 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-20  7:29 transaction reservations for deleting of shared extents Christoph Hellwig
2017-02-21  1:43 ` Darrick J. Wong
2017-02-21  9:07   ` Christoph Hellwig
2017-02-21 16:53     ` Darrick J. Wong
2017-04-12 13:52   ` Christoph Hellwig
2017-04-12 23:06     ` Darrick J. Wong
2017-04-13  3:52       ` Darrick J. Wong
2017-04-13 10:51         ` Christoph Hellwig
2017-04-13 12:13         ` Christoph Hellwig
2017-04-25  2:09           ` Darrick J. Wong
2017-06-03  7:13             ` Christoph Hellwig
2017-06-03 17:01               ` Darrick J. Wong
2017-04-13 10:33       ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.