All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] Prevent extent btree block allocation failures
@ 2008-06-13  7:38 Lachlan McIlroy
  2008-06-13 13:44 ` Christoph Hellwig
  2008-06-13 15:57 ` Dave Chinner
  0 siblings, 2 replies; 17+ messages in thread
From: Lachlan McIlroy @ 2008-06-13  7:38 UTC (permalink / raw)
  To: xfs-dev, xfs-oss

When at ENOSPC conditions extent btree block allocations can fail and we
have no error handling to undo partial btree operations.  Prior to extent
btree operations we reserve enough disk blocks somewhere in the filesystem
to satisfy the operation but in some conditions we require the blocks to
come from specific AGs and if those AGs are full the allocation fails.

This change fixes xfs_bmap_extents_to_btree(), xfs_bmap_local_to_extents(),
xfs_bmbt_split() and xfs_bmbt_newroot() so that they can search other AGs
for the space needed.  Since we have reserved the space these allocations
are now guaranteed to succeed.  In order to search all AGs I had to revert
a change made to xfs_alloc_vextent() that prevented a search from looking
at AGs lower than the starting AG.  This original change was made to prevent
out of order AG locking when allocating multiple extents on data writeout
but since we only allocate one extent at a time now this particular problem
can't happen.

Lachlan

--- fs/xfs/xfs_alloc.c_1.193	2008-06-03 11:28:55.000000000 +1000
+++ fs/xfs/xfs_alloc.c	2008-06-02 18:40:47.000000000 +1000
@@ -2376,19 +2376,9 @@ xfs_alloc_vextent(
 			if (args->agno == sagno &&
 			    type == XFS_ALLOCTYPE_START_BNO)
 				args->type = XFS_ALLOCTYPE_THIS_AG;
-			/*
-			* For the first allocation, we can try any AG to get
-			* space.  However, if we already have allocated a
-			* block, we don't want to try AGs whose number is below
-			* sagno. Otherwise, we may end up with out-of-order
-			* locking of AGF, which might cause deadlock.
-			*/
-			if (++(args->agno) == mp->m_sb.sb_agcount) {
-				if (args->firstblock != NULLFSBLOCK)
-					args->agno = sagno;
-				else
-					args->agno = 0;
-			}
+
+			if (++(args->agno) == mp->m_sb.sb_agcount)
+				args->agno = 0;
 			/*
 			 * Reached the starting a.g., must either be done
 			 * or switch to non-trylock mode.
--- fs/xfs/xfs_bmap.c_1.392	2008-06-03 12:20:14.000000000 +1000
+++ fs/xfs/xfs_bmap.c	2008-06-03 15:57:40.000000000 +1000
@@ -3445,16 +3452,10 @@ xfs_bmap_extents_to_btree(
 	args.tp = tp;
 	args.mp = mp;
 	args.firstblock = *firstblock;
-	if (*firstblock == NULLFSBLOCK) {
-		args.type = XFS_ALLOCTYPE_START_BNO;
+	args.fsbno = *firstblock;
+	if (*firstblock == NULLFSBLOCK)
 		args.fsbno = XFS_INO_TO_FSB(mp, ip->i_ino);
-	} else if (flist->xbf_low) {
-		args.type = XFS_ALLOCTYPE_START_BNO;
-		args.fsbno = *firstblock;
-	} else {
-		args.type = XFS_ALLOCTYPE_NEAR_BNO;
-		args.fsbno = *firstblock;
-	}
+	args.type = XFS_ALLOCTYPE_START_BNO;
 	args.minlen = args.maxlen = args.prod = 1;
 	args.total = args.minleft = args.alignment = args.mod = args.isfl =
 		args.minalignslop = 0;
@@ -3585,13 +3586,10 @@ xfs_bmap_local_to_extents(
 		 * Allocate a block.  We know we need only one, since the
 		 * file currently fits in an inode.
 		 */
-		if (*firstblock == NULLFSBLOCK) {
+		args.fsbno = *firstblock;
+		if (*firstblock == NULLFSBLOCK)
 			args.fsbno = XFS_INO_TO_FSB(args.mp, ip->i_ino);
-			args.type = XFS_ALLOCTYPE_START_BNO;
-		} else {
-			args.fsbno = *firstblock;
-			args.type = XFS_ALLOCTYPE_NEAR_BNO;
-		}
+		args.type = XFS_ALLOCTYPE_START_BNO;
 		args.total = total;
 		args.mod = args.minleft = args.alignment = args.wasdel =
 			args.isfl = args.minalignslop = 0;
--- fs/xfs/xfs_bmap_btree.c_1.169	2008-06-03 11:28:56.000000000 +1000
+++ fs/xfs/xfs_bmap_btree.c	2008-06-06 14:48:14.000000000 +1000
@@ -1493,11 +1493,9 @@ xfs_bmbt_split(
 	left = XFS_BUF_TO_BMBT_BLOCK(lbp);
 	args.fsbno = cur->bc_private.b.firstblock;
 	args.firstblock = args.fsbno;
-	if (args.fsbno == NULLFSBLOCK) {
+	if (args.fsbno == NULLFSBLOCK)
 		args.fsbno = lbno;
-		args.type = XFS_ALLOCTYPE_START_BNO;
-	} else
-		args.type = XFS_ALLOCTYPE_NEAR_BNO;
+	args.type = XFS_ALLOCTYPE_START_BNO;
 	args.mod = args.minleft = args.alignment = args.total = args.isfl =
 		args.userdata = args.minalignslop = 0;
 	args.minlen = args.maxlen = args.prod = 1;
@@ -2253,9 +2251,8 @@ xfs_bmbt_newroot(
 		}
 #endif
 		args.fsbno = be64_to_cpu(*pp);
-		args.type = XFS_ALLOCTYPE_START_BNO;
-	} else
-		args.type = XFS_ALLOCTYPE_NEAR_BNO;
+	}
+	args.type = XFS_ALLOCTYPE_START_BNO;
 	if ((error = xfs_alloc_vextent(&args))) {
 		XFS_BMBT_TRACE_CURSOR(cur, ERROR);
 		return error;

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Prevent extent btree block allocation failures
  2008-06-13  7:38 [PATCH] Prevent extent btree block allocation failures Lachlan McIlroy
@ 2008-06-13 13:44 ` Christoph Hellwig
  2008-06-16  3:57   ` Lachlan McIlroy
  2008-06-13 15:57 ` Dave Chinner
  1 sibling, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2008-06-13 13:44 UTC (permalink / raw)
  To: Lachlan McIlroy; +Cc: xfs-dev, xfs-oss

On Fri, Jun 13, 2008 at 05:38:12PM +1000, Lachlan McIlroy wrote:
> This change fixes xfs_bmap_extents_to_btree(), xfs_bmap_local_to_extents(),
> xfs_bmbt_split() and xfs_bmbt_newroot() so that they can search other AGs
> for the space needed.  Since we have reserved the space these allocations
> are now guaranteed to succeed.

This looks good and makes lot of sense to me.  Please also add a comment
to enum xfs_alloctype in xfs_alloc.h about the danger of the allocation
types that never go out of the AG when used inside transactions.

> In order to search all AGs I had to revert
> a change made to xfs_alloc_vextent() that prevented a search from looking
> at AGs lower than the starting AG.  This original change was made to prevent
> out of order AG locking when allocating multiple extents on data writeout
> but since we only allocate one extent at a time now this particular problem
> can't happen.

This one also makes sense, but I have a very bad gut feeling about it.
There's nothing preventing the same deadlock scenario from coming back
when people modify the highlevel data allocator again.  We really need
some sort of assert to trigger early in that case to not got a nasty
hard to trigger deadlock.

> +			if (++(args->agno) == mp->m_sb.sb_agcount)

and while we're at it this should be

			if (++args->agno == mp->m_sb.sb_agcount)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Prevent extent btree block allocation failures
  2008-06-13  7:38 [PATCH] Prevent extent btree block allocation failures Lachlan McIlroy
  2008-06-13 13:44 ` Christoph Hellwig
@ 2008-06-13 15:57 ` Dave Chinner
  2008-06-16  6:11   ` Lachlan McIlroy
  1 sibling, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2008-06-13 15:57 UTC (permalink / raw)
  To: Lachlan McIlroy; +Cc: xfs-dev, xfs-oss

On Fri, Jun 13, 2008 at 05:38:12PM +1000, Lachlan McIlroy wrote:
> When at ENOSPC conditions extent btree block allocations can fail and we
> have no error handling to undo partial btree operations.  Prior to extent
> btree operations we reserve enough disk blocks somewhere in the filesystem
> to satisfy the operation but in some conditions we require the blocks to
> come from specific AGs and if those AGs are full the allocation fails.
>
> This change fixes xfs_bmap_extents_to_btree(), xfs_bmap_local_to_extents(),
> xfs_bmbt_split() and xfs_bmbt_newroot() so that they can search other AGs
> for the space needed.  Since we have reserved the space these allocations
> are now guaranteed to succeed. 

Sure, but we didn't reserve space for potential btree splits in a
second AG as a result of this. That needs to be reserved in the
transaction as well, which will blow out transaction reservations
substantially as we'll need to add another 2 full AGF btree splits to
every transaction that modifies the bmap btree.

> In order to search all AGs I had to revert
> a change made to xfs_alloc_vextent() that prevented a search from looking
> at AGs lower than the starting AG.  This original change was made to prevent
> out of order AG locking when allocating multiple extents on data writeout
> but since we only allocate one extent at a time now this particular problem
> can't happen.

You missed the fact that the AGF of modified AGs is already held
locked in the transaction, hence the locking order within the
transaction is wrong. Also, if we modify the free list in an AG
the fail an allocation (e.g. can't do an exact allocation), we'll
have multiple dirty and locked AGFs in the one allocation. Hence
we still can have locking order violations if you remove that check
and therefore deadlocks.

This is not the solution to the problem. As I suggested (back when
you first floated this idea as a fix for the problem several weeks
ago) I think the bug is that we are not taking into account the
number of blocks required for a bmbt split when selecting an AG to
allocate from. All we take into account is the blocks required for
the extent to be allocated and nothing else. If we take the blocks
for a bmbt split into account then we'll never try to allocate an
extent in an AG that we can't also allocate all the blocks for the
bmbt split in at the same time.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Prevent extent btree block allocation failures
  2008-06-13 13:44 ` Christoph Hellwig
@ 2008-06-16  3:57   ` Lachlan McIlroy
  0 siblings, 0 replies; 17+ messages in thread
From: Lachlan McIlroy @ 2008-06-16  3:57 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs-dev, xfs-oss

Christoph Hellwig wrote:
> On Fri, Jun 13, 2008 at 05:38:12PM +1000, Lachlan McIlroy wrote:
>> This change fixes xfs_bmap_extents_to_btree(), xfs_bmap_local_to_extents(),
>> xfs_bmbt_split() and xfs_bmbt_newroot() so that they can search other AGs
>> for the space needed.  Since we have reserved the space these allocations
>> are now guaranteed to succeed.
> 
> This looks good and makes lot of sense to me.  Please also add a comment
> to enum xfs_alloctype in xfs_alloc.h about the danger of the allocation
> types that never go out of the AG when used inside transactions.

The flags already have comments?  Are you referring specifically to the
fact we have reserved space somewhere in the filesystem but expect to find
it in a single AG?

> 
>> In order to search all AGs I had to revert
>> a change made to xfs_alloc_vextent() that prevented a search from looking
>> at AGs lower than the starting AG.  This original change was made to prevent
>> out of order AG locking when allocating multiple extents on data writeout
>> but since we only allocate one extent at a time now this particular problem
>> can't happen.
> 
> This one also makes sense, but I have a very bad gut feeling about it.
> There's nothing preventing the same deadlock scenario from coming back
> when people modify the highlevel data allocator again.  We really need
> some sort of assert to trigger early in that case to not got a nasty
> hard to trigger deadlock.

Yes I agree.  I was thinking about making the allocation of a second data
extent in the same transaction a conditional try operation to avoid the
deadlock.  If a caller to the allocator provides room to return more than
one extent then I don't think we have to return more than one - multiple
extents in one transaction is just an optimisation.

> 
>> +			if (++(args->agno) == mp->m_sb.sb_agcount)
> 
> and while we're at it this should be
> 
> 			if (++args->agno == mp->m_sb.sb_agcount)

Done, thanks.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Prevent extent btree block allocation failures
  2008-06-13 15:57 ` Dave Chinner
@ 2008-06-16  6:11   ` Lachlan McIlroy
  2008-06-16 17:10     ` Dave Chinner
  0 siblings, 1 reply; 17+ messages in thread
From: Lachlan McIlroy @ 2008-06-16  6:11 UTC (permalink / raw)
  To: Lachlan McIlroy, xfs-dev, xfs-oss

Dave Chinner wrote:
> On Fri, Jun 13, 2008 at 05:38:12PM +1000, Lachlan McIlroy wrote:
>> When at ENOSPC conditions extent btree block allocations can fail and we
>> have no error handling to undo partial btree operations.  Prior to extent
>> btree operations we reserve enough disk blocks somewhere in the filesystem
>> to satisfy the operation but in some conditions we require the blocks to
>> come from specific AGs and if those AGs are full the allocation fails.
>>
>> This change fixes xfs_bmap_extents_to_btree(), xfs_bmap_local_to_extents(),
>> xfs_bmbt_split() and xfs_bmbt_newroot() so that they can search other AGs
>> for the space needed.  Since we have reserved the space these allocations
>> are now guaranteed to succeed. 
> 
> Sure, but we didn't reserve space for potential btree splits in a
> second AG as a result of this. That needs to be reserved in the
> transaction as well, which will blow out transaction reservations
> substantially as we'll need to add another 2 full AGF btree splits to
> every transaction that modifies the bmap btree.

Right.  And most of the time we wont need the space either so it's a
real waste.

> 
>> In order to search all AGs I had to revert
>> a change made to xfs_alloc_vextent() that prevented a search from looking
>> at AGs lower than the starting AG.  This original change was made to prevent
>> out of order AG locking when allocating multiple extents on data writeout
>> but since we only allocate one extent at a time now this particular problem
>> can't happen.
> 
> You missed the fact that the AGF of modified AGs is already held
> locked in the transaction, hence the locking order within the
> transaction is wrong. Also, if we modify the free list in an AG
> the fail an allocation (e.g. can't do an exact allocation), we'll
> have multiple dirty and locked AGFs in the one allocation. Hence
> we still can have locking order violations if you remove that check
> and therefore deadlocks.

I'm well aware of that particular deadlock involving the freelist - I
hit it while testing.  If you look closely at the code that deadlock
can occur with or without the AG locking avoidance logic.  This is
because the rest of the transaction is unaware that an AG has been
locked due to a freelist operation.

> 
> This is not the solution to the problem. As I suggested (back when
> you first floated this idea as a fix for the problem several weeks
> ago) I think the bug is that we are not taking into account the
> number of blocks required for a bmbt split when selecting an AG to
> allocate from. All we take into account is the blocks required for
> the extent to be allocated and nothing else. If we take the blocks
> for a bmbt split into account then we'll never try to allocate an
> extent in an AG that we can't also allocate all the blocks for the
> bmbt split in at the same time.
> 

I considered that approach (using the minleft field in xfs_alloc_arg_t)
but it has it's problems too.  When we reserve space for the btree
operations it is done on the global filesystem counters, not a
particular AG, so there is the possibility that not one AG has sufficent
space to perform the allocation even though there is enough free space
in the whole filesystem.  Of course if we have enough space left in one
AG and the AG is locked then the space we reserved doesn't matter anymore
and it should all work.

I'm worried with this approach that we could have delayed allocations and
unwritten extents that need to be converted but we can't do it because we
don't have the space we might need (but probably don't).

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Prevent extent btree block allocation failures
  2008-06-16  6:11   ` Lachlan McIlroy
@ 2008-06-16 17:10     ` Dave Chinner
  2008-06-17  1:58       ` Lachlan McIlroy
  0 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2008-06-16 17:10 UTC (permalink / raw)
  To: lachlan; +Cc: xfs-dev, xfs-oss

On Sunday 15 June 2008 11:11 pm, Lachlan McIlroy wrote:
> Dave Chinner wrote:
> > On Fri, Jun 13, 2008 at 05:38:12PM +1000, Lachlan McIlroy wrote:
> >> When at ENOSPC conditions extent btree block allocations can fail and we
> >> have no error handling to undo partial btree operations.  Prior to extent
> >> btree operations we reserve enough disk blocks somewhere in the filesystem
> >> to satisfy the operation but in some conditions we require the blocks to
> >> come from specific AGs and if those AGs are full the allocation fails.
> >>
> >> This change fixes xfs_bmap_extents_to_btree(), xfs_bmap_local_to_extents(),
> >> xfs_bmbt_split() and xfs_bmbt_newroot() so that they can search other AGs
> >> for the space needed.  Since we have reserved the space these allocations
> >> are now guaranteed to succeed. 
> > 
> > Sure, but we didn't reserve space for potential btree splits in a
> > second AG as a result of this. That needs to be reserved in the
> > transaction as well, which will blow out transaction reservations
> > substantially as we'll need to add another 2 full AGF btree splits to
> > every transaction that modifies the bmap btree.
> 
> Right.  And most of the time we wont need the space either so it's a
> real waste.

Waste, yes, but still needed otherwise transaction overruns and log space
deadlocks could occur....

> >> In order to search all AGs I had to revert
> >> a change made to xfs_alloc_vextent() that prevented a search from looking
> >> at AGs lower than the starting AG.  This original change was made to prevent
> >> out of order AG locking when allocating multiple extents on data writeout
> >> but since we only allocate one extent at a time now this particular problem
> >> can't happen.
> > 
> > You missed the fact that the AGF of modified AGs is already held
> > locked in the transaction, hence the locking order within the
> > transaction is wrong. Also, if we modify the free list in an AG
> > the fail an allocation (e.g. can't do an exact allocation), we'll
> > have multiple dirty and locked AGFs in the one allocation. Hence
> > we still can have locking order violations if you remove that check
> > and therefore deadlocks.
> 
> I'm well aware of that particular deadlock involving the freelist - I
> hit it while testing.  If you look closely at the code that deadlock
> can occur with or without the AG locking avoidance logic.  This is
> because the rest of the transaction is unaware that an AG has been
> locked due to a freelist operation.

Yes, which is why you need to prevent freelist modifications occurring
when you can't allocate anything out of the AG.

> > This is not the solution to the problem. As I suggested (back when
> > you first floated this idea as a fix for the problem several weeks
> > ago) I think the bug is that we are not taking into account the
> > number of blocks required for a bmbt split when selecting an AG to
> > allocate from. All we take into account is the blocks required for
> > the extent to be allocated and nothing else. If we take the blocks
> > for a bmbt split into account then we'll never try to allocate an
> > extent in an AG that we can't also allocate all the blocks for the
> > bmbt split in at the same time.
> 
> I considered that approach (using the minleft field in xfs_alloc_arg_t)
> but it has it's problems too.  When we reserve space for the btree
> operations it is done on the global filesystem counters, not a
> particular AG, so there is the possibility that not one AG has sufficent
> space to perform the allocation even though there is enough free space
> in the whole filesystem.

Yes, we had that problem with the ENOSPC deadlock fixes in that we always
needed at least 4 blocks per AG available for a extent free to succeed.
Hence we have the XFS_ALLOC_SET_ASIDE() value for determining if the
filesystem is out of space, not a count of zero free blocks.

In this case, this macro can be extended to guarantee that our aggregate
block usage never goes below the threshold that would prevent each AG from
holding enough blocks for a worst case allocation to succeed.....

> Of course if we have enough space left in one 
> AG and the AG is locked then the space we reserved doesn't matter anymore
> and it should all work.

Yes.

> I'm worried with this approach that we could have delayed allocations and
> unwritten extents that need to be converted but we can't do it because we
> don't have the space we might need (but probably don't).

Delayed allocation is the issue - unwritten extent conversion failure will
simply return an error and leave the extent unwritten.

With delayed allocation, we can oversubscribe a given AG but we can
always try a different AG. If we get to the situation that we don't have
enough space in any AG then we are screwed. However, by ensuring we can
sustain a worst-case split within any AG we can avoid this situation
completely.

Cheers,

Dave.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Prevent extent btree block allocation failures
  2008-06-16 17:10     ` Dave Chinner
@ 2008-06-17  1:58       ` Lachlan McIlroy
  2008-06-17  7:39         ` Dave Chinner
  0 siblings, 1 reply; 17+ messages in thread
From: Lachlan McIlroy @ 2008-06-17  1:58 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs-dev, xfs-oss

Dave Chinner wrote:
> On Sunday 15 June 2008 11:11 pm, Lachlan McIlroy wrote:
>> Dave Chinner wrote:
>>> On Fri, Jun 13, 2008 at 05:38:12PM +1000, Lachlan McIlroy wrote:
>>>> When at ENOSPC conditions extent btree block allocations can fail and we
>>>> have no error handling to undo partial btree operations.  Prior to extent
>>>> btree operations we reserve enough disk blocks somewhere in the filesystem
>>>> to satisfy the operation but in some conditions we require the blocks to
>>>> come from specific AGs and if those AGs are full the allocation fails.
>>>>
>>>> This change fixes xfs_bmap_extents_to_btree(), xfs_bmap_local_to_extents(),
>>>> xfs_bmbt_split() and xfs_bmbt_newroot() so that they can search other AGs
>>>> for the space needed.  Since we have reserved the space these allocations
>>>> are now guaranteed to succeed. 
>>> Sure, but we didn't reserve space for potential btree splits in a
>>> second AG as a result of this. That needs to be reserved in the
>>> transaction as well, which will blow out transaction reservations
>>> substantially as we'll need to add another 2 full AGF btree splits to
>>> every transaction that modifies the bmap btree.
>> Right.  And most of the time we wont need the space either so it's a
>> real waste.
> 
> Waste, yes, but still needed otherwise transaction overruns and log space
> deadlocks could occur....
> 
>>>> In order to search all AGs I had to revert
>>>> a change made to xfs_alloc_vextent() that prevented a search from looking
>>>> at AGs lower than the starting AG.  This original change was made to prevent
>>>> out of order AG locking when allocating multiple extents on data writeout
>>>> but since we only allocate one extent at a time now this particular problem
>>>> can't happen.
>>> You missed the fact that the AGF of modified AGs is already held
>>> locked in the transaction, hence the locking order within the
>>> transaction is wrong. Also, if we modify the free list in an AG
>>> the fail an allocation (e.g. can't do an exact allocation), we'll
>>> have multiple dirty and locked AGFs in the one allocation. Hence
>>> we still can have locking order violations if you remove that check
>>> and therefore deadlocks.
>> I'm well aware of that particular deadlock involving the freelist - I
>> hit it while testing.  If you look closely at the code that deadlock
>> can occur with or without the AG locking avoidance logic.  This is
>> because the rest of the transaction is unaware that an AG has been
>> locked due to a freelist operation.
> 
> Yes, which is why you need to prevent freelist modifications occurring
> when you can't allocate anything out of the AG.

That sounds reasonable but it isn't consistent with the deadlock I saw.
One of the threads that was deadlocked had tried to allocate a data extent
in AG3 but didn't find the space.  It had modified, and hence locked, AG3
due to modifying the freelist but since it didn't get the space it needed
it had to go on to another AG.  So before we even allocated the data extent
(and before we tried to modify the btree, etc) we had an AG locked.  The
deadlock avoidance code in xfs_alloc_vextent() didn't know this because
it only checks for a previous allocation.  It makes sense that we should
avoid modifying the freelist if there isn't enough space for the allocation
so the puzzle is why didn't the code do that?

> 
>>> This is not the solution to the problem. As I suggested (back when
>>> you first floated this idea as a fix for the problem several weeks
>>> ago) I think the bug is that we are not taking into account the
>>> number of blocks required for a bmbt split when selecting an AG to
>>> allocate from. All we take into account is the blocks required for
>>> the extent to be allocated and nothing else. If we take the blocks
>>> for a bmbt split into account then we'll never try to allocate an
>>> extent in an AG that we can't also allocate all the blocks for the
>>> bmbt split in at the same time.
>> I considered that approach (using the minleft field in xfs_alloc_arg_t)
>> but it has it's problems too.  When we reserve space for the btree
>> operations it is done on the global filesystem counters, not a
>> particular AG, so there is the possibility that not one AG has sufficent
>> space to perform the allocation even though there is enough free space
>> in the whole filesystem.
> 
> Yes, we had that problem with the ENOSPC deadlock fixes in that we always
> needed at least 4 blocks per AG available for a extent free to succeed.
> Hence we have the XFS_ALLOC_SET_ASIDE() value for determining if the
> filesystem is out of space, not a count of zero free blocks.

Those 4 blocks are for one extent free operation, right?  What if we have
multiple threads all trying to do the same thing (in the same AG)?

> 
> In this case, this macro can be extended to guarantee that our aggregate
> block usage never goes below the threshold that would prevent each AG from
> holding enough blocks for a worst case allocation to succeed.....
> 
>> Of course if we have enough space left in one 
>> AG and the AG is locked then the space we reserved doesn't matter anymore
>> and it should all work.
> 
> Yes.
> 
>> I'm worried with this approach that we could have delayed allocations and
>> unwritten extents that need to be converted but we can't do it because we
>> don't have the space we might need (but probably don't).
> 
> Delayed allocation is the issue - unwritten extent conversion failure will
> simply return an error and leave the extent unwritten.

That's still a problem though - if we can't convert unwritten extents then
we can't clean dirty pages and we wont be able to unmount the filesystem.

> 
> With delayed allocation, we can oversubscribe a given AG but we can
> always try a different AG. If we get to the situation that we don't have
> enough space in any AG then we are screwed. However, by ensuring we can
> sustain a worst-case split within any AG we can avoid this situation
> completely.
> 
> Cheers,
> 
> Dave.
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Prevent extent btree block allocation failures
  2008-06-17  1:58       ` Lachlan McIlroy
@ 2008-06-17  7:39         ` Dave Chinner
  2008-06-19  7:28           ` Lachlan McIlroy
  0 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2008-06-17  7:39 UTC (permalink / raw)
  To: Lachlan McIlroy; +Cc: Dave Chinner, xfs-dev, xfs-oss

On Tue, Jun 17, 2008 at 11:58:47AM +1000, Lachlan McIlroy wrote:
> Dave Chinner wrote:
>> On Sunday 15 June 2008 11:11 pm, Lachlan McIlroy wrote:
>>> I'm well aware of that particular deadlock involving the freelist - I
>>> hit it while testing.  If you look closely at the code that deadlock
>>> can occur with or without the AG locking avoidance logic.  This is
>>> because the rest of the transaction is unaware that an AG has been
>>> locked due to a freelist operation.
>>
>> Yes, which is why you need to prevent freelist modifications occurring
>> when you can't allocate anything out of the AG.
>
> That sounds reasonable but it isn't consistent with the deadlock I saw.
> One of the threads that was deadlocked had tried to allocate a data extent
> in AG3 but didn't find the space.  It had modified, and hence locked, AG3
> due to modifying the freelist but since it didn't get the space it needed
> it had to go on to another AG.

That sounds like an exact allocation failure - there is enough
space, a large enough extent but no free space at the exact block
required. This is exactly the case that occurred with the inode
allocation - and then allocation in the same AG failed because of
alignment that wasn't taken into account by the first exact
allocation attempt. Perhaps the minalignslop calculation in
xfs_bmap_btalloc() is incorrect...

> So before we even allocated the data extent
> (and before we tried to modify the btree, etc) we had an AG locked.  The
> deadlock avoidance code in xfs_alloc_vextent() didn't know this because
> it only checks for a previous allocation.  It makes sense that we should
> avoid modifying the freelist if there isn't enough space for the allocation
> so the puzzle is why didn't the code do that?

Good question, and I think that is one we need to answer.

>>> I considered that approach (using the minleft field in xfs_alloc_arg_t)
>>> but it has it's problems too.  When we reserve space for the btree
>>> operations it is done on the global filesystem counters, not a
>>> particular AG, so there is the possibility that not one AG has sufficent
>>> space to perform the allocation even though there is enough free space
>>> in the whole filesystem.
>>
>> Yes, we had that problem with the ENOSPC deadlock fixes in that we always
>> needed at least 4 blocks per AG available for a extent free to succeed.
>> Hence we have the XFS_ALLOC_SET_ASIDE() value for determining if the
>> filesystem is out of space, not a count of zero free blocks.
>
> Those 4 blocks are for one extent free operation, right?  What if we have
> multiple threads all trying to do the same thing (in the same AG)?

If we have empty AGF btrees we need to allocate two new root blocks,
and then we won't have to allocate any more btree blocks for the
next 100+ extent free operations...

For allocations, the first would succeed, the rest would have to
search for another AG...

>>> I'm worried with this approach that we could have delayed allocations and
>>> unwritten extents that need to be converted but we can't do it because we
>>> don't have the space we might need (but probably don't).
>>
>> Delayed allocation is the issue - unwritten extent conversion failure will
>> simply return an error and leave the extent unwritten.
>
> That's still a problem though - if we can't convert unwritten extents then
> we can't clean dirty pages and we wont be able to unmount the filesystem.

There will be errors logged and the extents will remain unwritten.
The filesystem should still be able to be unmounted.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Prevent extent btree block allocation failures
  2008-06-17  7:39         ` Dave Chinner
@ 2008-06-19  7:28           ` Lachlan McIlroy
  2008-06-20  5:21             ` Dave Chinner
  0 siblings, 1 reply; 17+ messages in thread
From: Lachlan McIlroy @ 2008-06-19  7:28 UTC (permalink / raw)
  To: Lachlan McIlroy, Dave Chinner, xfs-dev, xfs-oss

Dave Chinner wrote:
> On Tue, Jun 17, 2008 at 11:58:47AM +1000, Lachlan McIlroy wrote:
>> Dave Chinner wrote:
>>> On Sunday 15 June 2008 11:11 pm, Lachlan McIlroy wrote:
>>>> I'm well aware of that particular deadlock involving the freelist - I
>>>> hit it while testing.  If you look closely at the code that deadlock
>>>> can occur with or without the AG locking avoidance logic.  This is
>>>> because the rest of the transaction is unaware that an AG has been
>>>> locked due to a freelist operation.
>>> Yes, which is why you need to prevent freelist modifications occurring
>>> when you can't allocate anything out of the AG.
>> That sounds reasonable but it isn't consistent with the deadlock I saw.
>> One of the threads that was deadlocked had tried to allocate a data extent
>> in AG3 but didn't find the space.  It had modified, and hence locked, AG3
>> due to modifying the freelist but since it didn't get the space it needed
>> it had to go on to another AG.
> 
> That sounds like an exact allocation failure - there is enough
> space, a large enough extent but no free space at the exact block
> required. This is exactly the case that occurred with the inode
> allocation - and then allocation in the same AG failed because of
> alignment that wasn't taken into account by the first exact
> allocation attempt. Perhaps the minalignslop calculation in
> xfs_bmap_btalloc() is incorrect...

Okay I'll look into that.

There's something else that looks suspicious to me - this code in
xfs_bmap_btalloc() is setting minleft to 0.  Doesn't this go against
what you were saying about setting minleft to be the space we might
need for the btree operations?

	if (args.fsbno == NULLFSBLOCK && nullfb) {
		args.fsbno = 0;
		args.type = XFS_ALLOCTYPE_FIRST_AG;
		args.total = ap->minlen;
		args.minleft = 0;
		if ((error = xfs_alloc_vextent(&args)))
			return error;
		ap->low = 1;
	}

I see it sets a lowspace indicator which filters back up and into
some btree operations.  It appears the purpose of this feature is to
allow allocations to search for space in other AGs as in this example
from xfs_bmap_extents_to_btree():

	if (*firstblock == NULLFSBLOCK) {
		args.type = XFS_ALLOCTYPE_START_BNO;
		args.fsbno = XFS_INO_TO_FSB(mp, ip->i_ino);
	} else if (flist->xbf_low) {
		args.type = XFS_ALLOCTYPE_START_BNO;
		args.fsbno = *firstblock;
	} else {
		args.type = XFS_ALLOCTYPE_NEAR_BNO;
		args.fsbno = *firstblock;
	}

This is sort of what I was trying to do with my patch but without the
special lowspace condition.  This lowspace feature is probably broken
because there was a similar special case in xfs_bmbt_split() that got
removed with the changes that fixed the AG out-of-order locking problem.

@@ -1569,12 +1569,11 @@
 	lbno = XFS_DADDR_TO_FSB(args.mp, XFS_BUF_ADDR(lbp));
 	left = XFS_BUF_TO_BMBT_BLOCK(lbp);
 	args.fsbno = cur->bc_private.b.firstblock;
+	args.firstblock = args.fsbno;
 	if (args.fsbno == NULLFSBLOCK) {
 		args.fsbno = lbno;
 		args.type = XFS_ALLOCTYPE_START_BNO;
-	} else if (cur->bc_private.b.flist->xbf_low)
-		args.type = XFS_ALLOCTYPE_FIRST_AG;
-	else
+	} else
 		args.type = XFS_ALLOCTYPE_NEAR_BNO;
 	args.mod = args.minleft = args.alignment = args.total = args.isfl =
 		args.userdata = args.minalignslop = 0;

This could be why we have allocations failing now.  Maybe it should
have been left in and XFS_ALLOCTYPE_FIRST_AG changed to
XFS_ALLOCTYPE_START_BNO.  But even then it could still fail because the
AG deadlock avoidance code may prevent us from searching the AG that has
the space we need.

Should we ditch this lowspace special condition (and the code in
xfs_bmap_btalloc()) and insist that all the space we need (using minleft)
should come from one AG?

> 
>> So before we even allocated the data extent
>> (and before we tried to modify the btree, etc) we had an AG locked.  The
>> deadlock avoidance code in xfs_alloc_vextent() didn't know this because
>> it only checks for a previous allocation.  It makes sense that we should
>> avoid modifying the freelist if there isn't enough space for the allocation
>> so the puzzle is why didn't the code do that?
> 
> Good question, and I think that is one we need to answer.
> 
>>>> I considered that approach (using the minleft field in xfs_alloc_arg_t)
>>>> but it has it's problems too.  When we reserve space for the btree
>>>> operations it is done on the global filesystem counters, not a
>>>> particular AG, so there is the possibility that not one AG has sufficent
>>>> space to perform the allocation even though there is enough free space
>>>> in the whole filesystem.
>>> Yes, we had that problem with the ENOSPC deadlock fixes in that we always
>>> needed at least 4 blocks per AG available for a extent free to succeed.
>>> Hence we have the XFS_ALLOC_SET_ASIDE() value for determining if the
>>> filesystem is out of space, not a count of zero free blocks.
>> Those 4 blocks are for one extent free operation, right?  What if we have
>> multiple threads all trying to do the same thing (in the same AG)?
> 
> If we have empty AGF btrees we need to allocate two new root blocks,
> and then we won't have to allocate any more btree blocks for the
> next 100+ extent free operations...
> 
> For allocations, the first would succeed, the rest would have to
> search for another AG...
> 
>>>> I'm worried with this approach that we could have delayed allocations and
>>>> unwritten extents that need to be converted but we can't do it because we
>>>> don't have the space we might need (but probably don't).
>>> Delayed allocation is the issue - unwritten extent conversion failure will
>>> simply return an error and leave the extent unwritten.
>> That's still a problem though - if we can't convert unwritten extents then
>> we can't clean dirty pages and we wont be able to unmount the filesystem.
> 
> There will be errors logged and the extents will remain unwritten.
> The filesystem should still be able to be unmounted.
> 

So returning an error from unwritten extent conversion will not re-dirty the
pages?  So we effectively throw the user's data away?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Prevent extent btree block allocation failures
  2008-06-19  7:28           ` Lachlan McIlroy
@ 2008-06-20  5:21             ` Dave Chinner
  2008-06-23  5:20               ` Dave Chinner
  2008-06-23  5:24               ` Lachlan McIlroy
  0 siblings, 2 replies; 17+ messages in thread
From: Dave Chinner @ 2008-06-20  5:21 UTC (permalink / raw)
  To: Lachlan McIlroy; +Cc: xfs-dev, xfs-oss

On Thu, Jun 19, 2008 at 05:28:50PM +1000, Lachlan McIlroy wrote:
> Dave Chinner wrote:
>> On Tue, Jun 17, 2008 at 11:58:47AM +1000, Lachlan McIlroy wrote:
>>> Dave Chinner wrote:
>>>> On Sunday 15 June 2008 11:11 pm, Lachlan McIlroy wrote:
>>>>> I'm well aware of that particular deadlock involving the freelist - I
>>>>> hit it while testing.  If you look closely at the code that deadlock
>>>>> can occur with or without the AG locking avoidance logic.  This is
>>>>> because the rest of the transaction is unaware that an AG has been
>>>>> locked due to a freelist operation.
>>>> Yes, which is why you need to prevent freelist modifications occurring
>>>> when you can't allocate anything out of the AG.
>>> That sounds reasonable but it isn't consistent with the deadlock I saw.
>>> One of the threads that was deadlocked had tried to allocate a data extent
>>> in AG3 but didn't find the space.  It had modified, and hence locked, AG3
>>> due to modifying the freelist but since it didn't get the space it needed
>>> it had to go on to another AG.
>>
>> That sounds like an exact allocation failure - there is enough
>> space, a large enough extent but no free space at the exact block
>> required. This is exactly the case that occurred with the inode
>> allocation - and then allocation in the same AG failed because of
>> alignment that wasn't taken into account by the first exact
>> allocation attempt. Perhaps the minalignslop calculation in
>> xfs_bmap_btalloc() is incorrect...
>
> Okay I'll look into that.
>
> There's something else that looks suspicious to me - this code in
> xfs_bmap_btalloc() is setting minleft to 0.  Doesn't this go against
> what you were saying about setting minleft to be the space we might
> need for the btree operations?
>
> 	if (args.fsbno == NULLFSBLOCK && nullfb) {
> 		args.fsbno = 0;
> 		args.type = XFS_ALLOCTYPE_FIRST_AG;
> 		args.total = ap->minlen;
> 		args.minleft = 0;
> 		if ((error = xfs_alloc_vextent(&args)))
> 			return error;
> 		ap->low = 1;
> 	}

Hmmm - that looks suspicious. In xfs_bmapi(), when we are doing a
write and *firstblock == NULLFSBLOCK (which leads to nullfb being
set in the above code), we do:

        if (wr && *firstblock == NULLFSBLOCK) {
                if (XFS_IFORK_FORMAT(ip, whichfork) == XFS_DINODE_FMT_BTREE)
                        minleft = be16_to_cpu(ifp->if_broot->bb_level) + 1;
                else
                        minleft = 1;
        } else
                minleft = 0;

If we are in btree format we set the minleft to the number of blocks needed
for a split. If we are in extent or local format, change to extent of btree
format requires one extra block.

The above code you point out definitely breaks this - we haven't done a
previous allocation so we can start from the first AG, but we sure as
hell still need minleft set to the number of blocks needed for a
format change or btree split.

> I see it sets a lowspace indicator which filters back up and into
> some btree operations.  It appears the purpose of this feature is to
> allow allocations to search for space in other AGs as in this example
> from xfs_bmap_extents_to_btree():
>
> 	if (*firstblock == NULLFSBLOCK) {
> 		args.type = XFS_ALLOCTYPE_START_BNO;
> 		args.fsbno = XFS_INO_TO_FSB(mp, ip->i_ino);
> 	} else if (flist->xbf_low) {
> 		args.type = XFS_ALLOCTYPE_START_BNO;
> 		args.fsbno = *firstblock;
> 	} else {
> 		args.type = XFS_ALLOCTYPE_NEAR_BNO;
> 		args.fsbno = *firstblock;
> 	}

Hmmm - the only place xbf_low is used in the extent-to-btree conversion. I
don't have access to the revision history anymore, so i can't find out what
bug the xbf_low condition was added for. It certainly looks like it is
allowing the btree block to be allocated in a different AG to data block.

> This is sort of what I was trying to do with my patch but without the
> special lowspace condition.  This lowspace feature is probably broken
> because there was a similar special case in xfs_bmbt_split() that got
> removed with the changes that fixed the AG out-of-order locking problem.
>
> @@ -1569,12 +1569,11 @@
> 	lbno = XFS_DADDR_TO_FSB(args.mp, XFS_BUF_ADDR(lbp));
> 	left = XFS_BUF_TO_BMBT_BLOCK(lbp);
> 	args.fsbno = cur->bc_private.b.firstblock;
> +	args.firstblock = args.fsbno;
> 	if (args.fsbno == NULLFSBLOCK) {
> 		args.fsbno = lbno;
> 		args.type = XFS_ALLOCTYPE_START_BNO;
> -	} else if (cur->bc_private.b.flist->xbf_low)
> -		args.type = XFS_ALLOCTYPE_FIRST_AG;
> -	else
> +	} else
> 		args.type = XFS_ALLOCTYPE_NEAR_BNO;
> 	args.mod = args.minleft = args.alignment = args.total = args.isfl =
> 		args.userdata = args.minalignslop = 0;
>
> This could be why we have allocations failing now. 

Hmmm - yes, could be. Well found, Lachlan. Was there an equivalent change
to the allocation of a new root block?

> Maybe it should
> have been left in and XFS_ALLOCTYPE_FIRST_AG changed to
> XFS_ALLOCTYPE_START_BNO.  But even then it could still fail because the
> AG deadlock avoidance code may prevent us from searching the AG that has
> the space we need.

Right. But it would definitely be more likely to find space than the current
code without re-introducing the deadlock.

> Should we ditch this lowspace special condition (and the code in
> xfs_bmap_btalloc()) and insist that all the space we need (using minleft)
> should come from one AG?

Well, we could, but I suspect that one condition that it is used it
is safe to do so. That is, the logic goes like this:

	- allocate the last extent in an AG. By definition, that has not
	  caused a AGF btree split as the trees are now empty.
	- because we haven't split any AGF btrees, we still have an unused
	  transaction reservation for full AGF btree splits.
	- seeing as we have a full reservation, we can safely allocate in a
	  different AG without overrunning a transaction reservation.

However, we still need to obey the AGF locking order.

Hmmmm - perhaps before allocating with minleft = 0 we need to
check if we can allocate the rest of the blocks from another AG and
lock both AGs in the correct order first, recheck we can allocate
from both of them after they are locked but before modifying anything
and only then proceed. If we can't find two AGs to allocate from then
we can safely ENOSPC without any problems. In this special case we'd then
be able to search the entire FS for space and hence only get an ENOSPC
if we are really at ENOSPC. Can you pick holes in this?

>>>>> I'm worried with this approach that we could have delayed allocations and
>>>>> unwritten extents that need to be converted but we can't do it because we
>>>>> don't have the space we might need (but probably don't).
>>>> Delayed allocation is the issue - unwritten extent conversion failure will
>>>> simply return an error and leave the extent unwritten.
>>> That's still a problem though - if we can't convert unwritten extents then
>>> we can't clean dirty pages and we wont be able to unmount the filesystem.
>>
>> There will be errors logged and the extents will remain unwritten.
>> The filesystem should still be able to be unmounted.
>
> So returning an error from unwritten extent conversion will not re-dirty the
> pages?  So we effectively throw the user's data away?

Yes, I think that can happen async writes. For anything that is sync the
error will be propagated back to application. Often there is nothing to
report errors back to on async writes - I'm not sure if the errors get
logged in that case, though...

I suspect that this is a holdover from before we had ENOSPC checking on
mmap writes - that could result in mmap oversubscribing space and the
data could not ever be written. In low memory conditions this could
deadlock the machine if we did not throw the pages away. We probably
need to reevisit this now that ->page_mkwrite() prevents
oversubscription from occurring....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Prevent extent btree block allocation failures
  2008-06-20  5:21             ` Dave Chinner
@ 2008-06-23  5:20               ` Dave Chinner
  2008-06-23  5:57                 ` Lachlan McIlroy
  2008-06-23  5:24               ` Lachlan McIlroy
  1 sibling, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2008-06-23  5:20 UTC (permalink / raw)
  To: Lachlan McIlroy, xfs-dev, xfs-oss

On Fri, Jun 20, 2008 at 03:21:20PM +1000, Dave Chinner wrote:
> On Thu, Jun 19, 2008 at 05:28:50PM +1000, Lachlan McIlroy wrote:
> > Dave Chinner wrote:
> >> On Tue, Jun 17, 2008 at 11:58:47AM +1000, Lachlan McIlroy wrote:
> >>> Dave Chinner wrote:
> >>>> On Sunday 15 June 2008 11:11 pm, Lachlan McIlroy wrote:
> >>>>> I'm well aware of that particular deadlock involving the freelist - I
> >>>>> hit it while testing.  If you look closely at the code that deadlock
> >>>>> can occur with or without the AG locking avoidance logic.  This is
> >>>>> because the rest of the transaction is unaware that an AG has been
> >>>>> locked due to a freelist operation.
> >>>> Yes, which is why you need to prevent freelist modifications occurring
> >>>> when you can't allocate anything out of the AG.
> >>> That sounds reasonable but it isn't consistent with the deadlock I saw.
> >>> One of the threads that was deadlocked had tried to allocate a data extent
> >>> in AG3 but didn't find the space.  It had modified, and hence locked, AG3
> >>> due to modifying the freelist but since it didn't get the space it needed
> >>> it had to go on to another AG.
> >>
> >> That sounds like an exact allocation failure - there is enough
> >> space, a large enough extent but no free space at the exact block
> >> required. This is exactly the case that occurred with the inode
> >> allocation - and then allocation in the same AG failed because of
> >> alignment that wasn't taken into account by the first exact
> >> allocation attempt. Perhaps the minalignslop calculation in
> >> xfs_bmap_btalloc() is incorrect...
> >
> > Okay I'll look into that.
> >
> > There's something else that looks suspicious to me - this code in
> > xfs_bmap_btalloc() is setting minleft to 0.  Doesn't this go against
> > what you were saying about setting minleft to be the space we might
> > need for the btree operations?
> >
> > 	if (args.fsbno == NULLFSBLOCK && nullfb) {
> > 		args.fsbno = 0;
> > 		args.type = XFS_ALLOCTYPE_FIRST_AG;
> > 		args.total = ap->minlen;
> > 		args.minleft = 0;
> > 		if ((error = xfs_alloc_vextent(&args)))
> > 			return error;
> > 		ap->low = 1;
> > 	}
> 
> Hmmm - that looks suspicious. In xfs_bmapi(), when we are doing a
> write and *firstblock == NULLFSBLOCK (which leads to nullfb being
> set in the above code), we do:
> 
>         if (wr && *firstblock == NULLFSBLOCK) {
>                 if (XFS_IFORK_FORMAT(ip, whichfork) == XFS_DINODE_FMT_BTREE)
>                         minleft = be16_to_cpu(ifp->if_broot->bb_level) + 1;
>                 else
>                         minleft = 1;
>         } else
>                 minleft = 0;
> 
> If we are in btree format we set the minleft to the number of blocks needed
> for a split. If we are in extent or local format, change to extent of btree
> format requires one extra block.
> 
> The above code you point out definitely breaks this - we haven't done a
> previous allocation so we can start from the first AG, but we sure as
> hell still need minleft set to the number of blocks needed for a
> format change or btree split.

Just to point out yet another problem in this code (one that's just
been tripped over @ agami) is unwritten extent conversion.

Basically, we don't do an allocation here, so when we end up in
xfs_bmap_add_extent_unwritten_real() with a null firstblock. Hence
the cases where conversion can cause a split - case
MASK(LEFT_FILLING), MASK(RIGHT_FILLING) and 0 (convert the middle of
an extent) - we can select an AG that doesn't have enough space for
the entire split as we've ignored the number of blocks we might
need to allocate in the split (the minleft parameter) entirely.

I suspect that xfs_bmbt_split() needs to handle the null first block
case slightly differently - the minleft parameter passed to the
allocation should not be zero - it should be the number of levels
above the current level left in the tree. i.e:

	minleft = be16_to_cpu(ifp->if_broot->bb_level) + 1;

If we've already got a firstblock set, then this should have already
been taken into account (i.e. we still need to fix the low space
case where it got ignored as we were discussing).

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Prevent extent btree block allocation failures
  2008-06-20  5:21             ` Dave Chinner
  2008-06-23  5:20               ` Dave Chinner
@ 2008-06-23  5:24               ` Lachlan McIlroy
  2008-06-23  6:21                 ` Dave Chinner
  1 sibling, 1 reply; 17+ messages in thread
From: Lachlan McIlroy @ 2008-06-23  5:24 UTC (permalink / raw)
  To: Lachlan McIlroy, xfs-dev, xfs-oss

Dave Chinner wrote:
> On Thu, Jun 19, 2008 at 05:28:50PM +1000, Lachlan McIlroy wrote:
>> Dave Chinner wrote:
>>> On Tue, Jun 17, 2008 at 11:58:47AM +1000, Lachlan McIlroy wrote:
>>>> Dave Chinner wrote:
>>>>> On Sunday 15 June 2008 11:11 pm, Lachlan McIlroy wrote:
>>>>>> I'm well aware of that particular deadlock involving the freelist - I
>>>>>> hit it while testing.  If you look closely at the code that deadlock
>>>>>> can occur with or without the AG locking avoidance logic.  This is
>>>>>> because the rest of the transaction is unaware that an AG has been
>>>>>> locked due to a freelist operation.
>>>>> Yes, which is why you need to prevent freelist modifications occurring
>>>>> when you can't allocate anything out of the AG.
>>>> That sounds reasonable but it isn't consistent with the deadlock I saw.
>>>> One of the threads that was deadlocked had tried to allocate a data extent
>>>> in AG3 but didn't find the space.  It had modified, and hence locked, AG3
>>>> due to modifying the freelist but since it didn't get the space it needed
>>>> it had to go on to another AG.
>>> That sounds like an exact allocation failure - there is enough
>>> space, a large enough extent but no free space at the exact block
>>> required. This is exactly the case that occurred with the inode
>>> allocation - and then allocation in the same AG failed because of
>>> alignment that wasn't taken into account by the first exact
>>> allocation attempt. Perhaps the minalignslop calculation in
>>> xfs_bmap_btalloc() is incorrect...
>> Okay I'll look into that.
>>
>> There's something else that looks suspicious to me - this code in
>> xfs_bmap_btalloc() is setting minleft to 0.  Doesn't this go against
>> what you were saying about setting minleft to be the space we might
>> need for the btree operations?
>>
>> 	if (args.fsbno == NULLFSBLOCK && nullfb) {
>> 		args.fsbno = 0;
>> 		args.type = XFS_ALLOCTYPE_FIRST_AG;
>> 		args.total = ap->minlen;
>> 		args.minleft = 0;
>> 		if ((error = xfs_alloc_vextent(&args)))
>> 			return error;
>> 		ap->low = 1;
>> 	}
> 
> Hmmm - that looks suspicious. In xfs_bmapi(), when we are doing a
> write and *firstblock == NULLFSBLOCK (which leads to nullfb being
> set in the above code), we do:
> 
>         if (wr && *firstblock == NULLFSBLOCK) {
>                 if (XFS_IFORK_FORMAT(ip, whichfork) == XFS_DINODE_FMT_BTREE)
>                         minleft = be16_to_cpu(ifp->if_broot->bb_level) + 1;
>                 else
>                         minleft = 1;
>         } else
>                 minleft = 0;
> 
> If we are in btree format we set the minleft to the number of blocks needed
> for a split. If we are in extent or local format, change to extent of btree
> format requires one extra block.
> 
> The above code you point out definitely breaks this - we haven't done a
> previous allocation so we can start from the first AG, but we sure as
> hell still need minleft set to the number of blocks needed for a
> format change or btree split.
> 
>> I see it sets a lowspace indicator which filters back up and into
>> some btree operations.  It appears the purpose of this feature is to
>> allow allocations to search for space in other AGs as in this example
>> from xfs_bmap_extents_to_btree():
>>
>> 	if (*firstblock == NULLFSBLOCK) {
>> 		args.type = XFS_ALLOCTYPE_START_BNO;
>> 		args.fsbno = XFS_INO_TO_FSB(mp, ip->i_ino);
>> 	} else if (flist->xbf_low) {
>> 		args.type = XFS_ALLOCTYPE_START_BNO;
>> 		args.fsbno = *firstblock;
>> 	} else {
>> 		args.type = XFS_ALLOCTYPE_NEAR_BNO;
>> 		args.fsbno = *firstblock;
>> 	}
> 
> Hmmm - the only place xbf_low is used in the extent-to-btree conversion. I
> don't have access to the revision history anymore, so i can't find out what
> bug the xbf_low condition was added for. It certainly looks like it is
> allowing the btree block to be allocated in a different AG to data block.

The lowspace algorithm was added way back in the early '90s and has been
'tweaked' many times since.

> 
>> This is sort of what I was trying to do with my patch but without the
>> special lowspace condition.  This lowspace feature is probably broken
>> because there was a similar special case in xfs_bmbt_split() that got
>> removed with the changes that fixed the AG out-of-order locking problem.
>>
>> @@ -1569,12 +1569,11 @@
>> 	lbno = XFS_DADDR_TO_FSB(args.mp, XFS_BUF_ADDR(lbp));
>> 	left = XFS_BUF_TO_BMBT_BLOCK(lbp);
>> 	args.fsbno = cur->bc_private.b.firstblock;
>> +	args.firstblock = args.fsbno;
>> 	if (args.fsbno == NULLFSBLOCK) {
>> 		args.fsbno = lbno;
>> 		args.type = XFS_ALLOCTYPE_START_BNO;
>> -	} else if (cur->bc_private.b.flist->xbf_low)
>> -		args.type = XFS_ALLOCTYPE_FIRST_AG;
>> -	else
>> +	} else
>> 		args.type = XFS_ALLOCTYPE_NEAR_BNO;
>> 	args.mod = args.minleft = args.alignment = args.total = args.isfl =
>> 		args.userdata = args.minalignslop = 0;
>>
>> This could be why we have allocations failing now. 
> 
> Hmmm - yes, could be. Well found, Lachlan. Was there an equivalent change
> to the allocation of a new root block?

Yeah but it got dropped 14 years ago in a code cleanup change!  Looks like
it was by mistake too.  There used to be another special case for converting
delayed allocations that had the same semantics as this low space trick - it
used XFS_ALLOCTYPE_FIRST_AG instead - maybe to try a little harder to find
space for cases where it is too late to return an error to the user.

> 
>> Maybe it should
>> have been left in and XFS_ALLOCTYPE_FIRST_AG changed to
>> XFS_ALLOCTYPE_START_BNO.  But even then it could still fail because the
>> AG deadlock avoidance code may prevent us from searching the AG that has
>> the space we need.
> 
> Right. But it would definitely be more likely to find space than the current
> code without re-introducing the deadlock.
> 
>> Should we ditch this lowspace special condition (and the code in
>> xfs_bmap_btalloc()) and insist that all the space we need (using minleft)
>> should come from one AG?
> 
> Well, we could, but I suspect that one condition that it is used it
> is safe to do so. That is, the logic goes like this:
> 
> 	- allocate the last extent in an AG. By definition, that has not
> 	  caused a AGF btree split as the trees are now empty.
> 	- because we haven't split any AGF btrees, we still have an unused
> 	  transaction reservation for full AGF btree splits.
> 	- seeing as we have a full reservation, we can safely allocate in a
> 	  different AG without overrunning a transaction reservation.
> 
> However, we still need to obey the AGF locking order.
> 
> Hmmmm - perhaps before allocating with minleft = 0 we need to
> check if we can allocate the rest of the blocks from another AG and
> lock both AGs in the correct order first, recheck we can allocate
> from both of them after they are locked but before modifying anything
> and only then proceed. If we can't find two AGs to allocate from then
> we can safely ENOSPC without any problems. In this special case we'd then
> be able to search the entire FS for space and hence only get an ENOSPC
> if we are really at ENOSPC. Can you pick holes in this?

Sounds like it should work.  We may need to lock more than two AGs at once to
find all the space we need.  Since we can't lock AGs out of order then if we get
to the last AG and we still don't have enough space then we will need to try
again but start at an earlier AG (say AG 0 which should work).

So the code in xfs_alloc_vextent() that uses XFS_ALLOCTYPE_FIRST_AG and sets
minleft to 0 should work.  If we start at AG 0 and we've done the proper reservations
then we should eventually find all the space we need - as long as everything that
needs to allocate space obeys the lowspace algorithm and we always kick off each
search for space from the AG we last allocated from.

> 
>>>>>> I'm worried with this approach that we could have delayed allocations and
>>>>>> unwritten extents that need to be converted but we can't do it because we
>>>>>> don't have the space we might need (but probably don't).
>>>>> Delayed allocation is the issue - unwritten extent conversion failure will
>>>>> simply return an error and leave the extent unwritten.
>>>> That's still a problem though - if we can't convert unwritten extents then
>>>> we can't clean dirty pages and we wont be able to unmount the filesystem.
>>> There will be errors logged and the extents will remain unwritten.
>>> The filesystem should still be able to be unmounted.
>> So returning an error from unwritten extent conversion will not re-dirty the
>> pages?  So we effectively throw the user's data away?
> 
> Yes, I think that can happen async writes. For anything that is sync the
> error will be propagated back to application. Often there is nothing to
> report errors back to on async writes - I'm not sure if the errors get
> logged in that case, though...
> 
> I suspect that this is a holdover from before we had ENOSPC checking on
> mmap writes - that could result in mmap oversubscribing space and the
> data could not ever be written. In low memory conditions this could
> deadlock the machine if we did not throw the pages away. We probably
> need to reevisit this now that ->page_mkwrite() prevents
> oversubscription from occurring....
> 
> Cheers,
> 
> Dave.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Prevent extent btree block allocation failures
  2008-06-23  5:20               ` Dave Chinner
@ 2008-06-23  5:57                 ` Lachlan McIlroy
  2008-06-23  6:14                   ` Dave Chinner
  0 siblings, 1 reply; 17+ messages in thread
From: Lachlan McIlroy @ 2008-06-23  5:57 UTC (permalink / raw)
  To: Lachlan McIlroy, xfs-dev, xfs-oss

Dave Chinner wrote:
> On Fri, Jun 20, 2008 at 03:21:20PM +1000, Dave Chinner wrote:
>> On Thu, Jun 19, 2008 at 05:28:50PM +1000, Lachlan McIlroy wrote:
>>> Dave Chinner wrote:
>>>> On Tue, Jun 17, 2008 at 11:58:47AM +1000, Lachlan McIlroy wrote:
>>>>> Dave Chinner wrote:
>>>>>> On Sunday 15 June 2008 11:11 pm, Lachlan McIlroy wrote:
>>>>>>> I'm well aware of that particular deadlock involving the freelist - I
>>>>>>> hit it while testing.  If you look closely at the code that deadlock
>>>>>>> can occur with or without the AG locking avoidance logic.  This is
>>>>>>> because the rest of the transaction is unaware that an AG has been
>>>>>>> locked due to a freelist operation.
>>>>>> Yes, which is why you need to prevent freelist modifications occurring
>>>>>> when you can't allocate anything out of the AG.
>>>>> That sounds reasonable but it isn't consistent with the deadlock I saw.
>>>>> One of the threads that was deadlocked had tried to allocate a data extent
>>>>> in AG3 but didn't find the space.  It had modified, and hence locked, AG3
>>>>> due to modifying the freelist but since it didn't get the space it needed
>>>>> it had to go on to another AG.
>>>> That sounds like an exact allocation failure - there is enough
>>>> space, a large enough extent but no free space at the exact block
>>>> required. This is exactly the case that occurred with the inode
>>>> allocation - and then allocation in the same AG failed because of
>>>> alignment that wasn't taken into account by the first exact
>>>> allocation attempt. Perhaps the minalignslop calculation in
>>>> xfs_bmap_btalloc() is incorrect...
>>> Okay I'll look into that.
>>>
>>> There's something else that looks suspicious to me - this code in
>>> xfs_bmap_btalloc() is setting minleft to 0.  Doesn't this go against
>>> what you were saying about setting minleft to be the space we might
>>> need for the btree operations?
>>>
>>> 	if (args.fsbno == NULLFSBLOCK && nullfb) {
>>> 		args.fsbno = 0;
>>> 		args.type = XFS_ALLOCTYPE_FIRST_AG;
>>> 		args.total = ap->minlen;
>>> 		args.minleft = 0;
>>> 		if ((error = xfs_alloc_vextent(&args)))
>>> 			return error;
>>> 		ap->low = 1;
>>> 	}
>> Hmmm - that looks suspicious. In xfs_bmapi(), when we are doing a
>> write and *firstblock == NULLFSBLOCK (which leads to nullfb being
>> set in the above code), we do:
>>
>>         if (wr && *firstblock == NULLFSBLOCK) {
>>                 if (XFS_IFORK_FORMAT(ip, whichfork) == XFS_DINODE_FMT_BTREE)
>>                         minleft = be16_to_cpu(ifp->if_broot->bb_level) + 1;
>>                 else
>>                         minleft = 1;
>>         } else
>>                 minleft = 0;
>>
>> If we are in btree format we set the minleft to the number of blocks needed
>> for a split. If we are in extent or local format, change to extent of btree
>> format requires one extra block.
>>
>> The above code you point out definitely breaks this - we haven't done a
>> previous allocation so we can start from the first AG, but we sure as
>> hell still need minleft set to the number of blocks needed for a
>> format change or btree split.
> 
> Just to point out yet another problem in this code (one that's just
> been tripped over @ agami) is unwritten extent conversion.
> 
> Basically, we don't do an allocation here, so when we end up in
> xfs_bmap_add_extent_unwritten_real() with a null firstblock. Hence
> the cases where conversion can cause a split - case
> MASK(LEFT_FILLING), MASK(RIGHT_FILLING) and 0 (convert the middle of
> an extent) - we can select an AG that doesn't have enough space for
> the entire split as we've ignored the number of blocks we might
> need to allocate in the split (the minleft parameter) entirely.
> 
> I suspect that xfs_bmbt_split() needs to handle the null first block
> case slightly differently - the minleft parameter passed to the
> allocation should not be zero - it should be the number of levels
> above the current level left in the tree. i.e:
> 
> 	minleft = be16_to_cpu(ifp->if_broot->bb_level) + 1;
> 
> If we've already got a firstblock set, then this should have already
> been taken into account (i.e. we still need to fix the low space
> case where it got ignored as we were discussing).
> 

Funny.  I tested the exact same change last week to try to fix the same
problem.  Seemed to work okay.

In the case where we convert the middle of an existing unwritten extent
we need to insert two new extents.  I might be paranoid here but I'll
assume the worst case scenario and that we'll need space for two complete
tree splits.  The first allocation for the first insert will set minleft
correctly but what about the allocations for splits during the second
insert?  We could run out of space in the chosen AG because minleft wasn't
enough.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Prevent extent btree block allocation failures
  2008-06-23  5:57                 ` Lachlan McIlroy
@ 2008-06-23  6:14                   ` Dave Chinner
  2008-06-23  6:40                     ` Lachlan McIlroy
  0 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2008-06-23  6:14 UTC (permalink / raw)
  To: Lachlan McIlroy; +Cc: xfs-dev, xfs-oss

On Mon, Jun 23, 2008 at 03:57:22PM +1000, Lachlan McIlroy wrote:
> Dave Chinner wrote:
>> On Fri, Jun 20, 2008 at 03:21:20PM +1000, Dave Chinner wrote:
>>> On Thu, Jun 19, 2008 at 05:28:50PM +1000, Lachlan McIlroy wrote:
>>>> There's something else that looks suspicious to me - this code in
>>>> xfs_bmap_btalloc() is setting minleft to 0.  Doesn't this go against
>>>> what you were saying about setting minleft to be the space we might
>>>> need for the btree operations?
>>>>
>>>> 	if (args.fsbno == NULLFSBLOCK && nullfb) {
>>>> 		args.fsbno = 0;
>>>> 		args.type = XFS_ALLOCTYPE_FIRST_AG;
>>>> 		args.total = ap->minlen;
>>>> 		args.minleft = 0;
>>>> 		if ((error = xfs_alloc_vextent(&args)))
>>>> 			return error;
>>>> 		ap->low = 1;
>>>> 	}
>>> Hmmm - that looks suspicious. In xfs_bmapi(), when we are doing a
>>> write and *firstblock == NULLFSBLOCK (which leads to nullfb being
>>> set in the above code), we do:
>>>
>>>         if (wr && *firstblock == NULLFSBLOCK) {
>>>                 if (XFS_IFORK_FORMAT(ip, whichfork) == XFS_DINODE_FMT_BTREE)
>>>                         minleft = be16_to_cpu(ifp->if_broot->bb_level) + 1;
>>>                 else
>>>                         minleft = 1;
>>>         } else
>>>                 minleft = 0;
>>>
>>> If we are in btree format we set the minleft to the number of blocks needed
>>> for a split. If we are in extent or local format, change to extent of btree
>>> format requires one extra block.
>>>
>>> The above code you point out definitely breaks this - we haven't done a
>>> previous allocation so we can start from the first AG, but we sure as
>>> hell still need minleft set to the number of blocks needed for a
>>> format change or btree split.
>>
>> Just to point out yet another problem in this code (one that's just
>> been tripped over @ agami) is unwritten extent conversion.
>>
>> Basically, we don't do an allocation here, so when we end up in
>> xfs_bmap_add_extent_unwritten_real() with a null firstblock. Hence
>> the cases where conversion can cause a split - case
>> MASK(LEFT_FILLING), MASK(RIGHT_FILLING) and 0 (convert the middle of
>> an extent) - we can select an AG that doesn't have enough space for
>> the entire split as we've ignored the number of blocks we might
>> need to allocate in the split (the minleft parameter) entirely.
>>
>> I suspect that xfs_bmbt_split() needs to handle the null first block
>> case slightly differently - the minleft parameter passed to the
>> allocation should not be zero - it should be the number of levels
>> above the current level left in the tree. i.e:
>>
>> 	minleft = be16_to_cpu(ifp->if_broot->bb_level) + 1;
>>
>> If we've already got a firstblock set, then this should have already
>> been taken into account (i.e. we still need to fix the low space
>> case where it got ignored as we were discussing).
>
> Funny.  I tested the exact same change last week to try to fix the same
> problem.  Seemed to work okay.

Cool. Got a patch for review?

> In the case where we convert the middle of an existing unwritten extent
> we need to insert two new extents.  I might be paranoid here but I'll
> assume the worst case scenario and that we'll need space for two complete
> tree splits.

Yes, I think so. Certainly, if you look at the block reservation in
xfs_iomap_write_unwritten():

892         resblks = XFS_DIOSTRAT_SPACE_RES(mp, 0) << 1;

#define XFS_DIOSTRAT_SPACE_RES(mp, v)   \
        (XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK) + (v))

#define XFS_EXTENTADD_SPACE_RES(mp,w)   (XFS_BM_MAXLEVELS(mp,w) - 1)

It reserves enough blocks for 2 bmbt splits so I think this is
definitely a possibility we need to handle.

> The first allocation for the first insert will set minleft
> correctly but what about the allocations for splits during the second
> insert?  We could run out of space in the chosen AG because minleft wasn't
> enough.

Yeah, so we probably need pass a flag in the cursor to indicate
it's a double split case when doing the first allocation in
xfs_bmbt_split....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Prevent extent btree block allocation failures
  2008-06-23  5:24               ` Lachlan McIlroy
@ 2008-06-23  6:21                 ` Dave Chinner
  0 siblings, 0 replies; 17+ messages in thread
From: Dave Chinner @ 2008-06-23  6:21 UTC (permalink / raw)
  To: Lachlan McIlroy; +Cc: xfs-dev, xfs-oss

On Mon, Jun 23, 2008 at 03:24:45PM +1000, Lachlan McIlroy wrote:
> Dave Chinner wrote:
>> On Thu, Jun 19, 2008 at 05:28:50PM +1000, Lachlan McIlroy wrote:
>>> I see it sets a lowspace indicator which filters back up and into
>>> some btree operations.  It appears the purpose of this feature is to
>>> allow allocations to search for space in other AGs as in this example
>>> from xfs_bmap_extents_to_btree():
>>>
>>> 	if (*firstblock == NULLFSBLOCK) {
>>> 		args.type = XFS_ALLOCTYPE_START_BNO;
>>> 		args.fsbno = XFS_INO_TO_FSB(mp, ip->i_ino);
>>> 	} else if (flist->xbf_low) {
>>> 		args.type = XFS_ALLOCTYPE_START_BNO;
>>> 		args.fsbno = *firstblock;
>>> 	} else {
>>> 		args.type = XFS_ALLOCTYPE_NEAR_BNO;
>>> 		args.fsbno = *firstblock;
>>> 	}
>>
>> Hmmm - the only place xbf_low is used in the extent-to-btree conversion. I
>> don't have access to the revision history anymore, so i can't find out what
>> bug the xbf_low condition was added for. It certainly looks like it is
>> allowing the btree block to be allocated in a different AG to data block.
>
> The lowspace algorithm was added way back in the early '90s and has been
> 'tweaked' many times since.

It's good to have a complete revision history around somewhere. ;)

>>> This is sort of what I was trying to do with my patch but without the
>>> special lowspace condition.  This lowspace feature is probably broken
>>> because there was a similar special case in xfs_bmbt_split() that got
>>> removed with the changes that fixed the AG out-of-order locking problem.
>>>
>>> @@ -1569,12 +1569,11 @@
>>> 	lbno = XFS_DADDR_TO_FSB(args.mp, XFS_BUF_ADDR(lbp));
>>> 	left = XFS_BUF_TO_BMBT_BLOCK(lbp);
>>> 	args.fsbno = cur->bc_private.b.firstblock;
>>> +	args.firstblock = args.fsbno;
>>> 	if (args.fsbno == NULLFSBLOCK) {
>>> 		args.fsbno = lbno;
>>> 		args.type = XFS_ALLOCTYPE_START_BNO;
>>> -	} else if (cur->bc_private.b.flist->xbf_low)
>>> -		args.type = XFS_ALLOCTYPE_FIRST_AG;
>>> -	else
>>> +	} else
>>> 		args.type = XFS_ALLOCTYPE_NEAR_BNO;
>>> 	args.mod = args.minleft = args.alignment = args.total = args.isfl =
>>> 		args.userdata = args.minalignslop = 0;
>>>
>>> This could be why we have allocations failing now. 
>>
>> Hmmm - yes, could be. Well found, Lachlan. Was there an equivalent change
>> to the allocation of a new root block?
>
> Yeah but it got dropped 14 years ago in a code cleanup change!  Looks like
> it was by mistake too. 

Ouch.

> There used to be another special case for converting
> delayed allocations that had the same semantics as this low space trick - it
> used XFS_ALLOCTYPE_FIRST_AG instead - maybe to try a little harder to find
> space for cases where it is too late to return an error to the user.

Interesting. Any idea why that was removed? Or another accident?

[....]

>> Hmmmm - perhaps before allocating with minleft = 0 we need to
>> check if we can allocate the rest of the blocks from another AG and
>> lock both AGs in the correct order first, recheck we can allocate
>> from both of them after they are locked but before modifying anything
>> and only then proceed. If we can't find two AGs to allocate from then
>> we can safely ENOSPC without any problems. In this special case we'd then
>> be able to search the entire FS for space and hence only get an ENOSPC
>> if we are really at ENOSPC. Can you pick holes in this?
>
> Sounds like it should work.  We may need to lock more than two AGs at once to
> find all the space we need.  Since we can't lock AGs out of order then if we get
> to the last AG and we still don't have enough space then we will need to try
> again but start at an earlier AG (say AG 0 which should work).

In this case, I'd just start from XFS_ALLOCTYPE_FIRST_AG rather than
doing multiple passes and having to undo between them....

> So the code in xfs_alloc_vextent() that uses XFS_ALLOCTYPE_FIRST_AG and sets
> minleft to 0 should work.

Yes, as long as the next selected AG has the original minleft blocks
available in it.

> If we start at AG 0 and we've done the proper reservations
> then we should eventually find all the space we need - as long as everything that
> needs to allocate space obeys the lowspace algorithm and we always kick off each
> search for space from the AG we last allocated from.

Yup. Seems sane to me.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Prevent extent btree block allocation failures
  2008-06-23  6:14                   ` Dave Chinner
@ 2008-06-23  6:40                     ` Lachlan McIlroy
  2008-06-23  8:05                       ` Dave Chinner
  0 siblings, 1 reply; 17+ messages in thread
From: Lachlan McIlroy @ 2008-06-23  6:40 UTC (permalink / raw)
  To: Lachlan McIlroy, xfs-dev, xfs-oss

Dave Chinner wrote:
> On Mon, Jun 23, 2008 at 03:57:22PM +1000, Lachlan McIlroy wrote:
>> Dave Chinner wrote:
>>> On Fri, Jun 20, 2008 at 03:21:20PM +1000, Dave Chinner wrote:
>>>> On Thu, Jun 19, 2008 at 05:28:50PM +1000, Lachlan McIlroy wrote:
>>>>> There's something else that looks suspicious to me - this code in
>>>>> xfs_bmap_btalloc() is setting minleft to 0.  Doesn't this go against
>>>>> what you were saying about setting minleft to be the space we might
>>>>> need for the btree operations?
>>>>>
>>>>> 	if (args.fsbno == NULLFSBLOCK && nullfb) {
>>>>> 		args.fsbno = 0;
>>>>> 		args.type = XFS_ALLOCTYPE_FIRST_AG;
>>>>> 		args.total = ap->minlen;
>>>>> 		args.minleft = 0;
>>>>> 		if ((error = xfs_alloc_vextent(&args)))
>>>>> 			return error;
>>>>> 		ap->low = 1;
>>>>> 	}
>>>> Hmmm - that looks suspicious. In xfs_bmapi(), when we are doing a
>>>> write and *firstblock == NULLFSBLOCK (which leads to nullfb being
>>>> set in the above code), we do:
>>>>
>>>>         if (wr && *firstblock == NULLFSBLOCK) {
>>>>                 if (XFS_IFORK_FORMAT(ip, whichfork) == XFS_DINODE_FMT_BTREE)
>>>>                         minleft = be16_to_cpu(ifp->if_broot->bb_level) + 1;
>>>>                 else
>>>>                         minleft = 1;
>>>>         } else
>>>>                 minleft = 0;
>>>>
>>>> If we are in btree format we set the minleft to the number of blocks needed
>>>> for a split. If we are in extent or local format, change to extent of btree
>>>> format requires one extra block.
>>>>
>>>> The above code you point out definitely breaks this - we haven't done a
>>>> previous allocation so we can start from the first AG, but we sure as
>>>> hell still need minleft set to the number of blocks needed for a
>>>> format change or btree split.
>>> Just to point out yet another problem in this code (one that's just
>>> been tripped over @ agami) is unwritten extent conversion.
>>>
>>> Basically, we don't do an allocation here, so when we end up in
>>> xfs_bmap_add_extent_unwritten_real() with a null firstblock. Hence
>>> the cases where conversion can cause a split - case
>>> MASK(LEFT_FILLING), MASK(RIGHT_FILLING) and 0 (convert the middle of
>>> an extent) - we can select an AG that doesn't have enough space for
>>> the entire split as we've ignored the number of blocks we might
>>> need to allocate in the split (the minleft parameter) entirely.
>>>
>>> I suspect that xfs_bmbt_split() needs to handle the null first block
>>> case slightly differently - the minleft parameter passed to the
>>> allocation should not be zero - it should be the number of levels
>>> above the current level left in the tree. i.e:
>>>
>>> 	minleft = be16_to_cpu(ifp->if_broot->bb_level) + 1;
>>>
>>> If we've already got a firstblock set, then this should have already
>>> been taken into account (i.e. we still need to fix the low space
>>> case where it got ignored as we were discussing).
>> Funny.  I tested the exact same change last week to try to fix the same
>> problem.  Seemed to work okay.
> 
> Cool. Got a patch for review?

I couldn't find the original patch that calculated minleft as above - instead
here's a variant that addresses the double insert problem by retrieving the
reservation amount from the transaction.  It could very well be overkill though.

--- fs/xfs/xfs_bmap_btree.c_1.169	2008-06-16 17:25:10.000000000 +1000
+++ fs/xfs/xfs_bmap_btree.c	2008-06-16 18:32:45.000000000 +1000
@@ -1496,9 +1496,12 @@ xfs_bmbt_split(
 	if (args.fsbno == NULLFSBLOCK) {
 		args.fsbno = lbno;
 		args.type = XFS_ALLOCTYPE_START_BNO;
-	} else
+		args.minleft = xfs_trans_get_block_res(args.tp);
+	} else {
 		args.type = XFS_ALLOCTYPE_NEAR_BNO;
-	args.mod = args.minleft = args.alignment = args.total = args.isfl =
+		args.minleft = 0;
+	}
+	args.mod = args.alignment = args.total = args.isfl =
 		args.userdata = args.minalignslop = 0;
 	args.minlen = args.maxlen = args.prod = 1;
 	args.wasdel = cur->bc_private.b.flags & XFS_BTCUR_BPRV_WASDEL;


> 
>> In the case where we convert the middle of an existing unwritten extent
>> we need to insert two new extents.  I might be paranoid here but I'll
>> assume the worst case scenario and that we'll need space for two complete
>> tree splits.
> 
> Yes, I think so. Certainly, if you look at the block reservation in
> xfs_iomap_write_unwritten():
> 
> 892         resblks = XFS_DIOSTRAT_SPACE_RES(mp, 0) << 1;
> 
> #define XFS_DIOSTRAT_SPACE_RES(mp, v)   \
>         (XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK) + (v))
> 
> #define XFS_EXTENTADD_SPACE_RES(mp,w)   (XFS_BM_MAXLEVELS(mp,w) - 1)
> 
> It reserves enough blocks for 2 bmbt splits so I think this is
> definitely a possibility we need to handle.
> 
>> The first allocation for the first insert will set minleft
>> correctly but what about the allocations for splits during the second
>> insert?  We could run out of space in the chosen AG because minleft wasn't
>> enough.
> 
> Yeah, so we probably need pass a flag in the cursor to indicate
> it's a double split case when doing the first allocation in
> xfs_bmbt_split....
> 
> Cheers,
> 
> Dave.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Prevent extent btree block allocation failures
  2008-06-23  6:40                     ` Lachlan McIlroy
@ 2008-06-23  8:05                       ` Dave Chinner
  0 siblings, 0 replies; 17+ messages in thread
From: Dave Chinner @ 2008-06-23  8:05 UTC (permalink / raw)
  To: Lachlan McIlroy; +Cc: xfs-dev, xfs-oss

On Mon, Jun 23, 2008 at 04:40:26PM +1000, Lachlan McIlroy wrote:
> Dave Chinner wrote:
>> On Mon, Jun 23, 2008 at 03:57:22PM +1000, Lachlan McIlroy wrote:
>>> Dave Chinner wrote:
>>>> On Fri, Jun 20, 2008 at 03:21:20PM +1000, Dave Chinner wrote:
>>>>> On Thu, Jun 19, 2008 at 05:28:50PM +1000, Lachlan McIlroy wrote:
>>>>>> There's something else that looks suspicious to me - this code in
>>>>>> xfs_bmap_btalloc() is setting minleft to 0.  Doesn't this go against
>>>>>> what you were saying about setting minleft to be the space we might
>>>>>> need for the btree operations?
>>>>>>
>>>>>> 	if (args.fsbno == NULLFSBLOCK && nullfb) {
>>>>>> 		args.fsbno = 0;
>>>>>> 		args.type = XFS_ALLOCTYPE_FIRST_AG;
>>>>>> 		args.total = ap->minlen;
>>>>>> 		args.minleft = 0;
>>>>>> 		if ((error = xfs_alloc_vextent(&args)))
>>>>>> 			return error;
>>>>>> 		ap->low = 1;
>>>>>> 	}
>>>>> Hmmm - that looks suspicious. In xfs_bmapi(), when we are doing a
>>>>> write and *firstblock == NULLFSBLOCK (which leads to nullfb being
>>>>> set in the above code), we do:
>>>>>
>>>>>         if (wr && *firstblock == NULLFSBLOCK) {
>>>>>                 if (XFS_IFORK_FORMAT(ip, whichfork) == XFS_DINODE_FMT_BTREE)
>>>>>                         minleft = be16_to_cpu(ifp->if_broot->bb_level) + 1;
>>>>>                 else
>>>>>                         minleft = 1;
>>>>>         } else
>>>>>                 minleft = 0;
>>>>>
>>>>> If we are in btree format we set the minleft to the number of blocks needed
>>>>> for a split. If we are in extent or local format, change to extent of btree
>>>>> format requires one extra block.
>>>>>
>>>>> The above code you point out definitely breaks this - we haven't done a
>>>>> previous allocation so we can start from the first AG, but we sure as
>>>>> hell still need minleft set to the number of blocks needed for a
>>>>> format change or btree split.
>>>> Just to point out yet another problem in this code (one that's just
>>>> been tripped over @ agami) is unwritten extent conversion.
>>>>
>>>> Basically, we don't do an allocation here, so when we end up in
>>>> xfs_bmap_add_extent_unwritten_real() with a null firstblock. Hence
>>>> the cases where conversion can cause a split - case
>>>> MASK(LEFT_FILLING), MASK(RIGHT_FILLING) and 0 (convert the middle of
>>>> an extent) - we can select an AG that doesn't have enough space for
>>>> the entire split as we've ignored the number of blocks we might
>>>> need to allocate in the split (the minleft parameter) entirely.
>>>>
>>>> I suspect that xfs_bmbt_split() needs to handle the null first block
>>>> case slightly differently - the minleft parameter passed to the
>>>> allocation should not be zero - it should be the number of levels
>>>> above the current level left in the tree. i.e:
>>>>
>>>> 	minleft = be16_to_cpu(ifp->if_broot->bb_level) + 1;
>>>>
>>>> If we've already got a firstblock set, then this should have already
>>>> been taken into account (i.e. we still need to fix the low space
>>>> case where it got ignored as we were discussing).
>>> Funny.  I tested the exact same change last week to try to fix the same
>>> problem.  Seemed to work okay.
>>
>> Cool. Got a patch for review?
>
> I couldn't find the original patch that calculated minleft as above - instead
> here's a variant that addresses the double insert problem by retrieving the
> reservation amount from the transaction.  It could very well be overkill though.

No, that seems valid; all allocations need to pass in a reservation
for a number of blocks needed for the transaction to proceed. I did
a quick check and everything appears to be reserving only what
is necessary for a bmbt split (or two)...

> --- fs/xfs/xfs_bmap_btree.c_1.169	2008-06-16 17:25:10.000000000 +1000
> +++ fs/xfs/xfs_bmap_btree.c	2008-06-16 18:32:45.000000000 +1000
> @@ -1496,9 +1496,12 @@ xfs_bmbt_split(
> 	if (args.fsbno == NULLFSBLOCK) {
> 		args.fsbno = lbno;
> 		args.type = XFS_ALLOCTYPE_START_BNO;
> -	} else
> +		args.minleft = xfs_trans_get_block_res(args.tp);
> +	} else {

Might be worth a comment ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2008-06-23  8:04 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-06-13  7:38 [PATCH] Prevent extent btree block allocation failures Lachlan McIlroy
2008-06-13 13:44 ` Christoph Hellwig
2008-06-16  3:57   ` Lachlan McIlroy
2008-06-13 15:57 ` Dave Chinner
2008-06-16  6:11   ` Lachlan McIlroy
2008-06-16 17:10     ` Dave Chinner
2008-06-17  1:58       ` Lachlan McIlroy
2008-06-17  7:39         ` Dave Chinner
2008-06-19  7:28           ` Lachlan McIlroy
2008-06-20  5:21             ` Dave Chinner
2008-06-23  5:20               ` Dave Chinner
2008-06-23  5:57                 ` Lachlan McIlroy
2008-06-23  6:14                   ` Dave Chinner
2008-06-23  6:40                     ` Lachlan McIlroy
2008-06-23  8:05                       ` Dave Chinner
2008-06-23  5:24               ` Lachlan McIlroy
2008-06-23  6:21                 ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.