All of lore.kernel.org
 help / color / mirror / Atom feed
* tr_ifree transaction log reservation calculation
@ 2017-11-21 15:05 Brian Foster
       [not found] ` <0a9b47ba-a41e-3b96-981f-f04f9e2ab11c@hpe.com>
  0 siblings, 1 reply; 15+ messages in thread
From: Brian Foster @ 2017-11-21 15:05 UTC (permalink / raw)
  To: linux-xfs; +Cc: david

Hi all,

I'm looking into a bug that appears to reflect the fairly rare
xfs_inactive_ifree() transaction overruns that have been reported in the
past. To summarize, the cause of the overrun appears to be that the
pre-inode-chunk-free agfl fixup ends up freeing a single agfl block that
leads to multiple [cnt|bno]bt joins that repopulate the agfl and cause
several more iterations in xfs_alloc_fix_freelist(). These extra
iterations combine with a couple other conditions that ultimately result
in consuming most or all of the anticipated cntbt log reservation before
we actually get to freeing the inode chunk:

- left+right contiguous blocks that require 2 cntbt record removals and
  an insert with a new length key
- the overrun is the first transaction in a CIL ctx and thus consumes
  the CIL ticket reservation
- the transaction spans a log buffer and thus requires additional space
  for split region headers

Note that I don't believe the above are problems, but rather this
suggests how we probably get away with the higher level problem in most
cases where this additional "worst case" reservation goes unused.

I ended up looking at tr_ifree while investigating some options to
resolve this problem and am slightly confused by the reservation
calculation. In particular, we do this for the inobt portion of the
operation (i.e., "the inode btree: max depth * blocksize"):

                xfs_calc_buf_res(2 + mp->m_ialloc_blks +   
                                 mp->m_in_maxlevels, 0) +  

... where it looks to me that we only incorporate the overhead of the
inobt buffers rather than the buffer content themselves. Is this
expected/appropriate, or should we be passing something like
XFS_FSB_TO_B(mp, 1) there rather than 0? As it is, while this is not
related to the allocation btrees, it does happen to add enough
reservation to the transaction to avoid the overrun. Then again, it
might not technically be appropriate to add that reservation since the
inobt and free space btree updates are separated by a transaction roll.

That aside, the next best option for dealing with this situation seems
to be to limit the number of agfl fixups that can occur per-transaction.
Yet another option might be to roll the transaction in certain
situations (i.e., let the deferred op handling give us a new transaction
if an agfl fixup resulted in a split/join), but I suspect that could get
more involved as we may want to keep the agf locked across that
sequence, etc.

Thoughts? Can anyone shed some light on this reservation? Thanks!

Brian

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: tr_ifree transaction log reservation calculation
       [not found] ` <0a9b47ba-a41e-3b96-981f-f04f9e2ab11c@hpe.com>
@ 2017-11-21 17:35   ` Brian Foster
  2017-11-22  2:26     ` Dave Chinner
  0 siblings, 1 reply; 15+ messages in thread
From: Brian Foster @ 2017-11-21 17:35 UTC (permalink / raw)
  To: Mark Tinguely; +Cc: linux-xfs

cc linux-xfs

On Tue, Nov 21, 2017 at 10:51:10AM -0600, Mark Tinguely wrote:
> On 11/21/2017 09:05 AM, Brian Foster wrote:
> > I ended up looking at tr_ifree while investigating some options to
> > resolve this problem and am slightly confused by the reservation
> > calculation. In particular, we do this for the inobt portion of the
> > operation (i.e., "the inode btree: max depth * blocksize"):
> > 
> >                  xfs_calc_buf_res(2 + mp->m_ialloc_blks +
> >                                   mp->m_in_maxlevels, 0) +
> 

Hi Mark,

> I reviewed the Jeff Lui refactoring of the log space calculation years
> ago....
> 
> There is some combination of entries  only difference that I see is that the
> original calculation used the generic btree calculation for the inode btree
> and the generic btree calculation accounts for the by-block and by-count, so
> they are times two:
> 
> /*
>  * In freeing an inode we can modify:
>  *    the inode being freed: inode size
>  *    the super block free inode counter: sector size
>  *    the agi hash list and counters: sector size
>  *    the inode btree entry: block size
>  *    the on disk inode before ours in the agi hash list: inode cluster size
>  *    the inode btree: max depth * blocksize
>  *    the allocation btrees: 2 trees * (max depth - 1) * block size
>  */
> STATIC uint
> xfs_calc_ifree_reservation(
>         struct xfs_mount        *mp)
> {
>         return XFS_DQUOT_LOGRES(mp) +
>                 mp->m_sb.sb_inodesize +
>                 mp->m_sb.sb_sectsize +
>                 mp->m_sb.sb_sectsize +
>                 XFS_FSB_TO_B(mp, 1) +
>                 MAX((__uint16_t)XFS_FSB_TO_B(mp, 1),
>                     XFS_INODE_CLUSTER_SIZE(mp)) +
>                 128 * 5 +
>                 XFS_ALLOCFREE_LOG_RES(mp, 1) +
>                 128 * (2 + XFS_IALLOC_BLOCKS(mp) + mp->m_in_maxlevels +
>                        XFS_ALLOCFREE_LOG_COUNT(mp, 1));
> }
> ...

Indeed, this looks fairly similar to the line cited above from the
current code (modulo the xfs_calc_buf_res() refactoring). I'm curious
why we only consider the buffer overhead (i.e., 128 bytes *
->m_in_maxlevels) of the ->m_in_maxlevels value rather than the full
buffer size.

> /*
>  * Per-extent log reservation for the allocation btree changes
>  * involved in freeing or allocating an extent.
>  * 2 trees * (2 blocks/level * max depth - 1) * block size
>  */
> #define XFS_ALLOCFREE_LOG_RES(mp,nx) \
>         ((nx) * (2 * XFS_FSB_TO_B((mp), 2 * XFS_AG_MAXLEVELS(mp) - 1)))
> #define XFS_ALLOCFREE_LOG_COUNT(mp,nx) \
>         ((nx) * (2 * (2 * XFS_AG_MAXLEVELS(mp) - 1)))
> 
> Doesn't the blocks for the btrees get allocated in:
> 
> 	xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
>                                  XFS_FSB_TO_B(mp, 1)) +

Yeah, I believe that covers the log reservation for the allocation
btrees. I'm wondering if/where we'd account for a join of the inobt due
to freeing an inode chunk and removing the associated inobt record. I
suppose the above could also cover the inobt since the transaction is
rolled, but then aren't we assuming that m_in_maxlevels ==
m_ag_maxlevels..?

Brian

> 
> And the rest of the calculation counts the log chunks that can get modified?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: tr_ifree transaction log reservation calculation
  2017-11-21 17:35   ` Brian Foster
@ 2017-11-22  2:26     ` Dave Chinner
  2017-11-22 12:21       ` Brian Foster
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2017-11-22  2:26 UTC (permalink / raw)
  To: Brian Foster; +Cc: Mark Tinguely, linux-xfs

On Tue, Nov 21, 2017 at 12:35:57PM -0500, Brian Foster wrote:
> cc linux-xfs
> 
> On Tue, Nov 21, 2017 at 10:51:10AM -0600, Mark Tinguely wrote:
> > STATIC uint
> > xfs_calc_ifree_reservation(
> >         struct xfs_mount        *mp)
> > {
> >         return XFS_DQUOT_LOGRES(mp) +
> >                 mp->m_sb.sb_inodesize +
> >                 mp->m_sb.sb_sectsize +
> >                 mp->m_sb.sb_sectsize +
> >                 XFS_FSB_TO_B(mp, 1) +
> >                 MAX((__uint16_t)XFS_FSB_TO_B(mp, 1),
> >                     XFS_INODE_CLUSTER_SIZE(mp)) +
> >                 128 * 5 +
> >                 XFS_ALLOCFREE_LOG_RES(mp, 1) +
> >                 128 * (2 + XFS_IALLOC_BLOCKS(mp) + mp->m_in_maxlevels +
> >                        XFS_ALLOCFREE_LOG_COUNT(mp, 1));
> > }
> > ...
> 
> Indeed, this looks fairly similar to the line cited above from the
> current code (modulo the xfs_calc_buf_res() refactoring).

Right, the refactoring should have been a straight conversion of the
code - and not change any of the reservations. If they were wrong
before the refactoring, they will still be wrong now.

> I'm curious
> why we only consider the buffer overhead (i.e., 128 bytes *
> ->m_in_maxlevels) of the ->m_in_maxlevels value rather than the full
> buffer size.

That just looks like a bug. Earlier on in the calculation we
take into account:

 *    the inode btree entry: block size
....
		xfs_calc_buf_res(1, XFS_FSB_TO_B(mp, 1)) +

So we've accounted for the AGI btree block the record being modified
exists in. But we don't account for tree shape changes....

.... and, as usual, knowing the history of the code tells me the
probable reason why the reservation is like this.

i.e. inodes could not be removed from disk when the original
reservation was defined.  Once allocated, they never got
removed, so an ifree transaction only ever touched the inode btree
block the chunk record was in. So, a quick dive into history from
the XFS archive with git blame:

814994f5c75f6 (Adam Sweeney     1994-09-23  * In freeing an inode we can modify:
814994f5c75f6 (Adam Sweeney     1994-09-23  *    the inode being freed: inode size
814994f5c75f6 (Adam Sweeney     1994-09-23  *    the super block free inode counter: sector size
814994f5c75f6 (Adam Sweeney     1994-09-23  *    the agi hash list and counters: sector size
814994f5c75f6 (Adam Sweeney     1994-09-23  *    the inode btree entry: block size
cd235688e046e (Adam Sweeney     1995-08-31  *    the on disk inode before ours in the agi hash list: inode cluster size
254f6311ed1b7 (Steve Lord       2003-10-06  *    the inode btree: max depth * blocksize
254f6311ed1b7 (Steve Lord       2003-10-06  *    the allocation btrees: 2 trees * (max depth - 1) * block size
814994f5c75f6 (Adam Sweeney     1994-09-23  */       
d3c15dd08b409 (Russell Cattelan 2003-04-15 #define  XFS_CALC_IFREE_LOG_RES(mp) \
814994f5c75f6 (Adam Sweeney     1994-09-23  ((mp)->m_sb.sb_inodesize + \
814994f5c75f6 (Adam Sweeney     1994-09-23   (mp)->m_sb.sb_sectsize + \
814994f5c75f6 (Adam Sweeney     1994-09-23   (mp)->m_sb.sb_sectsize + \
814994f5c75f6 (Adam Sweeney     1994-09-23   XFS_FSB_TO_B((mp), 1) + \
0139225509eb1 (Steve Lord       2001-09-11   MAX((__uint16_t)XFS_FSB_TO_B((mp), 1), XFS_INODE_CLUSTER_SIZE(mp)) + \
254f6311ed1b7 (Steve Lord       2003-10-06   (128 * 5) + \
254f6311ed1b7 (Steve Lord       2003-10-06    (128 * (2 + XFS_IALLOC_BLOCKS(mp) + XFS_IN_MAXLEVELS(mp) + \
254f6311ed1b7 (Steve Lord       2003-10-06     XFS_ALLOCFREE_LOG_COUNT(mp, 1))))
254f6311ed1b7 (Steve Lord       2003-10-06           

Without even looking I know what those 2003 changes were. For
everyone else:

254f6311ed1b Implement deletion of inode clusters in XFS.

So, yeah, this looks like a bug in the original reservation
calculation for freeing inode clusters. I suspect it got messed up
because the buffers covering XFS_IALLOC_BLOCKS (the inode chunk) get
marked stale and so their contents are not logged. That means just
the header+blf is logged and so the reservation is correct for them.
that isn't the case for the inobt blocks themselves, however.

i.e.  it should be calculating:

....
 *	the inode chunk blocks are stale, so only headers get logged
 *	the inode btree: max depth * blocksize
....
	xfs_calc_buf_res(2 + mp->m_ialloc_blks, 0) +
	xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
....

I'm still not 100% sure what the extra "2 +" is accounting for
there, though, so to be safe that migh need to be accounted as full
blocks, not log headers.

Good find, Brian!

Of course, that doesn't mean it will fix the transaction overrun
being reported because it's overrunning the freespace tree
reservations rather than the inobt reservation. It certainly won't
hurt, though. :P

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: tr_ifree transaction log reservation calculation
  2017-11-22  2:26     ` Dave Chinner
@ 2017-11-22 12:21       ` Brian Foster
  2017-11-22 20:41         ` Brian Foster
  0 siblings, 1 reply; 15+ messages in thread
From: Brian Foster @ 2017-11-22 12:21 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Mark Tinguely, linux-xfs

On Wed, Nov 22, 2017 at 01:26:45PM +1100, Dave Chinner wrote:
> On Tue, Nov 21, 2017 at 12:35:57PM -0500, Brian Foster wrote:
> > cc linux-xfs
> > 
> > On Tue, Nov 21, 2017 at 10:51:10AM -0600, Mark Tinguely wrote:
> > > STATIC uint
> > > xfs_calc_ifree_reservation(
> > >         struct xfs_mount        *mp)
> > > {
> > >         return XFS_DQUOT_LOGRES(mp) +
> > >                 mp->m_sb.sb_inodesize +
> > >                 mp->m_sb.sb_sectsize +
> > >                 mp->m_sb.sb_sectsize +
> > >                 XFS_FSB_TO_B(mp, 1) +
> > >                 MAX((__uint16_t)XFS_FSB_TO_B(mp, 1),
> > >                     XFS_INODE_CLUSTER_SIZE(mp)) +
> > >                 128 * 5 +
> > >                 XFS_ALLOCFREE_LOG_RES(mp, 1) +
> > >                 128 * (2 + XFS_IALLOC_BLOCKS(mp) + mp->m_in_maxlevels +
> > >                        XFS_ALLOCFREE_LOG_COUNT(mp, 1));
> > > }
> > > ...
> > 
> > Indeed, this looks fairly similar to the line cited above from the
> > current code (modulo the xfs_calc_buf_res() refactoring).
> 
> Right, the refactoring should have been a straight conversion of the
> code - and not change any of the reservations. If they were wrong
> before the refactoring, they will still be wrong now.
> 
> > I'm curious
> > why we only consider the buffer overhead (i.e., 128 bytes *
> > ->m_in_maxlevels) of the ->m_in_maxlevels value rather than the full
> > buffer size.
> 
> That just looks like a bug. Earlier on in the calculation we
> take into account:
> 
>  *    the inode btree entry: block size
> ....
> 		xfs_calc_buf_res(1, XFS_FSB_TO_B(mp, 1)) +
> 
> So we've accounted for the AGI btree block the record being modified
> exists in. But we don't account for tree shape changes....
> 
> .... and, as usual, knowing the history of the code tells me the
> probable reason why the reservation is like this.
> 
> i.e. inodes could not be removed from disk when the original
> reservation was defined.  Once allocated, they never got
> removed, so an ifree transaction only ever touched the inode btree
> block the chunk record was in. So, a quick dive into history from
> the XFS archive with git blame:
> 

Ah, I wasn't aware of that. That definitely helps explain the current
reservation.

> 814994f5c75f6 (Adam Sweeney     1994-09-23  * In freeing an inode we can modify:
> 814994f5c75f6 (Adam Sweeney     1994-09-23  *    the inode being freed: inode size
> 814994f5c75f6 (Adam Sweeney     1994-09-23  *    the super block free inode counter: sector size
> 814994f5c75f6 (Adam Sweeney     1994-09-23  *    the agi hash list and counters: sector size
> 814994f5c75f6 (Adam Sweeney     1994-09-23  *    the inode btree entry: block size
> cd235688e046e (Adam Sweeney     1995-08-31  *    the on disk inode before ours in the agi hash list: inode cluster size
> 254f6311ed1b7 (Steve Lord       2003-10-06  *    the inode btree: max depth * blocksize
> 254f6311ed1b7 (Steve Lord       2003-10-06  *    the allocation btrees: 2 trees * (max depth - 1) * block size
> 814994f5c75f6 (Adam Sweeney     1994-09-23  */       
> d3c15dd08b409 (Russell Cattelan 2003-04-15 #define  XFS_CALC_IFREE_LOG_RES(mp) \
> 814994f5c75f6 (Adam Sweeney     1994-09-23  ((mp)->m_sb.sb_inodesize + \
> 814994f5c75f6 (Adam Sweeney     1994-09-23   (mp)->m_sb.sb_sectsize + \
> 814994f5c75f6 (Adam Sweeney     1994-09-23   (mp)->m_sb.sb_sectsize + \
> 814994f5c75f6 (Adam Sweeney     1994-09-23   XFS_FSB_TO_B((mp), 1) + \
> 0139225509eb1 (Steve Lord       2001-09-11   MAX((__uint16_t)XFS_FSB_TO_B((mp), 1), XFS_INODE_CLUSTER_SIZE(mp)) + \
> 254f6311ed1b7 (Steve Lord       2003-10-06   (128 * 5) + \
> 254f6311ed1b7 (Steve Lord       2003-10-06    (128 * (2 + XFS_IALLOC_BLOCKS(mp) + XFS_IN_MAXLEVELS(mp) + \
> 254f6311ed1b7 (Steve Lord       2003-10-06     XFS_ALLOCFREE_LOG_COUNT(mp, 1))))
> 254f6311ed1b7 (Steve Lord       2003-10-06           
> 
> Without even looking I know what those 2003 changes were. For
> everyone else:
> 
> 254f6311ed1b Implement deletion of inode clusters in XFS.
> 
> So, yeah, this looks like a bug in the original reservation
> calculation for freeing inode clusters. I suspect it got messed up
> because the buffers covering XFS_IALLOC_BLOCKS (the inode chunk) get
> marked stale and so their contents are not logged. That means just
> the header+blf is logged and so the reservation is correct for them.
> that isn't the case for the inobt blocks themselves, however.
> 

Yep, that's what I figured for ->m_ialloc_blks.

> i.e.  it should be calculating:
> 
> ....
>  *	the inode chunk blocks are stale, so only headers get logged
>  *	the inode btree: max depth * blocksize
> ....
> 	xfs_calc_buf_res(2 + mp->m_ialloc_blks, 0) +
> 	xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
> ....
> 

Sounds about right.. though I think we should also replace the old
single block reservation ("the inode btree entry: block size") as well
since that is now a subset of the m_in_maxlevels res.

> I'm still not 100% sure what the extra "2 +" is accounting for
> there, though, so to be safe that migh need to be accounted as full
> blocks, not log headers.
> 

Same.. there is still a bit of cruft in this calculation that really
isn't clear. That +2 is one part and then there's another
xfs_calc_buf_res(1, 0) right before the m_ialloc_blks line. The fact
that it's a separate line implies it is logically separate, but who
knows.

I suppose I'll keep that line as is, replace the inobt entry bit with
the full inobt calculation (as above), keep the +2 associated with the
inobt calculation and run some tests on that.

> Good find, Brian!
> 
> Of course, that doesn't mean it will fix the transaction overrun
> being reported because it's overrunning the freespace tree
> reservations rather than the inobt reservation. It certainly won't
> hurt, though. :P
> 

Right.. technically this is a separate issue but I actually think it may
end up resolving the overrun in most cases. I tested a quick hack to
change the 0 in the existing code to XFS_FSB_TO_B(mp, 1) before I
realized that the m_ialloc_blks part still needs to be separate. That
survived all my tests, but it's also larger than what we've settled on
here. I think the overrun has been within m_in_maxlevels blocks worth of
bytes in most cases. I do recall seeing occasional larger instances,
however, I just don't recall the exact numbers. I'll give this some
testing and see if we need to also bolt on some kind of agfl fixup
limit.

Thanks for the feedback!

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: tr_ifree transaction log reservation calculation
  2017-11-22 12:21       ` Brian Foster
@ 2017-11-22 20:41         ` Brian Foster
  2017-11-23  0:24           ` Dave Chinner
  0 siblings, 1 reply; 15+ messages in thread
From: Brian Foster @ 2017-11-22 20:41 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Mark Tinguely, linux-xfs

On Wed, Nov 22, 2017 at 07:21:01AM -0500, Brian Foster wrote:
> On Wed, Nov 22, 2017 at 01:26:45PM +1100, Dave Chinner wrote:
> > On Tue, Nov 21, 2017 at 12:35:57PM -0500, Brian Foster wrote:
> > > cc linux-xfs
> > > 
> > > On Tue, Nov 21, 2017 at 10:51:10AM -0600, Mark Tinguely wrote:
> > > > STATIC uint
> > > > xfs_calc_ifree_reservation(
> > > >         struct xfs_mount        *mp)
> > > > {
> > > >         return XFS_DQUOT_LOGRES(mp) +
> > > >                 mp->m_sb.sb_inodesize +
> > > >                 mp->m_sb.sb_sectsize +
> > > >                 mp->m_sb.sb_sectsize +
> > > >                 XFS_FSB_TO_B(mp, 1) +
> > > >                 MAX((__uint16_t)XFS_FSB_TO_B(mp, 1),
> > > >                     XFS_INODE_CLUSTER_SIZE(mp)) +
> > > >                 128 * 5 +
> > > >                 XFS_ALLOCFREE_LOG_RES(mp, 1) +
> > > >                 128 * (2 + XFS_IALLOC_BLOCKS(mp) + mp->m_in_maxlevels +
> > > >                        XFS_ALLOCFREE_LOG_COUNT(mp, 1));
> > > > }
> > > > ...
> > > 
> > > Indeed, this looks fairly similar to the line cited above from the
> > > current code (modulo the xfs_calc_buf_res() refactoring).
> > 
> > Right, the refactoring should have been a straight conversion of the
> > code - and not change any of the reservations. If they were wrong
> > before the refactoring, they will still be wrong now.
> > 
> > > I'm curious
> > > why we only consider the buffer overhead (i.e., 128 bytes *
> > > ->m_in_maxlevels) of the ->m_in_maxlevels value rather than the full
> > > buffer size.
> > 
> > That just looks like a bug. Earlier on in the calculation we
> > take into account:
> > 
> >  *    the inode btree entry: block size
> > ....
> > 		xfs_calc_buf_res(1, XFS_FSB_TO_B(mp, 1)) +
> > 
> > So we've accounted for the AGI btree block the record being modified
> > exists in. But we don't account for tree shape changes....
> > 
> > .... and, as usual, knowing the history of the code tells me the
> > probable reason why the reservation is like this.
> > 
> > i.e. inodes could not be removed from disk when the original
> > reservation was defined.  Once allocated, they never got
> > removed, so an ifree transaction only ever touched the inode btree
> > block the chunk record was in. So, a quick dive into history from
> > the XFS archive with git blame:
> > 
> 
> Ah, I wasn't aware of that. That definitely helps explain the current
> reservation.
> 
> > 814994f5c75f6 (Adam Sweeney     1994-09-23  * In freeing an inode we can modify:
> > 814994f5c75f6 (Adam Sweeney     1994-09-23  *    the inode being freed: inode size
> > 814994f5c75f6 (Adam Sweeney     1994-09-23  *    the super block free inode counter: sector size
> > 814994f5c75f6 (Adam Sweeney     1994-09-23  *    the agi hash list and counters: sector size
> > 814994f5c75f6 (Adam Sweeney     1994-09-23  *    the inode btree entry: block size
> > cd235688e046e (Adam Sweeney     1995-08-31  *    the on disk inode before ours in the agi hash list: inode cluster size
> > 254f6311ed1b7 (Steve Lord       2003-10-06  *    the inode btree: max depth * blocksize
> > 254f6311ed1b7 (Steve Lord       2003-10-06  *    the allocation btrees: 2 trees * (max depth - 1) * block size
> > 814994f5c75f6 (Adam Sweeney     1994-09-23  */       
> > d3c15dd08b409 (Russell Cattelan 2003-04-15 #define  XFS_CALC_IFREE_LOG_RES(mp) \
> > 814994f5c75f6 (Adam Sweeney     1994-09-23  ((mp)->m_sb.sb_inodesize + \
> > 814994f5c75f6 (Adam Sweeney     1994-09-23   (mp)->m_sb.sb_sectsize + \
> > 814994f5c75f6 (Adam Sweeney     1994-09-23   (mp)->m_sb.sb_sectsize + \
> > 814994f5c75f6 (Adam Sweeney     1994-09-23   XFS_FSB_TO_B((mp), 1) + \
> > 0139225509eb1 (Steve Lord       2001-09-11   MAX((__uint16_t)XFS_FSB_TO_B((mp), 1), XFS_INODE_CLUSTER_SIZE(mp)) + \
> > 254f6311ed1b7 (Steve Lord       2003-10-06   (128 * 5) + \
> > 254f6311ed1b7 (Steve Lord       2003-10-06    (128 * (2 + XFS_IALLOC_BLOCKS(mp) + XFS_IN_MAXLEVELS(mp) + \
> > 254f6311ed1b7 (Steve Lord       2003-10-06     XFS_ALLOCFREE_LOG_COUNT(mp, 1))))
> > 254f6311ed1b7 (Steve Lord       2003-10-06           
> > 
> > Without even looking I know what those 2003 changes were. For
> > everyone else:
> > 
> > 254f6311ed1b Implement deletion of inode clusters in XFS.
> > 
> > So, yeah, this looks like a bug in the original reservation
> > calculation for freeing inode clusters. I suspect it got messed up
> > because the buffers covering XFS_IALLOC_BLOCKS (the inode chunk) get
> > marked stale and so their contents are not logged. That means just
> > the header+blf is logged and so the reservation is correct for them.
> > that isn't the case for the inobt blocks themselves, however.
> > 
> 
> Yep, that's what I figured for ->m_ialloc_blks.
> 
> > i.e.  it should be calculating:
> > 
> > ....
> >  *	the inode chunk blocks are stale, so only headers get logged
> >  *	the inode btree: max depth * blocksize
> > ....
> > 	xfs_calc_buf_res(2 + mp->m_ialloc_blks, 0) +
> > 	xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
> > ....
> > 
> 
> Sounds about right.. though I think we should also replace the old
> single block reservation ("the inode btree entry: block size") as well
> since that is now a subset of the m_in_maxlevels res.
> 
> > I'm still not 100% sure what the extra "2 +" is accounting for
> > there, though, so to be safe that migh need to be accounted as full
> > blocks, not log headers.
> > 
> 
> Same.. there is still a bit of cruft in this calculation that really
> isn't clear. That +2 is one part and then there's another
> xfs_calc_buf_res(1, 0) right before the m_ialloc_blks line. The fact
> that it's a separate line implies it is logically separate, but who
> knows.
> 
> I suppose I'll keep that line as is, replace the inobt entry bit with
> the full inobt calculation (as above), keep the +2 associated with the
> inobt calculation and run some tests on that.
> 
> > Good find, Brian!
> > 
> > Of course, that doesn't mean it will fix the transaction overrun
> > being reported because it's overrunning the freespace tree
> > reservations rather than the inobt reservation. It certainly won't
> > hurt, though. :P
> > 
> 
> Right.. technically this is a separate issue but I actually think it may
> end up resolving the overrun in most cases. I tested a quick hack to
> change the 0 in the existing code to XFS_FSB_TO_B(mp, 1) before I
> realized that the m_ialloc_blks part still needs to be separate. That
> survived all my tests, but it's also larger than what we've settled on
> here. I think the overrun has been within m_in_maxlevels blocks worth of
> bytes in most cases. I do recall seeing occasional larger instances,
> however, I just don't recall the exact numbers. I'll give this some
> testing and see if we need to also bolt on some kind of agfl fixup
> limit.
> 

Quick update... I ended up still hitting an overrun with this fix in
place (actually, the original bug report shows an overrun of ~32k which
should have indicated that was going to be the case).

In any event, that had me thinking about tweaking how we manage the agfl
and subsequently brings up a couple other issues. Firstly, this is a
large fs, so we have 1TB AGs and the associated 16 entry agfls. The min
agfl allocation is 8 at the time of the overrun, which is the maximum
case for such a filesystem (4 per space btree). IIUC, as much as we need
to keep the agfl populated with a minimum of 8 blocks in this case, we
conversely also need to make sure we have 8 free slots, yes? If so, I'm
not sure that being lazy about freeing blocks is an option because I
have at least one trace where it takes 3 agfl block frees just to get
the agfl from 9 down to the required 8.

Unrelated to the overrun issue, I also notice that we use the agfl for
the rmapbt. That adds another 5 blocks in the maximum case, which means
a 1TB AG can have a min agfl requirement of 4+4+5=13 blocks allocated of
16 slots. Unless I'm missing something in the reasoning above, aren't we
kind of playing with fire in that case? E.g., what prevents us from
doing a 4-level join on one of the trees with a 13/16 agfl and running
out of space to free btree blocks? This has me wondering whether we need
some kind of larger/dynamic agfl that facilitates the additional use
(and potentially lazier management to resolve the "too many fixups"
issue). Thoughts?

Brian

> Thanks for the feedback!
> 
> Brian
> 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: tr_ifree transaction log reservation calculation
  2017-11-22 20:41         ` Brian Foster
@ 2017-11-23  0:24           ` Dave Chinner
  2017-11-23 14:36             ` Brian Foster
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2017-11-23  0:24 UTC (permalink / raw)
  To: Brian Foster; +Cc: Mark Tinguely, linux-xfs

On Wed, Nov 22, 2017 at 03:41:26PM -0500, Brian Foster wrote:
> On Wed, Nov 22, 2017 at 07:21:01AM -0500, Brian Foster wrote:
> > On Wed, Nov 22, 2017 at 01:26:45PM +1100, Dave Chinner wrote:
> > > On Tue, Nov 21, 2017 at 12:35:57PM -0500, Brian Foster wrote:
> > > I'm still not 100% sure what the extra "2 +" is accounting for
> > > there, though, so to be safe that migh need to be accounted as full
> > > blocks, not log headers.
> > > 
> > 
> > Same.. there is still a bit of cruft in this calculation that really
> > isn't clear. That +2 is one part and then there's another
> > xfs_calc_buf_res(1, 0) right before the m_ialloc_blks line. The fact
> > that it's a separate line implies it is logically separate, but who
> > knows.
> > 
> > I suppose I'll keep that line as is, replace the inobt entry bit with
> > the full inobt calculation (as above), keep the +2 associated with the
> > inobt calculation and run some tests on that.

I think I've worked out what the +2 is supposed to be - the AGF
and AGFL when we alloc/free blocks for either the inode chunk
or inobt records. You see them in, say, xfs_calc_write_reservation()
but not in the inode allocation reservations, despite needing
to do allocation....

Ok, xfs_calc_itruncate_reservation() appears to have the same
inobt modification bug in it. But now I'm wondering: why does a
truncate transaction modify the inobt?

It doesn't, and the commitin the historic tree doesn't explain why
it was necesary then, either:

commit 5fe6abb82f76472ad4ca0d99c0c86585948060eb
Author: Glen Overby <overby@sgi.com>
Date:   Wed Mar 17 00:52:31 2004 +0000

    Add space for inode and allocation btrees to ITRUNCATE log reservation
    Add XFS_ALLOCFREE_LOG_RES to IFREE log reservation.

FWIW, there's a bunch of stuff in the other transactions
reservations that I'm finding highly questionable, too....

And thinking on this a bit further, I suspect the whole allocation
reservation is just be a fudge. We do a single alloc/free
reservation per AG per transaction, which allows for one full height
split of the tree. The problem that Brian has reported is that this
reservation only covers a /single/ freespace tree modification and
we're overrunning it just in fixing up the freelist.

So if we want to cover the worst case, we need a reservation for
each freespace tree modification, not one to cover them all. i.e.
one for each AGFL modification, and one for the actual extent being
allocated/freed. And we need one of those reservations for
*every single allocation/free that can occur in a transaction*.

This rapidly blows out to being impractical. We're not going to get
full height splits/joins on every single allocation that occurs in a
transaction, and the number of potential allocations in a
transaction can be considered essentially unbound. e.g.  how many
allocations can a single a directory modification do? Hash btree
split, data/node btree split,  new freespace block, too - especially
when considering that a each directory block could - in the worse
case - be 128 x 512 byte extents (64k dir block size on 512 byte
block size filesystem).

So, no, I don't think it's realistic to start trying to cater for
worst case reservations here. Such crazy shit just doesn't happen
often enough in the real world for us to penalise everything with
such reservations.

In which case, I think we need to consider that each major operation
(inobt, directory, finobt, etc) probably should have it's own
allocation reservation. It's extremely unlikely that all of them are
going to overrun in the same transaction, so maybe this is a good
enough safe guard without going completely over the top...

.....

> Quick update... I ended up still hitting an overrun with this fix in
> place (actually, the original bug report shows an overrun of ~32k which
> should have indicated that was going to be the case).

Yeah, no surprise there given the above.

> In any event, that had me thinking about tweaking how we manage the agfl
> and subsequently brings up a couple other issues. Firstly, this is a
> large fs, so we have 1TB AGs and the associated 16 entry agfls.

16 entries?

The AGFL can hold (on a v5 filesystems and 512 byte sectors) 117
individual freespace blocks.  It's always much larger than we need
to hold the blocks for a single extent allocation.

> The min
> agfl allocation is 8 at the time of the overrun, which is the maximum
> case for such a filesystem (4 per space btree). IIUC, as much as we need
> to keep the agfl populated with a minimum of 8 blocks in this case, we
> conversely also need to make sure we have 8 free slots, yes? If so, I'm
> not sure that being lazy about freeing blocks is an option because I
> have at least one trace where it takes 3 agfl block frees just to get
> the agfl from 9 down to the required 8.

Even the AGFL is half full, we've still got space for ~50 blocks to
be freed from the bno/cnt/rmap btrees in a single allocation/free
operation, so I just don't think we have to care about being a few
blocks off. If trying to free blocks from the AGFL results in it
growing, then just abort unless the AGFL is over half full...

> Unrelated to the overrun issue, I also notice that we use the agfl for
> the rmapbt. That adds another 5 blocks in the maximum case, which means
> a 1TB AG can have a min agfl requirement of 4+4+5=13 blocks allocated of
> 16 slots. Unless I'm missing something in the reasoning above, aren't we
> kind of playing with fire in that case?

No, because the AGFL is 117 blocks long at minimum.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: tr_ifree transaction log reservation calculation
  2017-11-23  0:24           ` Dave Chinner
@ 2017-11-23 14:36             ` Brian Foster
  2017-11-23 21:54               ` Dave Chinner
  0 siblings, 1 reply; 15+ messages in thread
From: Brian Foster @ 2017-11-23 14:36 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Mark Tinguely, linux-xfs

On Thu, Nov 23, 2017 at 11:24:02AM +1100, Dave Chinner wrote:
> On Wed, Nov 22, 2017 at 03:41:26PM -0500, Brian Foster wrote:
> > On Wed, Nov 22, 2017 at 07:21:01AM -0500, Brian Foster wrote:
> > > On Wed, Nov 22, 2017 at 01:26:45PM +1100, Dave Chinner wrote:
> > > > On Tue, Nov 21, 2017 at 12:35:57PM -0500, Brian Foster wrote:
> > > > I'm still not 100% sure what the extra "2 +" is accounting for
> > > > there, though, so to be safe that migh need to be accounted as full
> > > > blocks, not log headers.
> > > > 
> > > 
> > > Same.. there is still a bit of cruft in this calculation that really
> > > isn't clear. That +2 is one part and then there's another
> > > xfs_calc_buf_res(1, 0) right before the m_ialloc_blks line. The fact
> > > that it's a separate line implies it is logically separate, but who
> > > knows.
> > > 
> > > I suppose I'll keep that line as is, replace the inobt entry bit with
> > > the full inobt calculation (as above), keep the +2 associated with the
> > > inobt calculation and run some tests on that.
> 
> I think I've worked out what the +2 is supposed to be - the AGF
> and AGFL when we alloc/free blocks for either the inode chunk
> or inobt records. You see them in, say, xfs_calc_write_reservation()
> but not in the inode allocation reservations, despite needing
> to do allocation....
> 

That sounds sane.. though it looks like in the case of
xfs_calc_write_reservation() those are accounted as sectors rather than
full blocks as in the ifree calculation. I suppose we should fix those
up in the ifree calculation to reflect the sector size and properly
document them.

Hmm, xfs_calc_write_reservation() also handles the space freeing bits a
bit more how I would expect: using the max of two calculations since a
part of the operation occurs after a transaction roll. The ifree
calculation just adds everything together (and we still overrun it :/).

> Ok, xfs_calc_itruncate_reservation() appears to have the same
> inobt modification bug in it. But now I'm wondering: why does a
> truncate transaction modify the inobt?
> 
> It doesn't, and the commitin the historic tree doesn't explain why
> it was necesary then, either:
> 
> commit 5fe6abb82f76472ad4ca0d99c0c86585948060eb
> Author: Glen Overby <overby@sgi.com>
> Date:   Wed Mar 17 00:52:31 2004 +0000
> 
>     Add space for inode and allocation btrees to ITRUNCATE log reservation
>     Add XFS_ALLOCFREE_LOG_RES to IFREE log reservation.
> 
> FWIW, there's a bunch of stuff in the other transactions
> reservations that I'm finding highly questionable, too....
> 
> And thinking on this a bit further, I suspect the whole allocation
> reservation is just be a fudge. We do a single alloc/free
> reservation per AG per transaction, which allows for one full height
> split of the tree. The problem that Brian has reported is that this
> reservation only covers a /single/ freespace tree modification and
> we're overrunning it just in fixing up the freelist.
> 
> So if we want to cover the worst case, we need a reservation for
> each freespace tree modification, not one to cover them all. i.e.
> one for each AGFL modification, and one for the actual extent being
> allocated/freed. And we need one of those reservations for
> *every single allocation/free that can occur in a transaction*.
> 
> This rapidly blows out to being impractical. We're not going to get
> full height splits/joins on every single allocation that occurs in a
> transaction, and the number of potential allocations in a
> transaction can be considered essentially unbound. e.g.  how many
> allocations can a single a directory modification do? Hash btree
> split, data/node btree split,  new freespace block, too - especially
> when considering that a each directory block could - in the worse
> case - be 128 x 512 byte extents (64k dir block size on 512 byte
> block size filesystem).
> 

The impression I get is that at the time of designing these transactions
it was simply easiest to characterize worst case of a single operation
as a full btree modification, and then either there was an assumption
made that would be sufficient to cover other multiple allocation/free
cases (since the full tree changes are relatively rare) or that was just
borne out in practice and so was never really revisited. 

> So, no, I don't think it's realistic to start trying to cater for
> worst case reservations here. Such crazy shit just doesn't happen
> often enough in the real world for us to penalise everything with
> such reservations.
> 

Agreed.

> In which case, I think we need to consider that each major operation
> (inobt, directory, finobt, etc) probably should have it's own
> allocation reservation. It's extremely unlikely that all of them are
> going to overrun in the same transaction, so maybe this is a good
> enough safe guard without going completely over the top...
> 

For the ifree example, I take this to mean we should refactor the
calculation to consider freeing of the inode chunk separate from the
inobt and finobt modifications with respect to free space tree
modifications.  In other words, we account one allocfree res for the
inode chunk, another for the inobt modification (to effectively cover
agfl fixups) and another (optional) for the finobt as opposed to one for
the whole transaction. Is that what you're suggesting?

That sounds reasonable, but at the same time I'm a bit concerned that I
think the ifree transaction should technically be refactored into a max
of the pre/post tx roll reservations similar to the write reservation
mentioned above. For example:

	MAX(inode + inobt + finobt + agi + inode chunk + etc.,
	    allocfree for inode chunk + (allocfree * nr inobts));

But that would end up making the transaction smaller than adding it all
together. Of course, that may be fine with the agfl fixup below..

> .....
> 
> > Quick update... I ended up still hitting an overrun with this fix in
> > place (actually, the original bug report shows an overrun of ~32k which
> > should have indicated that was going to be the case).
> 
> Yeah, no surprise there given the above.
> 
> > In any event, that had me thinking about tweaking how we manage the agfl
> > and subsequently brings up a couple other issues. Firstly, this is a
> > large fs, so we have 1TB AGs and the associated 16 entry agfls.
> 
> 16 entries?
> 
> The AGFL can hold (on a v5 filesystems and 512 byte sectors) 117
> individual freespace blocks.  It's always much larger than we need
> to hold the blocks for a single extent allocation.
> 

Oops, thinko on my part...

> > The min
> > agfl allocation is 8 at the time of the overrun, which is the maximum
> > case for such a filesystem (4 per space btree). IIUC, as much as we need
> > to keep the agfl populated with a minimum of 8 blocks in this case, we
> > conversely also need to make sure we have 8 free slots, yes? If so, I'm
> > not sure that being lazy about freeing blocks is an option because I
> > have at least one trace where it takes 3 agfl block frees just to get
> > the agfl from 9 down to the required 8.
> 
> Even the AGFL is half full, we've still got space for ~50 blocks to
> be freed from the bno/cnt/rmap btrees in a single allocation/free
> operation, so I just don't think we have to care about being a few
> blocks off. If trying to free blocks from the AGFL results in it
> growing, then just abort unless the AGFL is over half full...
> 

Right.. disregard my previous concern. I think this means we can most
likely resolve this by doing something like limiting to a single free
per fixup or breaking out once a block free ends up adding another to
the agfl. Thanks.

Brian

> > Unrelated to the overrun issue, I also notice that we use the agfl for
> > the rmapbt. That adds another 5 blocks in the maximum case, which means
> > a 1TB AG can have a min agfl requirement of 4+4+5=13 blocks allocated of
> > 16 slots. Unless I'm missing something in the reasoning above, aren't we
> > kind of playing with fire in that case?
> 
> No, because the AGFL is 117 blocks long at minimum.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: tr_ifree transaction log reservation calculation
  2017-11-23 14:36             ` Brian Foster
@ 2017-11-23 21:54               ` Dave Chinner
  2017-11-24 14:51                 ` Brian Foster
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2017-11-23 21:54 UTC (permalink / raw)
  To: Brian Foster; +Cc: Mark Tinguely, linux-xfs

On Thu, Nov 23, 2017 at 09:36:53AM -0500, Brian Foster wrote:
> On Thu, Nov 23, 2017 at 11:24:02AM +1100, Dave Chinner wrote:
> > On Wed, Nov 22, 2017 at 03:41:26PM -0500, Brian Foster wrote:
> > > On Wed, Nov 22, 2017 at 07:21:01AM -0500, Brian Foster wrote:
> > > > On Wed, Nov 22, 2017 at 01:26:45PM +1100, Dave Chinner wrote:
> > > > > On Tue, Nov 21, 2017 at 12:35:57PM -0500, Brian Foster wrote:
> > > > > I'm still not 100% sure what the extra "2 +" is accounting for
> > > > > there, though, so to be safe that migh need to be accounted as full
> > > > > blocks, not log headers.
> > > > > 
> > > > 
> > > > Same.. there is still a bit of cruft in this calculation that really
> > > > isn't clear. That +2 is one part and then there's another
> > > > xfs_calc_buf_res(1, 0) right before the m_ialloc_blks line. The fact
> > > > that it's a separate line implies it is logically separate, but who
> > > > knows.
> > > > 
> > > > I suppose I'll keep that line as is, replace the inobt entry bit with
> > > > the full inobt calculation (as above), keep the +2 associated with the
> > > > inobt calculation and run some tests on that.
> > 
> > I think I've worked out what the +2 is supposed to be - the AGF
> > and AGFL when we alloc/free blocks for either the inode chunk
> > or inobt records. You see them in, say, xfs_calc_write_reservation()
> > but not in the inode allocation reservations, despite needing
> > to do allocation....
> > 
> 
> That sounds sane.. though it looks like in the case of
> xfs_calc_write_reservation() those are accounted as sectors rather than
> full blocks as in the ifree calculation. I suppose we should fix those
> up in the ifree calculation to reflect the sector size and properly
> document them.
> 
> Hmm, xfs_calc_write_reservation() also handles the space freeing bits a
> bit more how I would expect: using the max of two calculations since a
> part of the operation occurs after a transaction roll. The ifree
> calculation just adds everything together (and we still overrun it :/).

Well, I don't think that's going to matter - more below...

[....]

> > So if we want to cover the worst case, we need a reservation for
> > each freespace tree modification, not one to cover them all. i.e.
> > one for each AGFL modification, and one for the actual extent being
> > allocated/freed. And we need one of those reservations for
> > *every single allocation/free that can occur in a transaction*.
> > 
> > This rapidly blows out to being impractical. We're not going to get
> > full height splits/joins on every single allocation that occurs in a
> > transaction, and the number of potential allocations in a
> > transaction can be considered essentially unbound. e.g.  how many
> > allocations can a single a directory modification do? Hash btree
> > split, data/node btree split,  new freespace block, too - especially
> > when considering that a each directory block could - in the worse
> > case - be 128 x 512 byte extents (64k dir block size on 512 byte
> > block size filesystem).
> > 
> 
> The impression I get is that at the time of designing these transactions
> it was simply easiest to characterize worst case of a single operation
> as a full btree modification, and then either there was an assumption
> made that would be sufficient to cover other multiple allocation/free
> cases (since the full tree changes are relatively rare) or that was just
> borne out in practice and so was never really revisited. 

Yup, that seems to fit the pattern over time of "increase the
reservation when it runs out" fixes....

> > So, no, I don't think it's realistic to start trying to cater for
> > worst case reservations here. Such crazy shit just doesn't happen
> > often enough in the real world for us to penalise everything with
> > such reservations.
> > 
> 
> Agreed.
> 
> > In which case, I think we need to consider that each major operation
> > (inobt, directory, finobt, etc) probably should have it's own
> > allocation reservation. It's extremely unlikely that all of them are
> > going to overrun in the same transaction, so maybe this is a good
> > enough safe guard without going completely over the top...
> > 
> 
> For the ifree example, I take this to mean we should refactor the
> calculation to consider freeing of the inode chunk separate from the
> inobt and finobt modifications with respect to free space tree
> modifications.  In other words, we account one allocfree res for the
> inode chunk, another for the inobt modification (to effectively cover
> agfl fixups) and another (optional) for the finobt as opposed to one for
> the whole transaction. Is that what you're suggesting?
> 
> That sounds reasonable, but at the same time I'm a bit concerned that I
> think the ifree transaction should technically be refactored into a max
> of the pre/post tx roll reservations similar to the write reservation
> mentioned above. For example:
> 
> 	MAX(inode + inobt + finobt + agi + inode chunk + etc.,
> 	    allocfree for inode chunk + (allocfree * nr inobts));

Yeah, I think you're right in that we need to we need to look at
redefining the reservations at a higher level.

For this specific example, though, we have to take into account that
a finobt modification to mark an inode free can insert a new record
into the btree. Hence we have scope for a finobt block allocation
and tree split, so we need to have allocfree reservations in the
primary transaction.

Also, the inode btree modifications (and refcountbt, too) call
xfs_free_extent() directly - they don't use deferops free list that
is processed as an EFI/EFD op after the primary transaction has been
rolled.  i.e:

xfs_btree_delrec
  xfs_btree_free_block
    xfs_inobt_free_block
      xfs_free_extent

Hence the primary transaction reservation has to take into account
that operations that remove a record from the tree can cause and
extent to be freed. As such, the removal of an inode chunk - which
removes the record from both the inobt and finobt - can trigger
freespace tree modification directly.

IOWs, the primary ifree transaction needs alloc/free reservations
for the modifications of both the inobt and the finobt, regardless
of anything else that goes on, so.....

> But that would end up making the transaction smaller than adding it all
> together. Of course, that may be fine with the agfl fixup below..

.... I doubt it will shrink if we take into account the multiple
direct alloc/free operations in the primary transaction properly. :/

Hmmmm - I'm wondering if we need to limit the max extents in a
defer-ops EFI/EFD item to 2 so we don't try to free all the defered
extent frees in a single EFI/EFD transaction pair. i.e. force it
to roll the transaction and get a new log reservation every two
extents being freed so we can limit the reservation size defered
extent freeing requires....

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: tr_ifree transaction log reservation calculation
  2017-11-23 21:54               ` Dave Chinner
@ 2017-11-24 14:51                 ` Brian Foster
  2017-11-25 23:20                   ` Dave Chinner
  0 siblings, 1 reply; 15+ messages in thread
From: Brian Foster @ 2017-11-24 14:51 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Mark Tinguely, linux-xfs

On Fri, Nov 24, 2017 at 08:54:57AM +1100, Dave Chinner wrote:
> On Thu, Nov 23, 2017 at 09:36:53AM -0500, Brian Foster wrote:
> > On Thu, Nov 23, 2017 at 11:24:02AM +1100, Dave Chinner wrote:
> > > On Wed, Nov 22, 2017 at 03:41:26PM -0500, Brian Foster wrote:
> > > > On Wed, Nov 22, 2017 at 07:21:01AM -0500, Brian Foster wrote:
> > > > > On Wed, Nov 22, 2017 at 01:26:45PM +1100, Dave Chinner wrote:
> > > > > > On Tue, Nov 21, 2017 at 12:35:57PM -0500, Brian Foster wrote:
> > > > > > I'm still not 100% sure what the extra "2 +" is accounting for
> > > > > > there, though, so to be safe that migh need to be accounted as full
> > > > > > blocks, not log headers.
> > > > > > 
> > > > > 
> > > > > Same.. there is still a bit of cruft in this calculation that really
> > > > > isn't clear. That +2 is one part and then there's another
> > > > > xfs_calc_buf_res(1, 0) right before the m_ialloc_blks line. The fact
> > > > > that it's a separate line implies it is logically separate, but who
> > > > > knows.
> > > > > 
> > > > > I suppose I'll keep that line as is, replace the inobt entry bit with
> > > > > the full inobt calculation (as above), keep the +2 associated with the
> > > > > inobt calculation and run some tests on that.
> > > 
> > > I think I've worked out what the +2 is supposed to be - the AGF
> > > and AGFL when we alloc/free blocks for either the inode chunk
> > > or inobt records. You see them in, say, xfs_calc_write_reservation()
> > > but not in the inode allocation reservations, despite needing
> > > to do allocation....
> > > 
> > 
> > That sounds sane.. though it looks like in the case of
> > xfs_calc_write_reservation() those are accounted as sectors rather than
> > full blocks as in the ifree calculation. I suppose we should fix those
> > up in the ifree calculation to reflect the sector size and properly
> > document them.
> > 
> > Hmm, xfs_calc_write_reservation() also handles the space freeing bits a
> > bit more how I would expect: using the max of two calculations since a
> > part of the operation occurs after a transaction roll. The ifree
> > calculation just adds everything together (and we still overrun it :/).
> 
> Well, I don't think that's going to matter - more below...
> 
> [....]
> 
> > > So if we want to cover the worst case, we need a reservation for
> > > each freespace tree modification, not one to cover them all. i.e.
> > > one for each AGFL modification, and one for the actual extent being
> > > allocated/freed. And we need one of those reservations for
> > > *every single allocation/free that can occur in a transaction*.
> > > 
> > > This rapidly blows out to being impractical. We're not going to get
> > > full height splits/joins on every single allocation that occurs in a
> > > transaction, and the number of potential allocations in a
> > > transaction can be considered essentially unbound. e.g.  how many
> > > allocations can a single a directory modification do? Hash btree
> > > split, data/node btree split,  new freespace block, too - especially
> > > when considering that a each directory block could - in the worse
> > > case - be 128 x 512 byte extents (64k dir block size on 512 byte
> > > block size filesystem).
> > > 
> > 
> > The impression I get is that at the time of designing these transactions
> > it was simply easiest to characterize worst case of a single operation
> > as a full btree modification, and then either there was an assumption
> > made that would be sufficient to cover other multiple allocation/free
> > cases (since the full tree changes are relatively rare) or that was just
> > borne out in practice and so was never really revisited. 
> 
> Yup, that seems to fit the pattern over time of "increase the
> reservation when it runs out" fixes....
> 
> > > So, no, I don't think it's realistic to start trying to cater for
> > > worst case reservations here. Such crazy shit just doesn't happen
> > > often enough in the real world for us to penalise everything with
> > > such reservations.
> > > 
> > 
> > Agreed.
> > 
> > > In which case, I think we need to consider that each major operation
> > > (inobt, directory, finobt, etc) probably should have it's own
> > > allocation reservation. It's extremely unlikely that all of them are
> > > going to overrun in the same transaction, so maybe this is a good
> > > enough safe guard without going completely over the top...
> > > 
> > 
> > For the ifree example, I take this to mean we should refactor the
> > calculation to consider freeing of the inode chunk separate from the
> > inobt and finobt modifications with respect to free space tree
> > modifications.  In other words, we account one allocfree res for the
> > inode chunk, another for the inobt modification (to effectively cover
> > agfl fixups) and another (optional) for the finobt as opposed to one for
> > the whole transaction. Is that what you're suggesting?
> > 
> > That sounds reasonable, but at the same time I'm a bit concerned that I
> > think the ifree transaction should technically be refactored into a max
> > of the pre/post tx roll reservations similar to the write reservation
> > mentioned above. For example:
> > 
> > 	MAX(inode + inobt + finobt + agi + inode chunk + etc.,
> > 	    allocfree for inode chunk + (allocfree * nr inobts));
> 
> Yeah, I think you're right in that we need to we need to look at
> redefining the reservations at a higher level.
> 
> For this specific example, though, we have to take into account that
> a finobt modification to mark an inode free can insert a new record
> into the btree. Hence we have scope for a finobt block allocation
> and tree split, so we need to have allocfree reservations in the
> primary transaction.
> 
> Also, the inode btree modifications (and refcountbt, too) call
> xfs_free_extent() directly - they don't use deferops free list that
> is processed as an EFI/EFD op after the primary transaction has been
> rolled.  i.e:
> 
> xfs_btree_delrec
>   xfs_btree_free_block
>     xfs_inobt_free_block
>       xfs_free_extent
> 
> Hence the primary transaction reservation has to take into account
> that operations that remove a record from the tree can cause and
> extent to be freed. As such, the removal of an inode chunk - which
> removes the record from both the inobt and finobt - can trigger
> freespace tree modification directly.
> 

Ah, because the inobt doesn't use the agfl. It looks like we try to do
this in xfs_calc_finobt_res(), but it doesn't appear to be fully correct
because the allocfree res is only accounted for transactions that "do
not already account for free space btree modifications," according to
the comment. I don't recall the exact reasoning there off the top of my
head, but if I had to guess, this was probably just another instance of
"bolt something on to the transaction that appears to follow the pattern
of other transactions and doesn't explode."

Given that, it seems like the appropriate thing to do is in fact for the
transaction to include an independent allocfree res for the inode chunk,
the inobt and finobt. That justifies the existence of the current
allocfree in the pre-roll part of the transaction and would increase the
transaction size by one more allocfree in the finobt case (the inode
chunk should technically be part of the post-roll calculation).

Actually, with addition of some proper documentation that may also
obviate the need to do the whole pre/post tx roll thing because the
first part of the transaction will obviously be larger than a single
allocfree res for the second part (amazing how a little documentation to
establish/assess intent can save us time from trying to work backwards
from the transaction :P).

Ok, I think that gives me enough to go on to try and refactor at least
the inobt tx res bits. I'll take a closer look next week, thanks for all
of the feedback.

> IOWs, the primary ifree transaction needs alloc/free reservations
> for the modifications of both the inobt and the finobt, regardless
> of anything else that goes on, so.....
> 
> > But that would end up making the transaction smaller than adding it all
> > together. Of course, that may be fine with the agfl fixup below..
> 
> .... I doubt it will shrink if we take into account the multiple
> direct alloc/free operations in the primary transaction properly. :/
> 
> Hmmmm - I'm wondering if we need to limit the max extents in a
> defer-ops EFI/EFD item to 2 so we don't try to free all the defered
> extent frees in a single EFI/EFD transaction pair. i.e. force it
> to roll the transaction and get a new log reservation every two
> extents being freed so we can limit the reservation size defered
> extent freeing requires....
> 

This sounds similar in principle to the other option I was thinking
about with regard to dealing with the agfl fixup problem (rolling the
tx). The difference is the driving logic: putting a hard limit on the
number of per-tx operations vs. rolling in a particular agfl corner
case.

The former may be more straightforward, particularly if we can avoid
burying deferred ops logic down in the agfl fixup code (or even enforce
it in the deferred ops efi/efd class handling code), but I'll have to
take a closer look at this.

Brian

> Cheers,
> 
> Dave.
> 
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: tr_ifree transaction log reservation calculation
  2017-11-24 14:51                 ` Brian Foster
@ 2017-11-25 23:20                   ` Dave Chinner
  2017-11-27 18:46                     ` Brian Foster
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2017-11-25 23:20 UTC (permalink / raw)
  To: Brian Foster; +Cc: Mark Tinguely, linux-xfs

On Fri, Nov 24, 2017 at 09:51:22AM -0500, Brian Foster wrote:
> On Fri, Nov 24, 2017 at 08:54:57AM +1100, Dave Chinner wrote:
> > On Thu, Nov 23, 2017 at 09:36:53AM -0500, Brian Foster wrote:
> > > On Thu, Nov 23, 2017 at 11:24:02AM +1100, Dave Chinner wrote:
> > > That sounds reasonable, but at the same time I'm a bit concerned that I
> > > think the ifree transaction should technically be refactored into a max
> > > of the pre/post tx roll reservations similar to the write reservation
> > > mentioned above. For example:
> > > 
> > > 	MAX(inode + inobt + finobt + agi + inode chunk + etc.,
> > > 	    allocfree for inode chunk + (allocfree * nr inobts));
> > 
> > Yeah, I think you're right in that we need to we need to look at
> > redefining the reservations at a higher level.
> > 
> > For this specific example, though, we have to take into account that
> > a finobt modification to mark an inode free can insert a new record
> > into the btree. Hence we have scope for a finobt block allocation
> > and tree split, so we need to have allocfree reservations in the
> > primary transaction.
> > 
> > Also, the inode btree modifications (and refcountbt, too) call
> > xfs_free_extent() directly - they don't use deferops free list that
> > is processed as an EFI/EFD op after the primary transaction has been
> > rolled.  i.e:
> > 
> > xfs_btree_delrec
> >   xfs_btree_free_block
> >     xfs_inobt_free_block
> >       xfs_free_extent
> > 
> > Hence the primary transaction reservation has to take into account
> > that operations that remove a record from the tree can cause and
> > extent to be freed. As such, the removal of an inode chunk - which
> > removes the record from both the inobt and finobt - can trigger
> > freespace tree modification directly.
> > 
> 
> Ah, because the inobt doesn't use the agfl. It looks like we try to do
> this in xfs_calc_finobt_res(), but it doesn't appear to be fully correct
> because the allocfree res is only accounted for transactions that "do
> not already account for free space btree modifications," according to
> the comment. I don't recall the exact reasoning there off the top of my
> head, but if I had to guess, this was probably just another instance of
> "bolt something on to the transaction that appears to follow the pattern
> of other transactions and doesn't explode."
> 
> Given that, it seems like the appropriate thing to do is in fact for the
> transaction to include an independent allocfree res for the inode chunk,
> the inobt and finobt. That justifies the existence of the current
> allocfree in the pre-roll part of the transaction and would increase the
> transaction size by one more allocfree in the finobt case (the inode
> chunk should technically be part of the post-roll calculation).
> 
> Actually, with addition of some proper documentation that may also
> obviate the need to do the whole pre/post tx roll thing because the
> first part of the transaction will obviously be larger than a single
> allocfree res for the second part (amazing how a little documentation to
> establish/assess intent can save us time from trying to work backwards
> from the transaction :P).
> 
> Ok, I think that gives me enough to go on to try and refactor at least
> the inobt tx res bits. I'll take a closer look next week, thanks for all
> of the feedback.

Actually, I wrote a patch as I was working all this out :p

Not complete, or anything like that, but I've attached it below
so you've got some idea of what I was thinking....

> > IOWs, the primary ifree transaction needs alloc/free reservations
> > for the modifications of both the inobt and the finobt, regardless
> > of anything else that goes on, so.....
> > 
> > > But that would end up making the transaction smaller than adding it all
> > > together. Of course, that may be fine with the agfl fixup below..
> > 
> > .... I doubt it will shrink if we take into account the multiple
> > direct alloc/free operations in the primary transaction properly. :/
> > 
> > Hmmmm - I'm wondering if we need to limit the max extents in a
> > defer-ops EFI/EFD item to 2 so we don't try to free all the defered
> > extent frees in a single EFI/EFD transaction pair. i.e. force it
> > to roll the transaction and get a new log reservation every two
> > extents being freed so we can limit the reservation size defered
> > extent freeing requires....
> 
> This sounds similar in principle to the other option I was thinking
> about with regard to dealing with the agfl fixup problem (rolling the
> tx). The difference is the driving logic: putting a hard limit on the
> number of per-tx operations vs. rolling in a particular agfl corner
> case.

Rolling the transaction at the AGFL level is problematic. The
transaction scope is defined way further up the call stack and we
can't roll transactions while we hold random logged objects locked
that need to stay locked across the entire allocation process.

Hence I don't think we can roll the transaction at the point where
we are modifying the AGFL simply because we don't have the context
available to ensure it can be done in a transactionally atomic and
deadlock free manner.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com


xfs: inode transaciton reservations are whacky

From: Dave Chinner <dchinner@redhat.com>

Start hacking cleanups into the code, see if we can untangle the
mess.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_trans_resv.c | 128 +++++++++++++++++++++++++----------------
 1 file changed, 77 insertions(+), 51 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
index 6bd916bd35e2..60d25d353fc0 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.c
+++ b/fs/xfs/libxfs/xfs_trans_resv.c
@@ -131,6 +131,40 @@ xfs_calc_inode_res(
 		 2 * XFS_BMBT_BLOCK_LEN(mp));
 }
 
+/*
+ * Allocation/freeing of a [f]inobt record requires:
+ *
+ * the inode btree: max depth * block size
+ * the allocation btrees: 2 trees * (max depth - 1) * block size
+ *
+ * This assumes that AGI/AGF/SB modifications are accounted for by the caller.
+ * Modification is assumed as the first block in the tree depth reservation.
+ */
+static uint
+xfs_calc_inobt_alloc_res(
+	struct xfs_mount	*mp)
+{
+	return xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
+	       xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
+				XFS_FSB_TO_B(mp, 1));
+}
+
+/*
+ * Modification of a [f]inobt record requires:
+ *
+ * the inode btree: max depth * block size
+ * the inode btree entry: block size
+ *
+ * This assumes that AGI/AGF/SB modifications are accounted for by the caller.
+ */
+static uint
+xfs_calc_inobt_modify_res(
+	struct xfs_mount	*mp)
+{
+	return xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
+	       (uint)XFS_FSB_TO_B(mp, 1);
+}
+
 /*
  * The free inode btree is a conditional feature and the log reservation
  * requirements differ slightly from that of the traditional inode allocation
@@ -146,30 +180,19 @@ xfs_calc_inode_res(
  * 'alloc' param indicates to include the reservation for free space btree
  * modifications on behalf of finobt modifications. This is required only for
  * transactions that do not already account for free space btree modifications.
- *
- * the free inode btree: max depth * block size
- * the allocation btrees: 2 trees * (max depth - 1) * block size
- * the free inode btree entry: block size
  */
 STATIC uint
 xfs_calc_finobt_res(
 	struct xfs_mount	*mp,
-	int			alloc,
-	int			modify)
+	bool			alloc)
 {
-	uint res;
-
 	if (!xfs_sb_version_hasfinobt(&mp->m_sb))
 		return 0;
 
-	res = xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1));
+	/* alloc implies modification */
 	if (alloc)
-		res += xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
-					XFS_FSB_TO_B(mp, 1));
-	if (modify)
-		res += (uint)XFS_FSB_TO_B(mp, 1);
-
-	return res;
+		return xfs_calc_inobt_alloc_res(mp);
+	return xfs_calc_inobt_modify_res(mp);
 }
 
 /*
@@ -232,8 +255,6 @@ xfs_calc_write_reservation(
  *    the super block to reflect the freed blocks: sector size
  *    worst case split in allocation btrees per extent assuming 4 extents:
  *		4 exts * 2 trees * (2 * max depth - 1) * block size
- *    the inode btree: max depth * blocksize
- *    the allocation btrees: 2 trees * (max depth - 1) * block size
  */
 STATIC uint
 xfs_calc_itruncate_reservation(
@@ -245,12 +266,7 @@ xfs_calc_itruncate_reservation(
 				      XFS_FSB_TO_B(mp, 1))),
 		    (xfs_calc_buf_res(9, mp->m_sb.sb_sectsize) +
 		     xfs_calc_buf_res(xfs_allocfree_log_count(mp, 4),
-				      XFS_FSB_TO_B(mp, 1)) +
-		    xfs_calc_buf_res(5, 0) +
-		    xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
-				     XFS_FSB_TO_B(mp, 1)) +
-		    xfs_calc_buf_res(2 + mp->m_ialloc_blks +
-				     mp->m_in_maxlevels, 0)));
+				      XFS_FSB_TO_B(mp, 1))));
 }
 
 /*
@@ -320,13 +336,13 @@ xfs_calc_link_reservation(
 /*
  * For adding an inode to unlinked list we can modify:
  *    the agi hash list: sector size
- *    the unlinked inode: inode size
+ *    the on-disk unlinked inode: inode cluster size
  */
 STATIC uint
 xfs_calc_iunlink_add_reservation(xfs_mount_t *mp)
 {
 	return xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) +
-		xfs_calc_inode_res(mp, 1);
+	       max_t(uint, XFS_FSB_TO_B(mp, 1), mp->m_inode_cluster_size);
 }
 
 /*
@@ -367,24 +383,26 @@ xfs_calc_remove_reservation(
  *    the new inode: inode size
  *    the inode btree entry: block size
  *    the superblock for the nlink flag: sector size
+ *    the AGI for inode counters: sector size
  *    the directory btree: (max depth + v2) * dir block size
  *    the directory inode's bmap btree: (max depth + v2) * block size
- *    the finobt (record modification and allocation btrees)
+ *    the finobt (record modification/removal and allocation btrees)
+ *    the AGF, AGFL for finobt allocation: 2 * sector size
  */
 STATIC uint
 xfs_calc_create_resv_modify(
 	struct xfs_mount	*mp)
 {
 	return xfs_calc_inode_res(mp, 2) +
-		xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) +
-		(uint)XFS_FSB_TO_B(mp, 1) +
+		xfs_calc_buf_res(4, mp->m_sb.sb_sectsize) +
 		xfs_calc_buf_res(XFS_DIROP_LOG_COUNT(mp), XFS_FSB_TO_B(mp, 1)) +
-		xfs_calc_finobt_res(mp, 1, 1);
+		xfs_calc_inobt_modify_res(mp) +
+		xfs_calc_finobt_res(mp, true);
 }
 
 /*
  * For create we can allocate some inodes giving:
- *    the agi and agf of the ag getting the new inodes: 2 * sectorsize
+ *    the agi, agf, agfl of the ag getting the new inodes: 3 * sectorsize
  *    the superblock for the nlink flag: sector size
  *    the inode blocks allocated: mp->m_ialloc_blks * blocksize
  *    the inode btree: max depth * blocksize
@@ -394,12 +412,9 @@ STATIC uint
 xfs_calc_create_resv_alloc(
 	struct xfs_mount	*mp)
 {
-	return xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
-		mp->m_sb.sb_sectsize +
+	return xfs_calc_buf_res(4, mp->m_sb.sb_sectsize) +
 		xfs_calc_buf_res(mp->m_ialloc_blks, XFS_FSB_TO_B(mp, 1)) +
-		xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
-		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
-				 XFS_FSB_TO_B(mp, 1));
+		xfs_calc_inobt_alloc_res(mp);
 }
 
 STATIC uint
@@ -413,7 +428,7 @@ __xfs_calc_create_reservation(
 
 /*
  * For icreate we can allocate some inodes giving:
- *    the agi and agf of the ag getting the new inodes: 2 * sectorsize
+ *    the agi, agf, agfl of the ag getting the new inodes: 3 * sectorsize
  *    the superblock for the nlink flag: sector size
  *    the inode btree: max depth * blocksize
  *    the allocation btrees: 2 trees * (max depth - 1) * block size
@@ -423,12 +438,9 @@ STATIC uint
 xfs_calc_icreate_resv_alloc(
 	struct xfs_mount	*mp)
 {
-	return xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
-		mp->m_sb.sb_sectsize +
-		xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
-		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
-				 XFS_FSB_TO_B(mp, 1)) +
-		xfs_calc_finobt_res(mp, 0, 0);
+	return xfs_calc_buf_res(4, mp->m_sb.sb_sectsize) +
+		xfs_calc_inobt_alloc_res(mp) +
+		xfs_calc_finobt_res(mp, false);
 }
 
 STATIC uint
@@ -493,26 +505,40 @@ xfs_calc_symlink_reservation(
  *    the super block free inode counter: sector size
  *    the agi hash list and counters: sector size
  *    the inode btree entry: block size
+ *    the on disk inode to null di_next_unlinked: inode cluster size
  *    the on disk inode before ours in the agi hash list: inode cluster size
+ *
+ * If we also have to free an inode chunk or recrod
  *    the inode btree: max depth * blocksize
+ *	Note: this also includes inode btree entry block!
+ *    the AGF and AGFL: 2 sectors
+ *    the allocation btrees: 2 trees * (max depth - 1) * block size
+ *    the log headers to record the inode chunk buffers as freed
+ *
+ * If we also have to modify the finbot - record may need allocation:
+ *    the finobt: max depth * block size
  *    the allocation btrees: 2 trees * (max depth - 1) * block size
- *    the finobt (record insertion, removal or modification)
+ *
  */
 STATIC uint
 xfs_calc_ifree_reservation(
 	struct xfs_mount	*mp)
 {
 	return XFS_DQUOT_LOGRES(mp) +
+		/* inode being freed */
 		xfs_calc_inode_res(mp, 1) +
-		xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) +
-		xfs_calc_buf_res(1, XFS_FSB_TO_B(mp, 1)) +
+		/* sb, AGF, AGFL */
+		xfs_calc_buf_res(3, mp->m_sb.sb_sectsize) +
+		/* AGI + inode cluster */
 		xfs_calc_iunlink_remove_reservation(mp) +
-		xfs_calc_buf_res(1, 0) +
-		xfs_calc_buf_res(2 + mp->m_ialloc_blks +
-				 mp->m_in_maxlevels, 0) +
-		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
-				 XFS_FSB_TO_B(mp, 1)) +
-		xfs_calc_finobt_res(mp, 0, 1);
+		/* inode cluster on disk before ours */
+		xfs_calc_iunlink_remove_reservation(mp) +
+		/* inode btree record freeing */
+		xfs_calc_inobt_alloc_res(mp) +
+		/* stale inode chunk buffers (blf & log hdrs only) */
+		xfs_calc_buf_res(mp->m_ialloc_blks, 0) +
+		/* free inode btree */
+		xfs_calc_finobt_res(mp, true);
 }
 
 /*

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: tr_ifree transaction log reservation calculation
  2017-11-25 23:20                   ` Dave Chinner
@ 2017-11-27 18:46                     ` Brian Foster
  2017-11-27 21:29                       ` Dave Chinner
  0 siblings, 1 reply; 15+ messages in thread
From: Brian Foster @ 2017-11-27 18:46 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Mark Tinguely, linux-xfs

On Sun, Nov 26, 2017 at 10:20:10AM +1100, Dave Chinner wrote:
> On Fri, Nov 24, 2017 at 09:51:22AM -0500, Brian Foster wrote:
> > On Fri, Nov 24, 2017 at 08:54:57AM +1100, Dave Chinner wrote:
> > > On Thu, Nov 23, 2017 at 09:36:53AM -0500, Brian Foster wrote:
> > > > On Thu, Nov 23, 2017 at 11:24:02AM +1100, Dave Chinner wrote:
> > > > That sounds reasonable, but at the same time I'm a bit concerned that I
> > > > think the ifree transaction should technically be refactored into a max
> > > > of the pre/post tx roll reservations similar to the write reservation
> > > > mentioned above. For example:
> > > > 
> > > > 	MAX(inode + inobt + finobt + agi + inode chunk + etc.,
> > > > 	    allocfree for inode chunk + (allocfree * nr inobts));
> > > 
> > > Yeah, I think you're right in that we need to we need to look at
> > > redefining the reservations at a higher level.
> > > 
> > > For this specific example, though, we have to take into account that
> > > a finobt modification to mark an inode free can insert a new record
> > > into the btree. Hence we have scope for a finobt block allocation
> > > and tree split, so we need to have allocfree reservations in the
> > > primary transaction.
> > > 
> > > Also, the inode btree modifications (and refcountbt, too) call
> > > xfs_free_extent() directly - they don't use deferops free list that
> > > is processed as an EFI/EFD op after the primary transaction has been
> > > rolled.  i.e:
> > > 
> > > xfs_btree_delrec
> > >   xfs_btree_free_block
> > >     xfs_inobt_free_block
> > >       xfs_free_extent
> > > 
> > > Hence the primary transaction reservation has to take into account
> > > that operations that remove a record from the tree can cause and
> > > extent to be freed. As such, the removal of an inode chunk - which
> > > removes the record from both the inobt and finobt - can trigger
> > > freespace tree modification directly.
> > > 
> > 
> > Ah, because the inobt doesn't use the agfl. It looks like we try to do
> > this in xfs_calc_finobt_res(), but it doesn't appear to be fully correct
> > because the allocfree res is only accounted for transactions that "do
> > not already account for free space btree modifications," according to
> > the comment. I don't recall the exact reasoning there off the top of my
> > head, but if I had to guess, this was probably just another instance of
> > "bolt something on to the transaction that appears to follow the pattern
> > of other transactions and doesn't explode."
> > 
> > Given that, it seems like the appropriate thing to do is in fact for the
> > transaction to include an independent allocfree res for the inode chunk,
> > the inobt and finobt. That justifies the existence of the current
> > allocfree in the pre-roll part of the transaction and would increase the
> > transaction size by one more allocfree in the finobt case (the inode
> > chunk should technically be part of the post-roll calculation).
> > 
> > Actually, with addition of some proper documentation that may also
> > obviate the need to do the whole pre/post tx roll thing because the
> > first part of the transaction will obviously be larger than a single
> > allocfree res for the second part (amazing how a little documentation to
> > establish/assess intent can save us time from trying to work backwards
> > from the transaction :P).
> > 
> > Ok, I think that gives me enough to go on to try and refactor at least
> > the inobt tx res bits. I'll take a closer look next week, thanks for all
> > of the feedback.
> 
> Actually, I wrote a patch as I was working all this out :p
> 
> Not complete, or anything like that, but I've attached it below
> so you've got some idea of what I was thinking....
> 

Ok. While I get the basic idea, this isn't quite what I was thinking
with regard to the finobt bits. The current finobt reservation applies
the allocation part of the logic only when the higher level transaction
doesn't already include the allocfree res. This kind of handwaves away
the finobt allocfree res by assuming the existing transaction
reservation would cover the blocks alloc'd/freed by the finobt.

IOW, there isn't really a clear separation between the modify and
allocation cases for the finobt as there is for the inobt. The finobt
can add/remove a record in either of those cases. The requirement to add
an additional allocfree res with the finobt operation pretty much
invalidates the finobt modify case, I think.

Also, it looks like this makes some other changes that are not
immediately clear to me, squashes in a couple of the fixes I've already
made and doesn't actually add a new allocfree res in some of the
create/alloc calculations (but rather refactors the existing allocfree
res to use the new helper). It's not clear to me if the latter is
intentional..?

In any event, I'd like to keep the previous bugfixes separate from the
refactoring so I will post what I was thinking with regard to the
refactoring bits on top of the bugfix series. If I'm missing something
or you had a different approach, I'd appreciate if we could work off the
bugfixes as a baseline..

> > > IOWs, the primary ifree transaction needs alloc/free reservations
> > > for the modifications of both the inobt and the finobt, regardless
> > > of anything else that goes on, so.....
> > > 
> > > > But that would end up making the transaction smaller than adding it all
> > > > together. Of course, that may be fine with the agfl fixup below..
> > > 
> > > .... I doubt it will shrink if we take into account the multiple
> > > direct alloc/free operations in the primary transaction properly. :/
> > > 
> > > Hmmmm - I'm wondering if we need to limit the max extents in a
> > > defer-ops EFI/EFD item to 2 so we don't try to free all the defered
> > > extent frees in a single EFI/EFD transaction pair. i.e. force it
> > > to roll the transaction and get a new log reservation every two
> > > extents being freed so we can limit the reservation size defered
> > > extent freeing requires....
> > 
> > This sounds similar in principle to the other option I was thinking
> > about with regard to dealing with the agfl fixup problem (rolling the
> > tx). The difference is the driving logic: putting a hard limit on the
> > number of per-tx operations vs. rolling in a particular agfl corner
> > case.
> 
> Rolling the transaction at the AGFL level is problematic. The
> transaction scope is defined way further up the call stack and we
> can't roll transactions while we hold random logged objects locked
> that need to stay locked across the entire allocation process.
> 
> Hence I don't think we can roll the transaction at the point where
> we are modifying the AGFL simply because we don't have the context
> available to ensure it can be done in a transactionally atomic and
> deadlock free manner.
> 

Yes, the idea wasn't to explicitly roll the transaction down in the AGFL
code. Rather, refactor the code to allow the deferred ops infrastructure
to do so (so we relog the intent as well). This would most likely
require some ugly factoring to deal with the layering.

No matter, I think a hard restriction of a couple frees per transaction
is much more straightforward and thus a better idea.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> 
> xfs: inode transaciton reservations are whacky
> 
> From: Dave Chinner <dchinner@redhat.com>
> 
> Start hacking cleanups into the code, see if we can untangle the
> mess.
> 
> Signed-Off-By: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_trans_resv.c | 128 +++++++++++++++++++++++++----------------
>  1 file changed, 77 insertions(+), 51 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
> index 6bd916bd35e2..60d25d353fc0 100644
> --- a/fs/xfs/libxfs/xfs_trans_resv.c
> +++ b/fs/xfs/libxfs/xfs_trans_resv.c
> @@ -131,6 +131,40 @@ xfs_calc_inode_res(
>  		 2 * XFS_BMBT_BLOCK_LEN(mp));
>  }
>  
> +/*
> + * Allocation/freeing of a [f]inobt record requires:
> + *
> + * the inode btree: max depth * block size
> + * the allocation btrees: 2 trees * (max depth - 1) * block size
> + *
> + * This assumes that AGI/AGF/SB modifications are accounted for by the caller.
> + * Modification is assumed as the first block in the tree depth reservation.
> + */
> +static uint
> +xfs_calc_inobt_alloc_res(
> +	struct xfs_mount	*mp)
> +{
> +	return xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
> +	       xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
> +				XFS_FSB_TO_B(mp, 1));
> +}
> +
> +/*
> + * Modification of a [f]inobt record requires:
> + *
> + * the inode btree: max depth * block size
> + * the inode btree entry: block size
> + *
> + * This assumes that AGI/AGF/SB modifications are accounted for by the caller.
> + */
> +static uint
> +xfs_calc_inobt_modify_res(
> +	struct xfs_mount	*mp)
> +{
> +	return xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
> +	       (uint)XFS_FSB_TO_B(mp, 1);
> +}
> +
>  /*
>   * The free inode btree is a conditional feature and the log reservation
>   * requirements differ slightly from that of the traditional inode allocation
> @@ -146,30 +180,19 @@ xfs_calc_inode_res(
>   * 'alloc' param indicates to include the reservation for free space btree
>   * modifications on behalf of finobt modifications. This is required only for
>   * transactions that do not already account for free space btree modifications.
> - *
> - * the free inode btree: max depth * block size
> - * the allocation btrees: 2 trees * (max depth - 1) * block size
> - * the free inode btree entry: block size
>   */
>  STATIC uint
>  xfs_calc_finobt_res(
>  	struct xfs_mount	*mp,
> -	int			alloc,
> -	int			modify)
> +	bool			alloc)
>  {
> -	uint res;
> -
>  	if (!xfs_sb_version_hasfinobt(&mp->m_sb))
>  		return 0;
>  
> -	res = xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1));
> +	/* alloc implies modification */
>  	if (alloc)
> -		res += xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
> -					XFS_FSB_TO_B(mp, 1));
> -	if (modify)
> -		res += (uint)XFS_FSB_TO_B(mp, 1);
> -
> -	return res;
> +		return xfs_calc_inobt_alloc_res(mp);
> +	return xfs_calc_inobt_modify_res(mp);
>  }
>
>  /*
> @@ -232,8 +255,6 @@ xfs_calc_write_reservation(
>   *    the super block to reflect the freed blocks: sector size
>   *    worst case split in allocation btrees per extent assuming 4 extents:
>   *		4 exts * 2 trees * (2 * max depth - 1) * block size
> - *    the inode btree: max depth * blocksize
> - *    the allocation btrees: 2 trees * (max depth - 1) * block size
>   */
>  STATIC uint
>  xfs_calc_itruncate_reservation(
> @@ -245,12 +266,7 @@ xfs_calc_itruncate_reservation(
>  				      XFS_FSB_TO_B(mp, 1))),
>  		    (xfs_calc_buf_res(9, mp->m_sb.sb_sectsize) +
>  		     xfs_calc_buf_res(xfs_allocfree_log_count(mp, 4),
> -				      XFS_FSB_TO_B(mp, 1)) +
> -		    xfs_calc_buf_res(5, 0) +
> -		    xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
> -				     XFS_FSB_TO_B(mp, 1)) +
> -		    xfs_calc_buf_res(2 + mp->m_ialloc_blks +
> -				     mp->m_in_maxlevels, 0)));
> +				      XFS_FSB_TO_B(mp, 1))));
>  }
>  
>  /*
> @@ -320,13 +336,13 @@ xfs_calc_link_reservation(
>  /*
>   * For adding an inode to unlinked list we can modify:
>   *    the agi hash list: sector size
> - *    the unlinked inode: inode size
> + *    the on-disk unlinked inode: inode cluster size
>   */
>  STATIC uint
>  xfs_calc_iunlink_add_reservation(xfs_mount_t *mp)
>  {
>  	return xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) +
> -		xfs_calc_inode_res(mp, 1);
> +	       max_t(uint, XFS_FSB_TO_B(mp, 1), mp->m_inode_cluster_size);
>  }
>  
>  /*
> @@ -367,24 +383,26 @@ xfs_calc_remove_reservation(
>   *    the new inode: inode size
>   *    the inode btree entry: block size
>   *    the superblock for the nlink flag: sector size
> + *    the AGI for inode counters: sector size
>   *    the directory btree: (max depth + v2) * dir block size
>   *    the directory inode's bmap btree: (max depth + v2) * block size
> - *    the finobt (record modification and allocation btrees)
> + *    the finobt (record modification/removal and allocation btrees)
> + *    the AGF, AGFL for finobt allocation: 2 * sector size
>   */
>  STATIC uint
>  xfs_calc_create_resv_modify(
>  	struct xfs_mount	*mp)
>  {
>  	return xfs_calc_inode_res(mp, 2) +
> -		xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) +
> -		(uint)XFS_FSB_TO_B(mp, 1) +
> +		xfs_calc_buf_res(4, mp->m_sb.sb_sectsize) +
>  		xfs_calc_buf_res(XFS_DIROP_LOG_COUNT(mp), XFS_FSB_TO_B(mp, 1)) +
> -		xfs_calc_finobt_res(mp, 1, 1);
> +		xfs_calc_inobt_modify_res(mp) +
> +		xfs_calc_finobt_res(mp, true);
>  }
>  
>  /*
>   * For create we can allocate some inodes giving:
> - *    the agi and agf of the ag getting the new inodes: 2 * sectorsize
> + *    the agi, agf, agfl of the ag getting the new inodes: 3 * sectorsize
>   *    the superblock for the nlink flag: sector size
>   *    the inode blocks allocated: mp->m_ialloc_blks * blocksize
>   *    the inode btree: max depth * blocksize
> @@ -394,12 +412,9 @@ STATIC uint
>  xfs_calc_create_resv_alloc(
>  	struct xfs_mount	*mp)
>  {
> -	return xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
> -		mp->m_sb.sb_sectsize +
> +	return xfs_calc_buf_res(4, mp->m_sb.sb_sectsize) +
>  		xfs_calc_buf_res(mp->m_ialloc_blks, XFS_FSB_TO_B(mp, 1)) +
> -		xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
> -		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
> -				 XFS_FSB_TO_B(mp, 1));
> +		xfs_calc_inobt_alloc_res(mp);
>  }
>  
>  STATIC uint
> @@ -413,7 +428,7 @@ __xfs_calc_create_reservation(
>  
>  /*
>   * For icreate we can allocate some inodes giving:
> - *    the agi and agf of the ag getting the new inodes: 2 * sectorsize
> + *    the agi, agf, agfl of the ag getting the new inodes: 3 * sectorsize
>   *    the superblock for the nlink flag: sector size
>   *    the inode btree: max depth * blocksize
>   *    the allocation btrees: 2 trees * (max depth - 1) * block size
> @@ -423,12 +438,9 @@ STATIC uint
>  xfs_calc_icreate_resv_alloc(
>  	struct xfs_mount	*mp)
>  {
> -	return xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
> -		mp->m_sb.sb_sectsize +
> -		xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
> -		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
> -				 XFS_FSB_TO_B(mp, 1)) +
> -		xfs_calc_finobt_res(mp, 0, 0);
> +	return xfs_calc_buf_res(4, mp->m_sb.sb_sectsize) +
> +		xfs_calc_inobt_alloc_res(mp) +
> +		xfs_calc_finobt_res(mp, false);
>  }
>  
>  STATIC uint
> @@ -493,26 +505,40 @@ xfs_calc_symlink_reservation(
>   *    the super block free inode counter: sector size
>   *    the agi hash list and counters: sector size
>   *    the inode btree entry: block size
> + *    the on disk inode to null di_next_unlinked: inode cluster size
>   *    the on disk inode before ours in the agi hash list: inode cluster size
> + *
> + * If we also have to free an inode chunk or recrod
>   *    the inode btree: max depth * blocksize
> + *	Note: this also includes inode btree entry block!
> + *    the AGF and AGFL: 2 sectors
> + *    the allocation btrees: 2 trees * (max depth - 1) * block size
> + *    the log headers to record the inode chunk buffers as freed
> + *
> + * If we also have to modify the finbot - record may need allocation:
> + *    the finobt: max depth * block size
>   *    the allocation btrees: 2 trees * (max depth - 1) * block size
> - *    the finobt (record insertion, removal or modification)
> + *
>   */
>  STATIC uint
>  xfs_calc_ifree_reservation(
>  	struct xfs_mount	*mp)
>  {
>  	return XFS_DQUOT_LOGRES(mp) +
> +		/* inode being freed */
>  		xfs_calc_inode_res(mp, 1) +
> -		xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) +
> -		xfs_calc_buf_res(1, XFS_FSB_TO_B(mp, 1)) +
> +		/* sb, AGF, AGFL */
> +		xfs_calc_buf_res(3, mp->m_sb.sb_sectsize) +
> +		/* AGI + inode cluster */
>  		xfs_calc_iunlink_remove_reservation(mp) +
> -		xfs_calc_buf_res(1, 0) +
> -		xfs_calc_buf_res(2 + mp->m_ialloc_blks +
> -				 mp->m_in_maxlevels, 0) +
> -		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
> -				 XFS_FSB_TO_B(mp, 1)) +
> -		xfs_calc_finobt_res(mp, 0, 1);
> +		/* inode cluster on disk before ours */
> +		xfs_calc_iunlink_remove_reservation(mp) +
> +		/* inode btree record freeing */
> +		xfs_calc_inobt_alloc_res(mp) +
> +		/* stale inode chunk buffers (blf & log hdrs only) */
> +		xfs_calc_buf_res(mp->m_ialloc_blks, 0) +
> +		/* free inode btree */
> +		xfs_calc_finobt_res(mp, true);
>  }
>  
>  /*
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: tr_ifree transaction log reservation calculation
  2017-11-27 18:46                     ` Brian Foster
@ 2017-11-27 21:29                       ` Dave Chinner
  2017-11-28 13:28                         ` Brian Foster
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2017-11-27 21:29 UTC (permalink / raw)
  To: Brian Foster; +Cc: Mark Tinguely, linux-xfs

On Mon, Nov 27, 2017 at 01:46:51PM -0500, Brian Foster wrote:
> On Sun, Nov 26, 2017 at 10:20:10AM +1100, Dave Chinner wrote:
> > On Fri, Nov 24, 2017 at 09:51:22AM -0500, Brian Foster wrote:
> > > On Fri, Nov 24, 2017 at 08:54:57AM +1100, Dave Chinner wrote:
> > > > On Thu, Nov 23, 2017 at 09:36:53AM -0500, Brian Foster wrote:
> > > > > On Thu, Nov 23, 2017 at 11:24:02AM +1100, Dave Chinner wrote:
> > > > > That sounds reasonable, but at the same time I'm a bit concerned that I
> > > > > think the ifree transaction should technically be refactored into a max
> > > > > of the pre/post tx roll reservations similar to the write reservation
> > > > > mentioned above. For example:
> > > > > 
> > > > > 	MAX(inode + inobt + finobt + agi + inode chunk + etc.,
> > > > > 	    allocfree for inode chunk + (allocfree * nr inobts));
> > > > 
> > > > Yeah, I think you're right in that we need to we need to look at
> > > > redefining the reservations at a higher level.
> > > > 
> > > > For this specific example, though, we have to take into account that
> > > > a finobt modification to mark an inode free can insert a new record
> > > > into the btree. Hence we have scope for a finobt block allocation
> > > > and tree split, so we need to have allocfree reservations in the
> > > > primary transaction.
> > > > 
> > > > Also, the inode btree modifications (and refcountbt, too) call
> > > > xfs_free_extent() directly - they don't use deferops free list that
> > > > is processed as an EFI/EFD op after the primary transaction has been
> > > > rolled.  i.e:
> > > > 
> > > > xfs_btree_delrec
> > > >   xfs_btree_free_block
> > > >     xfs_inobt_free_block
> > > >       xfs_free_extent
> > > > 
> > > > Hence the primary transaction reservation has to take into account
> > > > that operations that remove a record from the tree can cause and
> > > > extent to be freed. As such, the removal of an inode chunk - which
> > > > removes the record from both the inobt and finobt - can trigger
> > > > freespace tree modification directly.
> > > > 
> > > 
> > > Ah, because the inobt doesn't use the agfl. It looks like we try to do
> > > this in xfs_calc_finobt_res(), but it doesn't appear to be fully correct
> > > because the allocfree res is only accounted for transactions that "do
> > > not already account for free space btree modifications," according to
> > > the comment. I don't recall the exact reasoning there off the top of my
> > > head, but if I had to guess, this was probably just another instance of
> > > "bolt something on to the transaction that appears to follow the pattern
> > > of other transactions and doesn't explode."
> > > 
> > > Given that, it seems like the appropriate thing to do is in fact for the
> > > transaction to include an independent allocfree res for the inode chunk,
> > > the inobt and finobt. That justifies the existence of the current
> > > allocfree in the pre-roll part of the transaction and would increase the
> > > transaction size by one more allocfree in the finobt case (the inode
> > > chunk should technically be part of the post-roll calculation).
> > > 
> > > Actually, with addition of some proper documentation that may also
> > > obviate the need to do the whole pre/post tx roll thing because the
> > > first part of the transaction will obviously be larger than a single
> > > allocfree res for the second part (amazing how a little documentation to
> > > establish/assess intent can save us time from trying to work backwards
> > > from the transaction :P).
> > > 
> > > Ok, I think that gives me enough to go on to try and refactor at least
> > > the inobt tx res bits. I'll take a closer look next week, thanks for all
> > > of the feedback.
> > 
> > Actually, I wrote a patch as I was working all this out :p
> > 
> > Not complete, or anything like that, but I've attached it below
> > so you've got some idea of what I was thinking....
> > 
> 
> Ok. While I get the basic idea, this isn't quite what I was thinking
> with regard to the finobt bits. The current finobt reservation applies
> the allocation part of the logic only when the higher level transaction
> doesn't already include the allocfree res.

Which is wrong, as we've already discussed. If we are doing an
allocation of [f]inobt blocks, we need an allocfree reservation for
that specific operation.

> This kind of handwaves away
> the finobt allocfree res by assuming the existing transaction
> reservation would cover the blocks alloc'd/freed by the finobt.
>
> IOW, there isn't really a clear separation between the modify and
> allocation cases for the finobt as there is for the inobt. The finobt
> can add/remove a record in either of those cases. The requirement to add
> an additional allocfree res with the finobt operation pretty much
> invalidates the finobt modify case, I think.

Sure. What I did was separate allocation vs no allocation
(modification) in terms of what the tree operation requires. But
that doesn't mean we actually use the modification variant for
finobt if all operations we take reservations for could require
allocation.

It's the difference between "this is the reservation for a pure
record modification operation" vs "this is the reservation for a
record modification operation that *may* require record insertion or
deletion and hence allocation/freeing"

> Also, it looks like this makes some other changes that are not
> immediately clear to me,

Which ones?

> squashes in a couple of the fixes I've already
> made and doesn't actually add a new allocfree res in some of the
> create/alloc calculations (but rather refactors the existing allocfree
> res to use the new helper). It's not clear to me if the latter is
> intentional..?

Not sure what you are asking? I factored existing open-coded inobt
reservations to use helpers so that I didn't have to modify the
reservation in lots of places. Get it right in one place, all the
others are correct too. Indeed, most of the cases I factored already
had the allocfree reservation, so I'm not sure why we'd be adding a
new one to those reservations?

And, keep in mind that I did say "the patch is just what I wrote as
I discovered problems" before tearing into it about not being
complete a complete solution....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: tr_ifree transaction log reservation calculation
  2017-11-27 21:29                       ` Dave Chinner
@ 2017-11-28 13:28                         ` Brian Foster
  2017-11-28 21:34                           ` Dave Chinner
  0 siblings, 1 reply; 15+ messages in thread
From: Brian Foster @ 2017-11-28 13:28 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Mark Tinguely, linux-xfs

On Tue, Nov 28, 2017 at 08:29:46AM +1100, Dave Chinner wrote:
> On Mon, Nov 27, 2017 at 01:46:51PM -0500, Brian Foster wrote:
> > On Sun, Nov 26, 2017 at 10:20:10AM +1100, Dave Chinner wrote:
> > > On Fri, Nov 24, 2017 at 09:51:22AM -0500, Brian Foster wrote:
> > > > On Fri, Nov 24, 2017 at 08:54:57AM +1100, Dave Chinner wrote:
> > > > > On Thu, Nov 23, 2017 at 09:36:53AM -0500, Brian Foster wrote:
> > > > > > On Thu, Nov 23, 2017 at 11:24:02AM +1100, Dave Chinner wrote:
> > > > > > That sounds reasonable, but at the same time I'm a bit concerned that I
> > > > > > think the ifree transaction should technically be refactored into a max
> > > > > > of the pre/post tx roll reservations similar to the write reservation
> > > > > > mentioned above. For example:
> > > > > > 
> > > > > > 	MAX(inode + inobt + finobt + agi + inode chunk + etc.,
> > > > > > 	    allocfree for inode chunk + (allocfree * nr inobts));
> > > > > 
> > > > > Yeah, I think you're right in that we need to we need to look at
> > > > > redefining the reservations at a higher level.
> > > > > 
> > > > > For this specific example, though, we have to take into account that
> > > > > a finobt modification to mark an inode free can insert a new record
> > > > > into the btree. Hence we have scope for a finobt block allocation
> > > > > and tree split, so we need to have allocfree reservations in the
> > > > > primary transaction.
> > > > > 
> > > > > Also, the inode btree modifications (and refcountbt, too) call
> > > > > xfs_free_extent() directly - they don't use deferops free list that
> > > > > is processed as an EFI/EFD op after the primary transaction has been
> > > > > rolled.  i.e:
> > > > > 
> > > > > xfs_btree_delrec
> > > > >   xfs_btree_free_block
> > > > >     xfs_inobt_free_block
> > > > >       xfs_free_extent
> > > > > 
> > > > > Hence the primary transaction reservation has to take into account
> > > > > that operations that remove a record from the tree can cause and
> > > > > extent to be freed. As such, the removal of an inode chunk - which
> > > > > removes the record from both the inobt and finobt - can trigger
> > > > > freespace tree modification directly.
> > > > > 
> > > > 
> > > > Ah, because the inobt doesn't use the agfl. It looks like we try to do
> > > > this in xfs_calc_finobt_res(), but it doesn't appear to be fully correct
> > > > because the allocfree res is only accounted for transactions that "do
> > > > not already account for free space btree modifications," according to
> > > > the comment. I don't recall the exact reasoning there off the top of my
> > > > head, but if I had to guess, this was probably just another instance of
> > > > "bolt something on to the transaction that appears to follow the pattern
> > > > of other transactions and doesn't explode."
> > > > 
> > > > Given that, it seems like the appropriate thing to do is in fact for the
> > > > transaction to include an independent allocfree res for the inode chunk,
> > > > the inobt and finobt. That justifies the existence of the current
> > > > allocfree in the pre-roll part of the transaction and would increase the
> > > > transaction size by one more allocfree in the finobt case (the inode
> > > > chunk should technically be part of the post-roll calculation).
> > > > 
> > > > Actually, with addition of some proper documentation that may also
> > > > obviate the need to do the whole pre/post tx roll thing because the
> > > > first part of the transaction will obviously be larger than a single
> > > > allocfree res for the second part (amazing how a little documentation to
> > > > establish/assess intent can save us time from trying to work backwards
> > > > from the transaction :P).
> > > > 
> > > > Ok, I think that gives me enough to go on to try and refactor at least
> > > > the inobt tx res bits. I'll take a closer look next week, thanks for all
> > > > of the feedback.
> > > 
> > > Actually, I wrote a patch as I was working all this out :p
> > > 
> > > Not complete, or anything like that, but I've attached it below
> > > so you've got some idea of what I was thinking....
> > > 
> > 
> > Ok. While I get the basic idea, this isn't quite what I was thinking
> > with regard to the finobt bits. The current finobt reservation applies
> > the allocation part of the logic only when the higher level transaction
> > doesn't already include the allocfree res.
> 
> Which is wrong, as we've already discussed. If we are doing an
> allocation of [f]inobt blocks, we need an allocfree reservation for
> that specific operation.
> 

Yep, just restating the obvious..

> > This kind of handwaves away
> > the finobt allocfree res by assuming the existing transaction
> > reservation would cover the blocks alloc'd/freed by the finobt.
> >
> > IOW, there isn't really a clear separation between the modify and
> > allocation cases for the finobt as there is for the inobt. The finobt
> > can add/remove a record in either of those cases. The requirement to add
> > an additional allocfree res with the finobt operation pretty much
> > invalidates the finobt modify case, I think.
> 
> Sure. What I did was separate allocation vs no allocation
> (modification) in terms of what the tree operation requires. But
> that doesn't mean we actually use the modification variant for
> finobt if all operations we take reservations for could require
> allocation.
> 
> It's the difference between "this is the reservation for a pure
> record modification operation" vs "this is the reservation for a
> record modification operation that *may* require record insertion or
> deletion and hence allocation/freeing"
> 

Ok, I just think it's cleaner if that concept is wiped away wrt to
finobt. It's confusing to me, at least, and I probably wrote the code.
:P

> > Also, it looks like this makes some other changes that are not
> > immediately clear to me,
> 
> Which ones?
> 

The truncate and iunlink bits...

> > squashes in a couple of the fixes I've already
> > made and doesn't actually add a new allocfree res in some of the
> > create/alloc calculations (but rather refactors the existing allocfree
> > res to use the new helper). It's not clear to me if the latter is
> > intentional..?
> 
> Not sure what you are asking? I factored existing open-coded inobt
> reservations to use helpers so that I didn't have to modify the
> reservation in lots of places. Get it right in one place, all the
> others are correct too. Indeed, most of the cases I factored already
> had the allocfree reservation, so I'm not sure why we'd be adding a
> new one to those reservations?
> 

Understood wrt to the helpers. What I'm asking is why
xfs_calc_create_resv_alloc(), for example, doesn't actually add another
allocfree res if the intent was to add one for all inode tree related
ops..? Using that example, my understanding is that tx should now have
an allocfree res for the inode chunk allocation and a new one for the
inobt insertion. The latter is the one that's now abstracted into the
helpers, yes..? That's the model I tried to follow for the rfc patch,
anyways.

> And, keep in mind that I did say "the patch is just what I wrote as
> I discovered problems" before tearing into it about not being
> complete a complete solution....
> 

Yeah, I didn't look too hard beyond trying to understand intent.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: tr_ifree transaction log reservation calculation
  2017-11-28 13:28                         ` Brian Foster
@ 2017-11-28 21:34                           ` Dave Chinner
  2017-11-29 14:31                             ` Brian Foster
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2017-11-28 21:34 UTC (permalink / raw)
  To: Brian Foster; +Cc: Mark Tinguely, linux-xfs

On Tue, Nov 28, 2017 at 08:28:02AM -0500, Brian Foster wrote:
> On Tue, Nov 28, 2017 at 08:29:46AM +1100, Dave Chinner wrote:
> > On Mon, Nov 27, 2017 at 01:46:51PM -0500, Brian Foster wrote:
> > > Also, it looks like this makes some other changes that are not
> > > immediately clear to me,
> > 
> > Which ones?
> > 
> 
> The truncate and iunlink bits...

The truncate reservation includes a reservation for an inobt
modification. We *never* modify the inobt inside a truncate
transaction, so even if we ignore the fact it was wrong (like all
the others we are fixing), it really shouldn't be there.

THe commit message from the archive tree doesn't help explain it,
either. It just says (paraphrasing) "fix transaction reservation".
So I removed it.

As for the iunlink bit, you mean this the fact I added the on disk
inode cluster to the reservation in
xfs_calc_iunlink_add_reservation()?

That's because we log the inode cluster in unlink when we modify the
inode unlinked list. That's missing from all the unlinked inode
manipulation reservations - some of them take into account modifying
the inode cluster of another inode in the unlinked list (e.g. ifree)
but they all fail to reserve space for logging the unlinked list
pointer in the inode being unlinked/freed.

> > > squashes in a couple of the fixes I've already
> > > made and doesn't actually add a new allocfree res in some of the
> > > create/alloc calculations (but rather refactors the existing allocfree
> > > res to use the new helper). It's not clear to me if the latter is
> > > intentional..?
> > 
> > Not sure what you are asking? I factored existing open-coded inobt
> > reservations to use helpers so that I didn't have to modify the
> > reservation in lots of places. Get it right in one place, all the
> > others are correct too. Indeed, most of the cases I factored already
> > had the allocfree reservation, so I'm not sure why we'd be adding a
> > new one to those reservations?
> > 
> 
> Understood wrt to the helpers. What I'm asking is why
> xfs_calc_create_resv_alloc(), for example, doesn't actually add another
> allocfree res if the intent was to add one for all inode tree related
> ops..?

Because the patch wasn't complete and finished and there's no
guarantee I factored it correctly. You're still reading that patch
as though it was a completely polished, finished patch, not a set of
"noticed these things while thinking about the problem" notes.

I factored the code first, never went back the create
code because we were focussed on the problems in the iunlink
reservations....

> Using that example, my understanding is that tx should now have
> an allocfree res for the inode chunk allocation and a new one for the
> inobt insertion. The latter is the one that's now abstracted into the
> helpers, yes..? That's the model I tried to follow for the rfc patch,
> anyways.

Sure, that's what needs to be done, as we've discussed.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: tr_ifree transaction log reservation calculation
  2017-11-28 21:34                           ` Dave Chinner
@ 2017-11-29 14:31                             ` Brian Foster
  0 siblings, 0 replies; 15+ messages in thread
From: Brian Foster @ 2017-11-29 14:31 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Mark Tinguely, linux-xfs

On Wed, Nov 29, 2017 at 08:34:08AM +1100, Dave Chinner wrote:
> On Tue, Nov 28, 2017 at 08:28:02AM -0500, Brian Foster wrote:
> > On Tue, Nov 28, 2017 at 08:29:46AM +1100, Dave Chinner wrote:
> > > On Mon, Nov 27, 2017 at 01:46:51PM -0500, Brian Foster wrote:
> > > > Also, it looks like this makes some other changes that are not
> > > > immediately clear to me,
> > > 
> > > Which ones?
> > > 
> > 
> > The truncate and iunlink bits...
> 
> The truncate reservation includes a reservation for an inobt
> modification. We *never* modify the inobt inside a truncate
> transaction, so even if we ignore the fact it was wrong (like all
> the others we are fixing), it really shouldn't be there.
> 
> THe commit message from the archive tree doesn't help explain it,
> either. It just says (paraphrasing) "fix transaction reservation".
> So I removed it.
> 

Ok..

> As for the iunlink bit, you mean this the fact I added the on disk
> inode cluster to the reservation in
> xfs_calc_iunlink_add_reservation()?
> 
> That's because we log the inode cluster in unlink when we modify the
> inode unlinked list. That's missing from all the unlinked inode
> manipulation reservations - some of them take into account modifying
> the inode cluster of another inode in the unlinked list (e.g. ifree)
> but they all fail to reserve space for logging the unlinked list
> pointer in the inode being unlinked/freed.
> 

This is clear to me now via the other thread. I'll incorporate both of
these fixes, thanks.

> > > > squashes in a couple of the fixes I've already
> > > > made and doesn't actually add a new allocfree res in some of the
> > > > create/alloc calculations (but rather refactors the existing allocfree
> > > > res to use the new helper). It's not clear to me if the latter is
> > > > intentional..?
> > > 
> > > Not sure what you are asking? I factored existing open-coded inobt
> > > reservations to use helpers so that I didn't have to modify the
> > > reservation in lots of places. Get it right in one place, all the
> > > others are correct too. Indeed, most of the cases I factored already
> > > had the allocfree reservation, so I'm not sure why we'd be adding a
> > > new one to those reservations?
> > > 
> > 
> > Understood wrt to the helpers. What I'm asking is why
> > xfs_calc_create_resv_alloc(), for example, doesn't actually add another
> > allocfree res if the intent was to add one for all inode tree related
> > ops..?
> 
> Because the patch wasn't complete and finished and there's no
> guarantee I factored it correctly. You're still reading that patch
> as though it was a completely polished, finished patch, not a set of
> "noticed these things while thinking about the problem" notes.
> 

Nope.. I skimmed it and was confused about a hunk that was reworked in a
way that conflicted with my understanding of the intent. It's really
that simple. I appreciate the psychological evaluation, but I think "no
that's not intentional, it's just a bug/incomplete" would have sufficed.
;)

Brian

> I factored the code first, never went back the create
> code because we were focussed on the problems in the iunlink
> reservations....
> 
> > Using that example, my understanding is that tx should now have
> > an allocfree res for the inode chunk allocation and a new one for the
> > inobt insertion. The latter is the one that's now abstracted into the
> > helpers, yes..? That's the model I tried to follow for the rfc patch,
> > anyways.
> 
> Sure, that's what needs to be done, as we've discussed.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2017-11-29 14:31 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-21 15:05 tr_ifree transaction log reservation calculation Brian Foster
     [not found] ` <0a9b47ba-a41e-3b96-981f-f04f9e2ab11c@hpe.com>
2017-11-21 17:35   ` Brian Foster
2017-11-22  2:26     ` Dave Chinner
2017-11-22 12:21       ` Brian Foster
2017-11-22 20:41         ` Brian Foster
2017-11-23  0:24           ` Dave Chinner
2017-11-23 14:36             ` Brian Foster
2017-11-23 21:54               ` Dave Chinner
2017-11-24 14:51                 ` Brian Foster
2017-11-25 23:20                   ` Dave Chinner
2017-11-27 18:46                     ` Brian Foster
2017-11-27 21:29                       ` Dave Chinner
2017-11-28 13:28                         ` Brian Foster
2017-11-28 21:34                           ` Dave Chinner
2017-11-29 14:31                             ` Brian Foster

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.