Re: [PATCH 3/5] [RFC] xfs: use percpu counters for CIL context counters

From: Dave Chinner <david@fromorbit.com>
To: Brian Foster <bfoster@redhat.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH 3/5] [RFC] xfs: use percpu counters for CIL context counters
Date: Thu, 14 May 2020 12:49:21 +1000	[thread overview]
Message-ID: <20200514024921.GJ2040@dread.disaster.area> (raw)
In-Reply-To: <20200514015055.GI2040@dread.disaster.area>

On Thu, May 14, 2020 at 11:50:55AM +1000, Dave Chinner wrote:
> On Thu, May 14, 2020 at 07:52:41AM +1000, Dave Chinner wrote:
> > On Wed, May 13, 2020 at 08:09:59AM -0400, Brian Foster wrote:
> > > On Wed, May 13, 2020 at 09:36:27AM +1000, Dave Chinner wrote:
> > > > On Tue, May 12, 2020 at 10:05:44AM -0400, Brian Foster wrote:
> > > > > Particularly as it relates to percpu functionality. Does
> > > > > the window scale with cpu count, for example? It might not matter either
> > > > 
> > > > Not really. We need a thundering herd to cause issues, and this
> > > > occurs after formatting an item so we won't get a huge thundering
> > > > herd even when lots of threads block on the xc_ctx_lock waiting for
> > > > a push to complete.
> > > > 
> > > 
> > > It would be nice to have some debug code somewhere that somehow or
> > > another asserts or warns if the CIL reservation exceeds some
> > > insane/unexpected heuristic based on the current size of the context. I
> > > don't know what that code or heuristic looks like (i.e. multiple factors
> > > of the ctx size?) so I'm obviously handwaving. Just something to think
> > > about if we can come up with a way to accomplish that opportunistically.
> > 
> > I don't think there is a reliable mechanism that can be used here.
> > At one end of the scale we have the valid case of a synchronous
> > inode modification on a log with a 256k stripe unit. So it's valid
> > to have a CIL reservation of ~550kB for a single item that consumes
> > ~700 bytes of log space.
> > 
> > OTOH, we might be freeing extents on a massively fragmented file and
> > filesystem, so we're pushing 200kB+ transactions into the CIL for
> > every rolling transaction. On a filesystem with a 512 byte log
> > sector size and no LSU, the CIL reservations are dwarfed by the
> > actual metadata being logged...
> > 
> > I'd suggest that looking at the ungrant trace for the CIL ticket
> > once it has committed will tell us exactly how much the reservation
> > was over-estimated, as the unused portion of the reservation will be
> > returned to the reserve grant head at this point in time.
> 
> Typical for this workload is a CIl ticket that looks like this at
> ungrant time:
> 
> t_curr_res 13408 t_unit_res 231100
> t_curr_res 9240 t_unit_res 140724
> t_curr_res 46284 t_unit_res 263964
> t_curr_res 29780 t_unit_res 190020
> t_curr_res 38044 t_unit_res 342016
> t_curr_res 21636 t_unit_res 321476
> t_curr_res 21576 t_unit_res 263964
> t_curr_res 42200 t_unit_res 411852
> t_curr_res 21636 t_unit_res 292720
> t_curr_res 62740 t_unit_res 514552
> t_curr_res 17456 t_unit_res 284504
> t_curr_res 29852 t_unit_res 411852
> t_curr_res 13384 t_unit_res 206452
> t_curr_res 70956 t_unit_res 518660
> t_curr_res 70908 t_unit_res 333800
> t_curr_res 50404 t_unit_res 518660
> t_curr_res 17480 t_unit_res 321476
> t_curr_res 33948 t_unit_res 436500
> t_curr_res 17492 t_unit_res 317368
> t_curr_res 50392 t_unit_res 489904
> t_curr_res 13360 t_unit_res 325584
> t_curr_res 66812 t_unit_res 506336
> t_curr_res 33924 t_unit_res 366664
> t_curr_res 70932 t_unit_res 551524
> t_curr_res 29852 t_unit_res 374880
> t_curr_res 25720 t_unit_res 494012
> t_curr_res 42152 t_unit_res 506336
> t_curr_res 21684 t_unit_res 543308
> t_curr_res 29840 t_unit_res 440608
> t_curr_res 46320 t_unit_res 551524
> t_curr_res 21624 t_unit_res 387204
> t_curr_res 29840 t_unit_res 522768
> 
> So we are looking at a reservation of up to 500KB, and typically
> using all but a few 10s of KB of it.
> 
> I'll use this as the ballpark for the lockless code.

And, yeah, just reserving a split res for everything blows it out by
a factor of 100.

I've worked out via experimentation and measurement that taking a
split reservation when the current transaction is near the iclog
threshold results in roughly doubling the above reservation, but I
haven't had it go anywhere near an underrun at this point....

I'll work with this heuristic for the moment, and spend some more
time thinking about stuff I've just learnt getting to this point.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com