All of lore.kernel.org
 help / color / mirror / Atom feed
* Performance regression with async inode inactivation
@ 2021-10-04 10:06 Jan Kara
  2021-10-04 18:17 ` Darrick J. Wong
  2021-10-04 21:15 ` Dave Chinner
  0 siblings, 2 replies; 12+ messages in thread
From: Jan Kara @ 2021-10-04 10:06 UTC (permalink / raw)
  To: linux-xfs; +Cc: Dave Chinner, Darrick J. Wong

Hello,

our performance testing grid has detected a performance regression caused
by commit ab23a77687 ("xfs: per-cpu deferred inode inactivation queues")
with reaim benchmark running 'disk' and 'disk-large' workloads. The
regression has been so far detected on two machines - marvin7 (48 cpus, 64
GB ram, SATA SSD), dobby (64 cpus, 192 GB ram, rotating disk behind
megaraid_sas controller).

The regression reports from mmtests on marvin7 (which from my experience I
find more reliable than dobby) look like:

reaim-io-disk-xfs
Hmean     disk-1       3597.12 (   0.00%)     3636.37 (   1.09%)
Hmean     disk-25    125000.00 (   0.00%)   121753.25 (  -2.60%)
Hmean     disk-49    218424.96 (   0.00%)   193421.05 * -11.45%*
Hmean     disk-73    286649.22 (   0.00%)   240131.58 * -16.23%*
Hmean     disk-97    349339.74 (   0.00%)   285854.62 * -18.17%*
Hmean     disk-121   373456.79 (   0.00%)   309199.32 * -17.21%*
Hmean     disk-145   399449.04 (   0.00%)   330547.11 * -17.25%*
Hmean     disk-169   420049.71 (   0.00%)   345132.74 * -17.84%*
Hmean     disk-193   458795.56 (   0.00%)   375243.03 * -18.21%*
Stddev    disk-1        102.32 (   0.00%)      117.93 ( -15.25%)
Stddev    disk-25      6981.08 (   0.00%)     4285.17 (  38.62%)
Stddev    disk-49      9523.63 (   0.00%)    10723.44 ( -12.60%)
Stddev    disk-73      9704.22 (   0.00%)     7946.73 (  18.11%)
Stddev    disk-97     10059.91 (   0.00%)     6111.28 (  39.25%)
Stddev    disk-121     5730.56 (   0.00%)    11714.34 (-104.42%)
Stddev    disk-145    11154.40 (   0.00%)     8129.06 (  27.12%)
Stddev    disk-169     4477.30 (   0.00%)     3558.86 (  20.51%)
Stddev    disk-193     8785.70 (   0.00%)    13258.89 ( -50.91%)

reaim-io-disk-large-xfs
Hmean     disk-1        722.72 (   0.00%)      721.85 (  -0.12%)
Hmean     disk-25     24177.95 (   0.00%)    24319.06 (   0.58%)
Hmean     disk-49     35294.12 (   0.00%)    34361.85 (  -2.64%)
Hmean     disk-73     43042.45 (   0.00%)    40896.36 *  -4.99%*
Hmean     disk-97     48403.19 (   0.00%)    46044.30 *  -4.87%*
Hmean     disk-121    52230.22 (   0.00%)    49347.47 *  -5.52%*
Hmean     disk-145    54613.94 (   0.00%)    52333.98 *  -4.17%*
Hmean     disk-169    57178.30 (   0.00%)    54745.71 *  -4.25%*
Hmean     disk-193    60230.94 (   0.00%)    57106.22 *  -5.19%*
Stddev    disk-1         18.74 (   0.00%)       30.19 ( -61.11%)
Stddev    disk-25       439.49 (   0.00%)      809.58 ( -84.21%)
Stddev    disk-49      1416.65 (   0.00%)      603.55 (  57.40%)
Stddev    disk-73       949.45 (   0.00%)      584.61 (  38.43%)
Stddev    disk-97       689.51 (   0.00%)      602.76 (  12.58%)
Stddev    disk-121      485.22 (   0.00%)      612.79 ( -26.29%)
Stddev    disk-145      147.37 (   0.00%)      442.99 (-200.60%)
Stddev    disk-169      282.25 (   0.00%)      613.42 (-117.33%)
Stddev    disk-193      970.05 (   0.00%)      572.59 (  40.97%)

Note that numbers behind dash (disk-xx) denote the number of reaim
"clients" - i.e., the number of processes reaim runs in parallel.

This reaim workload will create quite some small files, fsync them, do a
few operations on them like read, write, etc. and quickly delete them. This
happens in many processes in parallel so I can imagine the unlink workload
is rather heavy.

To reproduce the workload the easiest is probably to clone mmtests [1],
there's README.md and docs/Tutorial.md about how to run tests. Relevant
configurations are in configs/config-reaim-io-disk-large and
configs/config-reaim-io-disk, you will need to edit these files to set
appropriate test partition (will get wiped) and filesystem. Note that
mmtests do somewhat modify standard reaim benchmark so that it does not
call system("sync") after each operation (you can see modifications we
apply in [2] if you're interested).

I can probably find some time to better understand what's going on on these
machines later this week but if you have some suggestions what to look for,
you're welcome.

								Honza

[1] https://github.com/gormanm/mmtests
[2] https://github.com/gormanm/mmtests/blob/master/shellpack_src/src/reaim/reaim-install

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Performance regression with async inode inactivation
  2021-10-04 10:06 Performance regression with async inode inactivation Jan Kara
@ 2021-10-04 18:17 ` Darrick J. Wong
  2021-10-04 21:15 ` Dave Chinner
  1 sibling, 0 replies; 12+ messages in thread
From: Darrick J. Wong @ 2021-10-04 18:17 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-xfs, Dave Chinner

On Mon, Oct 04, 2021 at 12:06:53PM +0200, Jan Kara wrote:
> Hello,
> 
> our performance testing grid has detected a performance regression caused
> by commit ab23a77687 ("xfs: per-cpu deferred inode inactivation queues")
> with reaim benchmark running 'disk' and 'disk-large' workloads. The
> regression has been so far detected on two machines - marvin7 (48 cpus, 64
> GB ram, SATA SSD), dobby (64 cpus, 192 GB ram, rotating disk behind
> megaraid_sas controller).

I'll think about this some more, but for now my immediate reaction is to
point out the previous discussion about aim7 reporting a similar
(percentage point) regression:

https://lore.kernel.org/lkml/20210909142355.GA6270@xsang-OptiPlex-9020/

--D

> 
> The regression reports from mmtests on marvin7 (which from my experience I
> find more reliable than dobby) look like:
> 
> reaim-io-disk-xfs
> Hmean     disk-1       3597.12 (   0.00%)     3636.37 (   1.09%)
> Hmean     disk-25    125000.00 (   0.00%)   121753.25 (  -2.60%)
> Hmean     disk-49    218424.96 (   0.00%)   193421.05 * -11.45%*
> Hmean     disk-73    286649.22 (   0.00%)   240131.58 * -16.23%*
> Hmean     disk-97    349339.74 (   0.00%)   285854.62 * -18.17%*
> Hmean     disk-121   373456.79 (   0.00%)   309199.32 * -17.21%*
> Hmean     disk-145   399449.04 (   0.00%)   330547.11 * -17.25%*
> Hmean     disk-169   420049.71 (   0.00%)   345132.74 * -17.84%*
> Hmean     disk-193   458795.56 (   0.00%)   375243.03 * -18.21%*
> Stddev    disk-1        102.32 (   0.00%)      117.93 ( -15.25%)
> Stddev    disk-25      6981.08 (   0.00%)     4285.17 (  38.62%)
> Stddev    disk-49      9523.63 (   0.00%)    10723.44 ( -12.60%)
> Stddev    disk-73      9704.22 (   0.00%)     7946.73 (  18.11%)
> Stddev    disk-97     10059.91 (   0.00%)     6111.28 (  39.25%)
> Stddev    disk-121     5730.56 (   0.00%)    11714.34 (-104.42%)
> Stddev    disk-145    11154.40 (   0.00%)     8129.06 (  27.12%)
> Stddev    disk-169     4477.30 (   0.00%)     3558.86 (  20.51%)
> Stddev    disk-193     8785.70 (   0.00%)    13258.89 ( -50.91%)
> 
> reaim-io-disk-large-xfs
> Hmean     disk-1        722.72 (   0.00%)      721.85 (  -0.12%)
> Hmean     disk-25     24177.95 (   0.00%)    24319.06 (   0.58%)
> Hmean     disk-49     35294.12 (   0.00%)    34361.85 (  -2.64%)
> Hmean     disk-73     43042.45 (   0.00%)    40896.36 *  -4.99%*
> Hmean     disk-97     48403.19 (   0.00%)    46044.30 *  -4.87%*
> Hmean     disk-121    52230.22 (   0.00%)    49347.47 *  -5.52%*
> Hmean     disk-145    54613.94 (   0.00%)    52333.98 *  -4.17%*
> Hmean     disk-169    57178.30 (   0.00%)    54745.71 *  -4.25%*
> Hmean     disk-193    60230.94 (   0.00%)    57106.22 *  -5.19%*
> Stddev    disk-1         18.74 (   0.00%)       30.19 ( -61.11%)
> Stddev    disk-25       439.49 (   0.00%)      809.58 ( -84.21%)
> Stddev    disk-49      1416.65 (   0.00%)      603.55 (  57.40%)
> Stddev    disk-73       949.45 (   0.00%)      584.61 (  38.43%)
> Stddev    disk-97       689.51 (   0.00%)      602.76 (  12.58%)
> Stddev    disk-121      485.22 (   0.00%)      612.79 ( -26.29%)
> Stddev    disk-145      147.37 (   0.00%)      442.99 (-200.60%)
> Stddev    disk-169      282.25 (   0.00%)      613.42 (-117.33%)
> Stddev    disk-193      970.05 (   0.00%)      572.59 (  40.97%)
> 
> Note that numbers behind dash (disk-xx) denote the number of reaim
> "clients" - i.e., the number of processes reaim runs in parallel.
> 
> This reaim workload will create quite some small files, fsync them, do a
> few operations on them like read, write, etc. and quickly delete them. This
> happens in many processes in parallel so I can imagine the unlink workload
> is rather heavy.
> 
> To reproduce the workload the easiest is probably to clone mmtests [1],
> there's README.md and docs/Tutorial.md about how to run tests. Relevant
> configurations are in configs/config-reaim-io-disk-large and
> configs/config-reaim-io-disk, you will need to edit these files to set
> appropriate test partition (will get wiped) and filesystem. Note that
> mmtests do somewhat modify standard reaim benchmark so that it does not
> call system("sync") after each operation (you can see modifications we
> apply in [2] if you're interested).
> 
> I can probably find some time to better understand what's going on on these
> machines later this week but if you have some suggestions what to look for,
> you're welcome.
> 
> 								Honza
> 
> [1] https://github.com/gormanm/mmtests
> [2] https://github.com/gormanm/mmtests/blob/master/shellpack_src/src/reaim/reaim-install
> 
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Performance regression with async inode inactivation
  2021-10-04 10:06 Performance regression with async inode inactivation Jan Kara
  2021-10-04 18:17 ` Darrick J. Wong
@ 2021-10-04 21:15 ` Dave Chinner
  2021-10-05  8:11   ` Jan Kara
  1 sibling, 1 reply; 12+ messages in thread
From: Dave Chinner @ 2021-10-04 21:15 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-xfs, Dave Chinner, Darrick J. Wong

On Mon, Oct 04, 2021 at 12:06:53PM +0200, Jan Kara wrote:
> Hello,
> 
> our performance testing grid has detected a performance regression caused
> by commit ab23a77687 ("xfs: per-cpu deferred inode inactivation queues")
> with reaim benchmark running 'disk' and 'disk-large' workloads. The
> regression has been so far detected on two machines - marvin7 (48 cpus, 64
> GB ram, SATA SSD), dobby (64 cpus, 192 GB ram, rotating disk behind
> megaraid_sas controller).

Yup, large cpu count, single slow disk, and the cause will likely be
exclusive rwsem lock contention on a directory inode that concurrent
openat and unlink are occuring in.

Basically, that commit removed a bunch of userspace overhead in
unlinks, when mean they run as fast as the unlink() call can remove
the directory entry. There is effectively nothing throttling
unlink() in XFS now except for available log space and it mostly
runs to completion without blocking. Hence the front end unlink
performance can run in much faster bursts before delayed
inactivation needs to run.

Given most of the added CPU overhead is in the rwsem spin_on_owner
path, it implies that the write lock holder is, indeed, not sleeping
with the lock held. Hence reaim is hitting a borderline contended
rwsem much harder and with different behaviour, resulting in
catastrophic breakdown of lock performance and hence unlink
performance goes backwards.

I can't see any other new sleeping lock contention in the workload
profiles - the context switch rate goes down substantially (by 35%!)
with commit ab23a77687, which also implies that the lock contention
is resulting in much longer spin and/or sleep times on the lock.

I'm not sure we can do anything about this in the filesystem. The
contended lock is a core, high level VFS lock which is the first
point of unlinkat() syscall serialisation. This is the lock that is
directly exposed to userspace concurrency, so the scalability of
this lock determines concurrency performance of the userspace
application.....

> The regression reports from mmtests on marvin7 (which from my experience I
> find more reliable than dobby) look like:
> 
> reaim-io-disk-xfs
> Hmean     disk-1       3597.12 (   0.00%)     3636.37 (   1.09%)
> Hmean     disk-25    125000.00 (   0.00%)   121753.25 (  -2.60%)
> Hmean     disk-49    218424.96 (   0.00%)   193421.05 * -11.45%*
> Hmean     disk-73    286649.22 (   0.00%)   240131.58 * -16.23%*
> Hmean     disk-97    349339.74 (   0.00%)   285854.62 * -18.17%*
> Hmean     disk-121   373456.79 (   0.00%)   309199.32 * -17.21%*
> Hmean     disk-145   399449.04 (   0.00%)   330547.11 * -17.25%*
> Hmean     disk-169   420049.71 (   0.00%)   345132.74 * -17.84%*
> Hmean     disk-193   458795.56 (   0.00%)   375243.03 * -18.21%*
> Stddev    disk-1        102.32 (   0.00%)      117.93 ( -15.25%)
> Stddev    disk-25      6981.08 (   0.00%)     4285.17 (  38.62%)
> Stddev    disk-49      9523.63 (   0.00%)    10723.44 ( -12.60%)
> Stddev    disk-73      9704.22 (   0.00%)     7946.73 (  18.11%)
> Stddev    disk-97     10059.91 (   0.00%)     6111.28 (  39.25%)
> Stddev    disk-121     5730.56 (   0.00%)    11714.34 (-104.42%)
> Stddev    disk-145    11154.40 (   0.00%)     8129.06 (  27.12%)
> Stddev    disk-169     4477.30 (   0.00%)     3558.86 (  20.51%)
> Stddev    disk-193     8785.70 (   0.00%)    13258.89 ( -50.91%)
> 
> reaim-io-disk-large-xfs
> Hmean     disk-1        722.72 (   0.00%)      721.85 (  -0.12%)
> Hmean     disk-25     24177.95 (   0.00%)    24319.06 (   0.58%)
> Hmean     disk-49     35294.12 (   0.00%)    34361.85 (  -2.64%)
> Hmean     disk-73     43042.45 (   0.00%)    40896.36 *  -4.99%*
> Hmean     disk-97     48403.19 (   0.00%)    46044.30 *  -4.87%*
> Hmean     disk-121    52230.22 (   0.00%)    49347.47 *  -5.52%*
> Hmean     disk-145    54613.94 (   0.00%)    52333.98 *  -4.17%*
> Hmean     disk-169    57178.30 (   0.00%)    54745.71 *  -4.25%*
> Hmean     disk-193    60230.94 (   0.00%)    57106.22 *  -5.19%*
> Stddev    disk-1         18.74 (   0.00%)       30.19 ( -61.11%)
> Stddev    disk-25       439.49 (   0.00%)      809.58 ( -84.21%)
> Stddev    disk-49      1416.65 (   0.00%)      603.55 (  57.40%)
> Stddev    disk-73       949.45 (   0.00%)      584.61 (  38.43%)
> Stddev    disk-97       689.51 (   0.00%)      602.76 (  12.58%)
> Stddev    disk-121      485.22 (   0.00%)      612.79 ( -26.29%)
> Stddev    disk-145      147.37 (   0.00%)      442.99 (-200.60%)
> Stddev    disk-169      282.25 (   0.00%)      613.42 (-117.33%)
> Stddev    disk-193      970.05 (   0.00%)      572.59 (  40.97%)
> 
> Note that numbers behind dash (disk-xx) denote the number of reaim
> "clients" - i.e., the number of processes reaim runs in parallel.
> 
> This reaim workload will create quite some small files, fsync them, do a
> few operations on them like read, write, etc. and quickly delete them. This
> happens in many processes in parallel so I can imagine the unlink workload
> is rather heavy.

Yup, which is why I've largely been ignoring it. reaim is a *very
old* benchmark that doesn't do anything we'd normally do to scale
out independent workloads effectively.

> To reproduce the workload the easiest is probably to clone mmtests [1],
> there's README.md and docs/Tutorial.md about how to run tests. Relevant
> configurations are in configs/config-reaim-io-disk-large and
> configs/config-reaim-io-disk, you will need to edit these files to set
> appropriate test partition (will get wiped) and filesystem. Note that
> mmtests do somewhat modify standard reaim benchmark so that it does not
> call system("sync") after each operation (you can see modifications we
> apply in [2] if you're interested).
> 
> I can probably find some time to better understand what's going on on these
> machines later this week but if you have some suggestions what to look for,
> you're welcome.

You should be able to reproduce the concurrent unlink contention
simply by running a bunch of concurrent unlinks in a directory and
watching the directory i_rwsem get absolutely hammered up in
do_unlinkat().

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Performance regression with async inode inactivation
  2021-10-04 21:15 ` Dave Chinner
@ 2021-10-05  8:11   ` Jan Kara
  2021-10-05 21:26     ` Dave Chinner
  0 siblings, 1 reply; 12+ messages in thread
From: Jan Kara @ 2021-10-05  8:11 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Jan Kara, linux-xfs, Dave Chinner, Darrick J. Wong

On Tue 05-10-21 08:15:08, Dave Chinner wrote:
> On Mon, Oct 04, 2021 at 12:06:53PM +0200, Jan Kara wrote:
> > Hello,
> > 
> > our performance testing grid has detected a performance regression caused
> > by commit ab23a77687 ("xfs: per-cpu deferred inode inactivation queues")
> > with reaim benchmark running 'disk' and 'disk-large' workloads. The
> > regression has been so far detected on two machines - marvin7 (48 cpus, 64
> > GB ram, SATA SSD), dobby (64 cpus, 192 GB ram, rotating disk behind
> > megaraid_sas controller).
> 
> Yup, large cpu count, single slow disk, and the cause will likely be
> exclusive rwsem lock contention on a directory inode that concurrent
> openat and unlink are occuring in.
> 
> Basically, that commit removed a bunch of userspace overhead in
> unlinks, when mean they run as fast as the unlink() call can remove
> the directory entry. There is effectively nothing throttling
> unlink() in XFS now except for available log space and it mostly
> runs to completion without blocking. Hence the front end unlink
> performance can run in much faster bursts before delayed
> inactivation needs to run.
> 
> Given most of the added CPU overhead is in the rwsem spin_on_owner
> path, it implies that the write lock holder is, indeed, not sleeping
> with the lock held. Hence reaim is hitting a borderline contended
> rwsem much harder and with different behaviour, resulting in
> catastrophic breakdown of lock performance and hence unlink
> performance goes backwards.
> 
> I can't see any other new sleeping lock contention in the workload
> profiles - the context switch rate goes down substantially (by 35%!)
> with commit ab23a77687, which also implies that the lock contention
> is resulting in much longer spin and/or sleep times on the lock.
> 
> I'm not sure we can do anything about this in the filesystem. The
> contended lock is a core, high level VFS lock which is the first
> point of unlinkat() syscall serialisation. This is the lock that is
> directly exposed to userspace concurrency, so the scalability of
> this lock determines concurrency performance of the userspace
> application.....

Thanks for explanation! It makes sense, except one difference I can see in
vmstat on both marvin7 and dobby which I don't understand:

Dobby:
Ops Sector Reads                     1009081.00     1009081.00
Ops Sector Writes                   11550795.00    18753764.00

Marvin7:
Ops Sector Reads                      887951.00      887951.00
Ops Sector Writes                    8248822.00    11135086.00

So after the change reaim ends up doing noticeably more writes. I had a
look at iostat comparison as well but there wasn't anything particular
standing out besides higher amount of writes on the test disk. I guess,
I'll limit the number of clients to a single number showing the regression,
enable some more detailed monitoring and see whether something interesting
pops up.

> > The regression reports from mmtests on marvin7 (which from my experience I
> > find more reliable than dobby) look like:
> > 
> > reaim-io-disk-xfs
> > Hmean     disk-1       3597.12 (   0.00%)     3636.37 (   1.09%)
> > Hmean     disk-25    125000.00 (   0.00%)   121753.25 (  -2.60%)
> > Hmean     disk-49    218424.96 (   0.00%)   193421.05 * -11.45%*
> > Hmean     disk-73    286649.22 (   0.00%)   240131.58 * -16.23%*
> > Hmean     disk-97    349339.74 (   0.00%)   285854.62 * -18.17%*
> > Hmean     disk-121   373456.79 (   0.00%)   309199.32 * -17.21%*
> > Hmean     disk-145   399449.04 (   0.00%)   330547.11 * -17.25%*
> > Hmean     disk-169   420049.71 (   0.00%)   345132.74 * -17.84%*
> > Hmean     disk-193   458795.56 (   0.00%)   375243.03 * -18.21%*
> > Stddev    disk-1        102.32 (   0.00%)      117.93 ( -15.25%)
> > Stddev    disk-25      6981.08 (   0.00%)     4285.17 (  38.62%)
> > Stddev    disk-49      9523.63 (   0.00%)    10723.44 ( -12.60%)
> > Stddev    disk-73      9704.22 (   0.00%)     7946.73 (  18.11%)
> > Stddev    disk-97     10059.91 (   0.00%)     6111.28 (  39.25%)
> > Stddev    disk-121     5730.56 (   0.00%)    11714.34 (-104.42%)
> > Stddev    disk-145    11154.40 (   0.00%)     8129.06 (  27.12%)
> > Stddev    disk-169     4477.30 (   0.00%)     3558.86 (  20.51%)
> > Stddev    disk-193     8785.70 (   0.00%)    13258.89 ( -50.91%)
> > 
> > reaim-io-disk-large-xfs
> > Hmean     disk-1        722.72 (   0.00%)      721.85 (  -0.12%)
> > Hmean     disk-25     24177.95 (   0.00%)    24319.06 (   0.58%)
> > Hmean     disk-49     35294.12 (   0.00%)    34361.85 (  -2.64%)
> > Hmean     disk-73     43042.45 (   0.00%)    40896.36 *  -4.99%*
> > Hmean     disk-97     48403.19 (   0.00%)    46044.30 *  -4.87%*
> > Hmean     disk-121    52230.22 (   0.00%)    49347.47 *  -5.52%*
> > Hmean     disk-145    54613.94 (   0.00%)    52333.98 *  -4.17%*
> > Hmean     disk-169    57178.30 (   0.00%)    54745.71 *  -4.25%*
> > Hmean     disk-193    60230.94 (   0.00%)    57106.22 *  -5.19%*
> > Stddev    disk-1         18.74 (   0.00%)       30.19 ( -61.11%)
> > Stddev    disk-25       439.49 (   0.00%)      809.58 ( -84.21%)
> > Stddev    disk-49      1416.65 (   0.00%)      603.55 (  57.40%)
> > Stddev    disk-73       949.45 (   0.00%)      584.61 (  38.43%)
> > Stddev    disk-97       689.51 (   0.00%)      602.76 (  12.58%)
> > Stddev    disk-121      485.22 (   0.00%)      612.79 ( -26.29%)
> > Stddev    disk-145      147.37 (   0.00%)      442.99 (-200.60%)
> > Stddev    disk-169      282.25 (   0.00%)      613.42 (-117.33%)
> > Stddev    disk-193      970.05 (   0.00%)      572.59 (  40.97%)
> > 
> > Note that numbers behind dash (disk-xx) denote the number of reaim
> > "clients" - i.e., the number of processes reaim runs in parallel.
> > 
> > This reaim workload will create quite some small files, fsync them, do a
> > few operations on them like read, write, etc. and quickly delete them. This
> > happens in many processes in parallel so I can imagine the unlink workload
> > is rather heavy.
> 
> Yup, which is why I've largely been ignoring it. reaim is a *very
> old* benchmark that doesn't do anything we'd normally do to scale
> out independent workloads effectively.

Yes, I agree reaim is a stupid workload for today's standards so I'm not
much worried about it as such. Just occasionally it tends to highlight
problems that are valid also for more relevant workloads (just less visible
there) so we still run it.

> > To reproduce the workload the easiest is probably to clone mmtests [1],
> > there's README.md and docs/Tutorial.md about how to run tests. Relevant
> > configurations are in configs/config-reaim-io-disk-large and
> > configs/config-reaim-io-disk, you will need to edit these files to set
> > appropriate test partition (will get wiped) and filesystem. Note that
> > mmtests do somewhat modify standard reaim benchmark so that it does not
> > call system("sync") after each operation (you can see modifications we
> > apply in [2] if you're interested).
> > 
> > I can probably find some time to better understand what's going on on these
> > machines later this week but if you have some suggestions what to look for,
> > you're welcome.
> 
> You should be able to reproduce the concurrent unlink contention
> simply by running a bunch of concurrent unlinks in a directory and
> watching the directory i_rwsem get absolutely hammered up in
> do_unlinkat().

OK, thanks for suggestion. I can try that as well.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Performance regression with async inode inactivation
  2021-10-05  8:11   ` Jan Kara
@ 2021-10-05 21:26     ` Dave Chinner
  2021-10-06 18:10       ` Jan Kara
  0 siblings, 1 reply; 12+ messages in thread
From: Dave Chinner @ 2021-10-05 21:26 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-xfs, Dave Chinner, Darrick J. Wong

On Tue, Oct 05, 2021 at 10:11:57AM +0200, Jan Kara wrote:
> On Tue 05-10-21 08:15:08, Dave Chinner wrote:
> > On Mon, Oct 04, 2021 at 12:06:53PM +0200, Jan Kara wrote:
> > > Hello,
> > > 
> > > our performance testing grid has detected a performance regression caused
> > > by commit ab23a77687 ("xfs: per-cpu deferred inode inactivation queues")
> > > with reaim benchmark running 'disk' and 'disk-large' workloads. The
> > > regression has been so far detected on two machines - marvin7 (48 cpus, 64
> > > GB ram, SATA SSD), dobby (64 cpus, 192 GB ram, rotating disk behind
> > > megaraid_sas controller).
> > 
> > Yup, large cpu count, single slow disk, and the cause will likely be
> > exclusive rwsem lock contention on a directory inode that concurrent
> > openat and unlink are occuring in.
> > 
> > Basically, that commit removed a bunch of userspace overhead in
> > unlinks, when mean they run as fast as the unlink() call can remove
> > the directory entry. There is effectively nothing throttling
> > unlink() in XFS now except for available log space and it mostly
> > runs to completion without blocking. Hence the front end unlink
> > performance can run in much faster bursts before delayed
> > inactivation needs to run.
> > 
> > Given most of the added CPU overhead is in the rwsem spin_on_owner
> > path, it implies that the write lock holder is, indeed, not sleeping
> > with the lock held. Hence reaim is hitting a borderline contended
> > rwsem much harder and with different behaviour, resulting in
> > catastrophic breakdown of lock performance and hence unlink
> > performance goes backwards.
> > 
> > I can't see any other new sleeping lock contention in the workload
> > profiles - the context switch rate goes down substantially (by 35%!)
> > with commit ab23a77687, which also implies that the lock contention
> > is resulting in much longer spin and/or sleep times on the lock.
> > 
> > I'm not sure we can do anything about this in the filesystem. The
> > contended lock is a core, high level VFS lock which is the first
> > point of unlinkat() syscall serialisation. This is the lock that is
> > directly exposed to userspace concurrency, so the scalability of
> > this lock determines concurrency performance of the userspace
> > application.....
> 
> Thanks for explanation! It makes sense, except one difference I can see in
> vmstat on both marvin7 and dobby which I don't understand:
> 
> Dobby:
> Ops Sector Reads                     1009081.00     1009081.00
> Ops Sector Writes                   11550795.00    18753764.00
> 
> Marvin7:
> Ops Sector Reads                      887951.00      887951.00
> Ops Sector Writes                    8248822.00    11135086.00
> 
> So after the change reaim ends up doing noticeably more writes. I had a
> look at iostat comparison as well but there wasn't anything particular
> standing out besides higher amount of writes on the test disk. I guess,
> I'll limit the number of clients to a single number showing the regression,
> enable some more detailed monitoring and see whether something interesting
> pops up.

Interesting.

There weren't iostats in the original intel profiles given. I
can see a couple of vmstats that give some indications -
vmstat.io.bo went up from ~2500 to ~6000, and proc-vmstat.pgpgout
went up from ~90k to 250k.

Looking at another more recent profile, there are more IO related
stats in the output vmstat.nr_written went up by 2.5x and
vmstat.pgpgout went up by a factor of 6 (50k -> 300k) but otherwise
everything else was fairly constant in the VM. The resident size of
the file cache is small, and vmstat.nr_dirtied went up by a small
ammount by it's 4 orders of magnitude larger than nr_written.

Hmmm. That implies a *lot* of overwrite of cached files.

I wonder if we've just changed the memory pressure enough to trigger
more frequent writeback? We're delaying the inactivation (and hence
page cache invalidation) of up to 256 inodes per CPU, and the number
of cached+dirty inodes appears to have increased a small amount
(from ~3000 to ~4000). With slow disks, a small change in writeback
behaviour could cause seek-bound related performance regressions.

Also worth noting is that there's been some recent variance in reaim
numbers recently because of the journal FUA/flush optimisations
we've made.  Some machines report +20% from that change, some report
-20%, and there's no pattern to it. It's just another indication
that the reaim scalability and perf on these large CPU count, single
spinning disk setups is highly dependent on disk performance and
seek optimisation...

Have you run any tests on a system that isn't stupidly overpowered
for it's disk subsystem? e.g. has an SSD rather than spinning rust?

> > > The regression reports from mmtests on marvin7 (which from my experience I
> > > find more reliable than dobby) look like:
> > > 
> > > reaim-io-disk-xfs
> > > Hmean     disk-1       3597.12 (   0.00%)     3636.37 (   1.09%)
> > > Hmean     disk-25    125000.00 (   0.00%)   121753.25 (  -2.60%)
> > > Hmean     disk-49    218424.96 (   0.00%)   193421.05 * -11.45%*
> > > Hmean     disk-73    286649.22 (   0.00%)   240131.58 * -16.23%*
> > > Hmean     disk-97    349339.74 (   0.00%)   285854.62 * -18.17%*
> > > Hmean     disk-121   373456.79 (   0.00%)   309199.32 * -17.21%*
> > > Hmean     disk-145   399449.04 (   0.00%)   330547.11 * -17.25%*
> > > Hmean     disk-169   420049.71 (   0.00%)   345132.74 * -17.84%*
> > > Hmean     disk-193   458795.56 (   0.00%)   375243.03 * -18.21%*
> > > Stddev    disk-1        102.32 (   0.00%)      117.93 ( -15.25%)
> > > Stddev    disk-25      6981.08 (   0.00%)     4285.17 (  38.62%)
> > > Stddev    disk-49      9523.63 (   0.00%)    10723.44 ( -12.60%)
> > > Stddev    disk-73      9704.22 (   0.00%)     7946.73 (  18.11%)
> > > Stddev    disk-97     10059.91 (   0.00%)     6111.28 (  39.25%)
> > > Stddev    disk-121     5730.56 (   0.00%)    11714.34 (-104.42%)
> > > Stddev    disk-145    11154.40 (   0.00%)     8129.06 (  27.12%)
> > > Stddev    disk-169     4477.30 (   0.00%)     3558.86 (  20.51%)
> > > Stddev    disk-193     8785.70 (   0.00%)    13258.89 ( -50.91%)
> > > 
> > > reaim-io-disk-large-xfs
> > > Hmean     disk-1        722.72 (   0.00%)      721.85 (  -0.12%)
> > > Hmean     disk-25     24177.95 (   0.00%)    24319.06 (   0.58%)
> > > Hmean     disk-49     35294.12 (   0.00%)    34361.85 (  -2.64%)
> > > Hmean     disk-73     43042.45 (   0.00%)    40896.36 *  -4.99%*
> > > Hmean     disk-97     48403.19 (   0.00%)    46044.30 *  -4.87%*
> > > Hmean     disk-121    52230.22 (   0.00%)    49347.47 *  -5.52%*
> > > Hmean     disk-145    54613.94 (   0.00%)    52333.98 *  -4.17%*
> > > Hmean     disk-169    57178.30 (   0.00%)    54745.71 *  -4.25%*
> > > Hmean     disk-193    60230.94 (   0.00%)    57106.22 *  -5.19%*
> > > Stddev    disk-1         18.74 (   0.00%)       30.19 ( -61.11%)
> > > Stddev    disk-25       439.49 (   0.00%)      809.58 ( -84.21%)
> > > Stddev    disk-49      1416.65 (   0.00%)      603.55 (  57.40%)
> > > Stddev    disk-73       949.45 (   0.00%)      584.61 (  38.43%)
> > > Stddev    disk-97       689.51 (   0.00%)      602.76 (  12.58%)
> > > Stddev    disk-121      485.22 (   0.00%)      612.79 ( -26.29%)
> > > Stddev    disk-145      147.37 (   0.00%)      442.99 (-200.60%)
> > > Stddev    disk-169      282.25 (   0.00%)      613.42 (-117.33%)
> > > Stddev    disk-193      970.05 (   0.00%)      572.59 (  40.97%)
> > > 
> > > Note that numbers behind dash (disk-xx) denote the number of reaim
> > > "clients" - i.e., the number of processes reaim runs in parallel.
> > > 
> > > This reaim workload will create quite some small files, fsync them, do a
> > > few operations on them like read, write, etc. and quickly delete them. This
> > > happens in many processes in parallel so I can imagine the unlink workload
> > > is rather heavy.
> > 
> > Yup, which is why I've largely been ignoring it. reaim is a *very
> > old* benchmark that doesn't do anything we'd normally do to scale
> > out independent workloads effectively.
> 
> Yes, I agree reaim is a stupid workload for today's standards so I'm not
> much worried about it as such. Just occasionally it tends to highlight
> problems that are valid also for more relevant workloads (just less visible
> there) so we still run it.

*nod*

It has worked well as a canary in past times, but recently I'm
finding we're getting more noise than useful signal from these
unbalanced "all CPU, no IO" test machine configs these days....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Performance regression with async inode inactivation
  2021-10-05 21:26     ` Dave Chinner
@ 2021-10-06 18:10       ` Jan Kara
  2021-10-06 21:58         ` Dave Chinner
  0 siblings, 1 reply; 12+ messages in thread
From: Jan Kara @ 2021-10-06 18:10 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Jan Kara, linux-xfs, Dave Chinner, Darrick J. Wong

[-- Attachment #1: Type: text/plain, Size: 6023 bytes --]

On Wed 06-10-21 08:26:08, Dave Chinner wrote:
> On Tue, Oct 05, 2021 at 10:11:57AM +0200, Jan Kara wrote:
> > On Tue 05-10-21 08:15:08, Dave Chinner wrote:
> > > On Mon, Oct 04, 2021 at 12:06:53PM +0200, Jan Kara wrote:
> > > > Hello,
> > > > 
> > > > our performance testing grid has detected a performance regression caused
> > > > by commit ab23a77687 ("xfs: per-cpu deferred inode inactivation queues")
> > > > with reaim benchmark running 'disk' and 'disk-large' workloads. The
> > > > regression has been so far detected on two machines - marvin7 (48 cpus, 64
> > > > GB ram, SATA SSD), dobby (64 cpus, 192 GB ram, rotating disk behind
> > > > megaraid_sas controller).
> > > 
> > > Yup, large cpu count, single slow disk, and the cause will likely be
> > > exclusive rwsem lock contention on a directory inode that concurrent
> > > openat and unlink are occuring in.
> > > 
> > > Basically, that commit removed a bunch of userspace overhead in
> > > unlinks, when mean they run as fast as the unlink() call can remove
> > > the directory entry. There is effectively nothing throttling
> > > unlink() in XFS now except for available log space and it mostly
> > > runs to completion without blocking. Hence the front end unlink
> > > performance can run in much faster bursts before delayed
> > > inactivation needs to run.
> > > 
> > > Given most of the added CPU overhead is in the rwsem spin_on_owner
> > > path, it implies that the write lock holder is, indeed, not sleeping
> > > with the lock held. Hence reaim is hitting a borderline contended
> > > rwsem much harder and with different behaviour, resulting in
> > > catastrophic breakdown of lock performance and hence unlink
> > > performance goes backwards.
> > > 
> > > I can't see any other new sleeping lock contention in the workload
> > > profiles - the context switch rate goes down substantially (by 35%!)
> > > with commit ab23a77687, which also implies that the lock contention
> > > is resulting in much longer spin and/or sleep times on the lock.
> > > 
> > > I'm not sure we can do anything about this in the filesystem. The
> > > contended lock is a core, high level VFS lock which is the first
> > > point of unlinkat() syscall serialisation. This is the lock that is
> > > directly exposed to userspace concurrency, so the scalability of
> > > this lock determines concurrency performance of the userspace
> > > application.....
> > 
> > Thanks for explanation! It makes sense, except one difference I can see in
> > vmstat on both marvin7 and dobby which I don't understand:
> > 
> > Dobby:
> > Ops Sector Reads                     1009081.00     1009081.00
> > Ops Sector Writes                   11550795.00    18753764.00
> > 
> > Marvin7:
> > Ops Sector Reads                      887951.00      887951.00
> > Ops Sector Writes                    8248822.00    11135086.00
> > 
> > So after the change reaim ends up doing noticeably more writes. I had a
> > look at iostat comparison as well but there wasn't anything particular
> > standing out besides higher amount of writes on the test disk. I guess,
> > I'll limit the number of clients to a single number showing the regression,
> > enable some more detailed monitoring and see whether something interesting
> > pops up.
> 
> Interesting.
> 
> There weren't iostats in the original intel profiles given. I
> can see a couple of vmstats that give some indications -
> vmstat.io.bo went up from ~2500 to ~6000, and proc-vmstat.pgpgout
> went up from ~90k to 250k.
> 
> Looking at another more recent profile, there are more IO related
> stats in the output vmstat.nr_written went up by 2.5x and
> vmstat.pgpgout went up by a factor of 6 (50k -> 300k) but otherwise
> everything else was fairly constant in the VM. The resident size of
> the file cache is small, and vmstat.nr_dirtied went up by a small
> ammount by it's 4 orders of magnitude larger than nr_written.
> 
> Hmmm. That implies a *lot* of overwrite of cached files.
> 
> I wonder if we've just changed the memory pressure enough to trigger
> more frequent writeback? We're delaying the inactivation (and hence
> page cache invalidation) of up to 256 inodes per CPU, and the number
> of cached+dirty inodes appears to have increased a small amount
> (from ~3000 to ~4000). With slow disks, a small change in writeback
> behaviour could cause seek-bound related performance regressions.
> 
> Also worth noting is that there's been some recent variance in reaim
> numbers recently because of the journal FUA/flush optimisations
> we've made.  Some machines report +20% from that change, some report
> -20%, and there's no pattern to it. It's just another indication
> that the reaim scalability and perf on these large CPU count, single
> spinning disk setups is highly dependent on disk performance and
> seek optimisation...
> 
> Have you run any tests on a system that isn't stupidly overpowered
> for it's disk subsystem? e.g. has an SSD rather than spinning rust?

So marvin7 actually has SSD. I was experimenting some more. Attached is a
simple reproducer that demonstrates the issue for me - it just creates 16k
file, fsync it, delete it in a loop from given number processes (I run with
48). The reproducer runs ~25% slower after the commit ab23a77687. Note that
the reproducer makes each process use a different directory so i_rwsem
contention is out of question.

From blktrace I can see that indeed after the commit we do ~25% more
writes.  Each stress-unlink process does the same amount of IO, the extra
IO comes solely from the worker threads. Also I'd note that before the
commit we were reusing blocks much more (likely inode blocks getting
reused) - before the commit we write to ~88 MB worth of distinct disk
blocks, after the commit we write to ~296 MB worth of distinct disk blocks.
That's understandable given the delayed inode freeing but it's one thing
that could possibly have effect.  That's how far I've come today, I'll dig
more tomorrow.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

[-- Attachment #2: stress-unlink.c --]
[-- Type: text/x-c, Size: 2897 bytes --]

#include <stdio.h>
#include <fcntl.h>
#include <stdlib.h>
#include <errno.h>
#include <string.h>
#include <signal.h>
#include <unistd.h>
#include <sys/wait.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <sys/time.h>

#define COUNT 100
#define SIZE 16384
#define MAX_PROCS 1024

char shm_name[64];
char wbuf[SIZE];

void prepare(char *base, int num)
{
	char dir[128];

	sprintf(dir, "%s/dir-%d", base, num);
	if (mkdir(dir, 0700) < 0 && errno != EEXIST) {
		perror("mkdir");
		exit(1);
	}
}

void teardown(char *base, int num)
{
	char dir[128];

	sprintf(dir, "%s/dir-%d", base, num);
	rmdir(dir);
}

void run_test(char *base, int num)
{
	char name[128];
	int shmfd;
	int i;
	int *shm;
	int fd;

	sprintf(name, "%s/dir-%d/file", base, num);
	shmfd = shm_open(shm_name, O_RDWR, 0);
	if (shmfd < 0) {
		perror("shm_open");
		exit(1);
	}
	shm = mmap(NULL, sizeof(int) * MAX_PROCS, PROT_READ | PROT_WRITE,
		   MAP_SHARED, shmfd, 0);
	if (shm == MAP_FAILED) {
		perror("mmap");
		exit(1);
	}
	shm[num + 1] = 1;
	while (shm[0] == 0)
		usleep(1);

	for (i = 0; i < COUNT; i++) {
		fd = open(name, O_CREAT | O_WRONLY, 0644);
		if (fd < 0) {
			perror("open");
			exit(1);
		}
		if (write(fd, wbuf, SIZE) != SIZE) {
			perror("pwrite");
			exit(1);
		}
		fsync(fd);
		close(fd);
		unlink(name);
	}
}

int main(int argc, char **argv)
{
	int procs, i, j;
	pid_t pids[MAX_PROCS];
	int shmfd;
	int *shm;
	struct timeval start, end;
	long long ms;

	if (argc != 3) {
		fprintf(stderr, "Usage: stress-unlink <processes> <dir>\n");
		return 1;
	}
	procs = strtol(argv[1], NULL, 10);
	if (procs > MAX_PROCS) {
		fprintf(stderr, "Too many processes!\n");
		return 1;
	}

	memset(wbuf, 0xcc, SIZE);
	sprintf(shm_name, "/stress-unlink-%u", getpid());
	shmfd = shm_open(shm_name, O_CREAT | O_RDWR, 0600);
	if (shmfd < 0) {
		perror("shm_open");
		return 1;
	}
	if (ftruncate(shmfd, sizeof(int) * MAX_PROCS) < 0) {
		perror("ftruncate shm");
		return 1;
	}
	shm = mmap(NULL, sizeof(int) * MAX_PROCS, PROT_READ | PROT_WRITE,
		   MAP_SHARED, shmfd, 0);
	if (shm == MAP_FAILED) {
		perror("mmap");
		return 1;
	}
	
	shm[0] = 0;
	for (i = 0; i < procs; i++) {
		shm[i+1] = 0;
		prepare(argv[2], i);
	}

	for (i = 0; i < procs; i++) {
		pids[i] = fork();
		if (pids[i] < 0) {
			perror("fork");
			for (j = 0; j < i; j++)
				kill(pids[j], SIGKILL);
			return 1;
		}
		if (pids[i] == 0) {
			run_test(argv[2], i);
			exit(0);
		}
	}

	do {
		for (i = 0; i < procs && shm[i + 1]; i++);
	} while (i != procs);
	gettimeofday(&start, NULL);
	shm[0] = 1;
	fprintf(stderr, "Processes started.\n");

	for (i = 0; i < procs; i++)
		waitpid(pids[i], NULL, 0);
	gettimeofday(&end, NULL);
	for (i = 0; i < procs; i++)
		teardown(argv[2], i);
	shm_unlink(shm_name);
	ms = (((long long)(end.tv_sec - start.tv_sec) * 1000000) +
		(end.tv_usec - start.tv_usec)) / 1000;
	printf("%lld.%03lld\n", ms/1000, ms%1000);
	
	return 0;
}

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Performance regression with async inode inactivation
  2021-10-06 18:10       ` Jan Kara
@ 2021-10-06 21:58         ` Dave Chinner
  2021-10-07 12:03           ` Jan Kara
  0 siblings, 1 reply; 12+ messages in thread
From: Dave Chinner @ 2021-10-06 21:58 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-xfs, Dave Chinner, Darrick J. Wong

On Wed, Oct 06, 2021 at 08:10:01PM +0200, Jan Kara wrote:
> On Wed 06-10-21 08:26:08, Dave Chinner wrote:
> > On Tue, Oct 05, 2021 at 10:11:57AM +0200, Jan Kara wrote:
> > > On Tue 05-10-21 08:15:08, Dave Chinner wrote:
> > > > On Mon, Oct 04, 2021 at 12:06:53PM +0200, Jan Kara wrote:
> > > > > Hello,
> > > > > 
> > > > > our performance testing grid has detected a performance regression caused
> > > > > by commit ab23a77687 ("xfs: per-cpu deferred inode inactivation queues")
> > > > > with reaim benchmark running 'disk' and 'disk-large' workloads. The
> > > > > regression has been so far detected on two machines - marvin7 (48 cpus, 64
> > > > > GB ram, SATA SSD), dobby (64 cpus, 192 GB ram, rotating disk behind
> > > > > megaraid_sas controller).
> > > > 
> > > > Yup, large cpu count, single slow disk, and the cause will likely be
> > > > exclusive rwsem lock contention on a directory inode that concurrent
> > > > openat and unlink are occuring in.
> > > > 
> > > > Basically, that commit removed a bunch of userspace overhead in
> > > > unlinks, when mean they run as fast as the unlink() call can remove
> > > > the directory entry. There is effectively nothing throttling
> > > > unlink() in XFS now except for available log space and it mostly
> > > > runs to completion without blocking. Hence the front end unlink
> > > > performance can run in much faster bursts before delayed
> > > > inactivation needs to run.
> > > > 
> > > > Given most of the added CPU overhead is in the rwsem spin_on_owner
> > > > path, it implies that the write lock holder is, indeed, not sleeping
> > > > with the lock held. Hence reaim is hitting a borderline contended
> > > > rwsem much harder and with different behaviour, resulting in
> > > > catastrophic breakdown of lock performance and hence unlink
> > > > performance goes backwards.
> > > > 
> > > > I can't see any other new sleeping lock contention in the workload
> > > > profiles - the context switch rate goes down substantially (by 35%!)
> > > > with commit ab23a77687, which also implies that the lock contention
> > > > is resulting in much longer spin and/or sleep times on the lock.
> > > > 
> > > > I'm not sure we can do anything about this in the filesystem. The
> > > > contended lock is a core, high level VFS lock which is the first
> > > > point of unlinkat() syscall serialisation. This is the lock that is
> > > > directly exposed to userspace concurrency, so the scalability of
> > > > this lock determines concurrency performance of the userspace
> > > > application.....
> > > 
> > > Thanks for explanation! It makes sense, except one difference I can see in
> > > vmstat on both marvin7 and dobby which I don't understand:
> > > 
> > > Dobby:
> > > Ops Sector Reads                     1009081.00     1009081.00
> > > Ops Sector Writes                   11550795.00    18753764.00
> > > 
> > > Marvin7:
> > > Ops Sector Reads                      887951.00      887951.00
> > > Ops Sector Writes                    8248822.00    11135086.00
> > > 
> > > So after the change reaim ends up doing noticeably more writes. I had a
> > > look at iostat comparison as well but there wasn't anything particular
> > > standing out besides higher amount of writes on the test disk. I guess,
> > > I'll limit the number of clients to a single number showing the regression,
> > > enable some more detailed monitoring and see whether something interesting
> > > pops up.
> > 
> > Interesting.
> > 
> > There weren't iostats in the original intel profiles given. I
> > can see a couple of vmstats that give some indications -
> > vmstat.io.bo went up from ~2500 to ~6000, and proc-vmstat.pgpgout
> > went up from ~90k to 250k.
> > 
> > Looking at another more recent profile, there are more IO related
> > stats in the output vmstat.nr_written went up by 2.5x and
> > vmstat.pgpgout went up by a factor of 6 (50k -> 300k) but otherwise
> > everything else was fairly constant in the VM. The resident size of
> > the file cache is small, and vmstat.nr_dirtied went up by a small
> > ammount by it's 4 orders of magnitude larger than nr_written.
> > 
> > Hmmm. That implies a *lot* of overwrite of cached files.
> > 
> > I wonder if we've just changed the memory pressure enough to trigger
> > more frequent writeback? We're delaying the inactivation (and hence
> > page cache invalidation) of up to 256 inodes per CPU, and the number
> > of cached+dirty inodes appears to have increased a small amount
> > (from ~3000 to ~4000). With slow disks, a small change in writeback
> > behaviour could cause seek-bound related performance regressions.
> > 
> > Also worth noting is that there's been some recent variance in reaim
> > numbers recently because of the journal FUA/flush optimisations
> > we've made.  Some machines report +20% from that change, some report
> > -20%, and there's no pattern to it. It's just another indication
> > that the reaim scalability and perf on these large CPU count, single
> > spinning disk setups is highly dependent on disk performance and
> > seek optimisation...
> > 
> > Have you run any tests on a system that isn't stupidly overpowered
> > for it's disk subsystem? e.g. has an SSD rather than spinning rust?
> 
> So marvin7 actually has SSD. I was experimenting some more. Attached is a
> simple reproducer that demonstrates the issue for me - it just creates 16k
> file, fsync it, delete it in a loop from given number processes (I run with
> 48). The reproducer runs ~25% slower after the commit ab23a77687. Note that
> the reproducer makes each process use a different directory so i_rwsem
> contention is out of question.
> 
> From blktrace I can see that indeed after the commit we do ~25% more
> writes.  Each stress-unlink process does the same amount of IO, the extra
> IO comes solely from the worker threads.

So the question is "which worker thread?".

> Also I'd note that before the
> commit we were reusing blocks much more (likely inode blocks getting
> reused) - before the commit we write to ~88 MB worth of distinct disk
> blocks, after the commit we write to ~296 MB worth of distinct disk blocks.

Hmmm, I didn't see this at first.  What's the filesystem layout
(xfs_info) and the CLI for the test that you ran? How many CPUs was
the test run on?

Running a test with 100 procs across 32p, so creating 100 dirs and
10000 files, on a 1.4TB SSD:

$ xfs_info /mnt/scratch
meta-data=/dev/mapper/fast       isize=512    agcount=67, agsize=5467072 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1
data     =                       bsize=4096   blocks=366292480, imaxpct=5
         =                       sunit=64     swidth=256 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=512000, version=2
         =                       sectsz=512   sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
$ ./stress-unlink 100 /mnt/scratch
0.301
$


The total number of inode cluster writes is:

$ pminfo -f xfs.icluster_flushcnt

xfs.icluster_flushcnt
    value 68

That's 68 * 32kB or about 2MB.

[pminfo is part of Performance Co-Pilot - https://pcp.io/]

We did this many user writes:

$ pminfo -f xfs.write xfs.write_bytes

xfs.write
    value 10000

    xfs.write_bytes
        value 163840000

which is 160MB in 10000 writes.

Metadata writeback was:

$ pminfo -f xfs.log_tail.push_ail.success 

xfs.log_tail.push_ail.success
    value 473

473 buffers, part of which was 68 inode clusters so the total is
about ~400 x 4kB + 2MB = 3.6MB.

So, really, there isn't a potential for 25% growth in these numbers
that would make any sort of difference. The only remaining source of
IO differential is log writes:

$ pminfo -f xfs.log

xfs.log.writes
    value 1055

    xfs.log.blocks
        value 28212
....

Which indicates another ~1000 IOs and ~28MB written.

And that's it. We've got a total of about 195MB written to disk for
100 concurrent runs, and it's no different before and after deferred
inactivation. Other tests that I commonly run that do lots of file
creates and unlinks along with sync writes (e.g. dbench) didn't show
any regressions up to 512 concurrent processes, either. So there's
no obvious regression on this filesystem layout....

But it made me wonder - there are two things that could influence
background inactivation here: the AG count (which determines unlink
concurrency) and log size (which determines transaction
concurrency). So I remade the filesystem with a tiny 32MB log and 2
AGs, and that changed a -lot-:

xfs.icluster_flushcnt
    value 40420

xfs.log_tail.push_ail.success
    value 46098

Yeah, we did 40,000 inode cluster writes for only 10000 inode
modifications. Basically, we wrote every inode cluster once for
every inode modification (post create, post write, post unlink, post
deferred activation). Normally create/unlink loops the inode doesn't
even touch the disk - it's cancelled in the log before it gets
flushed to disk and so this is where a chunk of the difference in
disk space consumed comes from. It can be expected behaviour.

xfs.log_tail.push_ail.pinned
    value 84381

And we hit pinned items in the AIL 85,000 times instead of 0. This
triggers more log forces, but fsync is already doing that so it's
not a big deal. However:

xfs.log_tail.try_logspace
    value 80392

xfs.log_tail.sleep_logspace
     value 59938

75% of the operations had to wait on log space, which means it's
forcing the log tail to flush metadata to make space. That's where
all the inode cluster writes are coming from - the transactions are
lock stepping on log space. i.e. every time we want to modify an
inode, we have to flush a dirty inode to make space in the log.

This also means that the CIL can't perform efficient in-memory
transaction aggregation because it's always being forced out to
disk. As a result:

xfs.log.writes
    value 6504

    xfs.log.blocks
        value 65784

6x as many log writes, for 65MB of log writes. IOWs, the smaller log
reduced CIL aggregation efficiency substantially and so we wrote
twice as much to the log just because we had a small log.

So, the question needs to be asked: is this a function of a small
log, or is this behaviour of low AG count?

Increasing AG count back out to > CPU count resulted in the number
of inode cluster flushes dropping to ~3000, and the log writes
dropping back down to 1000 and 28MB of log writes. Performance went
way up, too (from 1.8s down to 0.35s), indicating that AG
concurrency is a factor here.

OTOH, leaving the AG count at 2 and increasing the log back out to
2GB removed all the log space waiting, all the inode cluster
flushing, and everything to do with waiting on log space. But
performance barely changed (1.8s down to 1.5s) and log bandwidth
*went up*:

xfs.log.writes
    value 6528

xfs.log.blocks
    value 135452

6,500 log writes, 130MB written to the log. IOWs, both too-small log
space and too-low AG count for the given workload concurrency will
adversely affect performance of concurrent workloads.

But we already knew that, didn't we? :/

Bumping the AG count from 2 to 16 and using a moderately sized log
(i.e. > 100MB) largely erases the bad behaviour. Indeed, agcount=16,
log size=100MB drops runtime to 0.30s and pretty much returns inode
write count and log write/bw back down to minimums on this 100
process workload.

But to point out that this isn't perfect for everything, if I run
the same test with 1000 processes on that config, it takes 6.2s,
writes 400MB to the log and flushes 153,000 inode clusters. We're
back to not having enough log space for the workload. Bumping out to
agcount=67 (2x CPU count) and max log size(2GB) results in runtime
of 3.5s (largely linear from the 100 process count), no inode
writeback and only 200MB of log throughput.

So, yeah, I suspect that if you change the mkfs parameters for the
reaim tests to have more AGs and significantly increased log space
over the defaults, the regression should disappear...

Can you run these experiments with reaim on your test machines and
see if they are influenced by filesystem level concurrency
parameters like AG count and log size as these tests suggest?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Performance regression with async inode inactivation
  2021-10-06 21:58         ` Dave Chinner
@ 2021-10-07 12:03           ` Jan Kara
  2021-10-07 23:44             ` Dave Chinner
  0 siblings, 1 reply; 12+ messages in thread
From: Jan Kara @ 2021-10-07 12:03 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Jan Kara, linux-xfs, Dave Chinner, Darrick J. Wong

[-- Attachment #1: Type: text/plain, Size: 16977 bytes --]

On Thu 07-10-21 08:58:51, Dave Chinner wrote:
> On Wed, Oct 06, 2021 at 08:10:01PM +0200, Jan Kara wrote:
> > On Wed 06-10-21 08:26:08, Dave Chinner wrote:
> > > On Tue, Oct 05, 2021 at 10:11:57AM +0200, Jan Kara wrote:
> > > > On Tue 05-10-21 08:15:08, Dave Chinner wrote:
> > > > > On Mon, Oct 04, 2021 at 12:06:53PM +0200, Jan Kara wrote:
> > > > > > Hello,
> > > > > > 
> > > > > > our performance testing grid has detected a performance regression caused
> > > > > > by commit ab23a77687 ("xfs: per-cpu deferred inode inactivation queues")
> > > > > > with reaim benchmark running 'disk' and 'disk-large' workloads. The
> > > > > > regression has been so far detected on two machines - marvin7 (48 cpus, 64
> > > > > > GB ram, SATA SSD), dobby (64 cpus, 192 GB ram, rotating disk behind
> > > > > > megaraid_sas controller).
> > > > > 
> > > > > Yup, large cpu count, single slow disk, and the cause will likely be
> > > > > exclusive rwsem lock contention on a directory inode that concurrent
> > > > > openat and unlink are occuring in.
> > > > > 
> > > > > Basically, that commit removed a bunch of userspace overhead in
> > > > > unlinks, when mean they run as fast as the unlink() call can remove
> > > > > the directory entry. There is effectively nothing throttling
> > > > > unlink() in XFS now except for available log space and it mostly
> > > > > runs to completion without blocking. Hence the front end unlink
> > > > > performance can run in much faster bursts before delayed
> > > > > inactivation needs to run.
> > > > > 
> > > > > Given most of the added CPU overhead is in the rwsem spin_on_owner
> > > > > path, it implies that the write lock holder is, indeed, not sleeping
> > > > > with the lock held. Hence reaim is hitting a borderline contended
> > > > > rwsem much harder and with different behaviour, resulting in
> > > > > catastrophic breakdown of lock performance and hence unlink
> > > > > performance goes backwards.
> > > > > 
> > > > > I can't see any other new sleeping lock contention in the workload
> > > > > profiles - the context switch rate goes down substantially (by 35%!)
> > > > > with commit ab23a77687, which also implies that the lock contention
> > > > > is resulting in much longer spin and/or sleep times on the lock.
> > > > > 
> > > > > I'm not sure we can do anything about this in the filesystem. The
> > > > > contended lock is a core, high level VFS lock which is the first
> > > > > point of unlinkat() syscall serialisation. This is the lock that is
> > > > > directly exposed to userspace concurrency, so the scalability of
> > > > > this lock determines concurrency performance of the userspace
> > > > > application.....
> > > > 
> > > > Thanks for explanation! It makes sense, except one difference I can see in
> > > > vmstat on both marvin7 and dobby which I don't understand:
> > > > 
> > > > Dobby:
> > > > Ops Sector Reads                     1009081.00     1009081.00
> > > > Ops Sector Writes                   11550795.00    18753764.00
> > > > 
> > > > Marvin7:
> > > > Ops Sector Reads                      887951.00      887951.00
> > > > Ops Sector Writes                    8248822.00    11135086.00
> > > > 
> > > > So after the change reaim ends up doing noticeably more writes. I had a
> > > > look at iostat comparison as well but there wasn't anything particular
> > > > standing out besides higher amount of writes on the test disk. I guess,
> > > > I'll limit the number of clients to a single number showing the regression,
> > > > enable some more detailed monitoring and see whether something interesting
> > > > pops up.
> > > 
> > > Interesting.
> > > 
> > > There weren't iostats in the original intel profiles given. I
> > > can see a couple of vmstats that give some indications -
> > > vmstat.io.bo went up from ~2500 to ~6000, and proc-vmstat.pgpgout
> > > went up from ~90k to 250k.
> > > 
> > > Looking at another more recent profile, there are more IO related
> > > stats in the output vmstat.nr_written went up by 2.5x and
> > > vmstat.pgpgout went up by a factor of 6 (50k -> 300k) but otherwise
> > > everything else was fairly constant in the VM. The resident size of
> > > the file cache is small, and vmstat.nr_dirtied went up by a small
> > > ammount by it's 4 orders of magnitude larger than nr_written.
> > > 
> > > Hmmm. That implies a *lot* of overwrite of cached files.
> > > 
> > > I wonder if we've just changed the memory pressure enough to trigger
> > > more frequent writeback? We're delaying the inactivation (and hence
> > > page cache invalidation) of up to 256 inodes per CPU, and the number
> > > of cached+dirty inodes appears to have increased a small amount
> > > (from ~3000 to ~4000). With slow disks, a small change in writeback
> > > behaviour could cause seek-bound related performance regressions.
> > > 
> > > Also worth noting is that there's been some recent variance in reaim
> > > numbers recently because of the journal FUA/flush optimisations
> > > we've made.  Some machines report +20% from that change, some report
> > > -20%, and there's no pattern to it. It's just another indication
> > > that the reaim scalability and perf on these large CPU count, single
> > > spinning disk setups is highly dependent on disk performance and
> > > seek optimisation...
> > > 
> > > Have you run any tests on a system that isn't stupidly overpowered
> > > for it's disk subsystem? e.g. has an SSD rather than spinning rust?
> > 
> > So marvin7 actually has SSD. I was experimenting some more. Attached is a
> > simple reproducer that demonstrates the issue for me - it just creates 16k
> > file, fsync it, delete it in a loop from given number processes (I run with
> > 48). The reproducer runs ~25% slower after the commit ab23a77687. Note that
> > the reproducer makes each process use a different directory so i_rwsem
> > contention is out of question.
> > 
> > From blktrace I can see that indeed after the commit we do ~25% more
> > writes.  Each stress-unlink process does the same amount of IO, the extra
> > IO comes solely from the worker threads.
> 
> So the question is "which worker thread?".

Yes, I'm also wondering ;) But it's impossible to tell from block traces
(there's just "kworker/xx:x" there).

> > Also I'd note that before the
> > commit we were reusing blocks much more (likely inode blocks getting
> > reused) - before the commit we write to ~88 MB worth of distinct disk
> > blocks, after the commit we write to ~296 MB worth of distinct disk blocks.
> 
> Hmmm, I didn't see this at first.  What's the filesystem layout
> (xfs_info) and the CLI for the test that you ran? How many CPUs was
> the test run on?

The machine has 24 cores, each core has 2 SMT siblings, so 48 logical CPUs.
That's why I've run stress-unlink with 48 processes.

xfs_info is:

meta-data=/dev/sdb1              isize=512    agcount=4, agsize=29303104 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=0, rmapbt=0
         =                       reflink=0
data     =                       bsize=4096   blocks=117212416, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=57232, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

This is whatever are XFS defaults (with somewhat dated xfsprogs on this
machine - 4.15.0), I didn't pass any special parameters to mkfs.xfs.

> Running a test with 100 procs across 32p, so creating 100 dirs and
> 10000 files, on a 1.4TB SSD:
> 
> $ xfs_info /mnt/scratch
> meta-data=/dev/mapper/fast       isize=512    agcount=67, agsize=5467072 blks
>          =                       sectsz=512   attr=2, projid32bit=1
>          =                       crc=1        finobt=1, sparse=1, rmapbt=0
>          =                       reflink=1
> data     =                       bsize=4096   blocks=366292480, imaxpct=5
>          =                       sunit=64     swidth=256 blks
> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
> log      =internal log           bsize=4096   blocks=512000, version=2
>          =                       sectsz=512   sunit=1 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> $ ./stress-unlink 100 /mnt/scratch
> 0.301
> $
> 
> 
> The total number of inode cluster writes is:
> 
> $ pminfo -f xfs.icluster_flushcnt
> 
> xfs.icluster_flushcnt
>     value 68
> 
> That's 68 * 32kB or about 2MB.
> 
> [pminfo is part of Performance Co-Pilot - https://pcp.io/]
> 
> We did this many user writes:
> 
> $ pminfo -f xfs.write xfs.write_bytes
> 
> xfs.write
>     value 10000
> 
>     xfs.write_bytes
>         value 163840000
> 
> which is 160MB in 10000 writes.
> 
> Metadata writeback was:
> 
> $ pminfo -f xfs.log_tail.push_ail.success 
> 
> xfs.log_tail.push_ail.success
>     value 473
> 
> 473 buffers, part of which was 68 inode clusters so the total is
> about ~400 x 4kB + 2MB = 3.6MB.
> 
> So, really, there isn't a potential for 25% growth in these numbers
> that would make any sort of difference. The only remaining source of
> IO differential is log writes:
> 
> $ pminfo -f xfs.log
> 
> xfs.log.writes
>     value 1055
> 
>     xfs.log.blocks
>         value 28212
> ....
> 
> Which indicates another ~1000 IOs and ~28MB written.
> 
> And that's it. We've got a total of about 195MB written to disk for
> 100 concurrent runs, and it's no different before and after deferred
> inactivation. Other tests that I commonly run that do lots of file
> creates and unlinks along with sync writes (e.g. dbench) didn't show
> any regressions up to 512 concurrent processes, either. So there's
> no obvious regression on this filesystem layout....
> 
> But it made me wonder - there are two things that could influence
> background inactivation here: the AG count (which determines unlink
> concurrency) and log size (which determines transaction
> concurrency). So I remade the filesystem with a tiny 32MB log and 2
> AGs, and that changed a -lot-:
> 
> xfs.icluster_flushcnt
>     value 40420
> 
> xfs.log_tail.push_ail.success
>     value 46098
> 
> Yeah, we did 40,000 inode cluster writes for only 10000 inode
> modifications. Basically, we wrote every inode cluster once for
> every inode modification (post create, post write, post unlink, post
> deferred activation). Normally create/unlink loops the inode doesn't
> even touch the disk - it's cancelled in the log before it gets
> flushed to disk and so this is where a chunk of the difference in
> disk space consumed comes from. It can be expected behaviour.
> 
> xfs.log_tail.push_ail.pinned
>     value 84381
> 
> And we hit pinned items in the AIL 85,000 times instead of 0. This
> triggers more log forces, but fsync is already doing that so it's
> not a big deal. However:
> 
> xfs.log_tail.try_logspace
>     value 80392
> 
> xfs.log_tail.sleep_logspace
>      value 59938
> 
> 75% of the operations had to wait on log space, which means it's
> forcing the log tail to flush metadata to make space. That's where
> all the inode cluster writes are coming from - the transactions are
> lock stepping on log space. i.e. every time we want to modify an
> inode, we have to flush a dirty inode to make space in the log.
> 
> This also means that the CIL can't perform efficient in-memory
> transaction aggregation because it's always being forced out to
> disk. As a result:
> 
> xfs.log.writes
>     value 6504
> 
>     xfs.log.blocks
>         value 65784
> 
> 6x as many log writes, for 65MB of log writes. IOWs, the smaller log
> reduced CIL aggregation efficiency substantially and so we wrote
> twice as much to the log just because we had a small log.
> 
> So, the question needs to be asked: is this a function of a small
> log, or is this behaviour of low AG count?
> 
> Increasing AG count back out to > CPU count resulted in the number
> of inode cluster flushes dropping to ~3000, and the log writes
> dropping back down to 1000 and 28MB of log writes. Performance went
> way up, too (from 1.8s down to 0.35s), indicating that AG
> concurrency is a factor here.
> 
> OTOH, leaving the AG count at 2 and increasing the log back out to
> 2GB removed all the log space waiting, all the inode cluster
> flushing, and everything to do with waiting on log space. But
> performance barely changed (1.8s down to 1.5s) and log bandwidth
> *went up*:
> 
> xfs.log.writes
>     value 6528
> 
> xfs.log.blocks
>     value 135452
> 
> 6,500 log writes, 130MB written to the log. IOWs, both too-small log
> space and too-low AG count for the given workload concurrency will
> adversely affect performance of concurrent workloads.
> 
> But we already knew that, didn't we? :/
> 
> Bumping the AG count from 2 to 16 and using a moderately sized log
> (i.e. > 100MB) largely erases the bad behaviour. Indeed, agcount=16,
> log size=100MB drops runtime to 0.30s and pretty much returns inode
> write count and log write/bw back down to minimums on this 100
> process workload.
> 
> But to point out that this isn't perfect for everything, if I run
> the same test with 1000 processes on that config, it takes 6.2s,
> writes 400MB to the log and flushes 153,000 inode clusters. We're
> back to not having enough log space for the workload. Bumping out to
> agcount=67 (2x CPU count) and max log size(2GB) results in runtime
> of 3.5s (largely linear from the 100 process count), no inode
> writeback and only 200MB of log throughput.
> 
> So, yeah, I suspect that if you change the mkfs parameters for the
> reaim tests to have more AGs and significantly increased log space
> over the defaults, the regression should disappear...
> 
> Can you run these experiments with reaim on your test machines and
> see if they are influenced by filesystem level concurrency
> parameters like AG count and log size as these tests suggest?

Thanks for the detailed analysis and suggestions. So here are some numbers
from my end. First, note that I've bumped number of unlinks each process
does from 100 to 1000 so that runtime on the test machine gets above 1s to
avoid some random noise. The relative slowdown (25%) didn't change (I did
it already yesterday but sent stress-unlink version without bumped up loop
count). Exact time to complete of stress-unlink are (from 5 runs):

(default mkfs params - agcount=4, logsize=223MB)
	AVG	STDDEV
before	2.7952	0.022886
after	3.4746	0.025303

I did also runs with increased AG count (to 96) and log size (to 512 MB).
The results are somewhat surprising:

(agcount=96, logsize=512MB)
	AVG	STDDEV
before	3.1340	0.044168
after	3.3612	0.048412

So the change did somewhat help the case with deferred inactivation however
it significantly hurt the kernel before deferred inactivation. Overall we
are still far from original performance. 

I had a look at xfs stats (full stats are attached for 4 different configs
- (before / after) * (defaults / high ag count)) and logging stats are
clearly different:

defaults-before:
log 4700 157590 0 48585 71854
defaults-after:
log 10293 536643 6 48007 53762
highag-before:
log 5933 233065 0 48000 58240
highag-after:
log 6789 303249 0 48155 53795

We can see big differences in the amount of log writes as well as logged
blocks. defaults-before logged 76 MB, defaults-after logged 262 MB,
highag-before logged 113 MB, highag-after logged 148 MB. Given data writes
for this test are 750 MB (16k * 48 processes * 1000 loops), the difference
of 186 MB of log IO matches well with the observed difference in the amount
of writes in block traces.

I'm not sure why the amount of logged blocks differs so much.  I didn't find
big difference between various configs in push_ail stats.  However I did
find notable differences in various btree stats:

defaults-before:
abtb2 99580 228864 15129 15129 0 0 0 0 0 0 0 0 0 0 49616
abtc2 229903 464130 96526 96526 0 0 0 0 0 0 0 0 0 0 299055
ibt2 96099 96096 3 3 0 0 0 0 0 0 0 0 0 0 0
fibt2 96099 96096 3 3 0 0 0 0 0 0 0 0 0 0 0

defaults-after:
abtb2 95532 400144 36596 36441 0 0 0 0 0 0 0 0 0 0 1421782
abtc2 157421 667273 61135 60980 0 0 0 0 0 0 0 0 0 0 1935106
ibt2 95420 220423 51 34 0 0 0 0 0 0 0 0 0 0 1
fibt2 131505 230219 7708 7691 0 0 0 0 0 0 0 0 0 0 40470

highag-before:
abtb2 120143 240191 24047 24047 0 0 0 0 0 0 0 0 0 0 0
abtc2 288334 456240 120143 120143 0 0 0 0 0 0 0 0 0 0 24051
ibt2 96143 96096 47 47 0 0 0 0 0 0 0 0 0 0 0
fibt2 96143 96096 47 47 0 0 0 0 0 0 0 0 0 0 0

highag-after:
abtb2 96903 205361 20137 20117 0 0 0 0 0 0 0 0 0 0 101850
abtc2 211742 433347 81617 81597 0 0 0 0 0 0 0 0 0 0 274068
ibt2 96083 96035 48 47 0 0 0 0 0 0 0 0 0 0 0
fibt2 96083 96035 48 47 0 0 0 0 0 0 0 0 0 0 0

So we seem to be shuffling btrees more in the slower configurations.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

[-- Attachment #2: unlink-before-stats-defaults.txt --]
[-- Type: text/plain, Size: 764 bytes --]

extent_alloc 48004 192025 48004 192025
abt 0 0 0 0
blk_map 48073 96001 48001 96001 48001 288076 0
bmbt 0 0 0 0
dir 96 48048 48048 0
trans 0 384280 0
ig 48048 47952 2 96 0 0 0
log 4700 157590 0 48585 71854
push_ail 384280 0 289 8 0 101 0 18 0 1
xstrat 48000 0
rw 48000 0
attr 0 0 0 0
icluster 0 4 20
vnodes 97 0 0 0 48048 48048 48048 0
buf 768796 24 768772 138553 41272 24 0 20 17
abtb2 99580 228864 15129 15129 0 0 0 0 0 0 0 0 0 0 49616
abtc2 229903 464130 96526 96526 0 0 0 0 0 0 0 0 0 0 299055
bmbt2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ibt2 96099 96096 3 3 0 0 0 0 0 0 0 0 0 0 0
fibt2 96099 96096 3 3 0 0 0 0 0 0 0 0 0 0 0
rmapbt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
refcntbt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
qm 0 0 0 0 0 0 0 0 0
xpc 786432000 786432000 0
defer_relog 0
debug 0

[-- Attachment #3: unlink-after-stats-defaults.txt --]
[-- Type: text/plain, Size: 796 bytes --]

extent_alloc 48052 192409 47308 189365
abt 0 0 0 0
blk_map 48073 96001 47274 96001 47274 287349 0
bmbt 0 0 0 0
dir 96 48048 48048 0
trans 0 381555 0
ig 48048 46076 0 1972 0 1245 0
log 10293 536643 6 48007 53762
push_ail 381555 0 413 68 0 36 0 1028 0 6
xstrat 48000 0
rw 48000 0
attr 0 0 0 0
icluster 0 36 1123
vnodes 728 0 0 0 48048 48048 48048 0
buf 846870 121 846749 89673 27864 121 0 117 18
abtb2 95532 400144 36596 36441 0 0 0 0 0 0 0 0 0 0 1421782
abtc2 157421 667273 61135 60980 0 0 0 0 0 0 0 0 0 0 1935106
bmbt2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ibt2 95420 220423 51 34 0 0 0 0 0 0 0 0 0 0 1
fibt2 131505 230219 7708 7691 0 0 0 0 0 0 0 0 0 0 40470
rmapbt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
refcntbt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
qm 0 0 0 0 0 0 0 0 0
xpc 786432000 786432000 0
defer_relog 0
debug 0

[-- Attachment #4: unlink-before-stats-highag2.txt --]
[-- Type: text/plain, Size: 755 bytes --]

extent_alloc 48048 192377 48048 192377
abt 0 0 0 0
blk_map 48073 96001 48001 96001 48001 288076 0
bmbt 0 0 0 0
dir 96 48048 48048 0
trans 0 384677 0
ig 48048 47952 0 96 0 0 0
log 5933 233065 0 48000 58240
push_ail 384677 0 0 0 0 0 0 0 0 0
xstrat 48000 0
rw 48000 0
attr 0 0 0 0
icluster 0 0 0
vnodes 97 0 0 0 48048 48048 48048 0
buf 769219 288 768931 0 0 288 0 240 193
abtb2 120143 240191 24047 24047 0 0 0 0 0 0 0 0 0 0 0
abtc2 288334 456240 120143 120143 0 0 0 0 0 0 0 0 0 0 24051
bmbt2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ibt2 96143 96096 47 47 0 0 0 0 0 0 0 0 0 0 0
fibt2 96143 96096 47 47 0 0 0 0 0 0 0 0 0 0 0
rmapbt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
refcntbt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
qm 0 0 0 0 0 0 0 0 0
xpc 786432000 786432000 0
defer_relog 0
debug 0

[-- Attachment #5: unlink-after-stats-highag2.txt --]
[-- Type: text/plain, Size: 775 bytes --]

extent_alloc 48049 192385 47987 192133
abt 0 0 0 0
blk_map 48050 96001 47940 96001 47940 287992 0
bmbt 0 0 0 0
dir 48 48048 48048 0
trans 0 384389 0
ig 48048 46316 0 1732 0 1671 0
log 6789 303249 0 48155 53795
push_ail 384389 0 132 489 0 4 0 1633 0 1
xstrat 48000 0
rw 48000 0
attr 0 0 0 0
icluster 0 95 1716
vnodes 764 0 0 0 48048 48048 48048 0
buf 771055 97 770958 1339 250 97 0 97 0
abtb2 96903 205361 20137 20117 0 0 0 0 0 0 0 0 0 0 101850
abtc2 211742 433347 81617 81597 0 0 0 0 0 0 0 0 0 0 274068
bmbt2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ibt2 96083 96035 48 47 0 0 0 0 0 0 0 0 0 0 0
fibt2 96083 96035 48 47 0 0 0 0 0 0 0 0 0 0 0
rmapbt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
refcntbt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
qm 0 0 0 0 0 0 0 0 0
xpc 786432000 786432000 0
defer_relog 0
debug 0

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Performance regression with async inode inactivation
  2021-10-07 12:03           ` Jan Kara
@ 2021-10-07 23:44             ` Dave Chinner
  2021-10-12 13:42               ` Jan Kara
  0 siblings, 1 reply; 12+ messages in thread
From: Dave Chinner @ 2021-10-07 23:44 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-xfs, Dave Chinner, Darrick J. Wong

On Thu, Oct 07, 2021 at 02:03:57PM +0200, Jan Kara wrote:
> On Thu 07-10-21 08:58:51, Dave Chinner wrote:
> > On Wed, Oct 06, 2021 at 08:10:01PM +0200, Jan Kara wrote:
> > Hmmm, I didn't see this at first.  What's the filesystem layout
> > (xfs_info) and the CLI for the test that you ran? How many CPUs was
> > the test run on?
> 
> The machine has 24 cores, each core has 2 SMT siblings, so 48 logical CPUs.
> That's why I've run stress-unlink with 48 processes.
> 
> xfs_info is:
> 
> meta-data=/dev/sdb1              isize=512    agcount=4, agsize=29303104 blks
>          =                       sectsz=512   attr=2, projid32bit=1
>          =                       crc=1        finobt=1, sparse=0, rmapbt=0
>          =                       reflink=0
> data     =                       bsize=4096   blocks=117212416, imaxpct=25
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
> log      =internal log           bsize=4096   blocks=57232, version=2

OK, default log is ~230MB, fs size is ~400GB?

> > But to point out that this isn't perfect for everything, if I run
> > the same test with 1000 processes on that config, it takes 6.2s,
> > writes 400MB to the log and flushes 153,000 inode clusters. We're
> > back to not having enough log space for the workload. Bumping out to
> > agcount=67 (2x CPU count) and max log size(2GB) results in runtime
> > of 3.5s (largely linear from the 100 process count), no inode
> > writeback and only 200MB of log throughput.
> > 
> > So, yeah, I suspect that if you change the mkfs parameters for the
> > reaim tests to have more AGs and significantly increased log space
> > over the defaults, the regression should disappear...
> > 
> > Can you run these experiments with reaim on your test machines and
> > see if they are influenced by filesystem level concurrency
> > parameters like AG count and log size as these tests suggest?
> 
> Thanks for the detailed analysis and suggestions. So here are some numbers
> from my end. First, note that I've bumped number of unlinks each process
> does from 100 to 1000 so that runtime on the test machine gets above 1s to
> avoid some random noise. The relative slowdown (25%) didn't change (I did
> it already yesterday but sent stress-unlink version without bumped up loop
> count). Exact time to complete of stress-unlink are (from 5 runs):
> 
> (default mkfs params - agcount=4, logsize=223MB)
> 	AVG	STDDEV
> before	2.7952	0.022886
> after		3.4746	0.025303
> 
> I did also runs with increased AG count (to 96) and log size (to 512 MB).
> The results are somewhat surprising:
> 
> (agcount=96, logsize=512MB)
> 	AVG	STDDEV
> before	3.1340	0.044168
> after		3.3612	0.048412

You bumped the log size to 512MB, but then bumped the amount of work
by a factor of 10....

> So the change did somewhat help the case with deferred inactivation however
> it significantly hurt the kernel before deferred inactivation. Overall we
> are still far from original performance. 
> 
> I had a look at xfs stats (full stats are attached for 4 different configs
> - (before / after) * (defaults / high ag count)) and logging stats are
> clearly different:
> 
> defaults-before:
> log 4700 157590 0 48585 71854

Fits in < 75% of log space, so will not be pushing out metadata at
all during the test.

> defaults-after:
> log 10293 536643 6 48007 53762

Overwrites the log 2.5x, so limited by metadata writeback speed for
most of the test.

> highag-before:
> log 5933 233065 0 48000 58240

Yup, workload is spread across 16x more AGs and AG metadata, so I'd
expect to see such an increase in log throuhgput. But with a larger
log, this won't be tail-pushing...

> highag-after:
> log 6789 303249 0 48155 53795

And this is also under the tail-pushing threshold, so really the
only difference in perf here comes from writing more metadata to
the log.

> We can see big differences in the amount of log writes as well as logged
> blocks. defaults-before logged 76 MB, defaults-after logged 262 MB,
> highag-before logged 113 MB, highag-after logged 148 MB. Given data writes
> for this test are 750 MB (16k * 48 processes * 1000 loops), the difference
> of 186 MB of log IO matches well with the observed difference in the amount
> of writes in block traces.
> 
> I'm not sure why the amount of logged blocks differs so much.

fsync() interactions.

On the original code, the two unlink transactions are temporally
adjacent as both are in unlinkat() syscall context.  One is directly
run by the syscall, the other in task_run context at syscall exit
when processing the last reference of the file being dropped.

In general, that means the objects modified (inode cluster, AGI,
inode, etc) are captured by the same CIL context and so aggregate in
memory as a single change (essentially log dedupe). Then the next
fsync() from some other context runs, pushing the CIL to disk and
we only log those objects modified in unlink to the journal once.

With deferred activation, the two phases of unlink are temporally
decoupled. We get a bunch of inodes running the first phase in
unlink() context, but the second phase is run later in a tight loop
from workqueue context. But this means that fsync()s occur between
the two pahses, and hence the objects modified in the two phases of
unlink are modified in two separate CIL contexts. Hence they get
written to the log twice.

Depending on the way things work out, deferred inactivation also
results in longer unlinked inode chains, resulting in more objects
being logged per unlink than without deferred inactivation, as the
inodes are added to the unlink chain and then immediately removed
before any others are added. Hence deferred inode inactivation will
increase the amount written to the log per unlink if the two phases
of unlink as split across journal checkpoints.

IOWs, an increase in log writes for open-write-fsync-close-unlink
workloads is not unexpected. But this workload isn't really a real
world workload in any way - we generally don't do data integrity
writes only to immediately throw the data away. :/

Remove the fsync and we do have a real world workload - temporary
files for compiles, etc. Then everything should mostly get logged
in the same CIL context because all transactions are run
asynchronously and aggregate in memory until the CIL is full and
checkpoints itself. Hence log writes shouldn't change very much at
all for such workloads.

> I didn't find
> big difference between various configs in push_ail stats.  However I did
> find notable differences in various btree stats:
> 
> defaults-before:
> abtb2 99580 228864 15129 15129 0 0 0 0 0 0 0 0 0 0 49616
> abtc2 229903 464130 96526 96526 0 0 0 0 0 0 0 0 0 0 299055
> ibt2 96099 96096 3 3 0 0 0 0 0 0 0 0 0 0 0
> fibt2 96099 96096 3 3 0 0 0 0 0 0 0 0 0 0 0
> 
> defaults-after:
> abtb2 95532 400144 36596 36441 0 0 0 0 0 0 0 0 0 0 1421782
> abtc2 157421 667273 61135 60980 0 0 0 0 0 0 0 0 0 0 1935106
> ibt2 95420 220423 51 34 0 0 0 0 0 0 0 0 0 0 1
> fibt2 131505 230219 7708 7691 0 0 0 0 0 0 0 0 0 0 40470

Yup, the change in the finobt is indicative of slight changes in
order of allocate/free of inodes. We're tracking a more free
inode records because we aren't doing purely sequential inode
allocation and freeing due to AGI contention managing the unlinked
inode chain.

IOWs, we've previously hyper-optimised create-unlink workloads to
aggressively reuse inodes, and that resulted in rapid reuse of
the unlinked inodes. That still happens, but deferred inactivation
increases the pool of free inodes that is being cycled over by this
workload. This reflects the per-ag contention occurring in this
limited fs config - we're getting larger batches of unlinked inodes
being chained on the AGs and so when they are freed we see more
inodes being added to the finobt and then reallocated from there.

I have plans to further decouple the unlinked AGI chain updates
between the two phases of unlink that will help address this, but
that is future work and not ready to go yet.

> highag-before:
> abtb2 120143 240191 24047 24047 0 0 0 0 0 0 0 0 0 0 0
> abtc2 288334 456240 120143 120143 0 0 0 0 0 0 0 0 0 0 24051
> ibt2 96143 96096 47 47 0 0 0 0 0 0 0 0 0 0 0
> fibt2 96143 96096 47 47 0 0 0 0 0 0 0 0 0 0 0
> 
> highag-after:
> abtb2 96903 205361 20137 20117 0 0 0 0 0 0 0 0 0 0 101850
> abtc2 211742 433347 81617 81597 0 0 0 0 0 0 0 0 0 0 274068
> ibt2 96083 96035 48 47 0 0 0 0 0 0 0 0 0 0 0
> fibt2 96083 96035 48 47 0 0 0 0 0 0 0 0 0 0 0

With this, we have no AGI contention to speak of between the two
phases of unlink, so unlinked inode chains remain very short in both
cases and so we don't see any change to finobt residency.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Performance regression with async inode inactivation
  2021-10-07 23:44             ` Dave Chinner
@ 2021-10-12 13:42               ` Jan Kara
  2021-10-12 21:23                 ` Dave Chinner
  0 siblings, 1 reply; 12+ messages in thread
From: Jan Kara @ 2021-10-12 13:42 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Jan Kara, linux-xfs, Dave Chinner, Darrick J. Wong

On Fri 08-10-21 10:44:30, Dave Chinner wrote:
> On Thu, Oct 07, 2021 at 02:03:57PM +0200, Jan Kara wrote:
> > On Thu 07-10-21 08:58:51, Dave Chinner wrote:
> > > On Wed, Oct 06, 2021 at 08:10:01PM +0200, Jan Kara wrote:
> > > Hmmm, I didn't see this at first.  What's the filesystem layout
> > > (xfs_info) and the CLI for the test that you ran? How many CPUs was
> > > the test run on?
> > 
> > The machine has 24 cores, each core has 2 SMT siblings, so 48 logical CPUs.
> > That's why I've run stress-unlink with 48 processes.
> > 
> > xfs_info is:
> > 
> > meta-data=/dev/sdb1              isize=512    agcount=4, agsize=29303104 blks
> >          =                       sectsz=512   attr=2, projid32bit=1
> >          =                       crc=1        finobt=1, sparse=0, rmapbt=0
> >          =                       reflink=0
> > data     =                       bsize=4096   blocks=117212416, imaxpct=25
> >          =                       sunit=0      swidth=0 blks
> > naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
> > log      =internal log           bsize=4096   blocks=57232, version=2
> 
> OK, default log is ~230MB, fs size is ~400GB?

Yep.

<snip log analysis>

> > We can see big differences in the amount of log writes as well as logged
> > blocks. defaults-before logged 76 MB, defaults-after logged 262 MB,
> > highag-before logged 113 MB, highag-after logged 148 MB. Given data writes
> > for this test are 750 MB (16k * 48 processes * 1000 loops), the difference
> > of 186 MB of log IO matches well with the observed difference in the amount
> > of writes in block traces.
> > 
> > I'm not sure why the amount of logged blocks differs so much.
> 
> fsync() interactions.
> 
> On the original code, the two unlink transactions are temporally
> adjacent as both are in unlinkat() syscall context.  One is directly
> run by the syscall, the other in task_run context at syscall exit
> when processing the last reference of the file being dropped.
> 
> In general, that means the objects modified (inode cluster, AGI,
> inode, etc) are captured by the same CIL context and so aggregate in
> memory as a single change (essentially log dedupe). Then the next
> fsync() from some other context runs, pushing the CIL to disk and
> we only log those objects modified in unlink to the journal once.
> 
> With deferred activation, the two phases of unlink are temporally
> decoupled. We get a bunch of inodes running the first phase in
> unlink() context, but the second phase is run later in a tight loop
> from workqueue context. But this means that fsync()s occur between
> the two pahses, and hence the objects modified in the two phases of
> unlink are modified in two separate CIL contexts. Hence they get
> written to the log twice.
> 
> Depending on the way things work out, deferred inactivation also
> results in longer unlinked inode chains, resulting in more objects
> being logged per unlink than without deferred inactivation, as the
> inodes are added to the unlink chain and then immediately removed
> before any others are added. Hence deferred inode inactivation will
> increase the amount written to the log per unlink if the two phases
> of unlink as split across journal checkpoints.
> 
> IOWs, an increase in log writes for open-write-fsync-close-unlink
> workloads is not unexpected. But this workload isn't really a real
> world workload in any way - we generally don't do data integrity
> writes only to immediately throw the data away. :/
> 
> Remove the fsync and we do have a real world workload - temporary
> files for compiles, etc. Then everything should mostly get logged
> in the same CIL context because all transactions are run
> asynchronously and aggregate in memory until the CIL is full and
> checkpoints itself. Hence log writes shouldn't change very much at
> all for such workloads.

OK, that makes sense. Thanks for explanation. So to verify your theory,
I've removed fsync(2) from the test program. So now it is pure create,
write, unlink workload. Results of "stress-unlink 48 /mnt", now for 5000
loops of create-unlink to increase runtime (but the workload does barely
any writes so it should not matter wrt log):

default mkfs params:
	AVG	STDDEV
before	2.0380	0.1597
after	2.7356	0.4712

agcount=96, log size 512M
	AVG	STDDEV
before	1.0610	0.0227
after	1.2508	0.0218

So still notable regression with the async inactivation. With default mkfs
params we'd need more runs to get more reliable results (note the rather
high standard deviation) but with high AG count results show pretty stable
20% regression - so let's have a look at that.

Looking at xfs stats there are barely any differences between before &
after - 'after' writes a bit more to the log but it is ~1.5 MB over the
whole benchmark run, altogether spending some 8ms doing IO so that's not
it. Generally the workload seems to be CPU / memory bound now (it does
barely any IO). Perf shows we are heavily contending on some spinlock in
xfs_cil_commit() - I presume this is a xc_cil_lock. This actually happens
both before and after, but we seem to spend some more time there with async
inactivation. Likely this is related to work being done from worker
threads. Perf stats for comparison:

before
         51,135.08 msec cpu-clock                 #   47.894 CPUs utilized          
             4,699      context-switches          #    0.092 K/sec                  
               382      cpu-migrations            #    0.007 K/sec                  
             1,228      page-faults               #    0.024 K/sec                  
   128,884,972,351      cycles                    #    2.520 GHz                    
    38,517,767,839      instructions              #    0.30  insn per cycle         
     8,337,611,468      branches                  #  163.051 M/sec                  
        39,749,736      branch-misses             #    0.48% of all branches        
        25,225,109      cache-misses                                                

       1.067666170 seconds time elapsed

after
         65,353.43 msec cpu-clock                 #   47.894 CPUs utilized          
            43,737      context-switches          #    0.669 K/sec                  
             1,824      cpu-migrations            #    0.028 K/sec                  
             1,953      page-faults               #    0.030 K/sec                  
   155,144,150,867      cycles                    #    2.374 GHz                    
    45,280,145,337      instructions              #    0.29  insn per cycle         
    10,027,567,384      branches                  #  153.436 M/sec                  
        39,554,691      branch-misses             #    0.39% of all branches        
        30,203,567      cache-misses                                                

       1.364539400 seconds time elapsed

So we can see huge increase in context-switches, notable increase in
cache-misses, decrease in cycles/s so perhaps we are bouncing cache more?
Anyway I guess this is kind of expected due to the nature of async
inactivation, I just wanted to highlight that there are regressions without
fsync in the game as well.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Performance regression with async inode inactivation
  2021-10-12 13:42               ` Jan Kara
@ 2021-10-12 21:23                 ` Dave Chinner
  2021-10-13  9:51                   ` Jan Kara
  0 siblings, 1 reply; 12+ messages in thread
From: Dave Chinner @ 2021-10-12 21:23 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-xfs, Dave Chinner, Darrick J. Wong

On Tue, Oct 12, 2021 at 03:42:55PM +0200, Jan Kara wrote:
> On Fri 08-10-21 10:44:30, Dave Chinner wrote:
> > On Thu, Oct 07, 2021 at 02:03:57PM +0200, Jan Kara wrote:
> > > On Thu 07-10-21 08:58:51, Dave Chinner wrote:
> > > > On Wed, Oct 06, 2021 at 08:10:01PM +0200, Jan Kara wrote:
> > > > Hmmm, I didn't see this at first.  What's the filesystem layout
> > > > (xfs_info) and the CLI for the test that you ran? How many CPUs was
> > > > the test run on?
> > > 
> > > The machine has 24 cores, each core has 2 SMT siblings, so 48 logical CPUs.
> > > That's why I've run stress-unlink with 48 processes.
> > > 
> > > xfs_info is:
> > > 
> > > meta-data=/dev/sdb1              isize=512    agcount=4, agsize=29303104 blks
> > >          =                       sectsz=512   attr=2, projid32bit=1
> > >          =                       crc=1        finobt=1, sparse=0, rmapbt=0
> > >          =                       reflink=0
> > > data     =                       bsize=4096   blocks=117212416, imaxpct=25
> > >          =                       sunit=0      swidth=0 blks
> > > naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
> > > log      =internal log           bsize=4096   blocks=57232, version=2
> > 
> > OK, default log is ~230MB, fs size is ~400GB?
> 
> Yep.
> 
> <snip log analysis>
> 
> > > We can see big differences in the amount of log writes as well as logged
> > > blocks. defaults-before logged 76 MB, defaults-after logged 262 MB,
> > > highag-before logged 113 MB, highag-after logged 148 MB. Given data writes
> > > for this test are 750 MB (16k * 48 processes * 1000 loops), the difference
> > > of 186 MB of log IO matches well with the observed difference in the amount
> > > of writes in block traces.
> > > 
> > > I'm not sure why the amount of logged blocks differs so much.
> > 
> > fsync() interactions.
> > 
> > On the original code, the two unlink transactions are temporally
> > adjacent as both are in unlinkat() syscall context.  One is directly
> > run by the syscall, the other in task_run context at syscall exit
> > when processing the last reference of the file being dropped.
> > 
> > In general, that means the objects modified (inode cluster, AGI,
> > inode, etc) are captured by the same CIL context and so aggregate in
> > memory as a single change (essentially log dedupe). Then the next
> > fsync() from some other context runs, pushing the CIL to disk and
> > we only log those objects modified in unlink to the journal once.
> > 
> > With deferred activation, the two phases of unlink are temporally
> > decoupled. We get a bunch of inodes running the first phase in
> > unlink() context, but the second phase is run later in a tight loop
> > from workqueue context. But this means that fsync()s occur between
> > the two pahses, and hence the objects modified in the two phases of
> > unlink are modified in two separate CIL contexts. Hence they get
> > written to the log twice.
> > 
> > Depending on the way things work out, deferred inactivation also
> > results in longer unlinked inode chains, resulting in more objects
> > being logged per unlink than without deferred inactivation, as the
> > inodes are added to the unlink chain and then immediately removed
> > before any others are added. Hence deferred inode inactivation will
> > increase the amount written to the log per unlink if the two phases
> > of unlink as split across journal checkpoints.
> > 
> > IOWs, an increase in log writes for open-write-fsync-close-unlink
> > workloads is not unexpected. But this workload isn't really a real
> > world workload in any way - we generally don't do data integrity
> > writes only to immediately throw the data away. :/
> > 
> > Remove the fsync and we do have a real world workload - temporary
> > files for compiles, etc. Then everything should mostly get logged
> > in the same CIL context because all transactions are run
> > asynchronously and aggregate in memory until the CIL is full and
> > checkpoints itself. Hence log writes shouldn't change very much at
> > all for such workloads.
> 
> OK, that makes sense. Thanks for explanation. So to verify your theory,
> I've removed fsync(2) from the test program. So now it is pure create,
> write, unlink workload. Results of "stress-unlink 48 /mnt", now for 5000
> loops of create-unlink to increase runtime (but the workload does barely
> any writes so it should not matter wrt log):
> 
> default mkfs params:
> 	AVG	STDDEV
> before	2.0380	0.1597
> after	2.7356	0.4712
> 
> agcount=96, log size 512M
> 	AVG	STDDEV
> before	1.0610	0.0227
> after	1.2508	0.0218
> 
> So still notable regression with the async inactivation. With default mkfs
> params we'd need more runs to get more reliable results (note the rather
> high standard deviation) but with high AG count results show pretty stable
> 20% regression - so let's have a look at that.
> 
> Looking at xfs stats there are barely any differences between before &
> after - 'after' writes a bit more to the log but it is ~1.5 MB over the
> whole benchmark run, altogether spending some 8ms doing IO so that's not
> it. Generally the workload seems to be CPU / memory bound now (it does
> barely any IO). Perf shows we are heavily contending on some spinlock in
> xfs_cil_commit() - I presume this is a xc_cil_lock.

Yes, and I have patches that fix this. It got reverted a before
release because it exposed a bunch of underlying zero-fay bugs in
the log code, and I haven't had time to run it through the review
cycle again even though it's pretty much unchanged from commits
26-39 in this series:

https://lore.kernel.org/linux-xfs/20210603052240.171998-1-david@fromorbit.com/

The profiles in this patch demonstrate the problem and the fix:

https://lore.kernel.org/linux-xfs/20210603052240.171998-35-david@fromorbit.com/

I did all my perf testing of inode inactivation with the CIL
scalability patches also installed, because deferred inode
inactivation only made contention on the CIL lock worse in my perf
tests. We simply can't evaluate the benefit of a change when the
system is easily driven into catastrophic lock breakdown by user
level operational concurrency.

IOWs, the CIL lock is the global limiting factor for async
transaction commit rates on large CPU count machines, and things
that remove bottlenecks in higher layers often just increase
contention on this lock and drive it into breakdown. That makes perf
go backwards, not forwards, and it's not the fault of the high level
change being made. That doesn't make the high level change wrong, it
just means we need to peel the onion further before the improvements
are fully realised.

> This actually happens
> both before and after, but we seem to spend some more time there with async
> inactivation. Likely this is related to work being done from worker
> threads. Perf stats for comparison:
> 
> before
>          51,135.08 msec cpu-clock                 #   47.894 CPUs utilized          
>              4,699      context-switches          #    0.092 K/sec                  
>                382      cpu-migrations            #    0.007 K/sec                  
>              1,228      page-faults               #    0.024 K/sec                  
>    128,884,972,351      cycles                    #    2.520 GHz                    
>     38,517,767,839      instructions              #    0.30  insn per cycle         
>      8,337,611,468      branches                  #  163.051 M/sec                  
>         39,749,736      branch-misses             #    0.48% of all branches        
>         25,225,109      cache-misses                                                
> 
>        1.067666170 seconds time elapsed
> 
> after
>          65,353.43 msec cpu-clock                 #   47.894 CPUs utilized          
>             43,737      context-switches          #    0.669 K/sec                  
>              1,824      cpu-migrations            #    0.028 K/sec                  
>              1,953      page-faults               #    0.030 K/sec                  
>    155,144,150,867      cycles                    #    2.374 GHz                    
>     45,280,145,337      instructions              #    0.29  insn per cycle         
>     10,027,567,384      branches                  #  153.436 M/sec                  
>         39,554,691      branch-misses             #    0.39% of all branches        
>         30,203,567      cache-misses                                                
> 
>        1.364539400 seconds time elapsed
> 
> So we can see huge increase in context-switches, notable increase in
> cache-misses, decrease in cycles/s so perhaps we are bouncing cache more?
> Anyway I guess this is kind of expected due to the nature of async
> inactivation, I just wanted to highlight that there are regressions without
> fsync in the game as well.

Context switches are largely noise - they are most likely just AGI
locks being bounced a bit more. It's the spinlock contention that is
the likely issue here. For example, on my 32p machine with vanilla
5.15-rc4 with a fsync-less, 5000 iteration test run:

$ sudo perf_5.9 stat ./stress-unlink 32 /mnt/scratch
Processes started.
1.290

 Performance counter stats for './stress-unlink 32 /mnt/scratch':

         16,856.61 msec task-clock                #   12.595 CPUs utilized          
            48,297      context-switches          #    0.003 M/sec                  
             4,219      cpu-migrations            #    0.250 K/sec                  
             1,373      page-faults               #    0.081 K/sec                  
    39,254,798,526      cycles                    #    2.329 GHz                    
    16,460,808,349      instructions              #    0.42  insn per cycle         
     3,475,251,228      branches                  #  206.166 M/sec                  
        12,129,889      branch-misses             #    0.35% of all branches        

       1.338312347 seconds time elapsed

       0.186554000 seconds user
      17.247176000 seconds sys

And with 5.15-rc4 + CIL scalability:

$ sudo perf_5.9 stat ./stress-unlink 32 /mnt/scratch
Processes started.
0.894

 Performance counter stats for './stress-unlink 32 /mnt/scratch':

         12,917.93 msec task-clock                #   13.805 CPUs utilized
            39,680      context-switches          #    0.003 M/sec
             2,737      cpu-migrations            #    0.212 K/sec
             1,402      page-faults               #    0.109 K/sec
    30,920,293,752      cycles                    #    2.394 GHz
    14,472,067,501      instructions              #    0.47  insn per cycle
     2,700,978,247      branches                  #  209.087 M/sec
         9,287,754      branch-misses             #    0.34% of all branches

       0.935710173 seconds time elapsed

       0.192467000 seconds user
      13.245977000 seconds sys

Runtime of the fsync-less, 5,000 iteration version drops from 1.29s
to 0.89s, IPC goes up, branches and branch-misses go down, context
switches only go down slightly, etc. IOWs, when you take away the
CIL lock contention, we get back all that perf loss and then some...

FWIW, let's really hammer it for a long while. Vanilla 5.14-rc4:

$ sudo perf_5.9 stat ./stress-unlink 1000 /mnt/scratch
Processes started.
38.881

 Performance counter stats for './stress-unlink 1000 /mnt/scratch':

        733,741.06 msec task-clock                #   16.004 CPUs utilized          
        13,131,968      context-switches          #    0.018 M/sec                  
         1,302,636      cpu-migrations            #    0.002 M/sec                  
            40,720      page-faults               #    0.055 K/sec                  
 1,195,192,185,398      cycles                    #    1.629 GHz                    
   643,382,890,656      instructions              #    0.54  insn per cycle         
   129,065,409,600      branches                  #  175.900 M/sec                  
       768,146,988      branch-misses             #    0.60% of all branches        

      45.847750477 seconds time elapsed

      11.194020000 seconds user
     758.762376000 seconds sys

And the transaction rate is pinned at 800,000/s for the entire test.
We're running at the limit of the CIL lock here.

With CIL scalbility patchset:

$ sudo perf_5.9 stat ./stress-unlink 1000 /mnt/scratch
Processes started.
28.263

 Performance counter stats for './stress-unlink 1000 /mnt/scratch':

        450,547.80 msec task-clock                #   15.038 CPUs utilized          
         5,949,268      context-switches          #    0.013 M/sec                  
           868,887      cpu-migrations            #    0.002 M/sec                  
            41,570      page-faults               #    0.092 K/sec                  
   836,933,822,425      cycles                    #    1.858 GHz                    
   536,132,522,275      instructions              #    0.64  insn per cycle         
    99,264,579,180      branches                  #  220.320 M/sec                  
       506,921,132      branch-misses             #    0.51% of all branches        

      29.961492616 seconds time elapsed

       7.796758000 seconds user
     471.990545000 seconds sys


30% reduction in runtime because the transaction rate is now
running at 1.1M/s. Improvements in code execution across the board
here, so the problem clearly isn't the deferred inode inactivation.

IOWs, I'm largely not concerned about the high CPU count perf
regressions that we are seeing from log code these days - the fix is
largely ready, it's just lacking in available engineering time to get
it over the line and merged right now...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Performance regression with async inode inactivation
  2021-10-12 21:23                 ` Dave Chinner
@ 2021-10-13  9:51                   ` Jan Kara
  0 siblings, 0 replies; 12+ messages in thread
From: Jan Kara @ 2021-10-13  9:51 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Jan Kara, linux-xfs, Dave Chinner, Darrick J. Wong

On Wed 13-10-21 08:23:39, Dave Chinner wrote:
> On Tue, Oct 12, 2021 at 03:42:55PM +0200, Jan Kara wrote:
> > On Fri 08-10-21 10:44:30, Dave Chinner wrote:
> > > Remove the fsync and we do have a real world workload - temporary
> > > files for compiles, etc. Then everything should mostly get logged
> > > in the same CIL context because all transactions are run
> > > asynchronously and aggregate in memory until the CIL is full and
> > > checkpoints itself. Hence log writes shouldn't change very much at
> > > all for such workloads.
> > 
> > OK, that makes sense. Thanks for explanation. So to verify your theory,
> > I've removed fsync(2) from the test program. So now it is pure create,
> > write, unlink workload. Results of "stress-unlink 48 /mnt", now for 5000
> > loops of create-unlink to increase runtime (but the workload does barely
> > any writes so it should not matter wrt log):
> > 
> > default mkfs params:
> > 	AVG	STDDEV
> > before	2.0380	0.1597
> > after	2.7356	0.4712
> > 
> > agcount=96, log size 512M
> > 	AVG	STDDEV
> > before	1.0610	0.0227
> > after	1.2508	0.0218
> > 
> > So still notable regression with the async inactivation. With default mkfs
> > params we'd need more runs to get more reliable results (note the rather
> > high standard deviation) but with high AG count results show pretty stable
> > 20% regression - so let's have a look at that.
> > 
> > Looking at xfs stats there are barely any differences between before &
> > after - 'after' writes a bit more to the log but it is ~1.5 MB over the
> > whole benchmark run, altogether spending some 8ms doing IO so that's not
> > it. Generally the workload seems to be CPU / memory bound now (it does
> > barely any IO). Perf shows we are heavily contending on some spinlock in
> > xfs_cil_commit() - I presume this is a xc_cil_lock.
> 
> Yes, and I have patches that fix this. It got reverted a before
> release because it exposed a bunch of underlying zero-fay bugs in
> the log code, and I haven't had time to run it through the review
> cycle again even though it's pretty much unchanged from commits
> 26-39 in this series:
> 
> https://lore.kernel.org/linux-xfs/20210603052240.171998-1-david@fromorbit.com/
> 
> The profiles in this patch demonstrate the problem and the fix:
> 
> https://lore.kernel.org/linux-xfs/20210603052240.171998-35-david@fromorbit.com/
> 
> I did all my perf testing of inode inactivation with the CIL
> scalability patches also installed, because deferred inode
> inactivation only made contention on the CIL lock worse in my perf
> tests. We simply can't evaluate the benefit of a change when the
> system is easily driven into catastrophic lock breakdown by user
> level operational concurrency.
> 
> IOWs, the CIL lock is the global limiting factor for async
> transaction commit rates on large CPU count machines, and things
> that remove bottlenecks in higher layers often just increase
> contention on this lock and drive it into breakdown. That makes perf
> go backwards, not forwards, and it's not the fault of the high level
> change being made. That doesn't make the high level change wrong, it
> just means we need to peel the onion further before the improvements
> are fully realised.

OK, understood.

> > This actually happens
> > both before and after, but we seem to spend some more time there with async
> > inactivation. Likely this is related to work being done from worker
> > threads. Perf stats for comparison:
> > 
> > before
> >          51,135.08 msec cpu-clock                 #   47.894 CPUs utilized          
> >              4,699      context-switches          #    0.092 K/sec                  
> >                382      cpu-migrations            #    0.007 K/sec                  
> >              1,228      page-faults               #    0.024 K/sec                  
> >    128,884,972,351      cycles                    #    2.520 GHz                    
> >     38,517,767,839      instructions              #    0.30  insn per cycle         
> >      8,337,611,468      branches                  #  163.051 M/sec                  
> >         39,749,736      branch-misses             #    0.48% of all branches        
> >         25,225,109      cache-misses                                                
> > 
> >        1.067666170 seconds time elapsed
> > 
> > after
> >          65,353.43 msec cpu-clock                 #   47.894 CPUs utilized          
> >             43,737      context-switches          #    0.669 K/sec                  
> >              1,824      cpu-migrations            #    0.028 K/sec                  
> >              1,953      page-faults               #    0.030 K/sec                  
> >    155,144,150,867      cycles                    #    2.374 GHz                    
> >     45,280,145,337      instructions              #    0.29  insn per cycle         
> >     10,027,567,384      branches                  #  153.436 M/sec                  
> >         39,554,691      branch-misses             #    0.39% of all branches        
> >         30,203,567      cache-misses                                                
> > 
> >        1.364539400 seconds time elapsed
> > 
> > So we can see huge increase in context-switches, notable increase in
> > cache-misses, decrease in cycles/s so perhaps we are bouncing cache more?
> > Anyway I guess this is kind of expected due to the nature of async
> > inactivation, I just wanted to highlight that there are regressions without
> > fsync in the game as well.
> 
> Context switches are largely noise - they are most likely just AGI
> locks being bounced a bit more. It's the spinlock contention that is
> the likely issue here. For example, on my 32p machine with vanilla
> 5.15-rc4 with a fsync-less, 5000 iteration test run:
> 
> $ sudo perf_5.9 stat ./stress-unlink 32 /mnt/scratch
> Processes started.
> 1.290
> 
>  Performance counter stats for './stress-unlink 32 /mnt/scratch':
> 
>          16,856.61 msec task-clock                #   12.595 CPUs utilized          
>             48,297      context-switches          #    0.003 M/sec                  
>              4,219      cpu-migrations            #    0.250 K/sec                  
>              1,373      page-faults               #    0.081 K/sec                  
>     39,254,798,526      cycles                    #    2.329 GHz                    
>     16,460,808,349      instructions              #    0.42  insn per cycle         
>      3,475,251,228      branches                  #  206.166 M/sec                  
>         12,129,889      branch-misses             #    0.35% of all branches        
> 
>        1.338312347 seconds time elapsed
> 
>        0.186554000 seconds user
>       17.247176000 seconds sys
> 
> And with 5.15-rc4 + CIL scalability:
> 
> $ sudo perf_5.9 stat ./stress-unlink 32 /mnt/scratch
> Processes started.
> 0.894
> 
>  Performance counter stats for './stress-unlink 32 /mnt/scratch':
> 
>          12,917.93 msec task-clock                #   13.805 CPUs utilized
>             39,680      context-switches          #    0.003 M/sec
>              2,737      cpu-migrations            #    0.212 K/sec
>              1,402      page-faults               #    0.109 K/sec
>     30,920,293,752      cycles                    #    2.394 GHz
>     14,472,067,501      instructions              #    0.47  insn per cycle
>      2,700,978,247      branches                  #  209.087 M/sec
>          9,287,754      branch-misses             #    0.34% of all branches
> 
>        0.935710173 seconds time elapsed
> 
>        0.192467000 seconds user
>       13.245977000 seconds sys
> 
> Runtime of the fsync-less, 5,000 iteration version drops from 1.29s
> to 0.89s, IPC goes up, branches and branch-misses go down, context
> switches only go down slightly, etc. IOWs, when you take away the
> CIL lock contention, we get back all that perf loss and then some...

Nice results!

> FWIW, let's really hammer it for a long while. Vanilla 5.14-rc4:
> 
> $ sudo perf_5.9 stat ./stress-unlink 1000 /mnt/scratch
> Processes started.
> 38.881
> 
>  Performance counter stats for './stress-unlink 1000 /mnt/scratch':
> 
>         733,741.06 msec task-clock                #   16.004 CPUs utilized          
>         13,131,968      context-switches          #    0.018 M/sec                  
>          1,302,636      cpu-migrations            #    0.002 M/sec                  
>             40,720      page-faults               #    0.055 K/sec                  
>  1,195,192,185,398      cycles                    #    1.629 GHz                    
>    643,382,890,656      instructions              #    0.54  insn per cycle         
>    129,065,409,600      branches                  #  175.900 M/sec                  
>        768,146,988      branch-misses             #    0.60% of all branches        
> 
>       45.847750477 seconds time elapsed
> 
>       11.194020000 seconds user
>      758.762376000 seconds sys
> 
> And the transaction rate is pinned at 800,000/s for the entire test.
> We're running at the limit of the CIL lock here.
> 
> With CIL scalbility patchset:
> 
> $ sudo perf_5.9 stat ./stress-unlink 1000 /mnt/scratch
> Processes started.
> 28.263
> 
>  Performance counter stats for './stress-unlink 1000 /mnt/scratch':
> 
>         450,547.80 msec task-clock                #   15.038 CPUs utilized          
>          5,949,268      context-switches          #    0.013 M/sec                  
>            868,887      cpu-migrations            #    0.002 M/sec                  
>             41,570      page-faults               #    0.092 K/sec                  
>    836,933,822,425      cycles                    #    1.858 GHz                    
>    536,132,522,275      instructions              #    0.64  insn per cycle         
>     99,264,579,180      branches                  #  220.320 M/sec                  
>        506,921,132      branch-misses             #    0.51% of all branches        
> 
>       29.961492616 seconds time elapsed
> 
>        7.796758000 seconds user
>      471.990545000 seconds sys
> 
> 
> 30% reduction in runtime because the transaction rate is now
> running at 1.1M/s. Improvements in code execution across the board
> here, so the problem clearly isn't the deferred inode inactivation.
> 
> IOWs, I'm largely not concerned about the high CPU count perf
> regressions that we are seeing from log code these days - the fix is
> largely ready, it's just lacking in available engineering time to get
> it over the line and merged right now...

OK, thanks for explanation!

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2021-10-13  9:51 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-04 10:06 Performance regression with async inode inactivation Jan Kara
2021-10-04 18:17 ` Darrick J. Wong
2021-10-04 21:15 ` Dave Chinner
2021-10-05  8:11   ` Jan Kara
2021-10-05 21:26     ` Dave Chinner
2021-10-06 18:10       ` Jan Kara
2021-10-06 21:58         ` Dave Chinner
2021-10-07 12:03           ` Jan Kara
2021-10-07 23:44             ` Dave Chinner
2021-10-12 13:42               ` Jan Kara
2021-10-12 21:23                 ` Dave Chinner
2021-10-13  9:51                   ` Jan Kara

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.