* Performance regression with async inode inactivation @ 2021-10-04 10:06 Jan Kara 2021-10-04 18:17 ` Darrick J. Wong 2021-10-04 21:15 ` Dave Chinner 0 siblings, 2 replies; 12+ messages in thread From: Jan Kara @ 2021-10-04 10:06 UTC (permalink / raw) To: linux-xfs; +Cc: Dave Chinner, Darrick J. Wong Hello, our performance testing grid has detected a performance regression caused by commit ab23a77687 ("xfs: per-cpu deferred inode inactivation queues") with reaim benchmark running 'disk' and 'disk-large' workloads. The regression has been so far detected on two machines - marvin7 (48 cpus, 64 GB ram, SATA SSD), dobby (64 cpus, 192 GB ram, rotating disk behind megaraid_sas controller). The regression reports from mmtests on marvin7 (which from my experience I find more reliable than dobby) look like: reaim-io-disk-xfs Hmean disk-1 3597.12 ( 0.00%) 3636.37 ( 1.09%) Hmean disk-25 125000.00 ( 0.00%) 121753.25 ( -2.60%) Hmean disk-49 218424.96 ( 0.00%) 193421.05 * -11.45%* Hmean disk-73 286649.22 ( 0.00%) 240131.58 * -16.23%* Hmean disk-97 349339.74 ( 0.00%) 285854.62 * -18.17%* Hmean disk-121 373456.79 ( 0.00%) 309199.32 * -17.21%* Hmean disk-145 399449.04 ( 0.00%) 330547.11 * -17.25%* Hmean disk-169 420049.71 ( 0.00%) 345132.74 * -17.84%* Hmean disk-193 458795.56 ( 0.00%) 375243.03 * -18.21%* Stddev disk-1 102.32 ( 0.00%) 117.93 ( -15.25%) Stddev disk-25 6981.08 ( 0.00%) 4285.17 ( 38.62%) Stddev disk-49 9523.63 ( 0.00%) 10723.44 ( -12.60%) Stddev disk-73 9704.22 ( 0.00%) 7946.73 ( 18.11%) Stddev disk-97 10059.91 ( 0.00%) 6111.28 ( 39.25%) Stddev disk-121 5730.56 ( 0.00%) 11714.34 (-104.42%) Stddev disk-145 11154.40 ( 0.00%) 8129.06 ( 27.12%) Stddev disk-169 4477.30 ( 0.00%) 3558.86 ( 20.51%) Stddev disk-193 8785.70 ( 0.00%) 13258.89 ( -50.91%) reaim-io-disk-large-xfs Hmean disk-1 722.72 ( 0.00%) 721.85 ( -0.12%) Hmean disk-25 24177.95 ( 0.00%) 24319.06 ( 0.58%) Hmean disk-49 35294.12 ( 0.00%) 34361.85 ( -2.64%) Hmean disk-73 43042.45 ( 0.00%) 40896.36 * -4.99%* Hmean disk-97 48403.19 ( 0.00%) 46044.30 * -4.87%* Hmean disk-121 52230.22 ( 0.00%) 49347.47 * -5.52%* Hmean disk-145 54613.94 ( 0.00%) 52333.98 * -4.17%* Hmean disk-169 57178.30 ( 0.00%) 54745.71 * -4.25%* Hmean disk-193 60230.94 ( 0.00%) 57106.22 * -5.19%* Stddev disk-1 18.74 ( 0.00%) 30.19 ( -61.11%) Stddev disk-25 439.49 ( 0.00%) 809.58 ( -84.21%) Stddev disk-49 1416.65 ( 0.00%) 603.55 ( 57.40%) Stddev disk-73 949.45 ( 0.00%) 584.61 ( 38.43%) Stddev disk-97 689.51 ( 0.00%) 602.76 ( 12.58%) Stddev disk-121 485.22 ( 0.00%) 612.79 ( -26.29%) Stddev disk-145 147.37 ( 0.00%) 442.99 (-200.60%) Stddev disk-169 282.25 ( 0.00%) 613.42 (-117.33%) Stddev disk-193 970.05 ( 0.00%) 572.59 ( 40.97%) Note that numbers behind dash (disk-xx) denote the number of reaim "clients" - i.e., the number of processes reaim runs in parallel. This reaim workload will create quite some small files, fsync them, do a few operations on them like read, write, etc. and quickly delete them. This happens in many processes in parallel so I can imagine the unlink workload is rather heavy. To reproduce the workload the easiest is probably to clone mmtests [1], there's README.md and docs/Tutorial.md about how to run tests. Relevant configurations are in configs/config-reaim-io-disk-large and configs/config-reaim-io-disk, you will need to edit these files to set appropriate test partition (will get wiped) and filesystem. Note that mmtests do somewhat modify standard reaim benchmark so that it does not call system("sync") after each operation (you can see modifications we apply in [2] if you're interested). I can probably find some time to better understand what's going on on these machines later this week but if you have some suggestions what to look for, you're welcome. Honza [1] https://github.com/gormanm/mmtests [2] https://github.com/gormanm/mmtests/blob/master/shellpack_src/src/reaim/reaim-install -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Performance regression with async inode inactivation 2021-10-04 10:06 Performance regression with async inode inactivation Jan Kara @ 2021-10-04 18:17 ` Darrick J. Wong 2021-10-04 21:15 ` Dave Chinner 1 sibling, 0 replies; 12+ messages in thread From: Darrick J. Wong @ 2021-10-04 18:17 UTC (permalink / raw) To: Jan Kara; +Cc: linux-xfs, Dave Chinner On Mon, Oct 04, 2021 at 12:06:53PM +0200, Jan Kara wrote: > Hello, > > our performance testing grid has detected a performance regression caused > by commit ab23a77687 ("xfs: per-cpu deferred inode inactivation queues") > with reaim benchmark running 'disk' and 'disk-large' workloads. The > regression has been so far detected on two machines - marvin7 (48 cpus, 64 > GB ram, SATA SSD), dobby (64 cpus, 192 GB ram, rotating disk behind > megaraid_sas controller). I'll think about this some more, but for now my immediate reaction is to point out the previous discussion about aim7 reporting a similar (percentage point) regression: https://lore.kernel.org/lkml/20210909142355.GA6270@xsang-OptiPlex-9020/ --D > > The regression reports from mmtests on marvin7 (which from my experience I > find more reliable than dobby) look like: > > reaim-io-disk-xfs > Hmean disk-1 3597.12 ( 0.00%) 3636.37 ( 1.09%) > Hmean disk-25 125000.00 ( 0.00%) 121753.25 ( -2.60%) > Hmean disk-49 218424.96 ( 0.00%) 193421.05 * -11.45%* > Hmean disk-73 286649.22 ( 0.00%) 240131.58 * -16.23%* > Hmean disk-97 349339.74 ( 0.00%) 285854.62 * -18.17%* > Hmean disk-121 373456.79 ( 0.00%) 309199.32 * -17.21%* > Hmean disk-145 399449.04 ( 0.00%) 330547.11 * -17.25%* > Hmean disk-169 420049.71 ( 0.00%) 345132.74 * -17.84%* > Hmean disk-193 458795.56 ( 0.00%) 375243.03 * -18.21%* > Stddev disk-1 102.32 ( 0.00%) 117.93 ( -15.25%) > Stddev disk-25 6981.08 ( 0.00%) 4285.17 ( 38.62%) > Stddev disk-49 9523.63 ( 0.00%) 10723.44 ( -12.60%) > Stddev disk-73 9704.22 ( 0.00%) 7946.73 ( 18.11%) > Stddev disk-97 10059.91 ( 0.00%) 6111.28 ( 39.25%) > Stddev disk-121 5730.56 ( 0.00%) 11714.34 (-104.42%) > Stddev disk-145 11154.40 ( 0.00%) 8129.06 ( 27.12%) > Stddev disk-169 4477.30 ( 0.00%) 3558.86 ( 20.51%) > Stddev disk-193 8785.70 ( 0.00%) 13258.89 ( -50.91%) > > reaim-io-disk-large-xfs > Hmean disk-1 722.72 ( 0.00%) 721.85 ( -0.12%) > Hmean disk-25 24177.95 ( 0.00%) 24319.06 ( 0.58%) > Hmean disk-49 35294.12 ( 0.00%) 34361.85 ( -2.64%) > Hmean disk-73 43042.45 ( 0.00%) 40896.36 * -4.99%* > Hmean disk-97 48403.19 ( 0.00%) 46044.30 * -4.87%* > Hmean disk-121 52230.22 ( 0.00%) 49347.47 * -5.52%* > Hmean disk-145 54613.94 ( 0.00%) 52333.98 * -4.17%* > Hmean disk-169 57178.30 ( 0.00%) 54745.71 * -4.25%* > Hmean disk-193 60230.94 ( 0.00%) 57106.22 * -5.19%* > Stddev disk-1 18.74 ( 0.00%) 30.19 ( -61.11%) > Stddev disk-25 439.49 ( 0.00%) 809.58 ( -84.21%) > Stddev disk-49 1416.65 ( 0.00%) 603.55 ( 57.40%) > Stddev disk-73 949.45 ( 0.00%) 584.61 ( 38.43%) > Stddev disk-97 689.51 ( 0.00%) 602.76 ( 12.58%) > Stddev disk-121 485.22 ( 0.00%) 612.79 ( -26.29%) > Stddev disk-145 147.37 ( 0.00%) 442.99 (-200.60%) > Stddev disk-169 282.25 ( 0.00%) 613.42 (-117.33%) > Stddev disk-193 970.05 ( 0.00%) 572.59 ( 40.97%) > > Note that numbers behind dash (disk-xx) denote the number of reaim > "clients" - i.e., the number of processes reaim runs in parallel. > > This reaim workload will create quite some small files, fsync them, do a > few operations on them like read, write, etc. and quickly delete them. This > happens in many processes in parallel so I can imagine the unlink workload > is rather heavy. > > To reproduce the workload the easiest is probably to clone mmtests [1], > there's README.md and docs/Tutorial.md about how to run tests. Relevant > configurations are in configs/config-reaim-io-disk-large and > configs/config-reaim-io-disk, you will need to edit these files to set > appropriate test partition (will get wiped) and filesystem. Note that > mmtests do somewhat modify standard reaim benchmark so that it does not > call system("sync") after each operation (you can see modifications we > apply in [2] if you're interested). > > I can probably find some time to better understand what's going on on these > machines later this week but if you have some suggestions what to look for, > you're welcome. > > Honza > > [1] https://github.com/gormanm/mmtests > [2] https://github.com/gormanm/mmtests/blob/master/shellpack_src/src/reaim/reaim-install > > -- > Jan Kara <jack@suse.com> > SUSE Labs, CR ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Performance regression with async inode inactivation 2021-10-04 10:06 Performance regression with async inode inactivation Jan Kara 2021-10-04 18:17 ` Darrick J. Wong @ 2021-10-04 21:15 ` Dave Chinner 2021-10-05 8:11 ` Jan Kara 1 sibling, 1 reply; 12+ messages in thread From: Dave Chinner @ 2021-10-04 21:15 UTC (permalink / raw) To: Jan Kara; +Cc: linux-xfs, Dave Chinner, Darrick J. Wong On Mon, Oct 04, 2021 at 12:06:53PM +0200, Jan Kara wrote: > Hello, > > our performance testing grid has detected a performance regression caused > by commit ab23a77687 ("xfs: per-cpu deferred inode inactivation queues") > with reaim benchmark running 'disk' and 'disk-large' workloads. The > regression has been so far detected on two machines - marvin7 (48 cpus, 64 > GB ram, SATA SSD), dobby (64 cpus, 192 GB ram, rotating disk behind > megaraid_sas controller). Yup, large cpu count, single slow disk, and the cause will likely be exclusive rwsem lock contention on a directory inode that concurrent openat and unlink are occuring in. Basically, that commit removed a bunch of userspace overhead in unlinks, when mean they run as fast as the unlink() call can remove the directory entry. There is effectively nothing throttling unlink() in XFS now except for available log space and it mostly runs to completion without blocking. Hence the front end unlink performance can run in much faster bursts before delayed inactivation needs to run. Given most of the added CPU overhead is in the rwsem spin_on_owner path, it implies that the write lock holder is, indeed, not sleeping with the lock held. Hence reaim is hitting a borderline contended rwsem much harder and with different behaviour, resulting in catastrophic breakdown of lock performance and hence unlink performance goes backwards. I can't see any other new sleeping lock contention in the workload profiles - the context switch rate goes down substantially (by 35%!) with commit ab23a77687, which also implies that the lock contention is resulting in much longer spin and/or sleep times on the lock. I'm not sure we can do anything about this in the filesystem. The contended lock is a core, high level VFS lock which is the first point of unlinkat() syscall serialisation. This is the lock that is directly exposed to userspace concurrency, so the scalability of this lock determines concurrency performance of the userspace application..... > The regression reports from mmtests on marvin7 (which from my experience I > find more reliable than dobby) look like: > > reaim-io-disk-xfs > Hmean disk-1 3597.12 ( 0.00%) 3636.37 ( 1.09%) > Hmean disk-25 125000.00 ( 0.00%) 121753.25 ( -2.60%) > Hmean disk-49 218424.96 ( 0.00%) 193421.05 * -11.45%* > Hmean disk-73 286649.22 ( 0.00%) 240131.58 * -16.23%* > Hmean disk-97 349339.74 ( 0.00%) 285854.62 * -18.17%* > Hmean disk-121 373456.79 ( 0.00%) 309199.32 * -17.21%* > Hmean disk-145 399449.04 ( 0.00%) 330547.11 * -17.25%* > Hmean disk-169 420049.71 ( 0.00%) 345132.74 * -17.84%* > Hmean disk-193 458795.56 ( 0.00%) 375243.03 * -18.21%* > Stddev disk-1 102.32 ( 0.00%) 117.93 ( -15.25%) > Stddev disk-25 6981.08 ( 0.00%) 4285.17 ( 38.62%) > Stddev disk-49 9523.63 ( 0.00%) 10723.44 ( -12.60%) > Stddev disk-73 9704.22 ( 0.00%) 7946.73 ( 18.11%) > Stddev disk-97 10059.91 ( 0.00%) 6111.28 ( 39.25%) > Stddev disk-121 5730.56 ( 0.00%) 11714.34 (-104.42%) > Stddev disk-145 11154.40 ( 0.00%) 8129.06 ( 27.12%) > Stddev disk-169 4477.30 ( 0.00%) 3558.86 ( 20.51%) > Stddev disk-193 8785.70 ( 0.00%) 13258.89 ( -50.91%) > > reaim-io-disk-large-xfs > Hmean disk-1 722.72 ( 0.00%) 721.85 ( -0.12%) > Hmean disk-25 24177.95 ( 0.00%) 24319.06 ( 0.58%) > Hmean disk-49 35294.12 ( 0.00%) 34361.85 ( -2.64%) > Hmean disk-73 43042.45 ( 0.00%) 40896.36 * -4.99%* > Hmean disk-97 48403.19 ( 0.00%) 46044.30 * -4.87%* > Hmean disk-121 52230.22 ( 0.00%) 49347.47 * -5.52%* > Hmean disk-145 54613.94 ( 0.00%) 52333.98 * -4.17%* > Hmean disk-169 57178.30 ( 0.00%) 54745.71 * -4.25%* > Hmean disk-193 60230.94 ( 0.00%) 57106.22 * -5.19%* > Stddev disk-1 18.74 ( 0.00%) 30.19 ( -61.11%) > Stddev disk-25 439.49 ( 0.00%) 809.58 ( -84.21%) > Stddev disk-49 1416.65 ( 0.00%) 603.55 ( 57.40%) > Stddev disk-73 949.45 ( 0.00%) 584.61 ( 38.43%) > Stddev disk-97 689.51 ( 0.00%) 602.76 ( 12.58%) > Stddev disk-121 485.22 ( 0.00%) 612.79 ( -26.29%) > Stddev disk-145 147.37 ( 0.00%) 442.99 (-200.60%) > Stddev disk-169 282.25 ( 0.00%) 613.42 (-117.33%) > Stddev disk-193 970.05 ( 0.00%) 572.59 ( 40.97%) > > Note that numbers behind dash (disk-xx) denote the number of reaim > "clients" - i.e., the number of processes reaim runs in parallel. > > This reaim workload will create quite some small files, fsync them, do a > few operations on them like read, write, etc. and quickly delete them. This > happens in many processes in parallel so I can imagine the unlink workload > is rather heavy. Yup, which is why I've largely been ignoring it. reaim is a *very old* benchmark that doesn't do anything we'd normally do to scale out independent workloads effectively. > To reproduce the workload the easiest is probably to clone mmtests [1], > there's README.md and docs/Tutorial.md about how to run tests. Relevant > configurations are in configs/config-reaim-io-disk-large and > configs/config-reaim-io-disk, you will need to edit these files to set > appropriate test partition (will get wiped) and filesystem. Note that > mmtests do somewhat modify standard reaim benchmark so that it does not > call system("sync") after each operation (you can see modifications we > apply in [2] if you're interested). > > I can probably find some time to better understand what's going on on these > machines later this week but if you have some suggestions what to look for, > you're welcome. You should be able to reproduce the concurrent unlink contention simply by running a bunch of concurrent unlinks in a directory and watching the directory i_rwsem get absolutely hammered up in do_unlinkat(). Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Performance regression with async inode inactivation 2021-10-04 21:15 ` Dave Chinner @ 2021-10-05 8:11 ` Jan Kara 2021-10-05 21:26 ` Dave Chinner 0 siblings, 1 reply; 12+ messages in thread From: Jan Kara @ 2021-10-05 8:11 UTC (permalink / raw) To: Dave Chinner; +Cc: Jan Kara, linux-xfs, Dave Chinner, Darrick J. Wong On Tue 05-10-21 08:15:08, Dave Chinner wrote: > On Mon, Oct 04, 2021 at 12:06:53PM +0200, Jan Kara wrote: > > Hello, > > > > our performance testing grid has detected a performance regression caused > > by commit ab23a77687 ("xfs: per-cpu deferred inode inactivation queues") > > with reaim benchmark running 'disk' and 'disk-large' workloads. The > > regression has been so far detected on two machines - marvin7 (48 cpus, 64 > > GB ram, SATA SSD), dobby (64 cpus, 192 GB ram, rotating disk behind > > megaraid_sas controller). > > Yup, large cpu count, single slow disk, and the cause will likely be > exclusive rwsem lock contention on a directory inode that concurrent > openat and unlink are occuring in. > > Basically, that commit removed a bunch of userspace overhead in > unlinks, when mean they run as fast as the unlink() call can remove > the directory entry. There is effectively nothing throttling > unlink() in XFS now except for available log space and it mostly > runs to completion without blocking. Hence the front end unlink > performance can run in much faster bursts before delayed > inactivation needs to run. > > Given most of the added CPU overhead is in the rwsem spin_on_owner > path, it implies that the write lock holder is, indeed, not sleeping > with the lock held. Hence reaim is hitting a borderline contended > rwsem much harder and with different behaviour, resulting in > catastrophic breakdown of lock performance and hence unlink > performance goes backwards. > > I can't see any other new sleeping lock contention in the workload > profiles - the context switch rate goes down substantially (by 35%!) > with commit ab23a77687, which also implies that the lock contention > is resulting in much longer spin and/or sleep times on the lock. > > I'm not sure we can do anything about this in the filesystem. The > contended lock is a core, high level VFS lock which is the first > point of unlinkat() syscall serialisation. This is the lock that is > directly exposed to userspace concurrency, so the scalability of > this lock determines concurrency performance of the userspace > application..... Thanks for explanation! It makes sense, except one difference I can see in vmstat on both marvin7 and dobby which I don't understand: Dobby: Ops Sector Reads 1009081.00 1009081.00 Ops Sector Writes 11550795.00 18753764.00 Marvin7: Ops Sector Reads 887951.00 887951.00 Ops Sector Writes 8248822.00 11135086.00 So after the change reaim ends up doing noticeably more writes. I had a look at iostat comparison as well but there wasn't anything particular standing out besides higher amount of writes on the test disk. I guess, I'll limit the number of clients to a single number showing the regression, enable some more detailed monitoring and see whether something interesting pops up. > > The regression reports from mmtests on marvin7 (which from my experience I > > find more reliable than dobby) look like: > > > > reaim-io-disk-xfs > > Hmean disk-1 3597.12 ( 0.00%) 3636.37 ( 1.09%) > > Hmean disk-25 125000.00 ( 0.00%) 121753.25 ( -2.60%) > > Hmean disk-49 218424.96 ( 0.00%) 193421.05 * -11.45%* > > Hmean disk-73 286649.22 ( 0.00%) 240131.58 * -16.23%* > > Hmean disk-97 349339.74 ( 0.00%) 285854.62 * -18.17%* > > Hmean disk-121 373456.79 ( 0.00%) 309199.32 * -17.21%* > > Hmean disk-145 399449.04 ( 0.00%) 330547.11 * -17.25%* > > Hmean disk-169 420049.71 ( 0.00%) 345132.74 * -17.84%* > > Hmean disk-193 458795.56 ( 0.00%) 375243.03 * -18.21%* > > Stddev disk-1 102.32 ( 0.00%) 117.93 ( -15.25%) > > Stddev disk-25 6981.08 ( 0.00%) 4285.17 ( 38.62%) > > Stddev disk-49 9523.63 ( 0.00%) 10723.44 ( -12.60%) > > Stddev disk-73 9704.22 ( 0.00%) 7946.73 ( 18.11%) > > Stddev disk-97 10059.91 ( 0.00%) 6111.28 ( 39.25%) > > Stddev disk-121 5730.56 ( 0.00%) 11714.34 (-104.42%) > > Stddev disk-145 11154.40 ( 0.00%) 8129.06 ( 27.12%) > > Stddev disk-169 4477.30 ( 0.00%) 3558.86 ( 20.51%) > > Stddev disk-193 8785.70 ( 0.00%) 13258.89 ( -50.91%) > > > > reaim-io-disk-large-xfs > > Hmean disk-1 722.72 ( 0.00%) 721.85 ( -0.12%) > > Hmean disk-25 24177.95 ( 0.00%) 24319.06 ( 0.58%) > > Hmean disk-49 35294.12 ( 0.00%) 34361.85 ( -2.64%) > > Hmean disk-73 43042.45 ( 0.00%) 40896.36 * -4.99%* > > Hmean disk-97 48403.19 ( 0.00%) 46044.30 * -4.87%* > > Hmean disk-121 52230.22 ( 0.00%) 49347.47 * -5.52%* > > Hmean disk-145 54613.94 ( 0.00%) 52333.98 * -4.17%* > > Hmean disk-169 57178.30 ( 0.00%) 54745.71 * -4.25%* > > Hmean disk-193 60230.94 ( 0.00%) 57106.22 * -5.19%* > > Stddev disk-1 18.74 ( 0.00%) 30.19 ( -61.11%) > > Stddev disk-25 439.49 ( 0.00%) 809.58 ( -84.21%) > > Stddev disk-49 1416.65 ( 0.00%) 603.55 ( 57.40%) > > Stddev disk-73 949.45 ( 0.00%) 584.61 ( 38.43%) > > Stddev disk-97 689.51 ( 0.00%) 602.76 ( 12.58%) > > Stddev disk-121 485.22 ( 0.00%) 612.79 ( -26.29%) > > Stddev disk-145 147.37 ( 0.00%) 442.99 (-200.60%) > > Stddev disk-169 282.25 ( 0.00%) 613.42 (-117.33%) > > Stddev disk-193 970.05 ( 0.00%) 572.59 ( 40.97%) > > > > Note that numbers behind dash (disk-xx) denote the number of reaim > > "clients" - i.e., the number of processes reaim runs in parallel. > > > > This reaim workload will create quite some small files, fsync them, do a > > few operations on them like read, write, etc. and quickly delete them. This > > happens in many processes in parallel so I can imagine the unlink workload > > is rather heavy. > > Yup, which is why I've largely been ignoring it. reaim is a *very > old* benchmark that doesn't do anything we'd normally do to scale > out independent workloads effectively. Yes, I agree reaim is a stupid workload for today's standards so I'm not much worried about it as such. Just occasionally it tends to highlight problems that are valid also for more relevant workloads (just less visible there) so we still run it. > > To reproduce the workload the easiest is probably to clone mmtests [1], > > there's README.md and docs/Tutorial.md about how to run tests. Relevant > > configurations are in configs/config-reaim-io-disk-large and > > configs/config-reaim-io-disk, you will need to edit these files to set > > appropriate test partition (will get wiped) and filesystem. Note that > > mmtests do somewhat modify standard reaim benchmark so that it does not > > call system("sync") after each operation (you can see modifications we > > apply in [2] if you're interested). > > > > I can probably find some time to better understand what's going on on these > > machines later this week but if you have some suggestions what to look for, > > you're welcome. > > You should be able to reproduce the concurrent unlink contention > simply by running a bunch of concurrent unlinks in a directory and > watching the directory i_rwsem get absolutely hammered up in > do_unlinkat(). OK, thanks for suggestion. I can try that as well. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Performance regression with async inode inactivation 2021-10-05 8:11 ` Jan Kara @ 2021-10-05 21:26 ` Dave Chinner 2021-10-06 18:10 ` Jan Kara 0 siblings, 1 reply; 12+ messages in thread From: Dave Chinner @ 2021-10-05 21:26 UTC (permalink / raw) To: Jan Kara; +Cc: linux-xfs, Dave Chinner, Darrick J. Wong On Tue, Oct 05, 2021 at 10:11:57AM +0200, Jan Kara wrote: > On Tue 05-10-21 08:15:08, Dave Chinner wrote: > > On Mon, Oct 04, 2021 at 12:06:53PM +0200, Jan Kara wrote: > > > Hello, > > > > > > our performance testing grid has detected a performance regression caused > > > by commit ab23a77687 ("xfs: per-cpu deferred inode inactivation queues") > > > with reaim benchmark running 'disk' and 'disk-large' workloads. The > > > regression has been so far detected on two machines - marvin7 (48 cpus, 64 > > > GB ram, SATA SSD), dobby (64 cpus, 192 GB ram, rotating disk behind > > > megaraid_sas controller). > > > > Yup, large cpu count, single slow disk, and the cause will likely be > > exclusive rwsem lock contention on a directory inode that concurrent > > openat and unlink are occuring in. > > > > Basically, that commit removed a bunch of userspace overhead in > > unlinks, when mean they run as fast as the unlink() call can remove > > the directory entry. There is effectively nothing throttling > > unlink() in XFS now except for available log space and it mostly > > runs to completion without blocking. Hence the front end unlink > > performance can run in much faster bursts before delayed > > inactivation needs to run. > > > > Given most of the added CPU overhead is in the rwsem spin_on_owner > > path, it implies that the write lock holder is, indeed, not sleeping > > with the lock held. Hence reaim is hitting a borderline contended > > rwsem much harder and with different behaviour, resulting in > > catastrophic breakdown of lock performance and hence unlink > > performance goes backwards. > > > > I can't see any other new sleeping lock contention in the workload > > profiles - the context switch rate goes down substantially (by 35%!) > > with commit ab23a77687, which also implies that the lock contention > > is resulting in much longer spin and/or sleep times on the lock. > > > > I'm not sure we can do anything about this in the filesystem. The > > contended lock is a core, high level VFS lock which is the first > > point of unlinkat() syscall serialisation. This is the lock that is > > directly exposed to userspace concurrency, so the scalability of > > this lock determines concurrency performance of the userspace > > application..... > > Thanks for explanation! It makes sense, except one difference I can see in > vmstat on both marvin7 and dobby which I don't understand: > > Dobby: > Ops Sector Reads 1009081.00 1009081.00 > Ops Sector Writes 11550795.00 18753764.00 > > Marvin7: > Ops Sector Reads 887951.00 887951.00 > Ops Sector Writes 8248822.00 11135086.00 > > So after the change reaim ends up doing noticeably more writes. I had a > look at iostat comparison as well but there wasn't anything particular > standing out besides higher amount of writes on the test disk. I guess, > I'll limit the number of clients to a single number showing the regression, > enable some more detailed monitoring and see whether something interesting > pops up. Interesting. There weren't iostats in the original intel profiles given. I can see a couple of vmstats that give some indications - vmstat.io.bo went up from ~2500 to ~6000, and proc-vmstat.pgpgout went up from ~90k to 250k. Looking at another more recent profile, there are more IO related stats in the output vmstat.nr_written went up by 2.5x and vmstat.pgpgout went up by a factor of 6 (50k -> 300k) but otherwise everything else was fairly constant in the VM. The resident size of the file cache is small, and vmstat.nr_dirtied went up by a small ammount by it's 4 orders of magnitude larger than nr_written. Hmmm. That implies a *lot* of overwrite of cached files. I wonder if we've just changed the memory pressure enough to trigger more frequent writeback? We're delaying the inactivation (and hence page cache invalidation) of up to 256 inodes per CPU, and the number of cached+dirty inodes appears to have increased a small amount (from ~3000 to ~4000). With slow disks, a small change in writeback behaviour could cause seek-bound related performance regressions. Also worth noting is that there's been some recent variance in reaim numbers recently because of the journal FUA/flush optimisations we've made. Some machines report +20% from that change, some report -20%, and there's no pattern to it. It's just another indication that the reaim scalability and perf on these large CPU count, single spinning disk setups is highly dependent on disk performance and seek optimisation... Have you run any tests on a system that isn't stupidly overpowered for it's disk subsystem? e.g. has an SSD rather than spinning rust? > > > The regression reports from mmtests on marvin7 (which from my experience I > > > find more reliable than dobby) look like: > > > > > > reaim-io-disk-xfs > > > Hmean disk-1 3597.12 ( 0.00%) 3636.37 ( 1.09%) > > > Hmean disk-25 125000.00 ( 0.00%) 121753.25 ( -2.60%) > > > Hmean disk-49 218424.96 ( 0.00%) 193421.05 * -11.45%* > > > Hmean disk-73 286649.22 ( 0.00%) 240131.58 * -16.23%* > > > Hmean disk-97 349339.74 ( 0.00%) 285854.62 * -18.17%* > > > Hmean disk-121 373456.79 ( 0.00%) 309199.32 * -17.21%* > > > Hmean disk-145 399449.04 ( 0.00%) 330547.11 * -17.25%* > > > Hmean disk-169 420049.71 ( 0.00%) 345132.74 * -17.84%* > > > Hmean disk-193 458795.56 ( 0.00%) 375243.03 * -18.21%* > > > Stddev disk-1 102.32 ( 0.00%) 117.93 ( -15.25%) > > > Stddev disk-25 6981.08 ( 0.00%) 4285.17 ( 38.62%) > > > Stddev disk-49 9523.63 ( 0.00%) 10723.44 ( -12.60%) > > > Stddev disk-73 9704.22 ( 0.00%) 7946.73 ( 18.11%) > > > Stddev disk-97 10059.91 ( 0.00%) 6111.28 ( 39.25%) > > > Stddev disk-121 5730.56 ( 0.00%) 11714.34 (-104.42%) > > > Stddev disk-145 11154.40 ( 0.00%) 8129.06 ( 27.12%) > > > Stddev disk-169 4477.30 ( 0.00%) 3558.86 ( 20.51%) > > > Stddev disk-193 8785.70 ( 0.00%) 13258.89 ( -50.91%) > > > > > > reaim-io-disk-large-xfs > > > Hmean disk-1 722.72 ( 0.00%) 721.85 ( -0.12%) > > > Hmean disk-25 24177.95 ( 0.00%) 24319.06 ( 0.58%) > > > Hmean disk-49 35294.12 ( 0.00%) 34361.85 ( -2.64%) > > > Hmean disk-73 43042.45 ( 0.00%) 40896.36 * -4.99%* > > > Hmean disk-97 48403.19 ( 0.00%) 46044.30 * -4.87%* > > > Hmean disk-121 52230.22 ( 0.00%) 49347.47 * -5.52%* > > > Hmean disk-145 54613.94 ( 0.00%) 52333.98 * -4.17%* > > > Hmean disk-169 57178.30 ( 0.00%) 54745.71 * -4.25%* > > > Hmean disk-193 60230.94 ( 0.00%) 57106.22 * -5.19%* > > > Stddev disk-1 18.74 ( 0.00%) 30.19 ( -61.11%) > > > Stddev disk-25 439.49 ( 0.00%) 809.58 ( -84.21%) > > > Stddev disk-49 1416.65 ( 0.00%) 603.55 ( 57.40%) > > > Stddev disk-73 949.45 ( 0.00%) 584.61 ( 38.43%) > > > Stddev disk-97 689.51 ( 0.00%) 602.76 ( 12.58%) > > > Stddev disk-121 485.22 ( 0.00%) 612.79 ( -26.29%) > > > Stddev disk-145 147.37 ( 0.00%) 442.99 (-200.60%) > > > Stddev disk-169 282.25 ( 0.00%) 613.42 (-117.33%) > > > Stddev disk-193 970.05 ( 0.00%) 572.59 ( 40.97%) > > > > > > Note that numbers behind dash (disk-xx) denote the number of reaim > > > "clients" - i.e., the number of processes reaim runs in parallel. > > > > > > This reaim workload will create quite some small files, fsync them, do a > > > few operations on them like read, write, etc. and quickly delete them. This > > > happens in many processes in parallel so I can imagine the unlink workload > > > is rather heavy. > > > > Yup, which is why I've largely been ignoring it. reaim is a *very > > old* benchmark that doesn't do anything we'd normally do to scale > > out independent workloads effectively. > > Yes, I agree reaim is a stupid workload for today's standards so I'm not > much worried about it as such. Just occasionally it tends to highlight > problems that are valid also for more relevant workloads (just less visible > there) so we still run it. *nod* It has worked well as a canary in past times, but recently I'm finding we're getting more noise than useful signal from these unbalanced "all CPU, no IO" test machine configs these days.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Performance regression with async inode inactivation 2021-10-05 21:26 ` Dave Chinner @ 2021-10-06 18:10 ` Jan Kara 2021-10-06 21:58 ` Dave Chinner 0 siblings, 1 reply; 12+ messages in thread From: Jan Kara @ 2021-10-06 18:10 UTC (permalink / raw) To: Dave Chinner; +Cc: Jan Kara, linux-xfs, Dave Chinner, Darrick J. Wong [-- Attachment #1: Type: text/plain, Size: 6023 bytes --] On Wed 06-10-21 08:26:08, Dave Chinner wrote: > On Tue, Oct 05, 2021 at 10:11:57AM +0200, Jan Kara wrote: > > On Tue 05-10-21 08:15:08, Dave Chinner wrote: > > > On Mon, Oct 04, 2021 at 12:06:53PM +0200, Jan Kara wrote: > > > > Hello, > > > > > > > > our performance testing grid has detected a performance regression caused > > > > by commit ab23a77687 ("xfs: per-cpu deferred inode inactivation queues") > > > > with reaim benchmark running 'disk' and 'disk-large' workloads. The > > > > regression has been so far detected on two machines - marvin7 (48 cpus, 64 > > > > GB ram, SATA SSD), dobby (64 cpus, 192 GB ram, rotating disk behind > > > > megaraid_sas controller). > > > > > > Yup, large cpu count, single slow disk, and the cause will likely be > > > exclusive rwsem lock contention on a directory inode that concurrent > > > openat and unlink are occuring in. > > > > > > Basically, that commit removed a bunch of userspace overhead in > > > unlinks, when mean they run as fast as the unlink() call can remove > > > the directory entry. There is effectively nothing throttling > > > unlink() in XFS now except for available log space and it mostly > > > runs to completion without blocking. Hence the front end unlink > > > performance can run in much faster bursts before delayed > > > inactivation needs to run. > > > > > > Given most of the added CPU overhead is in the rwsem spin_on_owner > > > path, it implies that the write lock holder is, indeed, not sleeping > > > with the lock held. Hence reaim is hitting a borderline contended > > > rwsem much harder and with different behaviour, resulting in > > > catastrophic breakdown of lock performance and hence unlink > > > performance goes backwards. > > > > > > I can't see any other new sleeping lock contention in the workload > > > profiles - the context switch rate goes down substantially (by 35%!) > > > with commit ab23a77687, which also implies that the lock contention > > > is resulting in much longer spin and/or sleep times on the lock. > > > > > > I'm not sure we can do anything about this in the filesystem. The > > > contended lock is a core, high level VFS lock which is the first > > > point of unlinkat() syscall serialisation. This is the lock that is > > > directly exposed to userspace concurrency, so the scalability of > > > this lock determines concurrency performance of the userspace > > > application..... > > > > Thanks for explanation! It makes sense, except one difference I can see in > > vmstat on both marvin7 and dobby which I don't understand: > > > > Dobby: > > Ops Sector Reads 1009081.00 1009081.00 > > Ops Sector Writes 11550795.00 18753764.00 > > > > Marvin7: > > Ops Sector Reads 887951.00 887951.00 > > Ops Sector Writes 8248822.00 11135086.00 > > > > So after the change reaim ends up doing noticeably more writes. I had a > > look at iostat comparison as well but there wasn't anything particular > > standing out besides higher amount of writes on the test disk. I guess, > > I'll limit the number of clients to a single number showing the regression, > > enable some more detailed monitoring and see whether something interesting > > pops up. > > Interesting. > > There weren't iostats in the original intel profiles given. I > can see a couple of vmstats that give some indications - > vmstat.io.bo went up from ~2500 to ~6000, and proc-vmstat.pgpgout > went up from ~90k to 250k. > > Looking at another more recent profile, there are more IO related > stats in the output vmstat.nr_written went up by 2.5x and > vmstat.pgpgout went up by a factor of 6 (50k -> 300k) but otherwise > everything else was fairly constant in the VM. The resident size of > the file cache is small, and vmstat.nr_dirtied went up by a small > ammount by it's 4 orders of magnitude larger than nr_written. > > Hmmm. That implies a *lot* of overwrite of cached files. > > I wonder if we've just changed the memory pressure enough to trigger > more frequent writeback? We're delaying the inactivation (and hence > page cache invalidation) of up to 256 inodes per CPU, and the number > of cached+dirty inodes appears to have increased a small amount > (from ~3000 to ~4000). With slow disks, a small change in writeback > behaviour could cause seek-bound related performance regressions. > > Also worth noting is that there's been some recent variance in reaim > numbers recently because of the journal FUA/flush optimisations > we've made. Some machines report +20% from that change, some report > -20%, and there's no pattern to it. It's just another indication > that the reaim scalability and perf on these large CPU count, single > spinning disk setups is highly dependent on disk performance and > seek optimisation... > > Have you run any tests on a system that isn't stupidly overpowered > for it's disk subsystem? e.g. has an SSD rather than spinning rust? So marvin7 actually has SSD. I was experimenting some more. Attached is a simple reproducer that demonstrates the issue for me - it just creates 16k file, fsync it, delete it in a loop from given number processes (I run with 48). The reproducer runs ~25% slower after the commit ab23a77687. Note that the reproducer makes each process use a different directory so i_rwsem contention is out of question. From blktrace I can see that indeed after the commit we do ~25% more writes. Each stress-unlink process does the same amount of IO, the extra IO comes solely from the worker threads. Also I'd note that before the commit we were reusing blocks much more (likely inode blocks getting reused) - before the commit we write to ~88 MB worth of distinct disk blocks, after the commit we write to ~296 MB worth of distinct disk blocks. That's understandable given the delayed inode freeing but it's one thing that could possibly have effect. That's how far I've come today, I'll dig more tomorrow. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR [-- Attachment #2: stress-unlink.c --] [-- Type: text/x-c, Size: 2897 bytes --] #include <stdio.h> #include <fcntl.h> #include <stdlib.h> #include <errno.h> #include <string.h> #include <signal.h> #include <unistd.h> #include <sys/wait.h> #include <sys/mman.h> #include <sys/stat.h> #include <sys/time.h> #define COUNT 100 #define SIZE 16384 #define MAX_PROCS 1024 char shm_name[64]; char wbuf[SIZE]; void prepare(char *base, int num) { char dir[128]; sprintf(dir, "%s/dir-%d", base, num); if (mkdir(dir, 0700) < 0 && errno != EEXIST) { perror("mkdir"); exit(1); } } void teardown(char *base, int num) { char dir[128]; sprintf(dir, "%s/dir-%d", base, num); rmdir(dir); } void run_test(char *base, int num) { char name[128]; int shmfd; int i; int *shm; int fd; sprintf(name, "%s/dir-%d/file", base, num); shmfd = shm_open(shm_name, O_RDWR, 0); if (shmfd < 0) { perror("shm_open"); exit(1); } shm = mmap(NULL, sizeof(int) * MAX_PROCS, PROT_READ | PROT_WRITE, MAP_SHARED, shmfd, 0); if (shm == MAP_FAILED) { perror("mmap"); exit(1); } shm[num + 1] = 1; while (shm[0] == 0) usleep(1); for (i = 0; i < COUNT; i++) { fd = open(name, O_CREAT | O_WRONLY, 0644); if (fd < 0) { perror("open"); exit(1); } if (write(fd, wbuf, SIZE) != SIZE) { perror("pwrite"); exit(1); } fsync(fd); close(fd); unlink(name); } } int main(int argc, char **argv) { int procs, i, j; pid_t pids[MAX_PROCS]; int shmfd; int *shm; struct timeval start, end; long long ms; if (argc != 3) { fprintf(stderr, "Usage: stress-unlink <processes> <dir>\n"); return 1; } procs = strtol(argv[1], NULL, 10); if (procs > MAX_PROCS) { fprintf(stderr, "Too many processes!\n"); return 1; } memset(wbuf, 0xcc, SIZE); sprintf(shm_name, "/stress-unlink-%u", getpid()); shmfd = shm_open(shm_name, O_CREAT | O_RDWR, 0600); if (shmfd < 0) { perror("shm_open"); return 1; } if (ftruncate(shmfd, sizeof(int) * MAX_PROCS) < 0) { perror("ftruncate shm"); return 1; } shm = mmap(NULL, sizeof(int) * MAX_PROCS, PROT_READ | PROT_WRITE, MAP_SHARED, shmfd, 0); if (shm == MAP_FAILED) { perror("mmap"); return 1; } shm[0] = 0; for (i = 0; i < procs; i++) { shm[i+1] = 0; prepare(argv[2], i); } for (i = 0; i < procs; i++) { pids[i] = fork(); if (pids[i] < 0) { perror("fork"); for (j = 0; j < i; j++) kill(pids[j], SIGKILL); return 1; } if (pids[i] == 0) { run_test(argv[2], i); exit(0); } } do { for (i = 0; i < procs && shm[i + 1]; i++); } while (i != procs); gettimeofday(&start, NULL); shm[0] = 1; fprintf(stderr, "Processes started.\n"); for (i = 0; i < procs; i++) waitpid(pids[i], NULL, 0); gettimeofday(&end, NULL); for (i = 0; i < procs; i++) teardown(argv[2], i); shm_unlink(shm_name); ms = (((long long)(end.tv_sec - start.tv_sec) * 1000000) + (end.tv_usec - start.tv_usec)) / 1000; printf("%lld.%03lld\n", ms/1000, ms%1000); return 0; } ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Performance regression with async inode inactivation 2021-10-06 18:10 ` Jan Kara @ 2021-10-06 21:58 ` Dave Chinner 2021-10-07 12:03 ` Jan Kara 0 siblings, 1 reply; 12+ messages in thread From: Dave Chinner @ 2021-10-06 21:58 UTC (permalink / raw) To: Jan Kara; +Cc: linux-xfs, Dave Chinner, Darrick J. Wong On Wed, Oct 06, 2021 at 08:10:01PM +0200, Jan Kara wrote: > On Wed 06-10-21 08:26:08, Dave Chinner wrote: > > On Tue, Oct 05, 2021 at 10:11:57AM +0200, Jan Kara wrote: > > > On Tue 05-10-21 08:15:08, Dave Chinner wrote: > > > > On Mon, Oct 04, 2021 at 12:06:53PM +0200, Jan Kara wrote: > > > > > Hello, > > > > > > > > > > our performance testing grid has detected a performance regression caused > > > > > by commit ab23a77687 ("xfs: per-cpu deferred inode inactivation queues") > > > > > with reaim benchmark running 'disk' and 'disk-large' workloads. The > > > > > regression has been so far detected on two machines - marvin7 (48 cpus, 64 > > > > > GB ram, SATA SSD), dobby (64 cpus, 192 GB ram, rotating disk behind > > > > > megaraid_sas controller). > > > > > > > > Yup, large cpu count, single slow disk, and the cause will likely be > > > > exclusive rwsem lock contention on a directory inode that concurrent > > > > openat and unlink are occuring in. > > > > > > > > Basically, that commit removed a bunch of userspace overhead in > > > > unlinks, when mean they run as fast as the unlink() call can remove > > > > the directory entry. There is effectively nothing throttling > > > > unlink() in XFS now except for available log space and it mostly > > > > runs to completion without blocking. Hence the front end unlink > > > > performance can run in much faster bursts before delayed > > > > inactivation needs to run. > > > > > > > > Given most of the added CPU overhead is in the rwsem spin_on_owner > > > > path, it implies that the write lock holder is, indeed, not sleeping > > > > with the lock held. Hence reaim is hitting a borderline contended > > > > rwsem much harder and with different behaviour, resulting in > > > > catastrophic breakdown of lock performance and hence unlink > > > > performance goes backwards. > > > > > > > > I can't see any other new sleeping lock contention in the workload > > > > profiles - the context switch rate goes down substantially (by 35%!) > > > > with commit ab23a77687, which also implies that the lock contention > > > > is resulting in much longer spin and/or sleep times on the lock. > > > > > > > > I'm not sure we can do anything about this in the filesystem. The > > > > contended lock is a core, high level VFS lock which is the first > > > > point of unlinkat() syscall serialisation. This is the lock that is > > > > directly exposed to userspace concurrency, so the scalability of > > > > this lock determines concurrency performance of the userspace > > > > application..... > > > > > > Thanks for explanation! It makes sense, except one difference I can see in > > > vmstat on both marvin7 and dobby which I don't understand: > > > > > > Dobby: > > > Ops Sector Reads 1009081.00 1009081.00 > > > Ops Sector Writes 11550795.00 18753764.00 > > > > > > Marvin7: > > > Ops Sector Reads 887951.00 887951.00 > > > Ops Sector Writes 8248822.00 11135086.00 > > > > > > So after the change reaim ends up doing noticeably more writes. I had a > > > look at iostat comparison as well but there wasn't anything particular > > > standing out besides higher amount of writes on the test disk. I guess, > > > I'll limit the number of clients to a single number showing the regression, > > > enable some more detailed monitoring and see whether something interesting > > > pops up. > > > > Interesting. > > > > There weren't iostats in the original intel profiles given. I > > can see a couple of vmstats that give some indications - > > vmstat.io.bo went up from ~2500 to ~6000, and proc-vmstat.pgpgout > > went up from ~90k to 250k. > > > > Looking at another more recent profile, there are more IO related > > stats in the output vmstat.nr_written went up by 2.5x and > > vmstat.pgpgout went up by a factor of 6 (50k -> 300k) but otherwise > > everything else was fairly constant in the VM. The resident size of > > the file cache is small, and vmstat.nr_dirtied went up by a small > > ammount by it's 4 orders of magnitude larger than nr_written. > > > > Hmmm. That implies a *lot* of overwrite of cached files. > > > > I wonder if we've just changed the memory pressure enough to trigger > > more frequent writeback? We're delaying the inactivation (and hence > > page cache invalidation) of up to 256 inodes per CPU, and the number > > of cached+dirty inodes appears to have increased a small amount > > (from ~3000 to ~4000). With slow disks, a small change in writeback > > behaviour could cause seek-bound related performance regressions. > > > > Also worth noting is that there's been some recent variance in reaim > > numbers recently because of the journal FUA/flush optimisations > > we've made. Some machines report +20% from that change, some report > > -20%, and there's no pattern to it. It's just another indication > > that the reaim scalability and perf on these large CPU count, single > > spinning disk setups is highly dependent on disk performance and > > seek optimisation... > > > > Have you run any tests on a system that isn't stupidly overpowered > > for it's disk subsystem? e.g. has an SSD rather than spinning rust? > > So marvin7 actually has SSD. I was experimenting some more. Attached is a > simple reproducer that demonstrates the issue for me - it just creates 16k > file, fsync it, delete it in a loop from given number processes (I run with > 48). The reproducer runs ~25% slower after the commit ab23a77687. Note that > the reproducer makes each process use a different directory so i_rwsem > contention is out of question. > > From blktrace I can see that indeed after the commit we do ~25% more > writes. Each stress-unlink process does the same amount of IO, the extra > IO comes solely from the worker threads. So the question is "which worker thread?". > Also I'd note that before the > commit we were reusing blocks much more (likely inode blocks getting > reused) - before the commit we write to ~88 MB worth of distinct disk > blocks, after the commit we write to ~296 MB worth of distinct disk blocks. Hmmm, I didn't see this at first. What's the filesystem layout (xfs_info) and the CLI for the test that you ran? How many CPUs was the test run on? Running a test with 100 procs across 32p, so creating 100 dirs and 10000 files, on a 1.4TB SSD: $ xfs_info /mnt/scratch meta-data=/dev/mapper/fast isize=512 agcount=67, agsize=5467072 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=1, rmapbt=0 = reflink=1 data = bsize=4096 blocks=366292480, imaxpct=5 = sunit=64 swidth=256 blks naming =version 2 bsize=4096 ascii-ci=0, ftype=1 log =internal log bsize=4096 blocks=512000, version=2 = sectsz=512 sunit=1 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 $ ./stress-unlink 100 /mnt/scratch 0.301 $ The total number of inode cluster writes is: $ pminfo -f xfs.icluster_flushcnt xfs.icluster_flushcnt value 68 That's 68 * 32kB or about 2MB. [pminfo is part of Performance Co-Pilot - https://pcp.io/] We did this many user writes: $ pminfo -f xfs.write xfs.write_bytes xfs.write value 10000 xfs.write_bytes value 163840000 which is 160MB in 10000 writes. Metadata writeback was: $ pminfo -f xfs.log_tail.push_ail.success xfs.log_tail.push_ail.success value 473 473 buffers, part of which was 68 inode clusters so the total is about ~400 x 4kB + 2MB = 3.6MB. So, really, there isn't a potential for 25% growth in these numbers that would make any sort of difference. The only remaining source of IO differential is log writes: $ pminfo -f xfs.log xfs.log.writes value 1055 xfs.log.blocks value 28212 .... Which indicates another ~1000 IOs and ~28MB written. And that's it. We've got a total of about 195MB written to disk for 100 concurrent runs, and it's no different before and after deferred inactivation. Other tests that I commonly run that do lots of file creates and unlinks along with sync writes (e.g. dbench) didn't show any regressions up to 512 concurrent processes, either. So there's no obvious regression on this filesystem layout.... But it made me wonder - there are two things that could influence background inactivation here: the AG count (which determines unlink concurrency) and log size (which determines transaction concurrency). So I remade the filesystem with a tiny 32MB log and 2 AGs, and that changed a -lot-: xfs.icluster_flushcnt value 40420 xfs.log_tail.push_ail.success value 46098 Yeah, we did 40,000 inode cluster writes for only 10000 inode modifications. Basically, we wrote every inode cluster once for every inode modification (post create, post write, post unlink, post deferred activation). Normally create/unlink loops the inode doesn't even touch the disk - it's cancelled in the log before it gets flushed to disk and so this is where a chunk of the difference in disk space consumed comes from. It can be expected behaviour. xfs.log_tail.push_ail.pinned value 84381 And we hit pinned items in the AIL 85,000 times instead of 0. This triggers more log forces, but fsync is already doing that so it's not a big deal. However: xfs.log_tail.try_logspace value 80392 xfs.log_tail.sleep_logspace value 59938 75% of the operations had to wait on log space, which means it's forcing the log tail to flush metadata to make space. That's where all the inode cluster writes are coming from - the transactions are lock stepping on log space. i.e. every time we want to modify an inode, we have to flush a dirty inode to make space in the log. This also means that the CIL can't perform efficient in-memory transaction aggregation because it's always being forced out to disk. As a result: xfs.log.writes value 6504 xfs.log.blocks value 65784 6x as many log writes, for 65MB of log writes. IOWs, the smaller log reduced CIL aggregation efficiency substantially and so we wrote twice as much to the log just because we had a small log. So, the question needs to be asked: is this a function of a small log, or is this behaviour of low AG count? Increasing AG count back out to > CPU count resulted in the number of inode cluster flushes dropping to ~3000, and the log writes dropping back down to 1000 and 28MB of log writes. Performance went way up, too (from 1.8s down to 0.35s), indicating that AG concurrency is a factor here. OTOH, leaving the AG count at 2 and increasing the log back out to 2GB removed all the log space waiting, all the inode cluster flushing, and everything to do with waiting on log space. But performance barely changed (1.8s down to 1.5s) and log bandwidth *went up*: xfs.log.writes value 6528 xfs.log.blocks value 135452 6,500 log writes, 130MB written to the log. IOWs, both too-small log space and too-low AG count for the given workload concurrency will adversely affect performance of concurrent workloads. But we already knew that, didn't we? :/ Bumping the AG count from 2 to 16 and using a moderately sized log (i.e. > 100MB) largely erases the bad behaviour. Indeed, agcount=16, log size=100MB drops runtime to 0.30s and pretty much returns inode write count and log write/bw back down to minimums on this 100 process workload. But to point out that this isn't perfect for everything, if I run the same test with 1000 processes on that config, it takes 6.2s, writes 400MB to the log and flushes 153,000 inode clusters. We're back to not having enough log space for the workload. Bumping out to agcount=67 (2x CPU count) and max log size(2GB) results in runtime of 3.5s (largely linear from the 100 process count), no inode writeback and only 200MB of log throughput. So, yeah, I suspect that if you change the mkfs parameters for the reaim tests to have more AGs and significantly increased log space over the defaults, the regression should disappear... Can you run these experiments with reaim on your test machines and see if they are influenced by filesystem level concurrency parameters like AG count and log size as these tests suggest? Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Performance regression with async inode inactivation 2021-10-06 21:58 ` Dave Chinner @ 2021-10-07 12:03 ` Jan Kara 2021-10-07 23:44 ` Dave Chinner 0 siblings, 1 reply; 12+ messages in thread From: Jan Kara @ 2021-10-07 12:03 UTC (permalink / raw) To: Dave Chinner; +Cc: Jan Kara, linux-xfs, Dave Chinner, Darrick J. Wong [-- Attachment #1: Type: text/plain, Size: 16977 bytes --] On Thu 07-10-21 08:58:51, Dave Chinner wrote: > On Wed, Oct 06, 2021 at 08:10:01PM +0200, Jan Kara wrote: > > On Wed 06-10-21 08:26:08, Dave Chinner wrote: > > > On Tue, Oct 05, 2021 at 10:11:57AM +0200, Jan Kara wrote: > > > > On Tue 05-10-21 08:15:08, Dave Chinner wrote: > > > > > On Mon, Oct 04, 2021 at 12:06:53PM +0200, Jan Kara wrote: > > > > > > Hello, > > > > > > > > > > > > our performance testing grid has detected a performance regression caused > > > > > > by commit ab23a77687 ("xfs: per-cpu deferred inode inactivation queues") > > > > > > with reaim benchmark running 'disk' and 'disk-large' workloads. The > > > > > > regression has been so far detected on two machines - marvin7 (48 cpus, 64 > > > > > > GB ram, SATA SSD), dobby (64 cpus, 192 GB ram, rotating disk behind > > > > > > megaraid_sas controller). > > > > > > > > > > Yup, large cpu count, single slow disk, and the cause will likely be > > > > > exclusive rwsem lock contention on a directory inode that concurrent > > > > > openat and unlink are occuring in. > > > > > > > > > > Basically, that commit removed a bunch of userspace overhead in > > > > > unlinks, when mean they run as fast as the unlink() call can remove > > > > > the directory entry. There is effectively nothing throttling > > > > > unlink() in XFS now except for available log space and it mostly > > > > > runs to completion without blocking. Hence the front end unlink > > > > > performance can run in much faster bursts before delayed > > > > > inactivation needs to run. > > > > > > > > > > Given most of the added CPU overhead is in the rwsem spin_on_owner > > > > > path, it implies that the write lock holder is, indeed, not sleeping > > > > > with the lock held. Hence reaim is hitting a borderline contended > > > > > rwsem much harder and with different behaviour, resulting in > > > > > catastrophic breakdown of lock performance and hence unlink > > > > > performance goes backwards. > > > > > > > > > > I can't see any other new sleeping lock contention in the workload > > > > > profiles - the context switch rate goes down substantially (by 35%!) > > > > > with commit ab23a77687, which also implies that the lock contention > > > > > is resulting in much longer spin and/or sleep times on the lock. > > > > > > > > > > I'm not sure we can do anything about this in the filesystem. The > > > > > contended lock is a core, high level VFS lock which is the first > > > > > point of unlinkat() syscall serialisation. This is the lock that is > > > > > directly exposed to userspace concurrency, so the scalability of > > > > > this lock determines concurrency performance of the userspace > > > > > application..... > > > > > > > > Thanks for explanation! It makes sense, except one difference I can see in > > > > vmstat on both marvin7 and dobby which I don't understand: > > > > > > > > Dobby: > > > > Ops Sector Reads 1009081.00 1009081.00 > > > > Ops Sector Writes 11550795.00 18753764.00 > > > > > > > > Marvin7: > > > > Ops Sector Reads 887951.00 887951.00 > > > > Ops Sector Writes 8248822.00 11135086.00 > > > > > > > > So after the change reaim ends up doing noticeably more writes. I had a > > > > look at iostat comparison as well but there wasn't anything particular > > > > standing out besides higher amount of writes on the test disk. I guess, > > > > I'll limit the number of clients to a single number showing the regression, > > > > enable some more detailed monitoring and see whether something interesting > > > > pops up. > > > > > > Interesting. > > > > > > There weren't iostats in the original intel profiles given. I > > > can see a couple of vmstats that give some indications - > > > vmstat.io.bo went up from ~2500 to ~6000, and proc-vmstat.pgpgout > > > went up from ~90k to 250k. > > > > > > Looking at another more recent profile, there are more IO related > > > stats in the output vmstat.nr_written went up by 2.5x and > > > vmstat.pgpgout went up by a factor of 6 (50k -> 300k) but otherwise > > > everything else was fairly constant in the VM. The resident size of > > > the file cache is small, and vmstat.nr_dirtied went up by a small > > > ammount by it's 4 orders of magnitude larger than nr_written. > > > > > > Hmmm. That implies a *lot* of overwrite of cached files. > > > > > > I wonder if we've just changed the memory pressure enough to trigger > > > more frequent writeback? We're delaying the inactivation (and hence > > > page cache invalidation) of up to 256 inodes per CPU, and the number > > > of cached+dirty inodes appears to have increased a small amount > > > (from ~3000 to ~4000). With slow disks, a small change in writeback > > > behaviour could cause seek-bound related performance regressions. > > > > > > Also worth noting is that there's been some recent variance in reaim > > > numbers recently because of the journal FUA/flush optimisations > > > we've made. Some machines report +20% from that change, some report > > > -20%, and there's no pattern to it. It's just another indication > > > that the reaim scalability and perf on these large CPU count, single > > > spinning disk setups is highly dependent on disk performance and > > > seek optimisation... > > > > > > Have you run any tests on a system that isn't stupidly overpowered > > > for it's disk subsystem? e.g. has an SSD rather than spinning rust? > > > > So marvin7 actually has SSD. I was experimenting some more. Attached is a > > simple reproducer that demonstrates the issue for me - it just creates 16k > > file, fsync it, delete it in a loop from given number processes (I run with > > 48). The reproducer runs ~25% slower after the commit ab23a77687. Note that > > the reproducer makes each process use a different directory so i_rwsem > > contention is out of question. > > > > From blktrace I can see that indeed after the commit we do ~25% more > > writes. Each stress-unlink process does the same amount of IO, the extra > > IO comes solely from the worker threads. > > So the question is "which worker thread?". Yes, I'm also wondering ;) But it's impossible to tell from block traces (there's just "kworker/xx:x" there). > > Also I'd note that before the > > commit we were reusing blocks much more (likely inode blocks getting > > reused) - before the commit we write to ~88 MB worth of distinct disk > > blocks, after the commit we write to ~296 MB worth of distinct disk blocks. > > Hmmm, I didn't see this at first. What's the filesystem layout > (xfs_info) and the CLI for the test that you ran? How many CPUs was > the test run on? The machine has 24 cores, each core has 2 SMT siblings, so 48 logical CPUs. That's why I've run stress-unlink with 48 processes. xfs_info is: meta-data=/dev/sdb1 isize=512 agcount=4, agsize=29303104 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=0, rmapbt=0 = reflink=0 data = bsize=4096 blocks=117212416, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0, ftype=1 log =internal log bsize=4096 blocks=57232, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 This is whatever are XFS defaults (with somewhat dated xfsprogs on this machine - 4.15.0), I didn't pass any special parameters to mkfs.xfs. > Running a test with 100 procs across 32p, so creating 100 dirs and > 10000 files, on a 1.4TB SSD: > > $ xfs_info /mnt/scratch > meta-data=/dev/mapper/fast isize=512 agcount=67, agsize=5467072 blks > = sectsz=512 attr=2, projid32bit=1 > = crc=1 finobt=1, sparse=1, rmapbt=0 > = reflink=1 > data = bsize=4096 blocks=366292480, imaxpct=5 > = sunit=64 swidth=256 blks > naming =version 2 bsize=4096 ascii-ci=0, ftype=1 > log =internal log bsize=4096 blocks=512000, version=2 > = sectsz=512 sunit=1 blks, lazy-count=1 > realtime =none extsz=4096 blocks=0, rtextents=0 > $ ./stress-unlink 100 /mnt/scratch > 0.301 > $ > > > The total number of inode cluster writes is: > > $ pminfo -f xfs.icluster_flushcnt > > xfs.icluster_flushcnt > value 68 > > That's 68 * 32kB or about 2MB. > > [pminfo is part of Performance Co-Pilot - https://pcp.io/] > > We did this many user writes: > > $ pminfo -f xfs.write xfs.write_bytes > > xfs.write > value 10000 > > xfs.write_bytes > value 163840000 > > which is 160MB in 10000 writes. > > Metadata writeback was: > > $ pminfo -f xfs.log_tail.push_ail.success > > xfs.log_tail.push_ail.success > value 473 > > 473 buffers, part of which was 68 inode clusters so the total is > about ~400 x 4kB + 2MB = 3.6MB. > > So, really, there isn't a potential for 25% growth in these numbers > that would make any sort of difference. The only remaining source of > IO differential is log writes: > > $ pminfo -f xfs.log > > xfs.log.writes > value 1055 > > xfs.log.blocks > value 28212 > .... > > Which indicates another ~1000 IOs and ~28MB written. > > And that's it. We've got a total of about 195MB written to disk for > 100 concurrent runs, and it's no different before and after deferred > inactivation. Other tests that I commonly run that do lots of file > creates and unlinks along with sync writes (e.g. dbench) didn't show > any regressions up to 512 concurrent processes, either. So there's > no obvious regression on this filesystem layout.... > > But it made me wonder - there are two things that could influence > background inactivation here: the AG count (which determines unlink > concurrency) and log size (which determines transaction > concurrency). So I remade the filesystem with a tiny 32MB log and 2 > AGs, and that changed a -lot-: > > xfs.icluster_flushcnt > value 40420 > > xfs.log_tail.push_ail.success > value 46098 > > Yeah, we did 40,000 inode cluster writes for only 10000 inode > modifications. Basically, we wrote every inode cluster once for > every inode modification (post create, post write, post unlink, post > deferred activation). Normally create/unlink loops the inode doesn't > even touch the disk - it's cancelled in the log before it gets > flushed to disk and so this is where a chunk of the difference in > disk space consumed comes from. It can be expected behaviour. > > xfs.log_tail.push_ail.pinned > value 84381 > > And we hit pinned items in the AIL 85,000 times instead of 0. This > triggers more log forces, but fsync is already doing that so it's > not a big deal. However: > > xfs.log_tail.try_logspace > value 80392 > > xfs.log_tail.sleep_logspace > value 59938 > > 75% of the operations had to wait on log space, which means it's > forcing the log tail to flush metadata to make space. That's where > all the inode cluster writes are coming from - the transactions are > lock stepping on log space. i.e. every time we want to modify an > inode, we have to flush a dirty inode to make space in the log. > > This also means that the CIL can't perform efficient in-memory > transaction aggregation because it's always being forced out to > disk. As a result: > > xfs.log.writes > value 6504 > > xfs.log.blocks > value 65784 > > 6x as many log writes, for 65MB of log writes. IOWs, the smaller log > reduced CIL aggregation efficiency substantially and so we wrote > twice as much to the log just because we had a small log. > > So, the question needs to be asked: is this a function of a small > log, or is this behaviour of low AG count? > > Increasing AG count back out to > CPU count resulted in the number > of inode cluster flushes dropping to ~3000, and the log writes > dropping back down to 1000 and 28MB of log writes. Performance went > way up, too (from 1.8s down to 0.35s), indicating that AG > concurrency is a factor here. > > OTOH, leaving the AG count at 2 and increasing the log back out to > 2GB removed all the log space waiting, all the inode cluster > flushing, and everything to do with waiting on log space. But > performance barely changed (1.8s down to 1.5s) and log bandwidth > *went up*: > > xfs.log.writes > value 6528 > > xfs.log.blocks > value 135452 > > 6,500 log writes, 130MB written to the log. IOWs, both too-small log > space and too-low AG count for the given workload concurrency will > adversely affect performance of concurrent workloads. > > But we already knew that, didn't we? :/ > > Bumping the AG count from 2 to 16 and using a moderately sized log > (i.e. > 100MB) largely erases the bad behaviour. Indeed, agcount=16, > log size=100MB drops runtime to 0.30s and pretty much returns inode > write count and log write/bw back down to minimums on this 100 > process workload. > > But to point out that this isn't perfect for everything, if I run > the same test with 1000 processes on that config, it takes 6.2s, > writes 400MB to the log and flushes 153,000 inode clusters. We're > back to not having enough log space for the workload. Bumping out to > agcount=67 (2x CPU count) and max log size(2GB) results in runtime > of 3.5s (largely linear from the 100 process count), no inode > writeback and only 200MB of log throughput. > > So, yeah, I suspect that if you change the mkfs parameters for the > reaim tests to have more AGs and significantly increased log space > over the defaults, the regression should disappear... > > Can you run these experiments with reaim on your test machines and > see if they are influenced by filesystem level concurrency > parameters like AG count and log size as these tests suggest? Thanks for the detailed analysis and suggestions. So here are some numbers from my end. First, note that I've bumped number of unlinks each process does from 100 to 1000 so that runtime on the test machine gets above 1s to avoid some random noise. The relative slowdown (25%) didn't change (I did it already yesterday but sent stress-unlink version without bumped up loop count). Exact time to complete of stress-unlink are (from 5 runs): (default mkfs params - agcount=4, logsize=223MB) AVG STDDEV before 2.7952 0.022886 after 3.4746 0.025303 I did also runs with increased AG count (to 96) and log size (to 512 MB). The results are somewhat surprising: (agcount=96, logsize=512MB) AVG STDDEV before 3.1340 0.044168 after 3.3612 0.048412 So the change did somewhat help the case with deferred inactivation however it significantly hurt the kernel before deferred inactivation. Overall we are still far from original performance. I had a look at xfs stats (full stats are attached for 4 different configs - (before / after) * (defaults / high ag count)) and logging stats are clearly different: defaults-before: log 4700 157590 0 48585 71854 defaults-after: log 10293 536643 6 48007 53762 highag-before: log 5933 233065 0 48000 58240 highag-after: log 6789 303249 0 48155 53795 We can see big differences in the amount of log writes as well as logged blocks. defaults-before logged 76 MB, defaults-after logged 262 MB, highag-before logged 113 MB, highag-after logged 148 MB. Given data writes for this test are 750 MB (16k * 48 processes * 1000 loops), the difference of 186 MB of log IO matches well with the observed difference in the amount of writes in block traces. I'm not sure why the amount of logged blocks differs so much. I didn't find big difference between various configs in push_ail stats. However I did find notable differences in various btree stats: defaults-before: abtb2 99580 228864 15129 15129 0 0 0 0 0 0 0 0 0 0 49616 abtc2 229903 464130 96526 96526 0 0 0 0 0 0 0 0 0 0 299055 ibt2 96099 96096 3 3 0 0 0 0 0 0 0 0 0 0 0 fibt2 96099 96096 3 3 0 0 0 0 0 0 0 0 0 0 0 defaults-after: abtb2 95532 400144 36596 36441 0 0 0 0 0 0 0 0 0 0 1421782 abtc2 157421 667273 61135 60980 0 0 0 0 0 0 0 0 0 0 1935106 ibt2 95420 220423 51 34 0 0 0 0 0 0 0 0 0 0 1 fibt2 131505 230219 7708 7691 0 0 0 0 0 0 0 0 0 0 40470 highag-before: abtb2 120143 240191 24047 24047 0 0 0 0 0 0 0 0 0 0 0 abtc2 288334 456240 120143 120143 0 0 0 0 0 0 0 0 0 0 24051 ibt2 96143 96096 47 47 0 0 0 0 0 0 0 0 0 0 0 fibt2 96143 96096 47 47 0 0 0 0 0 0 0 0 0 0 0 highag-after: abtb2 96903 205361 20137 20117 0 0 0 0 0 0 0 0 0 0 101850 abtc2 211742 433347 81617 81597 0 0 0 0 0 0 0 0 0 0 274068 ibt2 96083 96035 48 47 0 0 0 0 0 0 0 0 0 0 0 fibt2 96083 96035 48 47 0 0 0 0 0 0 0 0 0 0 0 So we seem to be shuffling btrees more in the slower configurations. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR [-- Attachment #2: unlink-before-stats-defaults.txt --] [-- Type: text/plain, Size: 764 bytes --] extent_alloc 48004 192025 48004 192025 abt 0 0 0 0 blk_map 48073 96001 48001 96001 48001 288076 0 bmbt 0 0 0 0 dir 96 48048 48048 0 trans 0 384280 0 ig 48048 47952 2 96 0 0 0 log 4700 157590 0 48585 71854 push_ail 384280 0 289 8 0 101 0 18 0 1 xstrat 48000 0 rw 48000 0 attr 0 0 0 0 icluster 0 4 20 vnodes 97 0 0 0 48048 48048 48048 0 buf 768796 24 768772 138553 41272 24 0 20 17 abtb2 99580 228864 15129 15129 0 0 0 0 0 0 0 0 0 0 49616 abtc2 229903 464130 96526 96526 0 0 0 0 0 0 0 0 0 0 299055 bmbt2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ibt2 96099 96096 3 3 0 0 0 0 0 0 0 0 0 0 0 fibt2 96099 96096 3 3 0 0 0 0 0 0 0 0 0 0 0 rmapbt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 refcntbt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 qm 0 0 0 0 0 0 0 0 0 xpc 786432000 786432000 0 defer_relog 0 debug 0 [-- Attachment #3: unlink-after-stats-defaults.txt --] [-- Type: text/plain, Size: 796 bytes --] extent_alloc 48052 192409 47308 189365 abt 0 0 0 0 blk_map 48073 96001 47274 96001 47274 287349 0 bmbt 0 0 0 0 dir 96 48048 48048 0 trans 0 381555 0 ig 48048 46076 0 1972 0 1245 0 log 10293 536643 6 48007 53762 push_ail 381555 0 413 68 0 36 0 1028 0 6 xstrat 48000 0 rw 48000 0 attr 0 0 0 0 icluster 0 36 1123 vnodes 728 0 0 0 48048 48048 48048 0 buf 846870 121 846749 89673 27864 121 0 117 18 abtb2 95532 400144 36596 36441 0 0 0 0 0 0 0 0 0 0 1421782 abtc2 157421 667273 61135 60980 0 0 0 0 0 0 0 0 0 0 1935106 bmbt2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ibt2 95420 220423 51 34 0 0 0 0 0 0 0 0 0 0 1 fibt2 131505 230219 7708 7691 0 0 0 0 0 0 0 0 0 0 40470 rmapbt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 refcntbt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 qm 0 0 0 0 0 0 0 0 0 xpc 786432000 786432000 0 defer_relog 0 debug 0 [-- Attachment #4: unlink-before-stats-highag2.txt --] [-- Type: text/plain, Size: 755 bytes --] extent_alloc 48048 192377 48048 192377 abt 0 0 0 0 blk_map 48073 96001 48001 96001 48001 288076 0 bmbt 0 0 0 0 dir 96 48048 48048 0 trans 0 384677 0 ig 48048 47952 0 96 0 0 0 log 5933 233065 0 48000 58240 push_ail 384677 0 0 0 0 0 0 0 0 0 xstrat 48000 0 rw 48000 0 attr 0 0 0 0 icluster 0 0 0 vnodes 97 0 0 0 48048 48048 48048 0 buf 769219 288 768931 0 0 288 0 240 193 abtb2 120143 240191 24047 24047 0 0 0 0 0 0 0 0 0 0 0 abtc2 288334 456240 120143 120143 0 0 0 0 0 0 0 0 0 0 24051 bmbt2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ibt2 96143 96096 47 47 0 0 0 0 0 0 0 0 0 0 0 fibt2 96143 96096 47 47 0 0 0 0 0 0 0 0 0 0 0 rmapbt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 refcntbt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 qm 0 0 0 0 0 0 0 0 0 xpc 786432000 786432000 0 defer_relog 0 debug 0 [-- Attachment #5: unlink-after-stats-highag2.txt --] [-- Type: text/plain, Size: 775 bytes --] extent_alloc 48049 192385 47987 192133 abt 0 0 0 0 blk_map 48050 96001 47940 96001 47940 287992 0 bmbt 0 0 0 0 dir 48 48048 48048 0 trans 0 384389 0 ig 48048 46316 0 1732 0 1671 0 log 6789 303249 0 48155 53795 push_ail 384389 0 132 489 0 4 0 1633 0 1 xstrat 48000 0 rw 48000 0 attr 0 0 0 0 icluster 0 95 1716 vnodes 764 0 0 0 48048 48048 48048 0 buf 771055 97 770958 1339 250 97 0 97 0 abtb2 96903 205361 20137 20117 0 0 0 0 0 0 0 0 0 0 101850 abtc2 211742 433347 81617 81597 0 0 0 0 0 0 0 0 0 0 274068 bmbt2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ibt2 96083 96035 48 47 0 0 0 0 0 0 0 0 0 0 0 fibt2 96083 96035 48 47 0 0 0 0 0 0 0 0 0 0 0 rmapbt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 refcntbt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 qm 0 0 0 0 0 0 0 0 0 xpc 786432000 786432000 0 defer_relog 0 debug 0 ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Performance regression with async inode inactivation 2021-10-07 12:03 ` Jan Kara @ 2021-10-07 23:44 ` Dave Chinner 2021-10-12 13:42 ` Jan Kara 0 siblings, 1 reply; 12+ messages in thread From: Dave Chinner @ 2021-10-07 23:44 UTC (permalink / raw) To: Jan Kara; +Cc: linux-xfs, Dave Chinner, Darrick J. Wong On Thu, Oct 07, 2021 at 02:03:57PM +0200, Jan Kara wrote: > On Thu 07-10-21 08:58:51, Dave Chinner wrote: > > On Wed, Oct 06, 2021 at 08:10:01PM +0200, Jan Kara wrote: > > Hmmm, I didn't see this at first. What's the filesystem layout > > (xfs_info) and the CLI for the test that you ran? How many CPUs was > > the test run on? > > The machine has 24 cores, each core has 2 SMT siblings, so 48 logical CPUs. > That's why I've run stress-unlink with 48 processes. > > xfs_info is: > > meta-data=/dev/sdb1 isize=512 agcount=4, agsize=29303104 blks > = sectsz=512 attr=2, projid32bit=1 > = crc=1 finobt=1, sparse=0, rmapbt=0 > = reflink=0 > data = bsize=4096 blocks=117212416, imaxpct=25 > = sunit=0 swidth=0 blks > naming =version 2 bsize=4096 ascii-ci=0, ftype=1 > log =internal log bsize=4096 blocks=57232, version=2 OK, default log is ~230MB, fs size is ~400GB? > > But to point out that this isn't perfect for everything, if I run > > the same test with 1000 processes on that config, it takes 6.2s, > > writes 400MB to the log and flushes 153,000 inode clusters. We're > > back to not having enough log space for the workload. Bumping out to > > agcount=67 (2x CPU count) and max log size(2GB) results in runtime > > of 3.5s (largely linear from the 100 process count), no inode > > writeback and only 200MB of log throughput. > > > > So, yeah, I suspect that if you change the mkfs parameters for the > > reaim tests to have more AGs and significantly increased log space > > over the defaults, the regression should disappear... > > > > Can you run these experiments with reaim on your test machines and > > see if they are influenced by filesystem level concurrency > > parameters like AG count and log size as these tests suggest? > > Thanks for the detailed analysis and suggestions. So here are some numbers > from my end. First, note that I've bumped number of unlinks each process > does from 100 to 1000 so that runtime on the test machine gets above 1s to > avoid some random noise. The relative slowdown (25%) didn't change (I did > it already yesterday but sent stress-unlink version without bumped up loop > count). Exact time to complete of stress-unlink are (from 5 runs): > > (default mkfs params - agcount=4, logsize=223MB) > AVG STDDEV > before 2.7952 0.022886 > after 3.4746 0.025303 > > I did also runs with increased AG count (to 96) and log size (to 512 MB). > The results are somewhat surprising: > > (agcount=96, logsize=512MB) > AVG STDDEV > before 3.1340 0.044168 > after 3.3612 0.048412 You bumped the log size to 512MB, but then bumped the amount of work by a factor of 10.... > So the change did somewhat help the case with deferred inactivation however > it significantly hurt the kernel before deferred inactivation. Overall we > are still far from original performance. > > I had a look at xfs stats (full stats are attached for 4 different configs > - (before / after) * (defaults / high ag count)) and logging stats are > clearly different: > > defaults-before: > log 4700 157590 0 48585 71854 Fits in < 75% of log space, so will not be pushing out metadata at all during the test. > defaults-after: > log 10293 536643 6 48007 53762 Overwrites the log 2.5x, so limited by metadata writeback speed for most of the test. > highag-before: > log 5933 233065 0 48000 58240 Yup, workload is spread across 16x more AGs and AG metadata, so I'd expect to see such an increase in log throuhgput. But with a larger log, this won't be tail-pushing... > highag-after: > log 6789 303249 0 48155 53795 And this is also under the tail-pushing threshold, so really the only difference in perf here comes from writing more metadata to the log. > We can see big differences in the amount of log writes as well as logged > blocks. defaults-before logged 76 MB, defaults-after logged 262 MB, > highag-before logged 113 MB, highag-after logged 148 MB. Given data writes > for this test are 750 MB (16k * 48 processes * 1000 loops), the difference > of 186 MB of log IO matches well with the observed difference in the amount > of writes in block traces. > > I'm not sure why the amount of logged blocks differs so much. fsync() interactions. On the original code, the two unlink transactions are temporally adjacent as both are in unlinkat() syscall context. One is directly run by the syscall, the other in task_run context at syscall exit when processing the last reference of the file being dropped. In general, that means the objects modified (inode cluster, AGI, inode, etc) are captured by the same CIL context and so aggregate in memory as a single change (essentially log dedupe). Then the next fsync() from some other context runs, pushing the CIL to disk and we only log those objects modified in unlink to the journal once. With deferred activation, the two phases of unlink are temporally decoupled. We get a bunch of inodes running the first phase in unlink() context, but the second phase is run later in a tight loop from workqueue context. But this means that fsync()s occur between the two pahses, and hence the objects modified in the two phases of unlink are modified in two separate CIL contexts. Hence they get written to the log twice. Depending on the way things work out, deferred inactivation also results in longer unlinked inode chains, resulting in more objects being logged per unlink than without deferred inactivation, as the inodes are added to the unlink chain and then immediately removed before any others are added. Hence deferred inode inactivation will increase the amount written to the log per unlink if the two phases of unlink as split across journal checkpoints. IOWs, an increase in log writes for open-write-fsync-close-unlink workloads is not unexpected. But this workload isn't really a real world workload in any way - we generally don't do data integrity writes only to immediately throw the data away. :/ Remove the fsync and we do have a real world workload - temporary files for compiles, etc. Then everything should mostly get logged in the same CIL context because all transactions are run asynchronously and aggregate in memory until the CIL is full and checkpoints itself. Hence log writes shouldn't change very much at all for such workloads. > I didn't find > big difference between various configs in push_ail stats. However I did > find notable differences in various btree stats: > > defaults-before: > abtb2 99580 228864 15129 15129 0 0 0 0 0 0 0 0 0 0 49616 > abtc2 229903 464130 96526 96526 0 0 0 0 0 0 0 0 0 0 299055 > ibt2 96099 96096 3 3 0 0 0 0 0 0 0 0 0 0 0 > fibt2 96099 96096 3 3 0 0 0 0 0 0 0 0 0 0 0 > > defaults-after: > abtb2 95532 400144 36596 36441 0 0 0 0 0 0 0 0 0 0 1421782 > abtc2 157421 667273 61135 60980 0 0 0 0 0 0 0 0 0 0 1935106 > ibt2 95420 220423 51 34 0 0 0 0 0 0 0 0 0 0 1 > fibt2 131505 230219 7708 7691 0 0 0 0 0 0 0 0 0 0 40470 Yup, the change in the finobt is indicative of slight changes in order of allocate/free of inodes. We're tracking a more free inode records because we aren't doing purely sequential inode allocation and freeing due to AGI contention managing the unlinked inode chain. IOWs, we've previously hyper-optimised create-unlink workloads to aggressively reuse inodes, and that resulted in rapid reuse of the unlinked inodes. That still happens, but deferred inactivation increases the pool of free inodes that is being cycled over by this workload. This reflects the per-ag contention occurring in this limited fs config - we're getting larger batches of unlinked inodes being chained on the AGs and so when they are freed we see more inodes being added to the finobt and then reallocated from there. I have plans to further decouple the unlinked AGI chain updates between the two phases of unlink that will help address this, but that is future work and not ready to go yet. > highag-before: > abtb2 120143 240191 24047 24047 0 0 0 0 0 0 0 0 0 0 0 > abtc2 288334 456240 120143 120143 0 0 0 0 0 0 0 0 0 0 24051 > ibt2 96143 96096 47 47 0 0 0 0 0 0 0 0 0 0 0 > fibt2 96143 96096 47 47 0 0 0 0 0 0 0 0 0 0 0 > > highag-after: > abtb2 96903 205361 20137 20117 0 0 0 0 0 0 0 0 0 0 101850 > abtc2 211742 433347 81617 81597 0 0 0 0 0 0 0 0 0 0 274068 > ibt2 96083 96035 48 47 0 0 0 0 0 0 0 0 0 0 0 > fibt2 96083 96035 48 47 0 0 0 0 0 0 0 0 0 0 0 With this, we have no AGI contention to speak of between the two phases of unlink, so unlinked inode chains remain very short in both cases and so we don't see any change to finobt residency. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Performance regression with async inode inactivation 2021-10-07 23:44 ` Dave Chinner @ 2021-10-12 13:42 ` Jan Kara 2021-10-12 21:23 ` Dave Chinner 0 siblings, 1 reply; 12+ messages in thread From: Jan Kara @ 2021-10-12 13:42 UTC (permalink / raw) To: Dave Chinner; +Cc: Jan Kara, linux-xfs, Dave Chinner, Darrick J. Wong On Fri 08-10-21 10:44:30, Dave Chinner wrote: > On Thu, Oct 07, 2021 at 02:03:57PM +0200, Jan Kara wrote: > > On Thu 07-10-21 08:58:51, Dave Chinner wrote: > > > On Wed, Oct 06, 2021 at 08:10:01PM +0200, Jan Kara wrote: > > > Hmmm, I didn't see this at first. What's the filesystem layout > > > (xfs_info) and the CLI for the test that you ran? How many CPUs was > > > the test run on? > > > > The machine has 24 cores, each core has 2 SMT siblings, so 48 logical CPUs. > > That's why I've run stress-unlink with 48 processes. > > > > xfs_info is: > > > > meta-data=/dev/sdb1 isize=512 agcount=4, agsize=29303104 blks > > = sectsz=512 attr=2, projid32bit=1 > > = crc=1 finobt=1, sparse=0, rmapbt=0 > > = reflink=0 > > data = bsize=4096 blocks=117212416, imaxpct=25 > > = sunit=0 swidth=0 blks > > naming =version 2 bsize=4096 ascii-ci=0, ftype=1 > > log =internal log bsize=4096 blocks=57232, version=2 > > OK, default log is ~230MB, fs size is ~400GB? Yep. <snip log analysis> > > We can see big differences in the amount of log writes as well as logged > > blocks. defaults-before logged 76 MB, defaults-after logged 262 MB, > > highag-before logged 113 MB, highag-after logged 148 MB. Given data writes > > for this test are 750 MB (16k * 48 processes * 1000 loops), the difference > > of 186 MB of log IO matches well with the observed difference in the amount > > of writes in block traces. > > > > I'm not sure why the amount of logged blocks differs so much. > > fsync() interactions. > > On the original code, the two unlink transactions are temporally > adjacent as both are in unlinkat() syscall context. One is directly > run by the syscall, the other in task_run context at syscall exit > when processing the last reference of the file being dropped. > > In general, that means the objects modified (inode cluster, AGI, > inode, etc) are captured by the same CIL context and so aggregate in > memory as a single change (essentially log dedupe). Then the next > fsync() from some other context runs, pushing the CIL to disk and > we only log those objects modified in unlink to the journal once. > > With deferred activation, the two phases of unlink are temporally > decoupled. We get a bunch of inodes running the first phase in > unlink() context, but the second phase is run later in a tight loop > from workqueue context. But this means that fsync()s occur between > the two pahses, and hence the objects modified in the two phases of > unlink are modified in two separate CIL contexts. Hence they get > written to the log twice. > > Depending on the way things work out, deferred inactivation also > results in longer unlinked inode chains, resulting in more objects > being logged per unlink than without deferred inactivation, as the > inodes are added to the unlink chain and then immediately removed > before any others are added. Hence deferred inode inactivation will > increase the amount written to the log per unlink if the two phases > of unlink as split across journal checkpoints. > > IOWs, an increase in log writes for open-write-fsync-close-unlink > workloads is not unexpected. But this workload isn't really a real > world workload in any way - we generally don't do data integrity > writes only to immediately throw the data away. :/ > > Remove the fsync and we do have a real world workload - temporary > files for compiles, etc. Then everything should mostly get logged > in the same CIL context because all transactions are run > asynchronously and aggregate in memory until the CIL is full and > checkpoints itself. Hence log writes shouldn't change very much at > all for such workloads. OK, that makes sense. Thanks for explanation. So to verify your theory, I've removed fsync(2) from the test program. So now it is pure create, write, unlink workload. Results of "stress-unlink 48 /mnt", now for 5000 loops of create-unlink to increase runtime (but the workload does barely any writes so it should not matter wrt log): default mkfs params: AVG STDDEV before 2.0380 0.1597 after 2.7356 0.4712 agcount=96, log size 512M AVG STDDEV before 1.0610 0.0227 after 1.2508 0.0218 So still notable regression with the async inactivation. With default mkfs params we'd need more runs to get more reliable results (note the rather high standard deviation) but with high AG count results show pretty stable 20% regression - so let's have a look at that. Looking at xfs stats there are barely any differences between before & after - 'after' writes a bit more to the log but it is ~1.5 MB over the whole benchmark run, altogether spending some 8ms doing IO so that's not it. Generally the workload seems to be CPU / memory bound now (it does barely any IO). Perf shows we are heavily contending on some spinlock in xfs_cil_commit() - I presume this is a xc_cil_lock. This actually happens both before and after, but we seem to spend some more time there with async inactivation. Likely this is related to work being done from worker threads. Perf stats for comparison: before 51,135.08 msec cpu-clock # 47.894 CPUs utilized 4,699 context-switches # 0.092 K/sec 382 cpu-migrations # 0.007 K/sec 1,228 page-faults # 0.024 K/sec 128,884,972,351 cycles # 2.520 GHz 38,517,767,839 instructions # 0.30 insn per cycle 8,337,611,468 branches # 163.051 M/sec 39,749,736 branch-misses # 0.48% of all branches 25,225,109 cache-misses 1.067666170 seconds time elapsed after 65,353.43 msec cpu-clock # 47.894 CPUs utilized 43,737 context-switches # 0.669 K/sec 1,824 cpu-migrations # 0.028 K/sec 1,953 page-faults # 0.030 K/sec 155,144,150,867 cycles # 2.374 GHz 45,280,145,337 instructions # 0.29 insn per cycle 10,027,567,384 branches # 153.436 M/sec 39,554,691 branch-misses # 0.39% of all branches 30,203,567 cache-misses 1.364539400 seconds time elapsed So we can see huge increase in context-switches, notable increase in cache-misses, decrease in cycles/s so perhaps we are bouncing cache more? Anyway I guess this is kind of expected due to the nature of async inactivation, I just wanted to highlight that there are regressions without fsync in the game as well. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Performance regression with async inode inactivation 2021-10-12 13:42 ` Jan Kara @ 2021-10-12 21:23 ` Dave Chinner 2021-10-13 9:51 ` Jan Kara 0 siblings, 1 reply; 12+ messages in thread From: Dave Chinner @ 2021-10-12 21:23 UTC (permalink / raw) To: Jan Kara; +Cc: linux-xfs, Dave Chinner, Darrick J. Wong On Tue, Oct 12, 2021 at 03:42:55PM +0200, Jan Kara wrote: > On Fri 08-10-21 10:44:30, Dave Chinner wrote: > > On Thu, Oct 07, 2021 at 02:03:57PM +0200, Jan Kara wrote: > > > On Thu 07-10-21 08:58:51, Dave Chinner wrote: > > > > On Wed, Oct 06, 2021 at 08:10:01PM +0200, Jan Kara wrote: > > > > Hmmm, I didn't see this at first. What's the filesystem layout > > > > (xfs_info) and the CLI for the test that you ran? How many CPUs was > > > > the test run on? > > > > > > The machine has 24 cores, each core has 2 SMT siblings, so 48 logical CPUs. > > > That's why I've run stress-unlink with 48 processes. > > > > > > xfs_info is: > > > > > > meta-data=/dev/sdb1 isize=512 agcount=4, agsize=29303104 blks > > > = sectsz=512 attr=2, projid32bit=1 > > > = crc=1 finobt=1, sparse=0, rmapbt=0 > > > = reflink=0 > > > data = bsize=4096 blocks=117212416, imaxpct=25 > > > = sunit=0 swidth=0 blks > > > naming =version 2 bsize=4096 ascii-ci=0, ftype=1 > > > log =internal log bsize=4096 blocks=57232, version=2 > > > > OK, default log is ~230MB, fs size is ~400GB? > > Yep. > > <snip log analysis> > > > > We can see big differences in the amount of log writes as well as logged > > > blocks. defaults-before logged 76 MB, defaults-after logged 262 MB, > > > highag-before logged 113 MB, highag-after logged 148 MB. Given data writes > > > for this test are 750 MB (16k * 48 processes * 1000 loops), the difference > > > of 186 MB of log IO matches well with the observed difference in the amount > > > of writes in block traces. > > > > > > I'm not sure why the amount of logged blocks differs so much. > > > > fsync() interactions. > > > > On the original code, the two unlink transactions are temporally > > adjacent as both are in unlinkat() syscall context. One is directly > > run by the syscall, the other in task_run context at syscall exit > > when processing the last reference of the file being dropped. > > > > In general, that means the objects modified (inode cluster, AGI, > > inode, etc) are captured by the same CIL context and so aggregate in > > memory as a single change (essentially log dedupe). Then the next > > fsync() from some other context runs, pushing the CIL to disk and > > we only log those objects modified in unlink to the journal once. > > > > With deferred activation, the two phases of unlink are temporally > > decoupled. We get a bunch of inodes running the first phase in > > unlink() context, but the second phase is run later in a tight loop > > from workqueue context. But this means that fsync()s occur between > > the two pahses, and hence the objects modified in the two phases of > > unlink are modified in two separate CIL contexts. Hence they get > > written to the log twice. > > > > Depending on the way things work out, deferred inactivation also > > results in longer unlinked inode chains, resulting in more objects > > being logged per unlink than without deferred inactivation, as the > > inodes are added to the unlink chain and then immediately removed > > before any others are added. Hence deferred inode inactivation will > > increase the amount written to the log per unlink if the two phases > > of unlink as split across journal checkpoints. > > > > IOWs, an increase in log writes for open-write-fsync-close-unlink > > workloads is not unexpected. But this workload isn't really a real > > world workload in any way - we generally don't do data integrity > > writes only to immediately throw the data away. :/ > > > > Remove the fsync and we do have a real world workload - temporary > > files for compiles, etc. Then everything should mostly get logged > > in the same CIL context because all transactions are run > > asynchronously and aggregate in memory until the CIL is full and > > checkpoints itself. Hence log writes shouldn't change very much at > > all for such workloads. > > OK, that makes sense. Thanks for explanation. So to verify your theory, > I've removed fsync(2) from the test program. So now it is pure create, > write, unlink workload. Results of "stress-unlink 48 /mnt", now for 5000 > loops of create-unlink to increase runtime (but the workload does barely > any writes so it should not matter wrt log): > > default mkfs params: > AVG STDDEV > before 2.0380 0.1597 > after 2.7356 0.4712 > > agcount=96, log size 512M > AVG STDDEV > before 1.0610 0.0227 > after 1.2508 0.0218 > > So still notable regression with the async inactivation. With default mkfs > params we'd need more runs to get more reliable results (note the rather > high standard deviation) but with high AG count results show pretty stable > 20% regression - so let's have a look at that. > > Looking at xfs stats there are barely any differences between before & > after - 'after' writes a bit more to the log but it is ~1.5 MB over the > whole benchmark run, altogether spending some 8ms doing IO so that's not > it. Generally the workload seems to be CPU / memory bound now (it does > barely any IO). Perf shows we are heavily contending on some spinlock in > xfs_cil_commit() - I presume this is a xc_cil_lock. Yes, and I have patches that fix this. It got reverted a before release because it exposed a bunch of underlying zero-fay bugs in the log code, and I haven't had time to run it through the review cycle again even though it's pretty much unchanged from commits 26-39 in this series: https://lore.kernel.org/linux-xfs/20210603052240.171998-1-david@fromorbit.com/ The profiles in this patch demonstrate the problem and the fix: https://lore.kernel.org/linux-xfs/20210603052240.171998-35-david@fromorbit.com/ I did all my perf testing of inode inactivation with the CIL scalability patches also installed, because deferred inode inactivation only made contention on the CIL lock worse in my perf tests. We simply can't evaluate the benefit of a change when the system is easily driven into catastrophic lock breakdown by user level operational concurrency. IOWs, the CIL lock is the global limiting factor for async transaction commit rates on large CPU count machines, and things that remove bottlenecks in higher layers often just increase contention on this lock and drive it into breakdown. That makes perf go backwards, not forwards, and it's not the fault of the high level change being made. That doesn't make the high level change wrong, it just means we need to peel the onion further before the improvements are fully realised. > This actually happens > both before and after, but we seem to spend some more time there with async > inactivation. Likely this is related to work being done from worker > threads. Perf stats for comparison: > > before > 51,135.08 msec cpu-clock # 47.894 CPUs utilized > 4,699 context-switches # 0.092 K/sec > 382 cpu-migrations # 0.007 K/sec > 1,228 page-faults # 0.024 K/sec > 128,884,972,351 cycles # 2.520 GHz > 38,517,767,839 instructions # 0.30 insn per cycle > 8,337,611,468 branches # 163.051 M/sec > 39,749,736 branch-misses # 0.48% of all branches > 25,225,109 cache-misses > > 1.067666170 seconds time elapsed > > after > 65,353.43 msec cpu-clock # 47.894 CPUs utilized > 43,737 context-switches # 0.669 K/sec > 1,824 cpu-migrations # 0.028 K/sec > 1,953 page-faults # 0.030 K/sec > 155,144,150,867 cycles # 2.374 GHz > 45,280,145,337 instructions # 0.29 insn per cycle > 10,027,567,384 branches # 153.436 M/sec > 39,554,691 branch-misses # 0.39% of all branches > 30,203,567 cache-misses > > 1.364539400 seconds time elapsed > > So we can see huge increase in context-switches, notable increase in > cache-misses, decrease in cycles/s so perhaps we are bouncing cache more? > Anyway I guess this is kind of expected due to the nature of async > inactivation, I just wanted to highlight that there are regressions without > fsync in the game as well. Context switches are largely noise - they are most likely just AGI locks being bounced a bit more. It's the spinlock contention that is the likely issue here. For example, on my 32p machine with vanilla 5.15-rc4 with a fsync-less, 5000 iteration test run: $ sudo perf_5.9 stat ./stress-unlink 32 /mnt/scratch Processes started. 1.290 Performance counter stats for './stress-unlink 32 /mnt/scratch': 16,856.61 msec task-clock # 12.595 CPUs utilized 48,297 context-switches # 0.003 M/sec 4,219 cpu-migrations # 0.250 K/sec 1,373 page-faults # 0.081 K/sec 39,254,798,526 cycles # 2.329 GHz 16,460,808,349 instructions # 0.42 insn per cycle 3,475,251,228 branches # 206.166 M/sec 12,129,889 branch-misses # 0.35% of all branches 1.338312347 seconds time elapsed 0.186554000 seconds user 17.247176000 seconds sys And with 5.15-rc4 + CIL scalability: $ sudo perf_5.9 stat ./stress-unlink 32 /mnt/scratch Processes started. 0.894 Performance counter stats for './stress-unlink 32 /mnt/scratch': 12,917.93 msec task-clock # 13.805 CPUs utilized 39,680 context-switches # 0.003 M/sec 2,737 cpu-migrations # 0.212 K/sec 1,402 page-faults # 0.109 K/sec 30,920,293,752 cycles # 2.394 GHz 14,472,067,501 instructions # 0.47 insn per cycle 2,700,978,247 branches # 209.087 M/sec 9,287,754 branch-misses # 0.34% of all branches 0.935710173 seconds time elapsed 0.192467000 seconds user 13.245977000 seconds sys Runtime of the fsync-less, 5,000 iteration version drops from 1.29s to 0.89s, IPC goes up, branches and branch-misses go down, context switches only go down slightly, etc. IOWs, when you take away the CIL lock contention, we get back all that perf loss and then some... FWIW, let's really hammer it for a long while. Vanilla 5.14-rc4: $ sudo perf_5.9 stat ./stress-unlink 1000 /mnt/scratch Processes started. 38.881 Performance counter stats for './stress-unlink 1000 /mnt/scratch': 733,741.06 msec task-clock # 16.004 CPUs utilized 13,131,968 context-switches # 0.018 M/sec 1,302,636 cpu-migrations # 0.002 M/sec 40,720 page-faults # 0.055 K/sec 1,195,192,185,398 cycles # 1.629 GHz 643,382,890,656 instructions # 0.54 insn per cycle 129,065,409,600 branches # 175.900 M/sec 768,146,988 branch-misses # 0.60% of all branches 45.847750477 seconds time elapsed 11.194020000 seconds user 758.762376000 seconds sys And the transaction rate is pinned at 800,000/s for the entire test. We're running at the limit of the CIL lock here. With CIL scalbility patchset: $ sudo perf_5.9 stat ./stress-unlink 1000 /mnt/scratch Processes started. 28.263 Performance counter stats for './stress-unlink 1000 /mnt/scratch': 450,547.80 msec task-clock # 15.038 CPUs utilized 5,949,268 context-switches # 0.013 M/sec 868,887 cpu-migrations # 0.002 M/sec 41,570 page-faults # 0.092 K/sec 836,933,822,425 cycles # 1.858 GHz 536,132,522,275 instructions # 0.64 insn per cycle 99,264,579,180 branches # 220.320 M/sec 506,921,132 branch-misses # 0.51% of all branches 29.961492616 seconds time elapsed 7.796758000 seconds user 471.990545000 seconds sys 30% reduction in runtime because the transaction rate is now running at 1.1M/s. Improvements in code execution across the board here, so the problem clearly isn't the deferred inode inactivation. IOWs, I'm largely not concerned about the high CPU count perf regressions that we are seeing from log code these days - the fix is largely ready, it's just lacking in available engineering time to get it over the line and merged right now... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Performance regression with async inode inactivation 2021-10-12 21:23 ` Dave Chinner @ 2021-10-13 9:51 ` Jan Kara 0 siblings, 0 replies; 12+ messages in thread From: Jan Kara @ 2021-10-13 9:51 UTC (permalink / raw) To: Dave Chinner; +Cc: Jan Kara, linux-xfs, Dave Chinner, Darrick J. Wong On Wed 13-10-21 08:23:39, Dave Chinner wrote: > On Tue, Oct 12, 2021 at 03:42:55PM +0200, Jan Kara wrote: > > On Fri 08-10-21 10:44:30, Dave Chinner wrote: > > > Remove the fsync and we do have a real world workload - temporary > > > files for compiles, etc. Then everything should mostly get logged > > > in the same CIL context because all transactions are run > > > asynchronously and aggregate in memory until the CIL is full and > > > checkpoints itself. Hence log writes shouldn't change very much at > > > all for such workloads. > > > > OK, that makes sense. Thanks for explanation. So to verify your theory, > > I've removed fsync(2) from the test program. So now it is pure create, > > write, unlink workload. Results of "stress-unlink 48 /mnt", now for 5000 > > loops of create-unlink to increase runtime (but the workload does barely > > any writes so it should not matter wrt log): > > > > default mkfs params: > > AVG STDDEV > > before 2.0380 0.1597 > > after 2.7356 0.4712 > > > > agcount=96, log size 512M > > AVG STDDEV > > before 1.0610 0.0227 > > after 1.2508 0.0218 > > > > So still notable regression with the async inactivation. With default mkfs > > params we'd need more runs to get more reliable results (note the rather > > high standard deviation) but with high AG count results show pretty stable > > 20% regression - so let's have a look at that. > > > > Looking at xfs stats there are barely any differences between before & > > after - 'after' writes a bit more to the log but it is ~1.5 MB over the > > whole benchmark run, altogether spending some 8ms doing IO so that's not > > it. Generally the workload seems to be CPU / memory bound now (it does > > barely any IO). Perf shows we are heavily contending on some spinlock in > > xfs_cil_commit() - I presume this is a xc_cil_lock. > > Yes, and I have patches that fix this. It got reverted a before > release because it exposed a bunch of underlying zero-fay bugs in > the log code, and I haven't had time to run it through the review > cycle again even though it's pretty much unchanged from commits > 26-39 in this series: > > https://lore.kernel.org/linux-xfs/20210603052240.171998-1-david@fromorbit.com/ > > The profiles in this patch demonstrate the problem and the fix: > > https://lore.kernel.org/linux-xfs/20210603052240.171998-35-david@fromorbit.com/ > > I did all my perf testing of inode inactivation with the CIL > scalability patches also installed, because deferred inode > inactivation only made contention on the CIL lock worse in my perf > tests. We simply can't evaluate the benefit of a change when the > system is easily driven into catastrophic lock breakdown by user > level operational concurrency. > > IOWs, the CIL lock is the global limiting factor for async > transaction commit rates on large CPU count machines, and things > that remove bottlenecks in higher layers often just increase > contention on this lock and drive it into breakdown. That makes perf > go backwards, not forwards, and it's not the fault of the high level > change being made. That doesn't make the high level change wrong, it > just means we need to peel the onion further before the improvements > are fully realised. OK, understood. > > This actually happens > > both before and after, but we seem to spend some more time there with async > > inactivation. Likely this is related to work being done from worker > > threads. Perf stats for comparison: > > > > before > > 51,135.08 msec cpu-clock # 47.894 CPUs utilized > > 4,699 context-switches # 0.092 K/sec > > 382 cpu-migrations # 0.007 K/sec > > 1,228 page-faults # 0.024 K/sec > > 128,884,972,351 cycles # 2.520 GHz > > 38,517,767,839 instructions # 0.30 insn per cycle > > 8,337,611,468 branches # 163.051 M/sec > > 39,749,736 branch-misses # 0.48% of all branches > > 25,225,109 cache-misses > > > > 1.067666170 seconds time elapsed > > > > after > > 65,353.43 msec cpu-clock # 47.894 CPUs utilized > > 43,737 context-switches # 0.669 K/sec > > 1,824 cpu-migrations # 0.028 K/sec > > 1,953 page-faults # 0.030 K/sec > > 155,144,150,867 cycles # 2.374 GHz > > 45,280,145,337 instructions # 0.29 insn per cycle > > 10,027,567,384 branches # 153.436 M/sec > > 39,554,691 branch-misses # 0.39% of all branches > > 30,203,567 cache-misses > > > > 1.364539400 seconds time elapsed > > > > So we can see huge increase in context-switches, notable increase in > > cache-misses, decrease in cycles/s so perhaps we are bouncing cache more? > > Anyway I guess this is kind of expected due to the nature of async > > inactivation, I just wanted to highlight that there are regressions without > > fsync in the game as well. > > Context switches are largely noise - they are most likely just AGI > locks being bounced a bit more. It's the spinlock contention that is > the likely issue here. For example, on my 32p machine with vanilla > 5.15-rc4 with a fsync-less, 5000 iteration test run: > > $ sudo perf_5.9 stat ./stress-unlink 32 /mnt/scratch > Processes started. > 1.290 > > Performance counter stats for './stress-unlink 32 /mnt/scratch': > > 16,856.61 msec task-clock # 12.595 CPUs utilized > 48,297 context-switches # 0.003 M/sec > 4,219 cpu-migrations # 0.250 K/sec > 1,373 page-faults # 0.081 K/sec > 39,254,798,526 cycles # 2.329 GHz > 16,460,808,349 instructions # 0.42 insn per cycle > 3,475,251,228 branches # 206.166 M/sec > 12,129,889 branch-misses # 0.35% of all branches > > 1.338312347 seconds time elapsed > > 0.186554000 seconds user > 17.247176000 seconds sys > > And with 5.15-rc4 + CIL scalability: > > $ sudo perf_5.9 stat ./stress-unlink 32 /mnt/scratch > Processes started. > 0.894 > > Performance counter stats for './stress-unlink 32 /mnt/scratch': > > 12,917.93 msec task-clock # 13.805 CPUs utilized > 39,680 context-switches # 0.003 M/sec > 2,737 cpu-migrations # 0.212 K/sec > 1,402 page-faults # 0.109 K/sec > 30,920,293,752 cycles # 2.394 GHz > 14,472,067,501 instructions # 0.47 insn per cycle > 2,700,978,247 branches # 209.087 M/sec > 9,287,754 branch-misses # 0.34% of all branches > > 0.935710173 seconds time elapsed > > 0.192467000 seconds user > 13.245977000 seconds sys > > Runtime of the fsync-less, 5,000 iteration version drops from 1.29s > to 0.89s, IPC goes up, branches and branch-misses go down, context > switches only go down slightly, etc. IOWs, when you take away the > CIL lock contention, we get back all that perf loss and then some... Nice results! > FWIW, let's really hammer it for a long while. Vanilla 5.14-rc4: > > $ sudo perf_5.9 stat ./stress-unlink 1000 /mnt/scratch > Processes started. > 38.881 > > Performance counter stats for './stress-unlink 1000 /mnt/scratch': > > 733,741.06 msec task-clock # 16.004 CPUs utilized > 13,131,968 context-switches # 0.018 M/sec > 1,302,636 cpu-migrations # 0.002 M/sec > 40,720 page-faults # 0.055 K/sec > 1,195,192,185,398 cycles # 1.629 GHz > 643,382,890,656 instructions # 0.54 insn per cycle > 129,065,409,600 branches # 175.900 M/sec > 768,146,988 branch-misses # 0.60% of all branches > > 45.847750477 seconds time elapsed > > 11.194020000 seconds user > 758.762376000 seconds sys > > And the transaction rate is pinned at 800,000/s for the entire test. > We're running at the limit of the CIL lock here. > > With CIL scalbility patchset: > > $ sudo perf_5.9 stat ./stress-unlink 1000 /mnt/scratch > Processes started. > 28.263 > > Performance counter stats for './stress-unlink 1000 /mnt/scratch': > > 450,547.80 msec task-clock # 15.038 CPUs utilized > 5,949,268 context-switches # 0.013 M/sec > 868,887 cpu-migrations # 0.002 M/sec > 41,570 page-faults # 0.092 K/sec > 836,933,822,425 cycles # 1.858 GHz > 536,132,522,275 instructions # 0.64 insn per cycle > 99,264,579,180 branches # 220.320 M/sec > 506,921,132 branch-misses # 0.51% of all branches > > 29.961492616 seconds time elapsed > > 7.796758000 seconds user > 471.990545000 seconds sys > > > 30% reduction in runtime because the transaction rate is now > running at 1.1M/s. Improvements in code execution across the board > here, so the problem clearly isn't the deferred inode inactivation. > > IOWs, I'm largely not concerned about the high CPU count perf > regressions that we are seeing from log code these days - the fix is > largely ready, it's just lacking in available engineering time to get > it over the line and merged right now... OK, thanks for explanation! Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2021-10-13 9:51 UTC | newest] Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-10-04 10:06 Performance regression with async inode inactivation Jan Kara 2021-10-04 18:17 ` Darrick J. Wong 2021-10-04 21:15 ` Dave Chinner 2021-10-05 8:11 ` Jan Kara 2021-10-05 21:26 ` Dave Chinner 2021-10-06 18:10 ` Jan Kara 2021-10-06 21:58 ` Dave Chinner 2021-10-07 12:03 ` Jan Kara 2021-10-07 23:44 ` Dave Chinner 2021-10-12 13:42 ` Jan Kara 2021-10-12 21:23 ` Dave Chinner 2021-10-13 9:51 ` Jan Kara
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.