From: Jan Kara <jack@suse.cz>
To: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>,
linux-xfs@vger.kernel.org, Dave Chinner <dchinner@redhat.com>,
"Darrick J. Wong" <djwong@kernel.org>
Subject: Re: Performance regression with async inode inactivation
Date: Wed, 13 Oct 2021 11:51:39 +0200 [thread overview]
Message-ID: <20211013095139.GC19200@quack2.suse.cz> (raw)
In-Reply-To: <20211012212339.GQ2361455@dread.disaster.area>
On Wed 13-10-21 08:23:39, Dave Chinner wrote:
> On Tue, Oct 12, 2021 at 03:42:55PM +0200, Jan Kara wrote:
> > On Fri 08-10-21 10:44:30, Dave Chinner wrote:
> > > Remove the fsync and we do have a real world workload - temporary
> > > files for compiles, etc. Then everything should mostly get logged
> > > in the same CIL context because all transactions are run
> > > asynchronously and aggregate in memory until the CIL is full and
> > > checkpoints itself. Hence log writes shouldn't change very much at
> > > all for such workloads.
> >
> > OK, that makes sense. Thanks for explanation. So to verify your theory,
> > I've removed fsync(2) from the test program. So now it is pure create,
> > write, unlink workload. Results of "stress-unlink 48 /mnt", now for 5000
> > loops of create-unlink to increase runtime (but the workload does barely
> > any writes so it should not matter wrt log):
> >
> > default mkfs params:
> > AVG STDDEV
> > before 2.0380 0.1597
> > after 2.7356 0.4712
> >
> > agcount=96, log size 512M
> > AVG STDDEV
> > before 1.0610 0.0227
> > after 1.2508 0.0218
> >
> > So still notable regression with the async inactivation. With default mkfs
> > params we'd need more runs to get more reliable results (note the rather
> > high standard deviation) but with high AG count results show pretty stable
> > 20% regression - so let's have a look at that.
> >
> > Looking at xfs stats there are barely any differences between before &
> > after - 'after' writes a bit more to the log but it is ~1.5 MB over the
> > whole benchmark run, altogether spending some 8ms doing IO so that's not
> > it. Generally the workload seems to be CPU / memory bound now (it does
> > barely any IO). Perf shows we are heavily contending on some spinlock in
> > xfs_cil_commit() - I presume this is a xc_cil_lock.
>
> Yes, and I have patches that fix this. It got reverted a before
> release because it exposed a bunch of underlying zero-fay bugs in
> the log code, and I haven't had time to run it through the review
> cycle again even though it's pretty much unchanged from commits
> 26-39 in this series:
>
> https://lore.kernel.org/linux-xfs/20210603052240.171998-1-david@fromorbit.com/
>
> The profiles in this patch demonstrate the problem and the fix:
>
> https://lore.kernel.org/linux-xfs/20210603052240.171998-35-david@fromorbit.com/
>
> I did all my perf testing of inode inactivation with the CIL
> scalability patches also installed, because deferred inode
> inactivation only made contention on the CIL lock worse in my perf
> tests. We simply can't evaluate the benefit of a change when the
> system is easily driven into catastrophic lock breakdown by user
> level operational concurrency.
>
> IOWs, the CIL lock is the global limiting factor for async
> transaction commit rates on large CPU count machines, and things
> that remove bottlenecks in higher layers often just increase
> contention on this lock and drive it into breakdown. That makes perf
> go backwards, not forwards, and it's not the fault of the high level
> change being made. That doesn't make the high level change wrong, it
> just means we need to peel the onion further before the improvements
> are fully realised.
OK, understood.
> > This actually happens
> > both before and after, but we seem to spend some more time there with async
> > inactivation. Likely this is related to work being done from worker
> > threads. Perf stats for comparison:
> >
> > before
> > 51,135.08 msec cpu-clock # 47.894 CPUs utilized
> > 4,699 context-switches # 0.092 K/sec
> > 382 cpu-migrations # 0.007 K/sec
> > 1,228 page-faults # 0.024 K/sec
> > 128,884,972,351 cycles # 2.520 GHz
> > 38,517,767,839 instructions # 0.30 insn per cycle
> > 8,337,611,468 branches # 163.051 M/sec
> > 39,749,736 branch-misses # 0.48% of all branches
> > 25,225,109 cache-misses
> >
> > 1.067666170 seconds time elapsed
> >
> > after
> > 65,353.43 msec cpu-clock # 47.894 CPUs utilized
> > 43,737 context-switches # 0.669 K/sec
> > 1,824 cpu-migrations # 0.028 K/sec
> > 1,953 page-faults # 0.030 K/sec
> > 155,144,150,867 cycles # 2.374 GHz
> > 45,280,145,337 instructions # 0.29 insn per cycle
> > 10,027,567,384 branches # 153.436 M/sec
> > 39,554,691 branch-misses # 0.39% of all branches
> > 30,203,567 cache-misses
> >
> > 1.364539400 seconds time elapsed
> >
> > So we can see huge increase in context-switches, notable increase in
> > cache-misses, decrease in cycles/s so perhaps we are bouncing cache more?
> > Anyway I guess this is kind of expected due to the nature of async
> > inactivation, I just wanted to highlight that there are regressions without
> > fsync in the game as well.
>
> Context switches are largely noise - they are most likely just AGI
> locks being bounced a bit more. It's the spinlock contention that is
> the likely issue here. For example, on my 32p machine with vanilla
> 5.15-rc4 with a fsync-less, 5000 iteration test run:
>
> $ sudo perf_5.9 stat ./stress-unlink 32 /mnt/scratch
> Processes started.
> 1.290
>
> Performance counter stats for './stress-unlink 32 /mnt/scratch':
>
> 16,856.61 msec task-clock # 12.595 CPUs utilized
> 48,297 context-switches # 0.003 M/sec
> 4,219 cpu-migrations # 0.250 K/sec
> 1,373 page-faults # 0.081 K/sec
> 39,254,798,526 cycles # 2.329 GHz
> 16,460,808,349 instructions # 0.42 insn per cycle
> 3,475,251,228 branches # 206.166 M/sec
> 12,129,889 branch-misses # 0.35% of all branches
>
> 1.338312347 seconds time elapsed
>
> 0.186554000 seconds user
> 17.247176000 seconds sys
>
> And with 5.15-rc4 + CIL scalability:
>
> $ sudo perf_5.9 stat ./stress-unlink 32 /mnt/scratch
> Processes started.
> 0.894
>
> Performance counter stats for './stress-unlink 32 /mnt/scratch':
>
> 12,917.93 msec task-clock # 13.805 CPUs utilized
> 39,680 context-switches # 0.003 M/sec
> 2,737 cpu-migrations # 0.212 K/sec
> 1,402 page-faults # 0.109 K/sec
> 30,920,293,752 cycles # 2.394 GHz
> 14,472,067,501 instructions # 0.47 insn per cycle
> 2,700,978,247 branches # 209.087 M/sec
> 9,287,754 branch-misses # 0.34% of all branches
>
> 0.935710173 seconds time elapsed
>
> 0.192467000 seconds user
> 13.245977000 seconds sys
>
> Runtime of the fsync-less, 5,000 iteration version drops from 1.29s
> to 0.89s, IPC goes up, branches and branch-misses go down, context
> switches only go down slightly, etc. IOWs, when you take away the
> CIL lock contention, we get back all that perf loss and then some...
Nice results!
> FWIW, let's really hammer it for a long while. Vanilla 5.14-rc4:
>
> $ sudo perf_5.9 stat ./stress-unlink 1000 /mnt/scratch
> Processes started.
> 38.881
>
> Performance counter stats for './stress-unlink 1000 /mnt/scratch':
>
> 733,741.06 msec task-clock # 16.004 CPUs utilized
> 13,131,968 context-switches # 0.018 M/sec
> 1,302,636 cpu-migrations # 0.002 M/sec
> 40,720 page-faults # 0.055 K/sec
> 1,195,192,185,398 cycles # 1.629 GHz
> 643,382,890,656 instructions # 0.54 insn per cycle
> 129,065,409,600 branches # 175.900 M/sec
> 768,146,988 branch-misses # 0.60% of all branches
>
> 45.847750477 seconds time elapsed
>
> 11.194020000 seconds user
> 758.762376000 seconds sys
>
> And the transaction rate is pinned at 800,000/s for the entire test.
> We're running at the limit of the CIL lock here.
>
> With CIL scalbility patchset:
>
> $ sudo perf_5.9 stat ./stress-unlink 1000 /mnt/scratch
> Processes started.
> 28.263
>
> Performance counter stats for './stress-unlink 1000 /mnt/scratch':
>
> 450,547.80 msec task-clock # 15.038 CPUs utilized
> 5,949,268 context-switches # 0.013 M/sec
> 868,887 cpu-migrations # 0.002 M/sec
> 41,570 page-faults # 0.092 K/sec
> 836,933,822,425 cycles # 1.858 GHz
> 536,132,522,275 instructions # 0.64 insn per cycle
> 99,264,579,180 branches # 220.320 M/sec
> 506,921,132 branch-misses # 0.51% of all branches
>
> 29.961492616 seconds time elapsed
>
> 7.796758000 seconds user
> 471.990545000 seconds sys
>
>
> 30% reduction in runtime because the transaction rate is now
> running at 1.1M/s. Improvements in code execution across the board
> here, so the problem clearly isn't the deferred inode inactivation.
>
> IOWs, I'm largely not concerned about the high CPU count perf
> regressions that we are seeing from log code these days - the fix is
> largely ready, it's just lacking in available engineering time to get
> it over the line and merged right now...
OK, thanks for explanation!
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
prev parent reply other threads:[~2021-10-13 9:51 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-10-04 10:06 Performance regression with async inode inactivation Jan Kara
2021-10-04 18:17 ` Darrick J. Wong
2021-10-04 21:15 ` Dave Chinner
2021-10-05 8:11 ` Jan Kara
2021-10-05 21:26 ` Dave Chinner
2021-10-06 18:10 ` Jan Kara
2021-10-06 21:58 ` Dave Chinner
2021-10-07 12:03 ` Jan Kara
2021-10-07 23:44 ` Dave Chinner
2021-10-12 13:42 ` Jan Kara
2021-10-12 21:23 ` Dave Chinner
2021-10-13 9:51 ` Jan Kara [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20211013095139.GC19200@quack2.suse.cz \
--to=jack@suse.cz \
--cc=david@fromorbit.com \
--cc=dchinner@redhat.com \
--cc=djwong@kernel.org \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.