From: Qi Zheng <qi.zheng@linux.dev>
To: paulmck@kernel.org, Kirill Tkhai <tkhai@ya.ru>
Cc: RCU <rcu@vger.kernel.org>, "Yujie Liu" <yujie.liu@intel.com>,
oe-lkp@lists.linux.dev, lkp@intel.com,
linux-kernel@vger.kernel.org,
"Andrew Morton" <akpm@linux-foundation.org>,
"Vlastimil Babka" <vbabka@suse.cz>,
"Roman Gushchin" <roman.gushchin@linux.dev>,
"Christian König" <christian.koenig@amd.com>,
"David Hildenbrand" <david@redhat.com>,
"Davidlohr Bueso" <dave@stgolabs.net>,
"Johannes Weiner" <hannes@cmpxchg.org>,
"Michal Hocko" <mhocko@kernel.org>,
"Muchun Song" <muchun.song@linux.dev>,
"Shakeel Butt" <shakeelb@google.com>,
"Yang Shi" <shy828301@gmail.com>,
linux-mm@kvack.org, ying.huang@intel.com, feng.tang@intel.com,
fengwei.yin@intel.com
Subject: Re: [linus:master] [mm] f95bdb700b: stress-ng.ramfs.ops_per_sec -88.8% regression
Date: Mon, 29 May 2023 10:39:21 +0800 [thread overview]
Message-ID: <d4b1599d-14c1-071c-6205-09fe60f2ed8b@linux.dev> (raw)
In-Reply-To: <44407892-b7bc-4d6c-8e4a-6452f0ee88b9@paulmck-laptop>
Hi Paul,
On 2023/5/27 19:14, Paul E. McKenney wrote:
> On Thu, May 25, 2023 at 12:03:16PM +0800, Qi Zheng wrote:
>> On 2023/5/24 19:56, Qi Zheng wrote:
>>> On 2023/5/24 19:08, Qi Zheng wrote:
>>>
>>> [...]
>>>
>>>>
>>>> Well, I just ran the following command and reproduced the result:
>>>>
>>>> stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 &
>>>>
>>>> 1) with commit 42c9db3970483:
>>>>
>>>> stress-ng: info: [11023] setting to a 60 second run per stressor
>>>> stress-ng: info: [11023] dispatching hogs: 9 ramfs
>>>> stress-ng: info: [11023] stressor bogo ops real time usr
>>>> time sys time bogo ops/s bogo ops/s
>>>> stress-ng: info: [11023] (secs) (secs)
>>>> (secs) (real time) (usr+sys time)
>>>> stress-ng: info: [11023] ramfs 774966 60.00
>>>> 10.18 169.45 12915.89 4314.26
>>>> stress-ng: info: [11023] for a 60.00s run time:
>>>> stress-ng: info: [11023] 1920.11s available CPU time
>>>> stress-ng: info: [11023] 10.18s user time ( 0.53%)
>>>> stress-ng: info: [11023] 169.44s system time ( 8.82%)
>>>> stress-ng: info: [11023] 179.62s total time ( 9.35%)
>>>> stress-ng: info: [11023] load average: 8.99 2.69 0.93
>>>> stress-ng: info: [11023] successful run completed in 60.00s (1 min,
>>>> 0.00 secs)
>>>>
>>>> 2) with commit f95bdb700bc6b:
>>>>
>>>> stress-ng: info: [37676] dispatching hogs: 9 ramfs
>>>> stress-ng: info: [37676] stressor bogo ops real time usr
>>>> time sys time bogo ops/s bogo ops/s
>>>> stress-ng: info: [37676] (secs) (secs)
>>>> (secs) (real time) (usr+sys time)
>>>> stress-ng: info: [37676] ramfs 168673 60.00
>>>> 1.61 39.66 2811.08 4087.47
>>>> stress-ng: info: [37676] for a 60.10s run time:
>>>> stress-ng: info: [37676] 1923.36s available CPU time
>>>> stress-ng: info: [37676] 1.60s user time ( 0.08%)
>>>> stress-ng: info: [37676] 39.66s system time ( 2.06%)
>>>> stress-ng: info: [37676] 41.26s total time ( 2.15%)
>>>> stress-ng: info: [37676] load average: 7.69 3.63 2.36
>>>> stress-ng: info: [37676] successful run completed in 60.10s (1 min,
>>>> 0.10 secs)
>>>>
>>>> The bogo ops/s (real time) did drop significantly.
>>>>
>>>> And the memory reclaimation was not triggered in the whole process. so
>>>> theoretically no one is in the read critical section of shrinker_srcu.
>>>>
>>>> Then I found that some stress-ng-ramfs processes were in
>>>> TASK_UNINTERRUPTIBLE state for a long time:
>>>>
>>>> root 42313 0.0 0.0 69592 2068 pts/0 S 19:00 0:00
>>>> stress-ng-ramfs [run]
>>>> root 42314 0.0 0.0 69592 2068 pts/0 S 19:00 0:00
>>>> stress-ng-ramfs [run]
>>>> root 42315 0.0 0.0 69592 2068 pts/0 S 19:00 0:00
>>>> stress-ng-ramfs [run]
>>>> root 42316 0.0 0.0 69592 2068 pts/0 S 19:00 0:00
>>>> stress-ng-ramfs [run]
>>>> root 42317 7.8 0.0 69592 1812 pts/0 D 19:00 0:02
>>>> stress-ng-ramfs [run]
>>>> root 42318 0.0 0.0 69592 2068 pts/0 S 19:00 0:00
>>>> stress-ng-ramfs [run]
>>>> root 42319 7.8 0.0 69592 1812 pts/0 D 19:00 0:02
>>>> stress-ng-ramfs [run]
>>>> root 42320 0.0 0.0 69592 2068 pts/0 S 19:00 0:00
>>>> stress-ng-ramfs [run]
>>>> root 42321 7.8 0.0 69592 1812 pts/0 D 19:00 0:02
>>>> stress-ng-ramfs [run]
>>>> root 42322 0.0 0.0 69592 2068 pts/0 S 19:00 0:00
>>>> stress-ng-ramfs [run]
>>>> root 42323 7.8 0.0 69592 1812 pts/0 D 19:00 0:02
>>>> stress-ng-ramfs [run]
>>>> root 42324 0.0 0.0 69592 2068 pts/0 S 19:00 0:00
>>>> stress-ng-ramfs [run]
>>>> root 42325 7.8 0.0 69592 1812 pts/0 D 19:00 0:02
>>>> stress-ng-ramfs [run]
>>>> root 42326 0.0 0.0 69592 2068 pts/0 S 19:00 0:00
>>>> stress-ng-ramfs [run]
>>>> root 42327 7.9 0.0 69592 1812 pts/0 D 19:00 0:02
>>>> stress-ng-ramfs [run]
>>>> root 42328 7.9 0.0 69592 1812 pts/0 D 19:00 0:02
>>>> stress-ng-ramfs [run]
>>>> root 42329 7.9 0.0 69592 1812 pts/0 D 19:00 0:02
>>>> stress-ng-ramfs [run]
>>>> root 42330 7.9 0.0 69592 1556 pts/0 D 19:00 0:02
>>>> stress-ng-ramfs [run]
>>>>
>>>> Their call stack is as follows:
>>>>
>>>> cat /proc/42330/stack
>>>>
>>>> [<0>] __synchronize_srcu.part.21+0x83/0xb0
>>>> [<0>] unregister_shrinker+0x85/0xb0
>>>> [<0>] deactivate_locked_super+0x27/0x70
>>>> [<0>] cleanup_mnt+0xb8/0x140
>>>> [<0>] task_work_run+0x65/0x90
>>>> [<0>] exit_to_user_mode_prepare+0x1ba/0x1c0
>>>> [<0>] syscall_exit_to_user_mode+0x1b/0x40
>>>> [<0>] do_syscall_64+0x44/0x80
>>>> [<0>] entry_SYSCALL_64_after_hwframe+0x63/0xcd
>>>>
>>>> + RCU folks, Is this result as expected? I would have thought that
>>>> synchronize_srcu() should return quickly if no one is in the read
>>>> critical section. :(
>
> In theory, it would indeed be nice if synchronize_srcu() would do that.
> In practice, the act of checking to see if there is anyone in an SRCU
> read-side critical section is a heavy-weight operation, involving at
> least one cache miss per CPU along with a number of full memory barriers.
>
> So SRCU has to be careful to not check too frequently.
Got it.
>
> However, if SRCU has been idle for some time, normal synchronize_srcu()
> will do an immediate check. And this will of course mark SRCU as
> non-idle.
>
>>> With the following changes, ops/s can return to previous levels:
>>
>> Or just set rcu_expedited to 1:
>> echo 1 > /sys/kernel/rcu_expedited
>
> This does cause SRCU to be much more aggressive. This can be a good
> choice for small systems, but please keep in mind that this affects normal
> RCU as well as SRCU. It will cause RCU to also be much more aggressive,
> sending IPIs to CPUs that are (or might be) in RCU read-side critical
> sections. Depending on your workload, this might or might not be what
> you want RCU to be doing. For example, if you are running aggressive
> real-time workloads, it most definitely is not what you want.
Yeah, that's not what I want, a shrinker might run for a long time.
>
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index db2ed6e08f67..90f541b07cd1 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -763,7 +763,7 @@ void unregister_shrinker(struct shrinker *shrinker)
>>> debugfs_entry = shrinker_debugfs_remove(shrinker);
>>> up_write(&shrinker_rwsem);
>>>
>>> - synchronize_srcu(&shrinker_srcu);
>>> + synchronize_srcu_expedited(&shrinker_srcu);
>
> If shrinkers are unregistered only occasionally, this is an entirely
> reasonable change.
>
>>> debugfs_remove_recursive(debugfs_entry);
>>>
>>> stress-ng: info: [13159] dispatching hogs: 9 ramfs
>>> stress-ng: info: [13159] stressor bogo ops real time usr time
>>> sys time bogo ops/s bogo ops/s
>>> stress-ng: info: [13159] (secs) (secs)
>>> (secs) (real time) (usr+sys time)
>>> stress-ng: info: [13159] ramfs 710062 60.00 9.63
>>> 157.26 11834.18 4254.75
>>> stress-ng: info: [13159] for a 60.00s run time:
>>> stress-ng: info: [13159] 1920.14s available CPU time
>>> stress-ng: info: [13159] 9.62s user time ( 0.50%)
>>> stress-ng: info: [13159] 157.26s system time ( 8.19%)
>>> stress-ng: info: [13159] 166.88s total time ( 8.69%)
>>> stress-ng: info: [13159] load average: 9.49 4.02 1.65
>>> stress-ng: info: [13159] successful run completed in 60.00s (1 min,
>>> 0.00 secs)
>>>
>>> Can we make synchronize_srcu() call synchronize_srcu_expedited() when no
>>> one is in the read critical section?
>
> Yes, in theory we could, but this would be a bad thing in practice.
> After all, the point of having synchronize_srcu() be separate from
> synchronize_srcu_expedited() is to allow uses that are OK with longer
> latency avoid consuming too much CPU. In addition, that longer
> SRCU grace-period latency allows the next grace period to handle more
> synchronize_srcu() and call_srcu() requests. This amortizes the
> overhead of that next grace period over a larger number of updates.
>
> However, your use of synchronize_srcu_expedited() does have that effect,
> but only for this call point. Which has the advantage of avoiding
> burning excessive quantities of CPU for the other 50+ call points.
Thanks for such a detailed explanation.
Now I think we can continue to try to complete the idea[1] from
Kirill Tkhai. The patch moves heavy synchronize_srcu() to delayed
work, so it doesn't affect on user-visible unregistration speed.
[1].
https://lore.kernel.org/lkml/153365636747.19074.12610817307548583381.stgit@localhost.localdomain/
Thanks,
Qi
>
> Thanx, Paul
next prev parent reply other threads:[~2023-05-29 2:39 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <202305230837.db2c233f-yujie.liu@intel.com>
[not found] ` <eba38fce-2454-d7a4-10ef-240b4686f23d@linux.dev>
[not found] ` <ZG29ULGNJdErnatI@yujie-X299>
[not found] ` <896bbb09-d400-ec73-ba3a-b64c6e9bbe46@linux.dev>
[not found] ` <e5fb8b34-c1ad-92e0-e7e5-f7ed1605dbc6@linux.dev>
2023-05-24 11:22 ` [linus:master] [mm] f95bdb700b: stress-ng.ramfs.ops_per_sec -88.8% regression Qi Zheng
2023-05-24 11:56 ` Qi Zheng
2023-05-25 4:03 ` Qi Zheng
2023-05-27 11:14 ` Paul E. McKenney
2023-05-29 2:39 ` Qi Zheng [this message]
2023-05-29 12:51 ` Paul E. McKenney
2023-05-30 3:07 ` Qi Zheng
2023-06-01 0:57 ` Kirill Tkhai
2023-06-01 8:34 ` Qi Zheng
[not found] ` <932751685611907@mail.yandex.ru>
2023-06-01 10:44 ` Qi Zheng
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d4b1599d-14c1-071c-6205-09fe60f2ed8b@linux.dev \
--to=qi.zheng@linux.dev \
--cc=akpm@linux-foundation.org \
--cc=christian.koenig@amd.com \
--cc=dave@stgolabs.net \
--cc=david@redhat.com \
--cc=feng.tang@intel.com \
--cc=fengwei.yin@intel.com \
--cc=hannes@cmpxchg.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lkp@intel.com \
--cc=mhocko@kernel.org \
--cc=muchun.song@linux.dev \
--cc=oe-lkp@lists.linux.dev \
--cc=paulmck@kernel.org \
--cc=rcu@vger.kernel.org \
--cc=roman.gushchin@linux.dev \
--cc=shakeelb@google.com \
--cc=shy828301@gmail.com \
--cc=tkhai@ya.ru \
--cc=vbabka@suse.cz \
--cc=ying.huang@intel.com \
--cc=yujie.liu@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).