containers.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
From: Oliver Sang <oliver.sang@intel.com>
To: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Jens Axboe <axboe@kernel.dk>, Feng Tang <feng.tang@intel.com>,
	0day robot <lkp@intel.com>,
	Kernel Hardening <kernel-hardening@lists.openwall.com>,
	Linux Containers <containers@lists.linux-foundation.org>,
	Jann Horn <jannh@google.com>, LKML <linux-kernel@vger.kernel.org>,
	Oleg Nesterov <oleg@redhat.com>, Linux-MM <linux-mm@kvack.org>,
	lkp@lists.01.org, Alexey Gladkov <legion@kernel.org>,
	"Huang, Ying" <ying.huang@intel.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	zhengjun.xing@intel.com,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Christian Brauner <christian.brauner@ubuntu.com>,
	Kees Cook <keescook@chromium.org>
Subject: Re: 08ed4efad6: stress-ng.sigsegv.ops_per_sec -41.9% regression
Date: Fri, 23 Apr 2021 10:47:22 +0800	[thread overview]
Message-ID: <20210423024722.GA13968@xsang-OptiPlex-9020> (raw)
In-Reply-To: <m1im4wmx9g.fsf@fess.ebiederm.org>

hi, Eric,

On Thu, Apr 08, 2021 at 01:44:43PM -0500, Eric W. Biederman wrote:
> Linus Torvalds <torvalds@linux-foundation.org> writes:
> 
> > On Thu, Apr 8, 2021 at 1:32 AM kernel test robot <oliver.sang@intel.com> wrote:
> >>
> >> FYI, we noticed a -41.9% regression of stress-ng.sigsegv.ops_per_sec due to commit
> >> 08ed4efad684 ("[PATCH v10 6/9] Reimplement RLIMIT_SIGPENDING on top of ucounts")
> >
> > Ouch.
> 
> We were cautiously optimistic when no test problems showed up from
> the last posting that there was nothing to look at here.
> 
> Unfortunately it looks like the bots just missed the last posting. 

this report is upon v10. do you have newer version which hope bot test?

please be noted, sorry to say, due to various reasons, it will be a
big challenge for us to capture each version of a patch set.

e.g. we didn't make out a similar performance regression for
v8/v9 version of this one..

> 
> So it seems we are finally pretty much at correct code in need
> of performance tuning.
> 
> > I *think* this test may be testing "send so many signals that it
> > triggers the signal queue overflow case".
> >
> > And I *think* that the performance degradation may be due to lots of
> > unnecessary allocations, because ity looks like that commit changes
> > __sigqueue_alloc() to do
> >
> >         struct sigqueue *q = kmem_cache_alloc(sigqueue_cachep, flags);
> >
> > *before* checking the signal limit, and then if the signal limit was
> > exceeded, it will just be free'd instead.
> >
> > The old code would check the signal count against RLIMIT_SIGPENDING
> > *first*, and if there were m ore pending signals then it wouldn't do
> > anything at all (including not incrementing that expensive atomic
> > count).
> 
> This is an interesting test in a lot of ways as it is testing the
> synchronous signal delivery path caused by an exception.  The test
> is either executing *ptr = 0 (where ptr points to a read-only page)
> or it executes an x86 instruction that is excessively long.
> 
> I have found the code but I haven't figured out how it is being
> called yet.  The core loop is just:
> 	for(;;) {
> 		sigaction(SIGSEGV, &action, NULL);
> 		sigaction(SIGILL, &action, NULL);
> 		sigaction(SIGBUS, &action, NULL);
> 
> 		ret = sigsetjmp(jmp_env, 1);
> 		if (done())
>                 	break;
> 		if (ret) {
>                 	/* verify signal */
>                 } else {
>                 	*ptr = 0;
>                 }
> 	}
> 
> Code like that fundamentally can not be multi-threaded.  So the only way
> the sigpending limit is being hit is if there are more processes running
> that code simultaneously than the size of the limit.
> 
> Further it looks like stress-ng pushes RLIMIT_SIGPENDING as high as it
> will go before the test starts.
> 
> 
> > Also, the old code was very careful to only do the "get_user()" for
> > the *first* signal it added to the queue, and do the "put_user()" for
> > when removing the last signal. Exactly because those atomics are very
> > expensive.
> >
> > The new code just does a lot of these atomics unconditionally.
> 
> Yes. That seems a likely culprit.
> 
> > I dunno. The profile data in there is a bit hard to read, but there's
> > a lot more cachee misses, and a *lot* of node crossers:
> >
> >>    5961544          +190.4%   17314361        perf-stat.i.cache-misses
> >>   22107466          +119.2%   48457656        perf-stat.i.cache-references
> >>     163292 ą  3%   +4582.0%    7645410        perf-stat.i.node-load-misses
> >>     227388 ą  2%   +3708.8%    8660824        perf-stat.i.node-loads
> >
> > and (probably as a result) average instruction costs have gone up enormously:
> >
> >>       3.47           +66.8%       5.79        perf-stat.overall.cpi
> >>      22849           -65.6%       7866        perf-stat.overall.cycles-between-cache-misses
> >
> > and it does seem to be at least partly about "put_ucounts()":
> >
> >>       0.00            +4.5        4.46        perf-profile.calltrace.cycles-pp.put_ucounts.__sigqueue_free.get_signal.arch_do_signal_or_restart.exit_to_user_mode_prepare
> >
> > and a lot of "get_ucounts()".
> >
> > But it may also be that the new "get sigpending" is just *so* much
> > more expensive than it used to be.
> 
> That too is possible.
> 
> That node-load-misses number does look like something is bouncing back
> and forth between the nodes a lot more.  So I suspect stress-ng is
> running multiple copies of the sigsegv test in different processes at
> once.
> 
> 
> 
> That really suggests cache line ping pong from get_ucounts and
> incrementing sigpending.
> 
> It surprises me that obtaining the cache lines exclusively is
> the dominant cost on this code path but obtaining two cache lines
> exclusively instead of one cache cache line exclusively is consistent
> with a causing the exception delivery to take nearly twice as long.
> 
> For the optimization we only care about the leaf count so with a little
> care we can restore the optimization.  So that is probably the thing
> to do here.  The fewer changes to worry about the less likely to find
> surprises.
> 
> 
> 
> That said for this specific case there is a lot of potential room for
> improvement.  As this is a per thread signal the code update sigpending
> in commit_cred and never worry about needing to pin the struct
> user_struct or struct ucounts.  As this is a synchronous signal we could
> skip the sigpending increment, skip the signal queue entirely, and
> deliver the signal to user-space immediately.  The removal of all cache
> ping pongs might make it worth it.
> 
> There is also Thomas Gleixner's recent optimization to cache one
> sigqueue entry per task to give more predictable behavior.  That
> would remove the cost of the allocation.
> 
> Eric
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

  parent reply	other threads:[~2021-04-23  2:30 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-07 17:08 [PATCH v10 0/9] Count rlimits in each user namespace Alexey Gladkov
2021-04-07 17:08 ` [PATCH v10 1/9] Increase size of ucounts to atomic_long_t Alexey Gladkov
2021-04-07 17:08 ` [PATCH v10 2/9] Add a reference to ucounts for each cred Alexey Gladkov
2021-04-07 17:08 ` [PATCH v10 3/9] Use atomic_t for ucounts reference counting Alexey Gladkov
2021-04-07 17:08 ` [PATCH v10 4/9] Reimplement RLIMIT_NPROC on top of ucounts Alexey Gladkov
2021-04-07 17:08 ` [PATCH v10 5/9] Reimplement RLIMIT_MSGQUEUE " Alexey Gladkov
2021-04-07 17:08 ` [PATCH v10 6/9] Reimplement RLIMIT_SIGPENDING " Alexey Gladkov
2021-04-08  8:30   ` 08ed4efad6: stress-ng.sigsegv.ops_per_sec -41.9% regression kernel test robot
2021-04-08 16:22     ` Linus Torvalds
2021-04-08 16:44       ` Alexey Gladkov
2021-04-08 18:44       ` Eric W. Biederman
2021-04-16 11:33         ` Alexey Gladkov
2021-04-23  2:47         ` Oliver Sang [this message]
2021-04-23  7:44           ` Alexey Gladkov
2021-04-28 14:36             ` Oliver Sang
2021-04-28 15:09               ` Alexey Gladkov
2021-04-07 17:08 ` [PATCH v10 7/9] Reimplement RLIMIT_MEMLOCK on top of ucounts Alexey Gladkov
2021-04-07 21:37   ` kernel test robot
2021-04-07 17:08 ` [PATCH v10 8/9] kselftests: Add test to check for rlimit changes in different user namespaces Alexey Gladkov
2021-04-07 17:08 ` [PATCH v10 9/9] ucounts: Set ucount_max to the largest positive value the type can hold Alexey Gladkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210423024722.GA13968@xsang-OptiPlex-9020 \
    --to=oliver.sang@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=axboe@kernel.dk \
    --cc=christian.brauner@ubuntu.com \
    --cc=containers@lists.linux-foundation.org \
    --cc=ebiederm@xmission.com \
    --cc=feng.tang@intel.com \
    --cc=jannh@google.com \
    --cc=keescook@chromium.org \
    --cc=kernel-hardening@lists.openwall.com \
    --cc=legion@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lkp@intel.com \
    --cc=lkp@lists.01.org \
    --cc=oleg@redhat.com \
    --cc=torvalds@linux-foundation.org \
    --cc=ying.huang@intel.com \
    --cc=zhengjun.xing@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).