Re: Severe performance regression w/ 4.4+ on Android due to cgroup locking changes

From: Peter Zijlstra <peterz@infradead.org>
To: Tejun Heo <tj@kernel.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
	John Stultz <john.stultz@linaro.org>,
	Ingo Molnar <mingo@redhat.com>,
	lkml <linux-kernel@vger.kernel.org>,
	Dmitry Shmidt <dimitrysh@google.com>,
	Rom Lemarchand <romlem@google.com>,
	Colin Cross <ccross@google.com>, Todd Kjos <tkjos@google.com>,
	Oleg Nesterov <oleg@redhat.com>
Subject: Re: Severe performance regression w/ 4.4+ on Android due to cgroup locking changes
Date: Thu, 14 Jul 2016 14:20:49 +0200	[thread overview]
Message-ID: <20160714122049.GB30154@twins.programming.kicks-ass.net> (raw)
In-Reply-To: <20160714120845.GE15005@htj.duckdns.org>

On Thu, Jul 14, 2016 at 08:08:45AM -0400, Tejun Heo wrote:
> On Thu, Jul 14, 2016 at 02:04:28PM +0200, Peter Zijlstra wrote:
> > > I think it probably makes sense to make this the default on !RT at
> > > least with a separate patch w/o stable cc'd.  While most use cases
> > > will be fine with the latency on write path, it also means that the
> > > reader side is blocked for the duration which can hurt.  rwsem implies
> > > a lot more readers and thus more read lock operations than writes.
> > > It's weird to trade off higher latency for lower cpu usage when it
> > > would also slow down all readers.
> > 
> > NAK, no expedited muck by default. There's more than just RT that
> > doesn't like IPI sprays.
> 
> Can you elaborate?  

HPC doesn't use RT but still wants to minimize jitter such that all CPUs
complete their work ASAP. They use barriers to wait on the slowest CPU
to complete work.

Sending random interrupts disturbs cache and other stuff and delays
things unnecessarily.

Same with RDMA (or other) userspace poll loops which want minimal
latency, they too don't use RT, but also very much want to avoid the
kernel poking at them.

Many of this could eventually use NOHZ_FULL, but I'm not sure all of
that is suitable.

In general its very bad form to spray interrupts just because and we've
spend a lot of effort to reduce that.

> If that's the case, we have the wrong implemention
> for percpu-rwsem where very long delays for writers induce the same
> level of delays to all readers.  If expedited by default isn't
> workable, we should move away from rcu_sync for percpu_rwsem.

Just because your usecase doesn't like it, doesn't mean its not good.
Its a perfectly fine implementation for uprobes for example. The
addition/removal of uprobes is extremely rare, as global writers should
be.

And no, the writer delay isn't observed by the readers, those will
continue 'undisturbed' for most of it.