Re: Severe performance regression w/ 4.4+ on Android due to cgroup locking changes

From: Peter Zijlstra <peterz@infradead.org>
To: Tejun Heo <tj@kernel.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
	John Stultz <john.stultz@linaro.org>,
	Ingo Molnar <mingo@redhat.com>,
	lkml <linux-kernel@vger.kernel.org>,
	Dmitry Shmidt <dimitrysh@google.com>,
	Rom Lemarchand <romlem@google.com>,
	Colin Cross <ccross@google.com>, Todd Kjos <tkjos@google.com>,
	Oleg Nesterov <oleg@redhat.com>
Subject: Re: Severe performance regression w/ 4.4+ on Android due to cgroup locking changes
Date: Thu, 14 Jul 2016 18:32:11 +0200	[thread overview]
Message-ID: <20160714163211.GE30154@twins.programming.kicks-ass.net> (raw)
In-Reply-To: <20160714150715.GJ15005@htj.duckdns.org>

On Thu, Jul 14, 2016 at 11:07:15AM -0400, Tejun Heo wrote:
> How?  While write lock is pending, no new reader is allowed.

Look at the new percpu_down_write (the old one is similar in concept):

+ void percpu_down_write(struct percpu_rw_semaphore *sem)
+ {
+ 	down_write(&sem->rw_sem);
+ 
+ 	/* Notify readers to take the slow path. */
+ 	rcu_sync_enter(&sem->rss);
+ 
+ 	/*
+ 	 * Notify new readers to block; up until now, and thus throughout the
+ 	 * longish rcu_sync_enter() above, new readers could still come in.
+ 	 */
+ 	WRITE_ONCE(sem->state, readers_block);

Note how up until this point new readers could happen? So the 'entire'
synchronize_sched() call is 'invisible' to new readers.

We need the sync_sched() to ensure all new down_read callers will go
through the 'slow' down_read path and even look at sem->state.

+ 
+ 	smp_mb(); /* D matches A */
+ 
+ 	/*
+ 	 * If they don't see our writer of readers_block to sem->state,
+ 	 * then we are guaranteed to see their sem->refcount increment, and
+ 	 * therefore will wait for them.
+ 	 */
+ 
+ 	/* Wait for all now active readers to complete. */
+ 	wait_event(sem->writer, readers_active_check(sem));
+ }

> If reader ops are high frequency, they will surely get affected. 

Before the sync_sched() a down_read (PREEMPT=n, LOCKDEP=n) is basically:

  __this_cpu_inc(*sem->refcount)
  if (sem->rss->gp_state) /* false */
	;

This is one purely local (!atomic) RmW and one load of a shared
variable. Absolute minimal overhead.

During sync_sched() gp_state is !0 and we end up doing:

  __this_cpu_inc(*sem->refcount)
  if (sem->rss->gp_state) {
	smp_mb();
	if (sem->state != readers_block) /* true */
		return;
  }

Which is 1 smp_mb() and 1 shared state load (to the same cacheline we
touched before IIRC) more. Now full barriers are really rather
expensive, and do show up on profiles.

(and this is where the old and new code differ, the old code would end
up doing: __down_read(&sem->rwsem), which is a shared atomic RmW and is
_much_ more expensive).

> It just isn't a good design to inject RCU grace period synchronously
> into a hot path.

That's just the point, the write side of a _global_ lock can never, per
definition, be a hot path.

Now, I realize not everyone wants these same tradeoffs and I did you
this patch to allow that.

But I really think that this Android usecase invalidates the premise of
cgroups using a global lock.