Re: [RFC][PATCH] srcu: Implement call_srcu()

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Avi Kivity <avi@redhat.com>, Oleg Nesterov <oleg@redhat.com>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	Marcelo Tosatti <mtosatti@redhat.com>,
	KVM list <kvm@vger.kernel.org>
Subject: Re: [RFC][PATCH] srcu: Implement call_srcu()
Date: Wed, 1 Feb 2012 06:07:34 -0800	[thread overview]
Message-ID: <20120201140734.GD2488@linux.vnet.ibm.com> (raw)
In-Reply-To: <1328091749.2760.34.camel@laptop>

On Wed, Feb 01, 2012 at 11:22:29AM +0100, Peter Zijlstra wrote:
> On Tue, 2012-01-31 at 14:24 -0800, Paul E. McKenney wrote:
> 
> > > > Can we get it back to speed by scheduling a work function on all cpus? 
> > > > wouldn't that force a quiescent state and allow call_srcu() to fire?
> > > > 
> > > > In kvm's use case synchronize_srcu_expedited() is usually called when no
> > > > thread is in a critical section, so we don't have to wait for anything
> > > > except the srcu machinery.
> > > 
> > > OK, I'll try and come up with means of making it go fast again ;-)
> > 
> > I cannot resist suggesting a kthread to do the call_srcu(), which
> > would allow synchronize_srcu_expedited() to proceed with its current
> > brute-force speed.
> 
> Right, so I really don't like to add a kthread per srcu instance.
> Sharing a kthread between all SRCUs will be problematic since these sync
> things can take forever and so the thread will become a bottlneck.

It should be possible to substitute call_rcu_sched() for synchronize_sched()
and use a per-SRCU-instance state machine to avoid a given instance from
blocking progress on the other instances.

> Also, I'd really like to come up with a better means of sync for SRCU
> and not hammer the entire machine (3 times).

We could introduce read-side smp_mb()s and reduce to one machine-hammer
straightforwardly.  It should be possible to make it so that if a given
srcu_struct ever uses synchronize_srcu_expedited(), it is required to
use the slower version of the read-side primitives.

Hmmm..  I should go back and take a look at the 8 or 9 variants that
I posted to LKML when I first worked on this.

> One of the things I was thinking of is adding a sequence counter in the
> per-cpu data. Using that we could do something like:
> 
>   unsigned int seq1 = 0, seq2 = 0, count = 0;
>   int cpu, idx;
> 
>   idx = ACCESS_ONCE(sp->completions) & 1;
> 
>   for_each_possible_cpu(cpu)
> 	seq1 += per_cpu(sp->per_cpu_ref, cpu)->seq;
> 
>   for_each_possible_cpu(cpu)
> 	count += per_cpu(sp->per_cpu_ref, cpu)->c[idx];
> 
>   for_each_possible_cpu(cpu)
> 	seq2 += per_cpu(sp->per_cpu_ref, cpu)->seq;
> 
>   /*
>    * there's no active references and no activity, we pass
>    */
>   if (seq1 == seq2 && count == 0)
> 	return;
> 
>   synchronize_srcu_slow();
> 
> 
> This would add a fast-path which should catch the case Avi outlined
> where we call sync_srcu() when there's no other SRCU activity.

Overflow could give us false positives for workloads with frequent
readers.  Yes, I know you don't like 32-bit systems, but SRCU still has
to work on them.  ;-)

We also need some smp_mb() calls for readers and in the above sequence
for this to have half a chance of working.

> The other thing I was hoping to be able to pull off is add a copy of idx
> into the same cacheline as c[] and abuse cache-coherency to avoid some
> of the sync_sched() calls, but that's currently hurting my brain.

I will check with some hardware architects.  Last I knew, this was not
officially supported, but it cannot hurt to ask again.

I suppose that another approach would be to use bitfields.  Fifteen bits
should be enough for each of the counters, but ouch!  Atomic instructions
would be required to make incrementing all the idx bits safe.

							Thanx, Paul