From mboxrd@z Thu Jan  1 00:00:00 1970
From: Ingo Molnar <mingo@elte.hu>
Subject: Re: [PATCH 4/6] kvm tools: Add rwlock wrapper
Date: Sun, 29 May 2011 09:35:50 +0200
Message-ID: <20110529073550.GA21254@elte.hu>
References: <20110526180518.GA3572@elte.hu>
 <4DDE97CE.4000302@redhat.com>
 <BANLkTi=bMMr1zHWhwNsD=ah52n3ueoLbUg@mail.gmail.com>
 <20110526202531.GA2765@elte.hu>
 <20110526230508.GA15983@Krystal>
 <20110527102533.GA24608@elte.hu>
 <20110527110729.GA26920@elte.hu>
 <4DE13AF0.2080001@redhat.com>
 <20110528183259.GA15019@elte.hu>
 <4DE1EA93.6040401@redhat.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
	Pekka Enberg <penberg@kernel.org>,
	Sasha Levin <levinsasha928@gmail.com>, john@jfloren.net,
	kvm@vger.kernel.org, asias.hejun@gmail.com, gorcunov@gmail.com,
	prasadjoshi124@gmail.com,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Avi Kivity <avi@redhat.com>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mx3.mail.elte.hu ([157.181.1.138]:32949 "EHLO mx3.mail.elte.hu"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1750814Ab1E2HgF (ORCPT <rfc822;kvm@vger.kernel.org>);
	Sun, 29 May 2011 03:36:05 -0400
Content-Disposition: inline
In-Reply-To: <4DE1EA93.6040401@redhat.com>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>


* Avi Kivity <avi@redhat.com> wrote:

> On 05/28/2011 09:32 PM, Ingo Molnar wrote:
> >* Avi Kivity<avi@redhat.com>  wrote:
> >
> >>  >  So if you set a notification signal via fcntl(F_SETOWN) on the
> >>  >  scheduler context switch event fd, the user-space RCU code will
> >>  >  get a signal on every context switch.
> >>
> >>  Context switches are completely uninteresting for userspace rcu:
> >>
> >>    rcu_read_lock();
> >>    --->  context switch
> >>
> >>  have we learned anything from that?  no.  User code is always
> >>  preemptible and migratable.  If rcu_read_lock() prevented migration
> >>  somehow, then we'd know that a context switch means we've started a
> >>  grace period for this thread.  But it doesn't, so we don't.
> >
> >Well, in the next mail i mentioned that we can do migration events as
> >well, which would be useful: instead of having to keep track of
> >nr_tasks RCU grace periods we could simplify it down to nr_cpus.
> 
> I don't see how a migration event helps.  It is completely
> transparent from the task's point of view.

It's not transparent at all if you index RCU data structures by the 
current CPU index, which the kernel implementation does.

Doing that has the advantage of being much more cache-compressed than 
the TID index, and also having better worst-case grace period latency 
properties than a TID index.

> > But if we indexed by the TID then we wouldnt need any scheduler 
> > bindings at all - this is the simpler approach.
> 
> Yes, and it maps 1:1 to the kernel implementation (cpu = task).

No, the kernel indexes grace period tracking (and the 
write-completion queues) by CPU index.

> >>  What's needed are explicit notifications about grace periods.  For
> >>  the vcpu threads, calling KVM_VCPU_RUN seems like a good point.
> >>  For I/O threads, completion of processing of an event is also a
> >>  good point.
> >
> > Grace period notifications are needed too, obviously.
> 
> I'd think they're sufficient, no?  Is something else needed?

I think you are missing the fact that in the kernel we index RCU data 
structures by CPU number:

static void rcu_preempt_qs(int cpu)
{
        struct rcu_data *rdp = &per_cpu(rcu_preempt_data, cpu);

...

static void rcu_preempt_note_context_switch(int cpu)
{
        struct task_struct *t = current;
        unsigned long flags;
        struct rcu_data *rdp;
        struct rcu_node *rnp;

        if (t->rcu_read_lock_nesting &&
            (t->rcu_read_unlock_special & RCU_READ_UNLOCK_BLOCKED) == 0) {

                /* Possibly blocking in an RCU read-side critical section. */
                rdp = per_cpu_ptr(rcu_preempt_state.rda, cpu);

...

Which could be changed over to be per task in user-space by treating 
the TID as a 'virtual CPU' equivalent.

This probably lengthens worst-case rcu_sync() latencies rather 
significantly though - possibly turning urcu into a 
stop_machine_run() equivalent in the worst-case. (but i could be 
wrong about this last bit)

Thanks,

	Ingo