Re: [patch 3/3] radix-tree: RCU lockless readside

From: Nick Piggin <npiggin@suse.de>
To: "Paul E. McKenney" <paulmck@us.ibm.com>
Cc: Andrew Morton <akpm@osdl.org>,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	Paul McKenney <Paul.McKenney@us.ibm.com>,
	Linux Kernel <linux-kernel@vger.kernel.org>,
	Linux Memory Management <linux-mm@kvack.org>
Subject: Re: [patch 3/3] radix-tree: RCU lockless readside
Date: Thu, 22 Jun 2006 18:55:51 +0200	[thread overview]
Message-ID: <20060622165551.GB23109@wotan.suse.de> (raw)
In-Reply-To: <20060622163032.GC1295@us.ibm.com>

On Thu, Jun 22, 2006 at 09:30:32AM -0700, Paul E. McKenney wrote:
> On Thu, Jun 22, 2006 at 05:45:18PM +0200, Nick Piggin wrote:
> > 
> > I'll probably put a little table in radix-tree.h to summarise the
> > API synchronisation requirements, OK?
> 
> Makes sense to me -- except will that feed into the docbook stuff?
> It seems to me to be really important to get these sorts of requirements
> included in the docbook stuff.  I have had too many people show me
> code that assumed that RCU somehow synchronizes updates, so it is
> good to call out these requirements early and often.

I'm not a docbook expert, but that's a good point. Will RFComments
on the comments when I'm done ;)

> > 
> > Does the single rcu_dereference in radix_tree_gang_lookup look OK?
> 
> Well, it does put a memory barrier in the right place on Alpha, but the
> intent would be more clear to me if the rcu_dereference() were on the
> assignments to each element of the results array.  And there would be
> no additional overhead on most architectures.
> 
> So I would much prefer the rcu_dereference() be on the assignment to
> the results array.

No problem, will change.

> > Ah indeed, that's confusing. Yes, the lookup_tag must exclude updates.
> > I guess I got too mechanical in my conversion... however, tag lookups
> > can be RCUified without a great deal of trouble, so I might take this
> > opportunity.
> 
> The tag lookups would then find anything that (1) had been tagged in a
> prior operation and (2) had not been deleted in the meantime, right?

Yes. Where "prior" is only really prior as guaranteed by some
synchronising (or otherwise dependent) operation. But I don't
need to tell you that ;)

> And the caller could hold a lock across both the tagging and tag
> lookup if greater certainty was desired.  I could imagine this sort
> of semantic being useful for deferred operations on ranges of memory,
> where new additions would have the operation implicit in creation and any
> deletions would no longer need the operation to be performed (or might
> be performed as part of deletion operation), but have not actually used
> this sort of thing myself.
> 
> So I must defer to people who have used tagging and tagged lookups
> in anger.

We use tagged lookups for writeout and synching -- slow IO related
which is why I had not converted it over to lockless. But it could
use lockless tagged lookups: eg. so long as sync catches all the
pages that we *know* to be dirty at the time of the sync, that's
fine.

> > I've tried to get that message across in the radix_tree_lookup_slot
> > comment, if they're using RCU lookups. Enough? I guess I'll add it
> > to the locking summary too.
> 
> This is what I saw in the radix_tree_lookup_slot() comment:
> 
> + *   radix_tree_lookup_slot    -    lookup a slot in a radix tree
> + *   @root:          radix tree root
> + *   @index:         index key
> + *
> + *   Lookup the slot corresponding to the position @index in the radix tree
> + *   @root. This is useful for update-if-exists operations.
> + *
> + *   This function cannot be called under rcu_read_lock, it must be
> + *   excluded from writers, as must the returned slot.
> 
> This comment does not make the RCU-protection point clear to me.
> The constraint is that if you use RCU radix-tree lookups, then you must
> use synchronize_rcu() or call_rcu() when freeing any elements removed
> from the radix tree via radix_tree_delete().
> 
> Or am I missing something here?

Well the pagecache uses pointers to struct page. struct page is never
freed, so we can forget the whole thing ;)
(mem unplug may want to free them, so in that case they would have to
synchronize_rcu).

But I guess for less specialised users, RCU would be the usual way to
go... ah I see, I must have got out of synch somewhere. I have comments
along these lines for radix_tree_lookup_slot:

+ *     This function can be called under rcu_read_lock, however it is the
+ *     duty of the caller to manage the lifetimes of the leaf nodes (ie.
+ *     they would usually be RCU protected as well). Also, dereferencing
+ *     the slot pointer would require rcu_dereference, and modifying it
+ *     would require rcu_assign_pointer.

> > > Rough notes, FYA:
> 
> You went through all these as well?  Hope you have recovered from the
> bout of insomnia!  ;-)

Pretty well gone through them.

> 
> > > o	Don't the items placed into the radix tree need to be protected
> > > 	by RCU?  If not, how does the radix tree avoid handing a pointer
> > > 	to something that has recently been removed, that the caller to
> > > 	radix_tree_delete() might have already freed?
> > 
> > Yes, they'll need to be protected by something. In the lockless pagecache,
> > they are never freed, so that isn't an issue. But other users (Ben's
> > irq patch perhaps, unless it uses the slot pointers directly) will have to
> > be careful.
> 
> Ah!  If they are never freed, there is still a need to take care when
> reusing them.  One approach is to prevent them from being reused until
> after a grace period has elapsed, another is to use revalidation checks
> after lookup.  Either way, one needs to allow for the fact that a
> lookup might hand you something that has just been deleted.

Yep, validation checks are done after lookup. The core of it, the
``page_cache_get_speculative'' function isn't that big.

> > I'll send out an incremental diff with changes.
> 
> Looking forward to it!  Maybe not as anxiously as Ben Herrenschmidt, but
> so it goes.  ;-)

Well I couldn't ask you to spend any more time on it, but if you
get interested and take a peek, that'll be a bonus for me ;)

Thanks again

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>