Re: [PATCH 0/5] rb_first to rb_first_cached conversion

From: David Sterba <dsterba@suse.cz>
To: Liu Bo <bo.liu@linux.alibaba.com>
Cc: dsterba@suse.cz, linux-btrfs@vger.kernel.org
Subject: Re: [PATCH 0/5] rb_first to rb_first_cached conversion
Date: Thu, 20 Sep 2018 17:56:29 +0200	[thread overview]
Message-ID: <20180920155629.GZ5847@twin.jikos.cz> (raw)
In-Reply-To: <20180914191456.4whx5gyjtg5hwbtt@US-160370MP2.local>

On Fri, Sep 14, 2018 at 12:14:57PM -0700, Liu Bo wrote:
> On Fri, Sep 14, 2018 at 05:11:02PM +0200, David Sterba wrote:
> > On Tue, Sep 11, 2018 at 11:31:49AM -0700, Liu Bo wrote:
> > > On Tue, Sep 11, 2018 at 05:34:03PM +0200, David Sterba wrote:
> > > > On Thu, Aug 23, 2018 at 03:51:48AM +0800, Liu Bo wrote:
> > > > > Several structs in btrfs are using rb_first() in a while loop, it'd be
> > > > > more efficient to do this with rb_first_cached() which has the O(1)
> > > > > complexity.
> > > > 
> > > > We'd like to see some numbers, though just by reading the code I think
> > > > there's going to be a noticeable improvement for some workloads.
> > > > 
> > > > There's a common pattern:
> > > > 
> > > > while(node = rb_first) {
> > > > 	entry = rb_entry(node)
> > > > 	next = rb_next(node)
> > > > 	rb_erase(node)
> > > > 	cleanup(entry)
> > > > }
> > > > 
> > > > rb_first needs to traverse the tree up to logN depth, rb_erase can
> > > > completely reshuffle the tree. With the caching we'll skip the traversal
> > > > in rb_first. That's a cached memory access vs looped pointer
> > > > dereference trade-off that IMHO has a clear winner.
> > > > 
> > > > As the pattern in many cases traverses the whole tree and removes all
> > > > entries, ideally we could get rid of the rebalancing too. The entries
> > > > would be cleaned up in postorder way, ie. reset the data and kfree in
> > > > the end. This requires that order of the freeing does not matter, which
> > > > might no apply to some trees.
> > > 
> > > The order of freeing does not matter, but we need the tree to return
> > > the correct next node to free, IOW, we have to maintain the rbtree
> > > until the last two nodes.
> > > 
> > > > 
> > > > I did not find suitable rbtree functions or helpers for that,
> > > > rbtree_postorder_for_each_entry_safe does not allow rb_erase and
> > > > rb_erase itself forces the rebalancing internally. But I think we can
> > > > implement that.
> > > > 
> > > > We could possibly use rb_next_postorder that makes sure that all
> > > > children are traversed before the parent so this is at least something
> > > > that could considerd.
> > > > 
> > > 
> > > Qu was inquiring about the same thing, I haven't got time to dig it
> > > further.
> > > 
> > > The question we need to answer is that whether we care about how fast
> > > destruction can make, as some are in critical paths while some are
> > > not.
> > 
> > Yeah, I take __btrfs_return_cluster_to_free_space as an example where
> > the more efficient tree destruction would shorten the time where a lock
> > is held.
> > 
> > In other cases it might complicate the code too much, though the
> > performance is not critical, eg. at umount time. Although, it'd be good
> > to optimize that too now that we know how.
> > 
> > And in the remaining cases it's not possible to avoid rb_erase. I had a
> > closer look at btrfs_cleanup_defrag_inodes and btrfs_put_tree_mod_seq.
> > Based on that I'd like to add this series to for-next, the first node
> > caching is clear enough and safe. We can do the other optimizations
> > later.
> 
> This needs more fine-grained debugging to see what's the cost
> lies on.
> 
> About the perf. number of rb_cached_node, I measured the time spent in
> extent map loop in btrfs_evict_inode(),
> 
> evict_inode_truncate_pages()
>     while (!RB_EMPTY_ROOT(&map_tree->map)) {
>          rb_first();
> 	 rb_erase();
>     }
> 
> 
> for a em tree containing 10,000 em,

The maximum depth of the rbtree for this number of nodes is roughly
2 * lg(10000) = 2 * 14 = 28. This is not an average case so it's going
to be better in practice but still 10+ pointer dereferences can be
exchanged with a pointer update.

> - with rb_first_cached, the loop takes 4880454 ns,
> - with rb_first, the loop takes 4584148 ns,
> 
> looks like the difference is not that much, ~6%.  After all it's about
> scalability, the more rb tree gets, the more rb_first_cached() wins.

Yeah, there's some difference but I think the cache effects will make
the measurements hard to estimate, so I take the numbers as a sort of
confirmation that there's not going to be a big difference.

The delayed refs and delayed inode can get some speedup from that as
there's usually only rb_first without rb_erase.

I'm going to update the first patch with the analysis we did and
reference the patch from the others so we'll have a starting point for
further optimizations and can verify if the described expected speedups
happen.