linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Matthew Wilcox <willy@infradead.org>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jens Axboe <axboe@kernel.dk>, Linux-MM <linux-mm@kvack.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	linux-block <linux-block@vger.kernel.org>,
	Chris Mason <clm@fb.com>, Dave Chinner <david@fromorbit.com>,
	Johannes Weiner <hannes@cmpxchg.org>
Subject: Re: [PATCHSET v3 0/5] Support for RWF_UNCACHED
Date: Thu, 12 Dec 2019 09:52:00 -0800	[thread overview]
Message-ID: <20191212175200.GS32169@bombadil.infradead.org> (raw)
In-Reply-To: <CAHk-=wjr1G0xXDs7R=2ZAB=YSs-WLk4GsVwLafw+96XVwo7jyg@mail.gmail.com>

On Wed, Dec 11, 2019 at 06:47:25PM -0800, Linus Torvalds wrote:
> On Wed, Dec 11, 2019 at 5:56 PM Matthew Wilcox <willy@infradead.org> wrote:
> > It _should_ be the same order of complexity.  Since there's already
> > a page in the page cache, xas_create() just walks its way down to the
> > right node calling xas_descend() and then xas_store() does the equivalent
> > of __radix_tree_replace().  I don't see a bug that would make it more
> > expensive than the old code ... a 10GB file is going to have four levels
> > of radix tree node, so it shouldn't even spend that long looking for
> > the right node.
> 
> The profile seems to say that 85% of the cost of xas_create() is two
> instructions:
> 
> # lib/xarray.c:143:     return (index >> node->shift) & XA_CHUNK_MASK;
>         movzbl  (%rsi), %ecx    # MEM[(unsigned char *)node_13],
> MEM[(unsigned char *)node_13]
> ...
> # ./include/linux/xarray.h:1145:        return
> rcu_dereference_check(node->slots[offset],
> # ./include/linux/compiler.h:199:       __READ_ONCE_SIZE;
>         movq    (%rax), %rax    # MEM[(volatile __u64 *)_80], <retval>
> 
> where that first load is "node->shift", and the second load seems to
> be just the node dereference.
> 
> I think both of them are basically just xas_descend(), but it's a bit
> hard to read the several levels of inlining.

Yep, that's basically the two cache misses you get for each level of
the tree.  I'd expect the node->shift to be slightly more costly because
sometimes the node->slots[offset] is going to be in the same cacheline
or the next cacheline after node->shift.

> I suspect the real issue is that going through kswapd means we've lost
> almost all locality, and with the random walking the LRU list is
> basically entirely unordered so you don't get any advantage of the
> xarray having (otherwise) possibly good cache patterns.
> 
> So we're just missing in the caches all the time, making it expensive.

I have some bad ideas for improving it.

1. We could semi-sort the pages on the LRU list.  If we know we're going
to remove a bunch of pages, we could take a batch of them off the list,
sort them and remove them in-order.  This probably wouldn't be terribly
effective.

2. We could change struct page to point to the xa_node that holds them.
Looking up the page mapping would be page->xa_node->array and then
offsetof(i_pages) to get the mapping.

3. Once the maple tree is ready, a 10GB file can actually be held more
efficiently in a maple tree than in the radix tree.  Because each level
of the tree is 128 bytes and (Intel) CPUs fetch a pair of cachelines,
there's only one cache miss per level of the tree.  So despite the tree
being deeper, there are fewer cache misses.  With an average of 6 pointers
per level of the tree, terminating in a dense node, we'd see:

: 6 * 6 * 6 * 6 * 6 * 6 * 6 * 15
: 4199040
(a 10GB file contains 2621440 4kB pages)

which is still eight cache misses, so to see an improvement, we'd need to
implement another planned improvement, which is allocating a large, dense
leaf node of 4kB.  That would reduce the height of the tree by 2:

: 6 * 6 * 6 * 6 * 6 * 500
: 3888000

4. For consecutive accesses, I'm working on allocating larger pages
(16kB, 64kB, 256kB, ...).  That will help by reducing the number of
deletions from the page cache by a factor of 4/16/64/...

  reply	other threads:[~2019-12-12 17:52 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-12-11 15:29 [PATCHSET v3 0/5] Support for RWF_UNCACHED Jens Axboe
2019-12-11 15:29 ` [PATCH 1/5] fs: add read support " Jens Axboe
2019-12-11 15:29 ` [PATCH 2/5] mm: make generic_perform_write() take a struct kiocb Jens Axboe
2019-12-11 15:29 ` [PATCH 3/5] mm: make buffered writes work with RWF_UNCACHED Jens Axboe
2019-12-11 15:29 ` [PATCH 4/5] iomap: pass in the write_begin/write_end flags to iomap_actor Jens Axboe
2019-12-11 17:19   ` Linus Torvalds
2019-12-11 15:29 ` [PATCH 5/5] iomap: support RWF_UNCACHED for buffered writes Jens Axboe
2019-12-11 17:19   ` Matthew Wilcox
2019-12-11 18:05     ` Jens Axboe
2019-12-12 22:34   ` Dave Chinner
2019-12-13  0:54     ` Jens Axboe
2019-12-13  0:57       ` Jens Axboe
2019-12-16  4:17         ` Dave Chinner
2019-12-17 14:31           ` Jens Axboe
2019-12-18  0:49             ` Dave Chinner
2019-12-18  1:01               ` Jens Axboe
2019-12-11 17:37 ` [PATCHSET v3 0/5] Support for RWF_UNCACHED Linus Torvalds
2019-12-11 17:56   ` Jens Axboe
2019-12-11 19:14     ` Linus Torvalds
2019-12-11 19:34     ` Jens Axboe
2019-12-11 20:03       ` Linus Torvalds
2019-12-11 20:08         ` Jens Axboe
2019-12-11 20:18           ` Linus Torvalds
2019-12-11 21:04             ` Johannes Weiner
2019-12-12  1:30               ` Jens Axboe
2019-12-11 23:41             ` Jens Axboe
2019-12-12  1:08               ` Linus Torvalds
2019-12-12  1:11                 ` Jens Axboe
2019-12-12  1:22                   ` Linus Torvalds
2019-12-12  1:29                     ` Jens Axboe
2019-12-12  1:41                       ` Linus Torvalds
2019-12-12  1:56                         ` Matthew Wilcox
2019-12-12  2:47                           ` Linus Torvalds
2019-12-12 17:52                             ` Matthew Wilcox [this message]
2019-12-12 18:29                               ` Linus Torvalds
2019-12-12 20:05                                 ` Matthew Wilcox
2019-12-12  1:41                       ` Jens Axboe
2019-12-12  1:49                         ` Linus Torvalds
2019-12-12  1:09               ` Jens Axboe
2019-12-12  2:03                 ` Jens Axboe
2019-12-12  2:10                   ` Jens Axboe
2019-12-12  2:21                   ` Matthew Wilcox
2019-12-12  2:38                     ` Jens Axboe
2019-12-12 22:18                 ` Dave Chinner
2019-12-13  1:32                   ` Chris Mason
2020-01-07 17:42                     ` Christoph Hellwig
2020-01-08 14:09                       ` Chris Mason
2020-02-01 10:33                     ` Andres Freund
2019-12-11 20:43           ` Matthew Wilcox
2019-12-11 20:04       ` Jens Axboe
2019-12-12 10:44 ` Martin Steigerwald
2019-12-12 15:16   ` Jens Axboe
2019-12-12 21:45     ` Martin Steigerwald
2019-12-12 22:15       ` Jens Axboe
2019-12-12 22:18     ` Linus Torvalds

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191212175200.GS32169@bombadil.infradead.org \
    --to=willy@infradead.org \
    --cc=axboe@kernel.dk \
    --cc=clm@fb.com \
    --cc=david@fromorbit.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).