Re: [PATCHSET v3 0/5] Support for RWF_UNCACHED

From: Dave Chinner <david@fromorbit.com>
To: Jens Axboe <axboe@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Linux-MM <linux-mm@kvack.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	linux-block <linux-block@vger.kernel.org>,
	Matthew Wilcox <willy@infradead.org>, Chris Mason <clm@fb.com>,
	Johannes Weiner <hannes@cmpxchg.org>
Subject: Re: [PATCHSET v3 0/5] Support for RWF_UNCACHED
Date: Fri, 13 Dec 2019 09:18:18 +1100	[thread overview]
Message-ID: <20191212221818.GG19213@dread.disaster.area> (raw)
In-Reply-To: <e7fc6b37-8106-4fe2-479c-05c3f2b1c1f1@kernel.dk>

On Wed, Dec 11, 2019 at 06:09:14PM -0700, Jens Axboe wrote:
> On 12/11/19 4:41 PM, Jens Axboe wrote:
> > On 12/11/19 1:18 PM, Linus Torvalds wrote:
> >> On Wed, Dec 11, 2019 at 12:08 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>
> >>> $ cat /proc/meminfo | grep -i active
> >>> Active:           134136 kB
> >>> Inactive:       28683916 kB
> >>> Active(anon):      97064 kB
> >>> Inactive(anon):        4 kB
> >>> Active(file):      37072 kB
> >>> Inactive(file): 28683912 kB
> >>
> >> Yeah, that should not put pressure on some swap activity. We have 28
> >> GB of basically free inactive file data, and the VM is doing something
> >> very very bad if it then doesn't just quickly free it with no real
> >> drama.
> >>
> >> In fact, I don't think it should even trigger kswapd at all, it should
> >> all be direct reclaim. Of course, some of the mm people hate that with
> >> a passion, but this does look like a prime example of why it should
> >> just be done.
> > 
> > For giggles, I ran just a single thread on the file set. We're only
> > doing about 100K IOPS at that point, yet when the page cache fills,
> > kswapd still eats 10% cpu. That seems like a lot for something that
> > slow.
> 
> Warning, the below is from the really crazy department...
> 
> Anyway, I took a closer look at the profiles for the uncached case.
> We're spending a lot of time doing memsets (this is the xa_node init,
> from the radix tree constructor), and call_rcu for the node free later
> on. All wasted time, and something that meant we weren't as close to the
> performance of O_DIRECT as we could be.
> 
> So Chris and I started talking about this, and pondered "what would
> happen if we simply bypassed the page cache completely?". Case in point,
> see below incremental patch. We still do the page cache lookup, and use
> that page to copy from if it's there. If the page isn't there, allocate
> one and do IO to it, but DON'T add it to the page cache. With that,
> we're almost at O_DIRECT levels of performance for the 4k read case,
> without 1-2%. I think 512b would look awesome, but we're reading full
> pages, so that won't really help us much. Compared to the previous
> uncached method, this is 30% faster on this device. That's substantial.

Interesting idea, but this seems like it is just direct IO with
kernel pages and a memcpy() rather than just mapping user pages, but
has none of the advantages of direct IO in that we can run reads and
writes concurrently because it's going through the buffered IO path.

It also needs all the special DIO truncate/hole punch serialisation
mechanisms to be propagated into the buffered IO path - the
requirement for inode_dio_wait() serialisation is something I'm
trying to remove from XFS, not have to add into more paths. And it
introduces the same issues with other buffered read/mmap access to
the same file ranges as direct IO has.

> Obviously this has issues with truncate that would need to be resolved,
> and it's definitely dirtier. But the performance is very enticing...

At which point I have to ask: why are we considering repeating the
mistakes that were made with direct IO?  Yes, it might be faster
than a coherent RWF_UNCACHED IO implementation, but I don't think
making it more like O_DIRECT is worth the price.

And, ultimately, RWF_UNCACHED will never be as fast as direct IO
because it *requires* the CPU to copy the data at least once. Direct
IO is zero-copy, and so it's always going to have lower overhead
than RWF_UNCACHED, and so when CPU or memory bandwidth is the
limiting facter, O_DIRECT will always be faster.

IOWs, I think trying to make RWF_UNCACHED as fast as O_DIRECT is a
fool's game and attempting to do so is taking a step in the wrong
direction architecturally.  I'd much prefer a sane IO model for
RWF_UNCACHED that provides coherency w/ mmap and other buffered IO
than compromise these things in the chase for ultimate performance.

Speaking of IO path architecture, perhaps what we really need here
is an iomap_apply()->iomap_read_actor loop here similar to the write
side. This would allow us to bypass all the complex readahead
shenanigans that generic_file_buffered_read() has to deal with and
directly control page cache residency and build the exact IOs we
need when RWF_UNCACHED is set. This moves it much closer to the
direct IO path in terms IO setup overhead and physical IO patterns,
but still has all the benefits of being fully cache coherent....

And, really, when we are talking about high end nvme drives that can
do 5-10GB/s read each, and we can put 20+ of them in a single
machine, there's no real value in doing readahead. i.e. there's
little read IO latency to hide in the first place and we such
systems have little memory bandwidth to spare to waste on readahead
IO that we don't end up using...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com