Re: [PATCHSET v3 0/5] Support for RWF_UNCACHED

From: Chris Mason <clm@fb.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@kernel.dk>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Linux-MM <linux-mm@kvack.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	linux-block <linux-block@vger.kernel.org>,
	Matthew Wilcox <willy@infradead.org>,
	Johannes Weiner <hannes@cmpxchg.org>
Subject: Re: [PATCHSET v3 0/5] Support for RWF_UNCACHED
Date: Fri, 13 Dec 2019 01:32:10 +0000	[thread overview]
Message-ID: <C08B7F86-C3D6-47C6-AB17-6F234EA33687@fb.com> (raw)
In-Reply-To: <20191212221818.GG19213@dread.disaster.area>

On 12 Dec 2019, at 17:18, Dave Chinner wrote:

> On Wed, Dec 11, 2019 at 06:09:14PM -0700, Jens Axboe wrote:
>>
>> So Chris and I started talking about this, and pondered "what would
>> happen if we simply bypassed the page cache completely?". Case in 
>> point,
>> see below incremental patch. We still do the page cache lookup, and 
>> use
>> that page to copy from if it's there. If the page isn't there, 
>> allocate
>> one and do IO to it, but DON'T add it to the page cache. With that,
>> we're almost at O_DIRECT levels of performance for the 4k read case,
>> without 1-2%. I think 512b would look awesome, but we're reading full
>> pages, so that won't really help us much. Compared to the previous
>> uncached method, this is 30% faster on this device. That's 
>> substantial.
>
> Interesting idea, but this seems like it is just direct IO with
> kernel pages and a memcpy() rather than just mapping user pages, but
> has none of the advantages of direct IO in that we can run reads and
> writes concurrently because it's going through the buffered IO path.
>
> It also needs all the special DIO truncate/hole punch serialisation
> mechanisms to be propagated into the buffered IO path - the
> requirement for inode_dio_wait() serialisation is something I'm
> trying to remove from XFS, not have to add into more paths. And it
> introduces the same issues with other buffered read/mmap access to
> the same file ranges as direct IO has.
>
>> Obviously this has issues with truncate that would need to be 
>> resolved,
>> and it's definitely dirtier. But the performance is very enticing...
>
> At which point I have to ask: why are we considering repeating the
> mistakes that were made with direct IO?  Yes, it might be faster
> than a coherent RWF_UNCACHED IO implementation, but I don't think
> making it more like O_DIRECT is worth the price.
>
> And, ultimately, RWF_UNCACHED will never be as fast as direct IO
> because it *requires* the CPU to copy the data at least once.

They just have different tradeoffs.  O_DIRECT actively blows away caches 
and can also force writes during reads, making RWF_UNCACHED a more 
natural fit for some applications.  There are fewer surprises, and some 
services are willing to pay for flexibility with a memcpy.  In general, 
they still want to do some cache management because it reduces p90+ 
latencies across the board, and gives them more control over which pages 
stay in cache.

Most services using buffered IO here as part of their main workload are 
pairing it with sync_file_range() and sometimes fadvise DONT_NEED.  
We've seen kswapd saturating cores with much slower flash than the fancy 
stuff Jens is using, and the solution is usually O_DIRECT or fadvise.

Grepping through the code shows a wonderful assortment of helpers to 
control the cache, and RWF_UNCACHED would be both cleaner and faster 
than what we have today.  I'm on the fence about asking for 
RWF_FILE_RANGE_WRITE (+/- naming) to force writes to start without 
pitching pages, but we can talk to some service owners to see how useful 
that would be.   They can always chain a sync_file_range() in io_uring, 
but RWF_ would be lower overhead if it were a common pattern.

With all of that said, I really agree that xfs+O_DIRECT wins on write 
concurrency.  Jens's current patches are a great first step, but I think 
that if he really loved us, Jens would carve up a concurrent pageless 
write patch series before Christmas.

> Direct
> IO is zero-copy, and so it's always going to have lower overhead
> than RWF_UNCACHED, and so when CPU or memory bandwidth is the
> limiting facter, O_DIRECT will always be faster.
>
> IOWs, I think trying to make RWF_UNCACHED as fast as O_DIRECT is a
> fool's game and attempting to do so is taking a step in the wrong
> direction architecturally.  I'd much prefer a sane IO model for
> RWF_UNCACHED that provides coherency w/ mmap and other buffered IO
> than compromise these things in the chase for ultimate performance.

No matter what I wrote in my letters to Santa this year, I agree that we 
shouldn't compromise on avoiding the warts from O_DIRECT.

-chris