linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Chris Mason <clm@fb.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@kernel.dk>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Linux-MM <linux-mm@kvack.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	linux-block <linux-block@vger.kernel.org>,
	Matthew Wilcox <willy@infradead.org>,
	Johannes Weiner <hannes@cmpxchg.org>
Subject: Re: [PATCHSET v3 0/5] Support for RWF_UNCACHED
Date: Fri, 13 Dec 2019 01:32:10 +0000	[thread overview]
Message-ID: <C08B7F86-C3D6-47C6-AB17-6F234EA33687@fb.com> (raw)
In-Reply-To: <20191212221818.GG19213@dread.disaster.area>

On 12 Dec 2019, at 17:18, Dave Chinner wrote:

> On Wed, Dec 11, 2019 at 06:09:14PM -0700, Jens Axboe wrote:
>>
>> So Chris and I started talking about this, and pondered "what would
>> happen if we simply bypassed the page cache completely?". Case in 
>> point,
>> see below incremental patch. We still do the page cache lookup, and 
>> use
>> that page to copy from if it's there. If the page isn't there, 
>> allocate
>> one and do IO to it, but DON'T add it to the page cache. With that,
>> we're almost at O_DIRECT levels of performance for the 4k read case,
>> without 1-2%. I think 512b would look awesome, but we're reading full
>> pages, so that won't really help us much. Compared to the previous
>> uncached method, this is 30% faster on this device. That's 
>> substantial.
>
> Interesting idea, but this seems like it is just direct IO with
> kernel pages and a memcpy() rather than just mapping user pages, but
> has none of the advantages of direct IO in that we can run reads and
> writes concurrently because it's going through the buffered IO path.
>
> It also needs all the special DIO truncate/hole punch serialisation
> mechanisms to be propagated into the buffered IO path - the
> requirement for inode_dio_wait() serialisation is something I'm
> trying to remove from XFS, not have to add into more paths. And it
> introduces the same issues with other buffered read/mmap access to
> the same file ranges as direct IO has.
>
>> Obviously this has issues with truncate that would need to be 
>> resolved,
>> and it's definitely dirtier. But the performance is very enticing...
>
> At which point I have to ask: why are we considering repeating the
> mistakes that were made with direct IO?  Yes, it might be faster
> than a coherent RWF_UNCACHED IO implementation, but I don't think
> making it more like O_DIRECT is worth the price.
>
> And, ultimately, RWF_UNCACHED will never be as fast as direct IO
> because it *requires* the CPU to copy the data at least once.

They just have different tradeoffs.  O_DIRECT actively blows away caches 
and can also force writes during reads, making RWF_UNCACHED a more 
natural fit for some applications.  There are fewer surprises, and some 
services are willing to pay for flexibility with a memcpy.  In general, 
they still want to do some cache management because it reduces p90+ 
latencies across the board, and gives them more control over which pages 
stay in cache.

Most services using buffered IO here as part of their main workload are 
pairing it with sync_file_range() and sometimes fadvise DONT_NEED.  
We've seen kswapd saturating cores with much slower flash than the fancy 
stuff Jens is using, and the solution is usually O_DIRECT or fadvise.

Grepping through the code shows a wonderful assortment of helpers to 
control the cache, and RWF_UNCACHED would be both cleaner and faster 
than what we have today.  I'm on the fence about asking for 
RWF_FILE_RANGE_WRITE (+/- naming) to force writes to start without 
pitching pages, but we can talk to some service owners to see how useful 
that would be.   They can always chain a sync_file_range() in io_uring, 
but RWF_ would be lower overhead if it were a common pattern.

With all of that said, I really agree that xfs+O_DIRECT wins on write 
concurrency.  Jens's current patches are a great first step, but I think 
that if he really loved us, Jens would carve up a concurrent pageless 
write patch series before Christmas.

> Direct
> IO is zero-copy, and so it's always going to have lower overhead
> than RWF_UNCACHED, and so when CPU or memory bandwidth is the
> limiting facter, O_DIRECT will always be faster.
>
> IOWs, I think trying to make RWF_UNCACHED as fast as O_DIRECT is a
> fool's game and attempting to do so is taking a step in the wrong
> direction architecturally.  I'd much prefer a sane IO model for
> RWF_UNCACHED that provides coherency w/ mmap and other buffered IO
> than compromise these things in the chase for ultimate performance.

No matter what I wrote in my letters to Santa this year, I agree that we 
shouldn't compromise on avoiding the warts from O_DIRECT.

-chris

  reply	other threads:[~2019-12-13  1:32 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-12-11 15:29 [PATCHSET v3 0/5] Support for RWF_UNCACHED Jens Axboe
2019-12-11 15:29 ` [PATCH 1/5] fs: add read support " Jens Axboe
2019-12-11 15:29 ` [PATCH 2/5] mm: make generic_perform_write() take a struct kiocb Jens Axboe
2019-12-11 15:29 ` [PATCH 3/5] mm: make buffered writes work with RWF_UNCACHED Jens Axboe
2019-12-11 15:29 ` [PATCH 4/5] iomap: pass in the write_begin/write_end flags to iomap_actor Jens Axboe
2019-12-11 17:19   ` Linus Torvalds
2019-12-11 15:29 ` [PATCH 5/5] iomap: support RWF_UNCACHED for buffered writes Jens Axboe
2019-12-11 17:19   ` Matthew Wilcox
2019-12-11 18:05     ` Jens Axboe
2019-12-12 22:34   ` Dave Chinner
2019-12-13  0:54     ` Jens Axboe
2019-12-13  0:57       ` Jens Axboe
2019-12-16  4:17         ` Dave Chinner
2019-12-17 14:31           ` Jens Axboe
2019-12-18  0:49             ` Dave Chinner
2019-12-18  1:01               ` Jens Axboe
2019-12-11 17:37 ` [PATCHSET v3 0/5] Support for RWF_UNCACHED Linus Torvalds
2019-12-11 17:56   ` Jens Axboe
2019-12-11 19:14     ` Linus Torvalds
2019-12-11 19:34     ` Jens Axboe
2019-12-11 20:03       ` Linus Torvalds
2019-12-11 20:08         ` Jens Axboe
2019-12-11 20:18           ` Linus Torvalds
2019-12-11 21:04             ` Johannes Weiner
2019-12-12  1:30               ` Jens Axboe
2019-12-11 23:41             ` Jens Axboe
2019-12-12  1:08               ` Linus Torvalds
2019-12-12  1:11                 ` Jens Axboe
2019-12-12  1:22                   ` Linus Torvalds
2019-12-12  1:29                     ` Jens Axboe
2019-12-12  1:41                       ` Linus Torvalds
2019-12-12  1:56                         ` Matthew Wilcox
2019-12-12  2:47                           ` Linus Torvalds
2019-12-12 17:52                             ` Matthew Wilcox
2019-12-12 18:29                               ` Linus Torvalds
2019-12-12 20:05                                 ` Matthew Wilcox
2019-12-12  1:41                       ` Jens Axboe
2019-12-12  1:49                         ` Linus Torvalds
2019-12-12  1:09               ` Jens Axboe
2019-12-12  2:03                 ` Jens Axboe
2019-12-12  2:10                   ` Jens Axboe
2019-12-12  2:21                   ` Matthew Wilcox
2019-12-12  2:38                     ` Jens Axboe
2019-12-12 22:18                 ` Dave Chinner
2019-12-13  1:32                   ` Chris Mason [this message]
2020-01-07 17:42                     ` Christoph Hellwig
2020-01-08 14:09                       ` Chris Mason
2020-02-01 10:33                     ` Andres Freund
2019-12-11 20:43           ` Matthew Wilcox
2019-12-11 20:04       ` Jens Axboe
2019-12-12 10:44 ` Martin Steigerwald
2019-12-12 15:16   ` Jens Axboe
2019-12-12 21:45     ` Martin Steigerwald
2019-12-12 22:15       ` Jens Axboe
2019-12-12 22:18     ` Linus Torvalds

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=C08B7F86-C3D6-47C6-AB17-6F234EA33687@fb.com \
    --to=clm@fb.com \
    --cc=axboe@kernel.dk \
    --cc=david@fromorbit.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=torvalds@linux-foundation.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).