linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jens Axboe <axboe@kernel.dk>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	hch@infradead.org, akpm@linux-foundation.org
Subject: Re: [PATCHSET 0/3] Improve IOCB_NOWAIT O_DIRECT
Date: Mon, 8 Feb 2021 16:37:26 -0700	[thread overview]
Message-ID: <44fec531-b2fd-f569-538a-64449a5c371b@kernel.dk> (raw)
In-Reply-To: <20210208232846.GO4626@dread.disaster.area>

On 2/8/21 4:28 PM, Dave Chinner wrote:
> On Mon, Feb 08, 2021 at 03:18:26PM -0700, Jens Axboe wrote:
>> Hi,
>>
>> Ran into an issue with IOCB_NOWAIT and O_DIRECT, which causes a rather
>> serious performance issue. If IOCB_NOWAIT is set, the generic/iomap
>> iterators check for page cache presence in the given range, and return
>> -EAGAIN if any is there. This is rather simplistic and looks like
>> something that was never really finished. For !IOCB_NOWAIT, we simply
>> call filemap_write_and_wait_range() to issue (if any) and wait on the
>> range. The fact that we have page cache entries for this range does
>> not mean that we cannot safely do O_DIRECT IO to/from it.
>>
>> This series adds filemap_range_needs_writeback(), which checks if
>> we have pages in the range that do require us to call
>> filemap_write_and_wait_range(). If we don't, then we can proceed just
>> fine with IOCB_NOWAIT.
> 
> Not exactly. If it is a write we are doing, we _must_ invalidate
> the page cache pages over the range of the DIO write to maintain
> some level of cache coherency between the DIO write and the page
> cache contents. i.e. the DIO write makes the page cache contents
> stale, so the page cache has to be invalidated before the DIO write
> is started, and again when it completes to toss away racing updates
> (mmap) while the DIO write was in flight...
> 
> Page invalidation can block (page locks, waits on writeback, taking
> the mmap_sem to zap page tables, etc), and it can also fail because
> pages are dirty (e.g. writeback+invalidation racing with mmap).
> 
> And if it fails because dirty pages then we fall back to buffered
> IO, which serialises readers and writes and will block.

Right, not disagreeing with any of that.

>> The problem manifested itself in a production environment, where someone
>> is doing O_DIRECT on a raw block device. Due to other circumstances,
>> blkid was triggered on this device periodically, and blkid very helpfully
>> does a number of page cache reads on the device. Now the mapping has
>> page cache entries, and performance falls to pieces because we can no
>> longer reliably do IOCB_NOWAIT O_DIRECT.
> 
> If it was a DIO write, then the pages would have been invalidated
> on the first write and the second write would issued with NOWAIT
> just fine.
> 
> So the problem sounds to me like DIO reads from the block device are
> not invalidating the page cache over the read range, so they persist
> and prevent IOCB_NOWAIT IO from being submitted.

That is exactly the case I ran into indeed.

> Historically speaking, this is why XFS always used to invalidate the
> page cache for DIO - it didn't want to leave cached clean pages that
> would prevent future DIOs from being issued concurrently because
> coherency with the page cache caused performance issues. We
> optimised away this invalidation because the data in the page cache
> is still valid after a flush+DIO read, but it sounds to me like
> there are still corner cases where "always invalidate cached pages"
> is the right thing for DIO to be doing....
> 
> Not sure what the best way to go here it - the patch isn't correct
> for NOWAIT DIO writes, but it looks necessary for reads. And I'm not
> sure that we want to go back to "invalidate everything all the time"
> either....

We still do the invalidation for writes with the patch for writes,
nothing has changed there. We just skip the
filemap_write_and_wait_range() if there's nothing to write. And if
there's nothing to write, _hopefully_ the invalidation should go
smoothly unless someone dirtied/locked/put-under-writeback the page
since we did the check. But that's always going to be racy, and there's
not a whole lot we can do about that.

-- 
Jens Axboe


  reply	other threads:[~2021-02-08 23:38 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-02-08 22:18 [PATCHSET 0/3] Improve IOCB_NOWAIT O_DIRECT Jens Axboe
2021-02-08 22:18 ` [PATCH 1/3] mm: provide filemap_range_needs_writeback() helper Jens Axboe
2021-02-08 23:02   ` Matthew Wilcox
2021-02-08 23:21     ` Jens Axboe
2021-02-08 22:18 ` [PATCH 2/3] mm: use filemap_range_needs_writeback() for O_DIRECT IO Jens Axboe
2021-02-08 22:18 ` [PATCH 3/3] iomap: " Jens Axboe
2021-02-08 23:28 ` [PATCHSET 0/3] Improve IOCB_NOWAIT O_DIRECT Dave Chinner
2021-02-08 23:37   ` Jens Axboe [this message]
2021-02-09  0:14     ` Dave Chinner
2021-02-09  2:15       ` Jens Axboe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=44fec531-b2fd-f569-538a-64449a5c371b@kernel.dk \
    --to=axboe@kernel.dk \
    --cc=akpm@linux-foundation.org \
    --cc=david@fromorbit.com \
    --cc=hch@infradead.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).