linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jens Axboe <axboe@kernel.dk>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	hch@infradead.org, akpm@linux-foundation.org
Subject: Re: [PATCHSET 0/3] Improve IOCB_NOWAIT O_DIRECT
Date: Mon, 8 Feb 2021 19:15:03 -0700	[thread overview]
Message-ID: <bb23a9ed-623e-b221-05d6-2b080f1d6066@kernel.dk> (raw)
In-Reply-To: <20210209001445.GP4626@dread.disaster.area>

On 2/8/21 5:14 PM, Dave Chinner wrote:
> On Mon, Feb 08, 2021 at 04:37:26PM -0700, Jens Axboe wrote:
>> On 2/8/21 4:28 PM, Dave Chinner wrote:
>>> On Mon, Feb 08, 2021 at 03:18:26PM -0700, Jens Axboe wrote:
>>>> Hi,
>>>>
>>>> Ran into an issue with IOCB_NOWAIT and O_DIRECT, which causes a rather
>>>> serious performance issue. If IOCB_NOWAIT is set, the generic/iomap
>>>> iterators check for page cache presence in the given range, and return
>>>> -EAGAIN if any is there. This is rather simplistic and looks like
>>>> something that was never really finished. For !IOCB_NOWAIT, we simply
>>>> call filemap_write_and_wait_range() to issue (if any) and wait on the
>>>> range. The fact that we have page cache entries for this range does
>>>> not mean that we cannot safely do O_DIRECT IO to/from it.
>>>>
>>>> This series adds filemap_range_needs_writeback(), which checks if
>>>> we have pages in the range that do require us to call
>>>> filemap_write_and_wait_range(). If we don't, then we can proceed just
>>>> fine with IOCB_NOWAIT.
>>>
>>> Not exactly. If it is a write we are doing, we _must_ invalidate
>>> the page cache pages over the range of the DIO write to maintain
>>> some level of cache coherency between the DIO write and the page
>>> cache contents. i.e. the DIO write makes the page cache contents
>>> stale, so the page cache has to be invalidated before the DIO write
>>> is started, and again when it completes to toss away racing updates
>>> (mmap) while the DIO write was in flight...
>>>
>>> Page invalidation can block (page locks, waits on writeback, taking
>>> the mmap_sem to zap page tables, etc), and it can also fail because
>>> pages are dirty (e.g. writeback+invalidation racing with mmap).
>>>
>>> And if it fails because dirty pages then we fall back to buffered
>>> IO, which serialises readers and writes and will block.
>>
>> Right, not disagreeing with any of that.
>>
>>>> The problem manifested itself in a production environment, where someone
>>>> is doing O_DIRECT on a raw block device. Due to other circumstances,
>>>> blkid was triggered on this device periodically, and blkid very helpfully
>>>> does a number of page cache reads on the device. Now the mapping has
>>>> page cache entries, and performance falls to pieces because we can no
>>>> longer reliably do IOCB_NOWAIT O_DIRECT.
>>>
>>> If it was a DIO write, then the pages would have been invalidated
>>> on the first write and the second write would issued with NOWAIT
>>> just fine.
>>>
>>> So the problem sounds to me like DIO reads from the block device are
>>> not invalidating the page cache over the read range, so they persist
>>> and prevent IOCB_NOWAIT IO from being submitted.
>>
>> That is exactly the case I ran into indeed.
>>
>>> Historically speaking, this is why XFS always used to invalidate the
>>> page cache for DIO - it didn't want to leave cached clean pages that
>>> would prevent future DIOs from being issued concurrently because
>>> coherency with the page cache caused performance issues. We
>>> optimised away this invalidation because the data in the page cache
>>> is still valid after a flush+DIO read, but it sounds to me like
>>> there are still corner cases where "always invalidate cached pages"
>>> is the right thing for DIO to be doing....
>>>
>>> Not sure what the best way to go here it - the patch isn't correct
>>> for NOWAIT DIO writes, but it looks necessary for reads. And I'm not
>>> sure that we want to go back to "invalidate everything all the time"
>>> either....
>>
>> We still do the invalidation for writes with the patch for writes,
>> nothing has changed there. We just skip the
>> filemap_write_and_wait_range() if there's nothing to write. And if
>> there's nothing to write, _hopefully_ the invalidation should go
>> smoothly unless someone dirtied/locked/put-under-writeback the page
>> since we did the check. But that's always going to be racy, and there's
>> not a whole lot we can do about that.
> 
> Sure, but if someone has actually mapped the range and is accessing
> it, then PTEs will need zapping and mmap_sem needs to be taken in
> write mode. If there's continual racing access, you've now got the
> mmap_sem regularly being taken exclusively in the IOCB_NOWAIT path
> and that means it will get serialised against other threads in the
> task doing page faults and other mm context operations.  The "needs
> writeback" check you've added does nothing to alleviate this
> potential blocking point for the write path.
> 
> That's my point - you're exposing obvious new blocking points for
> IOCB_NOWAIT DIO writes, not removing them. It may not happen very
> often, but the whack-a-mole game you are playing with IOCB_NOWAIT is
> "we found an even rarer blocking condition that it is critical to
> our application". While this patch whacks this specific mole in the
> read path, it also exposes the write path to another rare blocking
> condition that will eventually end up being the mole that needs to
> be whacked...
> 
> Perhaps the needs-writeback optimisation should only be applied to
> the DIO read path?

Sure, we can do that as a first version, and then tackle the remainder
of the write side as a separate thing as we need to handle invalidate
inode pages separately too.

I'll send out a v2 with just the read side.

-- 
Jens Axboe


      reply	other threads:[~2021-02-09  2:15 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-02-08 22:18 [PATCHSET 0/3] Improve IOCB_NOWAIT O_DIRECT Jens Axboe
2021-02-08 22:18 ` [PATCH 1/3] mm: provide filemap_range_needs_writeback() helper Jens Axboe
2021-02-08 23:02   ` Matthew Wilcox
2021-02-08 23:21     ` Jens Axboe
2021-02-08 22:18 ` [PATCH 2/3] mm: use filemap_range_needs_writeback() for O_DIRECT IO Jens Axboe
2021-02-08 22:18 ` [PATCH 3/3] iomap: " Jens Axboe
2021-02-08 23:28 ` [PATCHSET 0/3] Improve IOCB_NOWAIT O_DIRECT Dave Chinner
2021-02-08 23:37   ` Jens Axboe
2021-02-09  0:14     ` Dave Chinner
2021-02-09  2:15       ` Jens Axboe [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bb23a9ed-623e-b221-05d6-2b080f1d6066@kernel.dk \
    --to=axboe@kernel.dk \
    --cc=akpm@linux-foundation.org \
    --cc=david@fromorbit.com \
    --cc=hch@infradead.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).