All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHSET 0/3] Improve IOCB_NOWAIT O_DIRECT
@ 2021-02-08 22:18 Jens Axboe
  2021-02-08 22:18 ` [PATCH 1/3] mm: provide filemap_range_needs_writeback() helper Jens Axboe
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Jens Axboe @ 2021-02-08 22:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm; +Cc: hch, akpm

Hi,

Ran into an issue with IOCB_NOWAIT and O_DIRECT, which causes a rather
serious performance issue. If IOCB_NOWAIT is set, the generic/iomap
iterators check for page cache presence in the given range, and return
-EAGAIN if any is there. This is rather simplistic and looks like
something that was never really finished. For !IOCB_NOWAIT, we simply
call filemap_write_and_wait_range() to issue (if any) and wait on the
range. The fact that we have page cache entries for this range does
not mean that we cannot safely do O_DIRECT IO to/from it.

This series adds filemap_range_needs_writeback(), which checks if
we have pages in the range that do require us to call
filemap_write_and_wait_range(). If we don't, then we can proceed just
fine with IOCB_NOWAIT.

The problem manifested itself in a production environment, where someone
is doing O_DIRECT on a raw block device. Due to other circumstances,
blkid was triggered on this device periodically, and blkid very helpfully
does a number of page cache reads on the device. Now the mapping has
page cache entries, and performance falls to pieces because we can no
longer reliably do IOCB_NOWAIT O_DIRECT.

Patch 1 adds the helper, patch 2 uses it for the generic iterators, and
patch 3 applies the same to the iomap direct-io code.

 fs/iomap/direct-io.c | 10 ++++-----
 include/linux/fs.h   |  2 ++
 mm/filemap.c         | 52 +++++++++++++++++++++++++++++++++++++++++---
 3 files changed, 56 insertions(+), 8 deletions(-)

-- 
Jens Axboe




^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/3] mm: provide filemap_range_needs_writeback() helper
  2021-02-08 22:18 [PATCHSET 0/3] Improve IOCB_NOWAIT O_DIRECT Jens Axboe
@ 2021-02-08 22:18 ` Jens Axboe
  2021-02-08 23:02   ` Matthew Wilcox
  2021-02-08 22:18 ` [PATCH 2/3] mm: use filemap_range_needs_writeback() for O_DIRECT IO Jens Axboe
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 10+ messages in thread
From: Jens Axboe @ 2021-02-08 22:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm; +Cc: hch, akpm, Jens Axboe

For O_DIRECT reads/writes, we check if we need to issue a call to
filemap_write_and_wait_range() to issue and/or wait for writeback for any
page in the given range. The existing mechanism just checks for a page in
the range, which is suboptimal for IOCB_NOWAIT as we'll fallback to the
slow path (and needing retry) if there's just a clean page cache page in
the range.

Provide filemap_range_needs_writeback() which tries a little harder to
check if we actually need to issue and/or wait for writeback in the
range.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/fs.h |  2 ++
 mm/filemap.c       | 46 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 48 insertions(+)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index fd47deea7c17..def89222dfe9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2633,6 +2633,8 @@ static inline int filemap_fdatawait(struct address_space *mapping)
 
 extern bool filemap_range_has_page(struct address_space *, loff_t lstart,
 				  loff_t lend);
+extern bool filemap_range_needs_writeback(struct address_space *,
+					  loff_t lstart, loff_t lend);
 extern int filemap_write_and_wait_range(struct address_space *mapping,
 				        loff_t lstart, loff_t lend);
 extern int __filemap_fdatawrite_range(struct address_space *mapping,
diff --git a/mm/filemap.c b/mm/filemap.c
index aa0e0fb04670..1ed7acac8a1b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -633,6 +633,52 @@ static bool mapping_needs_writeback(struct address_space *mapping)
 	return mapping->nrpages;
 }
 
+/**
+ * filemap_range_needs_writeback - check if range potentially needs writeback
+ * @mapping:           address space within which to check
+ * @start_byte:        offset in bytes where the range starts
+ * @end_byte:          offset in bytes where the range ends (inclusive)
+ *
+ * Find at least one page in the range supplied, usually used to check if
+ * direct writing in this range will trigger a writeback. Used by O_DIRECT
+ * read/write with IOCB_NOWAIT, to see if the caller needs to do
+ * filemap_write_and_wait_range() before proceeding.
+ *
+ * Return: %true if the caller should do filemap_write_and_wait_range() before
+ * doing O_DIRECT to a page in this range, %false otherwise.
+ */
+bool filemap_range_needs_writeback(struct address_space *mapping,
+				   loff_t start_byte, loff_t end_byte)
+{
+	struct page *page = NULL, *head;
+	XA_STATE(xas, &mapping->i_pages, start_byte >> PAGE_SHIFT);
+	pgoff_t max = end_byte >> PAGE_SHIFT;
+
+	if (!mapping_needs_writeback(mapping))
+		return false;
+	if (!mapping_tagged(mapping, PAGECACHE_TAG_DIRTY) &&
+	    !mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK))
+		return false;
+	if (end_byte < start_byte)
+		return false;
+
+	rcu_read_lock();
+	xas_for_each(&xas, head, max) {
+		if (xas_retry(&xas, head))
+			continue;
+		if (xa_is_value(head))
+			continue;
+		page = find_subpage(head, xas.xa_index);
+		if (PageDirty(page) || PageLocked(page) || PageWriteback(page))
+			break;
+		page = NULL;
+	}
+	rcu_read_unlock();
+
+	return page != NULL;
+}
+EXPORT_SYMBOL_GPL(filemap_range_needs_writeback);
+
 /**
  * filemap_write_and_wait_range - write out & wait on a file range
  * @mapping:	the address_space for the pages
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/3] mm: use filemap_range_needs_writeback() for O_DIRECT IO
  2021-02-08 22:18 [PATCHSET 0/3] Improve IOCB_NOWAIT O_DIRECT Jens Axboe
  2021-02-08 22:18 ` [PATCH 1/3] mm: provide filemap_range_needs_writeback() helper Jens Axboe
@ 2021-02-08 22:18 ` Jens Axboe
  2021-02-08 22:18 ` [PATCH 3/3] iomap: " Jens Axboe
  2021-02-08 23:28 ` [PATCHSET 0/3] Improve IOCB_NOWAIT O_DIRECT Dave Chinner
  3 siblings, 0 replies; 10+ messages in thread
From: Jens Axboe @ 2021-02-08 22:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm; +Cc: hch, akpm, Jens Axboe

For the generic page cache read/write helpers, use the better variant
of checking for the need to call filemap_write_and_wait_range() when
doing O_DIRECT reads or writes. This avoids falling back to the slow
path for IOCB_NOWAIT, if there are no pages to wait for (or write out).

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 mm/filemap.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 1ed7acac8a1b..3bc76f99ddc3 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2646,8 +2646,8 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
 
 		size = i_size_read(inode);
 		if (iocb->ki_flags & IOCB_NOWAIT) {
-			if (filemap_range_has_page(mapping, iocb->ki_pos,
-						   iocb->ki_pos + count - 1))
+			if (filemap_range_needs_writeback(mapping, iocb->ki_pos,
+							  iocb->ki_pos + count - 1))
 				return -EAGAIN;
 		} else {
 			retval = filemap_write_and_wait_range(mapping,
@@ -3326,7 +3326,7 @@ generic_file_direct_write(struct kiocb *iocb, struct iov_iter *from)
 
 	if (iocb->ki_flags & IOCB_NOWAIT) {
 		/* If there are pages to writeback, return */
-		if (filemap_range_has_page(file->f_mapping, pos,
+		if (filemap_range_needs_writeback(file->f_mapping, pos,
 					   pos + write_len - 1))
 			return -EAGAIN;
 	} else {
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 3/3] iomap: use filemap_range_needs_writeback() for O_DIRECT IO
  2021-02-08 22:18 [PATCHSET 0/3] Improve IOCB_NOWAIT O_DIRECT Jens Axboe
  2021-02-08 22:18 ` [PATCH 1/3] mm: provide filemap_range_needs_writeback() helper Jens Axboe
  2021-02-08 22:18 ` [PATCH 2/3] mm: use filemap_range_needs_writeback() for O_DIRECT IO Jens Axboe
@ 2021-02-08 22:18 ` Jens Axboe
  2021-02-08 23:28 ` [PATCHSET 0/3] Improve IOCB_NOWAIT O_DIRECT Dave Chinner
  3 siblings, 0 replies; 10+ messages in thread
From: Jens Axboe @ 2021-02-08 22:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm; +Cc: hch, akpm, Jens Axboe

For the generic page cache read/write helpers, use the better variant
of checking for the need to call filemap_write_and_wait_range() when
doing O_DIRECT reads or writes. This avoids falling back to the slow
path for IOCB_NOWAIT, if there are no pages to wait for (or write out).

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/iomap/direct-io.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 933f234d5bec..2561e50c763a 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -479,17 +479,17 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	}
 
 	if (iocb->ki_flags & IOCB_NOWAIT) {
-		if (filemap_range_has_page(mapping, pos, end)) {
+		if (filemap_range_needs_writeback(mapping, pos, end)) {
 			ret = -EAGAIN;
 			goto out_free_dio;
 		}
 		flags |= IOMAP_NOWAIT;
+	} else {
+		ret = filemap_write_and_wait_range(mapping, pos, end);
+		if (ret)
+			goto out_free_dio;
 	}
 
-	ret = filemap_write_and_wait_range(mapping, pos, end);
-	if (ret)
-		goto out_free_dio;
-
 	if (iov_iter_rw(iter) == WRITE) {
 		/*
 		 * Try to invalidate cache pages for the range we are writing.
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/3] mm: provide filemap_range_needs_writeback() helper
  2021-02-08 22:18 ` [PATCH 1/3] mm: provide filemap_range_needs_writeback() helper Jens Axboe
@ 2021-02-08 23:02   ` Matthew Wilcox
  2021-02-08 23:21     ` Jens Axboe
  0 siblings, 1 reply; 10+ messages in thread
From: Matthew Wilcox @ 2021-02-08 23:02 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-fsdevel, linux-mm, hch, akpm

On Mon, Feb 08, 2021 at 03:18:27PM -0700, Jens Axboe wrote:
> +	rcu_read_lock();
> +	xas_for_each(&xas, head, max) {
> +		if (xas_retry(&xas, head))
> +			continue;
> +		if (xa_is_value(head))
> +			continue;
> +		page = find_subpage(head, xas.xa_index);
> +		if (PageDirty(page) || PageLocked(page) || PageWriteback(page))
> +			break;
> +		page = NULL;
> +	}
> +	rcu_read_unlock();

There's no need to find the sub-page for any of these three conditions --
the bit will be set on the head page.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/3] mm: provide filemap_range_needs_writeback() helper
  2021-02-08 23:02   ` Matthew Wilcox
@ 2021-02-08 23:21     ` Jens Axboe
  0 siblings, 0 replies; 10+ messages in thread
From: Jens Axboe @ 2021-02-08 23:21 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-mm, hch, akpm

On 2/8/21 4:02 PM, Matthew Wilcox wrote:
> On Mon, Feb 08, 2021 at 03:18:27PM -0700, Jens Axboe wrote:
>> +	rcu_read_lock();
>> +	xas_for_each(&xas, head, max) {
>> +		if (xas_retry(&xas, head))
>> +			continue;
>> +		if (xa_is_value(head))
>> +			continue;
>> +		page = find_subpage(head, xas.xa_index);
>> +		if (PageDirty(page) || PageLocked(page) || PageWriteback(page))
>> +			break;
>> +		page = NULL;
>> +	}
>> +	rcu_read_unlock();
> 
> There's no need to find the sub-page for any of these three conditions --
> the bit will be set on the head page.

Gotcha, that makes it simpler. I'll make the edit.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCHSET 0/3] Improve IOCB_NOWAIT O_DIRECT
  2021-02-08 22:18 [PATCHSET 0/3] Improve IOCB_NOWAIT O_DIRECT Jens Axboe
                   ` (2 preceding siblings ...)
  2021-02-08 22:18 ` [PATCH 3/3] iomap: " Jens Axboe
@ 2021-02-08 23:28 ` Dave Chinner
  2021-02-08 23:37   ` Jens Axboe
  3 siblings, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2021-02-08 23:28 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-fsdevel, linux-mm, hch, akpm

On Mon, Feb 08, 2021 at 03:18:26PM -0700, Jens Axboe wrote:
> Hi,
> 
> Ran into an issue with IOCB_NOWAIT and O_DIRECT, which causes a rather
> serious performance issue. If IOCB_NOWAIT is set, the generic/iomap
> iterators check for page cache presence in the given range, and return
> -EAGAIN if any is there. This is rather simplistic and looks like
> something that was never really finished. For !IOCB_NOWAIT, we simply
> call filemap_write_and_wait_range() to issue (if any) and wait on the
> range. The fact that we have page cache entries for this range does
> not mean that we cannot safely do O_DIRECT IO to/from it.
> 
> This series adds filemap_range_needs_writeback(), which checks if
> we have pages in the range that do require us to call
> filemap_write_and_wait_range(). If we don't, then we can proceed just
> fine with IOCB_NOWAIT.

Not exactly. If it is a write we are doing, we _must_ invalidate
the page cache pages over the range of the DIO write to maintain
some level of cache coherency between the DIO write and the page
cache contents. i.e. the DIO write makes the page cache contents
stale, so the page cache has to be invalidated before the DIO write
is started, and again when it completes to toss away racing updates
(mmap) while the DIO write was in flight...

Page invalidation can block (page locks, waits on writeback, taking
the mmap_sem to zap page tables, etc), and it can also fail because
pages are dirty (e.g. writeback+invalidation racing with mmap).

And if it fails because dirty pages then we fall back to buffered
IO, which serialises readers and writes and will block.

> The problem manifested itself in a production environment, where someone
> is doing O_DIRECT on a raw block device. Due to other circumstances,
> blkid was triggered on this device periodically, and blkid very helpfully
> does a number of page cache reads on the device. Now the mapping has
> page cache entries, and performance falls to pieces because we can no
> longer reliably do IOCB_NOWAIT O_DIRECT.

If it was a DIO write, then the pages would have been invalidated
on the first write and the second write would issued with NOWAIT
just fine.

So the problem sounds to me like DIO reads from the block device are
not invalidating the page cache over the read range, so they persist
and prevent IOCB_NOWAIT IO from being submitted.

Historically speaking, this is why XFS always used to invalidate the
page cache for DIO - it didn't want to leave cached clean pages that
would prevent future DIOs from being issued concurrently because
coherency with the page cache caused performance issues. We
optimised away this invalidation because the data in the page cache
is still valid after a flush+DIO read, but it sounds to me like
there are still corner cases where "always invalidate cached pages"
is the right thing for DIO to be doing....

Not sure what the best way to go here it - the patch isn't correct
for NOWAIT DIO writes, but it looks necessary for reads. And I'm not
sure that we want to go back to "invalidate everything all the time"
either....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCHSET 0/3] Improve IOCB_NOWAIT O_DIRECT
  2021-02-08 23:28 ` [PATCHSET 0/3] Improve IOCB_NOWAIT O_DIRECT Dave Chinner
@ 2021-02-08 23:37   ` Jens Axboe
  2021-02-09  0:14     ` Dave Chinner
  0 siblings, 1 reply; 10+ messages in thread
From: Jens Axboe @ 2021-02-08 23:37 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-mm, hch, akpm

On 2/8/21 4:28 PM, Dave Chinner wrote:
> On Mon, Feb 08, 2021 at 03:18:26PM -0700, Jens Axboe wrote:
>> Hi,
>>
>> Ran into an issue with IOCB_NOWAIT and O_DIRECT, which causes a rather
>> serious performance issue. If IOCB_NOWAIT is set, the generic/iomap
>> iterators check for page cache presence in the given range, and return
>> -EAGAIN if any is there. This is rather simplistic and looks like
>> something that was never really finished. For !IOCB_NOWAIT, we simply
>> call filemap_write_and_wait_range() to issue (if any) and wait on the
>> range. The fact that we have page cache entries for this range does
>> not mean that we cannot safely do O_DIRECT IO to/from it.
>>
>> This series adds filemap_range_needs_writeback(), which checks if
>> we have pages in the range that do require us to call
>> filemap_write_and_wait_range(). If we don't, then we can proceed just
>> fine with IOCB_NOWAIT.
> 
> Not exactly. If it is a write we are doing, we _must_ invalidate
> the page cache pages over the range of the DIO write to maintain
> some level of cache coherency between the DIO write and the page
> cache contents. i.e. the DIO write makes the page cache contents
> stale, so the page cache has to be invalidated before the DIO write
> is started, and again when it completes to toss away racing updates
> (mmap) while the DIO write was in flight...
> 
> Page invalidation can block (page locks, waits on writeback, taking
> the mmap_sem to zap page tables, etc), and it can also fail because
> pages are dirty (e.g. writeback+invalidation racing with mmap).
> 
> And if it fails because dirty pages then we fall back to buffered
> IO, which serialises readers and writes and will block.

Right, not disagreeing with any of that.

>> The problem manifested itself in a production environment, where someone
>> is doing O_DIRECT on a raw block device. Due to other circumstances,
>> blkid was triggered on this device periodically, and blkid very helpfully
>> does a number of page cache reads on the device. Now the mapping has
>> page cache entries, and performance falls to pieces because we can no
>> longer reliably do IOCB_NOWAIT O_DIRECT.
> 
> If it was a DIO write, then the pages would have been invalidated
> on the first write and the second write would issued with NOWAIT
> just fine.
> 
> So the problem sounds to me like DIO reads from the block device are
> not invalidating the page cache over the read range, so they persist
> and prevent IOCB_NOWAIT IO from being submitted.

That is exactly the case I ran into indeed.

> Historically speaking, this is why XFS always used to invalidate the
> page cache for DIO - it didn't want to leave cached clean pages that
> would prevent future DIOs from being issued concurrently because
> coherency with the page cache caused performance issues. We
> optimised away this invalidation because the data in the page cache
> is still valid after a flush+DIO read, but it sounds to me like
> there are still corner cases where "always invalidate cached pages"
> is the right thing for DIO to be doing....
> 
> Not sure what the best way to go here it - the patch isn't correct
> for NOWAIT DIO writes, but it looks necessary for reads. And I'm not
> sure that we want to go back to "invalidate everything all the time"
> either....

We still do the invalidation for writes with the patch for writes,
nothing has changed there. We just skip the
filemap_write_and_wait_range() if there's nothing to write. And if
there's nothing to write, _hopefully_ the invalidation should go
smoothly unless someone dirtied/locked/put-under-writeback the page
since we did the check. But that's always going to be racy, and there's
not a whole lot we can do about that.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCHSET 0/3] Improve IOCB_NOWAIT O_DIRECT
  2021-02-08 23:37   ` Jens Axboe
@ 2021-02-09  0:14     ` Dave Chinner
  2021-02-09  2:15       ` Jens Axboe
  0 siblings, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2021-02-09  0:14 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-fsdevel, linux-mm, hch, akpm

On Mon, Feb 08, 2021 at 04:37:26PM -0700, Jens Axboe wrote:
> On 2/8/21 4:28 PM, Dave Chinner wrote:
> > On Mon, Feb 08, 2021 at 03:18:26PM -0700, Jens Axboe wrote:
> >> Hi,
> >>
> >> Ran into an issue with IOCB_NOWAIT and O_DIRECT, which causes a rather
> >> serious performance issue. If IOCB_NOWAIT is set, the generic/iomap
> >> iterators check for page cache presence in the given range, and return
> >> -EAGAIN if any is there. This is rather simplistic and looks like
> >> something that was never really finished. For !IOCB_NOWAIT, we simply
> >> call filemap_write_and_wait_range() to issue (if any) and wait on the
> >> range. The fact that we have page cache entries for this range does
> >> not mean that we cannot safely do O_DIRECT IO to/from it.
> >>
> >> This series adds filemap_range_needs_writeback(), which checks if
> >> we have pages in the range that do require us to call
> >> filemap_write_and_wait_range(). If we don't, then we can proceed just
> >> fine with IOCB_NOWAIT.
> > 
> > Not exactly. If it is a write we are doing, we _must_ invalidate
> > the page cache pages over the range of the DIO write to maintain
> > some level of cache coherency between the DIO write and the page
> > cache contents. i.e. the DIO write makes the page cache contents
> > stale, so the page cache has to be invalidated before the DIO write
> > is started, and again when it completes to toss away racing updates
> > (mmap) while the DIO write was in flight...
> > 
> > Page invalidation can block (page locks, waits on writeback, taking
> > the mmap_sem to zap page tables, etc), and it can also fail because
> > pages are dirty (e.g. writeback+invalidation racing with mmap).
> > 
> > And if it fails because dirty pages then we fall back to buffered
> > IO, which serialises readers and writes and will block.
> 
> Right, not disagreeing with any of that.
> 
> >> The problem manifested itself in a production environment, where someone
> >> is doing O_DIRECT on a raw block device. Due to other circumstances,
> >> blkid was triggered on this device periodically, and blkid very helpfully
> >> does a number of page cache reads on the device. Now the mapping has
> >> page cache entries, and performance falls to pieces because we can no
> >> longer reliably do IOCB_NOWAIT O_DIRECT.
> > 
> > If it was a DIO write, then the pages would have been invalidated
> > on the first write and the second write would issued with NOWAIT
> > just fine.
> > 
> > So the problem sounds to me like DIO reads from the block device are
> > not invalidating the page cache over the read range, so they persist
> > and prevent IOCB_NOWAIT IO from being submitted.
> 
> That is exactly the case I ran into indeed.
> 
> > Historically speaking, this is why XFS always used to invalidate the
> > page cache for DIO - it didn't want to leave cached clean pages that
> > would prevent future DIOs from being issued concurrently because
> > coherency with the page cache caused performance issues. We
> > optimised away this invalidation because the data in the page cache
> > is still valid after a flush+DIO read, but it sounds to me like
> > there are still corner cases where "always invalidate cached pages"
> > is the right thing for DIO to be doing....
> > 
> > Not sure what the best way to go here it - the patch isn't correct
> > for NOWAIT DIO writes, but it looks necessary for reads. And I'm not
> > sure that we want to go back to "invalidate everything all the time"
> > either....
> 
> We still do the invalidation for writes with the patch for writes,
> nothing has changed there. We just skip the
> filemap_write_and_wait_range() if there's nothing to write. And if
> there's nothing to write, _hopefully_ the invalidation should go
> smoothly unless someone dirtied/locked/put-under-writeback the page
> since we did the check. But that's always going to be racy, and there's
> not a whole lot we can do about that.

Sure, but if someone has actually mapped the range and is accessing
it, then PTEs will need zapping and mmap_sem needs to be taken in
write mode. If there's continual racing access, you've now got the
mmap_sem regularly being taken exclusively in the IOCB_NOWAIT path
and that means it will get serialised against other threads in the
task doing page faults and other mm context operations.  The "needs
writeback" check you've added does nothing to alleviate this
potential blocking point for the write path.

That's my point - you're exposing obvious new blocking points for
IOCB_NOWAIT DIO writes, not removing them. It may not happen very
often, but the whack-a-mole game you are playing with IOCB_NOWAIT is
"we found an even rarer blocking condition that it is critical to
our application". While this patch whacks this specific mole in the
read path, it also exposes the write path to another rare blocking
condition that will eventually end up being the mole that needs to
be whacked...

Perhaps the needs-writeback optimisation should only be applied to
the DIO read path?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCHSET 0/3] Improve IOCB_NOWAIT O_DIRECT
  2021-02-09  0:14     ` Dave Chinner
@ 2021-02-09  2:15       ` Jens Axboe
  0 siblings, 0 replies; 10+ messages in thread
From: Jens Axboe @ 2021-02-09  2:15 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-mm, hch, akpm

On 2/8/21 5:14 PM, Dave Chinner wrote:
> On Mon, Feb 08, 2021 at 04:37:26PM -0700, Jens Axboe wrote:
>> On 2/8/21 4:28 PM, Dave Chinner wrote:
>>> On Mon, Feb 08, 2021 at 03:18:26PM -0700, Jens Axboe wrote:
>>>> Hi,
>>>>
>>>> Ran into an issue with IOCB_NOWAIT and O_DIRECT, which causes a rather
>>>> serious performance issue. If IOCB_NOWAIT is set, the generic/iomap
>>>> iterators check for page cache presence in the given range, and return
>>>> -EAGAIN if any is there. This is rather simplistic and looks like
>>>> something that was never really finished. For !IOCB_NOWAIT, we simply
>>>> call filemap_write_and_wait_range() to issue (if any) and wait on the
>>>> range. The fact that we have page cache entries for this range does
>>>> not mean that we cannot safely do O_DIRECT IO to/from it.
>>>>
>>>> This series adds filemap_range_needs_writeback(), which checks if
>>>> we have pages in the range that do require us to call
>>>> filemap_write_and_wait_range(). If we don't, then we can proceed just
>>>> fine with IOCB_NOWAIT.
>>>
>>> Not exactly. If it is a write we are doing, we _must_ invalidate
>>> the page cache pages over the range of the DIO write to maintain
>>> some level of cache coherency between the DIO write and the page
>>> cache contents. i.e. the DIO write makes the page cache contents
>>> stale, so the page cache has to be invalidated before the DIO write
>>> is started, and again when it completes to toss away racing updates
>>> (mmap) while the DIO write was in flight...
>>>
>>> Page invalidation can block (page locks, waits on writeback, taking
>>> the mmap_sem to zap page tables, etc), and it can also fail because
>>> pages are dirty (e.g. writeback+invalidation racing with mmap).
>>>
>>> And if it fails because dirty pages then we fall back to buffered
>>> IO, which serialises readers and writes and will block.
>>
>> Right, not disagreeing with any of that.
>>
>>>> The problem manifested itself in a production environment, where someone
>>>> is doing O_DIRECT on a raw block device. Due to other circumstances,
>>>> blkid was triggered on this device periodically, and blkid very helpfully
>>>> does a number of page cache reads on the device. Now the mapping has
>>>> page cache entries, and performance falls to pieces because we can no
>>>> longer reliably do IOCB_NOWAIT O_DIRECT.
>>>
>>> If it was a DIO write, then the pages would have been invalidated
>>> on the first write and the second write would issued with NOWAIT
>>> just fine.
>>>
>>> So the problem sounds to me like DIO reads from the block device are
>>> not invalidating the page cache over the read range, so they persist
>>> and prevent IOCB_NOWAIT IO from being submitted.
>>
>> That is exactly the case I ran into indeed.
>>
>>> Historically speaking, this is why XFS always used to invalidate the
>>> page cache for DIO - it didn't want to leave cached clean pages that
>>> would prevent future DIOs from being issued concurrently because
>>> coherency with the page cache caused performance issues. We
>>> optimised away this invalidation because the data in the page cache
>>> is still valid after a flush+DIO read, but it sounds to me like
>>> there are still corner cases where "always invalidate cached pages"
>>> is the right thing for DIO to be doing....
>>>
>>> Not sure what the best way to go here it - the patch isn't correct
>>> for NOWAIT DIO writes, but it looks necessary for reads. And I'm not
>>> sure that we want to go back to "invalidate everything all the time"
>>> either....
>>
>> We still do the invalidation for writes with the patch for writes,
>> nothing has changed there. We just skip the
>> filemap_write_and_wait_range() if there's nothing to write. And if
>> there's nothing to write, _hopefully_ the invalidation should go
>> smoothly unless someone dirtied/locked/put-under-writeback the page
>> since we did the check. But that's always going to be racy, and there's
>> not a whole lot we can do about that.
> 
> Sure, but if someone has actually mapped the range and is accessing
> it, then PTEs will need zapping and mmap_sem needs to be taken in
> write mode. If there's continual racing access, you've now got the
> mmap_sem regularly being taken exclusively in the IOCB_NOWAIT path
> and that means it will get serialised against other threads in the
> task doing page faults and other mm context operations.  The "needs
> writeback" check you've added does nothing to alleviate this
> potential blocking point for the write path.
> 
> That's my point - you're exposing obvious new blocking points for
> IOCB_NOWAIT DIO writes, not removing them. It may not happen very
> often, but the whack-a-mole game you are playing with IOCB_NOWAIT is
> "we found an even rarer blocking condition that it is critical to
> our application". While this patch whacks this specific mole in the
> read path, it also exposes the write path to another rare blocking
> condition that will eventually end up being the mole that needs to
> be whacked...
> 
> Perhaps the needs-writeback optimisation should only be applied to
> the DIO read path?

Sure, we can do that as a first version, and then tackle the remainder
of the write side as a separate thing as we need to handle invalidate
inode pages separately too.

I'll send out a v2 with just the read side.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-02-09  2:15 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-08 22:18 [PATCHSET 0/3] Improve IOCB_NOWAIT O_DIRECT Jens Axboe
2021-02-08 22:18 ` [PATCH 1/3] mm: provide filemap_range_needs_writeback() helper Jens Axboe
2021-02-08 23:02   ` Matthew Wilcox
2021-02-08 23:21     ` Jens Axboe
2021-02-08 22:18 ` [PATCH 2/3] mm: use filemap_range_needs_writeback() for O_DIRECT IO Jens Axboe
2021-02-08 22:18 ` [PATCH 3/3] iomap: " Jens Axboe
2021-02-08 23:28 ` [PATCHSET 0/3] Improve IOCB_NOWAIT O_DIRECT Dave Chinner
2021-02-08 23:37   ` Jens Axboe
2021-02-09  0:14     ` Dave Chinner
2021-02-09  2:15       ` Jens Axboe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.