All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC 0/5] Writeback handling of pinned pages
@ 2023-02-09 12:31 Jan Kara
  2023-02-09 12:31 ` [PATCH 1/5] mm: Do not reclaim private data from pinned page Jan Kara
                   ` (4 more replies)
  0 siblings, 5 replies; 26+ messages in thread
From: Jan Kara @ 2023-02-09 12:31 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-block, linux-mm, John Hubbard, David Howells,
	David Hildenbrand, Jan Kara

Hello,

since we are slowly getting into a state where folios used as buffers for
[R]DMA are detectable by folio_maybe_dma_pinned(), I figured it is time we also
address the original problems filesystems had with these pages [1] - namely
that page/folio private data can get reclaimed from the page while it is being
written to by the DMA and also that page contents can be modified while the
page is under writeback.

This patch series is kind of an outline how the solution could look like (so
far only compile tested). The first two patches deal with the reclaim of page
private data for pinned pages.  They are IMO no-brainers and actually deal with
99% of the observed issues so we might just separate them and merge them
earlier. The remainder of the series deals with the concern that page contents
can be modified while the page is being written back. What it implements is
that instead we skip page cleaning writeback for pinned pages and if we cannot
avoid writing the page (data integrity writeback), we bite the bullet and
bounce the page.

Note that the conversion of clear_page_dirty_for_io() (and its folio variant)
is kind of rough and providing wbc to the function only in the obvious cases -
that will need a bit more work but OTOH functionally passing NULL just retains
the old behavior + WARNs if we actually see pinned page in the writeback path.

Opinions?

								Honza

[1] https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/5] mm: Do not reclaim private data from pinned page
  2023-02-09 12:31 [PATCH RFC 0/5] Writeback handling of pinned pages Jan Kara
@ 2023-02-09 12:31 ` Jan Kara
  2023-02-09 16:17   ` Matthew Wilcox
  2023-02-13  9:01   ` David Hildenbrand
  2023-02-09 12:31 ` [PATCH 2/5] ext4: Drop workaround for mm reclaiming fs private page data Jan Kara
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 26+ messages in thread
From: Jan Kara @ 2023-02-09 12:31 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-block, linux-mm, John Hubbard, David Howells,
	David Hildenbrand, Jan Kara

If the page is pinned, there's no point in trying to reclaim it.
Furthermore if the page is from the page cache we don't want to reclaim
fs-private data from the page because the pinning process may be writing
to the page at any time and reclaiming fs private info on a dirty page
can upset the filesystem (see link below).

Link: https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
---
 mm/vmscan.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index bf3eedf0209c..ab3911a8b116 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1901,6 +1901,16 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 			}
 		}
 
+		/*
+		 * Folio is unmapped now so it cannot be newly pinned anymore.
+		 * No point in trying to reclaim folio if it is pinned.
+		 * Furthermore we don't want to reclaim underlying fs metadata
+		 * if the folio is pinned and thus potentially modified by the
+		 * pinning process is that may upset the filesystem.
+		 */
+		if (folio_maybe_dma_pinned(folio))
+			goto activate_locked;
+
 		mapping = folio_mapping(folio);
 		if (folio_test_dirty(folio)) {
 			/*
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 2/5] ext4: Drop workaround for mm reclaiming fs private page data
  2023-02-09 12:31 [PATCH RFC 0/5] Writeback handling of pinned pages Jan Kara
  2023-02-09 12:31 ` [PATCH 1/5] mm: Do not reclaim private data from pinned page Jan Kara
@ 2023-02-09 12:31 ` Jan Kara
  2023-02-09 12:31 ` [PATCH 3/5] mm: Do not try to write pinned folio during memory cleaning writeback Jan Kara
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 26+ messages in thread
From: Jan Kara @ 2023-02-09 12:31 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-block, linux-mm, John Hubbard, David Howells,
	David Hildenbrand, Jan Kara

Drop workaround in ext4 writeback code to handle a situation when MM
reclaims fs-private page data from a page that is (or becomes) dirty.
After the previous commit this should not happen anymore.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c | 16 ----------------
 1 file changed, 16 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 9d9f414f99fe..46078651ce32 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2657,22 +2657,6 @@ static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)
 			wait_on_page_writeback(page);
 			BUG_ON(PageWriteback(page));
 
-			/*
-			 * Should never happen but for buggy code in
-			 * other subsystems that call
-			 * set_page_dirty() without properly warning
-			 * the file system first.  See [1] for more
-			 * information.
-			 *
-			 * [1] https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz
-			 */
-			if (!page_has_buffers(page)) {
-				ext4_warning_inode(mpd->inode, "page %lu does not have buffers attached", page->index);
-				ClearPageDirty(page);
-				unlock_page(page);
-				continue;
-			}
-
 			if (mpd->map.m_len == 0)
 				mpd->first_page = page->index;
 			mpd->next_page = page->index + 1;
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 3/5] mm: Do not try to write pinned folio during memory cleaning writeback
  2023-02-09 12:31 [PATCH RFC 0/5] Writeback handling of pinned pages Jan Kara
  2023-02-09 12:31 ` [PATCH 1/5] mm: Do not reclaim private data from pinned page Jan Kara
  2023-02-09 12:31 ` [PATCH 2/5] ext4: Drop workaround for mm reclaiming fs private page data Jan Kara
@ 2023-02-09 12:31 ` Jan Kara
  2023-02-10  1:54   ` John Hubbard
  2023-02-09 12:31 ` [PATCH 4/5] block: Add support for bouncing pinned pages Jan Kara
  2023-02-09 12:31 ` [PATCH 5/5] iomap: Bounce pinned pages during writeback Jan Kara
  4 siblings, 1 reply; 26+ messages in thread
From: Jan Kara @ 2023-02-09 12:31 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-block, linux-mm, John Hubbard, David Howells,
	David Hildenbrand, Jan Kara

When a folio is pinned, there is no point in trying to write it during
memory cleaning writeback. We cannot reclaim the folio until it is
unpinned anyway and we cannot even be sure the folio is really clean.
On top of that writeback of such folio may be problematic as the data
can change while the writeback is running thus causing checksum or
DIF/DIX failures. So just don't bother doing memory cleaning writeback
for pinned folios.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/9p/vfs_addr.c            |  2 +-
 fs/afs/file.c               |  2 +-
 fs/afs/write.c              |  6 +++---
 fs/btrfs/extent_io.c        | 14 +++++++-------
 fs/btrfs/free-space-cache.c |  2 +-
 fs/btrfs/inode.c            |  2 +-
 fs/btrfs/subpage.c          |  2 +-
 fs/ceph/addr.c              |  6 +++---
 fs/cifs/file.c              |  6 +++---
 fs/ext4/inode.c             |  4 ++--
 fs/f2fs/checkpoint.c        |  4 ++--
 fs/f2fs/compress.c          |  2 +-
 fs/f2fs/data.c              |  2 +-
 fs/f2fs/dir.c               |  2 +-
 fs/f2fs/gc.c                |  4 ++--
 fs/f2fs/inline.c            |  2 +-
 fs/f2fs/node.c              | 10 +++++-----
 fs/fuse/file.c              |  2 +-
 fs/gfs2/aops.c              |  2 +-
 fs/nfs/write.c              |  2 +-
 fs/nilfs2/page.c            |  2 +-
 fs/nilfs2/segment.c         |  8 ++++----
 fs/orangefs/inode.c         |  2 +-
 fs/ubifs/file.c             |  2 +-
 include/linux/pagemap.h     |  5 +++--
 mm/folio-compat.c           |  4 ++--
 mm/migrate.c                |  2 +-
 mm/page-writeback.c         | 24 ++++++++++++++++++++----
 mm/vmscan.c                 |  2 +-
 29 files changed, 73 insertions(+), 56 deletions(-)

diff --git a/fs/9p/vfs_addr.c b/fs/9p/vfs_addr.c
index 97599edbc300..a14ff3c02eb1 100644
--- a/fs/9p/vfs_addr.c
+++ b/fs/9p/vfs_addr.c
@@ -221,7 +221,7 @@ static int v9fs_launder_folio(struct folio *folio)
 {
 	int retval;
 
-	if (folio_clear_dirty_for_io(folio)) {
+	if (folio_clear_dirty_for_io(NULL, folio)) {
 		retval = v9fs_vfs_write_folio_locked(folio);
 		if (retval)
 			return retval;
diff --git a/fs/afs/file.c b/fs/afs/file.c
index 68d6d5dc608d..8a81ac9c12fa 100644
--- a/fs/afs/file.c
+++ b/fs/afs/file.c
@@ -453,7 +453,7 @@ static void afs_invalidate_dirty(struct folio *folio, size_t offset,
 
 undirty:
 	trace_afs_folio_dirty(vnode, tracepoint_string("undirty"), folio);
-	folio_clear_dirty_for_io(folio);
+	folio_clear_dirty_for_io(NULL, folio);
 full_invalidate:
 	trace_afs_folio_dirty(vnode, tracepoint_string("inval"), folio);
 	folio_detach_private(folio);
diff --git a/fs/afs/write.c b/fs/afs/write.c
index 19df10d63323..9a5e6d59040c 100644
--- a/fs/afs/write.c
+++ b/fs/afs/write.c
@@ -555,7 +555,7 @@ static void afs_extend_writeback(struct address_space *mapping,
 			folio = page_folio(pvec.pages[i]);
 			trace_afs_folio_dirty(vnode, tracepoint_string("store+"), folio);
 
-			if (!folio_clear_dirty_for_io(folio))
+			if (!folio_clear_dirty_for_io(NULL, folio))
 				BUG();
 			if (folio_start_writeback(folio))
 				BUG();
@@ -769,7 +769,7 @@ static int afs_writepages_region(struct address_space *mapping,
 			continue;
 		}
 
-		if (!folio_clear_dirty_for_io(folio))
+		if (!folio_clear_dirty_for_io(NULL, folio))
 			BUG();
 		ret = afs_write_back_from_locked_folio(mapping, wbc, folio, start, end);
 		folio_put(folio);
@@ -1000,7 +1000,7 @@ int afs_launder_folio(struct folio *folio)
 	_enter("{%lx}", folio->index);
 
 	priv = (unsigned long)folio_get_private(folio);
-	if (folio_clear_dirty_for_io(folio)) {
+	if (folio_clear_dirty_for_io(NULL, folio)) {
 		f = 0;
 		t = folio_size(folio);
 		if (folio_test_private(folio)) {
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 9bd32daa9b9a..2026f567cbdd 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -215,7 +215,7 @@ void extent_range_clear_dirty_for_io(struct inode *inode, u64 start, u64 end)
 	while (index <= end_index) {
 		page = find_get_page(inode->i_mapping, index);
 		BUG_ON(!page); /* Pages should be in the extent_io_tree */
-		clear_page_dirty_for_io(page);
+		clear_page_dirty_for_io(NULL, page);
 		put_page(page);
 		index++;
 	}
@@ -2590,7 +2590,7 @@ static int write_one_subpage_eb(struct extent_buffer *eb,
 	no_dirty_ebs = btrfs_subpage_clear_and_test_dirty(fs_info, page,
 							  eb->start, eb->len);
 	if (no_dirty_ebs)
-		clear_page_dirty_for_io(page);
+		clear_page_dirty_for_io(NULL, page);
 
 	bio_ctrl->end_io_func = end_bio_subpage_eb_writepage;
 
@@ -2633,7 +2633,7 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
 	for (i = 0; i < num_pages; i++) {
 		struct page *p = eb->pages[i];
 
-		clear_page_dirty_for_io(p);
+		clear_page_dirty_for_io(NULL, p);
 		set_page_writeback(p);
 		ret = submit_extent_page(REQ_OP_WRITE | write_flags, wbc,
 					 bio_ctrl, disk_bytenr, p,
@@ -2655,7 +2655,7 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
 	if (unlikely(ret)) {
 		for (; i < num_pages; i++) {
 			struct page *p = eb->pages[i];
-			clear_page_dirty_for_io(p);
+			clear_page_dirty_for_io(NULL, p);
 			unlock_page(p);
 		}
 	}
@@ -3083,7 +3083,7 @@ static int extent_write_cache_pages(struct address_space *mapping,
 			}
 
 			if (PageWriteback(page) ||
-			    !clear_page_dirty_for_io(page)) {
+			    !clear_page_dirty_for_io(wbc, page)) {
 				unlock_page(page);
 				continue;
 			}
@@ -3174,7 +3174,7 @@ int extent_write_locked_range(struct inode *inode, u64 start, u64 end)
 		 */
 		ASSERT(PageLocked(page));
 		ASSERT(PageDirty(page));
-		clear_page_dirty_for_io(page);
+		clear_page_dirty_for_io(NULL, page);
 		ret = __extent_writepage(page, &wbc_writepages, &bio_ctrl);
 		ASSERT(ret <= 0);
 		if (ret < 0) {
@@ -4698,7 +4698,7 @@ static void btree_clear_page_dirty(struct page *page)
 {
 	ASSERT(PageDirty(page));
 	ASSERT(PageLocked(page));
-	clear_page_dirty_for_io(page);
+	clear_page_dirty_for_io(NULL, page);
 	xa_lock_irq(&page->mapping->i_pages);
 	if (!PageDirty(page))
 		__xa_clear_mark(&page->mapping->i_pages,
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 0d250d052487..02ec9987fc17 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -507,7 +507,7 @@ static int io_ctl_prepare_pages(struct btrfs_io_ctl *io_ctl, bool uptodate)
 	}
 
 	for (i = 0; i < io_ctl->num_pages; i++)
-		clear_page_dirty_for_io(io_ctl->pages[i]);
+		clear_page_dirty_for_io(NULL, io_ctl->pages[i]);
 
 	return 0;
 }
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 98a800b8bd43..26820c697c5b 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3002,7 +3002,7 @@ static void btrfs_writepage_fixup_worker(struct btrfs_work *work)
 		 */
 		mapping_set_error(page->mapping, ret);
 		end_extent_writepage(page, ret, page_start, page_end);
-		clear_page_dirty_for_io(page);
+		clear_page_dirty_for_io(NULL, page);
 		SetPageError(page);
 	}
 	btrfs_page_clear_checked(inode->root->fs_info, page, page_start, PAGE_SIZE);
diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
index dd46b978ac2c..f85b5a2ccdab 100644
--- a/fs/btrfs/subpage.c
+++ b/fs/btrfs/subpage.c
@@ -515,7 +515,7 @@ void btrfs_subpage_clear_dirty(const struct btrfs_fs_info *fs_info,
 
 	last = btrfs_subpage_clear_and_test_dirty(fs_info, page, start, len);
 	if (last)
-		clear_page_dirty_for_io(page);
+		clear_page_dirty_for_io(NULL, page);
 }
 
 void btrfs_subpage_set_writeback(const struct btrfs_fs_info *fs_info,
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index cac4083e387a..ff940c9cd1cf 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -926,7 +926,7 @@ static int ceph_writepages_start(struct address_space *mapping,
 				     folio->index, ceph_wbc.i_size);
 				if ((ceph_wbc.size_stable ||
 				    folio_pos(folio) >= i_size_read(inode)) &&
-				    folio_clear_dirty_for_io(folio))
+				    folio_clear_dirty_for_io(wbc, folio))
 					folio_invalidate(folio, 0,
 							folio_size(folio));
 				folio_unlock(folio);
@@ -948,7 +948,7 @@ static int ceph_writepages_start(struct address_space *mapping,
 				wait_on_page_fscache(page);
 			}
 
-			if (!clear_page_dirty_for_io(page)) {
+			if (!clear_page_dirty_for_io(wbc, page)) {
 				dout("%p !clear_page_dirty_for_io\n", page);
 				unlock_page(page);
 				continue;
@@ -1282,7 +1282,7 @@ ceph_find_incompatible(struct page *page)
 
 		/* yay, writeable, do it now (without dropping page lock) */
 		dout(" page %p snapc %p not current, but oldest\n", page, snapc);
-		if (clear_page_dirty_for_io(page)) {
+		if (clear_page_dirty_for_io(NULL, page)) {
 			int r = writepage_nounlock(page, NULL);
 			if (r < 0)
 				return ERR_PTR(r);
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 22dfc1f8b4f1..93e36829896e 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -2342,7 +2342,7 @@ cifs_writev_requeue(struct cifs_writedata *wdata)
 		for (j = 0; j < nr_pages; j++) {
 			wdata2->pages[j] = wdata->pages[i + j];
 			lock_page(wdata2->pages[j]);
-			clear_page_dirty_for_io(wdata2->pages[j]);
+			clear_page_dirty_for_io(NULL, wdata2->pages[j]);
 		}
 
 		wdata2->sync_mode = wdata->sync_mode;
@@ -2582,7 +2582,7 @@ wdata_prepare_pages(struct cifs_writedata *wdata, unsigned int found_pages,
 			wait_on_page_writeback(page);
 
 		if (PageWriteback(page) ||
-				!clear_page_dirty_for_io(page)) {
+				!clear_page_dirty_for_io(wbc, page)) {
 			unlock_page(page);
 			break;
 		}
@@ -5076,7 +5076,7 @@ static int cifs_launder_folio(struct folio *folio)
 
 	cifs_dbg(FYI, "Launder page: %lu\n", folio->index);
 
-	if (folio_clear_dirty_for_io(folio))
+	if (folio_clear_dirty_for_io(&wbc, folio))
 		rc = cifs_writepage_locked(&folio->page, &wbc);
 
 	folio_wait_fscache(folio);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 46078651ce32..7082b6ba8e12 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1616,7 +1616,7 @@ static void mpage_release_unused_pages(struct mpage_da_data *mpd,
 			BUG_ON(folio_test_writeback(folio));
 			if (invalidate) {
 				if (folio_mapped(folio))
-					folio_clear_dirty_for_io(folio);
+					folio_clear_dirty_for_io(NULL, folio);
 				block_invalidate_folio(folio, 0,
 						folio_size(folio));
 				folio_clear_uptodate(folio);
@@ -2106,7 +2106,7 @@ static int mpage_submit_page(struct mpage_da_data *mpd, struct page *page)
 	int err;
 
 	BUG_ON(page->index != mpd->first_page);
-	clear_page_dirty_for_io(page);
+	clear_page_dirty_for_io(NULL, page);
 	/*
 	 * We have to be very careful here!  Nothing protects writeback path
 	 * against i_size changes and the page can be writeably mapped into
diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c
index 56f7d0d6a8b2..37f951b11153 100644
--- a/fs/f2fs/checkpoint.c
+++ b/fs/f2fs/checkpoint.c
@@ -435,7 +435,7 @@ long f2fs_sync_meta_pages(struct f2fs_sb_info *sbi, enum page_type type,
 
 			f2fs_wait_on_page_writeback(page, META, true, true);
 
-			if (!clear_page_dirty_for_io(page))
+			if (!clear_page_dirty_for_io(NULL, page))
 				goto continue_unlock;
 
 			if (__f2fs_write_meta_page(page, &wbc, io_type)) {
@@ -1415,7 +1415,7 @@ static void commit_checkpoint(struct f2fs_sb_info *sbi,
 	memcpy(page_address(page), src, PAGE_SIZE);
 
 	set_page_dirty(page);
-	if (unlikely(!clear_page_dirty_for_io(page)))
+	if (unlikely(!clear_page_dirty_for_io(NULL, page)))
 		f2fs_bug_on(sbi, 1);
 
 	/* writeout cp pack 2 page */
diff --git a/fs/f2fs/compress.c b/fs/f2fs/compress.c
index 2532f369cb10..efd09e280b2c 100644
--- a/fs/f2fs/compress.c
+++ b/fs/f2fs/compress.c
@@ -1459,7 +1459,7 @@ static int f2fs_write_raw_pages(struct compress_ctx *cc,
 		if (!PageDirty(cc->rpages[i]))
 			goto continue_unlock;
 
-		if (!clear_page_dirty_for_io(cc->rpages[i]))
+		if (!clear_page_dirty_for_io(NULL, cc->rpages[i]))
 			goto continue_unlock;
 
 		ret = f2fs_write_single_data_page(cc->rpages[i], &_submitted,
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index 97e816590cd9..f1d622b64690 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -3102,7 +3102,7 @@ static int f2fs_write_cache_pages(struct address_space *mapping,
 					goto continue_unlock;
 			}
 
-			if (!clear_page_dirty_for_io(page))
+			if (!clear_page_dirty_for_io(NULL, page))
 				goto continue_unlock;
 
 #ifdef CONFIG_F2FS_FS_COMPRESSION
diff --git a/fs/f2fs/dir.c b/fs/f2fs/dir.c
index 8e025157f35c..73005e711b83 100644
--- a/fs/f2fs/dir.c
+++ b/fs/f2fs/dir.c
@@ -938,7 +938,7 @@ void f2fs_delete_entry(struct f2fs_dir_entry *dentry, struct page *page,
 	if (bit_pos == NR_DENTRY_IN_BLOCK &&
 		!f2fs_truncate_hole(dir, page->index, page->index + 1)) {
 		f2fs_clear_page_cache_dirty_tag(page);
-		clear_page_dirty_for_io(page);
+		clear_page_dirty_for_io(NULL, page);
 		ClearPageUptodate(page);
 
 		clear_page_private_gcing(page);
diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
index 6e2cae3d2e71..1f647287e3eb 100644
--- a/fs/f2fs/gc.c
+++ b/fs/f2fs/gc.c
@@ -1361,7 +1361,7 @@ static int move_data_block(struct inode *inode, block_t bidx,
 	f2fs_invalidate_compress_page(fio.sbi, fio.old_blkaddr);
 
 	set_page_dirty(fio.encrypted_page);
-	if (clear_page_dirty_for_io(fio.encrypted_page))
+	if (clear_page_dirty_for_io(NULL, fio.encrypted_page))
 		dec_page_count(fio.sbi, F2FS_DIRTY_META);
 
 	set_page_writeback(fio.encrypted_page);
@@ -1446,7 +1446,7 @@ static int move_data_page(struct inode *inode, block_t bidx, int gc_type,
 		f2fs_wait_on_page_writeback(page, DATA, true, true);
 
 		set_page_dirty(page);
-		if (clear_page_dirty_for_io(page)) {
+		if (clear_page_dirty_for_io(NULL, page)) {
 			inode_dec_dirty_pages(inode);
 			f2fs_remove_dirty_inode(inode);
 		}
diff --git a/fs/f2fs/inline.c b/fs/f2fs/inline.c
index 21a495234ffd..2bfade0ead67 100644
--- a/fs/f2fs/inline.c
+++ b/fs/f2fs/inline.c
@@ -170,7 +170,7 @@ int f2fs_convert_inline_page(struct dnode_of_data *dn, struct page *page)
 	set_page_dirty(page);
 
 	/* clear dirty state */
-	dirty = clear_page_dirty_for_io(page);
+	dirty = clear_page_dirty_for_io(NULL, page);
 
 	/* write data page to try to make data consistent */
 	set_page_writeback(page);
diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c
index dde4c0458704..6f5571cac2b3 100644
--- a/fs/f2fs/node.c
+++ b/fs/f2fs/node.c
@@ -124,7 +124,7 @@ static void clear_node_page_dirty(struct page *page)
 {
 	if (PageDirty(page)) {
 		f2fs_clear_page_cache_dirty_tag(page);
-		clear_page_dirty_for_io(page);
+		clear_page_dirty_for_io(NULL, page);
 		dec_page_count(F2FS_P_SB(page), F2FS_DIRTY_NODES);
 	}
 	ClearPageUptodate(page);
@@ -1501,7 +1501,7 @@ static void flush_inline_data(struct f2fs_sb_info *sbi, nid_t ino)
 	if (!PageDirty(page))
 		goto page_out;
 
-	if (!clear_page_dirty_for_io(page))
+	if (!clear_page_dirty_for_io(NULL, page))
 		goto page_out;
 
 	ret = f2fs_write_inline_data(inode, page);
@@ -1696,7 +1696,7 @@ int f2fs_move_node_page(struct page *node_page, int gc_type)
 
 		set_page_dirty(node_page);
 
-		if (!clear_page_dirty_for_io(node_page)) {
+		if (!clear_page_dirty_for_io(NULL, node_page)) {
 			err = -EAGAIN;
 			goto out_page;
 		}
@@ -1803,7 +1803,7 @@ int f2fs_fsync_node_pages(struct f2fs_sb_info *sbi, struct inode *inode,
 					set_page_dirty(page);
 			}
 
-			if (!clear_page_dirty_for_io(page))
+			if (!clear_page_dirty_for_io(NULL, page))
 				goto continue_unlock;
 
 			ret = __write_node_page(page, atomic &&
@@ -2011,7 +2011,7 @@ int f2fs_sync_node_pages(struct f2fs_sb_info *sbi,
 write_node:
 			f2fs_wait_on_page_writeback(page, NODE, true, true);
 
-			if (!clear_page_dirty_for_io(page))
+			if (!clear_page_dirty_for_io(NULL, page))
 				goto continue_unlock;
 
 			set_fsync_mark(page, 0);
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 875314ee6f59..cb9561128b4b 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -2399,7 +2399,7 @@ static int fuse_write_end(struct file *file, struct address_space *mapping,
 static int fuse_launder_folio(struct folio *folio)
 {
 	int err = 0;
-	if (folio_clear_dirty_for_io(folio)) {
+	if (folio_clear_dirty_for_io(NULL, folio)) {
 		struct inode *inode = folio->mapping->host;
 
 		/* Serialize with pending writeback for the same page */
diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index e782b4f1d104..cf784dd5fd3b 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -247,7 +247,7 @@ static int gfs2_write_jdata_pagevec(struct address_space *mapping,
 		}
 
 		BUG_ON(PageWriteback(page));
-		if (!clear_page_dirty_for_io(page))
+		if (!clear_page_dirty_for_io(wbc, page))
 			goto continue_unlock;
 
 		trace_wbc_writepage(wbc, inode_to_bdi(inode));
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 80c240e50952..c07b686c84ce 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -2089,7 +2089,7 @@ int nfs_wb_page(struct inode *inode, struct page *page)
 
 	for (;;) {
 		wait_on_page_writeback(page);
-		if (clear_page_dirty_for_io(page)) {
+		if (clear_page_dirty_for_io(&wbc, page)) {
 			ret = nfs_writepage_locked(page, &wbc);
 			if (ret < 0)
 				goto out_error;
diff --git a/fs/nilfs2/page.c b/fs/nilfs2/page.c
index 39b7eea2642a..9c3cc20b446e 100644
--- a/fs/nilfs2/page.c
+++ b/fs/nilfs2/page.c
@@ -456,7 +456,7 @@ int __nilfs_clear_page_dirty(struct page *page)
 			__xa_clear_mark(&mapping->i_pages, page_index(page),
 					     PAGECACHE_TAG_DIRTY);
 			xa_unlock_irq(&mapping->i_pages);
-			return clear_page_dirty_for_io(page);
+			return clear_page_dirty_for_io(NULL, page);
 		}
 		xa_unlock_irq(&mapping->i_pages);
 		return 0;
diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
index 76c3bd88b858..123494739030 100644
--- a/fs/nilfs2/segment.c
+++ b/fs/nilfs2/segment.c
@@ -1644,7 +1644,7 @@ static void nilfs_begin_page_io(struct page *page)
 		return;
 
 	lock_page(page);
-	clear_page_dirty_for_io(page);
+	clear_page_dirty_for_io(NULL, page);
 	set_page_writeback(page);
 	unlock_page(page);
 }
@@ -1662,7 +1662,7 @@ static void nilfs_segctor_prepare_write(struct nilfs_sc_info *sci)
 			if (bh->b_page != bd_page) {
 				if (bd_page) {
 					lock_page(bd_page);
-					clear_page_dirty_for_io(bd_page);
+					clear_page_dirty_for_io(NULL, bd_page);
 					set_page_writeback(bd_page);
 					unlock_page(bd_page);
 				}
@@ -1676,7 +1676,7 @@ static void nilfs_segctor_prepare_write(struct nilfs_sc_info *sci)
 			if (bh == segbuf->sb_super_root) {
 				if (bh->b_page != bd_page) {
 					lock_page(bd_page);
-					clear_page_dirty_for_io(bd_page);
+					clear_page_dirty_for_io(NULL, bd_page);
 					set_page_writeback(bd_page);
 					unlock_page(bd_page);
 					bd_page = bh->b_page;
@@ -1691,7 +1691,7 @@ static void nilfs_segctor_prepare_write(struct nilfs_sc_info *sci)
 	}
 	if (bd_page) {
 		lock_page(bd_page);
-		clear_page_dirty_for_io(bd_page);
+		clear_page_dirty_for_io(NULL, bd_page);
 		set_page_writeback(bd_page);
 		unlock_page(bd_page);
 	}
diff --git a/fs/orangefs/inode.c b/fs/orangefs/inode.c
index 4df560894386..829de5553a77 100644
--- a/fs/orangefs/inode.c
+++ b/fs/orangefs/inode.c
@@ -501,7 +501,7 @@ static int orangefs_launder_folio(struct folio *folio)
 		.nr_to_write = 0,
 	};
 	folio_wait_writeback(folio);
-	if (folio_clear_dirty_for_io(folio)) {
+	if (folio_clear_dirty_for_io(&wbc, folio)) {
 		r = orangefs_writepage_locked(&folio->page, &wbc);
 		folio_end_writeback(folio);
 	}
diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c
index f2353dd676ef..41d764fd5511 100644
--- a/fs/ubifs/file.c
+++ b/fs/ubifs/file.c
@@ -1159,7 +1159,7 @@ static int do_truncation(struct ubifs_info *c, struct inode *inode,
 				 */
 				ubifs_assert(c, PagePrivate(page));
 
-				clear_page_dirty_for_io(page);
+				clear_page_dirty_for_io(NULL, page);
 				if (UBIFS_BLOCKS_PER_PAGE_SHIFT)
 					offset = new_size &
 						 (PAGE_SIZE - 1);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 29e1f9e76eb6..81c42a95cf8d 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -1059,8 +1059,9 @@ static inline void folio_cancel_dirty(struct folio *folio)
 	if (folio_test_dirty(folio))
 		__folio_cancel_dirty(folio);
 }
-bool folio_clear_dirty_for_io(struct folio *folio);
-bool clear_page_dirty_for_io(struct page *page);
+bool folio_clear_dirty_for_io(struct writeback_control *wbc,
+			      struct folio *folio);
+bool clear_page_dirty_for_io(struct writeback_control *wbc, struct page *page);
 void folio_invalidate(struct folio *folio, size_t offset, size_t length);
 int __must_check folio_write_one(struct folio *folio);
 static inline int __must_check write_one_page(struct page *page)
diff --git a/mm/folio-compat.c b/mm/folio-compat.c
index 69ed25790c68..748f82def674 100644
--- a/mm/folio-compat.c
+++ b/mm/folio-compat.c
@@ -63,9 +63,9 @@ int __set_page_dirty_nobuffers(struct page *page)
 }
 EXPORT_SYMBOL(__set_page_dirty_nobuffers);
 
-bool clear_page_dirty_for_io(struct page *page)
+bool clear_page_dirty_for_io(struct writeback_control *wbc, struct page *page)
 {
-	return folio_clear_dirty_for_io(page_folio(page));
+	return folio_clear_dirty_for_io(wbc, page_folio(page));
 }
 EXPORT_SYMBOL(clear_page_dirty_for_io);
 
diff --git a/mm/migrate.c b/mm/migrate.c
index a4d3fc65085f..0bda652153b9 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -870,7 +870,7 @@ static int writeout(struct address_space *mapping, struct folio *folio)
 		/* No write method for the address space */
 		return -EINVAL;
 
-	if (!folio_clear_dirty_for_io(folio))
+	if (!folio_clear_dirty_for_io(&wbc, folio))
 		/* Someone else already triggered a write */
 		return -EAGAIN;
 
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index ad608ef2a243..2d70070e533c 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2465,7 +2465,7 @@ int write_cache_pages(struct address_space *mapping,
 			}
 
 			BUG_ON(PageWriteback(page));
-			if (!clear_page_dirty_for_io(page))
+			if (!clear_page_dirty_for_io(wbc, page))
 				goto continue_unlock;
 
 			trace_wbc_writepage(wbc, inode_to_bdi(mapping->host));
@@ -2628,7 +2628,7 @@ int folio_write_one(struct folio *folio)
 
 	folio_wait_writeback(folio);
 
-	if (folio_clear_dirty_for_io(folio)) {
+	if (folio_clear_dirty_for_io(&wbc, folio)) {
 		folio_get(folio);
 		ret = mapping->a_ops->writepage(&folio->page, &wbc);
 		if (ret == 0)
@@ -2924,7 +2924,7 @@ EXPORT_SYMBOL(__folio_cancel_dirty);
 
 /*
  * Clear a folio's dirty flag, while caring for dirty memory accounting.
- * Returns true if the folio was previously dirty.
+ * Returns true if the folio was previously dirty and should be written back.
  *
  * This is for preparing to put the folio under writeout.  We leave
  * the folio tagged as dirty in the xarray so that a concurrent
@@ -2935,8 +2935,14 @@ EXPORT_SYMBOL(__folio_cancel_dirty);
  *
  * This incoherency between the folio's dirty flag and xarray tag is
  * unfortunate, but it only exists while the folio is locked.
+ *
+ * If the folio is pinned, its writeback is problematic so we just don't bother
+ * for memory cleaning writeback - this is why writeback control is passed in.
+ * If it is NULL, we assume pinned pages are not expected (e.g. this can be
+ * a metadata page) and warn if the page is actually pinned.
  */
-bool folio_clear_dirty_for_io(struct folio *folio)
+bool folio_clear_dirty_for_io(struct writeback_control *wbc,
+			      struct folio *folio)
 {
 	struct address_space *mapping = folio_mapping(folio);
 	bool ret = false;
@@ -2975,6 +2981,16 @@ bool folio_clear_dirty_for_io(struct folio *folio)
 		 */
 		if (folio_mkclean(folio))
 			folio_mark_dirty(folio);
+		/*
+		 * For pinned folios we have no chance to reclaim them anyway
+		 * and we cannot be sure folio is ever clean. So memory
+		 * cleaning writeback is pointless. Just skip it.
+		 */
+		if (wbc && wbc->sync_mode == WB_SYNC_NONE &&
+		    folio_maybe_dma_pinned(folio))
+			return false;
+		/* Catch callers not expecting pinned pages */
+		WARN_ON_ONCE(!wbc && folio_maybe_dma_pinned(folio));
 		/*
 		 * We carefully synchronise fault handlers against
 		 * installing a dirty pte and marking the folio dirty
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ab3911a8b116..71a226b47ac6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1283,7 +1283,7 @@ static pageout_t pageout(struct folio *folio, struct address_space *mapping,
 	if (mapping->a_ops->writepage == NULL)
 		return PAGE_ACTIVATE;
 
-	if (folio_clear_dirty_for_io(folio)) {
+	if (folio_clear_dirty_for_io(NULL, folio)) {
 		int res;
 		struct writeback_control wbc = {
 			.sync_mode = WB_SYNC_NONE,
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 4/5] block: Add support for bouncing pinned pages
  2023-02-09 12:31 [PATCH RFC 0/5] Writeback handling of pinned pages Jan Kara
                   ` (2 preceding siblings ...)
  2023-02-09 12:31 ` [PATCH 3/5] mm: Do not try to write pinned folio during memory cleaning writeback Jan Kara
@ 2023-02-09 12:31 ` Jan Kara
  2023-02-13  9:59   ` Christoph Hellwig
  2023-02-09 12:31 ` [PATCH 5/5] iomap: Bounce pinned pages during writeback Jan Kara
  4 siblings, 1 reply; 26+ messages in thread
From: Jan Kara @ 2023-02-09 12:31 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-block, linux-mm, John Hubbard, David Howells,
	David Hildenbrand, Jan Kara

When there is direct IO (or other DMA write) running into a page, it is
not generally safe to submit this page for another IO (such as
writeback) because this can cause checksum failures or similar issues.
However sometimes we cannot avoid writing contents of these pages as
pages can be pinned for extensive amount of time (e.g. for RDMA). For
these cases we need to just bounce the pages if we really need to write
them out. Add support for this type of bouncing into the block layer
infrastructure.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 block/blk.h               | 10 +++++++++-
 block/bounce.c            |  9 +++++++--
 include/linux/blk_types.h |  1 +
 mm/Kconfig                |  8 ++++----
 4 files changed, 21 insertions(+), 7 deletions(-)

diff --git a/block/blk.h b/block/blk.h
index 4c3b3325219a..def7ab8379bc 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -384,10 +384,18 @@ static inline bool blk_queue_may_bounce(struct request_queue *q)
 		max_low_pfn >= max_pfn;
 }
 
+static inline bool bio_need_pin_bounce(struct bio *bio,
+		struct request_queue *q)
+{
+	return IS_ENABLED(CONFIG_BOUNCE) &&
+		bio->bi_flags & (1 << BIO_NEED_PIN_BOUNCE);
+}
+
 static inline struct bio *blk_queue_bounce(struct bio *bio,
 		struct request_queue *q)
 {
-	if (unlikely(blk_queue_may_bounce(q) && bio_has_data(bio)))
+	if (unlikely((blk_queue_may_bounce(q) || bio_need_pin_bounce(bio, q)) &&
+		     bio_has_data(bio)))
 		return __blk_queue_bounce(bio, q);
 	return bio;
 }
diff --git a/block/bounce.c b/block/bounce.c
index 7cfcb242f9a1..ebda95953d58 100644
--- a/block/bounce.c
+++ b/block/bounce.c
@@ -207,12 +207,16 @@ struct bio *__blk_queue_bounce(struct bio *bio_orig, struct request_queue *q)
 	struct bvec_iter iter;
 	unsigned i = 0, bytes = 0;
 	bool bounce = false;
+	bool pinned_bounce = bio_orig->bi_flags & (1 << BIO_NEED_PIN_BOUNCE);
+	bool highmem_bounce = blk_queue_may_bounce(q);
 	int sectors;
 
 	bio_for_each_segment(from, bio_orig, iter) {
 		if (i++ < BIO_MAX_VECS)
 			bytes += from.bv_len;
-		if (PageHighMem(from.bv_page))
+		if (highmem_bounce && PageHighMem(from.bv_page))
+			bounce = true;
+		if (pinned_bounce && page_maybe_dma_pinned(from.bv_page))
 			bounce = true;
 	}
 	if (!bounce)
@@ -241,7 +245,8 @@ struct bio *__blk_queue_bounce(struct bio *bio_orig, struct request_queue *q)
 	for (i = 0, to = bio->bi_io_vec; i < bio->bi_vcnt; to++, i++) {
 		struct page *bounce_page;
 
-		if (!PageHighMem(to->bv_page))
+		if (!((highmem_bounce && PageHighMem(to->bv_page)) ||
+		      (pinned_bounce && page_maybe_dma_pinned(to->bv_page))))
 			continue;
 
 		bounce_page = mempool_alloc(&page_pool, GFP_NOIO);
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 99be590f952f..3aa1dc5d8dc6 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -321,6 +321,7 @@ enum {
 	BIO_NO_PAGE_REF,	/* don't put release vec pages */
 	BIO_CLONED,		/* doesn't own data */
 	BIO_BOUNCED,		/* bio is a bounce bio */
+	BIO_NEED_PIN_BOUNCE,	/* bio needs to bounce pinned pages */
 	BIO_QUIET,		/* Make BIO Quiet */
 	BIO_CHAIN,		/* chained bio, ->bi_remaining in effect */
 	BIO_REFFED,		/* bio has elevated ->bi_cnt */
diff --git a/mm/Kconfig b/mm/Kconfig
index ff7b209dec05..eba075e959e8 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -659,11 +659,11 @@ config PHYS_ADDR_T_64BIT
 config BOUNCE
 	bool "Enable bounce buffers"
 	default y
-	depends on BLOCK && MMU && HIGHMEM
+	depends on BLOCK && MMU
 	help
-	  Enable bounce buffers for devices that cannot access the full range of
-	  memory available to the CPU. Enabled by default when HIGHMEM is
-	  selected, but you may say n to override this.
+	  Enable bounce buffers. This is used for devices that cannot access
+	  the full range of memory available to the CPU or when DMA can be
+	  modifying pages while they are submitted for writeback.
 
 config MMU_NOTIFIER
 	bool
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 5/5] iomap: Bounce pinned pages during writeback
  2023-02-09 12:31 [PATCH RFC 0/5] Writeback handling of pinned pages Jan Kara
                   ` (3 preceding siblings ...)
  2023-02-09 12:31 ` [PATCH 4/5] block: Add support for bouncing pinned pages Jan Kara
@ 2023-02-09 12:31 ` Jan Kara
  4 siblings, 0 replies; 26+ messages in thread
From: Jan Kara @ 2023-02-09 12:31 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-block, linux-mm, John Hubbard, David Howells,
	David Hildenbrand, Jan Kara

When there is direct IO (or other DMA write) running into a page, it is
not generally safe to submit this page for writeback because this can
cause DIF/DIX failures or similar issues. Ask block layer to bounce the
page in this case.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/iomap/buffered-io.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 356193e44cf0..e6469e7715cc 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1563,6 +1563,14 @@ iomap_add_to_ioend(struct inode *inode, loff_t pos, struct folio *folio,
 		bio_add_folio(wpc->ioend->io_bio, folio, len, poff);
 	}
 
+	/*
+	 * Folio may be modified by the owner of the pin and we require stable
+	 * page contents during writeback? Ask block layer to bounce the bio.
+	 */
+	if (inode->i_sb->s_iflags & SB_I_STABLE_WRITES &&
+	    folio_maybe_dma_pinned(folio))
+		wpc->ioend->io_bio->bi_flags |= 1 << BIO_NEED_PIN_BOUNCE;
+
 	if (iop)
 		atomic_add(len, &iop->write_bytes_pending);
 	wpc->ioend->io_size += len;
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/5] mm: Do not reclaim private data from pinned page
  2023-02-09 12:31 ` [PATCH 1/5] mm: Do not reclaim private data from pinned page Jan Kara
@ 2023-02-09 16:17   ` Matthew Wilcox
  2023-02-10 11:29     ` Jan Kara
  2023-02-13  9:01   ` David Hildenbrand
  1 sibling, 1 reply; 26+ messages in thread
From: Matthew Wilcox @ 2023-02-09 16:17 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, linux-block, linux-mm, John Hubbard,
	David Howells, David Hildenbrand

On Thu, Feb 09, 2023 at 01:31:53PM +0100, Jan Kara wrote:
> If the page is pinned, there's no point in trying to reclaim it.
> Furthermore if the page is from the page cache we don't want to reclaim
> fs-private data from the page because the pinning process may be writing
> to the page at any time and reclaiming fs private info on a dirty page
> can upset the filesystem (see link below).
> 
> Link: https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz

OK, but now I'm confused.  I've been told before that the reason we
can't take pinned pages off the LRU list is that they need to be written
back periodically for ... reasons.  But now the pages are going to be
skipped if they're found on the LRU list, so why is this better than
taking them off the LRU list?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/5] mm: Do not try to write pinned folio during memory cleaning writeback
  2023-02-09 12:31 ` [PATCH 3/5] mm: Do not try to write pinned folio during memory cleaning writeback Jan Kara
@ 2023-02-10  1:54   ` John Hubbard
  2023-02-10  2:10     ` John Hubbard
  2023-02-10 10:54     ` Jan Kara
  0 siblings, 2 replies; 26+ messages in thread
From: John Hubbard @ 2023-02-10  1:54 UTC (permalink / raw)
  To: Jan Kara, linux-fsdevel
  Cc: linux-block, linux-mm, David Howells, David Hildenbrand

On 2/9/23 04:31, Jan Kara wrote:
> When a folio is pinned, there is no point in trying to write it during
> memory cleaning writeback. We cannot reclaim the folio until it is
> unpinned anyway and we cannot even be sure the folio is really clean.
> On top of that writeback of such folio may be problematic as the data
> can change while the writeback is running thus causing checksum or
> DIF/DIX failures. So just don't bother doing memory cleaning writeback
> for pinned folios.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>   fs/9p/vfs_addr.c            |  2 +-
>   fs/afs/file.c               |  2 +-
>   fs/afs/write.c              |  6 +++---
>   fs/btrfs/extent_io.c        | 14 +++++++-------
>   fs/btrfs/free-space-cache.c |  2 +-
>   fs/btrfs/inode.c            |  2 +-
>   fs/btrfs/subpage.c          |  2 +-

Hi Jan!

Just a quick note that this breaks the btrfs build in subpage.c.
Because, unfortunately, btrfs creates 6 sets of functions via calls to a
macro: IMPLEMENT_BTRFS_PAGE_OPS(). And that expects only one argument to
the clear_page_func, and thus to clear_page_dirty_for_io().

It seems infeasible (right?) to add another argument to the other
clear_page_func functions, which include:

    ClearPageUptodate
    ClearPageError
    end_page_writeback
    ClearPageOrdered
    ClearPageChecked

, so I expect IMPLEMENT_BTRFS_PAGE_OPS() may need to be partially
unrolled, in order to pass in the new writeback control arg to
clear_page_dirty_for_io().


thanks,
-- 
John Hubbard
NVIDIA

>   fs/ceph/addr.c              |  6 +++---
>   fs/cifs/file.c              |  6 +++---
>   fs/ext4/inode.c             |  4 ++--
>   fs/f2fs/checkpoint.c        |  4 ++--
>   fs/f2fs/compress.c          |  2 +-
>   fs/f2fs/data.c              |  2 +-
>   fs/f2fs/dir.c               |  2 +-
>   fs/f2fs/gc.c                |  4 ++--
>   fs/f2fs/inline.c            |  2 +-
>   fs/f2fs/node.c              | 10 +++++-----
>   fs/fuse/file.c              |  2 +-
>   fs/gfs2/aops.c              |  2 +-
>   fs/nfs/write.c              |  2 +-
>   fs/nilfs2/page.c            |  2 +-
>   fs/nilfs2/segment.c         |  8 ++++----
>   fs/orangefs/inode.c         |  2 +-
>   fs/ubifs/file.c             |  2 +-
>   include/linux/pagemap.h     |  5 +++--
>   mm/folio-compat.c           |  4 ++--
>   mm/migrate.c                |  2 +-
>   mm/page-writeback.c         | 24 ++++++++++++++++++++----
>   mm/vmscan.c                 |  2 +-
>   29 files changed, 73 insertions(+), 56 deletions(-)
> 
> diff --git a/fs/9p/vfs_addr.c b/fs/9p/vfs_addr.c
> index 97599edbc300..a14ff3c02eb1 100644
> --- a/fs/9p/vfs_addr.c
> +++ b/fs/9p/vfs_addr.c
> @@ -221,7 +221,7 @@ static int v9fs_launder_folio(struct folio *folio)
>   {
>   	int retval;
>   
> -	if (folio_clear_dirty_for_io(folio)) {
> +	if (folio_clear_dirty_for_io(NULL, folio)) {
>   		retval = v9fs_vfs_write_folio_locked(folio);
>   		if (retval)
>   			return retval;
> diff --git a/fs/afs/file.c b/fs/afs/file.c
> index 68d6d5dc608d..8a81ac9c12fa 100644
> --- a/fs/afs/file.c
> +++ b/fs/afs/file.c
> @@ -453,7 +453,7 @@ static void afs_invalidate_dirty(struct folio *folio, size_t offset,
>   
>   undirty:
>   	trace_afs_folio_dirty(vnode, tracepoint_string("undirty"), folio);
> -	folio_clear_dirty_for_io(folio);
> +	folio_clear_dirty_for_io(NULL, folio);
>   full_invalidate:
>   	trace_afs_folio_dirty(vnode, tracepoint_string("inval"), folio);
>   	folio_detach_private(folio);
> diff --git a/fs/afs/write.c b/fs/afs/write.c
> index 19df10d63323..9a5e6d59040c 100644
> --- a/fs/afs/write.c
> +++ b/fs/afs/write.c
> @@ -555,7 +555,7 @@ static void afs_extend_writeback(struct address_space *mapping,
>   			folio = page_folio(pvec.pages[i]);
>   			trace_afs_folio_dirty(vnode, tracepoint_string("store+"), folio);
>   
> -			if (!folio_clear_dirty_for_io(folio))
> +			if (!folio_clear_dirty_for_io(NULL, folio))
>   				BUG();
>   			if (folio_start_writeback(folio))
>   				BUG();
> @@ -769,7 +769,7 @@ static int afs_writepages_region(struct address_space *mapping,
>   			continue;
>   		}
>   
> -		if (!folio_clear_dirty_for_io(folio))
> +		if (!folio_clear_dirty_for_io(NULL, folio))
>   			BUG();
>   		ret = afs_write_back_from_locked_folio(mapping, wbc, folio, start, end);
>   		folio_put(folio);
> @@ -1000,7 +1000,7 @@ int afs_launder_folio(struct folio *folio)
>   	_enter("{%lx}", folio->index);
>   
>   	priv = (unsigned long)folio_get_private(folio);
> -	if (folio_clear_dirty_for_io(folio)) {
> +	if (folio_clear_dirty_for_io(NULL, folio)) {
>   		f = 0;
>   		t = folio_size(folio);
>   		if (folio_test_private(folio)) {
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 9bd32daa9b9a..2026f567cbdd 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -215,7 +215,7 @@ void extent_range_clear_dirty_for_io(struct inode *inode, u64 start, u64 end)
>   	while (index <= end_index) {
>   		page = find_get_page(inode->i_mapping, index);
>   		BUG_ON(!page); /* Pages should be in the extent_io_tree */
> -		clear_page_dirty_for_io(page);
> +		clear_page_dirty_for_io(NULL, page);
>   		put_page(page);
>   		index++;
>   	}
> @@ -2590,7 +2590,7 @@ static int write_one_subpage_eb(struct extent_buffer *eb,
>   	no_dirty_ebs = btrfs_subpage_clear_and_test_dirty(fs_info, page,
>   							  eb->start, eb->len);
>   	if (no_dirty_ebs)
> -		clear_page_dirty_for_io(page);
> +		clear_page_dirty_for_io(NULL, page);
>   
>   	bio_ctrl->end_io_func = end_bio_subpage_eb_writepage;
>   
> @@ -2633,7 +2633,7 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
>   	for (i = 0; i < num_pages; i++) {
>   		struct page *p = eb->pages[i];
>   
> -		clear_page_dirty_for_io(p);
> +		clear_page_dirty_for_io(NULL, p);
>   		set_page_writeback(p);
>   		ret = submit_extent_page(REQ_OP_WRITE | write_flags, wbc,
>   					 bio_ctrl, disk_bytenr, p,
> @@ -2655,7 +2655,7 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
>   	if (unlikely(ret)) {
>   		for (; i < num_pages; i++) {
>   			struct page *p = eb->pages[i];
> -			clear_page_dirty_for_io(p);
> +			clear_page_dirty_for_io(NULL, p);
>   			unlock_page(p);
>   		}
>   	}
> @@ -3083,7 +3083,7 @@ static int extent_write_cache_pages(struct address_space *mapping,
>   			}
>   
>   			if (PageWriteback(page) ||
> -			    !clear_page_dirty_for_io(page)) {
> +			    !clear_page_dirty_for_io(wbc, page)) {
>   				unlock_page(page);
>   				continue;
>   			}
> @@ -3174,7 +3174,7 @@ int extent_write_locked_range(struct inode *inode, u64 start, u64 end)
>   		 */
>   		ASSERT(PageLocked(page));
>   		ASSERT(PageDirty(page));
> -		clear_page_dirty_for_io(page);
> +		clear_page_dirty_for_io(NULL, page);
>   		ret = __extent_writepage(page, &wbc_writepages, &bio_ctrl);
>   		ASSERT(ret <= 0);
>   		if (ret < 0) {
> @@ -4698,7 +4698,7 @@ static void btree_clear_page_dirty(struct page *page)
>   {
>   	ASSERT(PageDirty(page));
>   	ASSERT(PageLocked(page));
> -	clear_page_dirty_for_io(page);
> +	clear_page_dirty_for_io(NULL, page);
>   	xa_lock_irq(&page->mapping->i_pages);
>   	if (!PageDirty(page))
>   		__xa_clear_mark(&page->mapping->i_pages,
> diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
> index 0d250d052487..02ec9987fc17 100644
> --- a/fs/btrfs/free-space-cache.c
> +++ b/fs/btrfs/free-space-cache.c
> @@ -507,7 +507,7 @@ static int io_ctl_prepare_pages(struct btrfs_io_ctl *io_ctl, bool uptodate)
>   	}
>   
>   	for (i = 0; i < io_ctl->num_pages; i++)
> -		clear_page_dirty_for_io(io_ctl->pages[i]);
> +		clear_page_dirty_for_io(NULL, io_ctl->pages[i]);
>   
>   	return 0;
>   }
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 98a800b8bd43..26820c697c5b 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -3002,7 +3002,7 @@ static void btrfs_writepage_fixup_worker(struct btrfs_work *work)
>   		 */
>   		mapping_set_error(page->mapping, ret);
>   		end_extent_writepage(page, ret, page_start, page_end);
> -		clear_page_dirty_for_io(page);
> +		clear_page_dirty_for_io(NULL, page);
>   		SetPageError(page);
>   	}
>   	btrfs_page_clear_checked(inode->root->fs_info, page, page_start, PAGE_SIZE);
> diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
> index dd46b978ac2c..f85b5a2ccdab 100644
> --- a/fs/btrfs/subpage.c
> +++ b/fs/btrfs/subpage.c
> @@ -515,7 +515,7 @@ void btrfs_subpage_clear_dirty(const struct btrfs_fs_info *fs_info,
>   
>   	last = btrfs_subpage_clear_and_test_dirty(fs_info, page, start, len);
>   	if (last)
> -		clear_page_dirty_for_io(page);
> +		clear_page_dirty_for_io(NULL, page);
>   }
>   
>   void btrfs_subpage_set_writeback(const struct btrfs_fs_info *fs_info,
> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> index cac4083e387a..ff940c9cd1cf 100644
> --- a/fs/ceph/addr.c
> +++ b/fs/ceph/addr.c
> @@ -926,7 +926,7 @@ static int ceph_writepages_start(struct address_space *mapping,
>   				     folio->index, ceph_wbc.i_size);
>   				if ((ceph_wbc.size_stable ||
>   				    folio_pos(folio) >= i_size_read(inode)) &&
> -				    folio_clear_dirty_for_io(folio))
> +				    folio_clear_dirty_for_io(wbc, folio))
>   					folio_invalidate(folio, 0,
>   							folio_size(folio));
>   				folio_unlock(folio);
> @@ -948,7 +948,7 @@ static int ceph_writepages_start(struct address_space *mapping,
>   				wait_on_page_fscache(page);
>   			}
>   
> -			if (!clear_page_dirty_for_io(page)) {
> +			if (!clear_page_dirty_for_io(wbc, page)) {
>   				dout("%p !clear_page_dirty_for_io\n", page);
>   				unlock_page(page);
>   				continue;
> @@ -1282,7 +1282,7 @@ ceph_find_incompatible(struct page *page)
>   
>   		/* yay, writeable, do it now (without dropping page lock) */
>   		dout(" page %p snapc %p not current, but oldest\n", page, snapc);
> -		if (clear_page_dirty_for_io(page)) {
> +		if (clear_page_dirty_for_io(NULL, page)) {
>   			int r = writepage_nounlock(page, NULL);
>   			if (r < 0)
>   				return ERR_PTR(r);
> diff --git a/fs/cifs/file.c b/fs/cifs/file.c
> index 22dfc1f8b4f1..93e36829896e 100644
> --- a/fs/cifs/file.c
> +++ b/fs/cifs/file.c
> @@ -2342,7 +2342,7 @@ cifs_writev_requeue(struct cifs_writedata *wdata)
>   		for (j = 0; j < nr_pages; j++) {
>   			wdata2->pages[j] = wdata->pages[i + j];
>   			lock_page(wdata2->pages[j]);
> -			clear_page_dirty_for_io(wdata2->pages[j]);
> +			clear_page_dirty_for_io(NULL, wdata2->pages[j]);
>   		}
>   
>   		wdata2->sync_mode = wdata->sync_mode;
> @@ -2582,7 +2582,7 @@ wdata_prepare_pages(struct cifs_writedata *wdata, unsigned int found_pages,
>   			wait_on_page_writeback(page);
>   
>   		if (PageWriteback(page) ||
> -				!clear_page_dirty_for_io(page)) {
> +				!clear_page_dirty_for_io(wbc, page)) {
>   			unlock_page(page);
>   			break;
>   		}
> @@ -5076,7 +5076,7 @@ static int cifs_launder_folio(struct folio *folio)
>   
>   	cifs_dbg(FYI, "Launder page: %lu\n", folio->index);
>   
> -	if (folio_clear_dirty_for_io(folio))
> +	if (folio_clear_dirty_for_io(&wbc, folio))
>   		rc = cifs_writepage_locked(&folio->page, &wbc);
>   
>   	folio_wait_fscache(folio);
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 46078651ce32..7082b6ba8e12 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1616,7 +1616,7 @@ static void mpage_release_unused_pages(struct mpage_da_data *mpd,
>   			BUG_ON(folio_test_writeback(folio));
>   			if (invalidate) {
>   				if (folio_mapped(folio))
> -					folio_clear_dirty_for_io(folio);
> +					folio_clear_dirty_for_io(NULL, folio);
>   				block_invalidate_folio(folio, 0,
>   						folio_size(folio));
>   				folio_clear_uptodate(folio);
> @@ -2106,7 +2106,7 @@ static int mpage_submit_page(struct mpage_da_data *mpd, struct page *page)
>   	int err;
>   
>   	BUG_ON(page->index != mpd->first_page);
> -	clear_page_dirty_for_io(page);
> +	clear_page_dirty_for_io(NULL, page);
>   	/*
>   	 * We have to be very careful here!  Nothing protects writeback path
>   	 * against i_size changes and the page can be writeably mapped into
> diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c
> index 56f7d0d6a8b2..37f951b11153 100644
> --- a/fs/f2fs/checkpoint.c
> +++ b/fs/f2fs/checkpoint.c
> @@ -435,7 +435,7 @@ long f2fs_sync_meta_pages(struct f2fs_sb_info *sbi, enum page_type type,
>   
>   			f2fs_wait_on_page_writeback(page, META, true, true);
>   
> -			if (!clear_page_dirty_for_io(page))
> +			if (!clear_page_dirty_for_io(NULL, page))
>   				goto continue_unlock;
>   
>   			if (__f2fs_write_meta_page(page, &wbc, io_type)) {
> @@ -1415,7 +1415,7 @@ static void commit_checkpoint(struct f2fs_sb_info *sbi,
>   	memcpy(page_address(page), src, PAGE_SIZE);
>   
>   	set_page_dirty(page);
> -	if (unlikely(!clear_page_dirty_for_io(page)))
> +	if (unlikely(!clear_page_dirty_for_io(NULL, page)))
>   		f2fs_bug_on(sbi, 1);
>   
>   	/* writeout cp pack 2 page */
> diff --git a/fs/f2fs/compress.c b/fs/f2fs/compress.c
> index 2532f369cb10..efd09e280b2c 100644
> --- a/fs/f2fs/compress.c
> +++ b/fs/f2fs/compress.c
> @@ -1459,7 +1459,7 @@ static int f2fs_write_raw_pages(struct compress_ctx *cc,
>   		if (!PageDirty(cc->rpages[i]))
>   			goto continue_unlock;
>   
> -		if (!clear_page_dirty_for_io(cc->rpages[i]))
> +		if (!clear_page_dirty_for_io(NULL, cc->rpages[i]))
>   			goto continue_unlock;
>   
>   		ret = f2fs_write_single_data_page(cc->rpages[i], &_submitted,
> diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
> index 97e816590cd9..f1d622b64690 100644
> --- a/fs/f2fs/data.c
> +++ b/fs/f2fs/data.c
> @@ -3102,7 +3102,7 @@ static int f2fs_write_cache_pages(struct address_space *mapping,
>   					goto continue_unlock;
>   			}
>   
> -			if (!clear_page_dirty_for_io(page))
> +			if (!clear_page_dirty_for_io(NULL, page))
>   				goto continue_unlock;
>   
>   #ifdef CONFIG_F2FS_FS_COMPRESSION
> diff --git a/fs/f2fs/dir.c b/fs/f2fs/dir.c
> index 8e025157f35c..73005e711b83 100644
> --- a/fs/f2fs/dir.c
> +++ b/fs/f2fs/dir.c
> @@ -938,7 +938,7 @@ void f2fs_delete_entry(struct f2fs_dir_entry *dentry, struct page *page,
>   	if (bit_pos == NR_DENTRY_IN_BLOCK &&
>   		!f2fs_truncate_hole(dir, page->index, page->index + 1)) {
>   		f2fs_clear_page_cache_dirty_tag(page);
> -		clear_page_dirty_for_io(page);
> +		clear_page_dirty_for_io(NULL, page);
>   		ClearPageUptodate(page);
>   
>   		clear_page_private_gcing(page);
> diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
> index 6e2cae3d2e71..1f647287e3eb 100644
> --- a/fs/f2fs/gc.c
> +++ b/fs/f2fs/gc.c
> @@ -1361,7 +1361,7 @@ static int move_data_block(struct inode *inode, block_t bidx,
>   	f2fs_invalidate_compress_page(fio.sbi, fio.old_blkaddr);
>   
>   	set_page_dirty(fio.encrypted_page);
> -	if (clear_page_dirty_for_io(fio.encrypted_page))
> +	if (clear_page_dirty_for_io(NULL, fio.encrypted_page))
>   		dec_page_count(fio.sbi, F2FS_DIRTY_META);
>   
>   	set_page_writeback(fio.encrypted_page);
> @@ -1446,7 +1446,7 @@ static int move_data_page(struct inode *inode, block_t bidx, int gc_type,
>   		f2fs_wait_on_page_writeback(page, DATA, true, true);
>   
>   		set_page_dirty(page);
> -		if (clear_page_dirty_for_io(page)) {
> +		if (clear_page_dirty_for_io(NULL, page)) {
>   			inode_dec_dirty_pages(inode);
>   			f2fs_remove_dirty_inode(inode);
>   		}
> diff --git a/fs/f2fs/inline.c b/fs/f2fs/inline.c
> index 21a495234ffd..2bfade0ead67 100644
> --- a/fs/f2fs/inline.c
> +++ b/fs/f2fs/inline.c
> @@ -170,7 +170,7 @@ int f2fs_convert_inline_page(struct dnode_of_data *dn, struct page *page)
>   	set_page_dirty(page);
>   
>   	/* clear dirty state */
> -	dirty = clear_page_dirty_for_io(page);
> +	dirty = clear_page_dirty_for_io(NULL, page);
>   
>   	/* write data page to try to make data consistent */
>   	set_page_writeback(page);
> diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c
> index dde4c0458704..6f5571cac2b3 100644
> --- a/fs/f2fs/node.c
> +++ b/fs/f2fs/node.c
> @@ -124,7 +124,7 @@ static void clear_node_page_dirty(struct page *page)
>   {
>   	if (PageDirty(page)) {
>   		f2fs_clear_page_cache_dirty_tag(page);
> -		clear_page_dirty_for_io(page);
> +		clear_page_dirty_for_io(NULL, page);
>   		dec_page_count(F2FS_P_SB(page), F2FS_DIRTY_NODES);
>   	}
>   	ClearPageUptodate(page);
> @@ -1501,7 +1501,7 @@ static void flush_inline_data(struct f2fs_sb_info *sbi, nid_t ino)
>   	if (!PageDirty(page))
>   		goto page_out;
>   
> -	if (!clear_page_dirty_for_io(page))
> +	if (!clear_page_dirty_for_io(NULL, page))
>   		goto page_out;
>   
>   	ret = f2fs_write_inline_data(inode, page);
> @@ -1696,7 +1696,7 @@ int f2fs_move_node_page(struct page *node_page, int gc_type)
>   
>   		set_page_dirty(node_page);
>   
> -		if (!clear_page_dirty_for_io(node_page)) {
> +		if (!clear_page_dirty_for_io(NULL, node_page)) {
>   			err = -EAGAIN;
>   			goto out_page;
>   		}
> @@ -1803,7 +1803,7 @@ int f2fs_fsync_node_pages(struct f2fs_sb_info *sbi, struct inode *inode,
>   					set_page_dirty(page);
>   			}
>   
> -			if (!clear_page_dirty_for_io(page))
> +			if (!clear_page_dirty_for_io(NULL, page))
>   				goto continue_unlock;
>   
>   			ret = __write_node_page(page, atomic &&
> @@ -2011,7 +2011,7 @@ int f2fs_sync_node_pages(struct f2fs_sb_info *sbi,
>   write_node:
>   			f2fs_wait_on_page_writeback(page, NODE, true, true);
>   
> -			if (!clear_page_dirty_for_io(page))
> +			if (!clear_page_dirty_for_io(NULL, page))
>   				goto continue_unlock;
>   
>   			set_fsync_mark(page, 0);
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 875314ee6f59..cb9561128b4b 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -2399,7 +2399,7 @@ static int fuse_write_end(struct file *file, struct address_space *mapping,
>   static int fuse_launder_folio(struct folio *folio)
>   {
>   	int err = 0;
> -	if (folio_clear_dirty_for_io(folio)) {
> +	if (folio_clear_dirty_for_io(NULL, folio)) {
>   		struct inode *inode = folio->mapping->host;
>   
>   		/* Serialize with pending writeback for the same page */
> diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
> index e782b4f1d104..cf784dd5fd3b 100644
> --- a/fs/gfs2/aops.c
> +++ b/fs/gfs2/aops.c
> @@ -247,7 +247,7 @@ static int gfs2_write_jdata_pagevec(struct address_space *mapping,
>   		}
>   
>   		BUG_ON(PageWriteback(page));
> -		if (!clear_page_dirty_for_io(page))
> +		if (!clear_page_dirty_for_io(wbc, page))
>   			goto continue_unlock;
>   
>   		trace_wbc_writepage(wbc, inode_to_bdi(inode));
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index 80c240e50952..c07b686c84ce 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -2089,7 +2089,7 @@ int nfs_wb_page(struct inode *inode, struct page *page)
>   
>   	for (;;) {
>   		wait_on_page_writeback(page);
> -		if (clear_page_dirty_for_io(page)) {
> +		if (clear_page_dirty_for_io(&wbc, page)) {
>   			ret = nfs_writepage_locked(page, &wbc);
>   			if (ret < 0)
>   				goto out_error;
> diff --git a/fs/nilfs2/page.c b/fs/nilfs2/page.c
> index 39b7eea2642a..9c3cc20b446e 100644
> --- a/fs/nilfs2/page.c
> +++ b/fs/nilfs2/page.c
> @@ -456,7 +456,7 @@ int __nilfs_clear_page_dirty(struct page *page)
>   			__xa_clear_mark(&mapping->i_pages, page_index(page),
>   					     PAGECACHE_TAG_DIRTY);
>   			xa_unlock_irq(&mapping->i_pages);
> -			return clear_page_dirty_for_io(page);
> +			return clear_page_dirty_for_io(NULL, page);
>   		}
>   		xa_unlock_irq(&mapping->i_pages);
>   		return 0;
> diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
> index 76c3bd88b858..123494739030 100644
> --- a/fs/nilfs2/segment.c
> +++ b/fs/nilfs2/segment.c
> @@ -1644,7 +1644,7 @@ static void nilfs_begin_page_io(struct page *page)
>   		return;
>   
>   	lock_page(page);
> -	clear_page_dirty_for_io(page);
> +	clear_page_dirty_for_io(NULL, page);
>   	set_page_writeback(page);
>   	unlock_page(page);
>   }
> @@ -1662,7 +1662,7 @@ static void nilfs_segctor_prepare_write(struct nilfs_sc_info *sci)
>   			if (bh->b_page != bd_page) {
>   				if (bd_page) {
>   					lock_page(bd_page);
> -					clear_page_dirty_for_io(bd_page);
> +					clear_page_dirty_for_io(NULL, bd_page);
>   					set_page_writeback(bd_page);
>   					unlock_page(bd_page);
>   				}
> @@ -1676,7 +1676,7 @@ static void nilfs_segctor_prepare_write(struct nilfs_sc_info *sci)
>   			if (bh == segbuf->sb_super_root) {
>   				if (bh->b_page != bd_page) {
>   					lock_page(bd_page);
> -					clear_page_dirty_for_io(bd_page);
> +					clear_page_dirty_for_io(NULL, bd_page);
>   					set_page_writeback(bd_page);
>   					unlock_page(bd_page);
>   					bd_page = bh->b_page;
> @@ -1691,7 +1691,7 @@ static void nilfs_segctor_prepare_write(struct nilfs_sc_info *sci)
>   	}
>   	if (bd_page) {
>   		lock_page(bd_page);
> -		clear_page_dirty_for_io(bd_page);
> +		clear_page_dirty_for_io(NULL, bd_page);
>   		set_page_writeback(bd_page);
>   		unlock_page(bd_page);
>   	}
> diff --git a/fs/orangefs/inode.c b/fs/orangefs/inode.c
> index 4df560894386..829de5553a77 100644
> --- a/fs/orangefs/inode.c
> +++ b/fs/orangefs/inode.c
> @@ -501,7 +501,7 @@ static int orangefs_launder_folio(struct folio *folio)
>   		.nr_to_write = 0,
>   	};
>   	folio_wait_writeback(folio);
> -	if (folio_clear_dirty_for_io(folio)) {
> +	if (folio_clear_dirty_for_io(&wbc, folio)) {
>   		r = orangefs_writepage_locked(&folio->page, &wbc);
>   		folio_end_writeback(folio);
>   	}
> diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c
> index f2353dd676ef..41d764fd5511 100644
> --- a/fs/ubifs/file.c
> +++ b/fs/ubifs/file.c
> @@ -1159,7 +1159,7 @@ static int do_truncation(struct ubifs_info *c, struct inode *inode,
>   				 */
>   				ubifs_assert(c, PagePrivate(page));
>   
> -				clear_page_dirty_for_io(page);
> +				clear_page_dirty_for_io(NULL, page);
>   				if (UBIFS_BLOCKS_PER_PAGE_SHIFT)
>   					offset = new_size &
>   						 (PAGE_SIZE - 1);
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 29e1f9e76eb6..81c42a95cf8d 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -1059,8 +1059,9 @@ static inline void folio_cancel_dirty(struct folio *folio)
>   	if (folio_test_dirty(folio))
>   		__folio_cancel_dirty(folio);
>   }
> -bool folio_clear_dirty_for_io(struct folio *folio);
> -bool clear_page_dirty_for_io(struct page *page);
> +bool folio_clear_dirty_for_io(struct writeback_control *wbc,
> +			      struct folio *folio);
> +bool clear_page_dirty_for_io(struct writeback_control *wbc, struct page *page);
>   void folio_invalidate(struct folio *folio, size_t offset, size_t length);
>   int __must_check folio_write_one(struct folio *folio);
>   static inline int __must_check write_one_page(struct page *page)
> diff --git a/mm/folio-compat.c b/mm/folio-compat.c
> index 69ed25790c68..748f82def674 100644
> --- a/mm/folio-compat.c
> +++ b/mm/folio-compat.c
> @@ -63,9 +63,9 @@ int __set_page_dirty_nobuffers(struct page *page)
>   }
>   EXPORT_SYMBOL(__set_page_dirty_nobuffers);
>   
> -bool clear_page_dirty_for_io(struct page *page)
> +bool clear_page_dirty_for_io(struct writeback_control *wbc, struct page *page)
>   {
> -	return folio_clear_dirty_for_io(page_folio(page));
> +	return folio_clear_dirty_for_io(wbc, page_folio(page));
>   }
>   EXPORT_SYMBOL(clear_page_dirty_for_io);
>   
> diff --git a/mm/migrate.c b/mm/migrate.c
> index a4d3fc65085f..0bda652153b9 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -870,7 +870,7 @@ static int writeout(struct address_space *mapping, struct folio *folio)
>   		/* No write method for the address space */
>   		return -EINVAL;
>   
> -	if (!folio_clear_dirty_for_io(folio))
> +	if (!folio_clear_dirty_for_io(&wbc, folio))
>   		/* Someone else already triggered a write */
>   		return -EAGAIN;
>   
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index ad608ef2a243..2d70070e533c 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2465,7 +2465,7 @@ int write_cache_pages(struct address_space *mapping,
>   			}
>   
>   			BUG_ON(PageWriteback(page));
> -			if (!clear_page_dirty_for_io(page))
> +			if (!clear_page_dirty_for_io(wbc, page))
>   				goto continue_unlock;
>   
>   			trace_wbc_writepage(wbc, inode_to_bdi(mapping->host));
> @@ -2628,7 +2628,7 @@ int folio_write_one(struct folio *folio)
>   
>   	folio_wait_writeback(folio);
>   
> -	if (folio_clear_dirty_for_io(folio)) {
> +	if (folio_clear_dirty_for_io(&wbc, folio)) {
>   		folio_get(folio);
>   		ret = mapping->a_ops->writepage(&folio->page, &wbc);
>   		if (ret == 0)
> @@ -2924,7 +2924,7 @@ EXPORT_SYMBOL(__folio_cancel_dirty);
>   
>   /*
>    * Clear a folio's dirty flag, while caring for dirty memory accounting.
> - * Returns true if the folio was previously dirty.
> + * Returns true if the folio was previously dirty and should be written back.
>    *
>    * This is for preparing to put the folio under writeout.  We leave
>    * the folio tagged as dirty in the xarray so that a concurrent
> @@ -2935,8 +2935,14 @@ EXPORT_SYMBOL(__folio_cancel_dirty);
>    *
>    * This incoherency between the folio's dirty flag and xarray tag is
>    * unfortunate, but it only exists while the folio is locked.
> + *
> + * If the folio is pinned, its writeback is problematic so we just don't bother
> + * for memory cleaning writeback - this is why writeback control is passed in.
> + * If it is NULL, we assume pinned pages are not expected (e.g. this can be
> + * a metadata page) and warn if the page is actually pinned.
>    */
> -bool folio_clear_dirty_for_io(struct folio *folio)
> +bool folio_clear_dirty_for_io(struct writeback_control *wbc,
> +			      struct folio *folio)
>   {
>   	struct address_space *mapping = folio_mapping(folio);
>   	bool ret = false;
> @@ -2975,6 +2981,16 @@ bool folio_clear_dirty_for_io(struct folio *folio)
>   		 */
>   		if (folio_mkclean(folio))
>   			folio_mark_dirty(folio);
> +		/*
> +		 * For pinned folios we have no chance to reclaim them anyway
> +		 * and we cannot be sure folio is ever clean. So memory
> +		 * cleaning writeback is pointless. Just skip it.
> +		 */
> +		if (wbc && wbc->sync_mode == WB_SYNC_NONE &&
> +		    folio_maybe_dma_pinned(folio))
> +			return false;
> +		/* Catch callers not expecting pinned pages */
> +		WARN_ON_ONCE(!wbc && folio_maybe_dma_pinned(folio));
>   		/*
>   		 * We carefully synchronise fault handlers against
>   		 * installing a dirty pte and marking the folio dirty
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index ab3911a8b116..71a226b47ac6 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1283,7 +1283,7 @@ static pageout_t pageout(struct folio *folio, struct address_space *mapping,
>   	if (mapping->a_ops->writepage == NULL)
>   		return PAGE_ACTIVATE;
>   
> -	if (folio_clear_dirty_for_io(folio)) {
> +	if (folio_clear_dirty_for_io(NULL, folio)) {
>   		int res;
>   		struct writeback_control wbc = {
>   			.sync_mode = WB_SYNC_NONE,



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/5] mm: Do not try to write pinned folio during memory cleaning writeback
  2023-02-10  1:54   ` John Hubbard
@ 2023-02-10  2:10     ` John Hubbard
  2023-02-10 10:42       ` Jan Kara
  2023-02-10 10:54     ` Jan Kara
  1 sibling, 1 reply; 26+ messages in thread
From: John Hubbard @ 2023-02-10  2:10 UTC (permalink / raw)
  To: Jan Kara, linux-fsdevel
  Cc: linux-block, linux-mm, David Howells, David Hildenbrand

On 2/9/23 17:54, John Hubbard wrote:
> On 2/9/23 04:31, Jan Kara wrote:
>> When a folio is pinned, there is no point in trying to write it during
>> memory cleaning writeback. We cannot reclaim the folio until it is
>> unpinned anyway and we cannot even be sure the folio is really clean.
>> On top of that writeback of such folio may be problematic as the data
>> can change while the writeback is running thus causing checksum or
>> DIF/DIX failures. So just don't bother doing memory cleaning writeback
>> for pinned folios.
>>
>> Signed-off-by: Jan Kara <jack@suse.cz>
>> ---
>>   fs/9p/vfs_addr.c            |  2 +-
>>   fs/afs/file.c               |  2 +-
>>   fs/afs/write.c              |  6 +++---
>>   fs/btrfs/extent_io.c        | 14 +++++++-------
>>   fs/btrfs/free-space-cache.c |  2 +-
>>   fs/btrfs/inode.c            |  2 +-
>>   fs/btrfs/subpage.c          |  2 +-
> 

Oh, and one more fix, below, is required in order to build with my local
test config. Assuming that it is reasonable to deal with pinned pages
here, which I think it is:

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
index 9c759df700ca..c3279fb0edc8 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
@@ -313,7 +313,7 @@ void __shmem_writeback(size_t size, struct address_space *mapping)
  		if (!page)
  			continue;
  
-		if (!page_mapped(page) && clear_page_dirty_for_io(page)) {
+		if (!page_mapped(page) && clear_page_dirty_for_io(&wbc, page)) {
  			int ret;
  
  			SetPageReclaim(page);

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/5] mm: Do not try to write pinned folio during memory cleaning writeback
  2023-02-10  2:10     ` John Hubbard
@ 2023-02-10 10:42       ` Jan Kara
  0 siblings, 0 replies; 26+ messages in thread
From: Jan Kara @ 2023-02-10 10:42 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jan Kara, linux-fsdevel, linux-block, linux-mm, David Howells,
	David Hildenbrand

On Thu 09-02-23 18:10:23, John Hubbard wrote:
> On 2/9/23 17:54, John Hubbard wrote:
> > On 2/9/23 04:31, Jan Kara wrote:
> > > When a folio is pinned, there is no point in trying to write it during
> > > memory cleaning writeback. We cannot reclaim the folio until it is
> > > unpinned anyway and we cannot even be sure the folio is really clean.
> > > On top of that writeback of such folio may be problematic as the data
> > > can change while the writeback is running thus causing checksum or
> > > DIF/DIX failures. So just don't bother doing memory cleaning writeback
> > > for pinned folios.
> > > 
> > > Signed-off-by: Jan Kara <jack@suse.cz>
> > > ---
> > >   fs/9p/vfs_addr.c            |  2 +-
> > >   fs/afs/file.c               |  2 +-
> > >   fs/afs/write.c              |  6 +++---
> > >   fs/btrfs/extent_io.c        | 14 +++++++-------
> > >   fs/btrfs/free-space-cache.c |  2 +-
> > >   fs/btrfs/inode.c            |  2 +-
> > >   fs/btrfs/subpage.c          |  2 +-
> > 
> 
> Oh, and one more fix, below, is required in order to build with my local
> test config. Assuming that it is reasonable to deal with pinned pages
> here, which I think it is:
> 
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
> index 9c759df700ca..c3279fb0edc8 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
> @@ -313,7 +313,7 @@ void __shmem_writeback(size_t size, struct address_space *mapping)
>  		if (!page)
>  			continue;
> -		if (!page_mapped(page) && clear_page_dirty_for_io(page)) {
> +		if (!page_mapped(page) && clear_page_dirty_for_io(&wbc, page)) {
>  			int ret;
>  			SetPageReclaim(page);
> 

Thanks, fixed up. It didn't occur to me to grep drivers/ for these
functions :).

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/5] mm: Do not try to write pinned folio during memory cleaning writeback
  2023-02-10  1:54   ` John Hubbard
  2023-02-10  2:10     ` John Hubbard
@ 2023-02-10 10:54     ` Jan Kara
  1 sibling, 0 replies; 26+ messages in thread
From: Jan Kara @ 2023-02-10 10:54 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jan Kara, linux-fsdevel, linux-block, linux-mm, David Howells,
	David Hildenbrand

On Thu 09-02-23 17:54:14, John Hubbard wrote:
> On 2/9/23 04:31, Jan Kara wrote:
> > When a folio is pinned, there is no point in trying to write it during
> > memory cleaning writeback. We cannot reclaim the folio until it is
> > unpinned anyway and we cannot even be sure the folio is really clean.
> > On top of that writeback of such folio may be problematic as the data
> > can change while the writeback is running thus causing checksum or
> > DIF/DIX failures. So just don't bother doing memory cleaning writeback
> > for pinned folios.
> > 
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >   fs/9p/vfs_addr.c            |  2 +-
> >   fs/afs/file.c               |  2 +-
> >   fs/afs/write.c              |  6 +++---
> >   fs/btrfs/extent_io.c        | 14 +++++++-------
> >   fs/btrfs/free-space-cache.c |  2 +-
> >   fs/btrfs/inode.c            |  2 +-
> >   fs/btrfs/subpage.c          |  2 +-
> 
> Hi Jan!
> 
> Just a quick note that this breaks the btrfs build in subpage.c.
> Because, unfortunately, btrfs creates 6 sets of functions via calls to a
> macro: IMPLEMENT_BTRFS_PAGE_OPS(). And that expects only one argument to
> the clear_page_func, and thus to clear_page_dirty_for_io().
> 
> It seems infeasible (right?) to add another argument to the other
> clear_page_func functions, which include:
> 
>    ClearPageUptodate
>    ClearPageError
>    end_page_writeback
>    ClearPageOrdered
>    ClearPageChecked
> 
> , so I expect IMPLEMENT_BTRFS_PAGE_OPS() may need to be partially
> unrolled, in order to pass in the new writeback control arg to
> clear_page_dirty_for_io().

Aha, thanks for catching this. So it is easy to fix this to make things
compile (just a wrapper around clear_page_dirty_for_io() - done now). It
will be a bit more challenging to propagate wbc into there for proper
decision - that will probably need these functions not to be defined by the
macros.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/5] mm: Do not reclaim private data from pinned page
  2023-02-09 16:17   ` Matthew Wilcox
@ 2023-02-10 11:29     ` Jan Kara
  2023-02-13  9:55       ` Christoph Hellwig
  0 siblings, 1 reply; 26+ messages in thread
From: Jan Kara @ 2023-02-10 11:29 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jan Kara, linux-fsdevel, linux-block, linux-mm, John Hubbard,
	David Howells, David Hildenbrand

On Thu 09-02-23 16:17:47, Matthew Wilcox wrote:
> On Thu, Feb 09, 2023 at 01:31:53PM +0100, Jan Kara wrote:
> > If the page is pinned, there's no point in trying to reclaim it.
> > Furthermore if the page is from the page cache we don't want to reclaim
> > fs-private data from the page because the pinning process may be writing
> > to the page at any time and reclaiming fs private info on a dirty page
> > can upset the filesystem (see link below).
> > 
> > Link: https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz
> 
> OK, but now I'm confused.  I've been told before that the reason we
> can't take pinned pages off the LRU list is that they need to be written
> back periodically for ... reasons.  But now the pages are going to be
> skipped if they're found on the LRU list, so why is this better than
> taking them off the LRU list?

You are mixing things together a bit :). Yes, we do need to writeback
pinned pages from time to time - for data integrity purposes like fsync(2).
This has nothing to do with taking the pinned page out of LRU. It would be
actually nice to be able to take pinned pages out of the LRU and
functionally that would make sense but as I've mentioned in my reply to you
[1], the problem here is the performance. I've now dug out the discussion
from 2018 where John actually tried to take pinned pages out of the LRU [2]
and the result was 20% IOPS degradation on his NVME drive because of the
cost of taking the LRU lock. I'm not even speaking how costly that would
get on any heavily parallel direct IO workload on some high-iops device...

								Honza

[1] https://lore.kernel.org/all/20230124102931.g7e33syuhfo7s36h@quack3
[2] https://lore.kernel.org/all/f5ad7210-05e0-3dc4-02df-01ce5346e198@nvidia.com

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/5] mm: Do not reclaim private data from pinned page
  2023-02-09 12:31 ` [PATCH 1/5] mm: Do not reclaim private data from pinned page Jan Kara
  2023-02-09 16:17   ` Matthew Wilcox
@ 2023-02-13  9:01   ` David Hildenbrand
  2023-02-14 13:00     ` Jan Kara
  1 sibling, 1 reply; 26+ messages in thread
From: David Hildenbrand @ 2023-02-13  9:01 UTC (permalink / raw)
  To: Jan Kara, linux-fsdevel
  Cc: linux-block, linux-mm, John Hubbard, David Howells

On 09.02.23 13:31, Jan Kara wrote:
> If the page is pinned, there's no point in trying to reclaim it.
> Furthermore if the page is from the page cache we don't want to reclaim
> fs-private data from the page because the pinning process may be writing
> to the page at any time and reclaiming fs private info on a dirty page
> can upset the filesystem (see link below).
> 
> Link: https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>   mm/vmscan.c | 10 ++++++++++
>   1 file changed, 10 insertions(+)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index bf3eedf0209c..ab3911a8b116 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1901,6 +1901,16 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>   			}
>   		}
>   
> +		/*
> +		 * Folio is unmapped now so it cannot be newly pinned anymore.
> +		 * No point in trying to reclaim folio if it is pinned.
> +		 * Furthermore we don't want to reclaim underlying fs metadata
> +		 * if the folio is pinned and thus potentially modified by the
> +		 * pinning process is that may upset the filesystem.
> +		 */
> +		if (folio_maybe_dma_pinned(folio))
> +			goto activate_locked;
> +
>   		mapping = folio_mapping(folio);
>   		if (folio_test_dirty(folio)) {
>   			/*

At this point, we made sure that the folio is completely unmapped. 
However, we specify "TTU_BATCH_FLUSH", so rmap code might defer a TLB 
flush and consequently defer an IPI sync.

I remember that this check here is fine regarding GUP-fast: even if 
concurrent GUP-fast pins the page after our check here, it should 
observe the changed PTE and unpin it again.


Checking after unmapping makes sense: we reduce the likelyhood of false 
positives when a file-backed page is mapped many times (>= 1024). OTOH, 
we might unmap pinned pages because we cannot really detect it early.

For anon pages, we have an early (racy) check, which turned out "ok" in 
practice, because we don't frequently have that many anon pages that are 
shared by that many processes. I assume we don't want something similar 
for pagecache pages, because having a single page mapped by many 
processes can happen easily and would prevent reclaim.

I once had a patch lying around that documented for the existing 
folio_maybe_dma_pinned() for anon pages exactly that (racy+false 
positives with many mappings).


Long story short, I assume this change is fine.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/5] mm: Do not reclaim private data from pinned page
  2023-02-10 11:29     ` Jan Kara
@ 2023-02-13  9:55       ` Christoph Hellwig
  2023-02-14 13:06         ` Jan Kara
  0 siblings, 1 reply; 26+ messages in thread
From: Christoph Hellwig @ 2023-02-13  9:55 UTC (permalink / raw)
  To: Jan Kara
  Cc: Matthew Wilcox, linux-fsdevel, linux-block, linux-mm,
	John Hubbard, David Howells, David Hildenbrand

On Fri, Feb 10, 2023 at 12:29:54PM +0100, Jan Kara wrote:
> functionally that would make sense but as I've mentioned in my reply to you
> [1], the problem here is the performance. I've now dug out the discussion
> from 2018 where John actually tried to take pinned pages out of the LRU [2]
> and the result was 20% IOPS degradation on his NVME drive because of the
> cost of taking the LRU lock. I'm not even speaking how costly that would
> get on any heavily parallel direct IO workload on some high-iops device...

I think we need to distinguish between short- and long terms pins.
For short term pins like direct I/O it doesn't make sense to take them
off the lru, or to do any other special action.  Writeback will simplify
have to wait for the short term pin.

Long-term pins absolutely would make sense to be taken off the LRU list.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/5] block: Add support for bouncing pinned pages
  2023-02-09 12:31 ` [PATCH 4/5] block: Add support for bouncing pinned pages Jan Kara
@ 2023-02-13  9:59   ` Christoph Hellwig
  2023-02-14 13:56     ` Jan Kara
  0 siblings, 1 reply; 26+ messages in thread
From: Christoph Hellwig @ 2023-02-13  9:59 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, linux-block, linux-mm, John Hubbard,
	David Howells, David Hildenbrand

Eww.  The block bounc code really needs to go away, so a new user
makes me very unhappy.

But independent of that I don't think this is enough anyway.  Just
copying the data out into a new page in the block layer doesn't solve
the problem that this page needs to be tracked as dirtied for fs
accounting.  e.g. every time we write this copy it needs space allocated
for COW file systems.

Which brings me back to if and when we do writeback for pinned page.
I don't think doing any I/O for short term pins like direct I/O
make sense.  These pins are defined to be unpinned after I/O
completes, so we might as well just wait for the unpin instead of doing
anything complicated.

Long term pins are more troublesome, but I really wonder what the
defined semantics for data integrity writeback like fsync on them
is to start with as the content is very much undefined.  Should
an fsync on a (partially) long term pinned file simplfy fail?  It's
not like we can win in that scenario.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/5] mm: Do not reclaim private data from pinned page
  2023-02-13  9:01   ` David Hildenbrand
@ 2023-02-14 13:00     ` Jan Kara
  0 siblings, 0 replies; 26+ messages in thread
From: Jan Kara @ 2023-02-14 13:00 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Jan Kara, linux-fsdevel, linux-block, linux-mm, John Hubbard,
	David Howells

On Mon 13-02-23 10:01:35, David Hildenbrand wrote:
> On 09.02.23 13:31, Jan Kara wrote:
> > If the page is pinned, there's no point in trying to reclaim it.
> > Furthermore if the page is from the page cache we don't want to reclaim
> > fs-private data from the page because the pinning process may be writing
> > to the page at any time and reclaiming fs private info on a dirty page
> > can upset the filesystem (see link below).
> > 
> > Link: https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >   mm/vmscan.c | 10 ++++++++++
> >   1 file changed, 10 insertions(+)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index bf3eedf0209c..ab3911a8b116 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1901,6 +1901,16 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >   			}
> >   		}
> > +		/*
> > +		 * Folio is unmapped now so it cannot be newly pinned anymore.
> > +		 * No point in trying to reclaim folio if it is pinned.
> > +		 * Furthermore we don't want to reclaim underlying fs metadata
> > +		 * if the folio is pinned and thus potentially modified by the
> > +		 * pinning process is that may upset the filesystem.
> > +		 */
> > +		if (folio_maybe_dma_pinned(folio))
> > +			goto activate_locked;
> > +
> >   		mapping = folio_mapping(folio);
> >   		if (folio_test_dirty(folio)) {
> >   			/*
> 
> At this point, we made sure that the folio is completely unmapped. However,
> we specify "TTU_BATCH_FLUSH", so rmap code might defer a TLB flush and
> consequently defer an IPI sync.
> 
> I remember that this check here is fine regarding GUP-fast: even if
> concurrent GUP-fast pins the page after our check here, it should observe
> the changed PTE and unpin it again.
>  
> Checking after unmapping makes sense: we reduce the likelyhood of false
> positives when a file-backed page is mapped many times (>= 1024). OTOH, we
> might unmap pinned pages because we cannot really detect it early.
> 
> For anon pages, we have an early (racy) check, which turned out "ok" in
> practice, because we don't frequently have that many anon pages that are
> shared by that many processes. I assume we don't want something similar for
> pagecache pages, because having a single page mapped by many processes can
> happen easily and would prevent reclaim.

Yeah, I think pagecache pages shared by many processes are more likely.
Furthermore I think pinned pagecache pages are rather rare so unmapping
them before checking seems fine to me. Obviously we can reconsider if
reality would prove me wrong ;).

> I once had a patch lying around that documented for the existing
> folio_maybe_dma_pinned() for anon pages exactly that (racy+false positives
> with many mappings).
> 
> Long story short, I assume this change is fine.

Thanks for the throughout verification :)

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/5] mm: Do not reclaim private data from pinned page
  2023-02-13  9:55       ` Christoph Hellwig
@ 2023-02-14 13:06         ` Jan Kara
  2023-02-14 21:40           ` John Hubbard
  0 siblings, 1 reply; 26+ messages in thread
From: Jan Kara @ 2023-02-14 13:06 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, Matthew Wilcox, linux-fsdevel, linux-block, linux-mm,
	John Hubbard, David Howells, David Hildenbrand

On Mon 13-02-23 01:55:04, Christoph Hellwig wrote:
> On Fri, Feb 10, 2023 at 12:29:54PM +0100, Jan Kara wrote:
> > functionally that would make sense but as I've mentioned in my reply to you
> > [1], the problem here is the performance. I've now dug out the discussion
> > from 2018 where John actually tried to take pinned pages out of the LRU [2]
> > and the result was 20% IOPS degradation on his NVME drive because of the
> > cost of taking the LRU lock. I'm not even speaking how costly that would
> > get on any heavily parallel direct IO workload on some high-iops device...
> 
> I think we need to distinguish between short- and long terms pins.
> For short term pins like direct I/O it doesn't make sense to take them
> off the lru, or to do any other special action.  Writeback will simplify
> have to wait for the short term pin.
> 
> Long-term pins absolutely would make sense to be taken off the LRU list.

Yeah, I agree distinguishing these two would be nice as we could treat them
differently then. The trouble is a bit with always-crowded struct page. But
now it occurred to me that if we are going to take these long-term pinned
pages out from the LRU, we could overload the space for LRU pointers with
the counter (which is what I think John originally did). So yes, possibly
we could track separately long-term and short-term pins. John, what do you
think? Maybe time to revive your patches from 2018 in a bit different form?
;)

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/5] block: Add support for bouncing pinned pages
  2023-02-13  9:59   ` Christoph Hellwig
@ 2023-02-14 13:56     ` Jan Kara
  2023-02-15  4:59       ` Dave Chinner
  0 siblings, 1 reply; 26+ messages in thread
From: Jan Kara @ 2023-02-14 13:56 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, linux-fsdevel, linux-block, linux-mm, John Hubbard,
	David Howells, David Hildenbrand

On Mon 13-02-23 01:59:28, Christoph Hellwig wrote:
> Eww.  The block bounc code really needs to go away, so a new user
> makes me very unhappy.
> 
> But independent of that I don't think this is enough anyway.  Just
> copying the data out into a new page in the block layer doesn't solve
> the problem that this page needs to be tracked as dirtied for fs
> accounting.  e.g. every time we write this copy it needs space allocated
> for COW file systems.

Right, I forgot about this in my RFC. My original plan was to not clear the
dirty bit in clear_page_dirty_for_io() even for WB_SYNC_ALL writeback when
we do writeback the page and perhaps indicate this in the return value of
clear_page_dirty_for_io() so that the COW filesystem can keep tracking this
page as dirty.

> Which brings me back to if and when we do writeback for pinned page.
> I don't think doing any I/O for short term pins like direct I/O
> make sense.  These pins are defined to be unpinned after I/O
> completes, so we might as well just wait for the unpin instead of doing
> anything complicated.

Agreed. For short term pins we could just wait which should be quite
simple (although there's some DoS potential of this behavior if somebody
runs multiple processes that keep pinning some page with short term pins).

> Long term pins are more troublesome, but I really wonder what the
> defined semantics for data integrity writeback like fsync on them
> is to start with as the content is very much undefined.  Should
> an fsync on a (partially) long term pinned file simplfy fail?  It's
> not like we can win in that scenario.

Well, we have also cases like sync(2) so one would have to be careful with
error propagation and I'm afraid there are enough programs out-there that
treat any error return from fsync(2) as catastrophic so I suspect this
could lead to some surprises. The case I'm most worried about is if some
application sets up RDMA to an mmaped file, runs the transfer and waits for
it to complete, doesn't bother to unpin the pages (keeps them for future
transfers) and calls fsync(2) to make data stable on local storage. That
does seem like quite sensible use and so far it works just fine. And not
writing pages with fsync(2) would break such uses.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/5] mm: Do not reclaim private data from pinned page
  2023-02-14 13:06         ` Jan Kara
@ 2023-02-14 21:40           ` John Hubbard
  2023-02-16 11:56             ` Jan Kara
  0 siblings, 1 reply; 26+ messages in thread
From: John Hubbard @ 2023-02-14 21:40 UTC (permalink / raw)
  To: Jan Kara, Christoph Hellwig
  Cc: Matthew Wilcox, linux-fsdevel, linux-block, linux-mm,
	David Howells, David Hildenbrand

On 2/14/23 05:06, Jan Kara wrote:
> On Mon 13-02-23 01:55:04, Christoph Hellwig wrote:
>> I think we need to distinguish between short- and long terms pins.
>> For short term pins like direct I/O it doesn't make sense to take them
>> off the lru, or to do any other special action.  Writeback will simplify
>> have to wait for the short term pin.
>>
>> Long-term pins absolutely would make sense to be taken off the LRU list.
> 
> Yeah, I agree distinguishing these two would be nice as we could treat them
> differently then. The trouble is a bit with always-crowded struct page. But
> now it occurred to me that if we are going to take these long-term pinned
> pages out from the LRU, we could overload the space for LRU pointers with
> the counter (which is what I think John originally did). So yes, possibly
> we could track separately long-term and short-term pins. John, what do you
> think? Maybe time to revive your patches from 2018 in a bit different form?
> ;)
> 

Oh wow, I really love this idea. We kept running into problems because
long- and short-term pins were mixed up together (except during
creation), and this, at long last, separates them. Very nice. I'd almost
forgotten about the 2018 page.lru adventures, too. ha :)

One objection might be that pinning is now going to be taking a lot of
space in struct page / folio, but I think it's warranted, based on the
long-standing, difficult problems that it would solve.

We could even leave most of these patches, and David Howells' patches,
intact, by using an approach similar to the mm_users and mm_count
technique: maintain a long-term pin count in one of the folio->lru
fields, and any non-zero count there creates a single count in
folio->_pincount.

I could put together something to do that.


thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/5] block: Add support for bouncing pinned pages
  2023-02-14 13:56     ` Jan Kara
@ 2023-02-15  4:59       ` Dave Chinner
  2023-02-15  6:24         ` Christoph Hellwig
  0 siblings, 1 reply; 26+ messages in thread
From: Dave Chinner @ 2023-02-15  4:59 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, linux-fsdevel, linux-block, linux-mm,
	John Hubbard, David Howells, David Hildenbrand

On Tue, Feb 14, 2023 at 02:56:04PM +0100, Jan Kara wrote:
> On Mon 13-02-23 01:59:28, Christoph Hellwig wrote:
> > Eww.  The block bounc code really needs to go away, so a new user
> > makes me very unhappy.
> > 
> > But independent of that I don't think this is enough anyway.  Just
> > copying the data out into a new page in the block layer doesn't solve
> > the problem that this page needs to be tracked as dirtied for fs
> > accounting.  e.g. every time we write this copy it needs space allocated
> > for COW file systems.
> 
> Right, I forgot about this in my RFC. My original plan was to not clear the
> dirty bit in clear_page_dirty_for_io() even for WB_SYNC_ALL writeback when
> we do writeback the page and perhaps indicate this in the return value of
> clear_page_dirty_for_io() so that the COW filesystem can keep tracking this
> page as dirty.

I don't think this works, especially if the COW mechanism relies on
delayed allocation to prevent ENOSPC during writeback. That is, we
need a write() or page fault (to run ->page_mkwrite()) after every
call to folio_clear_dirty_for_io() in the writeback path to ensure
that new space is reserved for the allocation that will occur
during a future writeback of that page.

Hence we can't just leave the page dirty on COW filesystems - it has
to go through a clean state so that the clean->dirty event can be
gated on gaining the space reservation that allows it to be written
back again.

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/5] block: Add support for bouncing pinned pages
  2023-02-15  4:59       ` Dave Chinner
@ 2023-02-15  6:24         ` Christoph Hellwig
  2023-02-16 12:33           ` Jan Kara
  0 siblings, 1 reply; 26+ messages in thread
From: Christoph Hellwig @ 2023-02-15  6:24 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Christoph Hellwig, linux-fsdevel, linux-block,
	linux-mm, John Hubbard, David Howells, David Hildenbrand

On Wed, Feb 15, 2023 at 03:59:52PM +1100, Dave Chinner wrote:
> I don't think this works, especially if the COW mechanism relies on
> delayed allocation to prevent ENOSPC during writeback. That is, we
> need a write() or page fault (to run ->page_mkwrite()) after every
> call to folio_clear_dirty_for_io() in the writeback path to ensure
> that new space is reserved for the allocation that will occur
> during a future writeback of that page.
> 
> Hence we can't just leave the page dirty on COW filesystems - it has
> to go through a clean state so that the clean->dirty event can be
> gated on gaining the space reservation that allows it to be written
> back again.

Exactly.  Although if we really want we could do the redirtying without
formally moving to a clean state, but it certainly would require special
new code to the same steps as if we were redirtying.

Which is another reason why I'd prefer to avoid all that if we can.
For that we probably need an inventory of what long term pins we have
in the kernel tree that can and do operate on shared file mappings,
and what kind of I/O semantics they expect.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/5] mm: Do not reclaim private data from pinned page
  2023-02-14 21:40           ` John Hubbard
@ 2023-02-16 11:56             ` Jan Kara
  0 siblings, 0 replies; 26+ messages in thread
From: Jan Kara @ 2023-02-16 11:56 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jan Kara, Christoph Hellwig, Matthew Wilcox, linux-fsdevel,
	linux-block, linux-mm, David Howells, David Hildenbrand

On Tue 14-02-23 13:40:17, John Hubbard wrote:
> On 2/14/23 05:06, Jan Kara wrote:
> > On Mon 13-02-23 01:55:04, Christoph Hellwig wrote:
> >> I think we need to distinguish between short- and long terms pins.
> >> For short term pins like direct I/O it doesn't make sense to take them
> >> off the lru, or to do any other special action.  Writeback will simplify
> >> have to wait for the short term pin.
> >>
> >> Long-term pins absolutely would make sense to be taken off the LRU list.
> > 
> > Yeah, I agree distinguishing these two would be nice as we could treat them
> > differently then. The trouble is a bit with always-crowded struct page. But
> > now it occurred to me that if we are going to take these long-term pinned
> > pages out from the LRU, we could overload the space for LRU pointers with
> > the counter (which is what I think John originally did). So yes, possibly
> > we could track separately long-term and short-term pins. John, what do you
> > think? Maybe time to revive your patches from 2018 in a bit different form?
> > ;)
> > 
> 
> Oh wow, I really love this idea. We kept running into problems because
> long- and short-term pins were mixed up together (except during
> creation), and this, at long last, separates them. Very nice. I'd almost
> forgotten about the 2018 page.lru adventures, too. ha :)
> 
> One objection might be that pinning is now going to be taking a lot of
> space in struct page / folio, but I think it's warranted, based on the
> long-standing, difficult problems that it would solve.

Well, it doesn't need to consume more space in the struct page than it
already does currently AFAICS. We could just mark the folio as unevictable
and make sure folio_evictable() returns false for such pages. Then we
should be safe to use space of lru pointers for whatever we need.

> We could even leave most of these patches, and David Howells' patches,
> intact, by using an approach similar to the mm_users and mm_count
> technique: maintain a long-term pin count in one of the folio->lru
> fields, and any non-zero count there creates a single count in
> folio->_pincount.

Oh, you mean that the first longterm pin would take one short-term pin?
Yes, that should be possible but I'm not sure that would be a huge win. I
can imagine users can care about distinguishing these states:

1) unpinned
2) has any pin
3) has only short term pins

Now distinguishing between 1 and 2+3 would still be done by
folio_maybe_dma_pinned(). Your change will allow us to not look at lru
pointers in folio_maybe_dma_pinned() so that's some simplification and
perhaps performance optimization (potentially is can save us a need to pull
in another cacheline but mostly _refcount and lru will be in the same
cacheline anyway) so maybe it's worth it in the end.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/5] block: Add support for bouncing pinned pages
  2023-02-15  6:24         ` Christoph Hellwig
@ 2023-02-16 12:33           ` Jan Kara
  2023-02-20  6:22             ` Christoph Hellwig
  0 siblings, 1 reply; 26+ messages in thread
From: Jan Kara @ 2023-02-16 12:33 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dave Chinner, Jan Kara, linux-fsdevel, linux-block, linux-mm,
	John Hubbard, David Howells, David Hildenbrand

On Tue 14-02-23 22:24:33, Christoph Hellwig wrote:
> On Wed, Feb 15, 2023 at 03:59:52PM +1100, Dave Chinner wrote:
> > I don't think this works, especially if the COW mechanism relies on
> > delayed allocation to prevent ENOSPC during writeback. That is, we
> > need a write() or page fault (to run ->page_mkwrite()) after every
> > call to folio_clear_dirty_for_io() in the writeback path to ensure
> > that new space is reserved for the allocation that will occur
> > during a future writeback of that page.
> > 
> > Hence we can't just leave the page dirty on COW filesystems - it has
> > to go through a clean state so that the clean->dirty event can be
> > gated on gaining the space reservation that allows it to be written
> > back again.
> 
> Exactly.  Although if we really want we could do the redirtying without
> formally moving to a clean state, but it certainly would require special
> new code to the same steps as if we were redirtying.

Yes.

> Which is another reason why I'd prefer to avoid all that if we can.
> For that we probably need an inventory of what long term pins we have
> in the kernel tree that can and do operate on shared file mappings,
> and what kind of I/O semantics they expect.

I'm a bit skeptical we can reasonably assess that (as much as I would love
to just not write these pages and be done with it) because a lot of
FOLL_LONGTERM users just pin passed userspace address range, then allow
userspace to manipulate it with other operations, and finally unpin it with
another call. Who knows whether shared pagecache pages are passed in and
what userspace is doing with them while they are pinned? 

We have stuff like io_uring using FOLL_LONGTERM for IO buffers passed from
userspace (e.g. IORING_REGISTER_BUFFERS operation), we have V4L2 which
similarly pins buffers for video processing (and I vaguely remember one
bugreport due to some phone passing shared file pages there to acquire
screenshots from a webcam), and we have various infiniband drivers doing
this (not all of them are using FOLL_LONGTERM but they should AFAICS). We
even have vmsplice(2) that should be arguably using pinning with
FOLL_LONGTERM (at least that's the plan AFAIK) and not writing such pages
would IMO provide an interesting attack vector...

								Honza


-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/5] block: Add support for bouncing pinned pages
  2023-02-16 12:33           ` Jan Kara
@ 2023-02-20  6:22             ` Christoph Hellwig
  2023-02-27 11:39               ` Jan Kara
  0 siblings, 1 reply; 26+ messages in thread
From: Christoph Hellwig @ 2023-02-20  6:22 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, Dave Chinner, linux-fsdevel, linux-block,
	linux-mm, John Hubbard, David Howells, David Hildenbrand

On Thu, Feb 16, 2023 at 01:33:16PM +0100, Jan Kara wrote:
> I'm a bit skeptical we can reasonably assess that (as much as I would love
> to just not write these pages and be done with it) because a lot of
> FOLL_LONGTERM users just pin passed userspace address range, then allow
> userspace to manipulate it with other operations, and finally unpin it with
> another call. Who knows whether shared pagecache pages are passed in and
> what userspace is doing with them while they are pinned? 

True.

So what other sensible thing could we do at a higher level?

Treat MAP_SHARED buffers that are long term registered as MAP_PRIVATE
while registered, and just do writeback using in-kernel O_DIRECT on
fsync?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/5] block: Add support for bouncing pinned pages
  2023-02-20  6:22             ` Christoph Hellwig
@ 2023-02-27 11:39               ` Jan Kara
  2023-02-27 13:36                 ` Christoph Hellwig
  0 siblings, 1 reply; 26+ messages in thread
From: Jan Kara @ 2023-02-27 11:39 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, Dave Chinner, linux-fsdevel, linux-block, linux-mm,
	John Hubbard, David Howells, David Hildenbrand

On Sun 19-02-23 22:22:32, Christoph Hellwig wrote:
> On Thu, Feb 16, 2023 at 01:33:16PM +0100, Jan Kara wrote:
> > I'm a bit skeptical we can reasonably assess that (as much as I would love
> > to just not write these pages and be done with it) because a lot of
> > FOLL_LONGTERM users just pin passed userspace address range, then allow
> > userspace to manipulate it with other operations, and finally unpin it with
> > another call. Who knows whether shared pagecache pages are passed in and
> > what userspace is doing with them while they are pinned? 
> 
> True.
> 
> So what other sensible thing could we do at a higher level?
> 
> Treat MAP_SHARED buffers that are long term registered as MAP_PRIVATE
> while registered, and just do writeback using in-kernel O_DIRECT on
> fsync?

Do you mean to copy these pages on fsync(2) to newly allocated buffer and
then submit it via direct IO? That looks sensible to me. We could then make
writeback path just completely ignore these long term pinned pages and just
add this copying logic into filemap_fdatawrite() or something like that.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/5] block: Add support for bouncing pinned pages
  2023-02-27 11:39               ` Jan Kara
@ 2023-02-27 13:36                 ` Christoph Hellwig
  0 siblings, 0 replies; 26+ messages in thread
From: Christoph Hellwig @ 2023-02-27 13:36 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, Dave Chinner, linux-fsdevel, linux-block,
	linux-mm, John Hubbard, David Howells, David Hildenbrand

On Mon, Feb 27, 2023 at 12:39:26PM +0100, Jan Kara wrote:
> Do you mean to copy these pages on fsync(2) to newly allocated buffer and
> then submit it via direct IO? That looks sensible to me. We could then make
> writeback path just completely ignore these long term pinned pages and just
> add this copying logic into filemap_fdatawrite() or something like that.

I don't think we'd even have to copy them on fsync.  Just do an in-kernel
ITER_BVEC direct I/O on them.  The only hard part would be to come up
with a way to skip the pagecache invalidation and writeout for these
I/Os.

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2023-02-27 13:37 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-09 12:31 [PATCH RFC 0/5] Writeback handling of pinned pages Jan Kara
2023-02-09 12:31 ` [PATCH 1/5] mm: Do not reclaim private data from pinned page Jan Kara
2023-02-09 16:17   ` Matthew Wilcox
2023-02-10 11:29     ` Jan Kara
2023-02-13  9:55       ` Christoph Hellwig
2023-02-14 13:06         ` Jan Kara
2023-02-14 21:40           ` John Hubbard
2023-02-16 11:56             ` Jan Kara
2023-02-13  9:01   ` David Hildenbrand
2023-02-14 13:00     ` Jan Kara
2023-02-09 12:31 ` [PATCH 2/5] ext4: Drop workaround for mm reclaiming fs private page data Jan Kara
2023-02-09 12:31 ` [PATCH 3/5] mm: Do not try to write pinned folio during memory cleaning writeback Jan Kara
2023-02-10  1:54   ` John Hubbard
2023-02-10  2:10     ` John Hubbard
2023-02-10 10:42       ` Jan Kara
2023-02-10 10:54     ` Jan Kara
2023-02-09 12:31 ` [PATCH 4/5] block: Add support for bouncing pinned pages Jan Kara
2023-02-13  9:59   ` Christoph Hellwig
2023-02-14 13:56     ` Jan Kara
2023-02-15  4:59       ` Dave Chinner
2023-02-15  6:24         ` Christoph Hellwig
2023-02-16 12:33           ` Jan Kara
2023-02-20  6:22             ` Christoph Hellwig
2023-02-27 11:39               ` Jan Kara
2023-02-27 13:36                 ` Christoph Hellwig
2023-02-09 12:31 ` [PATCH 5/5] iomap: Bounce pinned pages during writeback Jan Kara

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.