linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* improve pagecache PSI annotations v2
@ 2022-09-15  9:41 Christoph Hellwig
  2022-09-15  9:41 ` [PATCH 1/5] mm: add PSI accounting around ->read_folio and ->readahead calls Christoph Hellwig
                   ` (6 more replies)
  0 siblings, 7 replies; 16+ messages in thread
From: Christoph Hellwig @ 2022-09-15  9:41 UTC (permalink / raw)
  To: Jens Axboe, Matthew Wilcox, Johannes Weiner, Suren Baghdasaryan,
	Andrew Morton
  Cc: Chris Mason, Josef Bacik, David Sterba, Gao Xiang, Chao Yu,
	linux-block, linux-btrfs, linux-fsdevel, linux-erofs, linux-mm

Hi all,

currently the VM tries to abuse the block layer submission path for
the page cache PSI annotations.  This series instead annotates the
->read_folio and ->readahead calls in the core VM code, and then
only deals with the odd direct add_to_page_cache_lru calls manually.

Changes since v1:
 - fix a logic error in ra_alloc_folio
 - drop a unlikely()
 - spell a comment in the weird way preferred by btrfs maintainers

Diffstat:
 block/bio.c               |    8 --------
 block/blk-core.c          |   17 -----------------
 fs/btrfs/compression.c    |   14 ++++++++++++--
 fs/direct-io.c            |    2 --
 fs/erofs/zdata.c          |   13 ++++++++++++-
 include/linux/blk_types.h |    1 -
 include/linux/pagemap.h   |    2 ++
 kernel/sched/psi.c        |    2 ++
 mm/filemap.c              |    7 +++++++
 mm/readahead.c            |   22 ++++++++++++++++++----
 10 files changed, 53 insertions(+), 35 deletions(-)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/5] mm: add PSI accounting around ->read_folio and ->readahead calls
  2022-09-15  9:41 improve pagecache PSI annotations v2 Christoph Hellwig
@ 2022-09-15  9:41 ` Christoph Hellwig
  2022-09-15  9:41 ` [PATCH 2/5] sched/psi: export psi_memstall_{enter,leave} Christoph Hellwig
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Christoph Hellwig @ 2022-09-15  9:41 UTC (permalink / raw)
  To: Jens Axboe, Matthew Wilcox, Johannes Weiner, Suren Baghdasaryan,
	Andrew Morton
  Cc: Chris Mason, Josef Bacik, David Sterba, Gao Xiang, Chao Yu,
	linux-block, linux-btrfs, linux-fsdevel, linux-erofs, linux-mm

PSI tries to account for the cost of bringing back in pages discarded by
the MM LRU management.  Currently the prime place for that is hooked into
the bio submission path, which is a rather bad place:

 - it does not actually account I/O for non-block file systems, of which
   we have many
 - it adds overhead and a layering violation to the block layer

Add the accounting into the two places in the core MM code that read
pages into an address space by calling into ->read_folio and ->readahead
so that the entire file system operations are covered, to broaden
the coverage and allow removing the accounting in the block layer going
forward.

As psi_memstall_enter can deal with nested calls this will not lead to
double accounting even while the bio annotations are still present.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/pagemap.h |  2 ++
 mm/filemap.c            |  7 +++++++
 mm/readahead.c          | 22 ++++++++++++++++++----
 3 files changed, 27 insertions(+), 4 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 0178b2040ea38..201dc7281640b 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -1173,6 +1173,8 @@ struct readahead_control {
 	pgoff_t _index;
 	unsigned int _nr_pages;
 	unsigned int _batch_count;
+	bool _workingset;
+	unsigned long _pflags;
 };
 
 #define DEFINE_READAHEAD(ractl, f, r, m, i)				\
diff --git a/mm/filemap.c b/mm/filemap.c
index 15800334147b3..c943d1b90cc26 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2382,6 +2382,8 @@ static void filemap_get_read_batch(struct address_space *mapping,
 static int filemap_read_folio(struct file *file, filler_t filler,
 		struct folio *folio)
 {
+	bool workingset = folio_test_workingset(folio);
+	unsigned long pflags;
 	int error;
 
 	/*
@@ -2390,8 +2392,13 @@ static int filemap_read_folio(struct file *file, filler_t filler,
 	 * fails.
 	 */
 	folio_clear_error(folio);
+
 	/* Start the actual read. The read will unlock the page. */
+	if (unlikely(workingset))
+		psi_memstall_enter(&pflags);
 	error = filler(file, folio);
+	if (unlikely(workingset))
+		psi_memstall_leave(&pflags);
 	if (error)
 		return error;
 
diff --git a/mm/readahead.c b/mm/readahead.c
index fdcd28cbd92de..b10f0cf81d804 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -122,6 +122,7 @@
 #include <linux/task_io_accounting_ops.h>
 #include <linux/pagevec.h>
 #include <linux/pagemap.h>
+#include <linux/psi.h>
 #include <linux/syscalls.h>
 #include <linux/file.h>
 #include <linux/mm_inline.h>
@@ -152,6 +153,8 @@ static void read_pages(struct readahead_control *rac)
 	if (!readahead_count(rac))
 		return;
 
+	if (unlikely(rac->_workingset))
+		psi_memstall_enter(&rac->_pflags);
 	blk_start_plug(&plug);
 
 	if (aops->readahead) {
@@ -179,6 +182,9 @@ static void read_pages(struct readahead_control *rac)
 	}
 
 	blk_finish_plug(&plug);
+	if (unlikely(rac->_workingset))
+		psi_memstall_leave(&rac->_pflags);
+	rac->_workingset = false;
 
 	BUG_ON(readahead_count(rac));
 }
@@ -252,6 +258,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 		}
 		if (i == nr_to_read - lookahead_size)
 			folio_set_readahead(folio);
+		ractl->_workingset |= folio_test_workingset(folio);
 		ractl->_nr_pages++;
 	}
 
@@ -480,11 +487,14 @@ static inline int ra_alloc_folio(struct readahead_control *ractl, pgoff_t index,
 	if (index == mark)
 		folio_set_readahead(folio);
 	err = filemap_add_folio(ractl->mapping, folio, index, gfp);
-	if (err)
+	if (err) {
 		folio_put(folio);
-	else
-		ractl->_nr_pages += 1UL << order;
-	return err;
+		return err;
+	}
+
+	ractl->_nr_pages += 1UL << order;
+	ractl->_workingset |= folio_test_workingset(folio);
+	return 0;
 }
 
 void page_cache_ra_order(struct readahead_control *ractl,
@@ -826,6 +836,10 @@ void readahead_expand(struct readahead_control *ractl,
 			put_page(page);
 			return;
 		}
+		if (unlikely(PageWorkingset(page)) && !ractl->_workingset) {
+			ractl->_workingset = true;
+			psi_memstall_enter(&ractl->_pflags);
+		}
 		ractl->_nr_pages++;
 		if (ra) {
 			ra->size++;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/5] sched/psi: export psi_memstall_{enter,leave}
  2022-09-15  9:41 improve pagecache PSI annotations v2 Christoph Hellwig
  2022-09-15  9:41 ` [PATCH 1/5] mm: add PSI accounting around ->read_folio and ->readahead calls Christoph Hellwig
@ 2022-09-15  9:41 ` Christoph Hellwig
  2022-09-15  9:41 ` [PATCH 3/5] btrfs: add manual PSI accounting for compressed reads Christoph Hellwig
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Christoph Hellwig @ 2022-09-15  9:41 UTC (permalink / raw)
  To: Jens Axboe, Matthew Wilcox, Johannes Weiner, Suren Baghdasaryan,
	Andrew Morton
  Cc: Chris Mason, Josef Bacik, David Sterba, Gao Xiang, Chao Yu,
	linux-block, linux-btrfs, linux-fsdevel, linux-erofs, linux-mm

To properly account for all refaults from file system logic, file systems
need to call psi_memstall_enter directly, so export it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 kernel/sched/psi.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index ecb4b4ff4ce0a..7f6030091aeee 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -917,6 +917,7 @@ void psi_memstall_enter(unsigned long *flags)
 
 	rq_unlock_irq(rq, &rf);
 }
+EXPORT_SYMBOL_GPL(psi_memstall_enter);
 
 /**
  * psi_memstall_leave - mark the end of an memory stall section
@@ -946,6 +947,7 @@ void psi_memstall_leave(unsigned long *flags)
 
 	rq_unlock_irq(rq, &rf);
 }
+EXPORT_SYMBOL_GPL(psi_memstall_leave);
 
 #ifdef CONFIG_CGROUPS
 int psi_cgroup_alloc(struct cgroup *cgroup)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 3/5] btrfs: add manual PSI accounting for compressed reads
  2022-09-15  9:41 improve pagecache PSI annotations v2 Christoph Hellwig
  2022-09-15  9:41 ` [PATCH 1/5] mm: add PSI accounting around ->read_folio and ->readahead calls Christoph Hellwig
  2022-09-15  9:41 ` [PATCH 2/5] sched/psi: export psi_memstall_{enter,leave} Christoph Hellwig
@ 2022-09-15  9:41 ` Christoph Hellwig
  2022-11-03 10:46   ` [REGESSION] systemd-oomd overreacting due to PSI changes for Btrfs (was: Re: [PATCH 3/5] btrfs: add manual PSI accounting for compressed reads) Thorsten Leemhuis
  2022-09-15  9:41 ` [PATCH 4/5] erofs: add manual PSI accounting for the compressed address space Christoph Hellwig
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 16+ messages in thread
From: Christoph Hellwig @ 2022-09-15  9:41 UTC (permalink / raw)
  To: Jens Axboe, Matthew Wilcox, Johannes Weiner, Suren Baghdasaryan,
	Andrew Morton
  Cc: Chris Mason, Josef Bacik, David Sterba, Gao Xiang, Chao Yu,
	linux-block, linux-btrfs, linux-fsdevel, linux-erofs, linux-mm

btrfs compressed reads try to always read the entire compressed chunk,
even if only a subset is requested.  Currently this is covered by the
magic PSI accounting underneath submit_bio, but that is about to go
away. Instead add manual psi_memstall_{enter,leave} annotations.

Note that for readahead this really should be using readahead_expand,
but the additionals reads are also done for plain ->read_folio where
readahead_expand can't work, so this overall logic is left as-is for
now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: David Sterba <dsterba@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 fs/btrfs/compression.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index e84d22c5c6a83..370788b9b1249 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -15,6 +15,7 @@
 #include <linux/string.h>
 #include <linux/backing-dev.h>
 #include <linux/writeback.h>
+#include <linux/psi.h>
 #include <linux/slab.h>
 #include <linux/sched/mm.h>
 #include <linux/log2.h>
@@ -519,7 +520,8 @@ static u64 bio_end_offset(struct bio *bio)
  */
 static noinline int add_ra_bio_pages(struct inode *inode,
 				     u64 compressed_end,
-				     struct compressed_bio *cb)
+				     struct compressed_bio *cb,
+				     unsigned long *pflags)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	unsigned long end_index;
@@ -588,6 +590,9 @@ static noinline int add_ra_bio_pages(struct inode *inode,
 			continue;
 		}
 
+		if (PageWorkingset(page))
+			psi_memstall_enter(pflags);
+
 		ret = set_page_extent_mapped(page);
 		if (ret < 0) {
 			unlock_page(page);
@@ -674,6 +679,8 @@ void btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
 	u64 em_len;
 	u64 em_start;
 	struct extent_map *em;
+	/* Initialize to 1 to make skip psi_memstall_leave unless needed */
+	unsigned long pflags = 1;
 	blk_status_t ret;
 	int ret2;
 	int i;
@@ -729,7 +736,7 @@ void btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
 		goto fail;
 	}
 
-	add_ra_bio_pages(inode, em_start + em_len, cb);
+	add_ra_bio_pages(inode, em_start + em_len, cb, &pflags);
 
 	/* include any pages we added in add_ra-bio_pages */
 	cb->len = bio->bi_iter.bi_size;
@@ -810,6 +817,9 @@ void btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
 		}
 	}
 
+	if (!pflags)
+		psi_memstall_leave(&pflags);
+
 	if (refcount_dec_and_test(&cb->pending_ios))
 		finish_compressed_bio_read(cb);
 	return;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 4/5] erofs: add manual PSI accounting for the compressed address space
  2022-09-15  9:41 improve pagecache PSI annotations v2 Christoph Hellwig
                   ` (2 preceding siblings ...)
  2022-09-15  9:41 ` [PATCH 3/5] btrfs: add manual PSI accounting for compressed reads Christoph Hellwig
@ 2022-09-15  9:41 ` Christoph Hellwig
  2022-09-15  9:42 ` [PATCH 5/5] block: remove PSI accounting from the bio layer Christoph Hellwig
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Christoph Hellwig @ 2022-09-15  9:41 UTC (permalink / raw)
  To: Jens Axboe, Matthew Wilcox, Johannes Weiner, Suren Baghdasaryan,
	Andrew Morton
  Cc: Chris Mason, Josef Bacik, David Sterba, Gao Xiang, Chao Yu,
	linux-block, linux-btrfs, linux-fsdevel, linux-erofs, linux-mm,
	Gao Xiang

erofs uses an additional address space for compressed data read from disk
in addition to the one directly associated with the inode.  Reading into
the lower address space is open coded using add_to_page_cache_lru instead
of using the filemap.c helper for page allocation micro-optimizations,
which means it is not covered by the MM PSI annotations for ->read_folio
and ->readahead, so add manual ones instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Gao Xiang <hsiangkao@linux.alibaba.com>
---
 fs/erofs/zdata.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/fs/erofs/zdata.c b/fs/erofs/zdata.c
index 5792ca9e0d5ef..143a101a36887 100644
--- a/fs/erofs/zdata.c
+++ b/fs/erofs/zdata.c
@@ -7,6 +7,7 @@
 #include "zdata.h"
 #include "compress.h"
 #include <linux/prefetch.h>
+#include <linux/psi.h>
 
 #include <trace/events/erofs.h>
 
@@ -1365,6 +1366,8 @@ static void z_erofs_submit_queue(struct z_erofs_decompress_frontend *f,
 	struct block_device *last_bdev;
 	unsigned int nr_bios = 0;
 	struct bio *bio = NULL;
+	/* initialize to 1 to make skip psi_memstall_leave unless needed */
+	unsigned long pflags = 1;
 
 	bi_private = jobqueueset_init(sb, q, fgq, force_fg);
 	qtail[JQ_BYPASS] = &q[JQ_BYPASS]->head;
@@ -1414,10 +1417,15 @@ static void z_erofs_submit_queue(struct z_erofs_decompress_frontend *f,
 			if (bio && (cur != last_index + 1 ||
 				    last_bdev != mdev.m_bdev)) {
 submit_bio_retry:
+				if (!pflags)
+					psi_memstall_leave(&pflags);
 				submit_bio(bio);
 				bio = NULL;
 			}
 
+			if (unlikely(PageWorkingset(page)))
+				psi_memstall_enter(&pflags);
+
 			if (!bio) {
 				bio = bio_alloc(mdev.m_bdev, BIO_MAX_VECS,
 						REQ_OP_READ, GFP_NOIO);
@@ -1445,8 +1453,11 @@ static void z_erofs_submit_queue(struct z_erofs_decompress_frontend *f,
 			move_to_bypass_jobqueue(pcl, qtail, owned_head);
 	} while (owned_head != Z_EROFS_PCLUSTER_TAIL);
 
-	if (bio)
+	if (bio) {
+		if (!pflags)
+			psi_memstall_leave(&pflags);
 		submit_bio(bio);
+	}
 
 	/*
 	 * although background is preferred, no one is pending for submission.
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 5/5] block: remove PSI accounting from the bio layer
  2022-09-15  9:41 improve pagecache PSI annotations v2 Christoph Hellwig
                   ` (3 preceding siblings ...)
  2022-09-15  9:41 ` [PATCH 4/5] erofs: add manual PSI accounting for the compressed address space Christoph Hellwig
@ 2022-09-15  9:42 ` Christoph Hellwig
  2022-09-15 13:01 ` improve pagecache PSI annotations v2 David Sterba
  2022-09-20 14:24 ` Jens Axboe
  6 siblings, 0 replies; 16+ messages in thread
From: Christoph Hellwig @ 2022-09-15  9:42 UTC (permalink / raw)
  To: Jens Axboe, Matthew Wilcox, Johannes Weiner, Suren Baghdasaryan,
	Andrew Morton
  Cc: Chris Mason, Josef Bacik, David Sterba, Gao Xiang, Chao Yu,
	linux-block, linux-btrfs, linux-fsdevel, linux-erofs, linux-mm

PSI accounting is now done by the VM code, where it should have been
since the beginning.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 block/bio.c               |  8 --------
 block/blk-core.c          | 17 -----------------
 fs/direct-io.c            |  2 --
 include/linux/blk_types.h |  1 -
 4 files changed, 28 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 3d3a2678fea25..d10c4e888cdcf 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1065,9 +1065,6 @@ void __bio_add_page(struct bio *bio, struct page *page,
 
 	bio->bi_iter.bi_size += len;
 	bio->bi_vcnt++;
-
-	if (!bio_flagged(bio, BIO_WORKINGSET) && unlikely(PageWorkingset(page)))
-		bio_set_flag(bio, BIO_WORKINGSET);
 }
 EXPORT_SYMBOL_GPL(__bio_add_page);
 
@@ -1276,9 +1273,6 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
  * fit into the bio, or are requested in @iter, whatever is smaller. If
  * MM encounters an error pinning the requested pages, it stops. Error
  * is returned only if 0 pages could be pinned.
- *
- * It's intended for direct IO, so doesn't do PSI tracking, the caller is
- * responsible for setting BIO_WORKINGSET if necessary.
  */
 int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 {
@@ -1294,8 +1288,6 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 		ret = __bio_iov_iter_get_pages(bio, iter);
 	} while (!ret && iov_iter_count(iter) && !bio_full(bio, 0));
 
-	/* don't account direct I/O as memory stall */
-	bio_clear_flag(bio, BIO_WORKINGSET);
 	return bio->bi_vcnt ? 0 : ret;
 }
 EXPORT_SYMBOL_GPL(bio_iov_iter_get_pages);
diff --git a/block/blk-core.c b/block/blk-core.c
index a0d1104c5590c..9e19195af6f5b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -37,7 +37,6 @@
 #include <linux/t10-pi.h>
 #include <linux/debugfs.h>
 #include <linux/bpf.h>
-#include <linux/psi.h>
 #include <linux/part_stat.h>
 #include <linux/sched/sysctl.h>
 #include <linux/blk-crypto.h>
@@ -829,22 +828,6 @@ void submit_bio(struct bio *bio)
 		count_vm_events(PGPGOUT, bio_sectors(bio));
 	}
 
-	/*
-	 * If we're reading data that is part of the userspace workingset, count
-	 * submission time as memory stall.  When the device is congested, or
-	 * the submitting cgroup IO-throttled, submission can be a significant
-	 * part of overall IO time.
-	 */
-	if (unlikely(bio_op(bio) == REQ_OP_READ &&
-	    bio_flagged(bio, BIO_WORKINGSET))) {
-		unsigned long pflags;
-
-		psi_memstall_enter(&pflags);
-		submit_bio_noacct(bio);
-		psi_memstall_leave(&pflags);
-		return;
-	}
-
 	submit_bio_noacct(bio);
 }
 EXPORT_SYMBOL(submit_bio);
diff --git a/fs/direct-io.c b/fs/direct-io.c
index f669163d5860f..03d381377ae10 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -421,8 +421,6 @@ static inline void dio_bio_submit(struct dio *dio, struct dio_submit *sdio)
 	unsigned long flags;
 
 	bio->bi_private = dio;
-	/* don't account direct I/O as memory stall */
-	bio_clear_flag(bio, BIO_WORKINGSET);
 
 	spin_lock_irqsave(&dio->bio_lock, flags);
 	dio->refcount++;
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 1ef99790f6ed3..8b1858df21752 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -321,7 +321,6 @@ enum {
 	BIO_NO_PAGE_REF,	/* don't put release vec pages */
 	BIO_CLONED,		/* doesn't own data */
 	BIO_BOUNCED,		/* bio is a bounce bio */
-	BIO_WORKINGSET,		/* contains userspace workingset pages */
 	BIO_QUIET,		/* Make BIO Quiet */
 	BIO_CHAIN,		/* chained bio, ->bi_remaining in effect */
 	BIO_REFFED,		/* bio has elevated ->bi_cnt */
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: improve pagecache PSI annotations v2
  2022-09-15  9:41 improve pagecache PSI annotations v2 Christoph Hellwig
                   ` (4 preceding siblings ...)
  2022-09-15  9:42 ` [PATCH 5/5] block: remove PSI accounting from the bio layer Christoph Hellwig
@ 2022-09-15 13:01 ` David Sterba
  2022-09-19 15:45   ` Christoph Hellwig
  2022-09-20 14:24 ` Jens Axboe
  6 siblings, 1 reply; 16+ messages in thread
From: David Sterba @ 2022-09-15 13:01 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Matthew Wilcox, Johannes Weiner, Suren Baghdasaryan,
	Andrew Morton, Chris Mason, Josef Bacik, David Sterba, Gao Xiang,
	Chao Yu, linux-block, linux-btrfs, linux-fsdevel, linux-erofs,
	linux-mm

On Thu, Sep 15, 2022 at 10:41:55AM +0100, Christoph Hellwig wrote:
> 
>  - spell a comment in the weird way preferred by btrfs maintainers

What? A comment is a standalone sentence or a full paragraph, so it's
formatted as such. I hope it's not weird to expect literacy either in
the language of comments or the programming language. The same style can
be found in many other kernel parts so please stop the nags at btrfs.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: improve pagecache PSI annotations v2
  2022-09-15 13:01 ` improve pagecache PSI annotations v2 David Sterba
@ 2022-09-19 15:45   ` Christoph Hellwig
  0 siblings, 0 replies; 16+ messages in thread
From: Christoph Hellwig @ 2022-09-19 15:45 UTC (permalink / raw)
  To: David Sterba
  Cc: Christoph Hellwig, Jens Axboe, Matthew Wilcox, Johannes Weiner,
	Suren Baghdasaryan, Andrew Morton, Chris Mason, Josef Bacik,
	David Sterba, Gao Xiang, Chao Yu, linux-block, linux-btrfs,
	linux-fsdevel, linux-erofs, linux-mm

On Thu, Sep 15, 2022 at 03:01:38PM +0200, David Sterba wrote:
> On Thu, Sep 15, 2022 at 10:41:55AM +0100, Christoph Hellwig wrote:
> > 
> >  - spell a comment in the weird way preferred by btrfs maintainers
> 
> What? A comment is a standalone sentence or a full paragraph, so it's
> formatted as such. I hope it's not weird to expect literacy either in
> the language of comments or the programming language. The same style can
> be found in many other kernel parts so please stop the nags at btrfs.

That is not what most of the kernel seems to think.  The usual style is
to have multi-line comments start with a capitalized word and end with a
dot and thus form one or more complete sentences, but single line
comments most of the time do not form complete sentences and thus
do not start with a capitalized word and do not end with a dot.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: improve pagecache PSI annotations v2
  2022-09-15  9:41 improve pagecache PSI annotations v2 Christoph Hellwig
                   ` (5 preceding siblings ...)
  2022-09-15 13:01 ` improve pagecache PSI annotations v2 David Sterba
@ 2022-09-20 14:24 ` Jens Axboe
  2022-09-20 17:21   ` Christoph Hellwig
  6 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2022-09-20 14:24 UTC (permalink / raw)
  To: Suren Baghdasaryan, Christoph Hellwig, Johannes Weiner,
	Matthew Wilcox, Andrew Morton
  Cc: linux-fsdevel, David Sterba, Gao Xiang, Chris Mason, linux-block,
	Josef Bacik, Chao Yu, linux-mm, linux-btrfs, linux-erofs

On Thu, 15 Sep 2022 10:41:55 +0100, Christoph Hellwig wrote:
> currently the VM tries to abuse the block layer submission path for
> the page cache PSI annotations.  This series instead annotates the
> ->read_folio and ->readahead calls in the core VM code, and then
> only deals with the odd direct add_to_page_cache_lru calls manually.
> 
> Changes since v1:
>  - fix a logic error in ra_alloc_folio
>  - drop a unlikely()
>  - spell a comment in the weird way preferred by btrfs maintainers
> 
> [...]

Applied, thanks!

[1/5] mm: add PSI accounting around ->read_folio and ->readahead calls
      commit: 176042404ee6a96ba7e9054e1bda6220360a26ad
[2/5] sched/psi: export psi_memstall_{enter,leave}
      commit: 527eb453bbfe65e5a55a90edfb1f30b477e36b8c
[3/5] btrfs: add manual PSI accounting for compressed reads
      commit: 4088a47e78f95a5fea683cf67e0be006b13831fd
[4/5] erofs: add manual PSI accounting for the compressed address space
      commit: 99486c511f686c799bb4e60b79d79808bb9440f4
[5/5] block: remove PSI accounting from the bio layer
      commit: 118f3663fbc658e9ad6165e129076981c7b685c5

Best regards,
-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: improve pagecache PSI annotations v2
  2022-09-20 14:24 ` Jens Axboe
@ 2022-09-20 17:21   ` Christoph Hellwig
  0 siblings, 0 replies; 16+ messages in thread
From: Christoph Hellwig @ 2022-09-20 17:21 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Suren Baghdasaryan, Christoph Hellwig, Johannes Weiner,
	Matthew Wilcox, Andrew Morton, linux-fsdevel, David Sterba,
	Gao Xiang, Chris Mason, linux-block, Josef Bacik, Chao Yu,
	linux-mm, linux-btrfs, linux-erofs

On Tue, Sep 20, 2022 at 08:24:58AM -0600, Jens Axboe wrote:
> Applied, thanks!

I think Andrew applied these as well already.  I'll let you two fight
that out as I'm happy as long as it goes in :)


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [REGESSION] systemd-oomd overreacting due to PSI changes for Btrfs (was: Re: [PATCH 3/5] btrfs: add manual PSI accounting for compressed reads)
  2022-09-15  9:41 ` [PATCH 3/5] btrfs: add manual PSI accounting for compressed reads Christoph Hellwig
@ 2022-11-03 10:46   ` Thorsten Leemhuis
  2022-11-03 11:08     ` [REGESSION] systemd-oomd overreacting due to PSI changes for Btrfs #forregzbot Thorsten Leemhuis
                       ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Thorsten Leemhuis @ 2022-11-03 10:46 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe, Matthew Wilcox, Johannes Weiner,
	Suren Baghdasaryan, Andrew Morton
  Cc: Chris Mason, Josef Bacik, David Sterba, Gao Xiang, Chao Yu,
	linux-block, linux-btrfs, linux-fsdevel, linux-erofs, linux-mm,
	regressions

Hi Christoph!

On 15.09.22 11:41, Christoph Hellwig wrote:
> btrfs compressed reads try to always read the entire compressed chunk,
> even if only a subset is requested.  Currently this is covered by the
> magic PSI accounting underneath submit_bio, but that is about to go
> away. Instead add manual psi_memstall_{enter,leave} annotations.
> 
> Note that for readahead this really should be using readahead_expand,
> but the additionals reads are also done for plain ->read_folio where
> readahead_expand can't work, so this overall logic is left as-is for
> now.

It seems this patch makes systemd-oomd overreact on my day-to-day
machine and aggressively kill applications. I'm not the only one that
noticed such a behavior with 6.1 pre-releases:
https://bugzilla.redhat.com/show_bug.cgi?id=2133829
https://bugzilla.redhat.com/show_bug.cgi?id=2134971

I think I have a pretty reliable way to trigger the issue that involves
starting the apps that I normally use and a VM that I occasionally use,
which up to now never resulted in such a behaviour.

On master as of today (8e5423e991e8) I can trigger the problem within a
minute or two. But I fail to trigger it with v6.0.6 or when I revert
4088a47e78f9 ("btrfs: add manual PSI accounting for compressed reads").
And yes, I use btrfs with compression for / and /home/.

See [1] for a log msg from systemd-oomd.

Note, I had some trouble with bisecting[2]. This series looked
suspicious, so I removed it completely ontop of master and the problem
went away. Then I tried reverting only 4088a47e78f9 which helped, too.
Let me know if you want me to try another combination or need more data.

Ciao, Thorsten


[1] just one example:
```
> 10:52:29 t14s systemd-oomd[1261]: Considered 60 cgroups for killing, top candidates were:
> 10:52:29 t14s systemd-oomd[1261]:         Path: /system.slice/packagekit.service
> 10:52:29 t14s systemd-oomd[1261]:                 Memory Pressure Limit: 0.00%
> 10:52:29 t14s systemd-oomd[1261]:                 Pressure: Avg10: 93.66 Avg60: 38.22 Avg300: 9.48 Total: 29s
> 10:52:29 t14s systemd-oomd[1261]:                 Current Memory Usage: 276.9M
> 10:52:29 t14s systemd-oomd[1261]:                 Memory Min: 0B
> 10:52:29 t14s systemd-oomd[1261]:                 Memory Low: 0B
> 10:52:29 t14s systemd-oomd[1261]:                 Pgscan: 181098
> 10:52:29 t14s systemd-oomd[1261]:                 Last Pgscan: 178926
> 10:52:29 t14s systemd-oomd[1261]:         Path: /system.slice/firewalld.service
> 10:52:29 t14s systemd-oomd[1261]:                 Memory Pressure Limit: 0.00%
> 10:52:29 t14s systemd-oomd[1261]:                 Pressure: Avg10: 0.00 Avg60: 0.00 Avg300: 0.00 Total: 0
> 10:52:29 t14s systemd-oomd[1261]:                 Current Memory Usage: 34.6M
> 10:52:29 t14s systemd-oomd[1261]:                 Memory Min: 0B
> 10:52:29 t14s systemd-oomd[1261]:                 Memory Low: 0B
> 10:52:29 t14s systemd-oomd[1261]:                 Pgscan: 13035
> 10:52:29 t14s systemd-oomd[1261]:                 Last Pgscan: 12854
> 10:52:29 t14s systemd-oomd[1261]:         Path: /system.slice/sssd-kcm.service
> 10:52:29 t14s systemd-oomd[1261]:                 Memory Pressure Limit: 0.00%
> 10:52:29 t14s systemd-oomd[1261]:                 Pressure: Avg10: 0.00 Avg60: 0.00 Avg300: 0.00 Total: 184us
> 10:52:29 t14s systemd-oomd[1261]:                 Current Memory Usage: 32.9M
> 10:52:29 t14s systemd-oomd[1261]:                 Memory Min: 0B
> 10:52:29 t14s systemd-oomd[1261]:                 Memory Low: 0B
> 10:52:29 t14s systemd-oomd[1261]:                 Pgscan: 7667
> 10:52:29 t14s systemd-oomd[1261]:                 Last Pgscan: 7501
> 10:52:29 t14s systemd-oomd[1261]:         Path: /system.slice/systemd-journald.service
> 10:52:29 t14s systemd-oomd[1261]:                 Memory Pressure Limit: 0.00%
> 10:52:29 t14s systemd-oomd[1261]:                 Pressure: Avg10: 0.00 Avg60: 0.00 Avg300: 0.00 Total: 8ms
> 10:52:29 t14s systemd-oomd[1261]:                 Current Memory Usage: 14.5M
> 10:52:29 t14s systemd-oomd[1261]:                 Memory Min: 0B
> 10:52:29 t14s systemd-oomd[1261]:                 Memory Low: 0B
> 10:52:29 t14s systemd-oomd[1261]:                 Pgscan: 13020
> 10:52:29 t14s systemd-oomd[1261]:                 Last Pgscan: 12914
> 10:52:29 t14s systemd-oomd[1261]:         Path: /system.slice/libvirtd.service
> 10:52:29 t14s systemd-oomd[1261]:                 Memory Pressure Limit: 0.00%
> 10:52:29 t14s systemd-oomd[1261]:                 Pressure: Avg10: 0.00 Avg60: 0.00 Avg300: 0.00 Total: 0
> 10:52:29 t14s systemd-oomd[1261]:                 Current Memory Usage: 18.9M
> 10:52:29 t14s systemd-oomd[1261]:                 Memory Min: 0B
> 10:52:29 t14s systemd-oomd[1261]:                 Memory Low: 0B
> 10:52:29 t14s systemd-oomd[1261]:                 Pgscan: 12983
> 10:52:29 t14s systemd-oomd[1261]:                 Last Pgscan: 12896
> 10:52:29 t14s systemd-oomd[1261]:         Path: /system.slice/geoclue.service
> 10:52:29 t14s systemd-oomd[1261]:                 Memory Pressure Limit: 0.00%
> 10:52:29 t14s systemd-oomd[1261]:                 Pressure: Avg10: 0.00 Avg60: 0.00 Avg300: 0.00 Total: 0
> 10:52:29 t14s systemd-oomd[1261]:                 Current Memory Usage: 18.0M
> 10:52:29 t14s systemd-oomd[1261]:                 Memory Min: 0B
> 10:52:29 t14s systemd-oomd[1261]:                 Memory Low: 0B
> 10:52:29 t14s systemd-oomd[1261]:                 Pgscan: 3625
> 10:52:29 t14s systemd-oomd[1261]:                 Last Pgscan: 3550
> 10:52:29 t14s systemd-oomd[1261]:         Path: /system.slice/polkit.service
> 10:52:29 t14s systemd-oomd[1261]:                 Memory Pressure Limit: 0.00%
> 10:52:29 t14s systemd-oomd[1261]:                 Pressure: Avg10: 0.00 Avg60: 0.00 Avg300: 0.00 Total: 2ms
> 10:52:29 t14s systemd-oomd[1261]:                 Current Memory Usage: 15.9M
> 10:52:29 t14s systemd-oomd[1261]:                 Memory Min: 0B
> 10:52:29 t14s systemd-oomd[1261]:                 Memory Low: 0B
> 10:52:29 t14s systemd-oomd[1261]:                 Pgscan: 10664
> 10:52:29 t14s systemd-oomd[1261]:                 Last Pgscan: 10596
> 10:52:29 t14s systemd-oomd[1261]:         Path: /system.slice/NetworkManager.service
> 10:52:29 t14s systemd-oomd[1261]:                 Memory Pressure Limit: 0.00%
> 10:52:29 t14s systemd-oomd[1261]:                 Pressure: Avg10: 0.00 Avg60: 0.00 Avg300: 0.00 Total: 3ms
> 10:52:29 t14s systemd-oomd[1261]:                 Current Memory Usage: 6.6M
> 10:52:29 t14s systemd-oomd[1261]:                 Memory Min: 0B
> 10:52:29 t14s systemd-oomd[1261]:                 Memory Low: 0B
> 10:52:29 t14s systemd-oomd[1261]:                 Pgscan: 2515
> 10:52:29 t14s systemd-oomd[1261]:                 Last Pgscan: 2492
> 10:52:29 t14s systemd-oomd[1261]:         Path: /system.slice/abrt-xorg.service
> 10:52:29 t14s systemd-oomd[1261]:                 Memory Pressure Limit: 0.00%
> 10:52:29 t14s systemd-oomd[1261]:                 Pressure: Avg10: 0.00 Avg60: 0.00 Avg300: 0.00 Total: 0
> 10:52:29 t14s systemd-oomd[1261]:                 Current Memory Usage: 5.2M
> 10:52:29 t14s systemd-oomd[1261]:                 Memory Min: 0B
> 10:52:29 t14s systemd-oomd[1261]:                 Memory Low: 0B
> 10:52:29 t14s systemd-oomd[1261]:                 Pgscan: 35154
> 10:52:29 t14s systemd-oomd[1261]:                 Last Pgscan: 35131
> 10:52:29 t14s systemd-oomd[1261]:         Path: /system.slice/dbus-broker.service
> 10:52:29 t14s systemd-oomd[1261]:                 Memory Pressure Limit: 0.00%
> 10:52:29 t14s systemd-oomd[1261]:                 Pressure: Avg10: 0.00 Avg60: 0.00 Avg300: 0.00 Total: 0
> 10:52:29 t14s systemd-oomd[1261]:                 Current Memory Usage: 7.5M
> 10:52:29 t14s systemd-oomd[1261]:                 Memory Min: 0B
> 10:52:29 t14s systemd-oomd[1261]:                 Memory Low: 0B
> 10:52:29 t14s systemd-oomd[1261]:                 Pgscan: 1183
> 10:52:29 t14s systemd-oomd[1261]:                 Last Pgscan: 1161
> 10:52:29 t14s systemd-oomd[1261]: Killed /system.slice/packagekit.service due to memory pressure for /system.slice being 91.73% > 50.00% for > 20s with reclaim activity
```

[2]

I have no idea what went wrong. 0a78a376ef3c (the last merge before the
one with this series) was fine, but c6ea70604249 (which afaics basically
is the next commit if I understand things right) was not. I tried
reverting it, which should give me the merge base (v6.0-rc2), but it was
broken, too. I guess I must have done something wrong, but I have no
idea what, but I tried again and got the same result. :-/

/me must be missing something and/or not understand git properly...


> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Acked-by: David Sterba <dsterba@suse.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  fs/btrfs/compression.c | 14 ++++++++++++--
>  1 file changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> index e84d22c5c6a83..370788b9b1249 100644
> --- a/fs/btrfs/compression.c
> +++ b/fs/btrfs/compression.c
> @@ -15,6 +15,7 @@
>  #include <linux/string.h>
>  #include <linux/backing-dev.h>
>  #include <linux/writeback.h>
> +#include <linux/psi.h>
>  #include <linux/slab.h>
>  #include <linux/sched/mm.h>
>  #include <linux/log2.h>
> @@ -519,7 +520,8 @@ static u64 bio_end_offset(struct bio *bio)
>   */
>  static noinline int add_ra_bio_pages(struct inode *inode,
>  				     u64 compressed_end,
> -				     struct compressed_bio *cb)
> +				     struct compressed_bio *cb,
> +				     unsigned long *pflags)
>  {
>  	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>  	unsigned long end_index;
> @@ -588,6 +590,9 @@ static noinline int add_ra_bio_pages(struct inode *inode,
>  			continue;
>  		}
>  
> +		if (PageWorkingset(page))
> +			psi_memstall_enter(pflags);
> +
>  		ret = set_page_extent_mapped(page);
>  		if (ret < 0) {
>  			unlock_page(page);
> @@ -674,6 +679,8 @@ void btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
>  	u64 em_len;
>  	u64 em_start;
>  	struct extent_map *em;
> +	/* Initialize to 1 to make skip psi_memstall_leave unless needed */
> +	unsigned long pflags = 1;
>  	blk_status_t ret;
>  	int ret2;
>  	int i;
> @@ -729,7 +736,7 @@ void btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
>  		goto fail;
>  	}
>  
> -	add_ra_bio_pages(inode, em_start + em_len, cb);
> +	add_ra_bio_pages(inode, em_start + em_len, cb, &pflags);
>  
>  	/* include any pages we added in add_ra-bio_pages */
>  	cb->len = bio->bi_iter.bi_size;
> @@ -810,6 +817,9 @@ void btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
>  		}
>  	}
>  
> +	if (!pflags)
> +		psi_memstall_leave(&pflags);
> +
>  	if (refcount_dec_and_test(&cb->pending_ios))
>  		finish_compressed_bio_read(cb);
>  	return;

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [REGESSION] systemd-oomd overreacting due to PSI changes for Btrfs #forregzbot
  2022-11-03 10:46   ` [REGESSION] systemd-oomd overreacting due to PSI changes for Btrfs (was: Re: [PATCH 3/5] btrfs: add manual PSI accounting for compressed reads) Thorsten Leemhuis
@ 2022-11-03 11:08     ` Thorsten Leemhuis
  2022-11-03 12:40     ` [REGESSION] systemd-oomd overreacting due to PSI changes for Btrfs (was: Re: [PATCH 3/5] btrfs: add manual PSI accounting for compressed reads) Christoph Hellwig
  2022-11-03 22:20     ` Johannes Weiner
  2 siblings, 0 replies; 16+ messages in thread
From: Thorsten Leemhuis @ 2022-11-03 11:08 UTC (permalink / raw)
  To: regressions
  Cc: linux-block, linux-btrfs, linux-fsdevel, linux-erofs, linux-mm

On 03.11.22 11:46, Thorsten Leemhuis wrote:
> On 15.09.22 11:41, Christoph Hellwig wrote:
>> btrfs compressed reads try to always read the entire compressed chunk,
>> even if only a subset is requested.  Currently this is covered by the
>> magic PSI accounting underneath submit_bio, but that is about to go
>> away. Instead add manual psi_memstall_{enter,leave} annotations.
>>
>> Note that for readahead this really should be using readahead_expand,
>> but the additionals reads are also done for plain ->read_folio where
>> readahead_expand can't work, so this overall logic is left as-is for
>> now.
> 
> It seems this patch makes systemd-oomd overreact on my day-to-day
> machine and aggressively kill applications. I'm not the only one that
> noticed such a behavior with 6.1 pre-releases:
> https://bugzilla.redhat.com/show_bug.cgi?id=2133829
> https://bugzilla.redhat.com/show_bug.cgi?id=2134971

Great, the kernel's regression tracker reports a regression and forgets
to tell his regression tracking bot about it to ensure it's tracked... :-D

#regzbot ^introduced 4088a47e78f9
#regzbot title mm/btrfs: systemd-oomd overreacting due to PSI changes
for Btrfs
#regzbot ignore-activity

Ciao, Thorsten

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [REGESSION] systemd-oomd overreacting due to PSI changes for Btrfs (was: Re: [PATCH 3/5] btrfs: add manual PSI accounting for compressed reads)
  2022-11-03 10:46   ` [REGESSION] systemd-oomd overreacting due to PSI changes for Btrfs (was: Re: [PATCH 3/5] btrfs: add manual PSI accounting for compressed reads) Thorsten Leemhuis
  2022-11-03 11:08     ` [REGESSION] systemd-oomd overreacting due to PSI changes for Btrfs #forregzbot Thorsten Leemhuis
@ 2022-11-03 12:40     ` Christoph Hellwig
  2022-11-03 22:20     ` Johannes Weiner
  2 siblings, 0 replies; 16+ messages in thread
From: Christoph Hellwig @ 2022-11-03 12:40 UTC (permalink / raw)
  To: Thorsten Leemhuis
  Cc: Christoph Hellwig, Jens Axboe, Matthew Wilcox, Johannes Weiner,
	Suren Baghdasaryan, Andrew Morton, Chris Mason, Josef Bacik,
	David Sterba, Gao Xiang, Chao Yu, linux-block, linux-btrfs,
	linux-fsdevel, linux-erofs, linux-mm, regressions

On Thu, Nov 03, 2022 at 11:46:52AM +0100, Thorsten Leemhuis wrote:
> It seems this patch makes systemd-oomd overreact on my day-to-day
> machine and aggressively kill applications. I'm not the only one that
> noticed such a behavior with 6.1 pre-releases:
> https://bugzilla.redhat.com/show_bug.cgi?id=2133829
> https://bugzilla.redhat.com/show_bug.cgi?id=2134971
> 
> I think I have a pretty reliable way to trigger the issue that involves
> starting the apps that I normally use and a VM that I occasionally use,
> which up to now never resulted in such a behaviour.
> 
> On master as of today (8e5423e991e8) I can trigger the problem within a
> minute or two. But I fail to trigger it with v6.0.6 or when I revert
> 4088a47e78f9 ("btrfs: add manual PSI accounting for compressed reads").
> And yes, I use btrfs with compression for / and /home/.

So, I did in fact not want to include this patch because it is a little
iffy and includes PSI accounting for reads where btrfs just does
aggresive readaround for compression, but Johannes asked for it to be
added.  I'd be perfectly fine with just reverting it.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [REGESSION] systemd-oomd overreacting due to PSI changes for Btrfs (was: Re: [PATCH 3/5] btrfs: add manual PSI accounting for compressed reads)
  2022-11-03 10:46   ` [REGESSION] systemd-oomd overreacting due to PSI changes for Btrfs (was: Re: [PATCH 3/5] btrfs: add manual PSI accounting for compressed reads) Thorsten Leemhuis
  2022-11-03 11:08     ` [REGESSION] systemd-oomd overreacting due to PSI changes for Btrfs #forregzbot Thorsten Leemhuis
  2022-11-03 12:40     ` [REGESSION] systemd-oomd overreacting due to PSI changes for Btrfs (was: Re: [PATCH 3/5] btrfs: add manual PSI accounting for compressed reads) Christoph Hellwig
@ 2022-11-03 22:20     ` Johannes Weiner
  2022-11-04  7:32       ` Thorsten Leemhuis
  2 siblings, 1 reply; 16+ messages in thread
From: Johannes Weiner @ 2022-11-03 22:20 UTC (permalink / raw)
  To: Thorsten Leemhuis
  Cc: Christoph Hellwig, Jens Axboe, Matthew Wilcox,
	Suren Baghdasaryan, Andrew Morton, Chris Mason, Josef Bacik,
	David Sterba, Gao Xiang, Chao Yu, linux-block, linux-btrfs,
	linux-fsdevel, linux-erofs, linux-mm, regressions

On Thu, Nov 03, 2022 at 11:46:52AM +0100, Thorsten Leemhuis wrote:
> Hi Christoph!
> 
> On 15.09.22 11:41, Christoph Hellwig wrote:
> > btrfs compressed reads try to always read the entire compressed chunk,
> > even if only a subset is requested.  Currently this is covered by the
> > magic PSI accounting underneath submit_bio, but that is about to go
> > away. Instead add manual psi_memstall_{enter,leave} annotations.
> > 
> > Note that for readahead this really should be using readahead_expand,
> > but the additionals reads are also done for plain ->read_folio where
> > readahead_expand can't work, so this overall logic is left as-is for
> > now.
> 
> It seems this patch makes systemd-oomd overreact on my day-to-day
> machine and aggressively kill applications. I'm not the only one that
> noticed such a behavior with 6.1 pre-releases:
> https://bugzilla.redhat.com/show_bug.cgi?id=2133829
> https://bugzilla.redhat.com/show_bug.cgi?id=2134971
> 
> I think I have a pretty reliable way to trigger the issue that involves
> starting the apps that I normally use and a VM that I occasionally use,
> which up to now never resulted in such a behaviour.
> 
> On master as of today (8e5423e991e8) I can trigger the problem within a
> minute or two. But I fail to trigger it with v6.0.6 or when I revert
> 4088a47e78f9 ("btrfs: add manual PSI accounting for compressed reads").
> And yes, I use btrfs with compression for / and /home/.
> 
> See [1] for a log msg from systemd-oomd.
> 
> Note, I had some trouble with bisecting[2]. This series looked
> suspicious, so I removed it completely ontop of master and the problem
> went away. Then I tried reverting only 4088a47e78f9 which helped, too.
> Let me know if you want me to try another combination or need more data.

Oh, I think I see the bug. We can leak pressure state from the bio
submission, which causes the task to permanently drive up pressure.

Can you try this patch?

From 499e5cab7b39fc4c90a0f96e33cdc03274b316fd Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Thu, 3 Nov 2022 17:34:31 -0400
Subject: [PATCH] fs: btrfs: fix leaked psi pressure state

When psi annotations were added to to btrfs compression reads, the psi
state tracking over add_ra_bio_pages and btrfs_submit_compressed_read
was faulty. The task can remain in a stall state after the read. This
results in incorrectly elevated pressure, which triggers OOM kills.

pflags record the *previous* memstall state when we enter a new
one. The code tried to initialize pflags to 1, and then optimize the
leave call when we either didn't enter a memstall, or were already
inside a nested stall. However, there can be multiple PageWorkingset
pages in the bio, at which point it's that path itself that re-enters
the state and overwrites pflags. This causes us to miss the exit.

Enter the stall only once if needed, then unwind correctly.

Reported-by: Thorsten Leemhuis <linux@leemhuis.info>
Fixes: 4088a47e78f9 btrfs: add manual PSI accounting for compressed reads
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 fs/btrfs/compression.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index f1f051ad3147..e6635fe70067 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -512,7 +512,7 @@ static u64 bio_end_offset(struct bio *bio)
 static noinline int add_ra_bio_pages(struct inode *inode,
 				     u64 compressed_end,
 				     struct compressed_bio *cb,
-				     unsigned long *pflags)
+				     int *memstall, unsigned long *pflags)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	unsigned long end_index;
@@ -581,8 +581,10 @@ static noinline int add_ra_bio_pages(struct inode *inode,
 			continue;
 		}
 
-		if (PageWorkingset(page))
+		if (!*memstall && PageWorkingset(page)) {
 			psi_memstall_enter(pflags);
+			*memstall = 1;
+		}
 
 		ret = set_page_extent_mapped(page);
 		if (ret < 0) {
@@ -670,8 +672,8 @@ void btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
 	u64 em_len;
 	u64 em_start;
 	struct extent_map *em;
-	/* Initialize to 1 to make skip psi_memstall_leave unless needed */
-	unsigned long pflags = 1;
+	unsigned long pflags;
+	int memstall = 0;
 	blk_status_t ret;
 	int ret2;
 	int i;
@@ -727,7 +729,7 @@ void btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
 		goto fail;
 	}
 
-	add_ra_bio_pages(inode, em_start + em_len, cb, &pflags);
+	add_ra_bio_pages(inode, em_start + em_len, cb, &memstall, &pflags);
 
 	/* include any pages we added in add_ra-bio_pages */
 	cb->len = bio->bi_iter.bi_size;
@@ -807,7 +809,7 @@ void btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
 		}
 	}
 
-	if (!pflags)
+	if (memstall)
 		psi_memstall_leave(&pflags);
 
 	if (refcount_dec_and_test(&cb->pending_ios))
-- 
2.38.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [REGESSION] systemd-oomd overreacting due to PSI changes for Btrfs (was: Re: [PATCH 3/5] btrfs: add manual PSI accounting for compressed reads)
  2022-11-03 22:20     ` Johannes Weiner
@ 2022-11-04  7:32       ` Thorsten Leemhuis
  2022-11-04 12:36         ` Johannes Weiner
  0 siblings, 1 reply; 16+ messages in thread
From: Thorsten Leemhuis @ 2022-11-04  7:32 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Christoph Hellwig, Jens Axboe, Matthew Wilcox,
	Suren Baghdasaryan, Andrew Morton, Chris Mason, Josef Bacik,
	David Sterba, Gao Xiang, Chao Yu, linux-block, linux-btrfs,
	linux-fsdevel, linux-erofs, linux-mm, regressions

On 03.11.22 23:20, Johannes Weiner wrote:
> On Thu, Nov 03, 2022 at 11:46:52AM +0100, Thorsten Leemhuis wrote:
>> On 15.09.22 11:41, Christoph Hellwig wrote:
>>> btrfs compressed reads try to always read the entire compressed chunk,
>>> even if only a subset is requested.  Currently this is covered by the
>>> magic PSI accounting underneath submit_bio, but that is about to go
>>> away. Instead add manual psi_memstall_{enter,leave} annotations.
>>>
>>> Note that for readahead this really should be using readahead_expand,
>>> but the additionals reads are also done for plain ->read_folio where
>>> readahead_expand can't work, so this overall logic is left as-is for
>>> now.
>>
>> It seems this patch makes systemd-oomd overreact on my day-to-day
>> machine and aggressively kill applications. I'm not the only one that
>> noticed such a behavior with 6.1 pre-releases:
>> https://bugzilla.redhat.com/show_bug.cgi?id=2133829
>> https://bugzilla.redhat.com/show_bug.cgi?id=2134971
> [...]
>> On master as of today (8e5423e991e8) I can trigger the problem within a
>> minute or two. But I fail to trigger it with v6.0.6 or when I revert
>> 4088a47e78f9 ("btrfs: add manual PSI accounting for compressed reads").
>> And yes, I use btrfs with compression for / and /home/.
> [...]
> 
> Oh, I think I see the bug. We can leak pressure state from the bio
> submission, which causes the task to permanently drive up pressure.

Thx for looking into this.

> Can you try this patch?

It apparently does the trick -- at least my test setup that usually
triggers the bug within a minute or two survived for nearly an hour now, so:

Tested-by: Thorsten Leemhuis <linux@leemhuis.info>

Can you please also add this tag to help future archeologists, as
explained by the kernel docs (for details see
Documentation/process/submitting-patches.rst and
Documentation/process/5.Posting.rst):

Link:
https://lore.kernel.org/r/d20a0a85-e415-cf78-27f9-77dd7a94bc8d@leemhuis.info/

It also will make my regression tracking bot see further postings of
this patch and mark the issue as resolved once the patch lands in mainline.

tia and thx again for the patch!

Ciao, Thorsten

>>From 499e5cab7b39fc4c90a0f96e33cdc03274b316fd Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Thu, 3 Nov 2022 17:34:31 -0400
> Subject: [PATCH] fs: btrfs: fix leaked psi pressure state
> 
> When psi annotations were added to to btrfs compression reads, the psi
> state tracking over add_ra_bio_pages and btrfs_submit_compressed_read
> was faulty. The task can remain in a stall state after the read. This
> results in incorrectly elevated pressure, which triggers OOM kills.
> 
> pflags record the *previous* memstall state when we enter a new
> one. The code tried to initialize pflags to 1, and then optimize the
> leave call when we either didn't enter a memstall, or were already
> inside a nested stall. However, there can be multiple PageWorkingset
> pages in the bio, at which point it's that path itself that re-enters
> the state and overwrites pflags. This causes us to miss the exit.
> 
> Enter the stall only once if needed, then unwind correctly.
> 
> Reported-by: Thorsten Leemhuis <linux@leemhuis.info>
> Fixes: 4088a47e78f9 btrfs: add manual PSI accounting for compressed reads
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  fs/btrfs/compression.c | 14 ++++++++------
>  1 file changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> index f1f051ad3147..e6635fe70067 100644
> --- a/fs/btrfs/compression.c
> +++ b/fs/btrfs/compression.c
> @@ -512,7 +512,7 @@ static u64 bio_end_offset(struct bio *bio)
>  static noinline int add_ra_bio_pages(struct inode *inode,
>  				     u64 compressed_end,
>  				     struct compressed_bio *cb,
> -				     unsigned long *pflags)
> +				     int *memstall, unsigned long *pflags)
>  {
>  	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>  	unsigned long end_index;
> @@ -581,8 +581,10 @@ static noinline int add_ra_bio_pages(struct inode *inode,
>  			continue;
>  		}
>  
> -		if (PageWorkingset(page))
> +		if (!*memstall && PageWorkingset(page)) {
>  			psi_memstall_enter(pflags);
> +			*memstall = 1;
> +		}
>  
>  		ret = set_page_extent_mapped(page);
>  		if (ret < 0) {
> @@ -670,8 +672,8 @@ void btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
>  	u64 em_len;
>  	u64 em_start;
>  	struct extent_map *em;
> -	/* Initialize to 1 to make skip psi_memstall_leave unless needed */
> -	unsigned long pflags = 1;
> +	unsigned long pflags;
> +	int memstall = 0;
>  	blk_status_t ret;
>  	int ret2;
>  	int i;
> @@ -727,7 +729,7 @@ void btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
>  		goto fail;
>  	}
>  
> -	add_ra_bio_pages(inode, em_start + em_len, cb, &pflags);
> +	add_ra_bio_pages(inode, em_start + em_len, cb, &memstall, &pflags);
>  
>  	/* include any pages we added in add_ra-bio_pages */
>  	cb->len = bio->bi_iter.bi_size;
> @@ -807,7 +809,7 @@ void btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
>  		}
>  	}
>  
> -	if (!pflags)
> +	if (memstall)
>  		psi_memstall_leave(&pflags);
>  
>  	if (refcount_dec_and_test(&cb->pending_ios))

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [REGESSION] systemd-oomd overreacting due to PSI changes for Btrfs (was: Re: [PATCH 3/5] btrfs: add manual PSI accounting for compressed reads)
  2022-11-04  7:32       ` Thorsten Leemhuis
@ 2022-11-04 12:36         ` Johannes Weiner
  0 siblings, 0 replies; 16+ messages in thread
From: Johannes Weiner @ 2022-11-04 12:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Thorsten Leemhuis, Christoph Hellwig, Jens Axboe, Matthew Wilcox,
	Suren Baghdasaryan, Chris Mason, Josef Bacik, David Sterba,
	Gao Xiang, Chao Yu, linux-block, linux-btrfs, linux-fsdevel,
	linux-erofs, linux-mm, regressions

On Fri, Nov 04, 2022 at 08:32:22AM +0100, Thorsten Leemhuis wrote:
> On 03.11.22 23:20, Johannes Weiner wrote:
> > Can you try this patch?
> 
> It apparently does the trick -- at least my test setup that usually
> triggers the bug within a minute or two survived for nearly an hour now, so:
> 
> Tested-by: Thorsten Leemhuis <linux@leemhuis.info>

Great, thanks Thorsten.

> Can you please also add this tag to help future archeologists, as
> explained by the kernel docs (for details see
> Documentation/process/submitting-patches.rst and
> Documentation/process/5.Posting.rst):
> 
> Link:
> https://lore.kernel.org/r/d20a0a85-e415-cf78-27f9-77dd7a94bc8d@leemhuis.info/
> 
> It also will make my regression tracking bot see further postings of
> this patch and mark the issue as resolved once the patch lands in mainline.

Done.

Looks like erofs has the same issue, I included a fix for that.

Andrew would you mind picking this up and sending it Linusward? Jens
routed the series originally, but I believe he is out today.

Thanks

From b668b261ed18105e91745f3d7676b6bca968476d Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Thu, 3 Nov 2022 17:34:31 -0400
Subject: [PATCH] fs: fix leaked psi pressure state

When psi annotations were added to to btrfs compression reads, the psi
state tracking over add_ra_bio_pages and btrfs_submit_compressed_read
was faulty. A pressure state, once entered, is never left. This
results in incorrectly elevated pressure, which triggers OOM kills.

pflags record the *previous* memstall state when we enter a new
one. The code tried to initialize pflags to 1, and then optimize the
leave call when we either didn't enter a memstall, or were already
inside a nested stall. However, there can be multiple PageWorkingset
pages in the bio, at which point it's that path itself that enters
repeatedly and overwrites pflags. This causes us to miss the exit.

Enter the stall only once if needed, then unwind correctly.

erofs has the same problem, fix that up too. And move the memstall
exit past submit_bio() to restore submit accounting originally added
by b8e24a9300b0 ("block: annotate refault stalls from IO submission").

Fixes: 4088a47e78f9 ("btrfs: add manual PSI accounting for compressed reads")
Fixes: 99486c511f68 ("erofs: add manual PSI accounting for the compressed address space")
Fixes: 118f3663fbc6 ("block: remove PSI accounting from the bio layer")
Link: https://lore.kernel.org/r/d20a0a85-e415-cf78-27f9-77dd7a94bc8d@leemhuis.info/
Reported-by: Thorsten Leemhuis <linux@leemhuis.info>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Thorsten Leemhuis <linux@leemhuis.info>
---
 fs/btrfs/compression.c | 14 ++++++++------
 fs/erofs/zdata.c       | 18 +++++++++++-------
 2 files changed, 19 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index f1f051ad3147..e6635fe70067 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -512,7 +512,7 @@ static u64 bio_end_offset(struct bio *bio)
 static noinline int add_ra_bio_pages(struct inode *inode,
 				     u64 compressed_end,
 				     struct compressed_bio *cb,
-				     unsigned long *pflags)
+				     int *memstall, unsigned long *pflags)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	unsigned long end_index;
@@ -581,8 +581,10 @@ static noinline int add_ra_bio_pages(struct inode *inode,
 			continue;
 		}
 
-		if (PageWorkingset(page))
+		if (!*memstall && PageWorkingset(page)) {
 			psi_memstall_enter(pflags);
+			*memstall = 1;
+		}
 
 		ret = set_page_extent_mapped(page);
 		if (ret < 0) {
@@ -670,8 +672,8 @@ void btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
 	u64 em_len;
 	u64 em_start;
 	struct extent_map *em;
-	/* Initialize to 1 to make skip psi_memstall_leave unless needed */
-	unsigned long pflags = 1;
+	unsigned long pflags;
+	int memstall = 0;
 	blk_status_t ret;
 	int ret2;
 	int i;
@@ -727,7 +729,7 @@ void btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
 		goto fail;
 	}
 
-	add_ra_bio_pages(inode, em_start + em_len, cb, &pflags);
+	add_ra_bio_pages(inode, em_start + em_len, cb, &memstall, &pflags);
 
 	/* include any pages we added in add_ra-bio_pages */
 	cb->len = bio->bi_iter.bi_size;
@@ -807,7 +809,7 @@ void btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
 		}
 	}
 
-	if (!pflags)
+	if (memstall)
 		psi_memstall_leave(&pflags);
 
 	if (refcount_dec_and_test(&cb->pending_ios))
diff --git a/fs/erofs/zdata.c b/fs/erofs/zdata.c
index c7f24fc7efd5..064a166324a7 100644
--- a/fs/erofs/zdata.c
+++ b/fs/erofs/zdata.c
@@ -1412,8 +1412,8 @@ static void z_erofs_submit_queue(struct z_erofs_decompress_frontend *f,
 	struct block_device *last_bdev;
 	unsigned int nr_bios = 0;
 	struct bio *bio = NULL;
-	/* initialize to 1 to make skip psi_memstall_leave unless needed */
-	unsigned long pflags = 1;
+	unsigned long pflags;
+	int memstall = 0;
 
 	bi_private = jobqueueset_init(sb, q, fgq, force_fg);
 	qtail[JQ_BYPASS] = &q[JQ_BYPASS]->head;
@@ -1463,14 +1463,18 @@ static void z_erofs_submit_queue(struct z_erofs_decompress_frontend *f,
 			if (bio && (cur != last_index + 1 ||
 				    last_bdev != mdev.m_bdev)) {
 submit_bio_retry:
-				if (!pflags)
-					psi_memstall_leave(&pflags);
 				submit_bio(bio);
+				if (memstall) {
+					psi_memstall_leave(&pflags);
+					memstall = 0;
+				}
 				bio = NULL;
 			}
 
-			if (unlikely(PageWorkingset(page)))
+			if (unlikely(PageWorkingset(page)) && !memstall) {
 				psi_memstall_enter(&pflags);
+				memstall = 1;
+			}
 
 			if (!bio) {
 				bio = bio_alloc(mdev.m_bdev, BIO_MAX_VECS,
@@ -1500,9 +1504,9 @@ static void z_erofs_submit_queue(struct z_erofs_decompress_frontend *f,
 	} while (owned_head != Z_EROFS_PCLUSTER_TAIL);
 
 	if (bio) {
-		if (!pflags)
-			psi_memstall_leave(&pflags);
 		submit_bio(bio);
+		if (memstall)
+			psi_memstall_leave(&pflags);
 	}
 
 	/*
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2022-11-04 12:36 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-15  9:41 improve pagecache PSI annotations v2 Christoph Hellwig
2022-09-15  9:41 ` [PATCH 1/5] mm: add PSI accounting around ->read_folio and ->readahead calls Christoph Hellwig
2022-09-15  9:41 ` [PATCH 2/5] sched/psi: export psi_memstall_{enter,leave} Christoph Hellwig
2022-09-15  9:41 ` [PATCH 3/5] btrfs: add manual PSI accounting for compressed reads Christoph Hellwig
2022-11-03 10:46   ` [REGESSION] systemd-oomd overreacting due to PSI changes for Btrfs (was: Re: [PATCH 3/5] btrfs: add manual PSI accounting for compressed reads) Thorsten Leemhuis
2022-11-03 11:08     ` [REGESSION] systemd-oomd overreacting due to PSI changes for Btrfs #forregzbot Thorsten Leemhuis
2022-11-03 12:40     ` [REGESSION] systemd-oomd overreacting due to PSI changes for Btrfs (was: Re: [PATCH 3/5] btrfs: add manual PSI accounting for compressed reads) Christoph Hellwig
2022-11-03 22:20     ` Johannes Weiner
2022-11-04  7:32       ` Thorsten Leemhuis
2022-11-04 12:36         ` Johannes Weiner
2022-09-15  9:41 ` [PATCH 4/5] erofs: add manual PSI accounting for the compressed address space Christoph Hellwig
2022-09-15  9:42 ` [PATCH 5/5] block: remove PSI accounting from the bio layer Christoph Hellwig
2022-09-15 13:01 ` improve pagecache PSI annotations v2 David Sterba
2022-09-19 15:45   ` Christoph Hellwig
2022-09-20 14:24 ` Jens Axboe
2022-09-20 17:21   ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).