All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/5] Remove dependency on congestion_wait in mm/
@ 2021-09-20  8:54 Mel Gorman
  2021-09-20  8:54 ` [PATCH 1/5] mm/vmscan: Throttle reclaim until some writeback completes if congested Mel Gorman
                   ` (6 more replies)
  0 siblings, 7 replies; 33+ messages in thread
From: Mel Gorman @ 2021-09-20  8:54 UTC (permalink / raw)
  To: Linux-MM
  Cc: NeilBrown, Theodore Ts'o, Andreas Dilger, Darrick J . Wong,
	Matthew Wilcox, Michal Hocko, Dave Chinner, Rik van Riel,
	Vlastimil Babka, Johannes Weiner, Jonathan Corbet, Linux-fsdevel,
	LKML, Mel Gorman

Cc list similar to "congestion_wait() and GFP_NOFAIL" as they're loosely
related.

This is a prototype series that removes all calls to congestion_wait
in mm/ and deletes wait_iff_congested. It's not a clever
implementation but congestion_wait has been broken for a long time
(https://lore.kernel.org/linux-mm/45d8b7a6-8548-65f5-cccf-9f451d4ae3d4@kernel.dk/).
Even if it worked, it was never a great idea. While excessive
dirty/writeback pages at the tail of the LRU is one possibility that
reclaim may be slow, there is also the problem of too many pages being
isolated and reclaim failing for other reasons (elevated references,
too many pages isolated, excessive LRU contention etc).

This series replaces the reclaim conditions with event driven ones

o If there are too many dirty/writeback pages, sleep until a timeout
  or enough pages get cleaned
o If too many pages are isolated, sleep until enough isolated pages
  are either reclaimed or put back on the LRU
o If no progress is being made, let direct reclaim tasks sleep until
  another task makes progress

This has been lightly tested only and the testing was useless as the
relevant code was not executed. The workload configurations I had that
used to trigger these corner cases no longer work (yey?) and I'll need
to implement a new synthetic workload. If someone is aware of a realistic
workload that forces reclaim activity to the point where reclaim stalls
then kindly share the details.

-- 
2.31.1

Mel Gorman (5):
  mm/vmscan: Throttle reclaim until some writeback completes if
    congested
  mm/vmscan: Throttle reclaim and compaction when too may pages are
    isolated
  mm/vmscan: Throttle reclaim when no progress is being made
  mm/writeback: Throttle based on page writeback instead of congestion
  mm/page_alloc: Remove the throttling logic from the page allocator

 include/linux/backing-dev.h      |   1 -
 include/linux/mmzone.h           |  12 ++++
 include/trace/events/vmscan.h    |  38 +++++++++++
 include/trace/events/writeback.h |   7 --
 mm/backing-dev.c                 |  48 --------------
 mm/compaction.c                  |   2 +-
 mm/filemap.c                     |   1 +
 mm/internal.h                    |  11 ++++
 mm/memcontrol.c                  |  10 +--
 mm/page-writeback.c              |  11 +++-
 mm/page_alloc.c                  |  26 ++------
 mm/vmscan.c                      | 110 ++++++++++++++++++++++++++++---
 mm/vmstat.c                      |   1 +
 13 files changed, 180 insertions(+), 98 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 1/5] mm/vmscan: Throttle reclaim until some writeback completes if congested
  2021-09-20  8:54 [RFC PATCH 0/5] Remove dependency on congestion_wait in mm/ Mel Gorman
@ 2021-09-20  8:54 ` Mel Gorman
  2021-09-20 23:19   ` NeilBrown
                     ` (2 more replies)
  2021-09-20  8:54 ` [PATCH 2/5] mm/vmscan: Throttle reclaim and compaction when too may pages are isolated Mel Gorman
                   ` (5 subsequent siblings)
  6 siblings, 3 replies; 33+ messages in thread
From: Mel Gorman @ 2021-09-20  8:54 UTC (permalink / raw)
  To: Linux-MM
  Cc: NeilBrown, Theodore Ts'o, Andreas Dilger, Darrick J . Wong,
	Matthew Wilcox, Michal Hocko, Dave Chinner, Rik van Riel,
	Vlastimil Babka, Johannes Weiner, Jonathan Corbet, Linux-fsdevel,
	LKML, Mel Gorman

Page reclaim throttles on wait_iff_congested under the following conditions

o kswapd is encountering pages under writeback and marked for immediate
  reclaim implying that pages are cycling through the LRU faster than
  pages can be cleaned.

o Direct reclaim will stall if all dirty pages are backed by congested
  inodes.

wait_iff_congested is almost completely broken with few exceptions. This
patch adds a new node-based workqueue and tracks the number of throttled
tasks and pages written back since throttling started. If enough pages
belonging to the node are written back then the throttled tasks will wake
early. If not, the throttled tasks sleeps until the timeout expires.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/backing-dev.h      |  1 -
 include/linux/mmzone.h           |  9 +++++
 include/trace/events/vmscan.h    | 34 +++++++++++++++++++
 include/trace/events/writeback.h |  7 ----
 mm/backing-dev.c                 | 48 --------------------------
 mm/filemap.c                     |  1 +
 mm/internal.h                    |  9 +++++
 mm/page_alloc.c                  |  1 +
 mm/vmscan.c                      | 58 +++++++++++++++++++++++++++-----
 mm/vmstat.c                      |  1 +
 10 files changed, 105 insertions(+), 64 deletions(-)

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index ac7f231b8825..9fb1f0ae273c 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -154,7 +154,6 @@ static inline int wb_congested(struct bdi_writeback *wb, int cong_bits)
 }
 
 long congestion_wait(int sync, long timeout);
-long wait_iff_congested(int sync, long timeout);
 
 static inline bool mapping_can_writeback(struct address_space *mapping)
 {
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6a1d79d84675..ef0a63ebd21d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -199,6 +199,7 @@ enum node_stat_item {
 	NR_VMSCAN_IMMEDIATE,	/* Prioritise for reclaim when writeback ends */
 	NR_DIRTIED,		/* page dirtyings since bootup */
 	NR_WRITTEN,		/* page writings since bootup */
+	NR_THROTTLED_WRITTEN,	/* NR_WRITTEN while reclaim throttled */
 	NR_KERNEL_MISC_RECLAIMABLE,	/* reclaimable non-slab kernel pages */
 	NR_FOLL_PIN_ACQUIRED,	/* via: pin_user_page(), gup flag: FOLL_PIN */
 	NR_FOLL_PIN_RELEASED,	/* pages returned via unpin_user_page() */
@@ -272,6 +273,10 @@ enum lru_list {
 	NR_LRU_LISTS
 };
 
+enum vmscan_throttle_state {
+	VMSCAN_THROTTLE_WRITEBACK,
+};
+
 #define for_each_lru(lru) for (lru = 0; lru < NR_LRU_LISTS; lru++)
 
 #define for_each_evictable_lru(lru) for (lru = 0; lru <= LRU_ACTIVE_FILE; lru++)
@@ -841,6 +846,10 @@ typedef struct pglist_data {
 	int node_id;
 	wait_queue_head_t kswapd_wait;
 	wait_queue_head_t pfmemalloc_wait;
+	wait_queue_head_t reclaim_wait;	/* wq for throttling reclaim */
+	atomic_t nr_reclaim_throttled;	/* nr of throtted tasks */
+	unsigned long nr_reclaim_start;	/* nr pages written while throttled
+					 * when throttling started. */
 	struct task_struct *kswapd;	/* Protected by
 					   mem_hotplug_begin/end() */
 	int kswapd_order;
diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index 88faf2400ec2..c317f9fe0d17 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -27,6 +27,14 @@
 		{RECLAIM_WB_ASYNC,	"RECLAIM_WB_ASYNC"}	\
 		) : "RECLAIM_WB_NONE"
 
+#define _VMSCAN_THROTTLE_WRITEBACK	(1 << VMSCAN_THROTTLE_WRITEBACK)
+
+#define show_throttle_flags(flags)						\
+	(flags) ? __print_flags(flags, "|",					\
+		{_VMSCAN_THROTTLE_WRITEBACK,	"VMSCAN_THROTTLE_WRITEBACK"}	\
+		) : "VMSCAN_THROTTLE_NONE"
+
+
 #define trace_reclaim_flags(file) ( \
 	(file ? RECLAIM_WB_FILE : RECLAIM_WB_ANON) | \
 	(RECLAIM_WB_ASYNC) \
@@ -454,6 +462,32 @@ DEFINE_EVENT(mm_vmscan_direct_reclaim_end_template, mm_vmscan_node_reclaim_end,
 	TP_ARGS(nr_reclaimed)
 );
 
+TRACE_EVENT(mm_vmscan_throttled,
+
+	TP_PROTO(int nid, int usec_timeout, int usec_delayed, int reason),
+
+	TP_ARGS(nid, usec_timeout, usec_delayed, reason),
+
+	TP_STRUCT__entry(
+		__field(int, nid)
+		__field(int, usec_timeout)
+		__field(int, usec_delayed)
+		__field(int, reason)
+	),
+
+	TP_fast_assign(
+		__entry->nid = nid;
+		__entry->usec_timeout = usec_timeout;
+		__entry->usec_delayed = usec_delayed;
+		__entry->reason = 1U << reason;
+	),
+
+	TP_printk("nid=%d usec_timeout=%d usect_delayed=%d reason=%s",
+		__entry->nid,
+		__entry->usec_timeout,
+		__entry->usec_delayed,
+		show_throttle_flags(__entry->reason))
+);
 #endif /* _TRACE_VMSCAN_H */
 
 /* This part must be outside protection */
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index 840d1ba84cf5..3bc759b81897 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -763,13 +763,6 @@ DEFINE_EVENT(writeback_congest_waited_template, writeback_congestion_wait,
 	TP_ARGS(usec_timeout, usec_delayed)
 );
 
-DEFINE_EVENT(writeback_congest_waited_template, writeback_wait_iff_congested,
-
-	TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),
-
-	TP_ARGS(usec_timeout, usec_delayed)
-);
-
 DECLARE_EVENT_CLASS(writeback_single_inode_template,
 
 	TP_PROTO(struct inode *inode,
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 4a9d4e27d0d9..0ea1a105eae5 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -1041,51 +1041,3 @@ long congestion_wait(int sync, long timeout)
 	return ret;
 }
 EXPORT_SYMBOL(congestion_wait);
-
-/**
- * wait_iff_congested - Conditionally wait for a backing_dev to become uncongested or a pgdat to complete writes
- * @sync: SYNC or ASYNC IO
- * @timeout: timeout in jiffies
- *
- * In the event of a congested backing_dev (any backing_dev) this waits
- * for up to @timeout jiffies for either a BDI to exit congestion of the
- * given @sync queue or a write to complete.
- *
- * The return value is 0 if the sleep is for the full timeout. Otherwise,
- * it is the number of jiffies that were still remaining when the function
- * returned. return_value == timeout implies the function did not sleep.
- */
-long wait_iff_congested(int sync, long timeout)
-{
-	long ret;
-	unsigned long start = jiffies;
-	DEFINE_WAIT(wait);
-	wait_queue_head_t *wqh = &congestion_wqh[sync];
-
-	/*
-	 * If there is no congestion, yield if necessary instead
-	 * of sleeping on the congestion queue
-	 */
-	if (atomic_read(&nr_wb_congested[sync]) == 0) {
-		cond_resched();
-
-		/* In case we scheduled, work out time remaining */
-		ret = timeout - (jiffies - start);
-		if (ret < 0)
-			ret = 0;
-
-		goto out;
-	}
-
-	/* Sleep until uncongested or a write happens */
-	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
-	ret = io_schedule_timeout(timeout);
-	finish_wait(wqh, &wait);
-
-out:
-	trace_writeback_wait_iff_congested(jiffies_to_usecs(timeout),
-					jiffies_to_usecs(jiffies - start));
-
-	return ret;
-}
-EXPORT_SYMBOL(wait_iff_congested);
diff --git a/mm/filemap.c b/mm/filemap.c
index dae481293b5d..59187787fbfc 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1605,6 +1605,7 @@ void end_page_writeback(struct page *page)
 
 	smp_mb__after_atomic();
 	wake_up_page(page, PG_writeback);
+	acct_reclaim_writeback(page);
 	put_page(page);
 }
 EXPORT_SYMBOL(end_page_writeback);
diff --git a/mm/internal.h b/mm/internal.h
index cf3cb933eba3..e25b3686bfab 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -34,6 +34,15 @@
 
 void page_writeback_init(void);
 
+void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page);
+static inline void acct_reclaim_writeback(struct page *page)
+{
+	pg_data_t *pgdat = page_pgdat(page);
+
+	if (atomic_read(&pgdat->nr_reclaim_throttled))
+		__acct_reclaim_writeback(pgdat, page);
+}
+
 vm_fault_t do_swap_page(struct vm_fault *vmf);
 
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b37435c274cf..d849ddfc1e51 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7396,6 +7396,7 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
 
 	init_waitqueue_head(&pgdat->kswapd_wait);
 	init_waitqueue_head(&pgdat->pfmemalloc_wait);
+	init_waitqueue_head(&pgdat->reclaim_wait);
 
 	pgdat_page_ext_init(pgdat);
 	lruvec_init(&pgdat->__lruvec);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 74296c2d1fed..b58ea0b13286 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1006,6 +1006,47 @@ static void handle_write_error(struct address_space *mapping,
 	unlock_page(page);
 }
 
+static void
+reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
+							long timeout)
+{
+	wait_queue_head_t *wqh = &pgdat->reclaim_wait;
+	unsigned long start = jiffies;
+	long ret;
+	DEFINE_WAIT(wait);
+
+	atomic_inc(&pgdat->nr_reclaim_throttled);
+	WRITE_ONCE(pgdat->nr_reclaim_start,
+		 node_page_state(pgdat, NR_THROTTLED_WRITTEN));
+
+	prepare_to_wait(wqh, &wait, TASK_INTERRUPTIBLE);
+	ret = schedule_timeout(timeout);
+	finish_wait(wqh, &wait);
+	atomic_dec(&pgdat->nr_reclaim_throttled);
+
+	trace_mm_vmscan_throttled(pgdat->node_id, jiffies_to_usecs(timeout),
+				jiffies_to_usecs(jiffies - start),
+				reason);
+}
+
+/*
+ * Account for pages written if tasks are throttled waiting on dirty
+ * pages to clean. If enough pages have been cleaned since throttling
+ * started then wakeup the throttled tasks.
+ */
+void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page)
+{
+	unsigned long nr_written;
+	int nr_throttled = atomic_read(&pgdat->nr_reclaim_throttled);
+
+	__inc_node_page_state(page, NR_THROTTLED_WRITTEN);
+	nr_written = node_page_state(pgdat, NR_THROTTLED_WRITTEN) -
+		READ_ONCE(pgdat->nr_reclaim_start);
+
+	if (nr_written > SWAP_CLUSTER_MAX * nr_throttled)
+		wake_up_interruptible_all(&pgdat->reclaim_wait);
+}
+
 /* possible outcome of pageout() */
 typedef enum {
 	/* failed to write page out, page is locked */
@@ -1412,9 +1453,8 @@ static unsigned int shrink_page_list(struct list_head *page_list,
 
 		/*
 		 * The number of dirty pages determines if a node is marked
-		 * reclaim_congested which affects wait_iff_congested. kswapd
-		 * will stall and start writing pages if the tail of the LRU
-		 * is all dirty unqueued pages.
+		 * reclaim_congested. kswapd will stall and start writing
+		 * pages if the tail of the LRU is all dirty unqueued pages.
 		 */
 		page_check_dirty_writeback(page, &dirty, &writeback);
 		if (dirty || writeback)
@@ -3180,19 +3220,20 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 		 * If kswapd scans pages marked for immediate
 		 * reclaim and under writeback (nr_immediate), it
 		 * implies that pages are cycling through the LRU
-		 * faster than they are written so also forcibly stall.
+		 * faster than they are written so forcibly stall
+		 * until some pages complete writeback.
 		 */
 		if (sc->nr.immediate)
-			congestion_wait(BLK_RW_ASYNC, HZ/10);
+			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK, HZ/10);
 	}
 
 	/*
 	 * Tag a node/memcg as congested if all the dirty pages
 	 * scanned were backed by a congested BDI and
-	 * wait_iff_congested will stall.
+	 * non-kswapd tasks will stall on reclaim_throttle.
 	 *
 	 * Legacy memcg will stall in page writeback so avoid forcibly
-	 * stalling in wait_iff_congested().
+	 * stalling in reclaim_throttle().
 	 */
 	if ((current_is_kswapd() ||
 	     (cgroup_reclaim(sc) && writeback_throttling_sane(sc))) &&
@@ -3208,7 +3249,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	if (!current_is_kswapd() && current_may_throttle() &&
 	    !sc->hibernation_mode &&
 	    test_bit(LRUVEC_CONGESTED, &target_lruvec->flags))
-		wait_iff_congested(BLK_RW_ASYNC, HZ/10);
+		reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK, HZ/10);
 
 	if (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
 				    sc))
@@ -4286,6 +4327,7 @@ static int kswapd(void *p)
 
 	WRITE_ONCE(pgdat->kswapd_order, 0);
 	WRITE_ONCE(pgdat->kswapd_highest_zoneidx, MAX_NR_ZONES);
+	atomic_set(&pgdat->nr_reclaim_throttled, 0);
 	for ( ; ; ) {
 		bool ret;
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 8ce2620344b2..9b2bc9d61d4b 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1225,6 +1225,7 @@ const char * const vmstat_text[] = {
 	"nr_vmscan_immediate_reclaim",
 	"nr_dirtied",
 	"nr_written",
+	"nr_throttled_written",
 	"nr_kernel_misc_reclaimable",
 	"nr_foll_pin_acquired",
 	"nr_foll_pin_released",
-- 
2.31.1


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 2/5] mm/vmscan: Throttle reclaim and compaction when too may pages are isolated
  2021-09-20  8:54 [RFC PATCH 0/5] Remove dependency on congestion_wait in mm/ Mel Gorman
  2021-09-20  8:54 ` [PATCH 1/5] mm/vmscan: Throttle reclaim until some writeback completes if congested Mel Gorman
@ 2021-09-20  8:54 ` Mel Gorman
  2021-09-20 23:27   ` NeilBrown
  2021-09-21 18:45     ` Yang Shi
  2021-09-20  8:54 ` [PATCH 3/5] mm/vmscan: Throttle reclaim when no progress is being made Mel Gorman
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 33+ messages in thread
From: Mel Gorman @ 2021-09-20  8:54 UTC (permalink / raw)
  To: Linux-MM
  Cc: NeilBrown, Theodore Ts'o, Andreas Dilger, Darrick J . Wong,
	Matthew Wilcox, Michal Hocko, Dave Chinner, Rik van Riel,
	Vlastimil Babka, Johannes Weiner, Jonathan Corbet, Linux-fsdevel,
	LKML, Mel Gorman

Page reclaim throttles on congestion if too many parallel reclaim instances
have isolated too many pages. This makes no sense, excessive parallelisation
has nothing to do with writeback or congestion.

This patch creates an additional workqueue to sleep on when too many
pages are isolated. The throttled tasks are woken when the number
of isolated pages is reduced or a timeout occurs. There may be
some false positive wakeups for GFP_NOIO/GFP_NOFS callers but
the tasks will throttle again if necessary.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/mmzone.h        |  4 +++-
 include/trace/events/vmscan.h |  4 +++-
 mm/compaction.c               |  2 +-
 mm/internal.h                 |  2 ++
 mm/page_alloc.c               |  6 +++++-
 mm/vmscan.c                   | 22 ++++++++++++++++------
 6 files changed, 30 insertions(+), 10 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ef0a63ebd21d..ca65d6a64bdd 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -275,6 +275,8 @@ enum lru_list {
 
 enum vmscan_throttle_state {
 	VMSCAN_THROTTLE_WRITEBACK,
+	VMSCAN_THROTTLE_ISOLATED,
+	NR_VMSCAN_THROTTLE,
 };
 
 #define for_each_lru(lru) for (lru = 0; lru < NR_LRU_LISTS; lru++)
@@ -846,7 +848,7 @@ typedef struct pglist_data {
 	int node_id;
 	wait_queue_head_t kswapd_wait;
 	wait_queue_head_t pfmemalloc_wait;
-	wait_queue_head_t reclaim_wait;	/* wq for throttling reclaim */
+	wait_queue_head_t reclaim_wait[NR_VMSCAN_THROTTLE];
 	atomic_t nr_reclaim_throttled;	/* nr of throtted tasks */
 	unsigned long nr_reclaim_start;	/* nr pages written while throttled
 					 * when throttling started. */
diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index c317f9fe0d17..d4905bd9e9c4 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -28,10 +28,12 @@
 		) : "RECLAIM_WB_NONE"
 
 #define _VMSCAN_THROTTLE_WRITEBACK	(1 << VMSCAN_THROTTLE_WRITEBACK)
+#define _VMSCAN_THROTTLE_ISOLATED	(1 << VMSCAN_THROTTLE_ISOLATED)
 
 #define show_throttle_flags(flags)						\
 	(flags) ? __print_flags(flags, "|",					\
-		{_VMSCAN_THROTTLE_WRITEBACK,	"VMSCAN_THROTTLE_WRITEBACK"}	\
+		{_VMSCAN_THROTTLE_WRITEBACK,	"VMSCAN_THROTTLE_WRITEBACK"},	\
+		{_VMSCAN_THROTTLE_ISOLATED,	"VMSCAN_THROTTLE_ISOLATED"}	\
 		) : "VMSCAN_THROTTLE_NONE"
 
 
diff --git a/mm/compaction.c b/mm/compaction.c
index bfc93da1c2c7..221c9c10ad7e 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -822,7 +822,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 		if (cc->mode == MIGRATE_ASYNC)
 			return -EAGAIN;
 
-		congestion_wait(BLK_RW_ASYNC, HZ/10);
+		reclaim_throttle(pgdat, VMSCAN_THROTTLE_ISOLATED, HZ/10);
 
 		if (fatal_signal_pending(current))
 			return -EINTR;
diff --git a/mm/internal.h b/mm/internal.h
index e25b3686bfab..e6cd22fb5a43 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -118,6 +118,8 @@ extern unsigned long highest_memmap_pfn;
  */
 extern int isolate_lru_page(struct page *page);
 extern void putback_lru_page(struct page *page);
+extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
+								long timeout);
 
 /*
  * in mm/rmap.c:
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d849ddfc1e51..78e538067651 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7389,6 +7389,8 @@ static void pgdat_init_kcompactd(struct pglist_data *pgdat) {}
 
 static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
 {
+	int i;
+
 	pgdat_resize_init(pgdat);
 
 	pgdat_init_split_queue(pgdat);
@@ -7396,7 +7398,9 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
 
 	init_waitqueue_head(&pgdat->kswapd_wait);
 	init_waitqueue_head(&pgdat->pfmemalloc_wait);
-	init_waitqueue_head(&pgdat->reclaim_wait);
+
+	for (i = 0; i < NR_VMSCAN_THROTTLE; i++)
+		init_waitqueue_head(&pgdat->reclaim_wait[i]);
 
 	pgdat_page_ext_init(pgdat);
 	lruvec_init(&pgdat->__lruvec);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b58ea0b13286..eb81dcac15b2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1006,11 +1006,10 @@ static void handle_write_error(struct address_space *mapping,
 	unlock_page(page);
 }
 
-static void
-reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
+void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
 							long timeout)
 {
-	wait_queue_head_t *wqh = &pgdat->reclaim_wait;
+	wait_queue_head_t *wqh = &pgdat->reclaim_wait[reason];
 	unsigned long start = jiffies;
 	long ret;
 	DEFINE_WAIT(wait);
@@ -1044,7 +1043,7 @@ void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page)
 		READ_ONCE(pgdat->nr_reclaim_start);
 
 	if (nr_written > SWAP_CLUSTER_MAX * nr_throttled)
-		wake_up_interruptible_all(&pgdat->reclaim_wait);
+		wake_up_interruptible_all(&pgdat->reclaim_wait[VMSCAN_THROTTLE_WRITEBACK]);
 }
 
 /* possible outcome of pageout() */
@@ -2159,6 +2158,7 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
 		struct scan_control *sc)
 {
 	unsigned long inactive, isolated;
+	bool too_many;
 
 	if (current_is_kswapd())
 		return 0;
@@ -2182,6 +2182,17 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
 	if ((sc->gfp_mask & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS))
 		inactive >>= 3;
 
+	too_many = isolated > inactive;
+
+	/* Wake up tasks throttled due to too_many_isolated. */
+	if (!too_many) {
+		wait_queue_head_t *wqh;
+
+		wqh = &pgdat->reclaim_wait[VMSCAN_THROTTLE_ISOLATED];
+		if (waitqueue_active(wqh))
+			wake_up_interruptible_all(wqh);
+	}
+
 	return isolated > inactive;
 }
 
@@ -2291,8 +2302,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 			return 0;
 
 		/* wait a bit for the reclaimer. */
-		msleep(100);
-		stalled = true;
+		reclaim_throttle(pgdat, VMSCAN_THROTTLE_ISOLATED, HZ/10);
 
 		/* We are about to die and free our memory. Return now. */
 		if (fatal_signal_pending(current))
-- 
2.31.1


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 3/5] mm/vmscan: Throttle reclaim when no progress is being made
  2021-09-20  8:54 [RFC PATCH 0/5] Remove dependency on congestion_wait in mm/ Mel Gorman
  2021-09-20  8:54 ` [PATCH 1/5] mm/vmscan: Throttle reclaim until some writeback completes if congested Mel Gorman
  2021-09-20  8:54 ` [PATCH 2/5] mm/vmscan: Throttle reclaim and compaction when too may pages are isolated Mel Gorman
@ 2021-09-20  8:54 ` Mel Gorman
  2021-09-20 23:31   ` NeilBrown
  2021-09-20  8:54 ` [PATCH 4/5] mm/writeback: Throttle based on page writeback instead of congestion Mel Gorman
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 33+ messages in thread
From: Mel Gorman @ 2021-09-20  8:54 UTC (permalink / raw)
  To: Linux-MM
  Cc: NeilBrown, Theodore Ts'o, Andreas Dilger, Darrick J . Wong,
	Matthew Wilcox, Michal Hocko, Dave Chinner, Rik van Riel,
	Vlastimil Babka, Johannes Weiner, Jonathan Corbet, Linux-fsdevel,
	LKML, Mel Gorman

Memcg reclaim throttles on congestion if no reclaim progress is made.
This makes little sense, it might be due to writeback or a host of
other factors.

For !memcg reclaim, it's messy. Direct reclaim primarily is throttled
in the page allocator if it is failing to make progress. Kswapd
throttles if too many pages are under writeback and marked for
immediate reclaim.

This patch explicitly throttles if reclaim is failing to make progress.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/mmzone.h        |  1 +
 include/trace/events/vmscan.h |  4 +++-
 mm/memcontrol.c               | 10 +--------
 mm/vmscan.c                   | 38 +++++++++++++++++++++++++++++++++++
 4 files changed, 43 insertions(+), 10 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ca65d6a64bdd..7c08cc91d526 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -276,6 +276,7 @@ enum lru_list {
 enum vmscan_throttle_state {
 	VMSCAN_THROTTLE_WRITEBACK,
 	VMSCAN_THROTTLE_ISOLATED,
+	VMSCAN_THROTTLE_NOPROGRESS,
 	NR_VMSCAN_THROTTLE,
 };
 
diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index d4905bd9e9c4..f25a6149d3ba 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -29,11 +29,13 @@
 
 #define _VMSCAN_THROTTLE_WRITEBACK	(1 << VMSCAN_THROTTLE_WRITEBACK)
 #define _VMSCAN_THROTTLE_ISOLATED	(1 << VMSCAN_THROTTLE_ISOLATED)
+#define _VMSCAN_THROTTLE_NOPROGRESS	(1 << VMSCAN_THROTTLE_NOPROGRESS)
 
 #define show_throttle_flags(flags)						\
 	(flags) ? __print_flags(flags, "|",					\
 		{_VMSCAN_THROTTLE_WRITEBACK,	"VMSCAN_THROTTLE_WRITEBACK"},	\
-		{_VMSCAN_THROTTLE_ISOLATED,	"VMSCAN_THROTTLE_ISOLATED"}	\
+		{_VMSCAN_THROTTLE_ISOLATED,	"VMSCAN_THROTTLE_ISOLATED"},	\
+		{_VMSCAN_THROTTLE_NOPROGRESS,	"VMSCAN_THROTTLE_NOPROGRESS"}	\
 		) : "VMSCAN_THROTTLE_NONE"
 
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b762215d73eb..8479919a633c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3470,19 +3470,11 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
 
 	/* try to free all pages in this cgroup */
 	while (nr_retries && page_counter_read(&memcg->memory)) {
-		int progress;
-
 		if (signal_pending(current))
 			return -EINTR;
 
-		progress = try_to_free_mem_cgroup_pages(memcg, 1,
-							GFP_KERNEL, true);
-		if (!progress) {
+		if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, true))
 			nr_retries--;
-			/* maybe some writeback is necessary */
-			congestion_wait(BLK_RW_ASYNC, HZ/10);
-		}
-
 	}
 
 	return 0;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index eb81dcac15b2..18b9826953a0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3307,6 +3307,33 @@ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
 	return zone_watermark_ok_safe(zone, 0, watermark, sc->reclaim_idx);
 }
 
+static void consider_reclaim_throttle(pg_data_t *pgdat, struct scan_control *sc)
+{
+	/* If reclaim is making progress, wake any throttled tasks. */
+	if (sc->nr_reclaimed) {
+		wait_queue_head_t *wqh;
+
+		wqh = &pgdat->reclaim_wait[VMSCAN_THROTTLE_NOPROGRESS];
+		if (waitqueue_active(wqh))
+			wake_up_interruptible_all(wqh);
+
+		return;
+	}
+
+	/*
+	 * Do not throttle kswapd on NOPROGRESS as it will throttle on
+	 * VMSCAN_THROTTLE_WRITEBACK if there are too many pages under
+	 * writeback and marked for immediate reclaim at the tail of
+	 * the LRU.
+	 */
+	if (current_is_kswapd())
+		return;
+
+	/* Throttle if making no progress at high prioities. */
+	if (sc->priority < DEF_PRIORITY - 2)
+		reclaim_throttle(pgdat, VMSCAN_THROTTLE_NOPROGRESS, HZ/10);
+}
+
 /*
  * This is the direct reclaim path, for page-allocating processes.  We only
  * try to reclaim pages from zones which will satisfy the caller's allocation
@@ -3391,6 +3418,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 			continue;
 		last_pgdat = zone->zone_pgdat;
 		shrink_node(zone->zone_pgdat, sc);
+		consider_reclaim_throttle(zone->zone_pgdat, sc);
 	}
 
 	/*
@@ -3765,6 +3793,16 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 	trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
 	set_task_reclaim_state(current, NULL);
 
+	if (!nr_reclaimed) {
+		struct zoneref *z;
+		pg_data_t *pgdat;
+
+		z = first_zones_zonelist(zonelist, sc.reclaim_idx, sc.nodemask);
+		pgdat = zonelist_zone(z)->zone_pgdat;
+
+		reclaim_throttle(pgdat, VMSCAN_THROTTLE_NOPROGRESS, HZ/10);
+	}
+
 	return nr_reclaimed;
 }
 #endif
-- 
2.31.1


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 4/5] mm/writeback: Throttle based on page writeback instead of congestion
  2021-09-20  8:54 [RFC PATCH 0/5] Remove dependency on congestion_wait in mm/ Mel Gorman
                   ` (2 preceding siblings ...)
  2021-09-20  8:54 ` [PATCH 3/5] mm/vmscan: Throttle reclaim when no progress is being made Mel Gorman
@ 2021-09-20  8:54 ` Mel Gorman
  2021-09-20  8:54 ` [PATCH 5/5] mm/page_alloc: Remove the throttling logic from the page allocator Mel Gorman
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 33+ messages in thread
From: Mel Gorman @ 2021-09-20  8:54 UTC (permalink / raw)
  To: Linux-MM
  Cc: NeilBrown, Theodore Ts'o, Andreas Dilger, Darrick J . Wong,
	Matthew Wilcox, Michal Hocko, Dave Chinner, Rik van Riel,
	Vlastimil Babka, Johannes Weiner, Jonathan Corbet, Linux-fsdevel,
	LKML, Mel Gorman

do_writepages throttles on congestion if the writepages() fails due to a
lack of memory but congestion_wait() is partially broken as the congestion
state is not updated for all BDIs.

This patch stalls waiting for a number of pages to complete writeback
that located on the local node. The main weakness is that there is no
correlation between the location of the inode's pages and locality but
that is still better than congestion_wait.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page-writeback.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 4812a17b288c..f34f54fcd5b4 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2366,8 +2366,15 @@ int do_writepages(struct address_space *mapping, struct writeback_control *wbc)
 			ret = generic_writepages(mapping, wbc);
 		if ((ret != -ENOMEM) || (wbc->sync_mode != WB_SYNC_ALL))
 			break;
-		cond_resched();
-		congestion_wait(BLK_RW_ASYNC, HZ/50);
+
+		/*
+		 * Lacking an allocation context or the locality or writeback
+		 * state of any of the inode's pages, throttle based on
+		 * writeback activity on the local node. It's as good a
+		 * guess as any.
+		 */
+		reclaim_throttle(NODE_DATA(numa_node_id()),
+			VMSCAN_THROTTLE_WRITEBACK, HZ/50);
 	}
 	/*
 	 * Usually few pages are written by now from those we've just submitted
-- 
2.31.1


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 5/5] mm/page_alloc: Remove the throttling logic from the page allocator
  2021-09-20  8:54 [RFC PATCH 0/5] Remove dependency on congestion_wait in mm/ Mel Gorman
                   ` (3 preceding siblings ...)
  2021-09-20  8:54 ` [PATCH 4/5] mm/writeback: Throttle based on page writeback instead of congestion Mel Gorman
@ 2021-09-20  8:54 ` Mel Gorman
  2021-09-20 11:42 ` [RFC PATCH 0/5] Remove dependency on congestion_wait in mm/ Matthew Wilcox
  2021-09-21 20:46 ` Dave Chinner
  6 siblings, 0 replies; 33+ messages in thread
From: Mel Gorman @ 2021-09-20  8:54 UTC (permalink / raw)
  To: Linux-MM
  Cc: NeilBrown, Theodore Ts'o, Andreas Dilger, Darrick J . Wong,
	Matthew Wilcox, Michal Hocko, Dave Chinner, Rik van Riel,
	Vlastimil Babka, Johannes Weiner, Jonathan Corbet, Linux-fsdevel,
	LKML, Mel Gorman

The page allocator stalls based on the number of pages that are
waiting for writeback to start but this should now be redundant.
shrink_inactive_list() will wake flusher threads if the LRU tail are
unqueued dirty pages so the flusher should be active. If it fails to make
progress due to pages under writeback not being completed quickly then
it should stall on VMSCAN_THROTTLE_WRITEBACK.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 21 +--------------------
 1 file changed, 1 insertion(+), 20 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 78e538067651..8fa0109ff417 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4795,30 +4795,11 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 		trace_reclaim_retry_zone(z, order, reclaimable,
 				available, min_wmark, *no_progress_loops, wmark);
 		if (wmark) {
-			/*
-			 * If we didn't make any progress and have a lot of
-			 * dirty + writeback pages then we should wait for
-			 * an IO to complete to slow down the reclaim and
-			 * prevent from pre mature OOM
-			 */
-			if (!did_some_progress) {
-				unsigned long write_pending;
-
-				write_pending = zone_page_state_snapshot(zone,
-							NR_ZONE_WRITE_PENDING);
-
-				if (2 * write_pending > reclaimable) {
-					congestion_wait(BLK_RW_ASYNC, HZ/10);
-					return true;
-				}
-			}
-
 			ret = true;
-			goto out;
+			break;
 		}
 	}
 
-out:
 	/*
 	 * Memory allocation/reclaim might be called from a WQ context and the
 	 * current implementation of the WQ concurrency control doesn't
-- 
2.31.1


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 0/5] Remove dependency on congestion_wait in mm/
  2021-09-20  8:54 [RFC PATCH 0/5] Remove dependency on congestion_wait in mm/ Mel Gorman
                   ` (4 preceding siblings ...)
  2021-09-20  8:54 ` [PATCH 5/5] mm/page_alloc: Remove the throttling logic from the page allocator Mel Gorman
@ 2021-09-20 11:42 ` Matthew Wilcox
  2021-09-20 12:50   ` Mel Gorman
  2021-09-20 19:51   ` Mel Gorman
  2021-09-21 20:46 ` Dave Chinner
  6 siblings, 2 replies; 33+ messages in thread
From: Matthew Wilcox @ 2021-09-20 11:42 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, NeilBrown, Theodore Ts'o, Andreas Dilger,
	Darrick J . Wong, Michal Hocko, Dave Chinner, Rik van Riel,
	Vlastimil Babka, Johannes Weiner, Jonathan Corbet, Linux-fsdevel,
	LKML

On Mon, Sep 20, 2021 at 09:54:31AM +0100, Mel Gorman wrote:
> This has been lightly tested only and the testing was useless as the
> relevant code was not executed. The workload configurations I had that
> used to trigger these corner cases no longer work (yey?) and I'll need
> to implement a new synthetic workload. If someone is aware of a realistic
> workload that forces reclaim activity to the point where reclaim stalls
> then kindly share the details.

The stereeotypical "stalling on I/O" problem is to plug in one of the
crap USB drives you were given at a trade show and simply
	dd if=/dev/zero of=/dev/sdb
	sync

You can also set up qemu to have extremely slow I/O performance:
https://serverfault.com/questions/675704/extremely-slow-qemu-storage-performance-with-qcow2-images


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 0/5] Remove dependency on congestion_wait in mm/
  2021-09-20 11:42 ` [RFC PATCH 0/5] Remove dependency on congestion_wait in mm/ Matthew Wilcox
@ 2021-09-20 12:50   ` Mel Gorman
  2021-09-20 14:11     ` David Sterba
  2021-09-20 19:51   ` Mel Gorman
  1 sibling, 1 reply; 33+ messages in thread
From: Mel Gorman @ 2021-09-20 12:50 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Linux-MM, NeilBrown, Theodore Ts'o, Andreas Dilger,
	Darrick J . Wong, Michal Hocko, Dave Chinner, Rik van Riel,
	Vlastimil Babka, Johannes Weiner, Jonathan Corbet, Linux-fsdevel,
	LKML

On Mon, Sep 20, 2021 at 12:42:44PM +0100, Matthew Wilcox wrote:
> On Mon, Sep 20, 2021 at 09:54:31AM +0100, Mel Gorman wrote:
> > This has been lightly tested only and the testing was useless as the
> > relevant code was not executed. The workload configurations I had that
> > used to trigger these corner cases no longer work (yey?) and I'll need
> > to implement a new synthetic workload. If someone is aware of a realistic
> > workload that forces reclaim activity to the point where reclaim stalls
> > then kindly share the details.
> 
> The stereeotypical "stalling on I/O" problem is to plug in one of the
> crap USB drives you were given at a trade show and simply
> 	dd if=/dev/zero of=/dev/sdb
> 	sync
> 

The test machines are 1500KM away so plugging in a USB stick but worst
comes to the worst, I could test it on a laptop. I considered using the
IO controller but I'm not sure that would throttle background writeback.
I dismissed doing this for a few reasons though -- the dirtying should
be rate limited based on the speed of the BDI so it will not necessarily
trigger the condition. It also misses the other interesting cases --
throttling due to excessive isolation and throttling due to failing to
make progress.

I've prototyped a synthetic case that uses 4..(NR_CPUS*4) workers. 1
worker measures mmap/munmap latency. 1 worker under fio is randomly reading
files. The remaining workers are split between fio doing random write IO
on separate files and anonymous memory hogs reading large mappings every
5 seconds. The aggregate WSS is approximately totalmem*2 split between 60%
anon and 40% file-backed (40% to be 2xdirty_ratio). After a warmup period
based on the writeback speed, it runs for 5 minutes per number of workers.

The primary metric of "goodness" will be the mmap latency because it's
the smallest worker that should be able to make quick progress and I
want to see how much it is interfered with during reclaim. I'll be
graphing the throttling times to see what processes get throttled and
for how long.

I was hoping though that there was a canonical realistic case that the
FS people use to stress the paths where the allocator fails to return
memory.  While my synthetic workload *might* work to trigger the cases,
I would prefer to have something that can compare this basic approach
with anything that is more clever.

Similarly, it would be nice to have a reasonable test case that phase
changes what memory is hot while there is heavy IO in the background to
detect whether the hot WSS is being properly protected. I used to use
memcached and a heavy writer to simulate this but it's weak because there
is no phase change so it's poor at evaluating vmscan.

> You can also set up qemu to have extremely slow I/O performance:
> https://serverfault.com/questions/675704/extremely-slow-qemu-storage-performance-with-qcow2-images
> 

Similar problem to the slow USB case, it's only catching one part of the
picture except now I have to worry about differences that are related
to the VM configuration (e.g. pinning virtual CPUs to physical CPUs
and replicating topology). Fine for a functional test, not so fine for
measuring if the patch is any good performance-wise.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 0/5] Remove dependency on congestion_wait in mm/
  2021-09-20 12:50   ` Mel Gorman
@ 2021-09-20 14:11     ` David Sterba
  2021-09-21 11:18       ` Mel Gorman
  0 siblings, 1 reply; 33+ messages in thread
From: David Sterba @ 2021-09-20 14:11 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Matthew Wilcox, Linux-MM, NeilBrown, Theodore Ts'o,
	Andreas Dilger, Darrick J . Wong, Michal Hocko, Dave Chinner,
	Rik van Riel, Vlastimil Babka, Johannes Weiner, Jonathan Corbet,
	Linux-fsdevel, LKML

On Mon, Sep 20, 2021 at 01:50:58PM +0100, Mel Gorman wrote:
> On Mon, Sep 20, 2021 at 12:42:44PM +0100, Matthew Wilcox wrote:
> > On Mon, Sep 20, 2021 at 09:54:31AM +0100, Mel Gorman wrote:
> > > This has been lightly tested only and the testing was useless as the
> > > relevant code was not executed. The workload configurations I had that
> > > used to trigger these corner cases no longer work (yey?) and I'll need
> > > to implement a new synthetic workload. If someone is aware of a realistic
> > > workload that forces reclaim activity to the point where reclaim stalls
> > > then kindly share the details.
> > 
> > The stereeotypical "stalling on I/O" problem is to plug in one of the
> > crap USB drives you were given at a trade show and simply
> > 	dd if=/dev/zero of=/dev/sdb
> > 	sync
> > 
> 
> The test machines are 1500KM away so plugging in a USB stick but worst
> comes to the worst, I could test it on a laptop.

There's a device mapper target dm-delay [1] that as it says delays the
reads and writes, so you could try to emulate the slow USB that way.

[1] https://www.kernel.org/doc/html/latest/admin-guide/device-mapper/delay.html

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 0/5] Remove dependency on congestion_wait in mm/
  2021-09-20 11:42 ` [RFC PATCH 0/5] Remove dependency on congestion_wait in mm/ Matthew Wilcox
  2021-09-20 12:50   ` Mel Gorman
@ 2021-09-20 19:51   ` Mel Gorman
  1 sibling, 0 replies; 33+ messages in thread
From: Mel Gorman @ 2021-09-20 19:51 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Linux-MM, NeilBrown, Theodore Ts'o, Andreas Dilger,
	Darrick J . Wong, Michal Hocko, Dave Chinner, Rik van Riel,
	Vlastimil Babka, Johannes Weiner, Jonathan Corbet, Linux-fsdevel,
	LKML

On Mon, Sep 20, 2021 at 12:42:44PM +0100, Matthew Wilcox wrote:
> On Mon, Sep 20, 2021 at 09:54:31AM +0100, Mel Gorman wrote:
> > This has been lightly tested only and the testing was useless as the
> > relevant code was not executed. The workload configurations I had that
> > used to trigger these corner cases no longer work (yey?) and I'll need
> > to implement a new synthetic workload. If someone is aware of a realistic
> > workload that forces reclaim activity to the point where reclaim stalls
> > then kindly share the details.
> 
> The stereeotypical "stalling on I/O" problem is to plug in one of the
> crap USB drives you were given at a trade show and simply
> 	dd if=/dev/zero of=/dev/sdb
> 	sync
> 
> You can also set up qemu to have extremely slow I/O performance:
> https://serverfault.com/questions/675704/extremely-slow-qemu-storage-performance-with-qcow2-images
> 

Ok, I managed to get something working and nothing blew up.

The workload was similar to what I described except the dirty file data
is related to dirty_ratio, the memory hogs no longer sleep and I disabled
the parallel readers. There is still a configuration with the parallel
readers but I won't have the results till tomorrow.

Surprising no one, vanilla kernel throttling barely works.

      1 writeback_wait_iff_congested: usec_delayed=4000
      3 writeback_congestion_wait: usec_delayed=108000
    196 writeback_congestion_wait: usec_delayed=104000
  16697 writeback_wait_iff_congested: usec_delayed=0

too_many_isolated it not tracked at all so we don't know what that looks
like but kswapd "blocking" on dirty pages at the tail basically never
stalls. The few congestion_wait's that did happen stalled for the full
duration as the bdi is not tracking congestion at all.

With the series, the breakdown of reasons to stall were

   5703 reason=VMSCAN_THROTTLE_WRITEBACK
  29644 reason=VMSCAN_THROTTLE_NOPROGRESS
1979999 reason=VMSCAN_THROTTLE_ISOLATED

kswapd stalls were rare but they did happen and surprise surprise, it
was dirty pages

    914 reason=VMSCAN_THROTTLE_WRITEBACK

All of them stalled for the full timeout so there might be a bug in
patch 1 because that sounds suspicious.

As "too many pages isolated" was the top reason, the frequency of each
stall time is as follows

      1 usect_delayed=164000
      1 usect_delayed=192000
      1 usect_delayed=200000
      1 usect_delayed=208000
      1 usect_delayed=220000
      1 usect_delayed=244000
      1 usect_delayed=308000
      1 usect_delayed=312000
      1 usect_delayed=316000
      1 usect_delayed=332000
      1 usect_delayed=588000
      1 usect_delayed=620000
      1 usect_delayed=836000
      3 usect_delayed=116000
      4 usect_delayed=124000
      4 usect_delayed=128000
      6 usect_delayed=120000
      9 usect_delayed=112000
     11 usect_delayed=100000
     13 usect_delayed=48000
     13 usect_delayed=96000
     14 usect_delayed=40000
     15 usect_delayed=88000
     15 usect_delayed=92000
     16 usect_delayed=80000
     18 usect_delayed=68000
     19 usect_delayed=76000
     22 usect_delayed=84000
     23 usect_delayed=108000
     23 usect_delayed=60000
     25 usect_delayed=44000
     25 usect_delayed=52000
     29 usect_delayed=36000
     30 usect_delayed=56000
     30 usect_delayed=64000
     33 usect_delayed=72000
     57 usect_delayed=32000
     91 usect_delayed=20000
    107 usect_delayed=24000
    125 usect_delayed=28000
    131 usect_delayed=16000
    180 usect_delayed=12000
    186 usect_delayed=8000
   1379 usect_delayed=104000
  16493 usect_delayed=4000
1960837 usect_delayed=0

In other words, the vast majority of stalls were for 0 time and the task
was immediately woken again. The next most common stall time was 1 tick
but a sizable number reach the full timeout. Everything else is somewhere
in between so the event trigger appears to be ok.

I don't know how the application itself performed as I still have to
write the analysis script and assuming I can look at this tomorrow, I'll
probably start with why VMSCAN_THROTTLE_WRITEBACK always stalled for the
full timeout.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/5] mm/vmscan: Throttle reclaim until some writeback completes if congested
  2021-09-20  8:54 ` [PATCH 1/5] mm/vmscan: Throttle reclaim until some writeback completes if congested Mel Gorman
@ 2021-09-20 23:19   ` NeilBrown
  2021-09-21 11:12     ` Mel Gorman
  2021-09-21  0:13   ` NeilBrown
  2021-09-22 12:16   ` Hillf Danton
  2 siblings, 1 reply; 33+ messages in thread
From: NeilBrown @ 2021-09-20 23:19 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Theodore Ts'o, Andreas Dilger, Darrick J . Wong,
	Matthew Wilcox, Michal Hocko, Dave Chinner, Rik van Riel,
	Vlastimil Babka, Johannes Weiner, Jonathan Corbet, Linux-fsdevel,
	LKML, Mel Gorman

On Mon, 20 Sep 2021, Mel Gorman wrote:
>  
> +void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page);
> +static inline void acct_reclaim_writeback(struct page *page)
> +{
> +	pg_data_t *pgdat = page_pgdat(page);
> +
> +	if (atomic_read(&pgdat->nr_reclaim_throttled))
> +		__acct_reclaim_writeback(pgdat, page);

The first thing __acct_reclaim_writeback() does is repeat that
atomic_read().
Should we read it once and pass the value in to
__acct_reclaim_writeback(), or is that an unnecessary
micro-optimisation?


> +/*
> + * Account for pages written if tasks are throttled waiting on dirty
> + * pages to clean. If enough pages have been cleaned since throttling
> + * started then wakeup the throttled tasks.
> + */
> +void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page)
> +{
> +	unsigned long nr_written;
> +	int nr_throttled = atomic_read(&pgdat->nr_reclaim_throttled);
> +
> +	__inc_node_page_state(page, NR_THROTTLED_WRITTEN);
> +	nr_written = node_page_state(pgdat, NR_THROTTLED_WRITTEN) -
> +		READ_ONCE(pgdat->nr_reclaim_start);
> +
> +	if (nr_written > SWAP_CLUSTER_MAX * nr_throttled)
> +		wake_up_interruptible_all(&pgdat->reclaim_wait);

A simple wake_up() could be used here.  "interruptible" is only needed
if non-interruptible waiters should be left alone.  "_all" is only needed
if there are some exclusive waiters.  Neither of these apply, so I think
the simpler interface is best.


> +}
> +
>  /* possible outcome of pageout() */
>  typedef enum {
>  	/* failed to write page out, page is locked */
> @@ -1412,9 +1453,8 @@ static unsigned int shrink_page_list(struct list_head *page_list,
>  
>  		/*
>  		 * The number of dirty pages determines if a node is marked
> -		 * reclaim_congested which affects wait_iff_congested. kswapd
> -		 * will stall and start writing pages if the tail of the LRU
> -		 * is all dirty unqueued pages.
> +		 * reclaim_congested. kswapd will stall and start writing
> +		 * pages if the tail of the LRU is all dirty unqueued pages.
>  		 */
>  		page_check_dirty_writeback(page, &dirty, &writeback);
>  		if (dirty || writeback)
> @@ -3180,19 +3220,20 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>  		 * If kswapd scans pages marked for immediate
>  		 * reclaim and under writeback (nr_immediate), it
>  		 * implies that pages are cycling through the LRU
> -		 * faster than they are written so also forcibly stall.
> +		 * faster than they are written so forcibly stall
> +		 * until some pages complete writeback.
>  		 */
>  		if (sc->nr.immediate)
> -			congestion_wait(BLK_RW_ASYNC, HZ/10);
> +			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK, HZ/10);
>  	}
>  
>  	/*
>  	 * Tag a node/memcg as congested if all the dirty pages
>  	 * scanned were backed by a congested BDI and

"congested BDI" doesn't mean anything any more.  Is this a good time to
correct that comment.
This comment seems to refer to the test

      sc->nr.dirty && sc->nr.dirty == sc->nr.congested)

a few lines down.  But nr.congested is set from nr_congested which
counts when inode_write_congested() is true - almost never - and when 
"writeback and PageReclaim()".

Is that last test the sign that we are cycling through the LRU to fast?
So the comment could become:

   Tag a node/memcg as congested if all the dirty page were
   already marked for writeback and immediate reclaim (counted in
   nr.congested).

??

Patch seems to make sense to me, but I'm not expert in this area.

Thanks!
NeilBrown

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/5] mm/vmscan: Throttle reclaim and compaction when too may pages are isolated
  2021-09-20  8:54 ` [PATCH 2/5] mm/vmscan: Throttle reclaim and compaction when too may pages are isolated Mel Gorman
@ 2021-09-20 23:27   ` NeilBrown
  2021-09-21 11:03     ` Mel Gorman
  2021-09-21 18:45     ` Yang Shi
  1 sibling, 1 reply; 33+ messages in thread
From: NeilBrown @ 2021-09-20 23:27 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Theodore Ts'o, Andreas Dilger, Darrick J . Wong,
	Matthew Wilcox, Michal Hocko, Dave Chinner, Rik van Riel,
	Vlastimil Babka, Johannes Weiner, Jonathan Corbet, Linux-fsdevel,
	LKML, Mel Gorman

On Mon, 20 Sep 2021, Mel Gorman wrote:
> @@ -2291,8 +2302,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>  			return 0;
>  
>  		/* wait a bit for the reclaimer. */
> -		msleep(100);
> -		stalled = true;
> +		reclaim_throttle(pgdat, VMSCAN_THROTTLE_ISOLATED, HZ/10);

Why drop the assignment to "stalled"?
Doing that changes the character of the loop - and makes the 'stalled'
variable always 'false'.

NeilBrown

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 3/5] mm/vmscan: Throttle reclaim when no progress is being made
  2021-09-20  8:54 ` [PATCH 3/5] mm/vmscan: Throttle reclaim when no progress is being made Mel Gorman
@ 2021-09-20 23:31   ` NeilBrown
  2021-09-21 11:16     ` Mel Gorman
  0 siblings, 1 reply; 33+ messages in thread
From: NeilBrown @ 2021-09-20 23:31 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Theodore Ts'o, Andreas Dilger, Darrick J . Wong,
	Matthew Wilcox, Michal Hocko, Dave Chinner, Rik van Riel,
	Vlastimil Babka, Johannes Weiner, Jonathan Corbet, Linux-fsdevel,
	LKML, Mel Gorman

On Mon, 20 Sep 2021, Mel Gorman wrote:
> +
> +		reclaim_throttle(pgdat, VMSCAN_THROTTLE_NOPROGRESS, HZ/10);

We always seem to pass "HZ/10" to reclaim_throttle().  Should we just
hard-code that in the one place inside reclaim_throttle() itself?

NeilBrown

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/5] mm/vmscan: Throttle reclaim until some writeback completes if congested
  2021-09-20  8:54 ` [PATCH 1/5] mm/vmscan: Throttle reclaim until some writeback completes if congested Mel Gorman
  2021-09-20 23:19   ` NeilBrown
@ 2021-09-21  0:13   ` NeilBrown
  2021-09-21 10:58     ` Mel Gorman
  2021-09-22 12:16   ` Hillf Danton
  2 siblings, 1 reply; 33+ messages in thread
From: NeilBrown @ 2021-09-21  0:13 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Theodore Ts'o, Andreas Dilger, Darrick J . Wong,
	Matthew Wilcox, Michal Hocko, Dave Chinner, Rik van Riel,
	Vlastimil Babka, Johannes Weiner, Jonathan Corbet, Linux-fsdevel,
	LKML, Mel Gorman

On Mon, 20 Sep 2021, Mel Gorman wrote:
> -long wait_iff_congested(int sync, long timeout)
> -{
> -	long ret;
> -	unsigned long start = jiffies;
> -	DEFINE_WAIT(wait);
> -	wait_queue_head_t *wqh = &congestion_wqh[sync];
> -
> -	/*
> -	 * If there is no congestion, yield if necessary instead
> -	 * of sleeping on the congestion queue
> -	 */
> -	if (atomic_read(&nr_wb_congested[sync]) == 0) {
> -		cond_resched();
> -
> -		/* In case we scheduled, work out time remaining */
> -		ret = timeout - (jiffies - start);
> -		if (ret < 0)
> -			ret = 0;
> -
> -		goto out;
> -	}
> -
> -	/* Sleep until uncongested or a write happens */
> -	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);

Uninterruptible wait.

....
> +static void
> +reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
> +							long timeout)
> +{
> +	wait_queue_head_t *wqh = &pgdat->reclaim_wait;
> +	unsigned long start = jiffies;
> +	long ret;
> +	DEFINE_WAIT(wait);
> +
> +	atomic_inc(&pgdat->nr_reclaim_throttled);
> +	WRITE_ONCE(pgdat->nr_reclaim_start,
> +		 node_page_state(pgdat, NR_THROTTLED_WRITTEN));
> +
> +	prepare_to_wait(wqh, &wait, TASK_INTERRUPTIBLE);

Interruptible wait.

Why the change?  I think these waits really need to be TASK_UNINTERRUPTIBLE.

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/5] mm/vmscan: Throttle reclaim until some writeback completes if congested
  2021-09-21  0:13   ` NeilBrown
@ 2021-09-21 10:58     ` Mel Gorman
  2021-09-21 21:40       ` NeilBrown
  2021-09-22  6:04       ` Dave Chinner
  0 siblings, 2 replies; 33+ messages in thread
From: Mel Gorman @ 2021-09-21 10:58 UTC (permalink / raw)
  To: NeilBrown
  Cc: Linux-MM, Theodore Ts'o, Andreas Dilger, Darrick J . Wong,
	Matthew Wilcox, Michal Hocko, Dave Chinner, Rik van Riel,
	Vlastimil Babka, Johannes Weiner, Jonathan Corbet, Linux-fsdevel,
	LKML

On Tue, Sep 21, 2021 at 10:13:17AM +1000, NeilBrown wrote:
> On Mon, 20 Sep 2021, Mel Gorman wrote:
> > -long wait_iff_congested(int sync, long timeout)
> > -{
> > -	long ret;
> > -	unsigned long start = jiffies;
> > -	DEFINE_WAIT(wait);
> > -	wait_queue_head_t *wqh = &congestion_wqh[sync];
> > -
> > -	/*
> > -	 * If there is no congestion, yield if necessary instead
> > -	 * of sleeping on the congestion queue
> > -	 */
> > -	if (atomic_read(&nr_wb_congested[sync]) == 0) {
> > -		cond_resched();
> > -
> > -		/* In case we scheduled, work out time remaining */
> > -		ret = timeout - (jiffies - start);
> > -		if (ret < 0)
> > -			ret = 0;
> > -
> > -		goto out;
> > -	}
> > -
> > -	/* Sleep until uncongested or a write happens */
> > -	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> 
> Uninterruptible wait.
> 
> ....
> > +static void
> > +reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
> > +							long timeout)
> > +{
> > +	wait_queue_head_t *wqh = &pgdat->reclaim_wait;
> > +	unsigned long start = jiffies;
> > +	long ret;
> > +	DEFINE_WAIT(wait);
> > +
> > +	atomic_inc(&pgdat->nr_reclaim_throttled);
> > +	WRITE_ONCE(pgdat->nr_reclaim_start,
> > +		 node_page_state(pgdat, NR_THROTTLED_WRITTEN));
> > +
> > +	prepare_to_wait(wqh, &wait, TASK_INTERRUPTIBLE);
> 
> Interruptible wait.
> 
> Why the change?  I think these waits really need to be TASK_UNINTERRUPTIBLE.
> 

Because from mm/ context, I saw no reason why the task *should* be
uninterruptible. It's waiting on other tasks to complete IO and it is not
protecting device state, filesystem state or anything else. If it gets
a signal, it's safe to wake up, particularly if that signal is KILL and
the context is a direct reclaimer.

The original TASK_UNINTERRUPTIBLE is almost certainly a copy&paste from
congestion_wait which may be called because a filesystem operation must
complete before it can return to userspace so a signal waking it up is
pointless.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/5] mm/vmscan: Throttle reclaim and compaction when too may pages are isolated
  2021-09-20 23:27   ` NeilBrown
@ 2021-09-21 11:03     ` Mel Gorman
  0 siblings, 0 replies; 33+ messages in thread
From: Mel Gorman @ 2021-09-21 11:03 UTC (permalink / raw)
  To: NeilBrown
  Cc: Linux-MM, Theodore Ts'o, Andreas Dilger, Darrick J . Wong,
	Matthew Wilcox, Michal Hocko, Dave Chinner, Rik van Riel,
	Vlastimil Babka, Johannes Weiner, Jonathan Corbet, Linux-fsdevel,
	LKML

On Tue, Sep 21, 2021 at 09:27:56AM +1000, NeilBrown wrote:
> On Mon, 20 Sep 2021, Mel Gorman wrote:
> > @@ -2291,8 +2302,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
> >  			return 0;
> >  
> >  		/* wait a bit for the reclaimer. */
> > -		msleep(100);
> > -		stalled = true;
> > +		reclaim_throttle(pgdat, VMSCAN_THROTTLE_ISOLATED, HZ/10);
> 
> Why drop the assignment to "stalled"?
> Doing that changes the character of the loop - and makes the 'stalled'
> variable always 'false'.
> 

This was a thought that was never completed. The intent was that if
there are too many pages isolated that it should not return prematurely
and do busy work elsewhere. It potentially means an allocation request
moves to lower zones or remote nodes prematurely but I never did the
full removal. Even if I had, on reflection, that type of behavioural
change does not belong in this series.

I've restored the "stalled = true".

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/5] mm/vmscan: Throttle reclaim until some writeback completes if congested
  2021-09-20 23:19   ` NeilBrown
@ 2021-09-21 11:12     ` Mel Gorman
  2021-09-21 21:27       ` NeilBrown
  0 siblings, 1 reply; 33+ messages in thread
From: Mel Gorman @ 2021-09-21 11:12 UTC (permalink / raw)
  To: NeilBrown
  Cc: Linux-MM, Theodore Ts'o, Andreas Dilger, Darrick J . Wong,
	Matthew Wilcox, Michal Hocko, Dave Chinner, Rik van Riel,
	Vlastimil Babka, Johannes Weiner, Jonathan Corbet, Linux-fsdevel,
	LKML

On Tue, Sep 21, 2021 at 09:19:07AM +1000, NeilBrown wrote:
> On Mon, 20 Sep 2021, Mel Gorman wrote:
> >  
> > +void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page);
> > +static inline void acct_reclaim_writeback(struct page *page)
> > +{
> > +	pg_data_t *pgdat = page_pgdat(page);
> > +
> > +	if (atomic_read(&pgdat->nr_reclaim_throttled))
> > +		__acct_reclaim_writeback(pgdat, page);
> 
> The first thing __acct_reclaim_writeback() does is repeat that
> atomic_read().
> Should we read it once and pass the value in to
> __acct_reclaim_writeback(), or is that an unnecessary
> micro-optimisation?
> 

I think it's a micro-optimisation but I can still do it.

> 
> > +/*
> > + * Account for pages written if tasks are throttled waiting on dirty
> > + * pages to clean. If enough pages have been cleaned since throttling
> > + * started then wakeup the throttled tasks.
> > + */
> > +void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page)
> > +{
> > +	unsigned long nr_written;
> > +	int nr_throttled = atomic_read(&pgdat->nr_reclaim_throttled);
> > +
> > +	__inc_node_page_state(page, NR_THROTTLED_WRITTEN);
> > +	nr_written = node_page_state(pgdat, NR_THROTTLED_WRITTEN) -
> > +		READ_ONCE(pgdat->nr_reclaim_start);
> > +
> > +	if (nr_written > SWAP_CLUSTER_MAX * nr_throttled)
> > +		wake_up_interruptible_all(&pgdat->reclaim_wait);
> 
> A simple wake_up() could be used here.  "interruptible" is only needed
> if non-interruptible waiters should be left alone.  "_all" is only needed
> if there are some exclusive waiters.  Neither of these apply, so I think
> the simpler interface is best.
> 

You're right.

> 
> > +}
> > +
> >  /* possible outcome of pageout() */
> >  typedef enum {
> >  	/* failed to write page out, page is locked */
> > @@ -1412,9 +1453,8 @@ static unsigned int shrink_page_list(struct list_head *page_list,
> >  
> >  		/*
> >  		 * The number of dirty pages determines if a node is marked
> > -		 * reclaim_congested which affects wait_iff_congested. kswapd
> > -		 * will stall and start writing pages if the tail of the LRU
> > -		 * is all dirty unqueued pages.
> > +		 * reclaim_congested. kswapd will stall and start writing
> > +		 * pages if the tail of the LRU is all dirty unqueued pages.
> >  		 */
> >  		page_check_dirty_writeback(page, &dirty, &writeback);
> >  		if (dirty || writeback)
> > @@ -3180,19 +3220,20 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> >  		 * If kswapd scans pages marked for immediate
> >  		 * reclaim and under writeback (nr_immediate), it
> >  		 * implies that pages are cycling through the LRU
> > -		 * faster than they are written so also forcibly stall.
> > +		 * faster than they are written so forcibly stall
> > +		 * until some pages complete writeback.
> >  		 */
> >  		if (sc->nr.immediate)
> > -			congestion_wait(BLK_RW_ASYNC, HZ/10);
> > +			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK, HZ/10);
> >  	}
> >  
> >  	/*
> >  	 * Tag a node/memcg as congested if all the dirty pages
> >  	 * scanned were backed by a congested BDI and
> 
> "congested BDI" doesn't mean anything any more.  Is this a good time to
> correct that comment.
> This comment seems to refer to the test
> 
>       sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
> 
> a few lines down.  But nr.congested is set from nr_congested which
> counts when inode_write_congested() is true - almost never - and when 
> "writeback and PageReclaim()".
> 
> Is that last test the sign that we are cycling through the LRU to fast?
> So the comment could become:
> 
>    Tag a node/memcg as congested if all the dirty page were
>    already marked for writeback and immediate reclaim (counted in
>    nr.congested).
> 
> ??
> 
> Patch seems to make sense to me, but I'm not expert in this area.
> 

Comments updated.

Diff on top looks like

diff --git a/mm/internal.h b/mm/internal.h
index e25b3686bfab..90764d646e02 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -34,13 +34,15 @@
 
 void page_writeback_init(void);
 
-void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page);
+void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page,
+						int nr_throttled);
 static inline void acct_reclaim_writeback(struct page *page)
 {
 	pg_data_t *pgdat = page_pgdat(page);
+	int nr_throttled = atomic_read(&pgdat->nr_reclaim_throttled);
 
-	if (atomic_read(&pgdat->nr_reclaim_throttled))
-		__acct_reclaim_writeback(pgdat, page);
+	if (nr_throttled)
+		__acct_reclaim_writeback(pgdat, page, nr_throttled);
 }
 
 vm_fault_t do_swap_page(struct vm_fault *vmf);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b58ea0b13286..2dc17de91d32 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1034,10 +1034,10 @@ reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
  * pages to clean. If enough pages have been cleaned since throttling
  * started then wakeup the throttled tasks.
  */
-void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page)
+void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page,
+							int nr_throttled)
 {
 	unsigned long nr_written;
-	int nr_throttled = atomic_read(&pgdat->nr_reclaim_throttled);
 
 	__inc_node_page_state(page, NR_THROTTLED_WRITTEN);
 	nr_written = node_page_state(pgdat, NR_THROTTLED_WRITTEN) -
@@ -3228,9 +3228,8 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	}
 
 	/*
-	 * Tag a node/memcg as congested if all the dirty pages
-	 * scanned were backed by a congested BDI and
-	 * non-kswapd tasks will stall on reclaim_throttle.
+	 * Tag a node/memcg as congested if all the dirty pages were marked
+	 * for writeback and immediate reclaim (counted in nr.congested).
 	 *
 	 * Legacy memcg will stall in page writeback so avoid forcibly
 	 * stalling in reclaim_throttle().
@@ -3241,8 +3240,8 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 		set_bit(LRUVEC_CONGESTED, &target_lruvec->flags);
 
 	/*
-	 * Stall direct reclaim for IO completions if underlying BDIs
-	 * and node is congested. Allow kswapd to continue until it
+	 * Stall direct reclaim for IO completions if the lruvec is
+	 * node is congested. Allow kswapd to continue until it
 	 * starts encountering unqueued dirty pages or cycling through
 	 * the LRU too quickly.
 	 */
@@ -4427,7 +4426,7 @@ void wakeup_kswapd(struct zone *zone, gfp_t gfp_flags, int order,
 
 	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, highest_zoneidx, order,
 				      gfp_flags);
-	wake_up_interruptible(&pgdat->kswapd_wait);
+	wake_up_all(&pgdat->kswapd_wait);
 }
 
 #ifdef CONFIG_HIBERNATION

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 3/5] mm/vmscan: Throttle reclaim when no progress is being made
  2021-09-20 23:31   ` NeilBrown
@ 2021-09-21 11:16     ` Mel Gorman
  2021-09-21 21:46       ` NeilBrown
  0 siblings, 1 reply; 33+ messages in thread
From: Mel Gorman @ 2021-09-21 11:16 UTC (permalink / raw)
  To: NeilBrown
  Cc: Linux-MM, Theodore Ts'o, Andreas Dilger, Darrick J . Wong,
	Matthew Wilcox, Michal Hocko, Dave Chinner, Rik van Riel,
	Vlastimil Babka, Johannes Weiner, Jonathan Corbet, Linux-fsdevel,
	LKML

On Tue, Sep 21, 2021 at 09:31:30AM +1000, NeilBrown wrote:
> On Mon, 20 Sep 2021, Mel Gorman wrote:
> > +
> > +		reclaim_throttle(pgdat, VMSCAN_THROTTLE_NOPROGRESS, HZ/10);
> 
> We always seem to pass "HZ/10" to reclaim_throttle().  Should we just
> hard-code that in the one place inside reclaim_throttle() itself?
> 

do_writepages passes in HZ/50. I'm not sure if these values even have
any special meaning, I think it's more likely they were pulled out of
the air based on the speed of some disk in the past and then copied.
It's another reason why I want the wakeups to be based on events within
the mm as much as possible.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 0/5] Remove dependency on congestion_wait in mm/
  2021-09-20 14:11     ` David Sterba
@ 2021-09-21 11:18       ` Mel Gorman
  0 siblings, 0 replies; 33+ messages in thread
From: Mel Gorman @ 2021-09-21 11:18 UTC (permalink / raw)
  To: dsterba, Matthew Wilcox, Linux-MM, NeilBrown, Theodore Ts'o,
	Andreas Dilger, Darrick J . Wong, Michal Hocko, Dave Chinner,
	Rik van Riel, Vlastimil Babka, Johannes Weiner, Jonathan Corbet,
	Linux-fsdevel, LKML

On Mon, Sep 20, 2021 at 04:11:52PM +0200, David Sterba wrote:
> On Mon, Sep 20, 2021 at 01:50:58PM +0100, Mel Gorman wrote:
> > On Mon, Sep 20, 2021 at 12:42:44PM +0100, Matthew Wilcox wrote:
> > > On Mon, Sep 20, 2021 at 09:54:31AM +0100, Mel Gorman wrote:
> > > > This has been lightly tested only and the testing was useless as the
> > > > relevant code was not executed. The workload configurations I had that
> > > > used to trigger these corner cases no longer work (yey?) and I'll need
> > > > to implement a new synthetic workload. If someone is aware of a realistic
> > > > workload that forces reclaim activity to the point where reclaim stalls
> > > > then kindly share the details.
> > > 
> > > The stereeotypical "stalling on I/O" problem is to plug in one of the
> > > crap USB drives you were given at a trade show and simply
> > > 	dd if=/dev/zero of=/dev/sdb
> > > 	sync
> > > 
> > 
> > The test machines are 1500KM away so plugging in a USB stick but worst
> > comes to the worst, I could test it on a laptop.
> 
> There's a device mapper target dm-delay [1] that as it says delays the
> reads and writes, so you could try to emulate the slow USB that way.
> 
> [1] https://www.kernel.org/doc/html/latest/admin-guide/device-mapper/delay.html

Ah, thanks for that tip. I wondered if something like this existed and
clearly did not search hard enough. I was able to reproduce the problem
without throttling but this could still be useful if examining cases
where there are 2 or more BDIs with variable speeds.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/5] mm/vmscan: Throttle reclaim and compaction when too may pages are isolated
  2021-09-20  8:54 ` [PATCH 2/5] mm/vmscan: Throttle reclaim and compaction when too may pages are isolated Mel Gorman
@ 2021-09-21 18:45     ` Yang Shi
  2021-09-21 18:45     ` Yang Shi
  1 sibling, 0 replies; 33+ messages in thread
From: Yang Shi @ 2021-09-21 18:45 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, NeilBrown, Theodore Ts'o, Andreas Dilger,
	Darrick J . Wong, Matthew Wilcox, Michal Hocko, Dave Chinner,
	Rik van Riel, Vlastimil Babka, Johannes Weiner, Jonathan Corbet,
	Linux-fsdevel, LKML

On Mon, Sep 20, 2021 at 1:55 AM Mel Gorman <mgorman@techsingularity.net> wrote:
>
> Page reclaim throttles on congestion if too many parallel reclaim instances
> have isolated too many pages. This makes no sense, excessive parallelisation
> has nothing to do with writeback or congestion.
>
> This patch creates an additional workqueue to sleep on when too many
> pages are isolated. The throttled tasks are woken when the number
> of isolated pages is reduced or a timeout occurs. There may be
> some false positive wakeups for GFP_NOIO/GFP_NOFS callers but
> the tasks will throttle again if necessary.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>  include/linux/mmzone.h        |  4 +++-
>  include/trace/events/vmscan.h |  4 +++-
>  mm/compaction.c               |  2 +-
>  mm/internal.h                 |  2 ++
>  mm/page_alloc.c               |  6 +++++-
>  mm/vmscan.c                   | 22 ++++++++++++++++------
>  6 files changed, 30 insertions(+), 10 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index ef0a63ebd21d..ca65d6a64bdd 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -275,6 +275,8 @@ enum lru_list {
>
>  enum vmscan_throttle_state {
>         VMSCAN_THROTTLE_WRITEBACK,
> +       VMSCAN_THROTTLE_ISOLATED,
> +       NR_VMSCAN_THROTTLE,
>  };
>
>  #define for_each_lru(lru) for (lru = 0; lru < NR_LRU_LISTS; lru++)
> @@ -846,7 +848,7 @@ typedef struct pglist_data {
>         int node_id;
>         wait_queue_head_t kswapd_wait;
>         wait_queue_head_t pfmemalloc_wait;
> -       wait_queue_head_t reclaim_wait; /* wq for throttling reclaim */
> +       wait_queue_head_t reclaim_wait[NR_VMSCAN_THROTTLE];
>         atomic_t nr_reclaim_throttled;  /* nr of throtted tasks */
>         unsigned long nr_reclaim_start; /* nr pages written while throttled
>                                          * when throttling started. */
> diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
> index c317f9fe0d17..d4905bd9e9c4 100644
> --- a/include/trace/events/vmscan.h
> +++ b/include/trace/events/vmscan.h
> @@ -28,10 +28,12 @@
>                 ) : "RECLAIM_WB_NONE"
>
>  #define _VMSCAN_THROTTLE_WRITEBACK     (1 << VMSCAN_THROTTLE_WRITEBACK)
> +#define _VMSCAN_THROTTLE_ISOLATED      (1 << VMSCAN_THROTTLE_ISOLATED)
>
>  #define show_throttle_flags(flags)                                             \
>         (flags) ? __print_flags(flags, "|",                                     \
> -               {_VMSCAN_THROTTLE_WRITEBACK,    "VMSCAN_THROTTLE_WRITEBACK"}    \
> +               {_VMSCAN_THROTTLE_WRITEBACK,    "VMSCAN_THROTTLE_WRITEBACK"},   \
> +               {_VMSCAN_THROTTLE_ISOLATED,     "VMSCAN_THROTTLE_ISOLATED"}     \
>                 ) : "VMSCAN_THROTTLE_NONE"
>
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index bfc93da1c2c7..221c9c10ad7e 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -822,7 +822,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>                 if (cc->mode == MIGRATE_ASYNC)
>                         return -EAGAIN;
>
> -               congestion_wait(BLK_RW_ASYNC, HZ/10);
> +               reclaim_throttle(pgdat, VMSCAN_THROTTLE_ISOLATED, HZ/10);

It seems waking up tasks is missed in compaction's
too_many_isolated(). There are two too_many_isolated(), one is for
compaction, the other is for reclaimer. I saw the waking up code was
added to the reclaimer's in the below. Or the compaction one is left
out intentionally?

>
>                 if (fatal_signal_pending(current))
>                         return -EINTR;
> diff --git a/mm/internal.h b/mm/internal.h
> index e25b3686bfab..e6cd22fb5a43 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -118,6 +118,8 @@ extern unsigned long highest_memmap_pfn;
>   */
>  extern int isolate_lru_page(struct page *page);
>  extern void putback_lru_page(struct page *page);
> +extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
> +                                                               long timeout);
>
>  /*
>   * in mm/rmap.c:
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d849ddfc1e51..78e538067651 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7389,6 +7389,8 @@ static void pgdat_init_kcompactd(struct pglist_data *pgdat) {}
>
>  static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
>  {
> +       int i;
> +
>         pgdat_resize_init(pgdat);
>
>         pgdat_init_split_queue(pgdat);
> @@ -7396,7 +7398,9 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
>
>         init_waitqueue_head(&pgdat->kswapd_wait);
>         init_waitqueue_head(&pgdat->pfmemalloc_wait);
> -       init_waitqueue_head(&pgdat->reclaim_wait);
> +
> +       for (i = 0; i < NR_VMSCAN_THROTTLE; i++)
> +               init_waitqueue_head(&pgdat->reclaim_wait[i]);
>
>         pgdat_page_ext_init(pgdat);
>         lruvec_init(&pgdat->__lruvec);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index b58ea0b13286..eb81dcac15b2 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1006,11 +1006,10 @@ static void handle_write_error(struct address_space *mapping,
>         unlock_page(page);
>  }
>
> -static void
> -reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
> +void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
>                                                         long timeout)
>  {
> -       wait_queue_head_t *wqh = &pgdat->reclaim_wait;
> +       wait_queue_head_t *wqh = &pgdat->reclaim_wait[reason];
>         unsigned long start = jiffies;
>         long ret;
>         DEFINE_WAIT(wait);
> @@ -1044,7 +1043,7 @@ void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page)
>                 READ_ONCE(pgdat->nr_reclaim_start);
>
>         if (nr_written > SWAP_CLUSTER_MAX * nr_throttled)
> -               wake_up_interruptible_all(&pgdat->reclaim_wait);
> +               wake_up_interruptible_all(&pgdat->reclaim_wait[VMSCAN_THROTTLE_WRITEBACK]);
>  }
>
>  /* possible outcome of pageout() */
> @@ -2159,6 +2158,7 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
>                 struct scan_control *sc)
>  {
>         unsigned long inactive, isolated;
> +       bool too_many;
>
>         if (current_is_kswapd())
>                 return 0;
> @@ -2182,6 +2182,17 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
>         if ((sc->gfp_mask & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS))
>                 inactive >>= 3;
>
> +       too_many = isolated > inactive;
> +
> +       /* Wake up tasks throttled due to too_many_isolated. */
> +       if (!too_many) {
> +               wait_queue_head_t *wqh;
> +
> +               wqh = &pgdat->reclaim_wait[VMSCAN_THROTTLE_ISOLATED];
> +               if (waitqueue_active(wqh))
> +                       wake_up_interruptible_all(wqh);
> +       }
> +
>         return isolated > inactive;

Just return too_many?

>  }
>
> @@ -2291,8 +2302,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>                         return 0;
>
>                 /* wait a bit for the reclaimer. */
> -               msleep(100);
> -               stalled = true;
> +               reclaim_throttle(pgdat, VMSCAN_THROTTLE_ISOLATED, HZ/10);
>
>                 /* We are about to die and free our memory. Return now. */
>                 if (fatal_signal_pending(current))
> --
> 2.31.1
>
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/5] mm/vmscan: Throttle reclaim and compaction when too may pages are isolated
@ 2021-09-21 18:45     ` Yang Shi
  0 siblings, 0 replies; 33+ messages in thread
From: Yang Shi @ 2021-09-21 18:45 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, NeilBrown, Theodore Ts'o, Andreas Dilger,
	Darrick J . Wong, Matthew Wilcox, Michal Hocko, Dave Chinner,
	Rik van Riel, Vlastimil Babka, Johannes Weiner, Jonathan Corbet,
	Linux-fsdevel, LKML

On Mon, Sep 20, 2021 at 1:55 AM Mel Gorman <mgorman@techsingularity.net> wrote:
>
> Page reclaim throttles on congestion if too many parallel reclaim instances
> have isolated too many pages. This makes no sense, excessive parallelisation
> has nothing to do with writeback or congestion.
>
> This patch creates an additional workqueue to sleep on when too many
> pages are isolated. The throttled tasks are woken when the number
> of isolated pages is reduced or a timeout occurs. There may be
> some false positive wakeups for GFP_NOIO/GFP_NOFS callers but
> the tasks will throttle again if necessary.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>  include/linux/mmzone.h        |  4 +++-
>  include/trace/events/vmscan.h |  4 +++-
>  mm/compaction.c               |  2 +-
>  mm/internal.h                 |  2 ++
>  mm/page_alloc.c               |  6 +++++-
>  mm/vmscan.c                   | 22 ++++++++++++++++------
>  6 files changed, 30 insertions(+), 10 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index ef0a63ebd21d..ca65d6a64bdd 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -275,6 +275,8 @@ enum lru_list {
>
>  enum vmscan_throttle_state {
>         VMSCAN_THROTTLE_WRITEBACK,
> +       VMSCAN_THROTTLE_ISOLATED,
> +       NR_VMSCAN_THROTTLE,
>  };
>
>  #define for_each_lru(lru) for (lru = 0; lru < NR_LRU_LISTS; lru++)
> @@ -846,7 +848,7 @@ typedef struct pglist_data {
>         int node_id;
>         wait_queue_head_t kswapd_wait;
>         wait_queue_head_t pfmemalloc_wait;
> -       wait_queue_head_t reclaim_wait; /* wq for throttling reclaim */
> +       wait_queue_head_t reclaim_wait[NR_VMSCAN_THROTTLE];
>         atomic_t nr_reclaim_throttled;  /* nr of throtted tasks */
>         unsigned long nr_reclaim_start; /* nr pages written while throttled
>                                          * when throttling started. */
> diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
> index c317f9fe0d17..d4905bd9e9c4 100644
> --- a/include/trace/events/vmscan.h
> +++ b/include/trace/events/vmscan.h
> @@ -28,10 +28,12 @@
>                 ) : "RECLAIM_WB_NONE"
>
>  #define _VMSCAN_THROTTLE_WRITEBACK     (1 << VMSCAN_THROTTLE_WRITEBACK)
> +#define _VMSCAN_THROTTLE_ISOLATED      (1 << VMSCAN_THROTTLE_ISOLATED)
>
>  #define show_throttle_flags(flags)                                             \
>         (flags) ? __print_flags(flags, "|",                                     \
> -               {_VMSCAN_THROTTLE_WRITEBACK,    "VMSCAN_THROTTLE_WRITEBACK"}    \
> +               {_VMSCAN_THROTTLE_WRITEBACK,    "VMSCAN_THROTTLE_WRITEBACK"},   \
> +               {_VMSCAN_THROTTLE_ISOLATED,     "VMSCAN_THROTTLE_ISOLATED"}     \
>                 ) : "VMSCAN_THROTTLE_NONE"
>
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index bfc93da1c2c7..221c9c10ad7e 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -822,7 +822,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>                 if (cc->mode == MIGRATE_ASYNC)
>                         return -EAGAIN;
>
> -               congestion_wait(BLK_RW_ASYNC, HZ/10);
> +               reclaim_throttle(pgdat, VMSCAN_THROTTLE_ISOLATED, HZ/10);

It seems waking up tasks is missed in compaction's
too_many_isolated(). There are two too_many_isolated(), one is for
compaction, the other is for reclaimer. I saw the waking up code was
added to the reclaimer's in the below. Or the compaction one is left
out intentionally?

>
>                 if (fatal_signal_pending(current))
>                         return -EINTR;
> diff --git a/mm/internal.h b/mm/internal.h
> index e25b3686bfab..e6cd22fb5a43 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -118,6 +118,8 @@ extern unsigned long highest_memmap_pfn;
>   */
>  extern int isolate_lru_page(struct page *page);
>  extern void putback_lru_page(struct page *page);
> +extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
> +                                                               long timeout);
>
>  /*
>   * in mm/rmap.c:
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d849ddfc1e51..78e538067651 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7389,6 +7389,8 @@ static void pgdat_init_kcompactd(struct pglist_data *pgdat) {}
>
>  static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
>  {
> +       int i;
> +
>         pgdat_resize_init(pgdat);
>
>         pgdat_init_split_queue(pgdat);
> @@ -7396,7 +7398,9 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
>
>         init_waitqueue_head(&pgdat->kswapd_wait);
>         init_waitqueue_head(&pgdat->pfmemalloc_wait);
> -       init_waitqueue_head(&pgdat->reclaim_wait);
> +
> +       for (i = 0; i < NR_VMSCAN_THROTTLE; i++)
> +               init_waitqueue_head(&pgdat->reclaim_wait[i]);
>
>         pgdat_page_ext_init(pgdat);
>         lruvec_init(&pgdat->__lruvec);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index b58ea0b13286..eb81dcac15b2 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1006,11 +1006,10 @@ static void handle_write_error(struct address_space *mapping,
>         unlock_page(page);
>  }
>
> -static void
> -reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
> +void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
>                                                         long timeout)
>  {
> -       wait_queue_head_t *wqh = &pgdat->reclaim_wait;
> +       wait_queue_head_t *wqh = &pgdat->reclaim_wait[reason];
>         unsigned long start = jiffies;
>         long ret;
>         DEFINE_WAIT(wait);
> @@ -1044,7 +1043,7 @@ void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page)
>                 READ_ONCE(pgdat->nr_reclaim_start);
>
>         if (nr_written > SWAP_CLUSTER_MAX * nr_throttled)
> -               wake_up_interruptible_all(&pgdat->reclaim_wait);
> +               wake_up_interruptible_all(&pgdat->reclaim_wait[VMSCAN_THROTTLE_WRITEBACK]);
>  }
>
>  /* possible outcome of pageout() */
> @@ -2159,6 +2158,7 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
>                 struct scan_control *sc)
>  {
>         unsigned long inactive, isolated;
> +       bool too_many;
>
>         if (current_is_kswapd())
>                 return 0;
> @@ -2182,6 +2182,17 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
>         if ((sc->gfp_mask & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS))
>                 inactive >>= 3;
>
> +       too_many = isolated > inactive;
> +
> +       /* Wake up tasks throttled due to too_many_isolated. */
> +       if (!too_many) {
> +               wait_queue_head_t *wqh;
> +
> +               wqh = &pgdat->reclaim_wait[VMSCAN_THROTTLE_ISOLATED];
> +               if (waitqueue_active(wqh))
> +                       wake_up_interruptible_all(wqh);
> +       }
> +
>         return isolated > inactive;

Just return too_many?

>  }
>
> @@ -2291,8 +2302,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>                         return 0;
>
>                 /* wait a bit for the reclaimer. */
> -               msleep(100);
> -               stalled = true;
> +               reclaim_throttle(pgdat, VMSCAN_THROTTLE_ISOLATED, HZ/10);
>
>                 /* We are about to die and free our memory. Return now. */
>                 if (fatal_signal_pending(current))
> --
> 2.31.1
>
>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 0/5] Remove dependency on congestion_wait in mm/
  2021-09-20  8:54 [RFC PATCH 0/5] Remove dependency on congestion_wait in mm/ Mel Gorman
                   ` (5 preceding siblings ...)
  2021-09-20 11:42 ` [RFC PATCH 0/5] Remove dependency on congestion_wait in mm/ Matthew Wilcox
@ 2021-09-21 20:46 ` Dave Chinner
  2021-09-22 17:52   ` Mel Gorman
  6 siblings, 1 reply; 33+ messages in thread
From: Dave Chinner @ 2021-09-21 20:46 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, NeilBrown, Theodore Ts'o, Andreas Dilger,
	Darrick J . Wong, Matthew Wilcox, Michal Hocko, Rik van Riel,
	Vlastimil Babka, Johannes Weiner, Jonathan Corbet, Linux-fsdevel,
	LKML

On Mon, Sep 20, 2021 at 09:54:31AM +0100, Mel Gorman wrote:
> Cc list similar to "congestion_wait() and GFP_NOFAIL" as they're loosely
> related.
> 
> This is a prototype series that removes all calls to congestion_wait
> in mm/ and deletes wait_iff_congested. It's not a clever
> implementation but congestion_wait has been broken for a long time
> (https://lore.kernel.org/linux-mm/45d8b7a6-8548-65f5-cccf-9f451d4ae3d4@kernel.dk/).
> Even if it worked, it was never a great idea. While excessive
> dirty/writeback pages at the tail of the LRU is one possibility that
> reclaim may be slow, there is also the problem of too many pages being
> isolated and reclaim failing for other reasons (elevated references,
> too many pages isolated, excessive LRU contention etc).
> 
> This series replaces the reclaim conditions with event driven ones
> 
> o If there are too many dirty/writeback pages, sleep until a timeout
>   or enough pages get cleaned
> o If too many pages are isolated, sleep until enough isolated pages
>   are either reclaimed or put back on the LRU
> o If no progress is being made, let direct reclaim tasks sleep until
>   another task makes progress
> 
> This has been lightly tested only and the testing was useless as the
> relevant code was not executed. The workload configurations I had that
> used to trigger these corner cases no longer work (yey?) and I'll need
> to implement a new synthetic workload. If someone is aware of a realistic
> workload that forces reclaim activity to the point where reclaim stalls
> then kindly share the details.

Got a git tree pointer so I can pull it into a test kernel so I can
see what impact it has on behaviour before I try to make sense of
the code?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/5] mm/vmscan: Throttle reclaim until some writeback completes if congested
  2021-09-21 11:12     ` Mel Gorman
@ 2021-09-21 21:27       ` NeilBrown
  0 siblings, 0 replies; 33+ messages in thread
From: NeilBrown @ 2021-09-21 21:27 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Theodore Ts'o, Andreas Dilger, Darrick J . Wong,
	Matthew Wilcox, Michal Hocko, Dave Chinner, Rik van Riel,
	Vlastimil Babka, Johannes Weiner, Jonathan Corbet, Linux-fsdevel,
	LKML

On Tue, 21 Sep 2021, Mel Gorman wrote:
> On Tue, Sep 21, 2021 at 09:19:07AM +1000, NeilBrown wrote:
> > On Mon, 20 Sep 2021, Mel Gorman wrote:
> > >  
> > > +void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page);
> > > +static inline void acct_reclaim_writeback(struct page *page)
> > > +{
> > > +	pg_data_t *pgdat = page_pgdat(page);
> > > +
> > > +	if (atomic_read(&pgdat->nr_reclaim_throttled))
> > > +		__acct_reclaim_writeback(pgdat, page);
> > 
> > The first thing __acct_reclaim_writeback() does is repeat that
> > atomic_read().
> > Should we read it once and pass the value in to
> > __acct_reclaim_writeback(), or is that an unnecessary
> > micro-optimisation?
> > 
> 
> I think it's a micro-optimisation but I can still do it.
> 
> > 
> > > +/*
> > > + * Account for pages written if tasks are throttled waiting on dirty
> > > + * pages to clean. If enough pages have been cleaned since throttling
> > > + * started then wakeup the throttled tasks.
> > > + */
> > > +void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page)
> > > +{
> > > +	unsigned long nr_written;
> > > +	int nr_throttled = atomic_read(&pgdat->nr_reclaim_throttled);
> > > +
> > > +	__inc_node_page_state(page, NR_THROTTLED_WRITTEN);
> > > +	nr_written = node_page_state(pgdat, NR_THROTTLED_WRITTEN) -
> > > +		READ_ONCE(pgdat->nr_reclaim_start);
> > > +
> > > +	if (nr_written > SWAP_CLUSTER_MAX * nr_throttled)
> > > +		wake_up_interruptible_all(&pgdat->reclaim_wait);
> > 
> > A simple wake_up() could be used here.  "interruptible" is only needed
> > if non-interruptible waiters should be left alone.  "_all" is only needed
> > if there are some exclusive waiters.  Neither of these apply, so I think
> > the simpler interface is best.
> > 
> 
> You're right.
> 
> > 
> > > +}
> > > +
> > >  /* possible outcome of pageout() */
> > >  typedef enum {
> > >  	/* failed to write page out, page is locked */
> > > @@ -1412,9 +1453,8 @@ static unsigned int shrink_page_list(struct list_head *page_list,
> > >  
> > >  		/*
> > >  		 * The number of dirty pages determines if a node is marked
> > > -		 * reclaim_congested which affects wait_iff_congested. kswapd
> > > -		 * will stall and start writing pages if the tail of the LRU
> > > -		 * is all dirty unqueued pages.
> > > +		 * reclaim_congested. kswapd will stall and start writing
> > > +		 * pages if the tail of the LRU is all dirty unqueued pages.
> > >  		 */
> > >  		page_check_dirty_writeback(page, &dirty, &writeback);
> > >  		if (dirty || writeback)
> > > @@ -3180,19 +3220,20 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> > >  		 * If kswapd scans pages marked for immediate
> > >  		 * reclaim and under writeback (nr_immediate), it
> > >  		 * implies that pages are cycling through the LRU
> > > -		 * faster than they are written so also forcibly stall.
> > > +		 * faster than they are written so forcibly stall
> > > +		 * until some pages complete writeback.
> > >  		 */
> > >  		if (sc->nr.immediate)
> > > -			congestion_wait(BLK_RW_ASYNC, HZ/10);
> > > +			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK, HZ/10);
> > >  	}
> > >  
> > >  	/*
> > >  	 * Tag a node/memcg as congested if all the dirty pages
> > >  	 * scanned were backed by a congested BDI and
> > 
> > "congested BDI" doesn't mean anything any more.  Is this a good time to
> > correct that comment.
> > This comment seems to refer to the test
> > 
> >       sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
> > 
> > a few lines down.  But nr.congested is set from nr_congested which
> > counts when inode_write_congested() is true - almost never - and when 
> > "writeback and PageReclaim()".
> > 
> > Is that last test the sign that we are cycling through the LRU to fast?
> > So the comment could become:
> > 
> >    Tag a node/memcg as congested if all the dirty page were
> >    already marked for writeback and immediate reclaim (counted in
> >    nr.congested).
> > 
> > ??
> > 
> > Patch seems to make sense to me, but I'm not expert in this area.
> > 
> 
> Comments updated.
> 
> Diff on top looks like
> 
> diff --git a/mm/internal.h b/mm/internal.h
> index e25b3686bfab..90764d646e02 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -34,13 +34,15 @@
>  
>  void page_writeback_init(void);
>  
> -void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page);
> +void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page,
> +						int nr_throttled);
>  static inline void acct_reclaim_writeback(struct page *page)
>  {
>  	pg_data_t *pgdat = page_pgdat(page);
> +	int nr_throttled = atomic_read(&pgdat->nr_reclaim_throttled);
>  
> -	if (atomic_read(&pgdat->nr_reclaim_throttled))
> -		__acct_reclaim_writeback(pgdat, page);
> +	if (nr_throttled)
> +		__acct_reclaim_writeback(pgdat, page, nr_throttled);
>  }
>  
>  vm_fault_t do_swap_page(struct vm_fault *vmf);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index b58ea0b13286..2dc17de91d32 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1034,10 +1034,10 @@ reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
>   * pages to clean. If enough pages have been cleaned since throttling
>   * started then wakeup the throttled tasks.
>   */
> -void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page)
> +void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page,
> +							int nr_throttled)
>  {
>  	unsigned long nr_written;
> -	int nr_throttled = atomic_read(&pgdat->nr_reclaim_throttled);
>  
>  	__inc_node_page_state(page, NR_THROTTLED_WRITTEN);
>  	nr_written = node_page_state(pgdat, NR_THROTTLED_WRITTEN) -
> @@ -3228,9 +3228,8 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>  	}
>  
>  	/*
> -	 * Tag a node/memcg as congested if all the dirty pages
> -	 * scanned were backed by a congested BDI and
> -	 * non-kswapd tasks will stall on reclaim_throttle.
> +	 * Tag a node/memcg as congested if all the dirty pages were marked
> +	 * for writeback and immediate reclaim (counted in nr.congested).
>  	 *
>  	 * Legacy memcg will stall in page writeback so avoid forcibly
>  	 * stalling in reclaim_throttle().
> @@ -3241,8 +3240,8 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>  		set_bit(LRUVEC_CONGESTED, &target_lruvec->flags);
>  
>  	/*
> -	 * Stall direct reclaim for IO completions if underlying BDIs
> -	 * and node is congested. Allow kswapd to continue until it
> +	 * Stall direct reclaim for IO completions if the lruvec is
> +	 * node is congested. Allow kswapd to continue until it
>  	 * starts encountering unqueued dirty pages or cycling through
>  	 * the LRU too quickly.
>  	 */
> @@ -4427,7 +4426,7 @@ void wakeup_kswapd(struct zone *zone, gfp_t gfp_flags, int order,
>  
>  	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, highest_zoneidx, order,
>  				      gfp_flags);
> -	wake_up_interruptible(&pgdat->kswapd_wait);
> +	wake_up_all(&pgdat->kswapd_wait);

???

That isn't the wake_up that I pointed too.

Other changes look good - thanks.

NeilBrown


>  }
>  
>  #ifdef CONFIG_HIBERNATION
> 
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/5] mm/vmscan: Throttle reclaim until some writeback completes if congested
  2021-09-21 10:58     ` Mel Gorman
@ 2021-09-21 21:40       ` NeilBrown
  2021-09-22  6:04       ` Dave Chinner
  1 sibling, 0 replies; 33+ messages in thread
From: NeilBrown @ 2021-09-21 21:40 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Theodore Ts'o, Andreas Dilger, Darrick J . Wong,
	Matthew Wilcox, Michal Hocko, Dave Chinner, Rik van Riel,
	Vlastimil Babka, Johannes Weiner, Jonathan Corbet, Linux-fsdevel,
	LKML

On Tue, 21 Sep 2021, Mel Gorman wrote:
> On Tue, Sep 21, 2021 at 10:13:17AM +1000, NeilBrown wrote:
> > On Mon, 20 Sep 2021, Mel Gorman wrote:
> > > -long wait_iff_congested(int sync, long timeout)
> > > -{
> > > -	long ret;
> > > -	unsigned long start = jiffies;
> > > -	DEFINE_WAIT(wait);
> > > -	wait_queue_head_t *wqh = &congestion_wqh[sync];
> > > -
> > > -	/*
> > > -	 * If there is no congestion, yield if necessary instead
> > > -	 * of sleeping on the congestion queue
> > > -	 */
> > > -	if (atomic_read(&nr_wb_congested[sync]) == 0) {
> > > -		cond_resched();
> > > -
> > > -		/* In case we scheduled, work out time remaining */
> > > -		ret = timeout - (jiffies - start);
> > > -		if (ret < 0)
> > > -			ret = 0;
> > > -
> > > -		goto out;
> > > -	}
> > > -
> > > -	/* Sleep until uncongested or a write happens */
> > > -	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> > 
> > Uninterruptible wait.
> > 
> > ....
> > > +static void
> > > +reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
> > > +							long timeout)
> > > +{
> > > +	wait_queue_head_t *wqh = &pgdat->reclaim_wait;
> > > +	unsigned long start = jiffies;
> > > +	long ret;
> > > +	DEFINE_WAIT(wait);
> > > +
> > > +	atomic_inc(&pgdat->nr_reclaim_throttled);
> > > +	WRITE_ONCE(pgdat->nr_reclaim_start,
> > > +		 node_page_state(pgdat, NR_THROTTLED_WRITTEN));
> > > +
> > > +	prepare_to_wait(wqh, &wait, TASK_INTERRUPTIBLE);
> > 
> > Interruptible wait.
> > 
> > Why the change?  I think these waits really need to be TASK_UNINTERRUPTIBLE.
> > 
> 
> Because from mm/ context, I saw no reason why the task *should* be
> uninterruptible. It's waiting on other tasks to complete IO and it is not
> protecting device state, filesystem state or anything else. If it gets
> a signal, it's safe to wake up, particularly if that signal is KILL and
> the context is a direct reclaimer.

I disagree.  An Interruptible sleep only makes sense if the "was
interrupted" status can propagate up to user-space (or to some in-kernel
handler that will clear the signal).
In particular, if reclaim_throttle() is called in a loop (which it is),
and if that loop doesn't check for signal_pending (which it doesn't),
then the next time around the loop after receiving a signal, it won't
sleep at all.  That would be bad.

In general, if you don't return an error, then you probably shouldn't
sleep Interruptible.

I notice that tasks sleep on kswapd_wait as TASK_INTERRUPTIBLE, but they
don't have any signal handling.  I suspect this isn't actually a defect
because I suspect that is it not even possible to SIGKILL kswapd.  But
the code seems misleading.  I guess I should write a patch.

Unless reclaim knows to abort completely on a signal (__GFP_KILLABLE
???) this must be an UNINTERRUPTIBLE wait.

Thanks,
NeilBrown

> 
> The original TASK_UNINTERRUPTIBLE is almost certainly a copy&paste from
> congestion_wait which may be called because a filesystem operation must
> complete before it can return to userspace so a signal waking it up is
> pointless.
> 
> -- 
> Mel Gorman
> SUSE Labs
> 
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 3/5] mm/vmscan: Throttle reclaim when no progress is being made
  2021-09-21 11:16     ` Mel Gorman
@ 2021-09-21 21:46       ` NeilBrown
  2021-09-22  9:21         ` Mel Gorman
  0 siblings, 1 reply; 33+ messages in thread
From: NeilBrown @ 2021-09-21 21:46 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Theodore Ts'o, Andreas Dilger, Darrick J . Wong,
	Matthew Wilcox, Michal Hocko, Dave Chinner, Rik van Riel,
	Vlastimil Babka, Johannes Weiner, Jonathan Corbet, Linux-fsdevel,
	LKML

On Tue, 21 Sep 2021, Mel Gorman wrote:
> On Tue, Sep 21, 2021 at 09:31:30AM +1000, NeilBrown wrote:
> > On Mon, 20 Sep 2021, Mel Gorman wrote:
> > > +
> > > +		reclaim_throttle(pgdat, VMSCAN_THROTTLE_NOPROGRESS, HZ/10);
> > 
> > We always seem to pass "HZ/10" to reclaim_throttle().  Should we just
> > hard-code that in the one place inside reclaim_throttle() itself?
> > 
> 
> do_writepages passes in HZ/50. I'm not sure if these values even have
> any special meaning, I think it's more likely they were pulled out of
> the air based on the speed of some disk in the past and then copied.
> It's another reason why I want the wakeups to be based on events within
> the mm as much as possible.

Yes, I saw the HZ/50 shortly after writing that email :-)
I agree with your guess for the source of these numbers.  I still think
we should pull them all from the same piece of air.
Hopefully, once these changes are properly understood and the events
reliably come as expected, we can make it quite large (HZ?) with minimal
cost.

Thanks,
NeilBrown


> 
> -- 
> Mel Gorman
> SUSE Labs
> 
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/5] mm/vmscan: Throttle reclaim until some writeback completes if congested
  2021-09-21 10:58     ` Mel Gorman
  2021-09-21 21:40       ` NeilBrown
@ 2021-09-22  6:04       ` Dave Chinner
  2021-09-22  8:03         ` Mel Gorman
  1 sibling, 1 reply; 33+ messages in thread
From: Dave Chinner @ 2021-09-22  6:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: NeilBrown, Linux-MM, Theodore Ts'o, Andreas Dilger,
	Darrick J . Wong, Matthew Wilcox, Michal Hocko, Rik van Riel,
	Vlastimil Babka, Johannes Weiner, Jonathan Corbet, Linux-fsdevel,
	LKML

On Tue, Sep 21, 2021 at 11:58:31AM +0100, Mel Gorman wrote:
> On Tue, Sep 21, 2021 at 10:13:17AM +1000, NeilBrown wrote:
> > On Mon, 20 Sep 2021, Mel Gorman wrote:
> > > -long wait_iff_congested(int sync, long timeout)
> > > -{
> > > -	long ret;
> > > -	unsigned long start = jiffies;
> > > -	DEFINE_WAIT(wait);
> > > -	wait_queue_head_t *wqh = &congestion_wqh[sync];
> > > -
> > > -	/*
> > > -	 * If there is no congestion, yield if necessary instead
> > > -	 * of sleeping on the congestion queue
> > > -	 */
> > > -	if (atomic_read(&nr_wb_congested[sync]) == 0) {
> > > -		cond_resched();
> > > -
> > > -		/* In case we scheduled, work out time remaining */
> > > -		ret = timeout - (jiffies - start);
> > > -		if (ret < 0)
> > > -			ret = 0;
> > > -
> > > -		goto out;
> > > -	}
> > > -
> > > -	/* Sleep until uncongested or a write happens */
> > > -	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> > 
> > Uninterruptible wait.
> > 
> > ....
> > > +static void
> > > +reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
> > > +							long timeout)
> > > +{
> > > +	wait_queue_head_t *wqh = &pgdat->reclaim_wait;
> > > +	unsigned long start = jiffies;
> > > +	long ret;
> > > +	DEFINE_WAIT(wait);
> > > +
> > > +	atomic_inc(&pgdat->nr_reclaim_throttled);
> > > +	WRITE_ONCE(pgdat->nr_reclaim_start,
> > > +		 node_page_state(pgdat, NR_THROTTLED_WRITTEN));
> > > +
> > > +	prepare_to_wait(wqh, &wait, TASK_INTERRUPTIBLE);
> > 
> > Interruptible wait.
> > 
> > Why the change?  I think these waits really need to be TASK_UNINTERRUPTIBLE.
> > 
> 
> Because from mm/ context, I saw no reason why the task *should* be
> uninterruptible. It's waiting on other tasks to complete IO and it is not
> protecting device state, filesystem state or anything else. If it gets
> a signal, it's safe to wake up, particularly if that signal is KILL and
> the context is a direct reclaimer.

I disagree. whether the sleep should be interruptable or
not is entirely dependent on whether the caller can handle failure
or not. If this is GFP_NOFAIL, allocation must not fail no matter
what the context is, so signals and the like are irrelevant.

For a context that can handle allocation failure, then it makes
sense to wake on events that will result in the allocation failing
immediately. But if all this does is make the allocation code go
around another retry loop sooner, then an interruptible sleep still
doesn't make any sense at all here...

> The original TASK_UNINTERRUPTIBLE is almost certainly a copy&paste from
> congestion_wait which may be called because a filesystem operation must
> complete before it can return to userspace so a signal waking it up is
> pointless.

Yup, but that AFAICT that same logic still applies. Only now it's
the allocation context that determines whether signal waking is
pointless or not...

Cheer,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/5] mm/vmscan: Throttle reclaim until some writeback completes if congested
  2021-09-22  6:04       ` Dave Chinner
@ 2021-09-22  8:03         ` Mel Gorman
  0 siblings, 0 replies; 33+ messages in thread
From: Mel Gorman @ 2021-09-22  8:03 UTC (permalink / raw)
  To: Dave Chinner
  Cc: NeilBrown, Linux-MM, Theodore Ts'o, Andreas Dilger,
	Darrick J . Wong, Matthew Wilcox, Michal Hocko, Rik van Riel,
	Vlastimil Babka, Johannes Weiner, Jonathan Corbet, Linux-fsdevel,
	LKML

On Wed, Sep 22, 2021 at 04:04:47PM +1000, Dave Chinner wrote:
> On Tue, Sep 21, 2021 at 11:58:31AM +0100, Mel Gorman wrote:
> > On Tue, Sep 21, 2021 at 10:13:17AM +1000, NeilBrown wrote:
> > > On Mon, 20 Sep 2021, Mel Gorman wrote:
> > > > -long wait_iff_congested(int sync, long timeout)
> > > > -{
> > > > -	long ret;
> > > > -	unsigned long start = jiffies;
> > > > -	DEFINE_WAIT(wait);
> > > > -	wait_queue_head_t *wqh = &congestion_wqh[sync];
> > > > -
> > > > -	/*
> > > > -	 * If there is no congestion, yield if necessary instead
> > > > -	 * of sleeping on the congestion queue
> > > > -	 */
> > > > -	if (atomic_read(&nr_wb_congested[sync]) == 0) {
> > > > -		cond_resched();
> > > > -
> > > > -		/* In case we scheduled, work out time remaining */
> > > > -		ret = timeout - (jiffies - start);
> > > > -		if (ret < 0)
> > > > -			ret = 0;
> > > > -
> > > > -		goto out;
> > > > -	}
> > > > -
> > > > -	/* Sleep until uncongested or a write happens */
> > > > -	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> > > 
> > > Uninterruptible wait.
> > > 
> > > ....
> > > > +static void
> > > > +reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
> > > > +							long timeout)
> > > > +{
> > > > +	wait_queue_head_t *wqh = &pgdat->reclaim_wait;
> > > > +	unsigned long start = jiffies;
> > > > +	long ret;
> > > > +	DEFINE_WAIT(wait);
> > > > +
> > > > +	atomic_inc(&pgdat->nr_reclaim_throttled);
> > > > +	WRITE_ONCE(pgdat->nr_reclaim_start,
> > > > +		 node_page_state(pgdat, NR_THROTTLED_WRITTEN));
> > > > +
> > > > +	prepare_to_wait(wqh, &wait, TASK_INTERRUPTIBLE);
> > > 
> > > Interruptible wait.
> > > 
> > > Why the change?  I think these waits really need to be TASK_UNINTERRUPTIBLE.
> > > 
> > 
> > Because from mm/ context, I saw no reason why the task *should* be
> > uninterruptible. It's waiting on other tasks to complete IO and it is not
> > protecting device state, filesystem state or anything else. If it gets
> > a signal, it's safe to wake up, particularly if that signal is KILL and
> > the context is a direct reclaimer.
> 
> I disagree. whether the sleep should be interruptable or
> not is entirely dependent on whether the caller can handle failure
> or not. If this is GFP_NOFAIL, allocation must not fail no matter
> what the context is, so signals and the like are irrelevant.
> 
> For a context that can handle allocation failure, then it makes
> sense to wake on events that will result in the allocation failing
> immediately. But if all this does is make the allocation code go
> around another retry loop sooner, then an interruptible sleep still
> doesn't make any sense at all here...
> 

Ok, between this and Neil's mail on the same topic, I'm convinced.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/5] mm/vmscan: Throttle reclaim and compaction when too may pages are isolated
  2021-09-21 18:45     ` Yang Shi
  (?)
@ 2021-09-22  8:11     ` Mel Gorman
  -1 siblings, 0 replies; 33+ messages in thread
From: Mel Gorman @ 2021-09-22  8:11 UTC (permalink / raw)
  To: Yang Shi
  Cc: Linux-MM, NeilBrown, Theodore Ts'o, Andreas Dilger,
	Darrick J . Wong, Matthew Wilcox, Michal Hocko, Dave Chinner,
	Rik van Riel, Vlastimil Babka, Johannes Weiner, Jonathan Corbet,
	Linux-fsdevel, LKML

On Tue, Sep 21, 2021 at 11:45:19AM -0700, Yang Shi wrote:
> On Mon, Sep 20, 2021 at 1:55 AM Mel Gorman <mgorman@techsingularity.net> wrote:
> >
> > Page reclaim throttles on congestion if too many parallel reclaim instances
> > have isolated too many pages. This makes no sense, excessive parallelisation
> > has nothing to do with writeback or congestion.
> >
> > This patch creates an additional workqueue to sleep on when too many
> > pages are isolated. The throttled tasks are woken when the number
> > of isolated pages is reduced or a timeout occurs. There may be
> > some false positive wakeups for GFP_NOIO/GFP_NOFS callers but
> > the tasks will throttle again if necessary.
> >
> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > ---
> >  include/linux/mmzone.h        |  4 +++-
> >  include/trace/events/vmscan.h |  4 +++-
> >  mm/compaction.c               |  2 +-
> >  mm/internal.h                 |  2 ++
> >  mm/page_alloc.c               |  6 +++++-
> >  mm/vmscan.c                   | 22 ++++++++++++++++------
> >  6 files changed, 30 insertions(+), 10 deletions(-)
> >
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index ef0a63ebd21d..ca65d6a64bdd 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -275,6 +275,8 @@ enum lru_list {
> >
> >  enum vmscan_throttle_state {
> >         VMSCAN_THROTTLE_WRITEBACK,
> > +       VMSCAN_THROTTLE_ISOLATED,
> > +       NR_VMSCAN_THROTTLE,
> >  };
> >
> >  #define for_each_lru(lru) for (lru = 0; lru < NR_LRU_LISTS; lru++)
> > @@ -846,7 +848,7 @@ typedef struct pglist_data {
> >         int node_id;
> >         wait_queue_head_t kswapd_wait;
> >         wait_queue_head_t pfmemalloc_wait;
> > -       wait_queue_head_t reclaim_wait; /* wq for throttling reclaim */
> > +       wait_queue_head_t reclaim_wait[NR_VMSCAN_THROTTLE];
> >         atomic_t nr_reclaim_throttled;  /* nr of throtted tasks */
> >         unsigned long nr_reclaim_start; /* nr pages written while throttled
> >                                          * when throttling started. */
> > diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
> > index c317f9fe0d17..d4905bd9e9c4 100644
> > --- a/include/trace/events/vmscan.h
> > +++ b/include/trace/events/vmscan.h
> > @@ -28,10 +28,12 @@
> >                 ) : "RECLAIM_WB_NONE"
> >
> >  #define _VMSCAN_THROTTLE_WRITEBACK     (1 << VMSCAN_THROTTLE_WRITEBACK)
> > +#define _VMSCAN_THROTTLE_ISOLATED      (1 << VMSCAN_THROTTLE_ISOLATED)
> >
> >  #define show_throttle_flags(flags)                                             \
> >         (flags) ? __print_flags(flags, "|",                                     \
> > -               {_VMSCAN_THROTTLE_WRITEBACK,    "VMSCAN_THROTTLE_WRITEBACK"}    \
> > +               {_VMSCAN_THROTTLE_WRITEBACK,    "VMSCAN_THROTTLE_WRITEBACK"},   \
> > +               {_VMSCAN_THROTTLE_ISOLATED,     "VMSCAN_THROTTLE_ISOLATED"}     \
> >                 ) : "VMSCAN_THROTTLE_NONE"
> >
> >
> > diff --git a/mm/compaction.c b/mm/compaction.c
> > index bfc93da1c2c7..221c9c10ad7e 100644
> > --- a/mm/compaction.c
> > +++ b/mm/compaction.c
> > @@ -822,7 +822,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> >                 if (cc->mode == MIGRATE_ASYNC)
> >                         return -EAGAIN;
> >
> > -               congestion_wait(BLK_RW_ASYNC, HZ/10);
> > +               reclaim_throttle(pgdat, VMSCAN_THROTTLE_ISOLATED, HZ/10);
> 
> It seems waking up tasks is missed in compaction's
> too_many_isolated(). There are two too_many_isolated(), one is for
> compaction, the other is for reclaimer. I saw the waking up code was
> added to the reclaimer's in the below. Or the compaction one is left
> out intentionally?
> 

Compaction one was left out accidentally, I'll fix it. Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 3/5] mm/vmscan: Throttle reclaim when no progress is being made
  2021-09-21 21:46       ` NeilBrown
@ 2021-09-22  9:21         ` Mel Gorman
  0 siblings, 0 replies; 33+ messages in thread
From: Mel Gorman @ 2021-09-22  9:21 UTC (permalink / raw)
  To: NeilBrown
  Cc: Linux-MM, Theodore Ts'o, Andreas Dilger, Darrick J . Wong,
	Matthew Wilcox, Michal Hocko, Dave Chinner, Rik van Riel,
	Vlastimil Babka, Johannes Weiner, Jonathan Corbet, Linux-fsdevel,
	LKML

On Wed, Sep 22, 2021 at 07:46:58AM +1000, NeilBrown wrote:
> On Tue, 21 Sep 2021, Mel Gorman wrote:
> > On Tue, Sep 21, 2021 at 09:31:30AM +1000, NeilBrown wrote:
> > > On Mon, 20 Sep 2021, Mel Gorman wrote:
> > > > +
> > > > +		reclaim_throttle(pgdat, VMSCAN_THROTTLE_NOPROGRESS, HZ/10);
> > > 
> > > We always seem to pass "HZ/10" to reclaim_throttle().  Should we just
> > > hard-code that in the one place inside reclaim_throttle() itself?
> > > 
> > 
> > do_writepages passes in HZ/50. I'm not sure if these values even have
> > any special meaning, I think it's more likely they were pulled out of
> > the air based on the speed of some disk in the past and then copied.
> > It's another reason why I want the wakeups to be based on events within
> > the mm as much as possible.
> 
> Yes, I saw the HZ/50 shortly after writing that email :-)
> I agree with your guess for the source of these numbers.  I still think
> we should pull them all from the same piece of air.
> Hopefully, once these changes are properly understood and the events
> reliably come as expected, we can make it quite large (HZ?) with minimal
> cost.
> 

I'd prefer to do it as a separate patch. At some point congestion_wait
worked and the original timeouts may have been selected based on testing
(I severely doubt it but I'm trying to be optimistic). However, we can
at least centralise the decision based on "reason" with this

---8<---
From 11e5197c0c569e89145475afd511efe3ce61711c Mon Sep 17 00:00:00 2001
From: Mel Gorman <mgorman@techsingularity.net>
Date: Wed, 22 Sep 2021 10:16:33 +0100
Subject: [PATCH] mm/vmscan: Centralise timeout values for reclaim_throttle

Neil Brown raised concerns about callers of reclaim_throttle specifying
a timeout value. The original timeout values to congestion_wait() were
probably pulled out of thin air or copy&pasted from somewhere else.
This patch centralises the timeout values and selects a timeout based
on the reason for reclaim throttling. These figures are also pulled
out of the same thin air but better values may be derived

Running a workload that is throttling for inappropriate periods
and tracing mm_vmscan_throttled can be used to pick a more appropriate
value. Excessive throttling would pick a lower timeout where as
excessive CPU usage in reclaim context would select a larger timeout.
Ideally a large value would always be used and the wakeups would
occur before a timeout but that requires careful testing.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/compaction.c     |  2 +-
 mm/internal.h       |  3 +--
 mm/page-writeback.c |  2 +-
 mm/vmscan.c         | 39 +++++++++++++++++++++++++++++++--------
 4 files changed, 34 insertions(+), 12 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 7359093d8ac0..151b04c4dab3 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -828,7 +828,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 		if (cc->mode == MIGRATE_ASYNC)
 			return -EAGAIN;
 
-		reclaim_throttle(pgdat, VMSCAN_THROTTLE_ISOLATED, HZ/10);
+		reclaim_throttle(pgdat, VMSCAN_THROTTLE_ISOLATED);
 
 		if (fatal_signal_pending(current))
 			return -EINTR;
diff --git a/mm/internal.h b/mm/internal.h
index 06d0c376efcd..f8d203cfd4e1 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -129,8 +129,7 @@ extern unsigned long highest_memmap_pfn;
  */
 extern int isolate_lru_page(struct page *page);
 extern void putback_lru_page(struct page *page);
-extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
-								long timeout);
+extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason);
 
 /*
  * in mm/rmap.c:
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index f34f54fcd5b4..7d08706c541a 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2374,7 +2374,7 @@ int do_writepages(struct address_space *mapping, struct writeback_control *wbc)
 		 * guess as any.
 		 */
 		reclaim_throttle(NODE_DATA(numa_node_id()),
-			VMSCAN_THROTTLE_WRITEBACK, HZ/50);
+						VMSCAN_THROTTLE_WRITEBACK);
 	}
 	/*
 	 * Usually few pages are written by now from those we've just submitted
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b0012f9536e1..36b21549a3a4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1006,14 +1006,37 @@ static void handle_write_error(struct address_space *mapping,
 	unlock_page(page);
 }
 
-void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
-							long timeout)
+void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason)
 {
 	wait_queue_head_t *wqh = &pgdat->reclaim_wait[reason];
 	unsigned long start = jiffies;
-	long ret;
+	long timeout, ret;
 	DEFINE_WAIT(wait);
 
+	/*
+	 * These figures are pulled out of thin air.
+	 * VMSCAN_THROTTLE_ISOLATED is a transient condition based on too many
+	 * parallel reclaimers which is a short-lived event so the timeout is
+	 * short. Failing to make progress or waiting on writeback are
+	 * potentially long-lived events so use a longer timeout. This is shaky
+	 * logic as a failure to make progress could be due to anything from
+	 * writeback to a slow device to excessive references pages at the tail
+	 * of the inactive LRU.
+	 */
+	switch(reason) {
+	case VMSCAN_THROTTLE_NOPROGRESS:
+	case VMSCAN_THROTTLE_WRITEBACK:
+		timeout = HZ/10;
+		break;
+	case VMSCAN_THROTTLE_ISOLATED:
+		timeout = HZ/50;
+		break;
+	default:
+		WARN_ON_ONCE(1);
+		timeout = HZ;
+		break;
+	}
+
 	atomic_inc(&pgdat->nr_reclaim_throttled);
 	WRITE_ONCE(pgdat->nr_reclaim_start,
 		 node_page_state(pgdat, NR_THROTTLED_WRITTEN));
@@ -2298,7 +2321,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 
 		/* wait a bit for the reclaimer. */
 		stalled = true;
-		reclaim_throttle(pgdat, VMSCAN_THROTTLE_ISOLATED, HZ/10);
+		reclaim_throttle(pgdat, VMSCAN_THROTTLE_ISOLATED);
 
 		/* We are about to die and free our memory. Return now. */
 		if (fatal_signal_pending(current))
@@ -3230,7 +3253,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 		 * until some pages complete writeback.
 		 */
 		if (sc->nr.immediate)
-			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK, HZ/10);
+			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
 	}
 
 	/*
@@ -3254,7 +3277,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	if (!current_is_kswapd() && current_may_throttle() &&
 	    !sc->hibernation_mode &&
 	    test_bit(LRUVEC_CONGESTED, &target_lruvec->flags))
-		reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK, HZ/10);
+		reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
 
 	if (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
 				    sc))
@@ -3326,7 +3349,7 @@ static void consider_reclaim_throttle(pg_data_t *pgdat, struct scan_control *sc)
 
 	/* Throttle if making no progress at high prioities. */
 	if (sc->priority < DEF_PRIORITY - 2)
-		reclaim_throttle(pgdat, VMSCAN_THROTTLE_NOPROGRESS, HZ/10);
+		reclaim_throttle(pgdat, VMSCAN_THROTTLE_NOPROGRESS);
 }
 
 /*
@@ -3795,7 +3818,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 		z = first_zones_zonelist(zonelist, sc.reclaim_idx, sc.nodemask);
 		pgdat = zonelist_zone(z)->zone_pgdat;
 
-		reclaim_throttle(pgdat, VMSCAN_THROTTLE_NOPROGRESS, HZ/10);
+		reclaim_throttle(pgdat, VMSCAN_THROTTLE_NOPROGRESS);
 	}
 
 	return nr_reclaimed;

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/5] mm/vmscan: Throttle reclaim until some writeback completes if congested
  2021-09-20  8:54 ` [PATCH 1/5] mm/vmscan: Throttle reclaim until some writeback completes if congested Mel Gorman
  2021-09-20 23:19   ` NeilBrown
  2021-09-21  0:13   ` NeilBrown
@ 2021-09-22 12:16   ` Hillf Danton
  2021-09-22 14:13     ` Mel Gorman
  2 siblings, 1 reply; 33+ messages in thread
From: Hillf Danton @ 2021-09-22 12:16 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux-MM, NeilBrown, Dave Chinner, LKML

On Mon, 20 Sep 2021 09:54:32 +0100 Mel Gorman wrote:
> +static void
> +reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
> +							long timeout)
> +{
> +	wait_queue_head_t *wqh =3D &pgdat->reclaim_wait;
> +	unsigned long start =3D jiffies;
> +	long ret;
> +	DEFINE_WAIT(wait);
> +
> +	atomic_inc(&pgdat->nr_reclaim_throttled);
> +	WRITE_ONCE(pgdat->nr_reclaim_start,
> +		 node_page_state(pgdat, NR_THROTTLED_WRITTEN));

Missing wakeup could happen if the current sleeper overwrites 
pgdat->nr_reclaim_start set by the existing sleeper.

	if (1 == atomic_inc_and_return(&pgdat->nr_reclaim_throttled))
		WRITE_ONCE(pgdat->nr_reclaim_start,
				node_page_state(pgdat, NR_THROTTLED_WRITTEN));

> +
> +	prepare_to_wait(wqh, &wait, TASK_INTERRUPTIBLE);
> +	ret =3D schedule_timeout(timeout);
> +	finish_wait(wqh, &wait);
> +	atomic_dec(&pgdat->nr_reclaim_throttled);
> +
> +	trace_mm_vmscan_throttled(pgdat->node_id, jiffies_to_usecs(timeout),
> +				jiffies_to_usecs(jiffies - start),
> +				reason);
> +}
> +
> +/*
> + * Account for pages written if tasks are throttled waiting on dirty
> + * pages to clean. If enough pages have been cleaned since throttling
> + * started then wakeup the throttled tasks.
> + */
> +void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page)
> +{
> +	unsigned long nr_written;
> +	int nr_throttled =3D atomic_read(&pgdat->nr_reclaim_throttled);
> +
> +	__inc_node_page_state(page, NR_THROTTLED_WRITTEN);
> +	nr_written =3D node_page_state(pgdat, NR_THROTTLED_WRITTEN) -
> +		READ_ONCE(pgdat->nr_reclaim_start);
> +
> +	if (nr_written > SWAP_CLUSTER_MAX * nr_throttled)
> +		wake_up_interruptible_all(&pgdat->reclaim_wait);
> +}


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/5] mm/vmscan: Throttle reclaim until some writeback completes if congested
  2021-09-22 12:16   ` Hillf Danton
@ 2021-09-22 14:13     ` Mel Gorman
  0 siblings, 0 replies; 33+ messages in thread
From: Mel Gorman @ 2021-09-22 14:13 UTC (permalink / raw)
  To: Hillf Danton; +Cc: Linux-MM, NeilBrown, Dave Chinner, LKML

On Wed, Sep 22, 2021 at 08:16:20PM +0800, Hillf Danton wrote:
> On Mon, 20 Sep 2021 09:54:32 +0100 Mel Gorman wrote:
> > +static void
> > +reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
> > +							long timeout)
> > +{
> > +	wait_queue_head_t *wqh =3D &pgdat->reclaim_wait;
> > +	unsigned long start =3D jiffies;
> > +	long ret;
> > +	DEFINE_WAIT(wait);
> > +
> > +	atomic_inc(&pgdat->nr_reclaim_throttled);
> > +	WRITE_ONCE(pgdat->nr_reclaim_start,
> > +		 node_page_state(pgdat, NR_THROTTLED_WRITTEN));
> 
> Missing wakeup could happen if the current sleeper overwrites 
> pgdat->nr_reclaim_start set by the existing sleeper.
> 
> 	if (1 == atomic_inc_and_return(&pgdat->nr_reclaim_throttled))
> 		WRITE_ONCE(pgdat->nr_reclaim_start,
> 				node_page_state(pgdat, NR_THROTTLED_WRITTEN));
> 

Good spot, will fix.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 0/5] Remove dependency on congestion_wait in mm/
  2021-09-21 20:46 ` Dave Chinner
@ 2021-09-22 17:52   ` Mel Gorman
  0 siblings, 0 replies; 33+ messages in thread
From: Mel Gorman @ 2021-09-22 17:52 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Linux-MM, NeilBrown, Theodore Ts'o, Andreas Dilger,
	Darrick J . Wong, Matthew Wilcox, Michal Hocko, Rik van Riel,
	Vlastimil Babka, Johannes Weiner, Jonathan Corbet, Linux-fsdevel,
	LKML

On Wed, Sep 22, 2021 at 06:46:21AM +1000, Dave Chinner wrote:
> On Mon, Sep 20, 2021 at 09:54:31AM +0100, Mel Gorman wrote:
> > Cc list similar to "congestion_wait() and GFP_NOFAIL" as they're loosely
> > related.
> > 
> > This is a prototype series that removes all calls to congestion_wait
> > in mm/ and deletes wait_iff_congested. It's not a clever
> > implementation but congestion_wait has been broken for a long time
> > (https://lore.kernel.org/linux-mm/45d8b7a6-8548-65f5-cccf-9f451d4ae3d4@kernel.dk/).
> > Even if it worked, it was never a great idea. While excessive
> > dirty/writeback pages at the tail of the LRU is one possibility that
> > reclaim may be slow, there is also the problem of too many pages being
> > isolated and reclaim failing for other reasons (elevated references,
> > too many pages isolated, excessive LRU contention etc).
> > 
> > This series replaces the reclaim conditions with event driven ones
> > 
> > o If there are too many dirty/writeback pages, sleep until a timeout
> >   or enough pages get cleaned
> > o If too many pages are isolated, sleep until enough isolated pages
> >   are either reclaimed or put back on the LRU
> > o If no progress is being made, let direct reclaim tasks sleep until
> >   another task makes progress
> > 
> > This has been lightly tested only and the testing was useless as the
> > relevant code was not executed. The workload configurations I had that
> > used to trigger these corner cases no longer work (yey?) and I'll need
> > to implement a new synthetic workload. If someone is aware of a realistic
> > workload that forces reclaim activity to the point where reclaim stalls
> > then kindly share the details.
> 
> Got a git tree pointer so I can pull it into a test kernel so I can
> see what impact it has on behaviour before I try to make sense of
> the code?
> 

The current version I'm testing is at

git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-reclaimcongest-v2r5

Only one test has completed and I won't be able to analyse the results
in detail for a few days but it's doing *something* for the workload that
is hammering reclaim

                  5.15.0-rc1  5.15.0-rc1
                     vanillamm-reclaimcongest-v2r5
Duration User       10891.30     9945.59
Duration System      5673.78     2649.43
Duration Elapsed     2402.85     2407.96

System CPU usage dropped by a lot. Workload completes runs for a fixed
duration so a difference in elapsed is not interesting

Ops Direct pages scanned           518791317.00   219956338.00
Ops Kswapd pages scanned           128555233.00   165439373.00
Ops Kswapd pages reclaimed          87830801.00    72216420.00
Ops Direct pages reclaimed          16114049.00    10408389.00
Ops Kswapd efficiency %                   68.32          43.65
Ops Kswapd velocity                    53501.15       68705.20
Ops Direct efficiency %                    3.11           4.73
Ops Direct velocity                   215906.66       91345.5
Ops Percentage direct scans               80.14          57.07
Ops Page writes by reclaim           4225921.00     2032865.00

Large reductions in direct pages scanned. The rate kswapd scans is roughly
the same (velocity) where as direct velocity is down (presumably because
it's getting throttled). Pages written from reclaim context are about
halved. Kswapd scan rates are increased slightly but probably because
direct reclaimers throttled. Reclaim efficiency is low but that's expected
given the workload is basically trying to make it as hard as possible
for reclaim to make progress.

Kswapd is only getting throttled on writeback and is being woken before
the timeout of 100000

      1 usect_delayed=84000 reason=VMSCAN_THROTTLE_WRITEBACK
      2 usect_delayed=20000 reason=VMSCAN_THROTTLE_WRITEBACK
      6 usect_delayed=16000 reason=VMSCAN_THROTTLE_WRITEBACK
     12 usect_delayed=12000 reason=VMSCAN_THROTTLE_WRITEBACK
     17 usect_delayed=8000 reason=VMSCAN_THROTTLE_WRITEBACK
    129 usect_delayed=4000 reason=VMSCAN_THROTTLE_WRITEBACK
    205 usect_delayed=0 reason=VMSCAN_THROTTLE_WRITEBACK

The number of throttle events for direct reclaimers were

  16909 reason=VMSCAN_THROTTLE_ISOLATED
  77844 reason=VMSCAN_THROTTLE_NOPROGRESS
 113415 reason=VMSCAN_THROTTLE_WRITEBACK

For the throttle events, 33% of them were NOPROGRESS hitting the full
timeout and 33% were WRITEBACK hitting the full timeout. If anything,
that would suggest increasing the max timeout as presumably they woke up
uselessly like Neil had suggested.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 1/5] mm/vmscan: Throttle reclaim until some writeback completes if congested
  2021-09-29 10:09 [PATCH 0/5] Remove dependency on congestion_wait in mm/ v2 Mel Gorman
@ 2021-09-29 10:09 ` Mel Gorman
  0 siblings, 0 replies; 33+ messages in thread
From: Mel Gorman @ 2021-09-29 10:09 UTC (permalink / raw)
  To: Linux-MM
  Cc: NeilBrown, Theodore Ts'o, Andreas Dilger, Darrick J . Wong,
	Matthew Wilcox, Michal Hocko, Dave Chinner, Rik van Riel,
	Vlastimil Babka, Johannes Weiner, Jonathan Corbet, Linux-fsdevel,
	LKML, Mel Gorman

Page reclaim throttles on wait_iff_congested under the following conditions

o kswapd is encountering pages under writeback and marked for immediate
  reclaim implying that pages are cycling through the LRU faster than
  pages can be cleaned.

o Direct reclaim will stall if all dirty pages are backed by congested
  inodes.

wait_iff_congested is almost completely broken with few exceptions. This
patch adds a new node-based workqueue and tracks the number of throttled
tasks and pages written back since throttling started. If enough pages
belonging to the node are written back then the throttled tasks will wake
early. If not, the throttled tasks sleeps until the timeout expires.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/backing-dev.h      |  1 -
 include/linux/mmzone.h           |  9 +++++
 include/trace/events/vmscan.h    | 34 +++++++++++++++++++
 include/trace/events/writeback.h |  7 ----
 mm/backing-dev.c                 | 48 --------------------------
 mm/filemap.c                     |  1 +
 mm/internal.h                    |  9 +++++
 mm/page_alloc.c                  |  1 +
 mm/vmscan.c                      | 58 +++++++++++++++++++++++++++-----
 mm/vmstat.c                      |  1 +
 10 files changed, 105 insertions(+), 64 deletions(-)

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index ac7f231b8825..9fb1f0ae273c 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -154,7 +154,6 @@ static inline int wb_congested(struct bdi_writeback *wb, int cong_bits)
 }
 
 long congestion_wait(int sync, long timeout);
-long wait_iff_congested(int sync, long timeout);
 
 static inline bool mapping_can_writeback(struct address_space *mapping)
 {
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6a1d79d84675..ef0a63ebd21d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -199,6 +199,7 @@ enum node_stat_item {
 	NR_VMSCAN_IMMEDIATE,	/* Prioritise for reclaim when writeback ends */
 	NR_DIRTIED,		/* page dirtyings since bootup */
 	NR_WRITTEN,		/* page writings since bootup */
+	NR_THROTTLED_WRITTEN,	/* NR_WRITTEN while reclaim throttled */
 	NR_KERNEL_MISC_RECLAIMABLE,	/* reclaimable non-slab kernel pages */
 	NR_FOLL_PIN_ACQUIRED,	/* via: pin_user_page(), gup flag: FOLL_PIN */
 	NR_FOLL_PIN_RELEASED,	/* pages returned via unpin_user_page() */
@@ -272,6 +273,10 @@ enum lru_list {
 	NR_LRU_LISTS
 };
 
+enum vmscan_throttle_state {
+	VMSCAN_THROTTLE_WRITEBACK,
+};
+
 #define for_each_lru(lru) for (lru = 0; lru < NR_LRU_LISTS; lru++)
 
 #define for_each_evictable_lru(lru) for (lru = 0; lru <= LRU_ACTIVE_FILE; lru++)
@@ -841,6 +846,10 @@ typedef struct pglist_data {
 	int node_id;
 	wait_queue_head_t kswapd_wait;
 	wait_queue_head_t pfmemalloc_wait;
+	wait_queue_head_t reclaim_wait;	/* wq for throttling reclaim */
+	atomic_t nr_reclaim_throttled;	/* nr of throtted tasks */
+	unsigned long nr_reclaim_start;	/* nr pages written while throttled
+					 * when throttling started. */
 	struct task_struct *kswapd;	/* Protected by
 					   mem_hotplug_begin/end() */
 	int kswapd_order;
diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index 88faf2400ec2..c317f9fe0d17 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -27,6 +27,14 @@
 		{RECLAIM_WB_ASYNC,	"RECLAIM_WB_ASYNC"}	\
 		) : "RECLAIM_WB_NONE"
 
+#define _VMSCAN_THROTTLE_WRITEBACK	(1 << VMSCAN_THROTTLE_WRITEBACK)
+
+#define show_throttle_flags(flags)						\
+	(flags) ? __print_flags(flags, "|",					\
+		{_VMSCAN_THROTTLE_WRITEBACK,	"VMSCAN_THROTTLE_WRITEBACK"}	\
+		) : "VMSCAN_THROTTLE_NONE"
+
+
 #define trace_reclaim_flags(file) ( \
 	(file ? RECLAIM_WB_FILE : RECLAIM_WB_ANON) | \
 	(RECLAIM_WB_ASYNC) \
@@ -454,6 +462,32 @@ DEFINE_EVENT(mm_vmscan_direct_reclaim_end_template, mm_vmscan_node_reclaim_end,
 	TP_ARGS(nr_reclaimed)
 );
 
+TRACE_EVENT(mm_vmscan_throttled,
+
+	TP_PROTO(int nid, int usec_timeout, int usec_delayed, int reason),
+
+	TP_ARGS(nid, usec_timeout, usec_delayed, reason),
+
+	TP_STRUCT__entry(
+		__field(int, nid)
+		__field(int, usec_timeout)
+		__field(int, usec_delayed)
+		__field(int, reason)
+	),
+
+	TP_fast_assign(
+		__entry->nid = nid;
+		__entry->usec_timeout = usec_timeout;
+		__entry->usec_delayed = usec_delayed;
+		__entry->reason = 1U << reason;
+	),
+
+	TP_printk("nid=%d usec_timeout=%d usect_delayed=%d reason=%s",
+		__entry->nid,
+		__entry->usec_timeout,
+		__entry->usec_delayed,
+		show_throttle_flags(__entry->reason))
+);
 #endif /* _TRACE_VMSCAN_H */
 
 /* This part must be outside protection */
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index 840d1ba84cf5..3bc759b81897 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -763,13 +763,6 @@ DEFINE_EVENT(writeback_congest_waited_template, writeback_congestion_wait,
 	TP_ARGS(usec_timeout, usec_delayed)
 );
 
-DEFINE_EVENT(writeback_congest_waited_template, writeback_wait_iff_congested,
-
-	TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),
-
-	TP_ARGS(usec_timeout, usec_delayed)
-);
-
 DECLARE_EVENT_CLASS(writeback_single_inode_template,
 
 	TP_PROTO(struct inode *inode,
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 4a9d4e27d0d9..0ea1a105eae5 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -1041,51 +1041,3 @@ long congestion_wait(int sync, long timeout)
 	return ret;
 }
 EXPORT_SYMBOL(congestion_wait);
-
-/**
- * wait_iff_congested - Conditionally wait for a backing_dev to become uncongested or a pgdat to complete writes
- * @sync: SYNC or ASYNC IO
- * @timeout: timeout in jiffies
- *
- * In the event of a congested backing_dev (any backing_dev) this waits
- * for up to @timeout jiffies for either a BDI to exit congestion of the
- * given @sync queue or a write to complete.
- *
- * The return value is 0 if the sleep is for the full timeout. Otherwise,
- * it is the number of jiffies that were still remaining when the function
- * returned. return_value == timeout implies the function did not sleep.
- */
-long wait_iff_congested(int sync, long timeout)
-{
-	long ret;
-	unsigned long start = jiffies;
-	DEFINE_WAIT(wait);
-	wait_queue_head_t *wqh = &congestion_wqh[sync];
-
-	/*
-	 * If there is no congestion, yield if necessary instead
-	 * of sleeping on the congestion queue
-	 */
-	if (atomic_read(&nr_wb_congested[sync]) == 0) {
-		cond_resched();
-
-		/* In case we scheduled, work out time remaining */
-		ret = timeout - (jiffies - start);
-		if (ret < 0)
-			ret = 0;
-
-		goto out;
-	}
-
-	/* Sleep until uncongested or a write happens */
-	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
-	ret = io_schedule_timeout(timeout);
-	finish_wait(wqh, &wait);
-
-out:
-	trace_writeback_wait_iff_congested(jiffies_to_usecs(timeout),
-					jiffies_to_usecs(jiffies - start));
-
-	return ret;
-}
-EXPORT_SYMBOL(wait_iff_congested);
diff --git a/mm/filemap.c b/mm/filemap.c
index dae481293b5d..59187787fbfc 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1605,6 +1605,7 @@ void end_page_writeback(struct page *page)
 
 	smp_mb__after_atomic();
 	wake_up_page(page, PG_writeback);
+	acct_reclaim_writeback(page);
 	put_page(page);
 }
 EXPORT_SYMBOL(end_page_writeback);
diff --git a/mm/internal.h b/mm/internal.h
index cf3cb933eba3..e25b3686bfab 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -34,6 +34,15 @@
 
 void page_writeback_init(void);
 
+void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page);
+static inline void acct_reclaim_writeback(struct page *page)
+{
+	pg_data_t *pgdat = page_pgdat(page);
+
+	if (atomic_read(&pgdat->nr_reclaim_throttled))
+		__acct_reclaim_writeback(pgdat, page);
+}
+
 vm_fault_t do_swap_page(struct vm_fault *vmf);
 
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b37435c274cf..d849ddfc1e51 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7396,6 +7396,7 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
 
 	init_waitqueue_head(&pgdat->kswapd_wait);
 	init_waitqueue_head(&pgdat->pfmemalloc_wait);
+	init_waitqueue_head(&pgdat->reclaim_wait);
 
 	pgdat_page_ext_init(pgdat);
 	lruvec_init(&pgdat->__lruvec);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 74296c2d1fed..b58ea0b13286 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1006,6 +1006,47 @@ static void handle_write_error(struct address_space *mapping,
 	unlock_page(page);
 }
 
+static void
+reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
+							long timeout)
+{
+	wait_queue_head_t *wqh = &pgdat->reclaim_wait;
+	unsigned long start = jiffies;
+	long ret;
+	DEFINE_WAIT(wait);
+
+	atomic_inc(&pgdat->nr_reclaim_throttled);
+	WRITE_ONCE(pgdat->nr_reclaim_start,
+		 node_page_state(pgdat, NR_THROTTLED_WRITTEN));
+
+	prepare_to_wait(wqh, &wait, TASK_INTERRUPTIBLE);
+	ret = schedule_timeout(timeout);
+	finish_wait(wqh, &wait);
+	atomic_dec(&pgdat->nr_reclaim_throttled);
+
+	trace_mm_vmscan_throttled(pgdat->node_id, jiffies_to_usecs(timeout),
+				jiffies_to_usecs(jiffies - start),
+				reason);
+}
+
+/*
+ * Account for pages written if tasks are throttled waiting on dirty
+ * pages to clean. If enough pages have been cleaned since throttling
+ * started then wakeup the throttled tasks.
+ */
+void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page)
+{
+	unsigned long nr_written;
+	int nr_throttled = atomic_read(&pgdat->nr_reclaim_throttled);
+
+	__inc_node_page_state(page, NR_THROTTLED_WRITTEN);
+	nr_written = node_page_state(pgdat, NR_THROTTLED_WRITTEN) -
+		READ_ONCE(pgdat->nr_reclaim_start);
+
+	if (nr_written > SWAP_CLUSTER_MAX * nr_throttled)
+		wake_up_interruptible_all(&pgdat->reclaim_wait);
+}
+
 /* possible outcome of pageout() */
 typedef enum {
 	/* failed to write page out, page is locked */
@@ -1412,9 +1453,8 @@ static unsigned int shrink_page_list(struct list_head *page_list,
 
 		/*
 		 * The number of dirty pages determines if a node is marked
-		 * reclaim_congested which affects wait_iff_congested. kswapd
-		 * will stall and start writing pages if the tail of the LRU
-		 * is all dirty unqueued pages.
+		 * reclaim_congested. kswapd will stall and start writing
+		 * pages if the tail of the LRU is all dirty unqueued pages.
 		 */
 		page_check_dirty_writeback(page, &dirty, &writeback);
 		if (dirty || writeback)
@@ -3180,19 +3220,20 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 		 * If kswapd scans pages marked for immediate
 		 * reclaim and under writeback (nr_immediate), it
 		 * implies that pages are cycling through the LRU
-		 * faster than they are written so also forcibly stall.
+		 * faster than they are written so forcibly stall
+		 * until some pages complete writeback.
 		 */
 		if (sc->nr.immediate)
-			congestion_wait(BLK_RW_ASYNC, HZ/10);
+			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK, HZ/10);
 	}
 
 	/*
 	 * Tag a node/memcg as congested if all the dirty pages
 	 * scanned were backed by a congested BDI and
-	 * wait_iff_congested will stall.
+	 * non-kswapd tasks will stall on reclaim_throttle.
 	 *
 	 * Legacy memcg will stall in page writeback so avoid forcibly
-	 * stalling in wait_iff_congested().
+	 * stalling in reclaim_throttle().
 	 */
 	if ((current_is_kswapd() ||
 	     (cgroup_reclaim(sc) && writeback_throttling_sane(sc))) &&
@@ -3208,7 +3249,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	if (!current_is_kswapd() && current_may_throttle() &&
 	    !sc->hibernation_mode &&
 	    test_bit(LRUVEC_CONGESTED, &target_lruvec->flags))
-		wait_iff_congested(BLK_RW_ASYNC, HZ/10);
+		reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK, HZ/10);
 
 	if (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
 				    sc))
@@ -4286,6 +4327,7 @@ static int kswapd(void *p)
 
 	WRITE_ONCE(pgdat->kswapd_order, 0);
 	WRITE_ONCE(pgdat->kswapd_highest_zoneidx, MAX_NR_ZONES);
+	atomic_set(&pgdat->nr_reclaim_throttled, 0);
 	for ( ; ; ) {
 		bool ret;
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 8ce2620344b2..9b2bc9d61d4b 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1225,6 +1225,7 @@ const char * const vmstat_text[] = {
 	"nr_vmscan_immediate_reclaim",
 	"nr_dirtied",
 	"nr_written",
+	"nr_throttled_written",
 	"nr_kernel_misc_reclaimable",
 	"nr_foll_pin_acquired",
 	"nr_foll_pin_released",
-- 
2.31.1


^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2021-09-29 10:09 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-20  8:54 [RFC PATCH 0/5] Remove dependency on congestion_wait in mm/ Mel Gorman
2021-09-20  8:54 ` [PATCH 1/5] mm/vmscan: Throttle reclaim until some writeback completes if congested Mel Gorman
2021-09-20 23:19   ` NeilBrown
2021-09-21 11:12     ` Mel Gorman
2021-09-21 21:27       ` NeilBrown
2021-09-21  0:13   ` NeilBrown
2021-09-21 10:58     ` Mel Gorman
2021-09-21 21:40       ` NeilBrown
2021-09-22  6:04       ` Dave Chinner
2021-09-22  8:03         ` Mel Gorman
2021-09-22 12:16   ` Hillf Danton
2021-09-22 14:13     ` Mel Gorman
2021-09-20  8:54 ` [PATCH 2/5] mm/vmscan: Throttle reclaim and compaction when too may pages are isolated Mel Gorman
2021-09-20 23:27   ` NeilBrown
2021-09-21 11:03     ` Mel Gorman
2021-09-21 18:45   ` Yang Shi
2021-09-21 18:45     ` Yang Shi
2021-09-22  8:11     ` Mel Gorman
2021-09-20  8:54 ` [PATCH 3/5] mm/vmscan: Throttle reclaim when no progress is being made Mel Gorman
2021-09-20 23:31   ` NeilBrown
2021-09-21 11:16     ` Mel Gorman
2021-09-21 21:46       ` NeilBrown
2021-09-22  9:21         ` Mel Gorman
2021-09-20  8:54 ` [PATCH 4/5] mm/writeback: Throttle based on page writeback instead of congestion Mel Gorman
2021-09-20  8:54 ` [PATCH 5/5] mm/page_alloc: Remove the throttling logic from the page allocator Mel Gorman
2021-09-20 11:42 ` [RFC PATCH 0/5] Remove dependency on congestion_wait in mm/ Matthew Wilcox
2021-09-20 12:50   ` Mel Gorman
2021-09-20 14:11     ` David Sterba
2021-09-21 11:18       ` Mel Gorman
2021-09-20 19:51   ` Mel Gorman
2021-09-21 20:46 ` Dave Chinner
2021-09-22 17:52   ` Mel Gorman
2021-09-29 10:09 [PATCH 0/5] Remove dependency on congestion_wait in mm/ v2 Mel Gorman
2021-09-29 10:09 ` [PATCH 1/5] mm/vmscan: Throttle reclaim until some writeback completes if congested Mel Gorman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.