* [PATCH 1/8] mm/vmscan: Throttle reclaim until some writeback completes if congested
2021-10-19 9:01 [PATCH v4 0/8] Remove dependency on congestion_wait in mm/ Mel Gorman
@ 2021-10-19 9:01 ` Mel Gorman
2021-10-19 9:01 ` [PATCH 2/8] mm/vmscan: Throttle reclaim and compaction when too may pages are isolated Mel Gorman
` (8 subsequent siblings)
9 siblings, 0 replies; 26+ messages in thread
From: Mel Gorman @ 2021-10-19 9:01 UTC (permalink / raw)
To: Andrew Morton
Cc: NeilBrown, Theodore Ts'o, Andreas Dilger, Darrick J . Wong,
Matthew Wilcox, Michal Hocko, Dave Chinner, Rik van Riel,
Vlastimil Babka, Johannes Weiner, Jonathan Corbet, Linux-MM,
Linux-fsdevel, LKML, Mel Gorman
Page reclaim throttles on wait_iff_congested under the following conditions
o kswapd is encountering pages under writeback and marked for immediate
reclaim implying that pages are cycling through the LRU faster than
pages can be cleaned.
o Direct reclaim will stall if all dirty pages are backed by congested
inodes.
wait_iff_congested is almost completely broken with few exceptions. This
patch adds a new node-based workqueue and tracks the number of throttled
tasks and pages written back since throttling started. If enough pages
belonging to the node are written back then the throttled tasks will wake
early. If not, the throttled tasks sleeps until the timeout expires.
[neilb@suse.de: Uninterruptible sleep and simpler wakeups]
[hdanton@sina.com: Avoid race when reclaim starts]
[vbabka@suse.cz: vmstat irq-safe api, clarifications]
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
include/linux/backing-dev.h | 1 -
include/linux/mmzone.h | 9 ++++
include/trace/events/vmscan.h | 34 +++++++++++++
include/trace/events/writeback.h | 7 ---
mm/backing-dev.c | 48 -------------------
mm/filemap.c | 1 +
mm/internal.h | 11 +++++
mm/page_alloc.c | 1 +
mm/vmscan.c | 82 +++++++++++++++++++++++++++-----
mm/vmstat.c | 1 +
10 files changed, 127 insertions(+), 68 deletions(-)
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index ac7f231b8825..9fb1f0ae273c 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -154,7 +154,6 @@ static inline int wb_congested(struct bdi_writeback *wb, int cong_bits)
}
long congestion_wait(int sync, long timeout);
-long wait_iff_congested(int sync, long timeout);
static inline bool mapping_can_writeback(struct address_space *mapping)
{
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6a1d79d84675..ef0a63ebd21d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -199,6 +199,7 @@ enum node_stat_item {
NR_VMSCAN_IMMEDIATE, /* Prioritise for reclaim when writeback ends */
NR_DIRTIED, /* page dirtyings since bootup */
NR_WRITTEN, /* page writings since bootup */
+ NR_THROTTLED_WRITTEN, /* NR_WRITTEN while reclaim throttled */
NR_KERNEL_MISC_RECLAIMABLE, /* reclaimable non-slab kernel pages */
NR_FOLL_PIN_ACQUIRED, /* via: pin_user_page(), gup flag: FOLL_PIN */
NR_FOLL_PIN_RELEASED, /* pages returned via unpin_user_page() */
@@ -272,6 +273,10 @@ enum lru_list {
NR_LRU_LISTS
};
+enum vmscan_throttle_state {
+ VMSCAN_THROTTLE_WRITEBACK,
+};
+
#define for_each_lru(lru) for (lru = 0; lru < NR_LRU_LISTS; lru++)
#define for_each_evictable_lru(lru) for (lru = 0; lru <= LRU_ACTIVE_FILE; lru++)
@@ -841,6 +846,10 @@ typedef struct pglist_data {
int node_id;
wait_queue_head_t kswapd_wait;
wait_queue_head_t pfmemalloc_wait;
+ wait_queue_head_t reclaim_wait; /* wq for throttling reclaim */
+ atomic_t nr_reclaim_throttled; /* nr of throtted tasks */
+ unsigned long nr_reclaim_start; /* nr pages written while throttled
+ * when throttling started. */
struct task_struct *kswapd; /* Protected by
mem_hotplug_begin/end() */
int kswapd_order;
diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index 88faf2400ec2..c317f9fe0d17 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -27,6 +27,14 @@
{RECLAIM_WB_ASYNC, "RECLAIM_WB_ASYNC"} \
) : "RECLAIM_WB_NONE"
+#define _VMSCAN_THROTTLE_WRITEBACK (1 << VMSCAN_THROTTLE_WRITEBACK)
+
+#define show_throttle_flags(flags) \
+ (flags) ? __print_flags(flags, "|", \
+ {_VMSCAN_THROTTLE_WRITEBACK, "VMSCAN_THROTTLE_WRITEBACK"} \
+ ) : "VMSCAN_THROTTLE_NONE"
+
+
#define trace_reclaim_flags(file) ( \
(file ? RECLAIM_WB_FILE : RECLAIM_WB_ANON) | \
(RECLAIM_WB_ASYNC) \
@@ -454,6 +462,32 @@ DEFINE_EVENT(mm_vmscan_direct_reclaim_end_template, mm_vmscan_node_reclaim_end,
TP_ARGS(nr_reclaimed)
);
+TRACE_EVENT(mm_vmscan_throttled,
+
+ TP_PROTO(int nid, int usec_timeout, int usec_delayed, int reason),
+
+ TP_ARGS(nid, usec_timeout, usec_delayed, reason),
+
+ TP_STRUCT__entry(
+ __field(int, nid)
+ __field(int, usec_timeout)
+ __field(int, usec_delayed)
+ __field(int, reason)
+ ),
+
+ TP_fast_assign(
+ __entry->nid = nid;
+ __entry->usec_timeout = usec_timeout;
+ __entry->usec_delayed = usec_delayed;
+ __entry->reason = 1U << reason;
+ ),
+
+ TP_printk("nid=%d usec_timeout=%d usect_delayed=%d reason=%s",
+ __entry->nid,
+ __entry->usec_timeout,
+ __entry->usec_delayed,
+ show_throttle_flags(__entry->reason))
+);
#endif /* _TRACE_VMSCAN_H */
/* This part must be outside protection */
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index 840d1ba84cf5..3bc759b81897 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -763,13 +763,6 @@ DEFINE_EVENT(writeback_congest_waited_template, writeback_congestion_wait,
TP_ARGS(usec_timeout, usec_delayed)
);
-DEFINE_EVENT(writeback_congest_waited_template, writeback_wait_iff_congested,
-
- TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),
-
- TP_ARGS(usec_timeout, usec_delayed)
-);
-
DECLARE_EVENT_CLASS(writeback_single_inode_template,
TP_PROTO(struct inode *inode,
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 4a9d4e27d0d9..0ea1a105eae5 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -1041,51 +1041,3 @@ long congestion_wait(int sync, long timeout)
return ret;
}
EXPORT_SYMBOL(congestion_wait);
-
-/**
- * wait_iff_congested - Conditionally wait for a backing_dev to become uncongested or a pgdat to complete writes
- * @sync: SYNC or ASYNC IO
- * @timeout: timeout in jiffies
- *
- * In the event of a congested backing_dev (any backing_dev) this waits
- * for up to @timeout jiffies for either a BDI to exit congestion of the
- * given @sync queue or a write to complete.
- *
- * The return value is 0 if the sleep is for the full timeout. Otherwise,
- * it is the number of jiffies that were still remaining when the function
- * returned. return_value == timeout implies the function did not sleep.
- */
-long wait_iff_congested(int sync, long timeout)
-{
- long ret;
- unsigned long start = jiffies;
- DEFINE_WAIT(wait);
- wait_queue_head_t *wqh = &congestion_wqh[sync];
-
- /*
- * If there is no congestion, yield if necessary instead
- * of sleeping on the congestion queue
- */
- if (atomic_read(&nr_wb_congested[sync]) == 0) {
- cond_resched();
-
- /* In case we scheduled, work out time remaining */
- ret = timeout - (jiffies - start);
- if (ret < 0)
- ret = 0;
-
- goto out;
- }
-
- /* Sleep until uncongested or a write happens */
- prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
- ret = io_schedule_timeout(timeout);
- finish_wait(wqh, &wait);
-
-out:
- trace_writeback_wait_iff_congested(jiffies_to_usecs(timeout),
- jiffies_to_usecs(jiffies - start));
-
- return ret;
-}
-EXPORT_SYMBOL(wait_iff_congested);
diff --git a/mm/filemap.c b/mm/filemap.c
index dae481293b5d..59187787fbfc 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1605,6 +1605,7 @@ void end_page_writeback(struct page *page)
smp_mb__after_atomic();
wake_up_page(page, PG_writeback);
+ acct_reclaim_writeback(page);
put_page(page);
}
EXPORT_SYMBOL(end_page_writeback);
diff --git a/mm/internal.h b/mm/internal.h
index cf3cb933eba3..90764d646e02 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -34,6 +34,17 @@
void page_writeback_init(void);
+void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page,
+ int nr_throttled);
+static inline void acct_reclaim_writeback(struct page *page)
+{
+ pg_data_t *pgdat = page_pgdat(page);
+ int nr_throttled = atomic_read(&pgdat->nr_reclaim_throttled);
+
+ if (nr_throttled)
+ __acct_reclaim_writeback(pgdat, page, nr_throttled);
+}
+
vm_fault_t do_swap_page(struct vm_fault *vmf);
void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b37435c274cf..d849ddfc1e51 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7396,6 +7396,7 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
init_waitqueue_head(&pgdat->kswapd_wait);
init_waitqueue_head(&pgdat->pfmemalloc_wait);
+ init_waitqueue_head(&pgdat->reclaim_wait);
pgdat_page_ext_init(pgdat);
lruvec_init(&pgdat->__lruvec);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 74296c2d1fed..735b1f2b5d9e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1006,6 +1006,64 @@ static void handle_write_error(struct address_space *mapping,
unlock_page(page);
}
+static void
+reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
+ long timeout)
+{
+ wait_queue_head_t *wqh = &pgdat->reclaim_wait;
+ long ret;
+ DEFINE_WAIT(wait);
+
+ /*
+ * Do not throttle IO workers, kthreads other than kswapd or
+ * workqueues. They may be required for reclaim to make
+ * forward progress (e.g. journalling workqueues or kthreads).
+ */
+ if (!current_is_kswapd() &&
+ current->flags & (PF_IO_WORKER|PF_KTHREAD))
+ return;
+
+ if (atomic_inc_return(&pgdat->nr_reclaim_throttled) == 1) {
+ WRITE_ONCE(pgdat->nr_reclaim_start,
+ node_page_state(pgdat, NR_THROTTLED_WRITTEN));
+ }
+
+ prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
+ ret = schedule_timeout(timeout);
+ finish_wait(wqh, &wait);
+ atomic_dec(&pgdat->nr_reclaim_throttled);
+
+ trace_mm_vmscan_throttled(pgdat->node_id, jiffies_to_usecs(timeout),
+ jiffies_to_usecs(timeout - ret),
+ reason);
+}
+
+/*
+ * Account for pages written if tasks are throttled waiting on dirty
+ * pages to clean. If enough pages have been cleaned since throttling
+ * started then wakeup the throttled tasks.
+ */
+void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page,
+ int nr_throttled)
+{
+ unsigned long nr_written;
+
+ inc_node_page_state(page, NR_THROTTLED_WRITTEN);
+
+ /*
+ * This is an inaccurate read as the per-cpu deltas may not
+ * be synchronised. However, given that the system is
+ * writeback throttled, it is not worth taking the penalty
+ * of getting an accurate count. At worst, the throttle
+ * timeout guarantees forward progress.
+ */
+ nr_written = node_page_state(pgdat, NR_THROTTLED_WRITTEN) -
+ READ_ONCE(pgdat->nr_reclaim_start);
+
+ if (nr_written > SWAP_CLUSTER_MAX * nr_throttled)
+ wake_up_all(&pgdat->reclaim_wait);
+}
+
/* possible outcome of pageout() */
typedef enum {
/* failed to write page out, page is locked */
@@ -1412,9 +1470,8 @@ static unsigned int shrink_page_list(struct list_head *page_list,
/*
* The number of dirty pages determines if a node is marked
- * reclaim_congested which affects wait_iff_congested. kswapd
- * will stall and start writing pages if the tail of the LRU
- * is all dirty unqueued pages.
+ * reclaim_congested. kswapd will stall and start writing
+ * pages if the tail of the LRU is all dirty unqueued pages.
*/
page_check_dirty_writeback(page, &dirty, &writeback);
if (dirty || writeback)
@@ -3180,19 +3237,19 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
* If kswapd scans pages marked for immediate
* reclaim and under writeback (nr_immediate), it
* implies that pages are cycling through the LRU
- * faster than they are written so also forcibly stall.
+ * faster than they are written so forcibly stall
+ * until some pages complete writeback.
*/
if (sc->nr.immediate)
- congestion_wait(BLK_RW_ASYNC, HZ/10);
+ reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK, HZ/10);
}
/*
- * Tag a node/memcg as congested if all the dirty pages
- * scanned were backed by a congested BDI and
- * wait_iff_congested will stall.
+ * Tag a node/memcg as congested if all the dirty pages were marked
+ * for writeback and immediate reclaim (counted in nr.congested).
*
* Legacy memcg will stall in page writeback so avoid forcibly
- * stalling in wait_iff_congested().
+ * stalling in reclaim_throttle().
*/
if ((current_is_kswapd() ||
(cgroup_reclaim(sc) && writeback_throttling_sane(sc))) &&
@@ -3200,15 +3257,15 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
set_bit(LRUVEC_CONGESTED, &target_lruvec->flags);
/*
- * Stall direct reclaim for IO completions if underlying BDIs
- * and node is congested. Allow kswapd to continue until it
+ * Stall direct reclaim for IO completions if the lruvec is
+ * node is congested. Allow kswapd to continue until it
* starts encountering unqueued dirty pages or cycling through
* the LRU too quickly.
*/
if (!current_is_kswapd() && current_may_throttle() &&
!sc->hibernation_mode &&
test_bit(LRUVEC_CONGESTED, &target_lruvec->flags))
- wait_iff_congested(BLK_RW_ASYNC, HZ/10);
+ reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK, HZ/10);
if (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
sc))
@@ -4286,6 +4343,7 @@ static int kswapd(void *p)
WRITE_ONCE(pgdat->kswapd_order, 0);
WRITE_ONCE(pgdat->kswapd_highest_zoneidx, MAX_NR_ZONES);
+ atomic_set(&pgdat->nr_reclaim_throttled, 0);
for ( ; ; ) {
bool ret;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 8ce2620344b2..9b2bc9d61d4b 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1225,6 +1225,7 @@ const char * const vmstat_text[] = {
"nr_vmscan_immediate_reclaim",
"nr_dirtied",
"nr_written",
+ "nr_throttled_written",
"nr_kernel_misc_reclaimable",
"nr_foll_pin_acquired",
"nr_foll_pin_released",
--
2.31.1
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH 2/8] mm/vmscan: Throttle reclaim and compaction when too may pages are isolated
2021-10-19 9:01 [PATCH v4 0/8] Remove dependency on congestion_wait in mm/ Mel Gorman
2021-10-19 9:01 ` [PATCH 1/8] mm/vmscan: Throttle reclaim until some writeback completes if congested Mel Gorman
@ 2021-10-19 9:01 ` Mel Gorman
2021-10-19 17:12 ` Yang Shi
2021-10-19 9:01 ` [PATCH 3/8] mm/vmscan: Throttle reclaim when no progress is being made Mel Gorman
` (7 subsequent siblings)
9 siblings, 1 reply; 26+ messages in thread
From: Mel Gorman @ 2021-10-19 9:01 UTC (permalink / raw)
To: Andrew Morton
Cc: NeilBrown, Theodore Ts'o, Andreas Dilger, Darrick J . Wong,
Matthew Wilcox, Michal Hocko, Dave Chinner, Rik van Riel,
Vlastimil Babka, Johannes Weiner, Jonathan Corbet, Linux-MM,
Linux-fsdevel, LKML, Mel Gorman
Page reclaim throttles on congestion if too many parallel reclaim instances
have isolated too many pages. This makes no sense, excessive parallelisation
has nothing to do with writeback or congestion.
This patch creates an additional workqueue to sleep on when too many
pages are isolated. The throttled tasks are woken when the number
of isolated pages is reduced or a timeout occurs. There may be
some false positive wakeups for GFP_NOIO/GFP_NOFS callers but
the tasks will throttle again if necessary.
[shy828301@gmail.com: Wake up from compaction context]
[vbabka@suse.cz: Account number of throttled tasks only for writeback]
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
include/linux/mmzone.h | 6 ++++--
include/trace/events/vmscan.h | 4 +++-
mm/compaction.c | 10 ++++++++--
mm/internal.h | 13 ++++++++++++-
mm/page_alloc.c | 6 +++++-
mm/vmscan.c | 28 +++++++++++++++++++---------
6 files changed, 51 insertions(+), 16 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ef0a63ebd21d..58a25d42c31c 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -275,6 +275,8 @@ enum lru_list {
enum vmscan_throttle_state {
VMSCAN_THROTTLE_WRITEBACK,
+ VMSCAN_THROTTLE_ISOLATED,
+ NR_VMSCAN_THROTTLE,
};
#define for_each_lru(lru) for (lru = 0; lru < NR_LRU_LISTS; lru++)
@@ -846,8 +848,8 @@ typedef struct pglist_data {
int node_id;
wait_queue_head_t kswapd_wait;
wait_queue_head_t pfmemalloc_wait;
- wait_queue_head_t reclaim_wait; /* wq for throttling reclaim */
- atomic_t nr_reclaim_throttled; /* nr of throtted tasks */
+ wait_queue_head_t reclaim_wait[NR_VMSCAN_THROTTLE];
+ atomic_t nr_writeback_throttled;/* nr of writeback-throttled tasks */
unsigned long nr_reclaim_start; /* nr pages written while throttled
* when throttling started. */
struct task_struct *kswapd; /* Protected by
diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index c317f9fe0d17..d4905bd9e9c4 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -28,10 +28,12 @@
) : "RECLAIM_WB_NONE"
#define _VMSCAN_THROTTLE_WRITEBACK (1 << VMSCAN_THROTTLE_WRITEBACK)
+#define _VMSCAN_THROTTLE_ISOLATED (1 << VMSCAN_THROTTLE_ISOLATED)
#define show_throttle_flags(flags) \
(flags) ? __print_flags(flags, "|", \
- {_VMSCAN_THROTTLE_WRITEBACK, "VMSCAN_THROTTLE_WRITEBACK"} \
+ {_VMSCAN_THROTTLE_WRITEBACK, "VMSCAN_THROTTLE_WRITEBACK"}, \
+ {_VMSCAN_THROTTLE_ISOLATED, "VMSCAN_THROTTLE_ISOLATED"} \
) : "VMSCAN_THROTTLE_NONE"
diff --git a/mm/compaction.c b/mm/compaction.c
index bfc93da1c2c7..7359093d8ac0 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -761,6 +761,8 @@ isolate_freepages_range(struct compact_control *cc,
/* Similar to reclaim, but different enough that they don't share logic */
static bool too_many_isolated(pg_data_t *pgdat)
{
+ bool too_many;
+
unsigned long active, inactive, isolated;
inactive = node_page_state(pgdat, NR_INACTIVE_FILE) +
@@ -770,7 +772,11 @@ static bool too_many_isolated(pg_data_t *pgdat)
isolated = node_page_state(pgdat, NR_ISOLATED_FILE) +
node_page_state(pgdat, NR_ISOLATED_ANON);
- return isolated > (inactive + active) / 2;
+ too_many = isolated > (inactive + active) / 2;
+ if (!too_many)
+ wake_throttle_isolated(pgdat);
+
+ return too_many;
}
/**
@@ -822,7 +828,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
if (cc->mode == MIGRATE_ASYNC)
return -EAGAIN;
- congestion_wait(BLK_RW_ASYNC, HZ/10);
+ reclaim_throttle(pgdat, VMSCAN_THROTTLE_ISOLATED, HZ/10);
if (fatal_signal_pending(current))
return -EINTR;
diff --git a/mm/internal.h b/mm/internal.h
index 90764d646e02..3461a1055975 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -39,12 +39,21 @@ void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page,
static inline void acct_reclaim_writeback(struct page *page)
{
pg_data_t *pgdat = page_pgdat(page);
- int nr_throttled = atomic_read(&pgdat->nr_reclaim_throttled);
+ int nr_throttled = atomic_read(&pgdat->nr_writeback_throttled);
if (nr_throttled)
__acct_reclaim_writeback(pgdat, page, nr_throttled);
}
+static inline void wake_throttle_isolated(pg_data_t *pgdat)
+{
+ wait_queue_head_t *wqh;
+
+ wqh = &pgdat->reclaim_wait[VMSCAN_THROTTLE_ISOLATED];
+ if (waitqueue_active(wqh))
+ wake_up_all(wqh);
+}
+
vm_fault_t do_swap_page(struct vm_fault *vmf);
void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
@@ -120,6 +129,8 @@ extern unsigned long highest_memmap_pfn;
*/
extern int isolate_lru_page(struct page *page);
extern void putback_lru_page(struct page *page);
+extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
+ long timeout);
/*
* in mm/rmap.c:
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d849ddfc1e51..78e538067651 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7389,6 +7389,8 @@ static void pgdat_init_kcompactd(struct pglist_data *pgdat) {}
static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
{
+ int i;
+
pgdat_resize_init(pgdat);
pgdat_init_split_queue(pgdat);
@@ -7396,7 +7398,9 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
init_waitqueue_head(&pgdat->kswapd_wait);
init_waitqueue_head(&pgdat->pfmemalloc_wait);
- init_waitqueue_head(&pgdat->reclaim_wait);
+
+ for (i = 0; i < NR_VMSCAN_THROTTLE; i++)
+ init_waitqueue_head(&pgdat->reclaim_wait[i]);
pgdat_page_ext_init(pgdat);
lruvec_init(&pgdat->__lruvec);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 735b1f2b5d9e..29434d4fc1c7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1006,12 +1006,12 @@ static void handle_write_error(struct address_space *mapping,
unlock_page(page);
}
-static void
-reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
+void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
long timeout)
{
- wait_queue_head_t *wqh = &pgdat->reclaim_wait;
+ wait_queue_head_t *wqh = &pgdat->reclaim_wait[reason];
long ret;
+ bool acct_writeback = (reason == VMSCAN_THROTTLE_WRITEBACK);
DEFINE_WAIT(wait);
/*
@@ -1023,7 +1023,8 @@ reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
current->flags & (PF_IO_WORKER|PF_KTHREAD))
return;
- if (atomic_inc_return(&pgdat->nr_reclaim_throttled) == 1) {
+ if (acct_writeback &&
+ atomic_inc_return(&pgdat->nr_writeback_throttled) == 1) {
WRITE_ONCE(pgdat->nr_reclaim_start,
node_page_state(pgdat, NR_THROTTLED_WRITTEN));
}
@@ -1031,7 +1032,9 @@ reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
ret = schedule_timeout(timeout);
finish_wait(wqh, &wait);
- atomic_dec(&pgdat->nr_reclaim_throttled);
+
+ if (acct_writeback)
+ atomic_dec(&pgdat->nr_writeback_throttled);
trace_mm_vmscan_throttled(pgdat->node_id, jiffies_to_usecs(timeout),
jiffies_to_usecs(timeout - ret),
@@ -1061,7 +1064,7 @@ void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page,
READ_ONCE(pgdat->nr_reclaim_start);
if (nr_written > SWAP_CLUSTER_MAX * nr_throttled)
- wake_up_all(&pgdat->reclaim_wait);
+ wake_up_all(&pgdat->reclaim_wait[VMSCAN_THROTTLE_WRITEBACK]);
}
/* possible outcome of pageout() */
@@ -2176,6 +2179,7 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
struct scan_control *sc)
{
unsigned long inactive, isolated;
+ bool too_many;
if (current_is_kswapd())
return 0;
@@ -2199,7 +2203,13 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
if ((sc->gfp_mask & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS))
inactive >>= 3;
- return isolated > inactive;
+ too_many = isolated > inactive;
+
+ /* Wake up tasks throttled due to too_many_isolated. */
+ if (!too_many)
+ wake_throttle_isolated(pgdat);
+
+ return too_many;
}
/*
@@ -2308,8 +2318,8 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
return 0;
/* wait a bit for the reclaimer. */
- msleep(100);
stalled = true;
+ reclaim_throttle(pgdat, VMSCAN_THROTTLE_ISOLATED, HZ/10);
/* We are about to die and free our memory. Return now. */
if (fatal_signal_pending(current))
@@ -4343,7 +4353,7 @@ static int kswapd(void *p)
WRITE_ONCE(pgdat->kswapd_order, 0);
WRITE_ONCE(pgdat->kswapd_highest_zoneidx, MAX_NR_ZONES);
- atomic_set(&pgdat->nr_reclaim_throttled, 0);
+ atomic_set(&pgdat->nr_writeback_throttled, 0);
for ( ; ; ) {
bool ret;
--
2.31.1
^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH 2/8] mm/vmscan: Throttle reclaim and compaction when too may pages are isolated
2021-10-19 9:01 ` [PATCH 2/8] mm/vmscan: Throttle reclaim and compaction when too may pages are isolated Mel Gorman
@ 2021-10-19 17:12 ` Yang Shi
0 siblings, 0 replies; 26+ messages in thread
From: Yang Shi @ 2021-10-19 17:12 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, NeilBrown, Theodore Ts'o, Andreas Dilger,
Darrick J . Wong, Matthew Wilcox, Michal Hocko, Dave Chinner,
Rik van Riel, Vlastimil Babka, Johannes Weiner, Jonathan Corbet,
Linux-MM, Linux-fsdevel, LKML
On Tue, Oct 19, 2021 at 2:01 AM Mel Gorman <mgorman@techsingularity.net> wrote:
>
> Page reclaim throttles on congestion if too many parallel reclaim instances
> have isolated too many pages. This makes no sense, excessive parallelisation
> has nothing to do with writeback or congestion.
>
> This patch creates an additional workqueue to sleep on when too many
> pages are isolated. The throttled tasks are woken when the number
> of isolated pages is reduced or a timeout occurs. There may be
> some false positive wakeups for GFP_NOIO/GFP_NOFS callers but
> the tasks will throttle again if necessary.
>
> [shy828301@gmail.com: Wake up from compaction context]
Reviewed-by: Yang Shi <shy828301@gmail.com>
> [vbabka@suse.cz: Account number of throttled tasks only for writeback]
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> ---
> include/linux/mmzone.h | 6 ++++--
> include/trace/events/vmscan.h | 4 +++-
> mm/compaction.c | 10 ++++++++--
> mm/internal.h | 13 ++++++++++++-
> mm/page_alloc.c | 6 +++++-
> mm/vmscan.c | 28 +++++++++++++++++++---------
> 6 files changed, 51 insertions(+), 16 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index ef0a63ebd21d..58a25d42c31c 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -275,6 +275,8 @@ enum lru_list {
>
> enum vmscan_throttle_state {
> VMSCAN_THROTTLE_WRITEBACK,
> + VMSCAN_THROTTLE_ISOLATED,
> + NR_VMSCAN_THROTTLE,
> };
>
> #define for_each_lru(lru) for (lru = 0; lru < NR_LRU_LISTS; lru++)
> @@ -846,8 +848,8 @@ typedef struct pglist_data {
> int node_id;
> wait_queue_head_t kswapd_wait;
> wait_queue_head_t pfmemalloc_wait;
> - wait_queue_head_t reclaim_wait; /* wq for throttling reclaim */
> - atomic_t nr_reclaim_throttled; /* nr of throtted tasks */
> + wait_queue_head_t reclaim_wait[NR_VMSCAN_THROTTLE];
> + atomic_t nr_writeback_throttled;/* nr of writeback-throttled tasks */
> unsigned long nr_reclaim_start; /* nr pages written while throttled
> * when throttling started. */
> struct task_struct *kswapd; /* Protected by
> diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
> index c317f9fe0d17..d4905bd9e9c4 100644
> --- a/include/trace/events/vmscan.h
> +++ b/include/trace/events/vmscan.h
> @@ -28,10 +28,12 @@
> ) : "RECLAIM_WB_NONE"
>
> #define _VMSCAN_THROTTLE_WRITEBACK (1 << VMSCAN_THROTTLE_WRITEBACK)
> +#define _VMSCAN_THROTTLE_ISOLATED (1 << VMSCAN_THROTTLE_ISOLATED)
>
> #define show_throttle_flags(flags) \
> (flags) ? __print_flags(flags, "|", \
> - {_VMSCAN_THROTTLE_WRITEBACK, "VMSCAN_THROTTLE_WRITEBACK"} \
> + {_VMSCAN_THROTTLE_WRITEBACK, "VMSCAN_THROTTLE_WRITEBACK"}, \
> + {_VMSCAN_THROTTLE_ISOLATED, "VMSCAN_THROTTLE_ISOLATED"} \
> ) : "VMSCAN_THROTTLE_NONE"
>
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index bfc93da1c2c7..7359093d8ac0 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -761,6 +761,8 @@ isolate_freepages_range(struct compact_control *cc,
> /* Similar to reclaim, but different enough that they don't share logic */
> static bool too_many_isolated(pg_data_t *pgdat)
> {
> + bool too_many;
> +
> unsigned long active, inactive, isolated;
>
> inactive = node_page_state(pgdat, NR_INACTIVE_FILE) +
> @@ -770,7 +772,11 @@ static bool too_many_isolated(pg_data_t *pgdat)
> isolated = node_page_state(pgdat, NR_ISOLATED_FILE) +
> node_page_state(pgdat, NR_ISOLATED_ANON);
>
> - return isolated > (inactive + active) / 2;
> + too_many = isolated > (inactive + active) / 2;
> + if (!too_many)
> + wake_throttle_isolated(pgdat);
> +
> + return too_many;
> }
>
> /**
> @@ -822,7 +828,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> if (cc->mode == MIGRATE_ASYNC)
> return -EAGAIN;
>
> - congestion_wait(BLK_RW_ASYNC, HZ/10);
> + reclaim_throttle(pgdat, VMSCAN_THROTTLE_ISOLATED, HZ/10);
>
> if (fatal_signal_pending(current))
> return -EINTR;
> diff --git a/mm/internal.h b/mm/internal.h
> index 90764d646e02..3461a1055975 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -39,12 +39,21 @@ void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page,
> static inline void acct_reclaim_writeback(struct page *page)
> {
> pg_data_t *pgdat = page_pgdat(page);
> - int nr_throttled = atomic_read(&pgdat->nr_reclaim_throttled);
> + int nr_throttled = atomic_read(&pgdat->nr_writeback_throttled);
>
> if (nr_throttled)
> __acct_reclaim_writeback(pgdat, page, nr_throttled);
> }
>
> +static inline void wake_throttle_isolated(pg_data_t *pgdat)
> +{
> + wait_queue_head_t *wqh;
> +
> + wqh = &pgdat->reclaim_wait[VMSCAN_THROTTLE_ISOLATED];
> + if (waitqueue_active(wqh))
> + wake_up_all(wqh);
> +}
> +
> vm_fault_t do_swap_page(struct vm_fault *vmf);
>
> void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
> @@ -120,6 +129,8 @@ extern unsigned long highest_memmap_pfn;
> */
> extern int isolate_lru_page(struct page *page);
> extern void putback_lru_page(struct page *page);
> +extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
> + long timeout);
>
> /*
> * in mm/rmap.c:
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d849ddfc1e51..78e538067651 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7389,6 +7389,8 @@ static void pgdat_init_kcompactd(struct pglist_data *pgdat) {}
>
> static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
> {
> + int i;
> +
> pgdat_resize_init(pgdat);
>
> pgdat_init_split_queue(pgdat);
> @@ -7396,7 +7398,9 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
>
> init_waitqueue_head(&pgdat->kswapd_wait);
> init_waitqueue_head(&pgdat->pfmemalloc_wait);
> - init_waitqueue_head(&pgdat->reclaim_wait);
> +
> + for (i = 0; i < NR_VMSCAN_THROTTLE; i++)
> + init_waitqueue_head(&pgdat->reclaim_wait[i]);
>
> pgdat_page_ext_init(pgdat);
> lruvec_init(&pgdat->__lruvec);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 735b1f2b5d9e..29434d4fc1c7 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1006,12 +1006,12 @@ static void handle_write_error(struct address_space *mapping,
> unlock_page(page);
> }
>
> -static void
> -reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
> +void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
> long timeout)
> {
> - wait_queue_head_t *wqh = &pgdat->reclaim_wait;
> + wait_queue_head_t *wqh = &pgdat->reclaim_wait[reason];
> long ret;
> + bool acct_writeback = (reason == VMSCAN_THROTTLE_WRITEBACK);
> DEFINE_WAIT(wait);
>
> /*
> @@ -1023,7 +1023,8 @@ reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
> current->flags & (PF_IO_WORKER|PF_KTHREAD))
> return;
>
> - if (atomic_inc_return(&pgdat->nr_reclaim_throttled) == 1) {
> + if (acct_writeback &&
> + atomic_inc_return(&pgdat->nr_writeback_throttled) == 1) {
> WRITE_ONCE(pgdat->nr_reclaim_start,
> node_page_state(pgdat, NR_THROTTLED_WRITTEN));
> }
> @@ -1031,7 +1032,9 @@ reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
> prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> ret = schedule_timeout(timeout);
> finish_wait(wqh, &wait);
> - atomic_dec(&pgdat->nr_reclaim_throttled);
> +
> + if (acct_writeback)
> + atomic_dec(&pgdat->nr_writeback_throttled);
>
> trace_mm_vmscan_throttled(pgdat->node_id, jiffies_to_usecs(timeout),
> jiffies_to_usecs(timeout - ret),
> @@ -1061,7 +1064,7 @@ void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page,
> READ_ONCE(pgdat->nr_reclaim_start);
>
> if (nr_written > SWAP_CLUSTER_MAX * nr_throttled)
> - wake_up_all(&pgdat->reclaim_wait);
> + wake_up_all(&pgdat->reclaim_wait[VMSCAN_THROTTLE_WRITEBACK]);
> }
>
> /* possible outcome of pageout() */
> @@ -2176,6 +2179,7 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
> struct scan_control *sc)
> {
> unsigned long inactive, isolated;
> + bool too_many;
>
> if (current_is_kswapd())
> return 0;
> @@ -2199,7 +2203,13 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
> if ((sc->gfp_mask & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS))
> inactive >>= 3;
>
> - return isolated > inactive;
> + too_many = isolated > inactive;
> +
> + /* Wake up tasks throttled due to too_many_isolated. */
> + if (!too_many)
> + wake_throttle_isolated(pgdat);
> +
> + return too_many;
> }
>
> /*
> @@ -2308,8 +2318,8 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
> return 0;
>
> /* wait a bit for the reclaimer. */
> - msleep(100);
> stalled = true;
> + reclaim_throttle(pgdat, VMSCAN_THROTTLE_ISOLATED, HZ/10);
>
> /* We are about to die and free our memory. Return now. */
> if (fatal_signal_pending(current))
> @@ -4343,7 +4353,7 @@ static int kswapd(void *p)
>
> WRITE_ONCE(pgdat->kswapd_order, 0);
> WRITE_ONCE(pgdat->kswapd_highest_zoneidx, MAX_NR_ZONES);
> - atomic_set(&pgdat->nr_reclaim_throttled, 0);
> + atomic_set(&pgdat->nr_writeback_throttled, 0);
> for ( ; ; ) {
> bool ret;
>
> --
> 2.31.1
>
>
^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 3/8] mm/vmscan: Throttle reclaim when no progress is being made
2021-10-19 9:01 [PATCH v4 0/8] Remove dependency on congestion_wait in mm/ Mel Gorman
2021-10-19 9:01 ` [PATCH 1/8] mm/vmscan: Throttle reclaim until some writeback completes if congested Mel Gorman
2021-10-19 9:01 ` [PATCH 2/8] mm/vmscan: Throttle reclaim and compaction when too may pages are isolated Mel Gorman
@ 2021-10-19 9:01 ` Mel Gorman
2021-10-19 9:01 ` [PATCH 4/8] mm/writeback: Throttle based on page writeback instead of congestion Mel Gorman
` (6 subsequent siblings)
9 siblings, 0 replies; 26+ messages in thread
From: Mel Gorman @ 2021-10-19 9:01 UTC (permalink / raw)
To: Andrew Morton
Cc: NeilBrown, Theodore Ts'o, Andreas Dilger, Darrick J . Wong,
Matthew Wilcox, Michal Hocko, Dave Chinner, Rik van Riel,
Vlastimil Babka, Johannes Weiner, Jonathan Corbet, Linux-MM,
Linux-fsdevel, LKML, Mel Gorman
Memcg reclaim throttles on congestion if no reclaim progress is made.
This makes little sense, it might be due to writeback or a host of
other factors.
For !memcg reclaim, it's messy. Direct reclaim primarily is throttled
in the page allocator if it is failing to make progress. Kswapd
throttles if too many pages are under writeback and marked for
immediate reclaim.
This patch explicitly throttles if reclaim is failing to make progress.
[vbabka@suse.cz: Remove redundant code]
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
include/linux/mmzone.h | 1 +
include/trace/events/vmscan.h | 4 +++-
mm/memcontrol.c | 10 +---------
mm/vmscan.c | 28 ++++++++++++++++++++++++++++
4 files changed, 33 insertions(+), 10 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 58a25d42c31c..2ffcf2410b66 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -276,6 +276,7 @@ enum lru_list {
enum vmscan_throttle_state {
VMSCAN_THROTTLE_WRITEBACK,
VMSCAN_THROTTLE_ISOLATED,
+ VMSCAN_THROTTLE_NOPROGRESS,
NR_VMSCAN_THROTTLE,
};
diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index d4905bd9e9c4..f25a6149d3ba 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -29,11 +29,13 @@
#define _VMSCAN_THROTTLE_WRITEBACK (1 << VMSCAN_THROTTLE_WRITEBACK)
#define _VMSCAN_THROTTLE_ISOLATED (1 << VMSCAN_THROTTLE_ISOLATED)
+#define _VMSCAN_THROTTLE_NOPROGRESS (1 << VMSCAN_THROTTLE_NOPROGRESS)
#define show_throttle_flags(flags) \
(flags) ? __print_flags(flags, "|", \
{_VMSCAN_THROTTLE_WRITEBACK, "VMSCAN_THROTTLE_WRITEBACK"}, \
- {_VMSCAN_THROTTLE_ISOLATED, "VMSCAN_THROTTLE_ISOLATED"} \
+ {_VMSCAN_THROTTLE_ISOLATED, "VMSCAN_THROTTLE_ISOLATED"}, \
+ {_VMSCAN_THROTTLE_NOPROGRESS, "VMSCAN_THROTTLE_NOPROGRESS"} \
) : "VMSCAN_THROTTLE_NONE"
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6da5020a8656..8b33152c9b85 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3465,19 +3465,11 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
/* try to free all pages in this cgroup */
while (nr_retries && page_counter_read(&memcg->memory)) {
- int progress;
-
if (signal_pending(current))
return -EINTR;
- progress = try_to_free_mem_cgroup_pages(memcg, 1,
- GFP_KERNEL, true);
- if (!progress) {
+ if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, true))
nr_retries--;
- /* maybe some writeback is necessary */
- congestion_wait(BLK_RW_ASYNC, HZ/10);
- }
-
}
return 0;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 29434d4fc1c7..14127bbf2c3b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3323,6 +3323,33 @@ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
return zone_watermark_ok_safe(zone, 0, watermark, sc->reclaim_idx);
}
+static void consider_reclaim_throttle(pg_data_t *pgdat, struct scan_control *sc)
+{
+ /* If reclaim is making progress, wake any throttled tasks. */
+ if (sc->nr_reclaimed) {
+ wait_queue_head_t *wqh;
+
+ wqh = &pgdat->reclaim_wait[VMSCAN_THROTTLE_NOPROGRESS];
+ if (waitqueue_active(wqh))
+ wake_up_all(wqh);
+
+ return;
+ }
+
+ /*
+ * Do not throttle kswapd on NOPROGRESS as it will throttle on
+ * VMSCAN_THROTTLE_WRITEBACK if there are too many pages under
+ * writeback and marked for immediate reclaim at the tail of
+ * the LRU.
+ */
+ if (current_is_kswapd())
+ return;
+
+ /* Throttle if making no progress at high prioities. */
+ if (sc->priority < DEF_PRIORITY - 2)
+ reclaim_throttle(pgdat, VMSCAN_THROTTLE_NOPROGRESS, HZ/10);
+}
+
/*
* This is the direct reclaim path, for page-allocating processes. We only
* try to reclaim pages from zones which will satisfy the caller's allocation
@@ -3407,6 +3434,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
continue;
last_pgdat = zone->zone_pgdat;
shrink_node(zone->zone_pgdat, sc);
+ consider_reclaim_throttle(zone->zone_pgdat, sc);
}
/*
--
2.31.1
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH 4/8] mm/writeback: Throttle based on page writeback instead of congestion
2021-10-19 9:01 [PATCH v4 0/8] Remove dependency on congestion_wait in mm/ Mel Gorman
` (2 preceding siblings ...)
2021-10-19 9:01 ` [PATCH 3/8] mm/vmscan: Throttle reclaim when no progress is being made Mel Gorman
@ 2021-10-19 9:01 ` Mel Gorman
2021-10-19 9:01 ` [PATCH 5/8] mm/page_alloc: Remove the throttling logic from the page allocator Mel Gorman
` (5 subsequent siblings)
9 siblings, 0 replies; 26+ messages in thread
From: Mel Gorman @ 2021-10-19 9:01 UTC (permalink / raw)
To: Andrew Morton
Cc: NeilBrown, Theodore Ts'o, Andreas Dilger, Darrick J . Wong,
Matthew Wilcox, Michal Hocko, Dave Chinner, Rik van Riel,
Vlastimil Babka, Johannes Weiner, Jonathan Corbet, Linux-MM,
Linux-fsdevel, LKML, Mel Gorman
do_writepages throttles on congestion if the writepages() fails due to a
lack of memory but congestion_wait() is partially broken as the congestion
state is not updated for all BDIs.
This patch stalls waiting for a number of pages to complete writeback
that located on the local node. The main weakness is that there is no
correlation between the location of the inode's pages and locality but
that is still better than congestion_wait.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
mm/page-writeback.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 4812a17b288c..f34f54fcd5b4 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2366,8 +2366,15 @@ int do_writepages(struct address_space *mapping, struct writeback_control *wbc)
ret = generic_writepages(mapping, wbc);
if ((ret != -ENOMEM) || (wbc->sync_mode != WB_SYNC_ALL))
break;
- cond_resched();
- congestion_wait(BLK_RW_ASYNC, HZ/50);
+
+ /*
+ * Lacking an allocation context or the locality or writeback
+ * state of any of the inode's pages, throttle based on
+ * writeback activity on the local node. It's as good a
+ * guess as any.
+ */
+ reclaim_throttle(NODE_DATA(numa_node_id()),
+ VMSCAN_THROTTLE_WRITEBACK, HZ/50);
}
/*
* Usually few pages are written by now from those we've just submitted
--
2.31.1
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH 5/8] mm/page_alloc: Remove the throttling logic from the page allocator
2021-10-19 9:01 [PATCH v4 0/8] Remove dependency on congestion_wait in mm/ Mel Gorman
` (3 preceding siblings ...)
2021-10-19 9:01 ` [PATCH 4/8] mm/writeback: Throttle based on page writeback instead of congestion Mel Gorman
@ 2021-10-19 9:01 ` Mel Gorman
2021-10-19 9:20 ` Vlastimil Babka
2021-10-19 9:01 ` [PATCH 6/8] mm/vmscan: Centralise timeout values for reclaim_throttle Mel Gorman
` (4 subsequent siblings)
9 siblings, 1 reply; 26+ messages in thread
From: Mel Gorman @ 2021-10-19 9:01 UTC (permalink / raw)
To: Andrew Morton
Cc: NeilBrown, Theodore Ts'o, Andreas Dilger, Darrick J . Wong,
Matthew Wilcox, Michal Hocko, Dave Chinner, Rik van Riel,
Vlastimil Babka, Johannes Weiner, Jonathan Corbet, Linux-MM,
Linux-fsdevel, LKML, Mel Gorman
The page allocator stalls based on the number of pages that are
waiting for writeback to start but this should now be redundant.
shrink_inactive_list() will wake flusher threads if the LRU tail are
unqueued dirty pages so the flusher should be active. If it fails to make
progress due to pages under writeback not being completed quickly then
it should stall on VMSCAN_THROTTLE_WRITEBACK.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
mm/page_alloc.c | 21 +--------------------
1 file changed, 1 insertion(+), 20 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 78e538067651..8fa0109ff417 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4795,30 +4795,11 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
trace_reclaim_retry_zone(z, order, reclaimable,
available, min_wmark, *no_progress_loops, wmark);
if (wmark) {
- /*
- * If we didn't make any progress and have a lot of
- * dirty + writeback pages then we should wait for
- * an IO to complete to slow down the reclaim and
- * prevent from pre mature OOM
- */
- if (!did_some_progress) {
- unsigned long write_pending;
-
- write_pending = zone_page_state_snapshot(zone,
- NR_ZONE_WRITE_PENDING);
-
- if (2 * write_pending > reclaimable) {
- congestion_wait(BLK_RW_ASYNC, HZ/10);
- return true;
- }
- }
-
ret = true;
- goto out;
+ break;
}
}
-out:
/*
* Memory allocation/reclaim might be called from a WQ context and the
* current implementation of the WQ concurrency control doesn't
--
2.31.1
^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH 5/8] mm/page_alloc: Remove the throttling logic from the page allocator
2021-10-19 9:01 ` [PATCH 5/8] mm/page_alloc: Remove the throttling logic from the page allocator Mel Gorman
@ 2021-10-19 9:20 ` Vlastimil Babka
0 siblings, 0 replies; 26+ messages in thread
From: Vlastimil Babka @ 2021-10-19 9:20 UTC (permalink / raw)
To: Mel Gorman, Andrew Morton
Cc: NeilBrown, Theodore Ts'o, Andreas Dilger, Darrick J . Wong,
Matthew Wilcox, Michal Hocko, Dave Chinner, Rik van Riel,
Johannes Weiner, Jonathan Corbet, Linux-MM, Linux-fsdevel, LKML
On 10/19/21 11:01, Mel Gorman wrote:
> The page allocator stalls based on the number of pages that are
> waiting for writeback to start but this should now be redundant.
> shrink_inactive_list() will wake flusher threads if the LRU tail are
> unqueued dirty pages so the flusher should be active. If it fails to make
> progress due to pages under writeback not being completed quickly then
> it should stall on VMSCAN_THROTTLE_WRITEBACK.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
(did in v3 already)
^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 6/8] mm/vmscan: Centralise timeout values for reclaim_throttle
2021-10-19 9:01 [PATCH v4 0/8] Remove dependency on congestion_wait in mm/ Mel Gorman
` (4 preceding siblings ...)
2021-10-19 9:01 ` [PATCH 5/8] mm/page_alloc: Remove the throttling logic from the page allocator Mel Gorman
@ 2021-10-19 9:01 ` Mel Gorman
2021-10-22 1:06 ` NeilBrown
2021-10-19 9:01 ` [PATCH 7/8] mm/vmscan: Increase the timeout if page reclaim is not making progress Mel Gorman
` (3 subsequent siblings)
9 siblings, 1 reply; 26+ messages in thread
From: Mel Gorman @ 2021-10-19 9:01 UTC (permalink / raw)
To: Andrew Morton
Cc: NeilBrown, Theodore Ts'o, Andreas Dilger, Darrick J . Wong,
Matthew Wilcox, Michal Hocko, Dave Chinner, Rik van Riel,
Vlastimil Babka, Johannes Weiner, Jonathan Corbet, Linux-MM,
Linux-fsdevel, LKML, Mel Gorman
Neil Brown raised concerns about callers of reclaim_throttle specifying
a timeout value. The original timeout values to congestion_wait() were
probably pulled out of thin air or copy&pasted from somewhere else.
This patch centralises the timeout values and selects a timeout based
on the reason for reclaim throttling. These figures are also pulled
out of the same thin air but better values may be derived
Running a workload that is throttling for inappropriate periods
and tracing mm_vmscan_throttled can be used to pick a more appropriate
value. Excessive throttling would pick a lower timeout where as
excessive CPU usage in reclaim context would select a larger timeout.
Ideally a large value would always be used and the wakeups would
occur before a timeout but that requires careful testing.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
mm/compaction.c | 2 +-
mm/internal.h | 3 +--
mm/page-writeback.c | 2 +-
mm/vmscan.c | 48 +++++++++++++++++++++++++++++++++------------
4 files changed, 38 insertions(+), 17 deletions(-)
diff --git a/mm/compaction.c b/mm/compaction.c
index 7359093d8ac0..151b04c4dab3 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -828,7 +828,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
if (cc->mode == MIGRATE_ASYNC)
return -EAGAIN;
- reclaim_throttle(pgdat, VMSCAN_THROTTLE_ISOLATED, HZ/10);
+ reclaim_throttle(pgdat, VMSCAN_THROTTLE_ISOLATED);
if (fatal_signal_pending(current))
return -EINTR;
diff --git a/mm/internal.h b/mm/internal.h
index 3461a1055975..63d8ebbc5a6d 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -129,8 +129,7 @@ extern unsigned long highest_memmap_pfn;
*/
extern int isolate_lru_page(struct page *page);
extern void putback_lru_page(struct page *page);
-extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
- long timeout);
+extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason);
/*
* in mm/rmap.c:
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index f34f54fcd5b4..4b01a6872f9e 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2374,7 +2374,7 @@ int do_writepages(struct address_space *mapping, struct writeback_control *wbc)
* guess as any.
*/
reclaim_throttle(NODE_DATA(numa_node_id()),
- VMSCAN_THROTTLE_WRITEBACK, HZ/50);
+ VMSCAN_THROTTLE_WRITEBACK);
}
/*
* Usually few pages are written by now from those we've just submitted
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 14127bbf2c3b..1f5c467dc83c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1006,12 +1006,10 @@ static void handle_write_error(struct address_space *mapping,
unlock_page(page);
}
-void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
- long timeout)
+void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason)
{
wait_queue_head_t *wqh = &pgdat->reclaim_wait[reason];
- long ret;
- bool acct_writeback = (reason == VMSCAN_THROTTLE_WRITEBACK);
+ long timeout, ret;
DEFINE_WAIT(wait);
/*
@@ -1023,17 +1021,41 @@ void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
current->flags & (PF_IO_WORKER|PF_KTHREAD))
return;
- if (acct_writeback &&
- atomic_inc_return(&pgdat->nr_writeback_throttled) == 1) {
- WRITE_ONCE(pgdat->nr_reclaim_start,
- node_page_state(pgdat, NR_THROTTLED_WRITTEN));
+ /*
+ * These figures are pulled out of thin air.
+ * VMSCAN_THROTTLE_ISOLATED is a transient condition based on too many
+ * parallel reclaimers which is a short-lived event so the timeout is
+ * short. Failing to make progress or waiting on writeback are
+ * potentially long-lived events so use a longer timeout. This is shaky
+ * logic as a failure to make progress could be due to anything from
+ * writeback to a slow device to excessive references pages at the tail
+ * of the inactive LRU.
+ */
+ switch(reason) {
+ case VMSCAN_THROTTLE_NOPROGRESS:
+ case VMSCAN_THROTTLE_WRITEBACK:
+ timeout = HZ/10;
+
+ if (atomic_inc_return(&pgdat->nr_writeback_throttled) == 1) {
+ WRITE_ONCE(pgdat->nr_reclaim_start,
+ node_page_state(pgdat, NR_THROTTLED_WRITTEN));
+ }
+
+ break;
+ case VMSCAN_THROTTLE_ISOLATED:
+ timeout = HZ/50;
+ break;
+ default:
+ WARN_ON_ONCE(1);
+ timeout = HZ;
+ break;
}
prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
ret = schedule_timeout(timeout);
finish_wait(wqh, &wait);
- if (acct_writeback)
+ if (reason == VMSCAN_THROTTLE_ISOLATED)
atomic_dec(&pgdat->nr_writeback_throttled);
trace_mm_vmscan_throttled(pgdat->node_id, jiffies_to_usecs(timeout),
@@ -2319,7 +2341,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
/* wait a bit for the reclaimer. */
stalled = true;
- reclaim_throttle(pgdat, VMSCAN_THROTTLE_ISOLATED, HZ/10);
+ reclaim_throttle(pgdat, VMSCAN_THROTTLE_ISOLATED);
/* We are about to die and free our memory. Return now. */
if (fatal_signal_pending(current))
@@ -3251,7 +3273,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
* until some pages complete writeback.
*/
if (sc->nr.immediate)
- reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK, HZ/10);
+ reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
}
/*
@@ -3275,7 +3297,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
if (!current_is_kswapd() && current_may_throttle() &&
!sc->hibernation_mode &&
test_bit(LRUVEC_CONGESTED, &target_lruvec->flags))
- reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK, HZ/10);
+ reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
if (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
sc))
@@ -3347,7 +3369,7 @@ static void consider_reclaim_throttle(pg_data_t *pgdat, struct scan_control *sc)
/* Throttle if making no progress at high prioities. */
if (sc->priority < DEF_PRIORITY - 2)
- reclaim_throttle(pgdat, VMSCAN_THROTTLE_NOPROGRESS, HZ/10);
+ reclaim_throttle(pgdat, VMSCAN_THROTTLE_NOPROGRESS);
}
/*
--
2.31.1
^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH 6/8] mm/vmscan: Centralise timeout values for reclaim_throttle
2021-10-19 9:01 ` [PATCH 6/8] mm/vmscan: Centralise timeout values for reclaim_throttle Mel Gorman
@ 2021-10-22 1:06 ` NeilBrown
2021-10-22 8:12 ` Mel Gorman
0 siblings, 1 reply; 26+ messages in thread
From: NeilBrown @ 2021-10-22 1:06 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, Theodore Ts'o, Andreas Dilger,
Darrick J . Wong, Matthew Wilcox, Michal Hocko, Dave Chinner,
Rik van Riel, Vlastimil Babka, Johannes Weiner, Jonathan Corbet,
Linux-MM, Linux-fsdevel, LKML, Mel Gorman
On Tue, 19 Oct 2021, Mel Gorman wrote:
...
> + switch(reason) {
> + case VMSCAN_THROTTLE_NOPROGRESS:
> + case VMSCAN_THROTTLE_WRITEBACK:
> + timeout = HZ/10;
> +
> + if (atomic_inc_return(&pgdat->nr_writeback_throttled) == 1) {
> + WRITE_ONCE(pgdat->nr_reclaim_start,
> + node_page_state(pgdat, NR_THROTTLED_WRITTEN));
You have introduced a behaviour change that wasn't flagged in the commit
message.
Previously nr_writeback_throttled was only incremented for
VMSCAN_THROTTLE_WRITEBACK, now it is incremented for
VMSCAN_THROTTLE_NOPROGRESS as well.
Some justification would be good.
> + }
> +
> + break;
> + case VMSCAN_THROTTLE_ISOLATED:
> + timeout = HZ/50;
> + break;
> + default:
> + WARN_ON_ONCE(1);
> + timeout = HZ;
> + break;
> }
>
> prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> ret = schedule_timeout(timeout);
> finish_wait(wqh, &wait);
>
> - if (acct_writeback)
> + if (reason == VMSCAN_THROTTLE_ISOLATED)
(defect) I think you want "!=" there.
While the numbers a still magic, they are now well documented and all in
one place - a definite improvement!
Thanks,
NeilBrown
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 6/8] mm/vmscan: Centralise timeout values for reclaim_throttle
2021-10-22 1:06 ` NeilBrown
@ 2021-10-22 8:12 ` Mel Gorman
0 siblings, 0 replies; 26+ messages in thread
From: Mel Gorman @ 2021-10-22 8:12 UTC (permalink / raw)
To: NeilBrown
Cc: Andrew Morton, Theodore Ts'o, Andreas Dilger,
Darrick J . Wong, Matthew Wilcox, Michal Hocko, Dave Chinner,
Rik van Riel, Vlastimil Babka, Johannes Weiner, Jonathan Corbet,
Linux-MM, Linux-fsdevel, LKML
On Fri, Oct 22, 2021 at 12:06:13PM +1100, NeilBrown wrote:
> On Tue, 19 Oct 2021, Mel Gorman wrote:
> ...
> > + switch(reason) {
> > + case VMSCAN_THROTTLE_NOPROGRESS:
> > + case VMSCAN_THROTTLE_WRITEBACK:
> > + timeout = HZ/10;
> > +
> > + if (atomic_inc_return(&pgdat->nr_writeback_throttled) == 1) {
> > + WRITE_ONCE(pgdat->nr_reclaim_start,
> > + node_page_state(pgdat, NR_THROTTLED_WRITTEN));
>
> You have introduced a behaviour change that wasn't flagged in the commit
> message.
> Previously nr_writeback_throttled was only incremented for
> VMSCAN_THROTTLE_WRITEBACK, now it is incremented for
> VMSCAN_THROTTLE_NOPROGRESS as well.
>
> Some justification would be good.
>
This is the result of rebase near the end of a day going sideways. There
is no justification, it's just wrong.
I'm rerunning the entire series, will update the leader and resend the
series.
--8<--
mm/vmscan: Centralise timeout values for reclaim_throttle -fix
Neil Brown spotted the fallthrough-logic for reclaim_throttle was wrong --
only VMSCAN_THROTTLE_WRITEBACK affects pgdat->nr_writeback_throttled. This
was the result of a rebase going sideways and only happens to sometimes
work by co-incidence.
This is a fix to the mmotm patch
mm-vmscan-centralise-timeout-values-for-reclaim_throttle.patch
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
mm/vmscan.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1f5c467dc83c..64c38979b7df 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1032,7 +1032,6 @@ void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason)
* of the inactive LRU.
*/
switch(reason) {
- case VMSCAN_THROTTLE_NOPROGRESS:
case VMSCAN_THROTTLE_WRITEBACK:
timeout = HZ/10;
@@ -1041,6 +1040,9 @@ void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason)
node_page_state(pgdat, NR_THROTTLED_WRITTEN));
}
+ break;
+ case VMSCAN_THROTTLE_NOPROGRESS:
+ timeout = HZ/10;
break;
case VMSCAN_THROTTLE_ISOLATED:
timeout = HZ/50;
@@ -1055,7 +1057,7 @@ void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason)
ret = schedule_timeout(timeout);
finish_wait(wqh, &wait);
- if (reason == VMSCAN_THROTTLE_ISOLATED)
+ if (reason == VMSCAN_THROTTLE_WRITEBACK)
atomic_dec(&pgdat->nr_writeback_throttled);
trace_mm_vmscan_throttled(pgdat->node_id, jiffies_to_usecs(timeout),
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH 7/8] mm/vmscan: Increase the timeout if page reclaim is not making progress
2021-10-19 9:01 [PATCH v4 0/8] Remove dependency on congestion_wait in mm/ Mel Gorman
` (5 preceding siblings ...)
2021-10-19 9:01 ` [PATCH 6/8] mm/vmscan: Centralise timeout values for reclaim_throttle Mel Gorman
@ 2021-10-19 9:01 ` Mel Gorman
2021-10-22 1:07 ` NeilBrown
2021-10-19 9:01 ` [PATCH 8/8] mm/vmscan: Delay waking of tasks throttled on NOPROGRESS Mel Gorman
` (2 subsequent siblings)
9 siblings, 1 reply; 26+ messages in thread
From: Mel Gorman @ 2021-10-19 9:01 UTC (permalink / raw)
To: Andrew Morton
Cc: NeilBrown, Theodore Ts'o, Andreas Dilger, Darrick J . Wong,
Matthew Wilcox, Michal Hocko, Dave Chinner, Rik van Riel,
Vlastimil Babka, Johannes Weiner, Jonathan Corbet, Linux-MM,
Linux-fsdevel, LKML, Mel Gorman
Tracing of the stutterp workload showed the following delays
1 usect_delayed=124000 reason=VMSCAN_THROTTLE_NOPROGRESS
1 usect_delayed=128000 reason=VMSCAN_THROTTLE_NOPROGRESS
1 usect_delayed=176000 reason=VMSCAN_THROTTLE_NOPROGRESS
1 usect_delayed=536000 reason=VMSCAN_THROTTLE_NOPROGRESS
1 usect_delayed=544000 reason=VMSCAN_THROTTLE_NOPROGRESS
1 usect_delayed=556000 reason=VMSCAN_THROTTLE_NOPROGRESS
1 usect_delayed=624000 reason=VMSCAN_THROTTLE_NOPROGRESS
1 usect_delayed=716000 reason=VMSCAN_THROTTLE_NOPROGRESS
1 usect_delayed=772000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usect_delayed=512000 reason=VMSCAN_THROTTLE_NOPROGRESS
16 usect_delayed=120000 reason=VMSCAN_THROTTLE_NOPROGRESS
53 usect_delayed=116000 reason=VMSCAN_THROTTLE_NOPROGRESS
116 usect_delayed=112000 reason=VMSCAN_THROTTLE_NOPROGRESS
5907 usect_delayed=108000 reason=VMSCAN_THROTTLE_NOPROGRESS
71741 usect_delayed=104000 reason=VMSCAN_THROTTLE_NOPROGRESS
All the throttling hit the full timeout and then there was wakeup delays
meaning that the wakeups are premature as no other reclaimer such as
kswapd has made progress. This patch increases the maximum timeout.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
mm/vmscan.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1f5c467dc83c..ec2006680242 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1033,6 +1033,8 @@ void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason)
*/
switch(reason) {
case VMSCAN_THROTTLE_NOPROGRESS:
+ timeout = HZ/2;
+ break;
case VMSCAN_THROTTLE_WRITEBACK:
timeout = HZ/10;
--
2.31.1
^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH 7/8] mm/vmscan: Increase the timeout if page reclaim is not making progress
2021-10-19 9:01 ` [PATCH 7/8] mm/vmscan: Increase the timeout if page reclaim is not making progress Mel Gorman
@ 2021-10-22 1:07 ` NeilBrown
2021-10-22 8:14 ` Mel Gorman
0 siblings, 1 reply; 26+ messages in thread
From: NeilBrown @ 2021-10-22 1:07 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, Theodore Ts'o, Andreas Dilger,
Darrick J . Wong, Matthew Wilcox, Michal Hocko, Dave Chinner,
Rik van Riel, Vlastimil Babka, Johannes Weiner, Jonathan Corbet,
Linux-MM, Linux-fsdevel, LKML, Mel Gorman
On Tue, 19 Oct 2021, Mel Gorman wrote:
> Tracing of the stutterp workload showed the following delays
>
> 1 usect_delayed=124000 reason=VMSCAN_THROTTLE_NOPROGRESS
> 1 usect_delayed=128000 reason=VMSCAN_THROTTLE_NOPROGRESS
> 1 usect_delayed=176000 reason=VMSCAN_THROTTLE_NOPROGRESS
> 1 usect_delayed=536000 reason=VMSCAN_THROTTLE_NOPROGRESS
> 1 usect_delayed=544000 reason=VMSCAN_THROTTLE_NOPROGRESS
> 1 usect_delayed=556000 reason=VMSCAN_THROTTLE_NOPROGRESS
> 1 usect_delayed=624000 reason=VMSCAN_THROTTLE_NOPROGRESS
> 1 usect_delayed=716000 reason=VMSCAN_THROTTLE_NOPROGRESS
> 1 usect_delayed=772000 reason=VMSCAN_THROTTLE_NOPROGRESS
> 2 usect_delayed=512000 reason=VMSCAN_THROTTLE_NOPROGRESS
> 16 usect_delayed=120000 reason=VMSCAN_THROTTLE_NOPROGRESS
> 53 usect_delayed=116000 reason=VMSCAN_THROTTLE_NOPROGRESS
> 116 usect_delayed=112000 reason=VMSCAN_THROTTLE_NOPROGRESS
> 5907 usect_delayed=108000 reason=VMSCAN_THROTTLE_NOPROGRESS
> 71741 usect_delayed=104000 reason=VMSCAN_THROTTLE_NOPROGRESS
>
> All the throttling hit the full timeout and then there was wakeup delays
> meaning that the wakeups are premature as no other reclaimer such as
> kswapd has made progress. This patch increases the maximum timeout.
Would love to see the comparable tracing results for after the patch.
Thanks,
NeilBrown
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> ---
> mm/vmscan.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 1f5c467dc83c..ec2006680242 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1033,6 +1033,8 @@ void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason)
> */
> switch(reason) {
> case VMSCAN_THROTTLE_NOPROGRESS:
> + timeout = HZ/2;
> + break;
> case VMSCAN_THROTTLE_WRITEBACK:
> timeout = HZ/10;
>
> --
> 2.31.1
>
>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 7/8] mm/vmscan: Increase the timeout if page reclaim is not making progress
2021-10-22 1:07 ` NeilBrown
@ 2021-10-22 8:14 ` Mel Gorman
0 siblings, 0 replies; 26+ messages in thread
From: Mel Gorman @ 2021-10-22 8:14 UTC (permalink / raw)
To: NeilBrown
Cc: Andrew Morton, Theodore Ts'o, Andreas Dilger,
Darrick J . Wong, Matthew Wilcox, Michal Hocko, Dave Chinner,
Rik van Riel, Vlastimil Babka, Johannes Weiner, Jonathan Corbet,
Linux-MM, Linux-fsdevel, LKML
On Fri, Oct 22, 2021 at 12:07:43PM +1100, NeilBrown wrote:
> On Tue, 19 Oct 2021, Mel Gorman wrote:
> > Tracing of the stutterp workload showed the following delays
> >
> > 1 usect_delayed=124000 reason=VMSCAN_THROTTLE_NOPROGRESS
> > 1 usect_delayed=128000 reason=VMSCAN_THROTTLE_NOPROGRESS
> > 1 usect_delayed=176000 reason=VMSCAN_THROTTLE_NOPROGRESS
> > 1 usect_delayed=536000 reason=VMSCAN_THROTTLE_NOPROGRESS
> > 1 usect_delayed=544000 reason=VMSCAN_THROTTLE_NOPROGRESS
> > 1 usect_delayed=556000 reason=VMSCAN_THROTTLE_NOPROGRESS
> > 1 usect_delayed=624000 reason=VMSCAN_THROTTLE_NOPROGRESS
> > 1 usect_delayed=716000 reason=VMSCAN_THROTTLE_NOPROGRESS
> > 1 usect_delayed=772000 reason=VMSCAN_THROTTLE_NOPROGRESS
> > 2 usect_delayed=512000 reason=VMSCAN_THROTTLE_NOPROGRESS
> > 16 usect_delayed=120000 reason=VMSCAN_THROTTLE_NOPROGRESS
> > 53 usect_delayed=116000 reason=VMSCAN_THROTTLE_NOPROGRESS
> > 116 usect_delayed=112000 reason=VMSCAN_THROTTLE_NOPROGRESS
> > 5907 usect_delayed=108000 reason=VMSCAN_THROTTLE_NOPROGRESS
> > 71741 usect_delayed=104000 reason=VMSCAN_THROTTLE_NOPROGRESS
> >
> > All the throttling hit the full timeout and then there was wakeup delays
> > meaning that the wakeups are premature as no other reclaimer such as
> > kswapd has made progress. This patch increases the maximum timeout.
>
> Would love to see the comparable tracing results for after the patch.
>
They're in the leader. The trace figures in the changelog are the ones I
had at the time the patch was developed and I didn't keep them up to date
to reduce overall test time. At the last set of results, some throttling
was still hitting the full timeout;
[....]
843 usec_timeout=500000 usect_delayed=12000 reason=VMSCAN_THROTTLE_NOPROGRESS
1299 usec_timeout=500000 usect_delayed=104000 reason=VMSCAN_THROTTLE_NOPROGRESS
2839 usec_timeout=500000 usect_delayed=8000 reason=VMSCAN_THROTTLE_NOPROGRESS
10111 usec_timeout=500000 usect_delayed=4000 reason=VMSCAN_THROTTLE_NOPROGRESS
21492 usec_timeout=500000 usect_delayed=0 reason=VMSCAN_THROTTLE_NOPROGRESS
36441 usec_timeout=500000 usect_delayed=500000 reason=VMSCAN_THROTTLE_NOPROGRESS
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 8/8] mm/vmscan: Delay waking of tasks throttled on NOPROGRESS
2021-10-19 9:01 [PATCH v4 0/8] Remove dependency on congestion_wait in mm/ Mel Gorman
` (6 preceding siblings ...)
2021-10-19 9:01 ` [PATCH 7/8] mm/vmscan: Increase the timeout if page reclaim is not making progress Mel Gorman
@ 2021-10-19 9:01 ` Mel Gorman
2021-10-19 22:00 ` [PATCH v4 0/8] Remove dependency on congestion_wait in mm/ Andrew Morton
2021-10-22 1:15 ` NeilBrown
9 siblings, 0 replies; 26+ messages in thread
From: Mel Gorman @ 2021-10-19 9:01 UTC (permalink / raw)
To: Andrew Morton
Cc: NeilBrown, Theodore Ts'o, Andreas Dilger, Darrick J . Wong,
Matthew Wilcox, Michal Hocko, Dave Chinner, Rik van Riel,
Vlastimil Babka, Johannes Weiner, Jonathan Corbet, Linux-MM,
Linux-fsdevel, LKML, Mel Gorman
Tracing indicates that tasks throttled on NOPROGRESS are woken
prematurely resulting in occasional massive spikes in direct
reclaim activity. This patch wakes tasks throttled on NOPROGRESS
if reclaim efficiency is at least 12%.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
mm/vmscan.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ec2006680242..28adc196353d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1057,7 +1057,7 @@ void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason)
ret = schedule_timeout(timeout);
finish_wait(wqh, &wait);
- if (reason == VMSCAN_THROTTLE_ISOLATED)
+ if (reason == VMSCAN_THROTTLE_WRITEBACK)
atomic_dec(&pgdat->nr_writeback_throttled);
trace_mm_vmscan_throttled(pgdat->node_id, jiffies_to_usecs(timeout),
@@ -3349,8 +3349,11 @@ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
static void consider_reclaim_throttle(pg_data_t *pgdat, struct scan_control *sc)
{
- /* If reclaim is making progress, wake any throttled tasks. */
- if (sc->nr_reclaimed) {
+ /*
+ * If reclaim is making progress greater than 12% efficiency then
+ * wake all the NOPROGRESS throttled tasks.
+ */
+ if (sc->nr_reclaimed > (sc->nr_scanned >> 3)) {
wait_queue_head_t *wqh;
wqh = &pgdat->reclaim_wait[VMSCAN_THROTTLE_NOPROGRESS];
--
2.31.1
^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH v4 0/8] Remove dependency on congestion_wait in mm/
2021-10-19 9:01 [PATCH v4 0/8] Remove dependency on congestion_wait in mm/ Mel Gorman
` (7 preceding siblings ...)
2021-10-19 9:01 ` [PATCH 8/8] mm/vmscan: Delay waking of tasks throttled on NOPROGRESS Mel Gorman
@ 2021-10-19 22:00 ` Andrew Morton
2021-10-20 8:44 ` Mel Gorman
2021-10-22 1:15 ` NeilBrown
9 siblings, 1 reply; 26+ messages in thread
From: Andrew Morton @ 2021-10-19 22:00 UTC (permalink / raw)
To: Mel Gorman
Cc: NeilBrown, Theodore Ts'o, Andreas Dilger, Darrick J . Wong,
Matthew Wilcox, Michal Hocko, Dave Chinner, Rik van Riel,
Vlastimil Babka, Johannes Weiner, Jonathan Corbet, Linux-MM,
Linux-fsdevel, LKML
On Tue, 19 Oct 2021 10:01:00 +0100 Mel Gorman <mgorman@techsingularity.net> wrote:
> Changelog since v3
> o Count writeback completions for NR_THROTTLED_WRITTEN only
> o Use IRQ-safe inc_node_page_state
> o Remove redundant throttling
>
> This series is also available at
>
> git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-reclaimcongest-v4r2
>
> This series that removes all calls to congestion_wait
> in mm/ and deletes wait_iff_congested. It's not a clever
> implementation but congestion_wait has been broken for a long time
> (https://lore.kernel.org/linux-mm/45d8b7a6-8548-65f5-cccf-9f451d4ae3d4@kernel.dk/).
The block layer doesn't call clear_bdi_congested() at all. I never
knew this until recent discussions :(
So this means that congestion_wait() will always time out, yes?
> Even if congestion throttling worked, it was never a great idea.
Well. It was a good idea until things like isolation got added!
> While
> excessive dirty/writeback pages at the tail of the LRU is one possibility
> that reclaim may be slow, there is also the problem of too many pages
> being isolated and reclaim failing for other reasons (elevated references,
> too many pages isolated, excessive LRU contention etc).
>
> This series replaces the "congestion" throttling with 3 different types.
>
> o If there are too many dirty/writeback pages, sleep until a timeout
> or enough pages get cleaned
> o If too many pages are isolated, sleep until enough isolated pages
> are either reclaimed or put back on the LRU
> o If no progress is being made, direct reclaim tasks sleep until
> another task makes progress with acceptable efficiency.
>
> This was initially tested with a mix of workloads that used to trigger
> corner cases that no longer work.
Mix of workloads is nice, but a mix of devices is more important here.
I trust some testing was done on plain old spinning disks? And USB
storage, please! And NFS plays with BDI congestion. Ceph and FUSE also.
We've had complaints about this stuff forever. Usually of the form of
interactive tasks getting horridly stalled by writeout/swap activity.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v4 0/8] Remove dependency on congestion_wait in mm/
2021-10-19 22:00 ` [PATCH v4 0/8] Remove dependency on congestion_wait in mm/ Andrew Morton
@ 2021-10-20 8:44 ` Mel Gorman
0 siblings, 0 replies; 26+ messages in thread
From: Mel Gorman @ 2021-10-20 8:44 UTC (permalink / raw)
To: Andrew Morton
Cc: NeilBrown, Theodore Ts'o, Andreas Dilger, Darrick J . Wong,
Matthew Wilcox, Michal Hocko, Dave Chinner, Rik van Riel,
Vlastimil Babka, Johannes Weiner, Jonathan Corbet, Linux-MM,
Linux-fsdevel, LKML
On Tue, Oct 19, 2021 at 03:00:25PM -0700, Andrew Morton wrote:
> On Tue, 19 Oct 2021 10:01:00 +0100 Mel Gorman <mgorman@techsingularity.net> wrote:
>
> > Changelog since v3
> > o Count writeback completions for NR_THROTTLED_WRITTEN only
> > o Use IRQ-safe inc_node_page_state
> > o Remove redundant throttling
> >
> > This series is also available at
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-reclaimcongest-v4r2
> >
> > This series that removes all calls to congestion_wait
> > in mm/ and deletes wait_iff_congested. It's not a clever
> > implementation but congestion_wait has been broken for a long time
> > (https://lore.kernel.org/linux-mm/45d8b7a6-8548-65f5-cccf-9f451d4ae3d4@kernel.dk/).
>
> The block layer doesn't call clear_bdi_congested() at all. I never
> knew this until recent discussions :(
>
> So this means that congestion_wait() will always time out, yes?
>
Unfortunately, yes except for filesystems that call
[set_clear]_bdi_congested. For the test case in the series leader,
congestion_wait always hit the full timeout.
> > Even if congestion throttling worked, it was never a great idea.
>
> Well. It was a good idea until things like isolation got added!
>
Well, true to an extent although it was always true that reclaim could fail
to make progress for reasons other than pages under writeback. But you're
right, saying it was "never a great idea" is overkill. congestion_wait
used to work and I expect it was particularly helpful before IO-less write
throttling, accurate dirty page tracking and immediate reclaim existed.
> > While
> > excessive dirty/writeback pages at the tail of the LRU is one possibility
> > that reclaim may be slow, there is also the problem of too many pages
> > being isolated and reclaim failing for other reasons (elevated references,
> > too many pages isolated, excessive LRU contention etc).
> >
> > This series replaces the "congestion" throttling with 3 different types.
> >
> > o If there are too many dirty/writeback pages, sleep until a timeout
> > or enough pages get cleaned
> > o If too many pages are isolated, sleep until enough isolated pages
> > are either reclaimed or put back on the LRU
> > o If no progress is being made, direct reclaim tasks sleep until
> > another task makes progress with acceptable efficiency.
> >
> > This was initially tested with a mix of workloads that used to trigger
> > corner cases that no longer work.
>
> Mix of workloads is nice, but a mix of devices is more important here.
I tested as much as I could but as well as storage devices, different
memory sizes are also relevant.
> I trust some testing was done on plain old spinning disks? And USB
> storage, please! And NFS plays with BDI congestion. Ceph and FUSE also.
>
Plain old spinning disk was tested. Basic USB testing didn't show many
problems although given it was my desktop machine, it might have had too
memory as no amount of IO to the USB key triggered a problem where reclaim
failed to make progress and get throttled. There was basic NFS testing
although I didn't try running stutterp over NFS. Given the original thread
motivating this was NFS-related and they are cc'd, I'm hoping they'll
give it a realistic kick. I don't have a realistic setup for ceph and
didn't try fuse.
> We've had complaints about this stuff forever. Usually of the form of
> interactive tasks getting horridly stalled by writeout/swap activity.
I know and there is no guarantee it won't happen again. The main problem
I was trying to solve was that congestion-based throttling is not suitable
for mm/.
From reclaim context, there isn't a good way of detecting "interactive"
tasks. At best, under reclaim pressure we could try tracking allocation
rates and throttle heavy allocators more than light allocators but I
didn't want to introduce complexity prematurely.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v4 0/8] Remove dependency on congestion_wait in mm/
2021-10-19 9:01 [PATCH v4 0/8] Remove dependency on congestion_wait in mm/ Mel Gorman
` (8 preceding siblings ...)
2021-10-19 22:00 ` [PATCH v4 0/8] Remove dependency on congestion_wait in mm/ Andrew Morton
@ 2021-10-22 1:15 ` NeilBrown
2021-10-22 8:39 ` Mel Gorman
9 siblings, 1 reply; 26+ messages in thread
From: NeilBrown @ 2021-10-22 1:15 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, Theodore Ts'o, Andreas Dilger,
Darrick J . Wong, Matthew Wilcox, Michal Hocko, Dave Chinner,
Rik van Riel, Vlastimil Babka, Johannes Weiner, Jonathan Corbet,
Linux-MM, Linux-fsdevel, LKML, Mel Gorman
On Tue, 19 Oct 2021, Mel Gorman wrote:
> Changelog since v3
> o Count writeback completions for NR_THROTTLED_WRITTEN only
> o Use IRQ-safe inc_node_page_state
> o Remove redundant throttling
>
> This series is also available at
>
> git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-reclaimcongest-v4r2
>
> This series that removes all calls to congestion_wait
> in mm/ and deletes wait_iff_congested.
Thanks for this.
I don't have sufficient expertise for a positive review, but it seems to
make sense with one exception which I have commented on separately.
In general, I still don't like the use of wake_up_all(), though it won't
cause incorrect behaviour.
I would prefer the first patch would:
- define NR_VMSCAN_THROTTLE
- make reclaim_wait an array
- spelled nr_reclaim_throttled as nr_writeback_throttled
rather than leaving those changes for the second patch. I think that
would make review easier.
But these are minor and I'll not mention them again.
Thanks,
NeilBrown
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v4 0/8] Remove dependency on congestion_wait in mm/
2021-10-22 1:15 ` NeilBrown
@ 2021-10-22 8:39 ` Mel Gorman
2021-10-22 11:26 ` NeilBrown
0 siblings, 1 reply; 26+ messages in thread
From: Mel Gorman @ 2021-10-22 8:39 UTC (permalink / raw)
To: NeilBrown
Cc: Andrew Morton, Theodore Ts'o, Andreas Dilger,
Darrick J . Wong, Matthew Wilcox, Michal Hocko, Dave Chinner,
Rik van Riel, Vlastimil Babka, Johannes Weiner, Jonathan Corbet,
Linux-MM, Linux-fsdevel, LKML
On Fri, Oct 22, 2021 at 12:15:10PM +1100, NeilBrown wrote:
> On Tue, 19 Oct 2021, Mel Gorman wrote:
> > Changelog since v3
> > o Count writeback completions for NR_THROTTLED_WRITTEN only
> > o Use IRQ-safe inc_node_page_state
> > o Remove redundant throttling
> >
> > This series is also available at
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-reclaimcongest-v4r2
> >
> > This series that removes all calls to congestion_wait
> > in mm/ and deletes wait_iff_congested.
>
> Thanks for this.
> I don't have sufficient expertise for a positive review, but it seems to
> make sense with one exception which I have commented on separately.
>
A test battering NFS would still be nice!
> In general, I still don't like the use of wake_up_all(), though it won't
> cause incorrect behaviour.
>
Removing wake_up_all would be tricky. Ideally it would be prioritised but
more importantly, some sort of guarantee should exist that enough wakeup
events trigger to wake tasks before the timeout. That would need careful
thinking about each reclaim reason. For example, if N tasks throttle on
NOPROGRESS, there is no guarantee that N tasks are currently in reclaim
that would wake each sleeping task as progress is made. It's similar
for writeback, are enough pages under writeback to trigger each wakeup?
A more subtle issue is if each reason should be strict if waking tasks one
at a time. For example, a task sleeping on writeback might make progress
for other reasons such as the working set changing during reclaim or a
large task exiting. Of course the same concerns exist for the series as
it stands but the worst case scenarios are mitigated by wake_up_all.
> I would prefer the first patch would:
> - define NR_VMSCAN_THROTTLE
> - make reclaim_wait an array
> - spelled nr_reclaim_throttled as nr_writeback_throttled
>
> rather than leaving those changes for the second patch. I think that
> would make review easier.
>
I can do this. Normally I try structure series from least-to-most
controversial so that it can be cut at any point and still make sense
so the array was defined in the second patch because that's when it is
required. However, I already had defined the enum in patch 1 for the
tracepoint so I might as well make it an array too.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v4 0/8] Remove dependency on congestion_wait in mm/
2021-10-22 8:39 ` Mel Gorman
@ 2021-10-22 11:26 ` NeilBrown
2021-10-22 13:17 ` Mel Gorman
0 siblings, 1 reply; 26+ messages in thread
From: NeilBrown @ 2021-10-22 11:26 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, Theodore Ts'o, Andreas Dilger,
Darrick J . Wong, Matthew Wilcox, Michal Hocko, Dave Chinner,
Rik van Riel, Vlastimil Babka, Johannes Weiner, Jonathan Corbet,
Linux-MM, Linux-fsdevel, LKML
On Fri, 22 Oct 2021, Mel Gorman wrote:
> On Fri, Oct 22, 2021 at 12:15:10PM +1100, NeilBrown wrote:
>
> > In general, I still don't like the use of wake_up_all(), though it won't
> > cause incorrect behaviour.
> >
>
> Removing wake_up_all would be tricky.
I think there is a misunderstanding. Removing wake_up_all() is as
simple as
s/wake_up_all/wake_up/
If you used prepare_to_wait_exclusive(), then wake_up() would only wake
one waiter, while wake_up_all() would wake all of them.
As you use prepare_to_wait(), wake_up() will wake all waiters - as will
wake_up_all().
When I see "wake_up_all()" I assume it is an exclusive wait, and that
for some reason this particular wake_up needs to wake up all waiters.
That is not the case here.
I suspect it would be clearer if "wake_up" always woke everything, and
"wake_up_one" was the special case - but unfortunately that isn't what
we have.
There are other non-exclusive waiters which use wake_up_all(), but the
vast majority of wakeups use wake_up(), and most of those are for
non-exclusive waiters.
Thanks,
NeilBrown
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v4 0/8] Remove dependency on congestion_wait in mm/
2021-10-22 11:26 ` NeilBrown
@ 2021-10-22 13:17 ` Mel Gorman
2021-10-27 0:43 ` NeilBrown
0 siblings, 1 reply; 26+ messages in thread
From: Mel Gorman @ 2021-10-22 13:17 UTC (permalink / raw)
To: NeilBrown
Cc: Andrew Morton, Theodore Ts'o, Andreas Dilger,
Darrick J . Wong, Matthew Wilcox, Michal Hocko, Dave Chinner,
Rik van Riel, Vlastimil Babka, Johannes Weiner, Jonathan Corbet,
Linux-MM, Linux-fsdevel, LKML
On Fri, Oct 22, 2021 at 10:26:30PM +1100, NeilBrown wrote:
> On Fri, 22 Oct 2021, Mel Gorman wrote:
> > On Fri, Oct 22, 2021 at 12:15:10PM +1100, NeilBrown wrote:
> >
> > > In general, I still don't like the use of wake_up_all(), though it won't
> > > cause incorrect behaviour.
> > >
> >
> > Removing wake_up_all would be tricky.
>
> I think there is a misunderstanding. Removing wake_up_all() is as
> simple as
> s/wake_up_all/wake_up/
>
> If you used prepare_to_wait_exclusive(), then wake_up() would only wake
> one waiter, while wake_up_all() would wake all of them.
> As you use prepare_to_wait(), wake_up() will wake all waiters - as will
> wake_up_all().
>
Ok, yes, there was a misunderstanding. I thought you were suggesting a
move to exclusive wakeups. I felt that the wake_up_all was explicit in
terms of intent and that I really meant for all tasks to wake instead of
one at a time.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v4 0/8] Remove dependency on congestion_wait in mm/
2021-10-22 13:17 ` Mel Gorman
@ 2021-10-27 0:43 ` NeilBrown
2021-10-27 10:13 ` Mel Gorman
0 siblings, 1 reply; 26+ messages in thread
From: NeilBrown @ 2021-10-27 0:43 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, Theodore Ts'o, Andreas Dilger,
Darrick J . Wong, Matthew Wilcox, Michal Hocko, Dave Chinner,
Rik van Riel, Vlastimil Babka, Johannes Weiner, Jonathan Corbet,
Linux-MM, Linux-fsdevel, LKML
On Sat, 23 Oct 2021, Mel Gorman wrote:
> On Fri, Oct 22, 2021 at 10:26:30PM +1100, NeilBrown wrote:
> > On Fri, 22 Oct 2021, Mel Gorman wrote:
> > > On Fri, Oct 22, 2021 at 12:15:10PM +1100, NeilBrown wrote:
> > >
> > > > In general, I still don't like the use of wake_up_all(), though it won't
> > > > cause incorrect behaviour.
> > > >
> > >
> > > Removing wake_up_all would be tricky.
> >
> > I think there is a misunderstanding. Removing wake_up_all() is as
> > simple as
> > s/wake_up_all/wake_up/
> >
> > If you used prepare_to_wait_exclusive(), then wake_up() would only wake
> > one waiter, while wake_up_all() would wake all of them.
> > As you use prepare_to_wait(), wake_up() will wake all waiters - as will
> > wake_up_all().
> >
>
> Ok, yes, there was a misunderstanding. I thought you were suggesting a
> move to exclusive wakeups. I felt that the wake_up_all was explicit in
> terms of intent and that I really meant for all tasks to wake instead of
> one at a time.
Fair enough. Thanks for changing it :-)
But this prompts me to wonder if exclusive wakeups would be a good idea
- which is a useful springboard to try to understand the code better.
For VMSCAN_THROTTLE_ISOLATED they probably are.
One pattern for reliable exclusive wakeups is for any thread that
received a wake-up to then consider sending a wake up.
Two places receive VMSCAN_THROTTLE_ISOLATED wakeups and both then call
too_many_isolated() which - on success - sends another wakeup - before
the caller has had a chance to isolate anything. If, instead, the
wakeup was sent sometime later, after pages were isolated by before the
caller (isoloate_migratepages_block() or shrink_inactive_list())
returned, then we would get an orderly progression of threads running
through that code.
For VMSCAN_THROTTLE_WRITEBACK is a little less straight forward.
There are two different places that wait for the wakeup, and a wake_up
is sent to all waiters after a time proportional to the number of
waiters. It might make sense to wake one thread per time unit?
That might work well for do_writepages - every SWAP_CLUSTER_MAX writes
triggers one wakeup.
I'm less sure that it would work for shrink_node(). Maybe the
shrink_node() waiters could be non-exclusive so they get woken as soon a
SWAP_CLUSTER_MAX writes complete, while do_writepages are exclusive and
get woken one at a time.
For VMSCAN_THROTTLE_NOPROGRESS .... I don't understand.
If one zone isn't making "enough" progress, we throttle before moving on
to the next zone. So we delay processing of the next zone, and only
indirectly delay re-processing of the current congested zone.
Maybe it make sense, but I don't see it yet. I note that the commit
message says "it's messy". I can't argue with that!
I'll follow up with patches to clarify what I am thinking about the
first two. I'm not proposing the patches, just presenting them as part
of improving my understanding.
Thanks,
NeilBrown
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v4 0/8] Remove dependency on congestion_wait in mm/
2021-10-27 0:43 ` NeilBrown
@ 2021-10-27 10:13 ` Mel Gorman
0 siblings, 0 replies; 26+ messages in thread
From: Mel Gorman @ 2021-10-27 10:13 UTC (permalink / raw)
To: NeilBrown
Cc: Andrew Morton, Theodore Ts'o, Andreas Dilger,
Darrick J . Wong, Matthew Wilcox, Michal Hocko, Dave Chinner,
Rik van Riel, Vlastimil Babka, Johannes Weiner, Jonathan Corbet,
Linux-MM, Linux-fsdevel, LKML
On Wed, Oct 27, 2021 at 11:43:22AM +1100, NeilBrown wrote:
> On Sat, 23 Oct 2021, Mel Gorman wrote:
> > On Fri, Oct 22, 2021 at 10:26:30PM +1100, NeilBrown wrote:
> > > On Fri, 22 Oct 2021, Mel Gorman wrote:
> > > > On Fri, Oct 22, 2021 at 12:15:10PM +1100, NeilBrown wrote:
> > > >
> > > > > In general, I still don't like the use of wake_up_all(), though it won't
> > > > > cause incorrect behaviour.
> > > > >
> > > >
> > > > Removing wake_up_all would be tricky.
> > >
> > > I think there is a misunderstanding. Removing wake_up_all() is as
> > > simple as
> > > s/wake_up_all/wake_up/
> > >
> > > If you used prepare_to_wait_exclusive(), then wake_up() would only wake
> > > one waiter, while wake_up_all() would wake all of them.
> > > As you use prepare_to_wait(), wake_up() will wake all waiters - as will
> > > wake_up_all().
> > >
> >
> > Ok, yes, there was a misunderstanding. I thought you were suggesting a
> > move to exclusive wakeups. I felt that the wake_up_all was explicit in
> > terms of intent and that I really meant for all tasks to wake instead of
> > one at a time.
>
> Fair enough. Thanks for changing it :-)
>
> But this prompts me to wonder if exclusive wakeups would be a good idea
> - which is a useful springboard to try to understand the code better.
>
> For VMSCAN_THROTTLE_ISOLATED they probably are.
> One pattern for reliable exclusive wakeups is for any thread that
> received a wake-up to then consider sending a wake up.
>
> Two places receive VMSCAN_THROTTLE_ISOLATED wakeups and both then call
> too_many_isolated() which - on success - sends another wakeup - before
> the caller has had a chance to isolate anything. If, instead, the
> wakeup was sent sometime later, after pages were isolated by before the
> caller (isoloate_migratepages_block() or shrink_inactive_list())
> returned, then we would get an orderly progression of threads running
> through that code.
>
That should work as the throttling condition is straight-forward. It
might even reduce a race condition where waking all throttled tasks all
then trigger the same "too many isolated" condition.
> For VMSCAN_THROTTLE_WRITEBACK is a little less straight forward.
> There are two different places that wait for the wakeup, and a wake_up
> is sent to all waiters after a time proportional to the number of
> waiters. It might make sense to wake one thread per time unit?
I'd avoid time as a wakeup condition other than the timeout which is
there to guarantee forward progress. I assume you mean "one thread per
SWAP_CLUSTER_MAX completions".
> That might work well for do_writepages - every SWAP_CLUSTER_MAX writes
> triggers one wakeup.
> I'm less sure that it would work for shrink_node(). Maybe the
> shrink_node() waiters could be non-exclusive so they get woken as soon a
> SWAP_CLUSTER_MAX writes complete, while do_writepages are exclusive and
> get woken one at a time.
>
It should work for either with the slight caveat that the last waiter
may not see SWAP_CLUSTER_MAX completions.
> For VMSCAN_THROTTLE_NOPROGRESS .... I don't understand.
> If one zone isn't making "enough" progress, we throttle before moving on
> to the next zone. So we delay processing of the next zone, and only
> indirectly delay re-processing of the current congested zone.
> Maybe it make sense, but I don't see it yet. I note that the commit
> message says "it's messy". I can't argue with that!
>
Yes, we delay the processing of the next zone when a given zone cannot
make progress. The thinking is that circumstances that cause one zone to
fail to make progress could spill over to other zones in the absense of
any throttling. Where it might cause problems is where the preferred zone
is very small. If a bug showed up like that, a potential fix would be to
avoid throttling if the preferred zone is very small relative to the total
amount of memory local to the node or total memory (preferably local node).
> I'll follow up with patches to clarify what I am thinking about the
> first two. I'm not proposing the patches, just presenting them as part
> of improving my understanding.
>
If I'm cc'd, I'll review and if I think they're promising, I'll run them
through the same tests and machines.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 26+ messages in thread