All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] writeback: Fix bandwidth estimates
@ 2021-07-05 16:23 Jan Kara
  2021-07-05 16:23 ` [PATCH 1/5] writeback: Track number of inodes under writeback Jan Kara
                   ` (5 more replies)
  0 siblings, 6 replies; 15+ messages in thread
From: Jan Kara @ 2021-07-05 16:23 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-fsdevel, Michael Stapelberg, linux-mm, Jan Kara

Hello,

this patch series fixes estimate of writeback throughput when device is not
fully busy doing writeback. Michael Stapelberg has reported that such
workload (e.g. generated by linking) tends to push estimated throughput down
to 0 and as a result writeback on the device is practically stalled.

The first three patches fix the reported issue, the remaining two patches
are unrelated cleanups of problems I've noticed when reading the code.

Michael, can you try whether these patches fix the problems you've observed?
I cannot trigger the "bandwidth going down" spiral with them anymore.

								Honza

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/5] writeback: Track number of inodes under writeback
  2021-07-05 16:23 [PATCH 0/5] writeback: Fix bandwidth estimates Jan Kara
@ 2021-07-05 16:23 ` Jan Kara
  2021-07-05 16:23 ` [PATCH 2/5] writeback: Reliably update bandwidth estimation Jan Kara
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: Jan Kara @ 2021-07-05 16:23 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-fsdevel, Michael Stapelberg, linux-mm, Jan Kara

Track number of inodes under writeback for each bdi_writeback structure.
We will use this to decide whether wb does any IO and so we can estimate
its writeback throughput. In principle we could use number of pages
under writeback (WB_WRITEBACK counter) for this however normal percpu
counter reads are too inaccurate for our purposes and summing the
counter is too expensive.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c                |  5 +++++
 include/linux/backing-dev-defs.h |  1 +
 mm/backing-dev.c                 |  1 +
 mm/page-writeback.c              | 22 ++++++++++++++++++++--
 4 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index e91980f49388..475681362b1c 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -416,6 +416,11 @@ static void inode_switch_wbs_work_fn(struct work_struct *work)
 		inc_wb_stat(new_wb, WB_WRITEBACK);
 	}
 
+	if (mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
+		atomic_dec(&old_wb->writeback_inodes);
+		atomic_inc(&new_wb->writeback_inodes);
+	}
+
 	wb_get(new_wb);
 
 	/*
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index fff9367a6348..148d889f2f7f 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -116,6 +116,7 @@ struct bdi_writeback {
 	struct list_head b_dirty_time;	/* time stamps are dirty */
 	spinlock_t list_lock;		/* protects the b_* lists */
 
+	atomic_t writeback_inodes;	/* number of inodes under writeback */
 	struct percpu_counter stat[NR_WB_STAT_ITEMS];
 
 	unsigned long congested;	/* WB_[a]sync_congested flags */
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 576220acd686..342394ef1e02 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -293,6 +293,7 @@ static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi,
 	INIT_LIST_HEAD(&wb->b_dirty_time);
 	spin_lock_init(&wb->list_lock);
 
+	atomic_set(&wb->writeback_inodes, 0);
 	wb->bw_time_stamp = jiffies;
 	wb->balanced_dirty_ratelimit = INIT_BW;
 	wb->dirty_ratelimit = INIT_BW;
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 0062d5c57d41..1560f6626a3b 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2719,6 +2719,16 @@ int clear_page_dirty_for_io(struct page *page)
 }
 EXPORT_SYMBOL(clear_page_dirty_for_io);
 
+static void wb_inode_writeback_start(struct bdi_writeback *wb)
+{
+	atomic_inc(&wb->writeback_inodes);
+}
+
+static void wb_inode_writeback_end(struct bdi_writeback *wb)
+{
+	atomic_dec(&wb->writeback_inodes);
+}
+
 int test_clear_page_writeback(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
@@ -2740,6 +2750,9 @@ int test_clear_page_writeback(struct page *page)
 
 				dec_wb_stat(wb, WB_WRITEBACK);
 				__wb_writeout_inc(wb);
+				if (!mapping_tagged(mapping,
+						    PAGECACHE_TAG_WRITEBACK))
+					wb_inode_writeback_end(wb);
 			}
 		}
 
@@ -2782,8 +2795,13 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
 						   PAGECACHE_TAG_WRITEBACK);
 
 			xas_set_mark(&xas, PAGECACHE_TAG_WRITEBACK);
-			if (bdi->capabilities & BDI_CAP_WRITEBACK_ACCT)
-				inc_wb_stat(inode_to_wb(inode), WB_WRITEBACK);
+			if (bdi->capabilities & BDI_CAP_WRITEBACK_ACCT) {
+				struct bdi_writeback *wb = inode_to_wb(inode);
+
+				inc_wb_stat(wb, WB_WRITEBACK);
+				if (!on_wblist)
+					wb_inode_writeback_start(wb);
+			}
 
 			/*
 			 * We can come through here when swapping anonymous
-- 
2.26.2


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 2/5] writeback: Reliably update bandwidth estimation
  2021-07-05 16:23 [PATCH 0/5] writeback: Fix bandwidth estimates Jan Kara
  2021-07-05 16:23 ` [PATCH 1/5] writeback: Track number of inodes under writeback Jan Kara
@ 2021-07-05 16:23 ` Jan Kara
  2021-07-05 16:23 ` [PATCH 3/5] writeback: Fix bandwidth estimate for spiky workload Jan Kara
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: Jan Kara @ 2021-07-05 16:23 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-fsdevel, Michael Stapelberg, linux-mm, Jan Kara

Currently we trigger writeback bandwidth estimation from
balance_dirty_pages() and from wb_writeback(). However neither of these
need to trigger when the system is relatively idle and writeback is
triggered e.g. from fsync(2). Make sure writeback estimates happen
reliably by triggering them from do_writepages().

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c           |  3 ---
 include/linux/backing-dev.h | 19 ++++++++++++++++++
 include/linux/writeback.h   |  1 -
 mm/page-writeback.c         | 39 +++++++++++++++++++++++++------------
 4 files changed, 46 insertions(+), 16 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 475681362b1c..b53d1513e510 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1858,7 +1858,6 @@ static long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages,
 static long wb_writeback(struct bdi_writeback *wb,
 			 struct wb_writeback_work *work)
 {
-	unsigned long wb_start = jiffies;
 	long nr_pages = work->nr_pages;
 	unsigned long dirtied_before = jiffies;
 	struct inode *inode;
@@ -1912,8 +1911,6 @@ static long wb_writeback(struct bdi_writeback *wb,
 			progress = __writeback_inodes_wb(wb, work);
 		trace_writeback_written(wb, work);
 
-		wb_update_bandwidth(wb, wb_start);
-
 		/*
 		 * Did we write something? Try for more
 		 *
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 44df4fcef65c..a5d7d625dcc6 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -288,6 +288,17 @@ static inline struct bdi_writeback *inode_to_wb(const struct inode *inode)
 	return inode->i_wb;
 }
 
+static inline struct bdi_writeback *inode_to_wb_wbc(
+				struct inode *inode,
+				struct writeback_control *wbc)
+{
+	/*
+	 * If wbc does not have inode attached, it means cgroup writeback was
+ 	 * disabled when wbc started. Just use the default wb in that case.
+	 */
+	return wbc->wb ? wbc->wb : &inode_to_bdi(inode)->wb;
+}
+
 /**
  * unlocked_inode_to_wb_begin - begin unlocked inode wb access transaction
  * @inode: target inode
@@ -366,6 +377,14 @@ static inline struct bdi_writeback *inode_to_wb(struct inode *inode)
 	return &inode_to_bdi(inode)->wb;
 }
 
+static inline struct bdi_writeback *inode_to_wb_wbc(
+				struct inode *inode,
+				struct writeback_control *wbc)
+{
+	return inode_to_wb(inode);
+}
+
+
 static inline struct bdi_writeback *
 unlocked_inode_to_wb_begin(struct inode *inode, struct wb_lock_cookie *cookie)
 {
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 8e5c5bb16e2d..47cd732e012e 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -379,7 +379,6 @@ int dirty_writeback_centisecs_handler(struct ctl_table *table, int write,
 void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty);
 unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh);
 
-void wb_update_bandwidth(struct bdi_writeback *wb, unsigned long start_time);
 void balance_dirty_pages_ratelimited(struct address_space *mapping);
 bool wb_over_bg_thresh(struct bdi_writeback *wb);
 
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 1560f6626a3b..1fecf8ebadb0 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1338,7 +1338,6 @@ static void wb_update_dirty_ratelimit(struct dirty_throttle_control *dtc,
 
 static void __wb_update_bandwidth(struct dirty_throttle_control *gdtc,
 				  struct dirty_throttle_control *mdtc,
-				  unsigned long start_time,
 				  bool update_ratelimit)
 {
 	struct bdi_writeback *wb = gdtc->wb;
@@ -1358,13 +1357,6 @@ static void __wb_update_bandwidth(struct dirty_throttle_control *gdtc,
 	dirtied = percpu_counter_read(&wb->stat[WB_DIRTIED]);
 	written = percpu_counter_read(&wb->stat[WB_WRITTEN]);
 
-	/*
-	 * Skip quiet periods when disk bandwidth is under-utilized.
-	 * (at least 1s idle time between two flusher runs)
-	 */
-	if (elapsed > HZ && time_before(wb->bw_time_stamp, start_time))
-		goto snapshot;
-
 	if (update_ratelimit) {
 		domain_update_bandwidth(gdtc, now);
 		wb_update_dirty_ratelimit(gdtc, dirtied, elapsed);
@@ -1380,17 +1372,36 @@ static void __wb_update_bandwidth(struct dirty_throttle_control *gdtc,
 	}
 	wb_update_write_bandwidth(wb, elapsed, written);
 
-snapshot:
 	wb->dirtied_stamp = dirtied;
 	wb->written_stamp = written;
 	wb->bw_time_stamp = now;
 }
 
-void wb_update_bandwidth(struct bdi_writeback *wb, unsigned long start_time)
+static void wb_update_bandwidth(struct bdi_writeback *wb)
 {
 	struct dirty_throttle_control gdtc = { GDTC_INIT(wb) };
 
-	__wb_update_bandwidth(&gdtc, NULL, start_time, false);
+	spin_lock(&wb->list_lock);
+	__wb_update_bandwidth(&gdtc, NULL, false);
+	spin_unlock(&wb->list_lock);
+}
+
+/* Interval after which we consider wb idle and don't estimate bandwidth */
+#define WB_BANDWIDTH_IDLE_JIF (HZ)
+
+static void wb_bandwidth_estimate_start(struct bdi_writeback *wb)
+{
+	unsigned long now = jiffies;
+	unsigned long elapsed = now - READ_ONCE(wb->bw_time_stamp);
+
+	if (elapsed > WB_BANDWIDTH_IDLE_JIF &&
+	    !atomic_read(&wb->writeback_inodes)) {
+		spin_lock(&wb->list_lock);
+		wb->dirtied_stamp = wb_stat(wb, WB_DIRTIED);
+		wb->written_stamp = wb_stat(wb, WB_WRITTEN);
+		wb->bw_time_stamp = now;
+		spin_unlock(&wb->list_lock);
+	}
 }
 
 /*
@@ -1719,7 +1730,7 @@ static void balance_dirty_pages(struct bdi_writeback *wb,
 		if (time_is_before_jiffies(wb->bw_time_stamp +
 					   BANDWIDTH_INTERVAL)) {
 			spin_lock(&wb->list_lock);
-			__wb_update_bandwidth(gdtc, mdtc, start_time, true);
+			__wb_update_bandwidth(gdtc, mdtc, true);
 			spin_unlock(&wb->list_lock);
 		}
 
@@ -2344,9 +2355,12 @@ EXPORT_SYMBOL(generic_writepages);
 int do_writepages(struct address_space *mapping, struct writeback_control *wbc)
 {
 	int ret;
+	struct bdi_writeback *wb;
 
 	if (wbc->nr_to_write <= 0)
 		return 0;
+	wb = inode_to_wb_wbc(mapping->host, wbc);
+	wb_bandwidth_estimate_start(wb);
 	while (1) {
 		if (mapping->a_ops->writepages)
 			ret = mapping->a_ops->writepages(mapping, wbc);
@@ -2357,6 +2371,7 @@ int do_writepages(struct address_space *mapping, struct writeback_control *wbc)
 		cond_resched();
 		congestion_wait(BLK_RW_ASYNC, HZ/50);
 	}
+	wb_update_bandwidth(wb);
 	return ret;
 }
 
-- 
2.26.2


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 3/5] writeback: Fix bandwidth estimate for spiky workload
  2021-07-05 16:23 [PATCH 0/5] writeback: Fix bandwidth estimates Jan Kara
  2021-07-05 16:23 ` [PATCH 1/5] writeback: Track number of inodes under writeback Jan Kara
  2021-07-05 16:23 ` [PATCH 2/5] writeback: Reliably update bandwidth estimation Jan Kara
@ 2021-07-05 16:23 ` Jan Kara
  2021-07-07  7:40   ` Hillf Danton
  2021-07-05 16:23 ` [PATCH 4/5] writeback: Rename domain_update_bandwidth() Jan Kara
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 15+ messages in thread
From: Jan Kara @ 2021-07-05 16:23 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-fsdevel, Michael Stapelberg, linux-mm, Jan Kara

Michael Stapelberg has reported that for workload with short big spikes
of writes (GCC linker seem to trigger this frequently) the write
throughput is heavily underestimated and tends to steadily sink until it
reaches zero. This has rather bad impact on writeback throttling
(causing stalls). The problem is that writeback throughput estimate gets
updated at most once per 200 ms. One update happens early after we
submit pages for writeback (at that point writeout of only small
fraction of pages is completed and thus observed throughput is tiny).
Next update happens only during the next write spike (updates happen
only from inode writeback and dirty throttling code) and if that is
more than 1s after previous spike, we decide system was idle and just
ignore whatever was written until this moment.

Fix the problem by making sure writeback throughput estimate is also
updated shortly after writeback completes to get reasonable estimate of
throughput for spiky workloads.

Link: https://lore.kernel.org/lkml/20210617095309.3542373-1-stapelberg+linux@google.com
Reported-by: Michael Stapelberg <stapelberg+linux@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/backing-dev-defs.h |  1 +
 include/linux/writeback.h        |  1 +
 mm/backing-dev.c                 | 10 ++++++++++
 mm/page-writeback.c              | 32 ++++++++++++++++----------------
 4 files changed, 28 insertions(+), 16 deletions(-)

diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 148d889f2f7f..57395f7bb192 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -143,6 +143,7 @@ struct bdi_writeback {
 	spinlock_t work_lock;		/* protects work_list & dwork scheduling */
 	struct list_head work_list;
 	struct delayed_work dwork;	/* work item used for writeback */
+	struct delayed_work bw_dwork;	/* work item used for bandwidth estimate */
 
 	unsigned long dirty_sleep;	/* last wait */
 
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 47cd732e012e..a45e09ed0711 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -379,6 +379,7 @@ int dirty_writeback_centisecs_handler(struct ctl_table *table, int write,
 void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty);
 unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh);
 
+void wb_update_bandwidth(struct bdi_writeback *wb);
 void balance_dirty_pages_ratelimited(struct address_space *mapping);
 bool wb_over_bg_thresh(struct bdi_writeback *wb);
 
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 342394ef1e02..9baa59d68110 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -271,6 +271,14 @@ void wb_wakeup_delayed(struct bdi_writeback *wb)
 	spin_unlock_bh(&wb->work_lock);
 }
 
+static void wb_update_bandwidth_workfn(struct work_struct *work)
+{
+	struct bdi_writeback *wb = container_of(to_delayed_work(work),
+						struct bdi_writeback, bw_dwork);
+
+	wb_update_bandwidth(wb);
+}
+
 /*
  * Initial write bandwidth: 100 MB/s
  */
@@ -303,6 +311,7 @@ static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi,
 	spin_lock_init(&wb->work_lock);
 	INIT_LIST_HEAD(&wb->work_list);
 	INIT_DELAYED_WORK(&wb->dwork, wb_workfn);
+	INIT_DELAYED_WORK(&wb->bw_dwork, wb_update_bandwidth_workfn);
 	wb->dirty_sleep = jiffies;
 
 	err = fprop_local_init_percpu(&wb->completions, gfp);
@@ -351,6 +360,7 @@ static void wb_shutdown(struct bdi_writeback *wb)
 	mod_delayed_work(bdi_wq, &wb->dwork, 0);
 	flush_delayed_work(&wb->dwork);
 	WARN_ON(!list_empty(&wb->work_list));
+	flush_delayed_work(&wb->bw_dwork);
 }
 
 static void wb_exit(struct bdi_writeback *wb)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 1fecf8ebadb0..6a99ddca95c0 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1346,14 +1346,7 @@ static void __wb_update_bandwidth(struct dirty_throttle_control *gdtc,
 	unsigned long dirtied;
 	unsigned long written;
 
-	lockdep_assert_held(&wb->list_lock);
-
-	/*
-	 * rate-limit, only update once every 200ms.
-	 */
-	if (elapsed < BANDWIDTH_INTERVAL)
-		return;
-
+	spin_lock(&wb->list_lock);
 	dirtied = percpu_counter_read(&wb->stat[WB_DIRTIED]);
 	written = percpu_counter_read(&wb->stat[WB_WRITTEN]);
 
@@ -1375,15 +1368,14 @@ static void __wb_update_bandwidth(struct dirty_throttle_control *gdtc,
 	wb->dirtied_stamp = dirtied;
 	wb->written_stamp = written;
 	wb->bw_time_stamp = now;
+	spin_unlock(&wb->list_lock);
 }
 
-static void wb_update_bandwidth(struct bdi_writeback *wb)
+void wb_update_bandwidth(struct bdi_writeback *wb)
 {
 	struct dirty_throttle_control gdtc = { GDTC_INIT(wb) };
 
-	spin_lock(&wb->list_lock);
 	__wb_update_bandwidth(&gdtc, NULL, false);
-	spin_unlock(&wb->list_lock);
 }
 
 /* Interval after which we consider wb idle and don't estimate bandwidth */
@@ -1728,11 +1720,8 @@ static void balance_dirty_pages(struct bdi_writeback *wb,
 			wb->dirty_exceeded = 1;
 
 		if (time_is_before_jiffies(wb->bw_time_stamp +
-					   BANDWIDTH_INTERVAL)) {
-			spin_lock(&wb->list_lock);
+					   BANDWIDTH_INTERVAL))
 			__wb_update_bandwidth(gdtc, mdtc, true);
-			spin_unlock(&wb->list_lock);
-		}
 
 		/* throttle according to the chosen dtc */
 		dirty_ratelimit = wb->dirty_ratelimit;
@@ -2371,7 +2360,13 @@ int do_writepages(struct address_space *mapping, struct writeback_control *wbc)
 		cond_resched();
 		congestion_wait(BLK_RW_ASYNC, HZ/50);
 	}
-	wb_update_bandwidth(wb);
+	/*
+	 * Usually few pages are written by now from those we've just submitted
+	 * but if there's constant writeback being submitted, this makes sure
+	 * writeback bandwidth is updated once in a while.
+	 */
+	if (time_is_before_jiffies(wb->bw_time_stamp + BANDWIDTH_INTERVAL))
+		wb_update_bandwidth(wb);
 	return ret;
 }
 
@@ -2742,6 +2737,11 @@ static void wb_inode_writeback_start(struct bdi_writeback *wb)
 static void wb_inode_writeback_end(struct bdi_writeback *wb)
 {
 	atomic_dec(&wb->writeback_inodes);
+	/*
+	 * Make sure estimate of writeback throughput gets
+	 * updated after writeback completed.
+	 */
+	queue_delayed_work(bdi_wq, &wb->bw_dwork, BANDWIDTH_INTERVAL);
 }
 
 int test_clear_page_writeback(struct page *page)
-- 
2.26.2


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 4/5] writeback: Rename domain_update_bandwidth()
  2021-07-05 16:23 [PATCH 0/5] writeback: Fix bandwidth estimates Jan Kara
                   ` (2 preceding siblings ...)
  2021-07-05 16:23 ` [PATCH 3/5] writeback: Fix bandwidth estimate for spiky workload Jan Kara
@ 2021-07-05 16:23 ` Jan Kara
  2021-07-05 16:23 ` [PATCH 5/5] writeback: Use READ_ONCE for unlocked reads of writeback stats Jan Kara
  2021-07-09 13:19   ` Michael Stapelberg
  5 siblings, 0 replies; 15+ messages in thread
From: Jan Kara @ 2021-07-05 16:23 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-fsdevel, Michael Stapelberg, linux-mm, Jan Kara

Rename domain_update_bandwidth() to domain_update_dirty_limit(). The
original name is a misnomer. The function has nothing to do with a
bandwidth, it updates dirty limits.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 mm/page-writeback.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 6a99ddca95c0..95abae9eecaf 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1153,8 +1153,8 @@ static void update_dirty_limit(struct dirty_throttle_control *dtc)
 	dom->dirty_limit = limit;
 }
 
-static void domain_update_bandwidth(struct dirty_throttle_control *dtc,
-				    unsigned long now)
+static void domain_update_dirty_limit(struct dirty_throttle_control *dtc,
+				      unsigned long now)
 {
 	struct wb_domain *dom = dtc_dom(dtc);
 
@@ -1351,7 +1351,7 @@ static void __wb_update_bandwidth(struct dirty_throttle_control *gdtc,
 	written = percpu_counter_read(&wb->stat[WB_WRITTEN]);
 
 	if (update_ratelimit) {
-		domain_update_bandwidth(gdtc, now);
+		domain_update_dirty_limit(gdtc, now);
 		wb_update_dirty_ratelimit(gdtc, dirtied, elapsed);
 
 		/*
@@ -1359,7 +1359,7 @@ static void __wb_update_bandwidth(struct dirty_throttle_control *gdtc,
 		 * compiler has no way to figure that out.  Help it.
 		 */
 		if (IS_ENABLED(CONFIG_CGROUP_WRITEBACK) && mdtc) {
-			domain_update_bandwidth(mdtc, now);
+			domain_update_dirty_limit(mdtc, now);
 			wb_update_dirty_ratelimit(mdtc, dirtied, elapsed);
 		}
 	}
-- 
2.26.2


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 5/5] writeback: Use READ_ONCE for unlocked reads of writeback stats
  2021-07-05 16:23 [PATCH 0/5] writeback: Fix bandwidth estimates Jan Kara
                   ` (3 preceding siblings ...)
  2021-07-05 16:23 ` [PATCH 4/5] writeback: Rename domain_update_bandwidth() Jan Kara
@ 2021-07-05 16:23 ` Jan Kara
  2021-07-09 13:19   ` Michael Stapelberg
  5 siblings, 0 replies; 15+ messages in thread
From: Jan Kara @ 2021-07-05 16:23 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-fsdevel, Michael Stapelberg, linux-mm, Jan Kara

We do some unlocked reads of writeback statistics like
avg_write_bandwidth, dirty_ratelimit, or bw_time_stamp. Generally we are
fine with getting somewhat out-of-date values but actually getting
different values in various parts of the functions because the compiler
decided to reload value from original memory location could confuse
calculations. Use READ_ONCE for these unlocked accesses to be on the
safe side.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 mm/page-writeback.c | 20 +++++++++++---------
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 95abae9eecaf..736d9e996191 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -189,7 +189,7 @@ static struct fprop_local_percpu *wb_memcg_completions(struct bdi_writeback *wb)
 static void wb_min_max_ratio(struct bdi_writeback *wb,
 			     unsigned long *minp, unsigned long *maxp)
 {
-	unsigned long this_bw = wb->avg_write_bandwidth;
+	unsigned long this_bw = READ_ONCE(wb->avg_write_bandwidth);
 	unsigned long tot_bw = atomic_long_read(&wb->bdi->tot_write_bandwidth);
 	unsigned long long min = wb->bdi->min_ratio;
 	unsigned long long max = wb->bdi->max_ratio;
@@ -898,7 +898,7 @@ static long long pos_ratio_polynom(unsigned long setpoint,
 static void wb_position_ratio(struct dirty_throttle_control *dtc)
 {
 	struct bdi_writeback *wb = dtc->wb;
-	unsigned long write_bw = wb->avg_write_bandwidth;
+	unsigned long write_bw = READ_ONCE(wb->avg_write_bandwidth);
 	unsigned long freerun = dirty_freerun_ceiling(dtc->thresh, dtc->bg_thresh);
 	unsigned long limit = hard_dirty_limit(dtc_dom(dtc), dtc->thresh);
 	unsigned long wb_thresh = dtc->wb_thresh;
@@ -1342,11 +1342,12 @@ static void __wb_update_bandwidth(struct dirty_throttle_control *gdtc,
 {
 	struct bdi_writeback *wb = gdtc->wb;
 	unsigned long now = jiffies;
-	unsigned long elapsed = now - wb->bw_time_stamp;
+	unsigned long elapsed;
 	unsigned long dirtied;
 	unsigned long written;
 
 	spin_lock(&wb->list_lock);
+	elapsed = now - wb->bw_time_stamp;
 	dirtied = percpu_counter_read(&wb->stat[WB_DIRTIED]);
 	written = percpu_counter_read(&wb->stat[WB_WRITTEN]);
 
@@ -1416,7 +1417,7 @@ static unsigned long dirty_poll_interval(unsigned long dirty,
 static unsigned long wb_max_pause(struct bdi_writeback *wb,
 				  unsigned long wb_dirty)
 {
-	unsigned long bw = wb->avg_write_bandwidth;
+	unsigned long bw = READ_ONCE(wb->avg_write_bandwidth);
 	unsigned long t;
 
 	/*
@@ -1438,8 +1439,8 @@ static long wb_min_pause(struct bdi_writeback *wb,
 			 unsigned long dirty_ratelimit,
 			 int *nr_dirtied_pause)
 {
-	long hi = ilog2(wb->avg_write_bandwidth);
-	long lo = ilog2(wb->dirty_ratelimit);
+	long hi = ilog2(READ_ONCE(wb->avg_write_bandwidth));
+	long lo = ilog2(READ_ONCE(wb->dirty_ratelimit));
 	long t;		/* target pause */
 	long pause;	/* estimated next pause */
 	int pages;	/* target nr_dirtied_pause */
@@ -1719,12 +1720,12 @@ static void balance_dirty_pages(struct bdi_writeback *wb,
 		if (dirty_exceeded && !wb->dirty_exceeded)
 			wb->dirty_exceeded = 1;
 
-		if (time_is_before_jiffies(wb->bw_time_stamp +
+		if (time_is_before_jiffies(READ_ONCE(wb->bw_time_stamp) +
 					   BANDWIDTH_INTERVAL))
 			__wb_update_bandwidth(gdtc, mdtc, true);
 
 		/* throttle according to the chosen dtc */
-		dirty_ratelimit = wb->dirty_ratelimit;
+		dirty_ratelimit = READ_ONCE(wb->dirty_ratelimit);
 		task_ratelimit = ((u64)dirty_ratelimit * sdtc->pos_ratio) >>
 							RATELIMIT_CALC_SHIFT;
 		max_pause = wb_max_pause(wb, sdtc->wb_dirty);
@@ -2365,7 +2366,8 @@ int do_writepages(struct address_space *mapping, struct writeback_control *wbc)
 	 * but if there's constant writeback being submitted, this makes sure
 	 * writeback bandwidth is updated once in a while.
 	 */
-	if (time_is_before_jiffies(wb->bw_time_stamp + BANDWIDTH_INTERVAL))
+	if (time_is_before_jiffies(READ_ONCE(wb->bw_time_stamp) +
+				   BANDWIDTH_INTERVAL))
 		wb_update_bandwidth(wb);
 	return ret;
 }
-- 
2.26.2


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/5] writeback: Fix bandwidth estimate for spiky workload
  2021-07-05 16:23 ` [PATCH 3/5] writeback: Fix bandwidth estimate for spiky workload Jan Kara
@ 2021-07-07  7:40   ` Hillf Danton
  2021-07-07  9:51     ` Jan Kara
  0 siblings, 1 reply; 15+ messages in thread
From: Hillf Danton @ 2021-07-07  7:40 UTC (permalink / raw)
  To: Jan Kara; +Cc: Andrew Morton, linux-fsdevel, Michael Stapelberg, linux-mm

On Mon,  5 Jul 2021 18:23:17 +0200 Jan Kara wrote:
>
>Michael Stapelberg has reported that for workload with short big spikes
>of writes (GCC linker seem to trigger this frequently) the write
>throughput is heavily underestimated and tends to steadily sink until it
>reaches zero. This has rather bad impact on writeback throttling
>(causing stalls). The problem is that writeback throughput estimate gets
>updated at most once per 200 ms. One update happens early after we
>submit pages for writeback (at that point writeout of only small
>fraction of pages is completed and thus observed throughput is tiny).
>Next update happens only during the next write spike (updates happen
>only from inode writeback and dirty throttling code) and if that is
>more than 1s after previous spike, we decide system was idle and just
>ignore whatever was written until this moment.
>
>Fix the problem by making sure writeback throughput estimate is also
>updated shortly after writeback completes to get reasonable estimate of
>throughput for spiky workloads.
>
>Link: https://lore.kernel.org/lkml/20210617095309.3542373-1-stapelberg+li>nux@google.com
>Reported-by: Michael Stapelberg <stapelberg+linux@google.com>
>Signed-off-by: Jan Kara <jack@suse.cz>
>---
> include/linux/backing-dev-defs.h |  1 +
> include/linux/writeback.h        |  1 +
> mm/backing-dev.c                 | 10 ++++++++++
> mm/page-writeback.c              | 32 ++++++++++++++++----------------
> 4 files changed, 28 insertions(+), 16 deletions(-)
>
>diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev>-defs.h
>index 148d889f2f7f..57395f7bb192 100644
>--- a/include/linux/backing-dev-defs.h
>+++ b/include/linux/backing-dev-defs.h
>@@ -143,6 +143,7 @@ struct bdi_writeback {
> 	spinlock_t work_lock;		/* protects work_list & dwork scheduling */
> 	struct list_head work_list;
> 	struct delayed_work dwork;	/* work item used for writeback */
>+	struct delayed_work bw_dwork;	/* work item used for bandwidth estimate >*/
>
> 	unsigned long dirty_sleep;	/* last wait */
>
>diff --git a/include/linux/writeback.h b/include/linux/writeback.h
>index 47cd732e012e..a45e09ed0711 100644
>--- a/include/linux/writeback.h
>+++ b/include/linux/writeback.h
>@@ -379,6 +379,7 @@ int dirty_writeback_centisecs_handler(struct ctl_tabl>e *table, int write,
> void global_dirty_limits(unsigned long *pbackground, unsigned long *pdir>ty);
> unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thr>esh);
>
>+void wb_update_bandwidth(struct bdi_writeback *wb);
> void balance_dirty_pages_ratelimited(struct address_space *mapping);
> bool wb_over_bg_thresh(struct bdi_writeback *wb);
>
>diff --git a/mm/backing-dev.c b/mm/backing-dev.c
>index 342394ef1e02..9baa59d68110 100644
>--- a/mm/backing-dev.c
>+++ b/mm/backing-dev.c
>@@ -271,6 +271,14 @@ void wb_wakeup_delayed(struct bdi_writeback *wb)
> 	spin_unlock_bh(&wb->work_lock);
> }
>
>+static void wb_update_bandwidth_workfn(struct work_struct *work)
>+{
>+	struct bdi_writeback *wb = container_of(to_delayed_work(work),
>+						struct bdi_writeback, bw_dwork);
>+
>+	wb_update_bandwidth(wb);
>+}
>+
> /*
>  * Initial write bandwidth: 100 MB/s
>  */
>@@ -303,6 +311,7 @@ static int wb_init(struct bdi_writeback *wb, struct b>acking_dev_info *bdi,
> 	spin_lock_init(&wb->work_lock);
> 	INIT_LIST_HEAD(&wb->work_list);
> 	INIT_DELAYED_WORK(&wb->dwork, wb_workfn);
>+	INIT_DELAYED_WORK(&wb->bw_dwork, wb_update_bandwidth_workfn);
> 	wb->dirty_sleep = jiffies;
>
> 	err = fprop_local_init_percpu(&wb->completions, gfp);
>@@ -351,6 +360,7 @@ static void wb_shutdown(struct bdi_writeback *wb)
> 	mod_delayed_work(bdi_wq, &wb->dwork, 0);
> 	flush_delayed_work(&wb->dwork);
> 	WARN_ON(!list_empty(&wb->work_list));
>+	flush_delayed_work(&wb->bw_dwork);
> }
>
> static void wb_exit(struct bdi_writeback *wb)
>diff --git a/mm/page-writeback.c b/mm/page-writeback.c
>index 1fecf8ebadb0..6a99ddca95c0 100644
>--- a/mm/page-writeback.c
>+++ b/mm/page-writeback.c
>@@ -1346,14 +1346,7 @@ static void __wb_update_bandwidth(struct dirty_thr>ottle_control *gdtc,
> 	unsigned long dirtied;
> 	unsigned long written;
>
>-	lockdep_assert_held(&wb->list_lock);
>-
>-	/*
>-	 * rate-limit, only update once every 200ms.
>-	 */
>-	if (elapsed < BANDWIDTH_INTERVAL)
>-		return;

Please leave it as it is if you are not dumping the 200ms rule.

>-
>+	spin_lock(&wb->list_lock);
> 	dirtied = percpu_counter_read(&wb->stat[WB_DIRTIED]);
> 	written = percpu_counter_read(&wb->stat[WB_WRITTEN]);
>
>@@ -1375,15 +1368,14 @@ static void __wb_update_bandwidth(struct dirty_th>rottle_control *gdtc,
> 	wb->dirtied_stamp = dirtied;
> 	wb->written_stamp = written;
> 	wb->bw_time_stamp = now;
>+	spin_unlock(&wb->list_lock);
> }
>
>-static void wb_update_bandwidth(struct bdi_writeback *wb)
>+void wb_update_bandwidth(struct bdi_writeback *wb)
> {
> 	struct dirty_throttle_control gdtc = { GDTC_INIT(wb) };
>
>-	spin_lock(&wb->list_lock);
> 	__wb_update_bandwidth(&gdtc, NULL, false);
>-	spin_unlock(&wb->list_lock);
> }
>
> /* Interval after which we consider wb idle and don't estimate bandwidth> */
>@@ -1728,11 +1720,8 @@ static void balance_dirty_pages(struct bdi_writeba>ck *wb,
> 			wb->dirty_exceeded = 1;
>
> 		if (time_is_before_jiffies(wb->bw_time_stamp +
>-					   BANDWIDTH_INTERVAL)) {
>-			spin_lock(&wb->list_lock);
>+					   BANDWIDTH_INTERVAL))
> 			__wb_update_bandwidth(gdtc, mdtc, true);
>-			spin_unlock(&wb->list_lock);
>-		}
>
> 		/* throttle according to the chosen dtc */
> 		dirty_ratelimit = wb->dirty_ratelimit;
>@@ -2371,7 +2360,13 @@ int do_writepages(struct address_space *mapping, s>truct writeback_control *wbc)
> 		cond_resched();
> 		congestion_wait(BLK_RW_ASYNC, HZ/50);
> 	}
>-	wb_update_bandwidth(wb);
>+	/*
>+	 * Usually few pages are written by now from those we've just submitted
>+	 * but if there's constant writeback being submitted, this makes sure
>+	 * writeback bandwidth is updated once in a while.
>+	 */
>+	if (time_is_before_jiffies(wb->bw_time_stamp + BANDWIDTH_INTERVAL))
>+		wb_update_bandwidth(wb);
> 	return ret;
> }
>
>@@ -2742,6 +2737,11 @@ static void wb_inode_writeback_start(struct bdi_wr>iteback *wb)
> static void wb_inode_writeback_end(struct bdi_writeback *wb)
> {
> 	atomic_dec(&wb->writeback_inodes);
>+	/*
>+	 * Make sure estimate of writeback throughput gets
>+	 * updated after writeback completed.
>+	 */
>+	queue_delayed_work(bdi_wq, &wb->bw_dwork, BANDWIDTH_INTERVAL);
> }

This is a bogus estimate - it does not break the 200ms rule but walks around it
without specifying why 300ms is not good.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/5] writeback: Fix bandwidth estimate for spiky workload
  2021-07-07  7:40   ` Hillf Danton
@ 2021-07-07  9:51     ` Jan Kara
  2021-07-08 12:17       ` Hillf Danton
  0 siblings, 1 reply; 15+ messages in thread
From: Jan Kara @ 2021-07-07  9:51 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Jan Kara, Andrew Morton, linux-fsdevel, Michael Stapelberg, linux-mm

On Wed 07-07-21 15:40:17, Hillf Danton wrote:
> On Mon,  5 Jul 2021 18:23:17 +0200 Jan Kara wrote:
> >
> >Michael Stapelberg has reported that for workload with short big spikes
> >of writes (GCC linker seem to trigger this frequently) the write
> >throughput is heavily underestimated and tends to steadily sink until it
> >reaches zero. This has rather bad impact on writeback throttling
> >(causing stalls). The problem is that writeback throughput estimate gets
> >updated at most once per 200 ms. One update happens early after we
> >submit pages for writeback (at that point writeout of only small
> >fraction of pages is completed and thus observed throughput is tiny).
> >Next update happens only during the next write spike (updates happen
> >only from inode writeback and dirty throttling code) and if that is
> >more than 1s after previous spike, we decide system was idle and just
> >ignore whatever was written until this moment.
> >
> >Fix the problem by making sure writeback throughput estimate is also
> >updated shortly after writeback completes to get reasonable estimate of
> >throughput for spiky workloads.
> >
> >Link: https://lore.kernel.org/lkml/20210617095309.3542373-1-stapelberg+li>nux@google.com
> >Reported-by: Michael Stapelberg <stapelberg+linux@google.com>
> >Signed-off-by: Jan Kara <jack@suse.cz>
...
> >diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> >index 1fecf8ebadb0..6a99ddca95c0 100644
> >--- a/mm/page-writeback.c
> >+++ b/mm/page-writeback.c
> >@@ -1346,14 +1346,7 @@ static void __wb_update_bandwidth(struct dirty_thr>ottle_control *gdtc,
> > 	unsigned long dirtied;
> > 	unsigned long written;
> >
> >-	lockdep_assert_held(&wb->list_lock);
> >-
> >-	/*
> >-	 * rate-limit, only update once every 200ms.
> >-	 */
> >-	if (elapsed < BANDWIDTH_INTERVAL)
> >-		return;
> 
> Please leave it as it is if you are not dumping the 200ms rule.

Well, that could break the delayed updated scheduled after the end of
writeback and for no good reason. The problematic ordering is like:

end writeback on inode1
  queue_delayed_work() - queues delayed work after BANDWIDTH_INTERVAL

__wb_update_bandwidth() called e.g. from balance_dirty_pages()
  wb->bw_time_stamp = now;

end writeback on inode2
  queue_delayed_work() - does nothing since work is already queued

delayed work calls __wb_update_bandwidth() - nothing is done since elapsed
< BANDWIDTH_INTERVAL and we may thus miss reflecting writeback of inode2 in
our estimates.

> >@@ -2742,6 +2737,11 @@ static void wb_inode_writeback_start(struct bdi_wr>iteback *wb)
> > static void wb_inode_writeback_end(struct bdi_writeback *wb)
> > {
> > 	atomic_dec(&wb->writeback_inodes);
> >+	/*
> >+	 * Make sure estimate of writeback throughput gets
> >+	 * updated after writeback completed.
> >+	 */
> >+	queue_delayed_work(bdi_wq, &wb->bw_dwork, BANDWIDTH_INTERVAL);
> > }
> 
> This is a bogus estimate - it does not break the 200ms rule but walks
> around it without specifying why 300ms is not good.

Well, you're right that BANDWIDTH_INTERVAL is somewhat arbitrary here. We
do want some batching of bandwidth updates after writeback completes for
the cases where lots of inodes end their writeback in a quick succession.
I've picked BANDWIDTH_INTERVAL here as that's the batching of other
bandwidth updates as well so it kind of makes sense. I'll add a comment why
BANDWIDTH_INTERVAL is picked here.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/5] writeback: Fix bandwidth estimate for spiky workload
  2021-07-07  9:51     ` Jan Kara
@ 2021-07-08 12:17       ` Hillf Danton
  2021-07-08 16:43         ` Jan Kara
  0 siblings, 1 reply; 15+ messages in thread
From: Hillf Danton @ 2021-07-08 12:17 UTC (permalink / raw)
  To: Jan Kara; +Cc: Andrew Morton, linux-fsdevel, Michael Stapelberg, linux-mm

On Wed, 7 Jul 2021 11:51:38 +0200 Jan Kara wrote:
>On Wed 07-07-21 15:40:17, Hillf Danton wrote:
>> On Mon,  5 Jul 2021 18:23:17 +0200 Jan Kara wrote:
>> >
>> >Michael Stapelberg has reported that for workload with short big spikes
>> >of writes (GCC linker seem to trigger this frequently) the write
>> >throughput is heavily underestimated and tends to steadily sink until it
>> >reaches zero. This has rather bad impact on writeback throttling
>> >(causing stalls). The problem is that writeback throughput estimate gets
>> >updated at most once per 200 ms. One update happens early after we
>> >submit pages for writeback (at that point writeout of only small
>> >fraction of pages is completed and thus observed throughput is tiny).
>> >Next update happens only during the next write spike (updates happen
>> >only from inode writeback and dirty throttling code) and if that is
>> >more than 1s after previous spike, we decide system was idle and just
>> >ignore whatever was written until this moment.
>> >
>> >Fix the problem by making sure writeback throughput estimate is also
>> >updated shortly after writeback completes to get reasonable estimate of
>> >throughput for spiky workloads.
>> >
>> >Link: https://lore.kernel.org/lkml/20210617095309.3542373-1-stapelberg+li>nux@google.com
>> >Reported-by: Michael Stapelberg <stapelberg+linux@google.com>
>> >Signed-off-by: Jan Kara <jack@suse.cz>
>...
>> >diff --git a/mm/page-writeback.c b/mm/page-writeback.c
>> >index 1fecf8ebadb0..6a99ddca95c0 100644
>> >--- a/mm/page-writeback.c
>> >+++ b/mm/page-writeback.c
>> >@@ -1346,14 +1346,7 @@ static void __wb_update_bandwidth(struct dirty_thr>ottle_control *gdtc,
>> > 	unsigned long dirtied;
>> > 	unsigned long written;
>> >
>> >-	lockdep_assert_held(&wb->list_lock);
>> >-
>> >-	/*
>> >-	 * rate-limit, only update once every 200ms.
>> >-	 */
>> >-	if (elapsed < BANDWIDTH_INTERVAL)
>> >-		return;
>> 
>> Please leave it as it is if you are not dumping the 200ms rule.
>
>Well, that could break the delayed updated scheduled after the end of
>writeback and for no good reason. The problematic ordering is like:

After another look at 2/5, you are cutting the rule, which is worth a
seperate patch.
>
>end writeback on inode1
>  queue_delayed_work() - queues delayed work after BANDWIDTH_INTERVAL
>
>__wb_update_bandwidth() called e.g. from balance_dirty_pages()
>  wb->bw_time_stamp = now;
>
>end writeback on inode2
>  queue_delayed_work() - does nothing since work is already queued
>
>delayed work calls __wb_update_bandwidth() - nothing is done since elapsed
>< BANDWIDTH_INTERVAL and we may thus miss reflecting writeback of inode2 in
>our estimates.

Your example says the estimate based on inode2 is torpedoed by a random
update, and you are looking to make that estimate meaningful at the cost
of breaking the rule - how differet is it to the current one if the
estimate is derived from 20ms-elapsed interval at inode2? Is it likely to
see another palpablely different result at inode3 from 50ms-elapsed interval?
>
>> >@@ -2742,6 +2737,11 @@ static void wb_inode_writeback_start(struct bdi_wr>iteback *wb)
>> > static void wb_inode_writeback_end(struct bdi_writeback *wb)
>> > {
>> > 	atomic_dec(&wb->writeback_inodes);
>> >+	/*
>> >+	 * Make sure estimate of writeback throughput gets
>> >+	 * updated after writeback completed.
>> >+	 */
>> >+	queue_delayed_work(bdi_wq, &wb->bw_dwork, BANDWIDTH_INTERVAL);
>> > }
>> 
>> This is a bogus estimate - it does not break the 200ms rule but walks
>> around it without specifying why 300ms is not good.
>
>Well, you're right that BANDWIDTH_INTERVAL is somewhat arbitrary here. We
>do want some batching of bandwidth updates after writeback completes for
>the cases where lots of inodes end their writeback in a quick succession.
>I've picked BANDWIDTH_INTERVAL here as that's the batching of other
>bandwidth updates as well so it kind of makes sense. I'll add a comment why
>BANDWIDTH_INTERVAL is picked here.
>
>								Honza
>-- 
>Jan Kara <jack@suse.com>
>SUSE Labs, CR
>
>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/5] writeback: Fix bandwidth estimate for spiky workload
  2021-07-08 12:17       ` Hillf Danton
@ 2021-07-08 16:43         ` Jan Kara
  2021-07-09  8:01           ` Hillf Danton
  0 siblings, 1 reply; 15+ messages in thread
From: Jan Kara @ 2021-07-08 16:43 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Jan Kara, Andrew Morton, linux-fsdevel, Michael Stapelberg, linux-mm

On Thu 08-07-21 20:17:51, Hillf Danton wrote:
> On Wed, 7 Jul 2021 11:51:38 +0200 Jan Kara wrote:
> >On Wed 07-07-21 15:40:17, Hillf Danton wrote:
> >> On Mon,  5 Jul 2021 18:23:17 +0200 Jan Kara wrote:
> >> >
> >> >Michael Stapelberg has reported that for workload with short big spikes
> >> >of writes (GCC linker seem to trigger this frequently) the write
> >> >throughput is heavily underestimated and tends to steadily sink until it
> >> >reaches zero. This has rather bad impact on writeback throttling
> >> >(causing stalls). The problem is that writeback throughput estimate gets
> >> >updated at most once per 200 ms. One update happens early after we
> >> >submit pages for writeback (at that point writeout of only small
> >> >fraction of pages is completed and thus observed throughput is tiny).
> >> >Next update happens only during the next write spike (updates happen
> >> >only from inode writeback and dirty throttling code) and if that is
> >> >more than 1s after previous spike, we decide system was idle and just
> >> >ignore whatever was written until this moment.
> >> >
> >> >Fix the problem by making sure writeback throughput estimate is also
> >> >updated shortly after writeback completes to get reasonable estimate of
> >> >throughput for spiky workloads.
> >> >
> >> >Link: https://lore.kernel.org/lkml/20210617095309.3542373-1-stapelberg+li>nux@google.com
> >> >Reported-by: Michael Stapelberg <stapelberg+linux@google.com>
> >> >Signed-off-by: Jan Kara <jack@suse.cz>
> >...
> >> >diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> >> >index 1fecf8ebadb0..6a99ddca95c0 100644
> >> >--- a/mm/page-writeback.c
> >> >+++ b/mm/page-writeback.c
> >> >@@ -1346,14 +1346,7 @@ static void __wb_update_bandwidth(struct dirty_thr>ottle_control *gdtc,
> >> > 	unsigned long dirtied;
> >> > 	unsigned long written;
> >> >
> >> >-	lockdep_assert_held(&wb->list_lock);
> >> >-
> >> >-	/*
> >> >-	 * rate-limit, only update once every 200ms.
> >> >-	 */
> >> >-	if (elapsed < BANDWIDTH_INTERVAL)
> >> >-		return;
> >> 
> >> Please leave it as it is if you are not dumping the 200ms rule.
> >
> >Well, that could break the delayed updated scheduled after the end of
> >writeback and for no good reason. The problematic ordering is like:
> 
> After another look at 2/5, you are cutting the rule, which is worth a
> seperate patch.

The only update that can break the 200ms rule are the updates added in this
patch. I don't think separating the removal of 200ms check for that one
case really brings much clarity. It would rather bring "what if questions"
to this patch...

> >end writeback on inode1
> >  queue_delayed_work() - queues delayed work after BANDWIDTH_INTERVAL
> >
> >__wb_update_bandwidth() called e.g. from balance_dirty_pages()
> >  wb->bw_time_stamp = now;
> >
> >end writeback on inode2
> >  queue_delayed_work() - does nothing since work is already queued
> >
> >delayed work calls __wb_update_bandwidth() - nothing is done since elapsed
> >< BANDWIDTH_INTERVAL and we may thus miss reflecting writeback of inode2 in
> >our estimates.
> 
> Your example says the estimate based on inode2 is torpedoed by a random
> update, and you are looking to make that estimate meaningful at the cost
> of breaking the rule - how differet is it to the current one if the
> estimate is derived from 20ms-elapsed interval at inode2? Is it likely to
> see another palpablely different result at inode3 from 50ms-elapsed interval?

I'm not sure I understand your question correctly but updates after shorter
than 200ms interval should not disturb the estimates much.
wb_update_write_bandwidth() effectively uses formula:

	bandwidth = (written + bandwidth * (period - elapsed)) / period

where 'period' is 3 seconds. So we compute average bandwidth over last 3
seconds where amount written in 'elapsed' interval is 'written' pages. If
'elapsed' is small, the influence of current sample on reducing estimated
bandwidth is going to be small as well.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/5] writeback: Fix bandwidth estimate for spiky workload
  2021-07-08 16:43         ` Jan Kara
@ 2021-07-09  8:01           ` Hillf Danton
  0 siblings, 0 replies; 15+ messages in thread
From: Hillf Danton @ 2021-07-09  8:01 UTC (permalink / raw)
  To: Jan Kara; +Cc: Andrew Morton, linux-fsdevel, Michael Stapelberg, linux-mm

On Thu, 8 Jul 2021 18:43:01 +0200 Jan Kara wrote:
>On Thu 08-07-21 20:17:51, Hillf Danton wrote:
>> On Wed, 7 Jul 2021 11:51:38 +0200 Jan Kara wrote:
>> >On Wed 07-07-21 15:40:17, Hillf Danton wrote:
>> >> On Mon,  5 Jul 2021 18:23:17 +0200 Jan Kara wrote:
>> >> >
>> >> >Michael Stapelberg has reported that for workload with short big spikes
>> >> >of writes (GCC linker seem to trigger this frequently) the write
>> >> >throughput is heavily underestimated and tends to steadily sink until it
>> >> >reaches zero. This has rather bad impact on writeback throttling
>> >> >(causing stalls). The problem is that writeback throughput estimate gets
>> >> >updated at most once per 200 ms. One update happens early after we
>> >> >submit pages for writeback (at that point writeout of only small
>> >> >fraction of pages is completed and thus observed throughput is tiny).
>> >> >Next update happens only during the next write spike (updates happen
>> >> >only from inode writeback and dirty throttling code) and if that is
>> >> >more than 1s after previous spike, we decide system was idle and just
>> >> >ignore whatever was written until this moment.
>> >> >
>> >> >Fix the problem by making sure writeback throughput estimate is also
>> >> >updated shortly after writeback completes to get reasonable estimate of
>> >> >throughput for spiky workloads.
>> >> >
>> >> >Link: https://lore.kernel.org/lkml/20210617095309.3542373-1-stapelberg+li>nux@google.com
>> >> >Reported-by: Michael Stapelberg <stapelberg+linux@google.com>
>> >> >Signed-off-by: Jan Kara <jack@suse.cz>
>> >...
>> >> >diff --git a/mm/page-writeback.c b/mm/page-writeback.c
>> >> >index 1fecf8ebadb0..6a99ddca95c0 100644
>> >> >--- a/mm/page-writeback.c
>> >> >+++ b/mm/page-writeback.c
>> >> >@@ -1346,14 +1346,7 @@ static void __wb_update_bandwidth(struct dirty_thr>ottle_control *gdtc,
>> >> > 	unsigned long dirtied;
>> >> > 	unsigned long written;
>> >> >
>> >> >-	lockdep_assert_held(&wb->list_lock);
>> >> >-
>> >> >-	/*
>> >> >-	 * rate-limit, only update once every 200ms.
>> >> >-	 */
>> >> >-	if (elapsed < BANDWIDTH_INTERVAL)
>> >> >-		return;
>> >> 
>> >> Please leave it as it is if you are not dumping the 200ms rule.
>> >
>> >Well, that could break the delayed updated scheduled after the end of
>> >writeback and for no good reason. The problematic ordering is like:
>> 
>> After another look at 2/5, you are cutting the rule, which is worth a
>> seperate patch.
>
>The only update that can break the 200ms rule are the updates added in this
>patch. I don't think separating the removal of 200ms check for that one
>case really brings much clarity. It would rather bring "what if questions"
>to this patch...
>
>> >end writeback on inode1
>> >  queue_delayed_work() - queues delayed work after BANDWIDTH_INTERVAL
>> >
>> >__wb_update_bandwidth() called e.g. from balance_dirty_pages()
>> >  wb->bw_time_stamp = now;
>> >
>> >end writeback on inode2
>> >  queue_delayed_work() - does nothing since work is already queued
>> >
>> >delayed work calls __wb_update_bandwidth() - nothing is done since elapsed
>> >< BANDWIDTH_INTERVAL and we may thus miss reflecting writeback of inode2 in
>> >our estimates.
>> 
>> Your example says the estimate based on inode2 is torpedoed by a random
>> update, and you are looking to make that estimate meaningful at the cost
>> of breaking the rule - how differet is it to the current one if the
>> estimate is derived from 20ms-elapsed interval at inode2? Is it likely to
>> see another palpablely different result at inode3 from 50ms-elapsed interval?
>
>I'm not sure I understand your question correctly but updates after shorter
>than 200ms interval should not disturb the estimates much.
>wb_update_write_bandwidth() effectively uses formula:
>
>	bandwidth = (written + bandwidth * (period - elapsed)) / period
>
>where 'period' is 3 seconds. So we compute average bandwidth over last 3
>seconds where amount written in 'elapsed' interval is 'written' pages. If
>'elapsed' is small, the influence of current sample on reducing estimated
>bandwidth is going to be small as well.

Correct. Without the 3s period that in combination with filters there helps
prevent any sample either in 20ms or 200ms interval from overturning the
estimated bandwidth, we will find what is estimated in an hour is a mile from
what disk vendors claim. And in a day interval I am inclined to ignore the
difference between the estimated BW and the bare metal BW, in assumption that
no disk will be found idle while dirty pages go above the background threshold.

Hillf


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/5] writeback: Fix bandwidth estimates
  2021-07-05 16:23 [PATCH 0/5] writeback: Fix bandwidth estimates Jan Kara
@ 2021-07-09 13:19   ` Michael Stapelberg
  2021-07-05 16:23 ` [PATCH 2/5] writeback: Reliably update bandwidth estimation Jan Kara
                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: Michael Stapelberg @ 2021-07-09 13:19 UTC (permalink / raw)
  To: Jan Kara; +Cc: Andrew Morton, linux-fsdevel, linux-mm

Thanks for sending this patch series!

I have used the mmap.c reproducer as before, with the following parameters:
* mkdir /tmp/mnt
* fusermount -u /tmp/mnt; /root/fuse-2.9.9/example/fusexmp_fh -f /tmp/mnt
* dd if=/dev/urandom of=/tmp/was bs=1M count=99
* while :; do grep ^Bdi /sys/kernel/debug/bdi/0:44/stats; sleep 0.1; done
* while :; do time WORKAROUND=1 ~/mmap /tmp/was
/tmp/mnt/tmp/stapelberg.1; sleep 5; done

Previously, after a few iterations, the BdiWriteBandwidth measure
would gradually approach 0.

With your patch series applied, the BdiWriteBandwidth is updated much
more quickly, and converges to ≈16000 kBps.
When I start copying more quickly, the bandwidth measure rises quickly.

As far as I understand, this should fix the problem (provided 16000
kBps is an okay value).
Certainly, I don’t see the downward spiral either with your patches :)


On Mon, 5 Jul 2021 at 18:23, Jan Kara <jack@suse.cz> wrote:
>
> Hello,
>
> this patch series fixes estimate of writeback throughput when device is not
> fully busy doing writeback. Michael Stapelberg has reported that such
> workload (e.g. generated by linking) tends to push estimated throughput down
> to 0 and as a result writeback on the device is practically stalled.
>
> The first three patches fix the reported issue, the remaining two patches
> are unrelated cleanups of problems I've noticed when reading the code.
>
> Michael, can you try whether these patches fix the problems you've observed?
> I cannot trigger the "bandwidth going down" spiral with them anymore.
>
>                                                                 Honza

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/5] writeback: Fix bandwidth estimates
@ 2021-07-09 13:19   ` Michael Stapelberg
  0 siblings, 0 replies; 15+ messages in thread
From: Michael Stapelberg @ 2021-07-09 13:19 UTC (permalink / raw)
  To: Jan Kara; +Cc: Andrew Morton, linux-fsdevel, linux-mm

Thanks for sending this patch series!

I have used the mmap.c reproducer as before, with the following parameters:
* mkdir /tmp/mnt
* fusermount -u /tmp/mnt; /root/fuse-2.9.9/example/fusexmp_fh -f /tmp/mnt
* dd if=/dev/urandom of=/tmp/was bs=1M count=99
* while :; do grep ^Bdi /sys/kernel/debug/bdi/0:44/stats; sleep 0.1; done
* while :; do time WORKAROUND=1 ~/mmap /tmp/was
/tmp/mnt/tmp/stapelberg.1; sleep 5; done

Previously, after a few iterations, the BdiWriteBandwidth measure
would gradually approach 0.

With your patch series applied, the BdiWriteBandwidth is updated much
more quickly, and converges to ≈16000 kBps.
When I start copying more quickly, the bandwidth measure rises quickly.

As far as I understand, this should fix the problem (provided 16000
kBps is an okay value).
Certainly, I don’t see the downward spiral either with your patches :)


On Mon, 5 Jul 2021 at 18:23, Jan Kara <jack@suse.cz> wrote:
>
> Hello,
>
> this patch series fixes estimate of writeback throughput when device is not
> fully busy doing writeback. Michael Stapelberg has reported that such
> workload (e.g. generated by linking) tends to push estimated throughput down
> to 0 and as a result writeback on the device is practically stalled.
>
> The first three patches fix the reported issue, the remaining two patches
> are unrelated cleanups of problems I've noticed when reading the code.
>
> Michael, can you try whether these patches fix the problems you've observed?
> I cannot trigger the "bandwidth going down" spiral with them anymore.
>
>                                                                 Honza


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/5] writeback: Fix bandwidth estimates
  2021-07-09 13:19   ` Michael Stapelberg
  (?)
@ 2021-07-12 16:27   ` Jan Kara
  2021-07-13  8:15     ` Michael Stapelberg
  -1 siblings, 1 reply; 15+ messages in thread
From: Jan Kara @ 2021-07-12 16:27 UTC (permalink / raw)
  To: Michael Stapelberg; +Cc: Jan Kara, Andrew Morton, linux-fsdevel, linux-mm

On Fri 09-07-21 15:19:17, Michael Stapelberg wrote:
> Thanks for sending this patch series!
> 
> I have used the mmap.c reproducer as before, with the following parameters:
> * mkdir /tmp/mnt
> * fusermount -u /tmp/mnt; /root/fuse-2.9.9/example/fusexmp_fh -f /tmp/mnt
> * dd if=/dev/urandom of=/tmp/was bs=1M count=99
> * while :; do grep ^Bdi /sys/kernel/debug/bdi/0:44/stats; sleep 0.1; done
> * while :; do time WORKAROUND=1 ~/mmap /tmp/was
> /tmp/mnt/tmp/stapelberg.1; sleep 5; done
> 
> Previously, after a few iterations, the BdiWriteBandwidth measure
> would gradually approach 0.
> 
> With your patch series applied, the BdiWriteBandwidth is updated much
> more quickly, and converges to ≈16000 kBps.
> When I start copying more quickly, the bandwidth measure rises quickly.
> 
> As far as I understand, this should fix the problem (provided 16000
> kBps is an okay value).
> Certainly, I don’t see the downward spiral either with your patches :)

Thanks for testing! Can I add your Tested-by tag?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/5] writeback: Fix bandwidth estimates
  2021-07-12 16:27   ` Jan Kara
@ 2021-07-13  8:15     ` Michael Stapelberg
  0 siblings, 0 replies; 15+ messages in thread
From: Michael Stapelberg @ 2021-07-13  8:15 UTC (permalink / raw)
  To: Jan Kara; +Cc: Andrew Morton, linux-fsdevel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 1290 bytes --]

Yes! Thank you.

On Mon, 12 Jul 2021 at 18:27, Jan Kara <jack@suse.cz> wrote:

> On Fri 09-07-21 15:19:17, Michael Stapelberg wrote:
> > Thanks for sending this patch series!
> >
> > I have used the mmap.c reproducer as before, with the following
> parameters:
> > * mkdir /tmp/mnt
> > * fusermount -u /tmp/mnt; /root/fuse-2.9.9/example/fusexmp_fh -f /tmp/mnt
> > * dd if=/dev/urandom of=/tmp/was bs=1M count=99
> > * while :; do grep ^Bdi /sys/kernel/debug/bdi/0:44/stats; sleep 0.1; done
> > * while :; do time WORKAROUND=1 ~/mmap /tmp/was
> > /tmp/mnt/tmp/stapelberg.1; sleep 5; done
> >
> > Previously, after a few iterations, the BdiWriteBandwidth measure
> > would gradually approach 0.
> >
> > With your patch series applied, the BdiWriteBandwidth is updated much
> > more quickly, and converges to ≈16000 kBps.
> > When I start copying more quickly, the bandwidth measure rises quickly.
> >
> > As far as I understand, this should fix the problem (provided 16000
> > kBps is an okay value).
> > Certainly, I don’t see the downward spiral either with your patches :)
>
> Thanks for testing! Can I add your Tested-by tag?
>
>                                                                 Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
>

[-- Attachment #2: Type: text/html, Size: 1764 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2021-07-13  8:15 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-05 16:23 [PATCH 0/5] writeback: Fix bandwidth estimates Jan Kara
2021-07-05 16:23 ` [PATCH 1/5] writeback: Track number of inodes under writeback Jan Kara
2021-07-05 16:23 ` [PATCH 2/5] writeback: Reliably update bandwidth estimation Jan Kara
2021-07-05 16:23 ` [PATCH 3/5] writeback: Fix bandwidth estimate for spiky workload Jan Kara
2021-07-07  7:40   ` Hillf Danton
2021-07-07  9:51     ` Jan Kara
2021-07-08 12:17       ` Hillf Danton
2021-07-08 16:43         ` Jan Kara
2021-07-09  8:01           ` Hillf Danton
2021-07-05 16:23 ` [PATCH 4/5] writeback: Rename domain_update_bandwidth() Jan Kara
2021-07-05 16:23 ` [PATCH 5/5] writeback: Use READ_ONCE for unlocked reads of writeback stats Jan Kara
2021-07-09 13:19 ` [PATCH 0/5] writeback: Fix bandwidth estimates Michael Stapelberg
2021-07-09 13:19   ` Michael Stapelberg
2021-07-12 16:27   ` Jan Kara
2021-07-13  8:15     ` Michael Stapelberg

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.