linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6] Improve visibility of writeback
@ 2024-03-20 11:02 Kemeng Shi
  2024-03-20 11:02 ` [PATCH 1/6] writeback: collect stats of all wb of bdi in bdi_debug_stats_show Kemeng Shi
                   ` (7 more replies)
  0 siblings, 8 replies; 38+ messages in thread
From: Kemeng Shi @ 2024-03-20 11:02 UTC (permalink / raw)
  To: akpm, tj, linux-mm, linux-fsdevel, linux-kernel
  Cc: willy, bfoster, jack, dsterba, mjguzik, dhowells, peterz

This series tries to improve visilibity of writeback. Patch 1 make
/sys/kernel/debug/bdi/xxx/stats show writeback info of whole bdi
instead of only writeback info in root cgroup. Patch 2 add a new
debug file /sys/kernel/debug/bdi/xxx/wb_stats to show per wb writeback
info. Patch 4 add wb_monitor.py to monitor basic writeback info
of running system, more info could be added on demand. Rest patches
are some random cleanups. More details can be found in respective
patches. Thanks!

Following domain hierarchy is tested:
                global domain (320G)
                /                 \
        cgroup domain1(10G)     cgroup domain2(10G)
                |                 |
bdi            wb1               wb2

/* all writeback info of bdi is successfully collected */
# cat /sys/kernel/debug/bdi/252:16/stats:
BdiWriteback:              448 kB
BdiReclaimable:        1303904 kB
BdiDirtyThresh:      189914124 kB
DirtyThresh:         195337564 kB
BackgroundThresh:     32516508 kB
BdiDirtied:            3591392 kB
BdiWritten:            2287488 kB
BdiWriteBandwidth:      322248 kBps
b_dirty:                     0
b_io:                        0
b_more_io:                   2
b_dirty_time:                0
bdi_list:                    1
state:                       1

/* per wb writeback info is collected */
# cat /sys/kernel/debug/bdi/252:16/wb_stats:
cat wb_stats
WbCgIno:                    1
WbWriteback:                0 kB
WbReclaimable:              0 kB
WbDirtyThresh:              0 kB
WbDirtied:                  0 kB
WbWritten:                  0 kB
WbWriteBandwidth:      102400 kBps
b_dirty:                    0
b_io:                       0
b_more_io:                  0
b_dirty_time:               0
state:                      1
WbCgIno:                 4284
WbWriteback:              448 kB
WbReclaimable:         818944 kB
WbDirtyThresh:        3096524 kB
WbDirtied:            2266880 kB
WbWritten:            1447936 kB
WbWriteBandwidth:      214036 kBps
b_dirty:                    0
b_io:                       0
b_more_io:                  1
b_dirty_time:               0
state:                      5
WbCgIno:                 4325
WbWriteback:              224 kB
WbReclaimable:         819392 kB
WbDirtyThresh:        2920088 kB
WbDirtied:            2551808 kB
WbWritten:            1732416 kB
WbWriteBandwidth:      201832 kBps
b_dirty:                    0
b_io:                       0
b_more_io:                  1
b_dirty_time:               0
state:                      5

/* monitor writeback info */
# ./wb_monitor.py 252:16 -c
                  writeback  reclaimable   dirtied   written    avg_bw
252:16_1                  0            0         0         0    102400
252:16_4284             672       820064   9230368   8410304    685612
252:16_4325             896       819840  10491264   9671648    652348
252:16                 1568      1639904  19721632  18081952   1440360


                  writeback  reclaimable   dirtied   written    avg_bw
252:16_1                  0            0         0         0    102400
252:16_4284             672       820064   9230368   8410304    685612
252:16_4325             896       819840  10491264   9671648    652348
252:16                 1568      1639904  19721632  18081952   1440360
...



Kemeng Shi (6):
  writeback: collect stats of all wb of bdi in bdi_debug_stats_show
  writeback: support retrieving per group debug writeback stats of bdi
  workqueue: remove unnecessary import and function in wq_monitor.py
  writeback: add wb_monitor.py script to monitor writeback info on bdi
  writeback: rename nr_reclaimable to nr_dirty in balance_dirty_pages
  writeback: remove unneeded GDTC_INIT_NO_WB

 include/linux/writeback.h     |   1 +
 mm/backing-dev.c              | 159 ++++++++++++++++++++++++++-----
 mm/page-writeback.c           |  32 +++++--
 tools/workqueue/wq_monitor.py |   9 +-
 tools/writeback/wb_monitor.py | 172 ++++++++++++++++++++++++++++++++++
 5 files changed, 334 insertions(+), 39 deletions(-)
 create mode 100644 tools/writeback/wb_monitor.py

-- 
2.30.0



^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 1/6] writeback: collect stats of all wb of bdi in bdi_debug_stats_show
  2024-03-20 11:02 [PATCH 0/6] Improve visibility of writeback Kemeng Shi
@ 2024-03-20 11:02 ` Kemeng Shi
  2024-03-20 13:21   ` Brian Foster
  2024-03-21 18:06   ` Jan Kara
  2024-03-20 11:02 ` [PATCH 2/6] writeback: support retrieving per group debug writeback stats of bdi Kemeng Shi
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 38+ messages in thread
From: Kemeng Shi @ 2024-03-20 11:02 UTC (permalink / raw)
  To: akpm, tj, linux-mm, linux-fsdevel, linux-kernel
  Cc: willy, bfoster, jack, dsterba, mjguzik, dhowells, peterz

/sys/kernel/debug/bdi/xxx/stats is supposed to show writeback information
of whole bdi, but only writeback information of bdi in root cgroup is
collected. So writeback information in non-root cgroup are missing now.
To be more specific, considering following case:

/* create writeback cgroup */
cd /sys/fs/cgroup
echo "+memory +io" > cgroup.subtree_control
mkdir group1
cd group1
echo $$ > cgroup.procs
/* do writeback in cgroup */
fio -name test -filename=/dev/vdb ...
/* get writeback info of bdi */
cat /sys/kernel/debug/bdi/xxx/stats
The cat result unexpectedly implies that there is no writeback on target
bdi.

Fix this by collecting stats of all wb in bdi instead of only wb in
root cgroup.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
---
 mm/backing-dev.c | 93 ++++++++++++++++++++++++++++++++++++------------
 1 file changed, 70 insertions(+), 23 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 5f2be8c8df11..788702b6c5dd 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -39,6 +39,19 @@ struct workqueue_struct *bdi_wq;
 #include <linux/debugfs.h>
 #include <linux/seq_file.h>
 
+struct wb_stats {
+	unsigned long nr_dirty;
+	unsigned long nr_io;
+	unsigned long nr_more_io;
+	unsigned long nr_dirty_time;
+	unsigned long nr_writeback;
+	unsigned long nr_reclaimable;
+	unsigned long nr_dirtied;
+	unsigned long nr_written;
+	unsigned long dirty_thresh;
+	unsigned long wb_thresh;
+};
+
 static struct dentry *bdi_debug_root;
 
 static void bdi_debug_init(void)
@@ -46,31 +59,65 @@ static void bdi_debug_init(void)
 	bdi_debug_root = debugfs_create_dir("bdi", NULL);
 }
 
-static int bdi_debug_stats_show(struct seq_file *m, void *v)
+static void collect_wb_stats(struct wb_stats *stats,
+			     struct bdi_writeback *wb)
 {
-	struct backing_dev_info *bdi = m->private;
-	struct bdi_writeback *wb = &bdi->wb;
-	unsigned long background_thresh;
-	unsigned long dirty_thresh;
-	unsigned long wb_thresh;
-	unsigned long nr_dirty, nr_io, nr_more_io, nr_dirty_time;
 	struct inode *inode;
 
-	nr_dirty = nr_io = nr_more_io = nr_dirty_time = 0;
 	spin_lock(&wb->list_lock);
 	list_for_each_entry(inode, &wb->b_dirty, i_io_list)
-		nr_dirty++;
+		stats->nr_dirty++;
 	list_for_each_entry(inode, &wb->b_io, i_io_list)
-		nr_io++;
+		stats->nr_io++;
 	list_for_each_entry(inode, &wb->b_more_io, i_io_list)
-		nr_more_io++;
+		stats->nr_more_io++;
 	list_for_each_entry(inode, &wb->b_dirty_time, i_io_list)
 		if (inode->i_state & I_DIRTY_TIME)
-			nr_dirty_time++;
+			stats->nr_dirty_time++;
 	spin_unlock(&wb->list_lock);
 
+	stats->nr_writeback += wb_stat(wb, WB_WRITEBACK);
+	stats->nr_reclaimable += wb_stat(wb, WB_RECLAIMABLE);
+	stats->nr_dirtied += wb_stat(wb, WB_DIRTIED);
+	stats->nr_written += wb_stat(wb, WB_WRITTEN);
+	stats->wb_thresh += wb_calc_thresh(wb, stats->dirty_thresh);
+}
+
+#ifdef CONFIG_CGROUP_WRITEBACK
+static void bdi_collect_stats(struct backing_dev_info *bdi,
+			      struct wb_stats *stats)
+{
+	struct bdi_writeback *wb;
+
+	/* protect wb from release */
+	mutex_lock(&bdi->cgwb_release_mutex);
+	list_for_each_entry(wb, &bdi->wb_list, bdi_node)
+		collect_wb_stats(stats, wb);
+	mutex_unlock(&bdi->cgwb_release_mutex);
+}
+#else
+static void bdi_collect_stats(struct backing_dev_info *bdi,
+			      struct wb_stats *stats)
+{
+	collect_wb_stats(stats, &bdi->wb);
+}
+#endif
+
+static int bdi_debug_stats_show(struct seq_file *m, void *v)
+{
+	struct backing_dev_info *bdi = m->private;
+	unsigned long background_thresh;
+	unsigned long dirty_thresh;
+	struct wb_stats stats;
+	unsigned long tot_bw;
+
 	global_dirty_limits(&background_thresh, &dirty_thresh);
-	wb_thresh = wb_calc_thresh(wb, dirty_thresh);
+
+	memset(&stats, 0, sizeof(stats));
+	stats.dirty_thresh = dirty_thresh;
+	bdi_collect_stats(bdi, &stats);
+
+	tot_bw = atomic_long_read(&bdi->tot_write_bandwidth);
 
 	seq_printf(m,
 		   "BdiWriteback:       %10lu kB\n"
@@ -87,18 +134,18 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
 		   "b_dirty_time:       %10lu\n"
 		   "bdi_list:           %10u\n"
 		   "state:              %10lx\n",
-		   (unsigned long) K(wb_stat(wb, WB_WRITEBACK)),
-		   (unsigned long) K(wb_stat(wb, WB_RECLAIMABLE)),
-		   K(wb_thresh),
+		   K(stats.nr_writeback),
+		   K(stats.nr_reclaimable),
+		   K(stats.wb_thresh),
 		   K(dirty_thresh),
 		   K(background_thresh),
-		   (unsigned long) K(wb_stat(wb, WB_DIRTIED)),
-		   (unsigned long) K(wb_stat(wb, WB_WRITTEN)),
-		   (unsigned long) K(wb->write_bandwidth),
-		   nr_dirty,
-		   nr_io,
-		   nr_more_io,
-		   nr_dirty_time,
+		   K(stats.nr_dirtied),
+		   K(stats.nr_written),
+		   K(tot_bw),
+		   stats.nr_dirty,
+		   stats.nr_io,
+		   stats.nr_more_io,
+		   stats.nr_dirty_time,
 		   !list_empty(&bdi->bdi_list), bdi->wb.state);
 
 	return 0;
-- 
2.30.0



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 2/6] writeback: support retrieving per group debug writeback stats of bdi
  2024-03-20 11:02 [PATCH 0/6] Improve visibility of writeback Kemeng Shi
  2024-03-20 11:02 ` [PATCH 1/6] writeback: collect stats of all wb of bdi in bdi_debug_stats_show Kemeng Shi
@ 2024-03-20 11:02 ` Kemeng Shi
  2024-03-20 15:01   ` Tejun Heo
  2024-03-26 12:24   ` Jan Kara
  2024-03-20 11:02 ` [PATCH 3/6] workqueue: remove unnecessary import and function in wq_monitor.py Kemeng Shi
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 38+ messages in thread
From: Kemeng Shi @ 2024-03-20 11:02 UTC (permalink / raw)
  To: akpm, tj, linux-mm, linux-fsdevel, linux-kernel
  Cc: willy, bfoster, jack, dsterba, mjguzik, dhowells, peterz

Add /sys/kernel/debug/bdi/xxx/wb_stats to show per group writeback stats
of bdi.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
---
 include/linux/writeback.h |  1 +
 mm/backing-dev.c          | 66 +++++++++++++++++++++++++++++++++++++++
 mm/page-writeback.c       | 19 +++++++++++
 3 files changed, 86 insertions(+)

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 9845cb62e40b..bb1ce1123b52 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -355,6 +355,7 @@ int dirtytime_interval_handler(struct ctl_table *table, int write,
 
 void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty);
 unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh);
+unsigned long wb_calc_cg_thresh(struct bdi_writeback *wb);
 
 void wb_update_bandwidth(struct bdi_writeback *wb);
 
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 788702b6c5dd..bfc4079dc7fe 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -95,12 +95,77 @@ static void bdi_collect_stats(struct backing_dev_info *bdi,
 		collect_wb_stats(stats, wb);
 	mutex_unlock(&bdi->cgwb_release_mutex);
 }
+static void wb_stats_show(struct seq_file *m, struct bdi_writeback *wb,
+			  struct wb_stats *stats)
+{
+
+	seq_printf(m,
+		   "WbCgIno:           %10lu\n"
+		   "WbWriteback:       %10lu kB\n"
+		   "WbReclaimable:     %10lu kB\n"
+		   "WbDirtyThresh:     %10lu kB\n"
+		   "WbDirtied:         %10lu kB\n"
+		   "WbWritten:         %10lu kB\n"
+		   "WbWriteBandwidth:  %10lu kBps\n"
+		   "b_dirty:           %10lu\n"
+		   "b_io:              %10lu\n"
+		   "b_more_io:         %10lu\n"
+		   "b_dirty_time:      %10lu\n"
+		   "state:             %10lx\n",
+		   cgroup_ino(wb->memcg_css->cgroup),
+		   K(stats->nr_writeback),
+		   K(stats->nr_reclaimable),
+		   K(stats->wb_thresh),
+		   K(stats->nr_dirtied),
+		   K(stats->nr_written),
+		   K(wb->avg_write_bandwidth),
+		   stats->nr_dirty,
+		   stats->nr_io,
+		   stats->nr_more_io,
+		   stats->nr_dirty_time,
+		   wb->state);
+}
+
+static int cgwb_debug_stats_show(struct seq_file *m, void *v)
+{
+	struct backing_dev_info *bdi = m->private;
+	unsigned long background_thresh;
+	unsigned long dirty_thresh;
+	struct bdi_writeback *wb;
+	struct wb_stats stats;
+
+	global_dirty_limits(&background_thresh, &dirty_thresh);
+
+	mutex_lock(&bdi->cgwb_release_mutex);
+	list_for_each_entry(wb, &bdi->wb_list, bdi_node) {
+		memset(&stats, 0, sizeof(stats));
+		stats.dirty_thresh = dirty_thresh;
+		collect_wb_stats(&stats, wb);
+
+		if (mem_cgroup_wb_domain(wb) != NULL)
+			stats.wb_thresh = min(stats.wb_thresh, wb_calc_cg_thresh(wb));
+
+		wb_stats_show(m, wb, &stats);
+	}
+	mutex_unlock(&bdi->cgwb_release_mutex);
+
+	return 0;
+}
+DEFINE_SHOW_ATTRIBUTE(cgwb_debug_stats);
+
+static void cgwb_debug_register(struct backing_dev_info *bdi)
+{
+	debugfs_create_file("wb_stats", 0444, bdi->debug_dir, bdi,
+			    &cgwb_debug_stats_fops);
+}
 #else
 static void bdi_collect_stats(struct backing_dev_info *bdi,
 			      struct wb_stats *stats)
 {
 	collect_wb_stats(stats, &bdi->wb);
 }
+
+static inline void cgwb_debug_register(struct backing_dev_info *bdi) { }
 #endif
 
 static int bdi_debug_stats_show(struct seq_file *m, void *v)
@@ -158,6 +223,7 @@ static void bdi_debug_register(struct backing_dev_info *bdi, const char *name)
 
 	debugfs_create_file("stats", 0444, bdi->debug_dir, bdi,
 			    &bdi_debug_stats_fops);
+	cgwb_debug_register(bdi);
 }
 
 static void bdi_debug_unregister(struct backing_dev_info *bdi)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 0e20467367fe..ba1b6b5ae5d6 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -893,6 +893,25 @@ unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh)
 	return __wb_calc_thresh(&gdtc, thresh);
 }
 
+unsigned long wb_calc_cg_thresh(struct bdi_writeback *wb)
+{
+	struct dirty_throttle_control gdtc = { GDTC_INIT_NO_WB };
+	struct dirty_throttle_control mdtc = { MDTC_INIT(wb, &gdtc) };
+	unsigned long filepages, headroom, writeback;
+
+	gdtc.avail = global_dirtyable_memory();
+	gdtc.dirty = global_node_page_state(NR_FILE_DIRTY) +
+		     global_node_page_state(NR_WRITEBACK);
+
+	mem_cgroup_wb_stats(wb, &filepages, &headroom,
+			    &mdtc.dirty, &writeback);
+	mdtc.dirty += writeback;
+	mdtc_calc_avail(&mdtc, filepages, headroom);
+	domain_dirty_limits(&mdtc);
+
+	return __wb_calc_thresh(&mdtc, mdtc.thresh);
+}
+
 /*
  *                           setpoint - dirty 3
  *        f(dirty) := 1.0 + (----------------)
-- 
2.30.0



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 3/6] workqueue: remove unnecessary import and function in wq_monitor.py
  2024-03-20 11:02 [PATCH 0/6] Improve visibility of writeback Kemeng Shi
  2024-03-20 11:02 ` [PATCH 1/6] writeback: collect stats of all wb of bdi in bdi_debug_stats_show Kemeng Shi
  2024-03-20 11:02 ` [PATCH 2/6] writeback: support retrieving per group debug writeback stats of bdi Kemeng Shi
@ 2024-03-20 11:02 ` Kemeng Shi
  2024-03-20 15:03   ` Tejun Heo
  2024-03-20 11:02 ` [PATCH 4/6] writeback: add wb_monitor.py script to monitor writeback info on bdi Kemeng Shi
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 38+ messages in thread
From: Kemeng Shi @ 2024-03-20 11:02 UTC (permalink / raw)
  To: akpm, tj, linux-mm, linux-fsdevel, linux-kernel
  Cc: willy, bfoster, jack, dsterba, mjguzik, dhowells, peterz

Remove unnecessary import and function in wq_monitor.py

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
---
 tools/workqueue/wq_monitor.py | 9 +--------
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/tools/workqueue/wq_monitor.py b/tools/workqueue/wq_monitor.py
index a8856a9c45dc..9e964c5be40c 100644
--- a/tools/workqueue/wq_monitor.py
+++ b/tools/workqueue/wq_monitor.py
@@ -32,16 +32,13 @@ https://github.com/osandov/drgn.
   rescued  The number of work items executed by the rescuer.
 """
 
-import sys
 import signal
-import os
 import re
 import time
 import json
 
 import drgn
-from drgn.helpers.linux.list import list_for_each_entry,list_empty
-from drgn.helpers.linux.cpumask import for_each_possible_cpu
+from drgn.helpers.linux.list import list_for_each_entry
 
 import argparse
 parser = argparse.ArgumentParser(description=desc,
@@ -54,10 +51,6 @@ parser.add_argument('-j', '--json', action='store_true',
                     help='Output in json')
 args = parser.parse_args()
 
-def err(s):
-    print(s, file=sys.stderr, flush=True)
-    sys.exit(1)
-
 workqueues              = prog['workqueues']
 
 WQ_UNBOUND              = prog['WQ_UNBOUND']
-- 
2.30.0



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 4/6] writeback: add wb_monitor.py script to monitor writeback info on bdi
  2024-03-20 11:02 [PATCH 0/6] Improve visibility of writeback Kemeng Shi
                   ` (2 preceding siblings ...)
  2024-03-20 11:02 ` [PATCH 3/6] workqueue: remove unnecessary import and function in wq_monitor.py Kemeng Shi
@ 2024-03-20 11:02 ` Kemeng Shi
  2024-03-20 15:12   ` Tejun Heo
  2024-03-20 11:02 ` [PATCH 5/6] writeback: rename nr_reclaimable to nr_dirty in balance_dirty_pages Kemeng Shi
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 38+ messages in thread
From: Kemeng Shi @ 2024-03-20 11:02 UTC (permalink / raw)
  To: akpm, tj, linux-mm, linux-fsdevel, linux-kernel
  Cc: willy, bfoster, jack, dsterba, mjguzik, dhowells, peterz

Add wb_monitor.py script to monitor writeback information on backing dev
which makes it easier and more convenient to observe writeback behaviors
of running system.

This script is based on copy of wq_monitor.py.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Suggested-by: Tejun Heo <tj@kernel.org>
---
 tools/writeback/wb_monitor.py | 172 ++++++++++++++++++++++++++++++++++
 1 file changed, 172 insertions(+)
 create mode 100644 tools/writeback/wb_monitor.py

diff --git a/tools/writeback/wb_monitor.py b/tools/writeback/wb_monitor.py
new file mode 100644
index 000000000000..5e3591f1f9a9
--- /dev/null
+++ b/tools/writeback/wb_monitor.py
@@ -0,0 +1,172 @@
+#!/usr/bin/env drgn
+#
+# Copyright (C) 2024 Kemeng Shi <shikemeng@huaweicloud.com>
+# Copyright (C) 2024 Huawei Inc
+
+desc = """
+This is a drgn script based on wq_monitor.py to monitor writeback info on
+backing dev. For more info on drgn, visit https://github.com/osandov/drgn.
+
+  writeback(kB)     Amount of dirty pages are currently being written back to
+                    disk.
+
+  reclaimable(kB)   Amount of pages are currently reclaimable.
+
+  dirtied(kB)       Amount of pages have been dirtied.
+
+  wrttien(kB)       Amount of dirty pages have been written back to disk.
+
+  avg_wb(kBps)      Smoothly estimated write bandwidth of writing dirty pages
+                    back to disk.
+"""
+
+import signal
+import re
+import time
+import json
+
+import drgn
+from drgn.helpers.linux.list import list_for_each_entry
+
+import argparse
+parser = argparse.ArgumentParser(description=desc,
+                                 formatter_class=argparse.RawTextHelpFormatter)
+parser.add_argument('bdi', metavar='REGEX', nargs='*',
+                    help='Target backing device name patterns (all if empty)')
+parser.add_argument('-i', '--interval', metavar='SECS', type=float, default=1,
+                    help='Monitoring interval (0 to print once and exit)')
+parser.add_argument('-j', '--json', action='store_true',
+                    help='Output in json')
+parser.add_argument('-c', '--cgroup', action='store_true',
+                    help='show writeback of bdi in cgroup')
+args = parser.parse_args()
+
+bdi_list                = prog['bdi_list']
+
+WB_RECLAIMABLE          = prog['WB_RECLAIMABLE']
+WB_WRITEBACK            = prog['WB_WRITEBACK']
+WB_DIRTIED              = prog['WB_DIRTIED']
+WB_WRITTEN              = prog['WB_WRITTEN']
+NR_WB_STAT_ITEMS        = prog['NR_WB_STAT_ITEMS']
+
+PAGE_SHIFT              = prog['PAGE_SHIFT']
+
+def K(x):
+    return x << (PAGE_SHIFT - 10)
+
+class Stats:
+    def dict(self, now):
+        return { 'timestamp'            : now,
+                 'name'                 : self.name,
+                 'writeback'            : self.stats[WB_WRITEBACK],
+                 'reclaimable'          : self.stats[WB_RECLAIMABLE],
+                 'dirtied'              : self.stats[WB_DIRTIED],
+                 'written'              : self.stats[WB_WRITTEN],
+                 'avg_wb'               : self.avg_bw, }
+
+    def table_header_str():
+        return f'{"":>16} {"writeback":>10} {"reclaimable":>12} ' \
+                f'{"dirtied":>9} {"written":>9} {"avg_bw":>9}'
+
+    def table_row_str(self):
+        out = f'{self.name[-16:]:16} ' \
+              f'{self.stats[WB_WRITEBACK]:10} ' \
+              f'{self.stats[WB_RECLAIMABLE]:12} ' \
+              f'{self.stats[WB_DIRTIED]:9} ' \
+              f'{self.stats[WB_WRITTEN]:9} ' \
+              f'{self.avg_bw:9} '
+        return out
+
+    def show_header():
+        if Stats.table_fmt:
+            print()
+            print(Stats.table_header_str())
+
+    def show_stats(self):
+        if Stats.table_fmt:
+            print(self.table_row_str())
+        else:
+            print(self.dict(Stats.now))
+
+class WbStats(Stats):
+    def __init__(self, wb):
+        bdi_name = wb.bdi.dev_name.string_().decode()
+        # avoid to use bdi.wb.memcg_css which is only defined when
+        # CONFIG_CGROUP_WRITEBACK is enabled
+        if wb == wb.bdi.wb.address_of_():
+            ino = "1"
+        else:
+            ino = str(wb.memcg_css.cgroup.kn.id.value_())
+        self.name = bdi_name + '_' + ino
+
+        self.stats = [0] * NR_WB_STAT_ITEMS
+        for i in range(NR_WB_STAT_ITEMS):
+            if wb.stat[i].count >= 0:
+                self.stats[i] = int(K(wb.stat[i].count))
+            else:
+                self.stats[i] = 0
+
+        self.avg_bw = int(K(wb.avg_write_bandwidth))
+
+class BdiStats(Stats):
+    def __init__(self, bdi):
+        self.name = bdi.dev_name.string_().decode()
+        self.stats = [0] * NR_WB_STAT_ITEMS
+        self.avg_bw = 0
+
+    def collectStats(self, wb_stats):
+        for i in range(NR_WB_STAT_ITEMS):
+            self.stats[i] += wb_stats.stats[i]
+
+        self.avg_bw += wb_stats.avg_bw
+
+exit_req = False
+
+def sigint_handler(signr, frame):
+    global exit_req
+    exit_req = True
+
+def main():
+    # handle args
+    Stats.table_fmt = not args.json
+    interval = args.interval
+    cgroup = args.cgroup
+
+    re_str = None
+    if args.bdi:
+        for r in args.bdi:
+            if re_str is None:
+                re_str = r
+            else:
+                re_str += '|' + r
+
+    filter_re = re.compile(re_str) if re_str else None
+
+    # monitoring loop
+    signal.signal(signal.SIGINT, sigint_handler)
+
+    while not exit_req:
+        Stats.now = time.time()
+
+        Stats.show_header()
+        for bdi in list_for_each_entry('struct backing_dev_info', bdi_list.address_of_(), 'bdi_list'):
+            bdi_stats = BdiStats(bdi)
+            if filter_re and not filter_re.search(bdi_stats.name):
+                continue
+
+            for wb in list_for_each_entry('struct bdi_writeback', bdi.wb_list.address_of_(), 'bdi_node'):
+                wb_stats = WbStats(wb)
+                bdi_stats.collectStats(wb_stats)
+                if cgroup:
+                    wb_stats.show_stats()
+
+            bdi_stats.show_stats()
+            if cgroup and Stats.table_fmt:
+                print()
+
+        if interval == 0:
+            break
+        time.sleep(interval)
+
+if __name__ == "__main__":
+    main()
-- 
2.30.0



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 5/6] writeback: rename nr_reclaimable to nr_dirty in balance_dirty_pages
  2024-03-20 11:02 [PATCH 0/6] Improve visibility of writeback Kemeng Shi
                   ` (3 preceding siblings ...)
  2024-03-20 11:02 ` [PATCH 4/6] writeback: add wb_monitor.py script to monitor writeback info on bdi Kemeng Shi
@ 2024-03-20 11:02 ` Kemeng Shi
  2024-03-26 12:27   ` Jan Kara
  2024-03-20 11:02 ` [PATCH 6/6] writeback: remove unneeded GDTC_INIT_NO_WB Kemeng Shi
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 38+ messages in thread
From: Kemeng Shi @ 2024-03-20 11:02 UTC (permalink / raw)
  To: akpm, tj, linux-mm, linux-fsdevel, linux-kernel
  Cc: willy, bfoster, jack, dsterba, mjguzik, dhowells, peterz

Commit 8d92890bd6b85 ("mm/writeback: discard NR_UNSTABLE_NFS, use
NR_WRITEBACK instead") removed NR_UNSTABLE_NFS and nr_reclaimable
only contains dirty page now.
Rename nr_reclaimable to nr_dirty properly.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
---
 mm/page-writeback.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index ba1b6b5ae5d6..481b6bf34c21 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1695,7 +1695,7 @@ static int balance_dirty_pages(struct bdi_writeback *wb,
 	struct dirty_throttle_control * const mdtc = mdtc_valid(&mdtc_stor) ?
 						     &mdtc_stor : NULL;
 	struct dirty_throttle_control *sdtc;
-	unsigned long nr_reclaimable;	/* = file_dirty */
+	unsigned long nr_dirty;
 	long period;
 	long pause;
 	long max_pause;
@@ -1716,9 +1716,9 @@ static int balance_dirty_pages(struct bdi_writeback *wb,
 		unsigned long m_thresh = 0;
 		unsigned long m_bg_thresh = 0;
 
-		nr_reclaimable = global_node_page_state(NR_FILE_DIRTY);
+		nr_dirty = global_node_page_state(NR_FILE_DIRTY);
 		gdtc->avail = global_dirtyable_memory();
-		gdtc->dirty = nr_reclaimable + global_node_page_state(NR_WRITEBACK);
+		gdtc->dirty = nr_dirty + global_node_page_state(NR_WRITEBACK);
 
 		domain_dirty_limits(gdtc);
 
@@ -1769,7 +1769,7 @@ static int balance_dirty_pages(struct bdi_writeback *wb,
 		 * In normal mode, we start background writeout at the lower
 		 * background_thresh, to keep the amount of dirty memory low.
 		 */
-		if (!laptop_mode && nr_reclaimable > gdtc->bg_thresh &&
+		if (!laptop_mode && nr_dirty > gdtc->bg_thresh &&
 		    !writeback_in_progress(wb))
 			wb_start_background_writeback(wb);
 
-- 
2.30.0



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 6/6] writeback: remove unneeded GDTC_INIT_NO_WB
  2024-03-20 11:02 [PATCH 0/6] Improve visibility of writeback Kemeng Shi
                   ` (4 preceding siblings ...)
  2024-03-20 11:02 ` [PATCH 5/6] writeback: rename nr_reclaimable to nr_dirty in balance_dirty_pages Kemeng Shi
@ 2024-03-20 11:02 ` Kemeng Shi
  2024-03-20 15:15   ` Tejun Heo
  2024-03-26 12:35   ` Jan Kara
  2024-03-20 15:20 ` [PATCH 0/6] Improve visibility of writeback Tejun Heo
  2024-03-20 17:22 ` Jan Kara
  7 siblings, 2 replies; 38+ messages in thread
From: Kemeng Shi @ 2024-03-20 11:02 UTC (permalink / raw)
  To: akpm, tj, linux-mm, linux-fsdevel, linux-kernel
  Cc: willy, bfoster, jack, dsterba, mjguzik, dhowells, peterz

We never use gdtc->dom set with GDTC_INIT_NO_WB, just remove unneeded
GDTC_INIT_NO_WB

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
---
 mm/page-writeback.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 481b6bf34c21..09b2b0754cc5 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -154,8 +154,6 @@ struct dirty_throttle_control {
 				.dom = &global_wb_domain,		\
 				.wb_completions = &(__wb)->completions
 
-#define GDTC_INIT_NO_WB		.dom = &global_wb_domain
-
 #define MDTC_INIT(__wb, __gdtc)	.wb = (__wb),				\
 				.dom = mem_cgroup_wb_domain(__wb),	\
 				.wb_completions = &(__wb)->memcg_completions, \
@@ -210,7 +208,6 @@ static void wb_min_max_ratio(struct bdi_writeback *wb,
 
 #define GDTC_INIT(__wb)		.wb = (__wb),                           \
 				.wb_completions = &(__wb)->completions
-#define GDTC_INIT_NO_WB
 #define MDTC_INIT(__wb, __gdtc)
 
 static bool mdtc_valid(struct dirty_throttle_control *dtc)
@@ -438,7 +435,7 @@ static void domain_dirty_limits(struct dirty_throttle_control *dtc)
  */
 void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
 {
-	struct dirty_throttle_control gdtc = { GDTC_INIT_NO_WB };
+	struct dirty_throttle_control gdtc = { };
 
 	gdtc.avail = global_dirtyable_memory();
 	domain_dirty_limits(&gdtc);
@@ -895,7 +892,7 @@ unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh)
 
 unsigned long wb_calc_cg_thresh(struct bdi_writeback *wb)
 {
-	struct dirty_throttle_control gdtc = { GDTC_INIT_NO_WB };
+	struct dirty_throttle_control gdtc = { };
 	struct dirty_throttle_control mdtc = { MDTC_INIT(wb, &gdtc) };
 	unsigned long filepages, headroom, writeback;
 
-- 
2.30.0



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/6] writeback: collect stats of all wb of bdi in bdi_debug_stats_show
  2024-03-20 11:02 ` [PATCH 1/6] writeback: collect stats of all wb of bdi in bdi_debug_stats_show Kemeng Shi
@ 2024-03-20 13:21   ` Brian Foster
  2024-03-21  3:44     ` Kemeng Shi
  2024-03-21 18:06   ` Jan Kara
  1 sibling, 1 reply; 38+ messages in thread
From: Brian Foster @ 2024-03-20 13:21 UTC (permalink / raw)
  To: Kemeng Shi
  Cc: akpm, tj, linux-mm, linux-fsdevel, linux-kernel, willy, jack,
	dsterba, mjguzik, dhowells, peterz

On Wed, Mar 20, 2024 at 07:02:17PM +0800, Kemeng Shi wrote:
> /sys/kernel/debug/bdi/xxx/stats is supposed to show writeback information
> of whole bdi, but only writeback information of bdi in root cgroup is
> collected. So writeback information in non-root cgroup are missing now.
> To be more specific, considering following case:
> 
> /* create writeback cgroup */
> cd /sys/fs/cgroup
> echo "+memory +io" > cgroup.subtree_control
> mkdir group1
> cd group1
> echo $$ > cgroup.procs
> /* do writeback in cgroup */
> fio -name test -filename=/dev/vdb ...
> /* get writeback info of bdi */
> cat /sys/kernel/debug/bdi/xxx/stats
> The cat result unexpectedly implies that there is no writeback on target
> bdi.
> 
> Fix this by collecting stats of all wb in bdi instead of only wb in
> root cgroup.
> 
> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
> ---
>  mm/backing-dev.c | 93 ++++++++++++++++++++++++++++++++++++------------
>  1 file changed, 70 insertions(+), 23 deletions(-)
> 
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 5f2be8c8df11..788702b6c5dd 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
...
> @@ -46,31 +59,65 @@ static void bdi_debug_init(void)
>  	bdi_debug_root = debugfs_create_dir("bdi", NULL);
>  }
>  
...
> +#ifdef CONFIG_CGROUP_WRITEBACK
> +static void bdi_collect_stats(struct backing_dev_info *bdi,
> +			      struct wb_stats *stats)
> +{
> +	struct bdi_writeback *wb;
> +
> +	/* protect wb from release */
> +	mutex_lock(&bdi->cgwb_release_mutex);
> +	list_for_each_entry(wb, &bdi->wb_list, bdi_node)
> +		collect_wb_stats(stats, wb);
> +	mutex_unlock(&bdi->cgwb_release_mutex);
> +}
> +#else
> +static void bdi_collect_stats(struct backing_dev_info *bdi,
> +			      struct wb_stats *stats)
> +{
> +	collect_wb_stats(stats, &bdi->wb);
> +}
> +#endif
> +

I'm not familiar enough with the cgwb code to say for sure (and I'd
probably wait for more high level feedback before worrying too much
about this), but do we need the ifdef here just to iterate ->wb_list?
From looking at the code, it appears bdi->wb ends up on the list while
the bdi is registered for both cases, so that distinction seems
unnecessary. WRT to wb release protection, I wonder if this could use a
combination of rcu_read_lock()/list_for_each_safe() and wb_tryget() on
each wb before collecting its stats..? See how bdi_split_work_to_wbs()
works, for example.

Also I see a patch conflict/compile error on patch 2 due to
__wb_calc_thresh() only taking one parameter in my tree. What's the
baseline commit for this series?

Brian

> +static int bdi_debug_stats_show(struct seq_file *m, void *v)
> +{
> +	struct backing_dev_info *bdi = m->private;
> +	unsigned long background_thresh;
> +	unsigned long dirty_thresh;
> +	struct wb_stats stats;
> +	unsigned long tot_bw;
> +
>  	global_dirty_limits(&background_thresh, &dirty_thresh);
> -	wb_thresh = wb_calc_thresh(wb, dirty_thresh);
> +
> +	memset(&stats, 0, sizeof(stats));
> +	stats.dirty_thresh = dirty_thresh;
> +	bdi_collect_stats(bdi, &stats);
> +
> +	tot_bw = atomic_long_read(&bdi->tot_write_bandwidth);
>  
>  	seq_printf(m,
>  		   "BdiWriteback:       %10lu kB\n"
> @@ -87,18 +134,18 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
>  		   "b_dirty_time:       %10lu\n"
>  		   "bdi_list:           %10u\n"
>  		   "state:              %10lx\n",
> -		   (unsigned long) K(wb_stat(wb, WB_WRITEBACK)),
> -		   (unsigned long) K(wb_stat(wb, WB_RECLAIMABLE)),
> -		   K(wb_thresh),
> +		   K(stats.nr_writeback),
> +		   K(stats.nr_reclaimable),
> +		   K(stats.wb_thresh),
>  		   K(dirty_thresh),
>  		   K(background_thresh),
> -		   (unsigned long) K(wb_stat(wb, WB_DIRTIED)),
> -		   (unsigned long) K(wb_stat(wb, WB_WRITTEN)),
> -		   (unsigned long) K(wb->write_bandwidth),
> -		   nr_dirty,
> -		   nr_io,
> -		   nr_more_io,
> -		   nr_dirty_time,
> +		   K(stats.nr_dirtied),
> +		   K(stats.nr_written),
> +		   K(tot_bw),
> +		   stats.nr_dirty,
> +		   stats.nr_io,
> +		   stats.nr_more_io,
> +		   stats.nr_dirty_time,
>  		   !list_empty(&bdi->bdi_list), bdi->wb.state);
>  
>  	return 0;
> -- 
> 2.30.0
> 
> 



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/6] writeback: support retrieving per group debug writeback stats of bdi
  2024-03-20 11:02 ` [PATCH 2/6] writeback: support retrieving per group debug writeback stats of bdi Kemeng Shi
@ 2024-03-20 15:01   ` Tejun Heo
  2024-03-21  3:45     ` Kemeng Shi
  2024-03-26 12:24   ` Jan Kara
  1 sibling, 1 reply; 38+ messages in thread
From: Tejun Heo @ 2024-03-20 15:01 UTC (permalink / raw)
  To: Kemeng Shi
  Cc: akpm, linux-mm, linux-fsdevel, linux-kernel, willy, bfoster,
	jack, dsterba, mjguzik, dhowells, peterz

On Wed, Mar 20, 2024 at 07:02:18PM +0800, Kemeng Shi wrote:
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index 9845cb62e40b..bb1ce1123b52 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -355,6 +355,7 @@ int dirtytime_interval_handler(struct ctl_table *table, int write,
>  
>  void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty);
>  unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh);
> +unsigned long wb_calc_cg_thresh(struct bdi_writeback *wb);

Would cgwb_calc_thresh() be an easier name?

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/6] workqueue: remove unnecessary import and function in wq_monitor.py
  2024-03-20 11:02 ` [PATCH 3/6] workqueue: remove unnecessary import and function in wq_monitor.py Kemeng Shi
@ 2024-03-20 15:03   ` Tejun Heo
  2024-03-21  6:08     ` Kemeng Shi
  0 siblings, 1 reply; 38+ messages in thread
From: Tejun Heo @ 2024-03-20 15:03 UTC (permalink / raw)
  To: Kemeng Shi
  Cc: akpm, linux-mm, linux-fsdevel, linux-kernel, willy, bfoster,
	jack, dsterba, mjguzik, dhowells, peterz

On Wed, Mar 20, 2024 at 07:02:19PM +0800, Kemeng Shi wrote:
> Remove unnecessary import and function in wq_monitor.py
> 
> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>

Ah, this is a workqueue patch. Do you mind sending this separately? I'll
apply it to the wq tree after -rc1.

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/6] writeback: add wb_monitor.py script to monitor writeback info on bdi
  2024-03-20 11:02 ` [PATCH 4/6] writeback: add wb_monitor.py script to monitor writeback info on bdi Kemeng Shi
@ 2024-03-20 15:12   ` Tejun Heo
  2024-03-21  6:22     ` Kemeng Shi
  0 siblings, 1 reply; 38+ messages in thread
From: Tejun Heo @ 2024-03-20 15:12 UTC (permalink / raw)
  To: Kemeng Shi
  Cc: akpm, linux-mm, linux-fsdevel, linux-kernel, willy, bfoster,
	jack, dsterba, mjguzik, dhowells, peterz

On Wed, Mar 20, 2024 at 07:02:20PM +0800, Kemeng Shi wrote:
> Add wb_monitor.py script to monitor writeback information on backing dev
> which makes it easier and more convenient to observe writeback behaviors
> of running system.
> 
> This script is based on copy of wq_monitor.py.

I don't think I did that when wq_monitor.py was added but would you mind
adding an example output so that people can have a better idea on what to
expect?

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 6/6] writeback: remove unneeded GDTC_INIT_NO_WB
  2024-03-20 11:02 ` [PATCH 6/6] writeback: remove unneeded GDTC_INIT_NO_WB Kemeng Shi
@ 2024-03-20 15:15   ` Tejun Heo
  2024-03-21  7:12     ` Kemeng Shi
  2024-03-26 12:35   ` Jan Kara
  1 sibling, 1 reply; 38+ messages in thread
From: Tejun Heo @ 2024-03-20 15:15 UTC (permalink / raw)
  To: Kemeng Shi
  Cc: akpm, linux-mm, linux-fsdevel, linux-kernel, willy, bfoster,
	jack, dsterba, mjguzik, dhowells, peterz

Hello,

On Wed, Mar 20, 2024 at 07:02:22PM +0800, Kemeng Shi wrote:
> We never use gdtc->dom set with GDTC_INIT_NO_WB, just remove unneeded
> GDTC_INIT_NO_WB
> 
> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
...
>  void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
>  {
> -	struct dirty_throttle_control gdtc = { GDTC_INIT_NO_WB };
> +	struct dirty_throttle_control gdtc = { };

Even if it's currently not referenced, wouldn't it still be better to always
guarantee that a dtc's dom is always initialized? I'm not sure what we get
by removing this.

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/6] Improve visibility of writeback
  2024-03-20 11:02 [PATCH 0/6] Improve visibility of writeback Kemeng Shi
                   ` (5 preceding siblings ...)
  2024-03-20 11:02 ` [PATCH 6/6] writeback: remove unneeded GDTC_INIT_NO_WB Kemeng Shi
@ 2024-03-20 15:20 ` Tejun Heo
  2024-03-20 17:22 ` Jan Kara
  7 siblings, 0 replies; 38+ messages in thread
From: Tejun Heo @ 2024-03-20 15:20 UTC (permalink / raw)
  To: Kemeng Shi
  Cc: akpm, linux-mm, linux-fsdevel, linux-kernel, willy, bfoster,
	jack, dsterba, mjguzik, dhowells, peterz

Hello,

On Wed, Mar 20, 2024 at 07:02:16PM +0800, Kemeng Shi wrote:
> /* all writeback info of bdi is successfully collected */
> # cat /sys/kernel/debug/bdi/252:16/stats:
> BdiWriteback:              448 kB
...
> 
> /* per wb writeback info is collected */
> # cat /sys/kernel/debug/bdi/252:16/wb_stats:
> cat wb_stats
> WbCgIno:                    1
...
> 
> /* monitor writeback info */
> # ./wb_monitor.py 252:16 -c
>                   writeback  reclaimable   dirtied   written    avg_bw
> 252:16_1                  0            0         0         0    102400
> 252:16_4284             672       820064   9230368   8410304    685612
> 252:16_4325             896       819840  10491264   9671648    652348
> 252:16                 1568      1639904  19721632  18081952   1440360
> 
> 
>                   writeback  reclaimable   dirtied   written    avg_bw
> 252:16_1                  0            0         0         0    102400
> 252:16_4284             672       820064   9230368   8410304    685612
> 252:16_4325             896       819840  10491264   9671648    652348
> 252:16                 1568      1639904  19721632  18081952   1440360
> ...

Ah you have the example outputs here. It'd be nice to have these in the
commit messages too.

I'm not sure about the last patch but patchset looks good to me otherwise. I
don't feel particularly enthusiastic about adding more debugfs files
especially given that some distros disable debugfs completely, but no harm
in adding them either.

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/6] Improve visibility of writeback
  2024-03-20 11:02 [PATCH 0/6] Improve visibility of writeback Kemeng Shi
                   ` (6 preceding siblings ...)
  2024-03-20 15:20 ` [PATCH 0/6] Improve visibility of writeback Tejun Heo
@ 2024-03-20 17:22 ` Jan Kara
  2024-03-21  8:12   ` Kemeng Shi
  7 siblings, 1 reply; 38+ messages in thread
From: Jan Kara @ 2024-03-20 17:22 UTC (permalink / raw)
  To: Kemeng Shi
  Cc: akpm, tj, linux-mm, linux-fsdevel, linux-kernel, willy, bfoster,
	jack, dsterba, mjguzik, dhowells, peterz

On Wed 20-03-24 19:02:16, Kemeng Shi wrote:
> This series tries to improve visilibity of writeback. Patch 1 make
> /sys/kernel/debug/bdi/xxx/stats show writeback info of whole bdi
> instead of only writeback info in root cgroup. Patch 2 add a new
> debug file /sys/kernel/debug/bdi/xxx/wb_stats to show per wb writeback
> info. Patch 4 add wb_monitor.py to monitor basic writeback info
> of running system, more info could be added on demand. Rest patches
> are some random cleanups. More details can be found in respective
> patches. Thanks!
> 
> Following domain hierarchy is tested:
>                 global domain (320G)
>                 /                 \
>         cgroup domain1(10G)     cgroup domain2(10G)
>                 |                 |
> bdi            wb1               wb2
> 
> /* all writeback info of bdi is successfully collected */
> # cat /sys/kernel/debug/bdi/252:16/stats:
> BdiWriteback:              448 kB
> BdiReclaimable:        1303904 kB
> BdiDirtyThresh:      189914124 kB
> DirtyThresh:         195337564 kB
> BackgroundThresh:     32516508 kB
> BdiDirtied:            3591392 kB
> BdiWritten:            2287488 kB
> BdiWriteBandwidth:      322248 kBps
> b_dirty:                     0
> b_io:                        0
> b_more_io:                   2
> b_dirty_time:                0
> bdi_list:                    1
> state:                       1
> 
> /* per wb writeback info is collected */
> # cat /sys/kernel/debug/bdi/252:16/wb_stats:
> cat wb_stats
> WbCgIno:                    1
> WbWriteback:                0 kB
> WbReclaimable:              0 kB
> WbDirtyThresh:              0 kB
> WbDirtied:                  0 kB
> WbWritten:                  0 kB
> WbWriteBandwidth:      102400 kBps
> b_dirty:                    0
> b_io:                       0
> b_more_io:                  0
> b_dirty_time:               0
> state:                      1
> WbCgIno:                 4284
> WbWriteback:              448 kB
> WbReclaimable:         818944 kB
> WbDirtyThresh:        3096524 kB
> WbDirtied:            2266880 kB
> WbWritten:            1447936 kB
> WbWriteBandwidth:      214036 kBps
> b_dirty:                    0
> b_io:                       0
> b_more_io:                  1
> b_dirty_time:               0
> state:                      5
> WbCgIno:                 4325
> WbWriteback:              224 kB
> WbReclaimable:         819392 kB
> WbDirtyThresh:        2920088 kB
> WbDirtied:            2551808 kB
> WbWritten:            1732416 kB
> WbWriteBandwidth:      201832 kBps
> b_dirty:                    0
> b_io:                       0
> b_more_io:                  1
> b_dirty_time:               0
> state:                      5
> 
> /* monitor writeback info */
> # ./wb_monitor.py 252:16 -c
>                   writeback  reclaimable   dirtied   written    avg_bw
> 252:16_1                  0            0         0         0    102400
> 252:16_4284             672       820064   9230368   8410304    685612
> 252:16_4325             896       819840  10491264   9671648    652348
> 252:16                 1568      1639904  19721632  18081952   1440360
> 
> 
>                   writeback  reclaimable   dirtied   written    avg_bw
> 252:16_1                  0            0         0         0    102400
> 252:16_4284             672       820064   9230368   8410304    685612
> 252:16_4325             896       819840  10491264   9671648    652348
> 252:16                 1568      1639904  19721632  18081952   1440360
> ...

So I'm wondering: Are you implementing this just because this looks
interesting or do you have a real need for the functionality? Why?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/6] writeback: collect stats of all wb of bdi in bdi_debug_stats_show
  2024-03-20 13:21   ` Brian Foster
@ 2024-03-21  3:44     ` Kemeng Shi
  2024-03-21 12:10       ` Brian Foster
  0 siblings, 1 reply; 38+ messages in thread
From: Kemeng Shi @ 2024-03-21  3:44 UTC (permalink / raw)
  To: Brian Foster
  Cc: akpm, tj, linux-mm, linux-fsdevel, linux-kernel, willy, jack,
	dsterba, mjguzik, dhowells, peterz



on 3/20/2024 9:21 PM, Brian Foster wrote:
> On Wed, Mar 20, 2024 at 07:02:17PM +0800, Kemeng Shi wrote:
>> /sys/kernel/debug/bdi/xxx/stats is supposed to show writeback information
>> of whole bdi, but only writeback information of bdi in root cgroup is
>> collected. So writeback information in non-root cgroup are missing now.
>> To be more specific, considering following case:
>>
>> /* create writeback cgroup */
>> cd /sys/fs/cgroup
>> echo "+memory +io" > cgroup.subtree_control
>> mkdir group1
>> cd group1
>> echo $$ > cgroup.procs
>> /* do writeback in cgroup */
>> fio -name test -filename=/dev/vdb ...
>> /* get writeback info of bdi */
>> cat /sys/kernel/debug/bdi/xxx/stats
>> The cat result unexpectedly implies that there is no writeback on target
>> bdi.
>>
>> Fix this by collecting stats of all wb in bdi instead of only wb in
>> root cgroup.
>>
>> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
>> ---
>>  mm/backing-dev.c | 93 ++++++++++++++++++++++++++++++++++++------------
>>  1 file changed, 70 insertions(+), 23 deletions(-)
>>
>> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
>> index 5f2be8c8df11..788702b6c5dd 100644
>> --- a/mm/backing-dev.c
>> +++ b/mm/backing-dev.c
> ...
>> @@ -46,31 +59,65 @@ static void bdi_debug_init(void)
>>  	bdi_debug_root = debugfs_create_dir("bdi", NULL);
>>  }
>>  
> ...
>> +#ifdef CONFIG_CGROUP_WRITEBACK
>> +static void bdi_collect_stats(struct backing_dev_info *bdi,
>> +			      struct wb_stats *stats)
>> +{
>> +	struct bdi_writeback *wb;
>> +
>> +	/* protect wb from release */
>> +	mutex_lock(&bdi->cgwb_release_mutex);
>> +	list_for_each_entry(wb, &bdi->wb_list, bdi_node)
>> +		collect_wb_stats(stats, wb);
>> +	mutex_unlock(&bdi->cgwb_release_mutex);
>> +}
>> +#else
>> +static void bdi_collect_stats(struct backing_dev_info *bdi,
>> +			      struct wb_stats *stats)
>> +{
>> +	collect_wb_stats(stats, &bdi->wb);
>> +}
>> +#endif
>> +
> 
> I'm not familiar enough with the cgwb code to say for sure (and I'd
> probably wait for more high level feedback before worrying too much
> about this), but do we need the ifdef here just to iterate ->wb_list?
>>From looking at the code, it appears bdi->wb ends up on the list while
> the bdi is registered for both cases, so that distinction seems
> unnecessary. WRT to wb release protection, I wonder if this could use a
Currently, we have ifdef trying to remove unnecessary cost when
CONFIG_CGROUP_WRITEBACK is not enabled, see defination of cgwb_bdi_register
and cgwb_remove_from_bdi_list for example. So I try to define bdi_collect_stats
in similar way.
> combination of rcu_read_lock()/list_for_each_safe() and wb_tryget() on
> each wb before collecting its stats..? See how bdi_split_work_to_wbs()
> works, for example.
The combination of rcu_read_lock()/list_for_each_safe() and wb_tryget()
should work fine.
With ifdef, bdi_collect_stats takes no extra cost when CONFIG_CGROUP_WRITEBACK
is not enabled and is consistent with existing code style, so I still prefer
this way. Yes, The extra cost is not a big deal as it only exists in debug mode,
so it's acceptable to use the suggested combination in next version if you are
still strongly aganst this.

> 
> Also I see a patch conflict/compile error on patch 2 due to
> __wb_calc_thresh() only taking one parameter in my tree. What's the
> baseline commit for this series?
> 
Sorry for missing this, this seris is based on another patchset [1] which is still
under review.
Look forward to your reply!

Thansk
Kemeng

[1] https://lore.kernel.org/lkml/20240123183332.876854-1-shikemeng@huaweicloud.com/T/#mc6455784a63d0f8aa1a2f5aff325abcdf9336b76

> Brian
> 
>> +static int bdi_debug_stats_show(struct seq_file *m, void *v)
>> +{
>> +	struct backing_dev_info *bdi = m->private;
>> +	unsigned long background_thresh;
>> +	unsigned long dirty_thresh;
>> +	struct wb_stats stats;
>> +	unsigned long tot_bw;
>> +
>>  	global_dirty_limits(&background_thresh, &dirty_thresh);
>> -	wb_thresh = wb_calc_thresh(wb, dirty_thresh);
>> +
>> +	memset(&stats, 0, sizeof(stats));
>> +	stats.dirty_thresh = dirty_thresh;
>> +	bdi_collect_stats(bdi, &stats);
>> +
>> +	tot_bw = atomic_long_read(&bdi->tot_write_bandwidth);
>>  
>>  	seq_printf(m,
>>  		   "BdiWriteback:       %10lu kB\n"
>> @@ -87,18 +134,18 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
>>  		   "b_dirty_time:       %10lu\n"
>>  		   "bdi_list:           %10u\n"
>>  		   "state:              %10lx\n",
>> -		   (unsigned long) K(wb_stat(wb, WB_WRITEBACK)),
>> -		   (unsigned long) K(wb_stat(wb, WB_RECLAIMABLE)),
>> -		   K(wb_thresh),
>> +		   K(stats.nr_writeback),
>> +		   K(stats.nr_reclaimable),
>> +		   K(stats.wb_thresh),
>>  		   K(dirty_thresh),
>>  		   K(background_thresh),
>> -		   (unsigned long) K(wb_stat(wb, WB_DIRTIED)),
>> -		   (unsigned long) K(wb_stat(wb, WB_WRITTEN)),
>> -		   (unsigned long) K(wb->write_bandwidth),
>> -		   nr_dirty,
>> -		   nr_io,
>> -		   nr_more_io,
>> -		   nr_dirty_time,
>> +		   K(stats.nr_dirtied),
>> +		   K(stats.nr_written),
>> +		   K(tot_bw),
>> +		   stats.nr_dirty,
>> +		   stats.nr_io,
>> +		   stats.nr_more_io,
>> +		   stats.nr_dirty_time,
>>  		   !list_empty(&bdi->bdi_list), bdi->wb.state);
>>  
>>  	return 0;
>> -- 
>> 2.30.0
>>
>>
> 
> 



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/6] writeback: support retrieving per group debug writeback stats of bdi
  2024-03-20 15:01   ` Tejun Heo
@ 2024-03-21  3:45     ` Kemeng Shi
  0 siblings, 0 replies; 38+ messages in thread
From: Kemeng Shi @ 2024-03-21  3:45 UTC (permalink / raw)
  To: Tejun Heo
  Cc: akpm, linux-mm, linux-fsdevel, linux-kernel, willy, bfoster,
	jack, dsterba, mjguzik, dhowells, peterz



on 3/20/2024 11:01 PM, Tejun Heo wrote:
> On Wed, Mar 20, 2024 at 07:02:18PM +0800, Kemeng Shi wrote:
>> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
>> index 9845cb62e40b..bb1ce1123b52 100644
>> --- a/include/linux/writeback.h
>> +++ b/include/linux/writeback.h
>> @@ -355,6 +355,7 @@ int dirtytime_interval_handler(struct ctl_table *table, int write,
>>  
>>  void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty);
>>  unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh);
>> +unsigned long wb_calc_cg_thresh(struct bdi_writeback *wb);
> 
> Would cgwb_calc_thresh() be an easier name?
> 
Sure, will rename it in next version. Thansk!
> Thanks.
> 



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/6] workqueue: remove unnecessary import and function in wq_monitor.py
  2024-03-20 15:03   ` Tejun Heo
@ 2024-03-21  6:08     ` Kemeng Shi
  0 siblings, 0 replies; 38+ messages in thread
From: Kemeng Shi @ 2024-03-21  6:08 UTC (permalink / raw)
  To: Tejun Heo
  Cc: akpm, linux-mm, linux-fsdevel, linux-kernel, willy, bfoster,
	jack, dsterba, mjguzik, dhowells, peterz



on 3/20/2024 11:03 PM, Tejun Heo wrote:
> On Wed, Mar 20, 2024 at 07:02:19PM +0800, Kemeng Shi wrote:
>> Remove unnecessary import and function in wq_monitor.py
>>
>> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
> 
> Ah, this is a workqueue patch. Do you mind sending this separately? I'll
> apply it to the wq tree after -rc1.
Sure, I send it separately now. Thanks!
> 
> Thanks.
> 



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/6] writeback: add wb_monitor.py script to monitor writeback info on bdi
  2024-03-20 15:12   ` Tejun Heo
@ 2024-03-21  6:22     ` Kemeng Shi
  0 siblings, 0 replies; 38+ messages in thread
From: Kemeng Shi @ 2024-03-21  6:22 UTC (permalink / raw)
  To: Tejun Heo
  Cc: akpm, linux-mm, linux-fsdevel, linux-kernel, willy, bfoster,
	jack, dsterba, mjguzik, dhowells, peterz



on 3/20/2024 11:12 PM, Tejun Heo wrote:
> On Wed, Mar 20, 2024 at 07:02:20PM +0800, Kemeng Shi wrote:
>> Add wb_monitor.py script to monitor writeback information on backing dev
>> which makes it easier and more convenient to observe writeback behaviors
>> of running system.
>>
>> This script is based on copy of wq_monitor.py.
> 
> I don't think I did that when wq_monitor.py was added but would you mind
> adding an example output so that people can have a better idea on what to
> expect?
Sorry for the confusion. What I mean is that wb_monitor.py is initially copy
from wq_monitor.py and then is modified to monitor writeback info. I will
correct the description and add the example outputs.

Thanks

> 
> Thanks.
> 



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 6/6] writeback: remove unneeded GDTC_INIT_NO_WB
  2024-03-20 15:15   ` Tejun Heo
@ 2024-03-21  7:12     ` Kemeng Shi
  2024-03-25 20:26       ` Tejun Heo
  2024-03-27  9:33       ` Jan Kara
  0 siblings, 2 replies; 38+ messages in thread
From: Kemeng Shi @ 2024-03-21  7:12 UTC (permalink / raw)
  To: Tejun Heo
  Cc: akpm, linux-mm, linux-fsdevel, linux-kernel, willy, bfoster,
	jack, dsterba, mjguzik, dhowells, peterz



on 3/20/2024 11:15 PM, Tejun Heo wrote:
> Hello,
> 
> On Wed, Mar 20, 2024 at 07:02:22PM +0800, Kemeng Shi wrote:
>> We never use gdtc->dom set with GDTC_INIT_NO_WB, just remove unneeded
>> GDTC_INIT_NO_WB
>>
>> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
> ...
>>  void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
>>  {
>> -	struct dirty_throttle_control gdtc = { GDTC_INIT_NO_WB };
>> +	struct dirty_throttle_control gdtc = { };
> 
> Even if it's currently not referenced, wouldn't it still be better to always
> guarantee that a dtc's dom is always initialized? I'm not sure what we get
> by removing this.
As we explicitly use GDTC_INIT_NO_WB to set global_wb_domain before
calculating dirty limit with domain_dirty_limits, I intuitively think the dirty
limit calculation in domain_dirty_limits is related to global_wb_domain when
CONFIG_WRITEBACK_CGROUP is enabled while the truth is not. So this is a little
confusing to me.
Would it be acceptable to you that we keep useing GDTC_INIT_NO_WB but
define GDTC_INIT_NO_WB to null fow now and redefine GDTC_INIT_NO_WB when some
member of gdtc is really needed.
Of couse I'm not insistent on this. Would like to hear you suggestion. Thanks!

> 
> Thanks.
> 



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/6] Improve visibility of writeback
  2024-03-20 17:22 ` Jan Kara
@ 2024-03-21  8:12   ` Kemeng Shi
  2024-03-21 18:07     ` Jan Kara
  0 siblings, 1 reply; 38+ messages in thread
From: Kemeng Shi @ 2024-03-21  8:12 UTC (permalink / raw)
  To: Jan Kara
  Cc: akpm, tj, linux-mm, linux-fsdevel, linux-kernel, willy, bfoster,
	dsterba, mjguzik, dhowells, peterz



on 3/21/2024 1:22 AM, Jan Kara wrote:
> On Wed 20-03-24 19:02:16, Kemeng Shi wrote:
>> This series tries to improve visilibity of writeback. Patch 1 make
>> /sys/kernel/debug/bdi/xxx/stats show writeback info of whole bdi
>> instead of only writeback info in root cgroup. Patch 2 add a new
>> debug file /sys/kernel/debug/bdi/xxx/wb_stats to show per wb writeback
>> info. Patch 4 add wb_monitor.py to monitor basic writeback info
>> of running system, more info could be added on demand. Rest patches
>> are some random cleanups. More details can be found in respective
>> patches. Thanks!
>>
>> Following domain hierarchy is tested:
>>                 global domain (320G)
>>                 /                 \
>>         cgroup domain1(10G)     cgroup domain2(10G)
>>                 |                 |
>> bdi            wb1               wb2
>>
>> /* all writeback info of bdi is successfully collected */
>> # cat /sys/kernel/debug/bdi/252:16/stats:
>> BdiWriteback:              448 kB
>> BdiReclaimable:        1303904 kB
>> BdiDirtyThresh:      189914124 kB
>> DirtyThresh:         195337564 kB
>> BackgroundThresh:     32516508 kB
>> BdiDirtied:            3591392 kB
>> BdiWritten:            2287488 kB
>> BdiWriteBandwidth:      322248 kBps
>> b_dirty:                     0
>> b_io:                        0
>> b_more_io:                   2
>> b_dirty_time:                0
>> bdi_list:                    1
>> state:                       1
>>
>> /* per wb writeback info is collected */
>> # cat /sys/kernel/debug/bdi/252:16/wb_stats:
>> cat wb_stats
>> WbCgIno:                    1
>> WbWriteback:                0 kB
>> WbReclaimable:              0 kB
>> WbDirtyThresh:              0 kB
>> WbDirtied:                  0 kB
>> WbWritten:                  0 kB
>> WbWriteBandwidth:      102400 kBps
>> b_dirty:                    0
>> b_io:                       0
>> b_more_io:                  0
>> b_dirty_time:               0
>> state:                      1
>> WbCgIno:                 4284
>> WbWriteback:              448 kB
>> WbReclaimable:         818944 kB
>> WbDirtyThresh:        3096524 kB
>> WbDirtied:            2266880 kB
>> WbWritten:            1447936 kB
>> WbWriteBandwidth:      214036 kBps
>> b_dirty:                    0
>> b_io:                       0
>> b_more_io:                  1
>> b_dirty_time:               0
>> state:                      5
>> WbCgIno:                 4325
>> WbWriteback:              224 kB
>> WbReclaimable:         819392 kB
>> WbDirtyThresh:        2920088 kB
>> WbDirtied:            2551808 kB
>> WbWritten:            1732416 kB
>> WbWriteBandwidth:      201832 kBps
>> b_dirty:                    0
>> b_io:                       0
>> b_more_io:                  1
>> b_dirty_time:               0
>> state:                      5
>>
>> /* monitor writeback info */
>> # ./wb_monitor.py 252:16 -c
>>                   writeback  reclaimable   dirtied   written    avg_bw
>> 252:16_1                  0            0         0         0    102400
>> 252:16_4284             672       820064   9230368   8410304    685612
>> 252:16_4325             896       819840  10491264   9671648    652348
>> 252:16                 1568      1639904  19721632  18081952   1440360
>>
>>
>>                   writeback  reclaimable   dirtied   written    avg_bw
>> 252:16_1                  0            0         0         0    102400
>> 252:16_4284             672       820064   9230368   8410304    685612
>> 252:16_4325             896       819840  10491264   9671648    652348
>> 252:16                 1568      1639904  19721632  18081952   1440360
>> ...
> 
> So I'm wondering: Are you implementing this just because this looks
> interesting or do you have a real need for the functionality? Why?
Hi Jan, I added debug files to test change in [1] which changes the way how
dirty background threshold of wb is calculated. Without debug files, we could
only monitor writeback to imply that threshold is corrected.
In current patchset, debug info has not included dirty background threshold yet,
I will add it when discution of calculation of dirty background threshold in [1]
is done.
The wb_monitor.py is suggested by Tejun in [2] to improve visibility of writeback.
The script is more convenient than trace to monitor writeback behavior of the running
system.

Thanks

[1] https://lore.kernel.org/lkml/a747dc7d-f24a-08bd-d969-d3fb35e151b7@huaweicloud.com/
[2] https://lore.kernel.org/lkml/ZcUsOb_fyvYr-zZ-@slm.duckdns.org/
> 
> 								Honza
> 



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/6] writeback: collect stats of all wb of bdi in bdi_debug_stats_show
  2024-03-21  3:44     ` Kemeng Shi
@ 2024-03-21 12:10       ` Brian Foster
  2024-03-22  7:32         ` Kemeng Shi
  0 siblings, 1 reply; 38+ messages in thread
From: Brian Foster @ 2024-03-21 12:10 UTC (permalink / raw)
  To: Kemeng Shi
  Cc: akpm, tj, linux-mm, linux-fsdevel, linux-kernel, willy, jack,
	dsterba, mjguzik, dhowells, peterz

On Thu, Mar 21, 2024 at 11:44:40AM +0800, Kemeng Shi wrote:
> 
> 
> on 3/20/2024 9:21 PM, Brian Foster wrote:
> > On Wed, Mar 20, 2024 at 07:02:17PM +0800, Kemeng Shi wrote:
> >> /sys/kernel/debug/bdi/xxx/stats is supposed to show writeback information
> >> of whole bdi, but only writeback information of bdi in root cgroup is
> >> collected. So writeback information in non-root cgroup are missing now.
> >> To be more specific, considering following case:
> >>
> >> /* create writeback cgroup */
> >> cd /sys/fs/cgroup
> >> echo "+memory +io" > cgroup.subtree_control
> >> mkdir group1
> >> cd group1
> >> echo $$ > cgroup.procs
> >> /* do writeback in cgroup */
> >> fio -name test -filename=/dev/vdb ...
> >> /* get writeback info of bdi */
> >> cat /sys/kernel/debug/bdi/xxx/stats
> >> The cat result unexpectedly implies that there is no writeback on target
> >> bdi.
> >>
> >> Fix this by collecting stats of all wb in bdi instead of only wb in
> >> root cgroup.
> >>
> >> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
> >> ---
> >>  mm/backing-dev.c | 93 ++++++++++++++++++++++++++++++++++++------------
> >>  1 file changed, 70 insertions(+), 23 deletions(-)
> >>
> >> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> >> index 5f2be8c8df11..788702b6c5dd 100644
> >> --- a/mm/backing-dev.c
> >> +++ b/mm/backing-dev.c
> > ...
> >> @@ -46,31 +59,65 @@ static void bdi_debug_init(void)
> >>  	bdi_debug_root = debugfs_create_dir("bdi", NULL);
> >>  }
> >>  
> > ...
> >> +#ifdef CONFIG_CGROUP_WRITEBACK
> >> +static void bdi_collect_stats(struct backing_dev_info *bdi,
> >> +			      struct wb_stats *stats)
> >> +{
> >> +	struct bdi_writeback *wb;
> >> +
> >> +	/* protect wb from release */
> >> +	mutex_lock(&bdi->cgwb_release_mutex);
> >> +	list_for_each_entry(wb, &bdi->wb_list, bdi_node)
> >> +		collect_wb_stats(stats, wb);
> >> +	mutex_unlock(&bdi->cgwb_release_mutex);
> >> +}
> >> +#else
> >> +static void bdi_collect_stats(struct backing_dev_info *bdi,
> >> +			      struct wb_stats *stats)
> >> +{
> >> +	collect_wb_stats(stats, &bdi->wb);
> >> +}
> >> +#endif
> >> +
> > 
> > I'm not familiar enough with the cgwb code to say for sure (and I'd
> > probably wait for more high level feedback before worrying too much
> > about this), but do we need the ifdef here just to iterate ->wb_list?
> >>From looking at the code, it appears bdi->wb ends up on the list while
> > the bdi is registered for both cases, so that distinction seems
> > unnecessary. WRT to wb release protection, I wonder if this could use a
> Currently, we have ifdef trying to remove unnecessary cost when
> CONFIG_CGROUP_WRITEBACK is not enabled, see defination of cgwb_bdi_register
> and cgwb_remove_from_bdi_list for example. So I try to define bdi_collect_stats
> in similar way.
> > combination of rcu_read_lock()/list_for_each_safe() and wb_tryget() on
> > each wb before collecting its stats..? See how bdi_split_work_to_wbs()
> > works, for example.
> The combination of rcu_read_lock()/list_for_each_safe() and wb_tryget()
> should work fine.
> With ifdef, bdi_collect_stats takes no extra cost when CONFIG_CGROUP_WRITEBACK
> is not enabled and is consistent with existing code style, so I still prefer
> this way. Yes, The extra cost is not a big deal as it only exists in debug mode,
> so it's acceptable to use the suggested combination in next version if you are
> still strongly aganst this.
> 

Ok. I also previously missed that there are two implementations of
bdi_split_work_to_wbs() based on CGROUP_WRITEBACK. It seems reasonable
enough to me to follow that precedent for the !CGROUP_WRITEBACK case.

It still seems to make more sense to me to walk the list similar to how
bdi_split_work_to_wbs() does for the CGROUP_WRITEBACK enabled case. Do
you agree?

Brian

> > 
> > Also I see a patch conflict/compile error on patch 2 due to
> > __wb_calc_thresh() only taking one parameter in my tree. What's the
> > baseline commit for this series?
> > 
> Sorry for missing this, this seris is based on another patchset [1] which is still
> under review.
> Look forward to your reply!
> 
> Thansk
> Kemeng
> 
> [1] https://lore.kernel.org/lkml/20240123183332.876854-1-shikemeng@huaweicloud.com/T/#mc6455784a63d0f8aa1a2f5aff325abcdf9336b76
> 
> > Brian
> > 
> >> +static int bdi_debug_stats_show(struct seq_file *m, void *v)
> >> +{
> >> +	struct backing_dev_info *bdi = m->private;
> >> +	unsigned long background_thresh;
> >> +	unsigned long dirty_thresh;
> >> +	struct wb_stats stats;
> >> +	unsigned long tot_bw;
> >> +
> >>  	global_dirty_limits(&background_thresh, &dirty_thresh);
> >> -	wb_thresh = wb_calc_thresh(wb, dirty_thresh);
> >> +
> >> +	memset(&stats, 0, sizeof(stats));
> >> +	stats.dirty_thresh = dirty_thresh;
> >> +	bdi_collect_stats(bdi, &stats);
> >> +
> >> +	tot_bw = atomic_long_read(&bdi->tot_write_bandwidth);
> >>  
> >>  	seq_printf(m,
> >>  		   "BdiWriteback:       %10lu kB\n"
> >> @@ -87,18 +134,18 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
> >>  		   "b_dirty_time:       %10lu\n"
> >>  		   "bdi_list:           %10u\n"
> >>  		   "state:              %10lx\n",
> >> -		   (unsigned long) K(wb_stat(wb, WB_WRITEBACK)),
> >> -		   (unsigned long) K(wb_stat(wb, WB_RECLAIMABLE)),
> >> -		   K(wb_thresh),
> >> +		   K(stats.nr_writeback),
> >> +		   K(stats.nr_reclaimable),
> >> +		   K(stats.wb_thresh),
> >>  		   K(dirty_thresh),
> >>  		   K(background_thresh),
> >> -		   (unsigned long) K(wb_stat(wb, WB_DIRTIED)),
> >> -		   (unsigned long) K(wb_stat(wb, WB_WRITTEN)),
> >> -		   (unsigned long) K(wb->write_bandwidth),
> >> -		   nr_dirty,
> >> -		   nr_io,
> >> -		   nr_more_io,
> >> -		   nr_dirty_time,
> >> +		   K(stats.nr_dirtied),
> >> +		   K(stats.nr_written),
> >> +		   K(tot_bw),
> >> +		   stats.nr_dirty,
> >> +		   stats.nr_io,
> >> +		   stats.nr_more_io,
> >> +		   stats.nr_dirty_time,
> >>  		   !list_empty(&bdi->bdi_list), bdi->wb.state);
> >>  
> >>  	return 0;
> >> -- 
> >> 2.30.0
> >>
> >>
> > 
> > 
> 



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/6] writeback: collect stats of all wb of bdi in bdi_debug_stats_show
  2024-03-20 11:02 ` [PATCH 1/6] writeback: collect stats of all wb of bdi in bdi_debug_stats_show Kemeng Shi
  2024-03-20 13:21   ` Brian Foster
@ 2024-03-21 18:06   ` Jan Kara
  2024-03-22  7:51     ` Kemeng Shi
  1 sibling, 1 reply; 38+ messages in thread
From: Jan Kara @ 2024-03-21 18:06 UTC (permalink / raw)
  To: Kemeng Shi
  Cc: akpm, tj, linux-mm, linux-fsdevel, linux-kernel, willy, bfoster,
	jack, dsterba, mjguzik, dhowells, peterz

On Wed 20-03-24 19:02:17, Kemeng Shi wrote:
> /sys/kernel/debug/bdi/xxx/stats is supposed to show writeback information
> of whole bdi, but only writeback information of bdi in root cgroup is
> collected. So writeback information in non-root cgroup are missing now.
> To be more specific, considering following case:
> 
> /* create writeback cgroup */
> cd /sys/fs/cgroup
> echo "+memory +io" > cgroup.subtree_control
> mkdir group1
> cd group1
> echo $$ > cgroup.procs
> /* do writeback in cgroup */
> fio -name test -filename=/dev/vdb ...
> /* get writeback info of bdi */
> cat /sys/kernel/debug/bdi/xxx/stats
> The cat result unexpectedly implies that there is no writeback on target
> bdi.
> 
> Fix this by collecting stats of all wb in bdi instead of only wb in
> root cgroup.
> 
> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>

Looks mostly good, one comment below:

> ---
>  mm/backing-dev.c | 93 ++++++++++++++++++++++++++++++++++++------------
>  1 file changed, 70 insertions(+), 23 deletions(-)
> 
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 5f2be8c8df11..788702b6c5dd 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -39,6 +39,19 @@ struct workqueue_struct *bdi_wq;
>  #include <linux/debugfs.h>
>  #include <linux/seq_file.h>
>  
> +struct wb_stats {
> +	unsigned long nr_dirty;
> +	unsigned long nr_io;
> +	unsigned long nr_more_io;
> +	unsigned long nr_dirty_time;
> +	unsigned long nr_writeback;
> +	unsigned long nr_reclaimable;
> +	unsigned long nr_dirtied;
> +	unsigned long nr_written;
> +	unsigned long dirty_thresh;
> +	unsigned long wb_thresh;
> +};
> +
>  static struct dentry *bdi_debug_root;
>  
>  static void bdi_debug_init(void)
> @@ -46,31 +59,65 @@ static void bdi_debug_init(void)
>  	bdi_debug_root = debugfs_create_dir("bdi", NULL);
>  }
>  
> -static int bdi_debug_stats_show(struct seq_file *m, void *v)
> +static void collect_wb_stats(struct wb_stats *stats,
> +			     struct bdi_writeback *wb)
>  {
> -	struct backing_dev_info *bdi = m->private;
> -	struct bdi_writeback *wb = &bdi->wb;
> -	unsigned long background_thresh;
> -	unsigned long dirty_thresh;
> -	unsigned long wb_thresh;
> -	unsigned long nr_dirty, nr_io, nr_more_io, nr_dirty_time;
>  	struct inode *inode;
>  
> -	nr_dirty = nr_io = nr_more_io = nr_dirty_time = 0;
>  	spin_lock(&wb->list_lock);
>  	list_for_each_entry(inode, &wb->b_dirty, i_io_list)
> -		nr_dirty++;
> +		stats->nr_dirty++;
>  	list_for_each_entry(inode, &wb->b_io, i_io_list)
> -		nr_io++;
> +		stats->nr_io++;
>  	list_for_each_entry(inode, &wb->b_more_io, i_io_list)
> -		nr_more_io++;
> +		stats->nr_more_io++;
>  	list_for_each_entry(inode, &wb->b_dirty_time, i_io_list)
>  		if (inode->i_state & I_DIRTY_TIME)
> -			nr_dirty_time++;
> +			stats->nr_dirty_time++;
>  	spin_unlock(&wb->list_lock);
>  
> +	stats->nr_writeback += wb_stat(wb, WB_WRITEBACK);
> +	stats->nr_reclaimable += wb_stat(wb, WB_RECLAIMABLE);
> +	stats->nr_dirtied += wb_stat(wb, WB_DIRTIED);
> +	stats->nr_written += wb_stat(wb, WB_WRITTEN);
> +	stats->wb_thresh += wb_calc_thresh(wb, stats->dirty_thresh);
> +}
> +
> +#ifdef CONFIG_CGROUP_WRITEBACK
> +static void bdi_collect_stats(struct backing_dev_info *bdi,
> +			      struct wb_stats *stats)
> +{
> +	struct bdi_writeback *wb;
> +
> +	/* protect wb from release */
> +	mutex_lock(&bdi->cgwb_release_mutex);
> +	list_for_each_entry(wb, &bdi->wb_list, bdi_node)
> +		collect_wb_stats(stats, wb);
> +	mutex_unlock(&bdi->cgwb_release_mutex);
> +}

So AFAICT this function can race against
  bdi_unregister() -> wb_shutdown(&bdi->wb)

because that doesn't take the cgwb_release_mutex. So we either need the RCU
protection as Brian suggested or cgwb_lock or something. But given
collect_wb_stats() can take a significant amount of time (traversing all
the lists etc.) I think we'll need something more clever.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/6] Improve visibility of writeback
  2024-03-21  8:12   ` Kemeng Shi
@ 2024-03-21 18:07     ` Jan Kara
  0 siblings, 0 replies; 38+ messages in thread
From: Jan Kara @ 2024-03-21 18:07 UTC (permalink / raw)
  To: Kemeng Shi
  Cc: Jan Kara, akpm, tj, linux-mm, linux-fsdevel, linux-kernel, willy,
	bfoster, dsterba, mjguzik, dhowells, peterz

On Thu 21-03-24 16:12:52, Kemeng Shi wrote:
> on 3/21/2024 1:22 AM, Jan Kara wrote:
> > On Wed 20-03-24 19:02:16, Kemeng Shi wrote:
> >> This series tries to improve visilibity of writeback. Patch 1 make
> >> /sys/kernel/debug/bdi/xxx/stats show writeback info of whole bdi
> >> instead of only writeback info in root cgroup. Patch 2 add a new
> >> debug file /sys/kernel/debug/bdi/xxx/wb_stats to show per wb writeback
> >> info. Patch 4 add wb_monitor.py to monitor basic writeback info
> >> of running system, more info could be added on demand. Rest patches
> >> are some random cleanups. More details can be found in respective
> >> patches. Thanks!
> >>
> >> Following domain hierarchy is tested:
> >>                 global domain (320G)
> >>                 /                 \
> >>         cgroup domain1(10G)     cgroup domain2(10G)
> >>                 |                 |
> >> bdi            wb1               wb2
> >>
> >> /* all writeback info of bdi is successfully collected */
> >> # cat /sys/kernel/debug/bdi/252:16/stats:
> >> BdiWriteback:              448 kB
> >> BdiReclaimable:        1303904 kB
> >> BdiDirtyThresh:      189914124 kB
> >> DirtyThresh:         195337564 kB
> >> BackgroundThresh:     32516508 kB
> >> BdiDirtied:            3591392 kB
> >> BdiWritten:            2287488 kB
> >> BdiWriteBandwidth:      322248 kBps
> >> b_dirty:                     0
> >> b_io:                        0
> >> b_more_io:                   2
> >> b_dirty_time:                0
> >> bdi_list:                    1
> >> state:                       1
> >>
> >> /* per wb writeback info is collected */
> >> # cat /sys/kernel/debug/bdi/252:16/wb_stats:
> >> cat wb_stats
> >> WbCgIno:                    1
> >> WbWriteback:                0 kB
> >> WbReclaimable:              0 kB
> >> WbDirtyThresh:              0 kB
> >> WbDirtied:                  0 kB
> >> WbWritten:                  0 kB
> >> WbWriteBandwidth:      102400 kBps
> >> b_dirty:                    0
> >> b_io:                       0
> >> b_more_io:                  0
> >> b_dirty_time:               0
> >> state:                      1
> >> WbCgIno:                 4284
> >> WbWriteback:              448 kB
> >> WbReclaimable:         818944 kB
> >> WbDirtyThresh:        3096524 kB
> >> WbDirtied:            2266880 kB
> >> WbWritten:            1447936 kB
> >> WbWriteBandwidth:      214036 kBps
> >> b_dirty:                    0
> >> b_io:                       0
> >> b_more_io:                  1
> >> b_dirty_time:               0
> >> state:                      5
> >> WbCgIno:                 4325
> >> WbWriteback:              224 kB
> >> WbReclaimable:         819392 kB
> >> WbDirtyThresh:        2920088 kB
> >> WbDirtied:            2551808 kB
> >> WbWritten:            1732416 kB
> >> WbWriteBandwidth:      201832 kBps
> >> b_dirty:                    0
> >> b_io:                       0
> >> b_more_io:                  1
> >> b_dirty_time:               0
> >> state:                      5
> >>
> >> /* monitor writeback info */
> >> # ./wb_monitor.py 252:16 -c
> >>                   writeback  reclaimable   dirtied   written    avg_bw
> >> 252:16_1                  0            0         0         0    102400
> >> 252:16_4284             672       820064   9230368   8410304    685612
> >> 252:16_4325             896       819840  10491264   9671648    652348
> >> 252:16                 1568      1639904  19721632  18081952   1440360
> >>
> >>
> >>                   writeback  reclaimable   dirtied   written    avg_bw
> >> 252:16_1                  0            0         0         0    102400
> >> 252:16_4284             672       820064   9230368   8410304    685612
> >> 252:16_4325             896       819840  10491264   9671648    652348
> >> 252:16                 1568      1639904  19721632  18081952   1440360
> >> ...
> > 
> > So I'm wondering: Are you implementing this just because this looks
> > interesting or do you have a real need for the functionality? Why?
> Hi Jan, I added debug files to test change in [1] which changes the way how
> dirty background threshold of wb is calculated. Without debug files, we could
> only monitor writeback to imply that threshold is corrected.
> In current patchset, debug info has not included dirty background threshold yet,
> I will add it when discution of calculation of dirty background threshold in [1]
> is done.
> The wb_monitor.py is suggested by Tejun in [2] to improve visibility of writeback.
> The script is more convenient than trace to monitor writeback behavior of the running
> system.

Thanks for the pointer. OK, I agree this is useful so let's have a look
into the code :)

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/6] writeback: collect stats of all wb of bdi in bdi_debug_stats_show
  2024-03-21 12:10       ` Brian Foster
@ 2024-03-22  7:32         ` Kemeng Shi
  0 siblings, 0 replies; 38+ messages in thread
From: Kemeng Shi @ 2024-03-22  7:32 UTC (permalink / raw)
  To: Brian Foster
  Cc: akpm, tj, linux-mm, linux-fsdevel, linux-kernel, willy, jack,
	dsterba, mjguzik, dhowells, peterz



on 3/21/2024 8:10 PM, Brian Foster wrote:
> On Thu, Mar 21, 2024 at 11:44:40AM +0800, Kemeng Shi wrote:
>>
>>
>> on 3/20/2024 9:21 PM, Brian Foster wrote:
>>> On Wed, Mar 20, 2024 at 07:02:17PM +0800, Kemeng Shi wrote:
>>>> /sys/kernel/debug/bdi/xxx/stats is supposed to show writeback information
>>>> of whole bdi, but only writeback information of bdi in root cgroup is
>>>> collected. So writeback information in non-root cgroup are missing now.
>>>> To be more specific, considering following case:
>>>>
>>>> /* create writeback cgroup */
>>>> cd /sys/fs/cgroup
>>>> echo "+memory +io" > cgroup.subtree_control
>>>> mkdir group1
>>>> cd group1
>>>> echo $$ > cgroup.procs
>>>> /* do writeback in cgroup */
>>>> fio -name test -filename=/dev/vdb ...
>>>> /* get writeback info of bdi */
>>>> cat /sys/kernel/debug/bdi/xxx/stats
>>>> The cat result unexpectedly implies that there is no writeback on target
>>>> bdi.
>>>>
>>>> Fix this by collecting stats of all wb in bdi instead of only wb in
>>>> root cgroup.
>>>>
>>>> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
>>>> ---
>>>>  mm/backing-dev.c | 93 ++++++++++++++++++++++++++++++++++++------------
>>>>  1 file changed, 70 insertions(+), 23 deletions(-)
>>>>
>>>> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
>>>> index 5f2be8c8df11..788702b6c5dd 100644
>>>> --- a/mm/backing-dev.c
>>>> +++ b/mm/backing-dev.c
>>> ...
>>>> @@ -46,31 +59,65 @@ static void bdi_debug_init(void)
>>>>  	bdi_debug_root = debugfs_create_dir("bdi", NULL);
>>>>  }
>>>>  
>>> ...
>>>> +#ifdef CONFIG_CGROUP_WRITEBACK
>>>> +static void bdi_collect_stats(struct backing_dev_info *bdi,
>>>> +			      struct wb_stats *stats)
>>>> +{
>>>> +	struct bdi_writeback *wb;
>>>> +
>>>> +	/* protect wb from release */
>>>> +	mutex_lock(&bdi->cgwb_release_mutex);
>>>> +	list_for_each_entry(wb, &bdi->wb_list, bdi_node)
>>>> +		collect_wb_stats(stats, wb);
>>>> +	mutex_unlock(&bdi->cgwb_release_mutex);
>>>> +}
>>>> +#else
>>>> +static void bdi_collect_stats(struct backing_dev_info *bdi,
>>>> +			      struct wb_stats *stats)
>>>> +{
>>>> +	collect_wb_stats(stats, &bdi->wb);
>>>> +}
>>>> +#endif
>>>> +
>>>
>>> I'm not familiar enough with the cgwb code to say for sure (and I'd
>>> probably wait for more high level feedback before worrying too much
>>> about this), but do we need the ifdef here just to iterate ->wb_list?
>>> >From looking at the code, it appears bdi->wb ends up on the list while
>>> the bdi is registered for both cases, so that distinction seems
>>> unnecessary. WRT to wb release protection, I wonder if this could use a
>> Currently, we have ifdef trying to remove unnecessary cost when
>> CONFIG_CGROUP_WRITEBACK is not enabled, see defination of cgwb_bdi_register
>> and cgwb_remove_from_bdi_list for example. So I try to define bdi_collect_stats
>> in similar way.
>>> combination of rcu_read_lock()/list_for_each_safe() and wb_tryget() on
>>> each wb before collecting its stats..? See how bdi_split_work_to_wbs()
>>> works, for example.
>> The combination of rcu_read_lock()/list_for_each_safe() and wb_tryget()
>> should work fine.
>> With ifdef, bdi_collect_stats takes no extra cost when CONFIG_CGROUP_WRITEBACK
>> is not enabled and is consistent with existing code style, so I still prefer
>> this way. Yes, The extra cost is not a big deal as it only exists in debug mode,
>> so it's acceptable to use the suggested combination in next version if you are
>> still strongly aganst this.
>>
> 
> Ok. I also previously missed that there are two implementations of
> bdi_split_work_to_wbs() based on CGROUP_WRITEBACK. It seems reasonable
> enough to me to follow that precedent for the !CGROUP_WRITEBACK case.
> 
> It still seems to make more sense to me to walk the list similar to how
> bdi_split_work_to_wbs() does for the CGROUP_WRITEBACK enabled case. Do
> you agree?
Sure, I agree with this and will do it in next version. Thansk!

Kemeng
> 
> Brian
> 
>>>
>>> Also I see a patch conflict/compile error on patch 2 due to
>>> __wb_calc_thresh() only taking one parameter in my tree. What's the
>>> baseline commit for this series?
>>>
>> Sorry for missing this, this seris is based on another patchset [1] which is still
>> under review.
>> Look forward to your reply!
>>
>> Thansk
>> Kemeng
>>
>> [1] https://lore.kernel.org/lkml/20240123183332.876854-1-shikemeng@huaweicloud.com/T/#mc6455784a63d0f8aa1a2f5aff325abcdf9336b76
>>
>>> Brian
>>>
>>>> +static int bdi_debug_stats_show(struct seq_file *m, void *v)
>>>> +{
>>>> +	struct backing_dev_info *bdi = m->private;
>>>> +	unsigned long background_thresh;
>>>> +	unsigned long dirty_thresh;
>>>> +	struct wb_stats stats;
>>>> +	unsigned long tot_bw;
>>>> +
>>>>  	global_dirty_limits(&background_thresh, &dirty_thresh);
>>>> -	wb_thresh = wb_calc_thresh(wb, dirty_thresh);
>>>> +
>>>> +	memset(&stats, 0, sizeof(stats));
>>>> +	stats.dirty_thresh = dirty_thresh;
>>>> +	bdi_collect_stats(bdi, &stats);
>>>> +
>>>> +	tot_bw = atomic_long_read(&bdi->tot_write_bandwidth);
>>>>  
>>>>  	seq_printf(m,
>>>>  		   "BdiWriteback:       %10lu kB\n"
>>>> @@ -87,18 +134,18 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
>>>>  		   "b_dirty_time:       %10lu\n"
>>>>  		   "bdi_list:           %10u\n"
>>>>  		   "state:              %10lx\n",
>>>> -		   (unsigned long) K(wb_stat(wb, WB_WRITEBACK)),
>>>> -		   (unsigned long) K(wb_stat(wb, WB_RECLAIMABLE)),
>>>> -		   K(wb_thresh),
>>>> +		   K(stats.nr_writeback),
>>>> +		   K(stats.nr_reclaimable),
>>>> +		   K(stats.wb_thresh),
>>>>  		   K(dirty_thresh),
>>>>  		   K(background_thresh),
>>>> -		   (unsigned long) K(wb_stat(wb, WB_DIRTIED)),
>>>> -		   (unsigned long) K(wb_stat(wb, WB_WRITTEN)),
>>>> -		   (unsigned long) K(wb->write_bandwidth),
>>>> -		   nr_dirty,
>>>> -		   nr_io,
>>>> -		   nr_more_io,
>>>> -		   nr_dirty_time,
>>>> +		   K(stats.nr_dirtied),
>>>> +		   K(stats.nr_written),
>>>> +		   K(tot_bw),
>>>> +		   stats.nr_dirty,
>>>> +		   stats.nr_io,
>>>> +		   stats.nr_more_io,
>>>> +		   stats.nr_dirty_time,
>>>>  		   !list_empty(&bdi->bdi_list), bdi->wb.state);
>>>>  
>>>>  	return 0;
>>>> -- 
>>>> 2.30.0
>>>>
>>>>
>>>
>>>
>>
> 
> 



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/6] writeback: collect stats of all wb of bdi in bdi_debug_stats_show
  2024-03-21 18:06   ` Jan Kara
@ 2024-03-22  7:51     ` Kemeng Shi
  2024-03-22 11:58       ` Brian Foster
  0 siblings, 1 reply; 38+ messages in thread
From: Kemeng Shi @ 2024-03-22  7:51 UTC (permalink / raw)
  To: Jan Kara
  Cc: akpm, tj, linux-mm, linux-fsdevel, linux-kernel, willy, bfoster,
	dsterba, mjguzik, dhowells, peterz



on 3/22/2024 2:06 AM, Jan Kara wrote:
> On Wed 20-03-24 19:02:17, Kemeng Shi wrote:
>> /sys/kernel/debug/bdi/xxx/stats is supposed to show writeback information
>> of whole bdi, but only writeback information of bdi in root cgroup is
>> collected. So writeback information in non-root cgroup are missing now.
>> To be more specific, considering following case:
>>
>> /* create writeback cgroup */
>> cd /sys/fs/cgroup
>> echo "+memory +io" > cgroup.subtree_control
>> mkdir group1
>> cd group1
>> echo $$ > cgroup.procs
>> /* do writeback in cgroup */
>> fio -name test -filename=/dev/vdb ...
>> /* get writeback info of bdi */
>> cat /sys/kernel/debug/bdi/xxx/stats
>> The cat result unexpectedly implies that there is no writeback on target
>> bdi.
>>
>> Fix this by collecting stats of all wb in bdi instead of only wb in
>> root cgroup.
>>
>> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
> 
> Looks mostly good, one comment below:
> 
>> ---
>>  mm/backing-dev.c | 93 ++++++++++++++++++++++++++++++++++++------------
>>  1 file changed, 70 insertions(+), 23 deletions(-)
>>
>> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
>> index 5f2be8c8df11..788702b6c5dd 100644
>> --- a/mm/backing-dev.c
>> +++ b/mm/backing-dev.c
>> @@ -39,6 +39,19 @@ struct workqueue_struct *bdi_wq;
>>  #include <linux/debugfs.h>
>>  #include <linux/seq_file.h>
>>  
>> +struct wb_stats {
>> +	unsigned long nr_dirty;
>> +	unsigned long nr_io;
>> +	unsigned long nr_more_io;
>> +	unsigned long nr_dirty_time;
>> +	unsigned long nr_writeback;
>> +	unsigned long nr_reclaimable;
>> +	unsigned long nr_dirtied;
>> +	unsigned long nr_written;
>> +	unsigned long dirty_thresh;
>> +	unsigned long wb_thresh;
>> +};
>> +
>>  static struct dentry *bdi_debug_root;
>>  
>>  static void bdi_debug_init(void)
>> @@ -46,31 +59,65 @@ static void bdi_debug_init(void)
>>  	bdi_debug_root = debugfs_create_dir("bdi", NULL);
>>  }
>>  
>> -static int bdi_debug_stats_show(struct seq_file *m, void *v)
>> +static void collect_wb_stats(struct wb_stats *stats,
>> +			     struct bdi_writeback *wb)
>>  {
>> -	struct backing_dev_info *bdi = m->private;
>> -	struct bdi_writeback *wb = &bdi->wb;
>> -	unsigned long background_thresh;
>> -	unsigned long dirty_thresh;
>> -	unsigned long wb_thresh;
>> -	unsigned long nr_dirty, nr_io, nr_more_io, nr_dirty_time;
>>  	struct inode *inode;
>>  
>> -	nr_dirty = nr_io = nr_more_io = nr_dirty_time = 0;
>>  	spin_lock(&wb->list_lock);
>>  	list_for_each_entry(inode, &wb->b_dirty, i_io_list)
>> -		nr_dirty++;
>> +		stats->nr_dirty++;
>>  	list_for_each_entry(inode, &wb->b_io, i_io_list)
>> -		nr_io++;
>> +		stats->nr_io++;
>>  	list_for_each_entry(inode, &wb->b_more_io, i_io_list)
>> -		nr_more_io++;
>> +		stats->nr_more_io++;
>>  	list_for_each_entry(inode, &wb->b_dirty_time, i_io_list)
>>  		if (inode->i_state & I_DIRTY_TIME)
>> -			nr_dirty_time++;
>> +			stats->nr_dirty_time++;
>>  	spin_unlock(&wb->list_lock);
>>  
>> +	stats->nr_writeback += wb_stat(wb, WB_WRITEBACK);
>> +	stats->nr_reclaimable += wb_stat(wb, WB_RECLAIMABLE);
>> +	stats->nr_dirtied += wb_stat(wb, WB_DIRTIED);
>> +	stats->nr_written += wb_stat(wb, WB_WRITTEN);
>> +	stats->wb_thresh += wb_calc_thresh(wb, stats->dirty_thresh);
>> +}
>> +
>> +#ifdef CONFIG_CGROUP_WRITEBACK
>> +static void bdi_collect_stats(struct backing_dev_info *bdi,
>> +			      struct wb_stats *stats)
>> +{
>> +	struct bdi_writeback *wb;
>> +
>> +	/* protect wb from release */
>> +	mutex_lock(&bdi->cgwb_release_mutex);
>> +	list_for_each_entry(wb, &bdi->wb_list, bdi_node)
>> +		collect_wb_stats(stats, wb);
>> +	mutex_unlock(&bdi->cgwb_release_mutex);
>> +}
> 
> So AFAICT this function can race against
>   bdi_unregister() -> wb_shutdown(&bdi->wb)
> 
> because that doesn't take the cgwb_release_mutex. So we either need the RCU
> protection as Brian suggested or cgwb_lock or something. But given
> collect_wb_stats() can take a significant amount of time (traversing all
> the lists etc.) I think we'll need something more clever.
Sorry for missing this. I only pay attention to wb in cgroup as there is no
much change to how we use bdi->wb.
It seems that there was always a race between orginal bdi_debug_stats_show and
release of bdi as following
cat /.../stats
bdi_debug_stats_show
			bdi_unregister
			bdi_put
			  release_bdi
			    kfree(bdi)
  use after free

I will fix this in next version. Thanks!

> 
> 								Honza
> 



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/6] writeback: collect stats of all wb of bdi in bdi_debug_stats_show
  2024-03-22  7:51     ` Kemeng Shi
@ 2024-03-22 11:58       ` Brian Foster
  2024-03-26 13:16         ` Kemeng Shi
  0 siblings, 1 reply; 38+ messages in thread
From: Brian Foster @ 2024-03-22 11:58 UTC (permalink / raw)
  To: Kemeng Shi
  Cc: Jan Kara, akpm, tj, linux-mm, linux-fsdevel, linux-kernel, willy,
	dsterba, mjguzik, dhowells, peterz

On Fri, Mar 22, 2024 at 03:51:27PM +0800, Kemeng Shi wrote:
> 
> 
> on 3/22/2024 2:06 AM, Jan Kara wrote:
> > On Wed 20-03-24 19:02:17, Kemeng Shi wrote:
> >> /sys/kernel/debug/bdi/xxx/stats is supposed to show writeback information
> >> of whole bdi, but only writeback information of bdi in root cgroup is
> >> collected. So writeback information in non-root cgroup are missing now.
> >> To be more specific, considering following case:
> >>
> >> /* create writeback cgroup */
> >> cd /sys/fs/cgroup
> >> echo "+memory +io" > cgroup.subtree_control
> >> mkdir group1
> >> cd group1
> >> echo $$ > cgroup.procs
> >> /* do writeback in cgroup */
> >> fio -name test -filename=/dev/vdb ...
> >> /* get writeback info of bdi */
> >> cat /sys/kernel/debug/bdi/xxx/stats
> >> The cat result unexpectedly implies that there is no writeback on target
> >> bdi.
> >>
> >> Fix this by collecting stats of all wb in bdi instead of only wb in
> >> root cgroup.
> >>
> >> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
> > 
> > Looks mostly good, one comment below:
> > 
> >> ---
> >>  mm/backing-dev.c | 93 ++++++++++++++++++++++++++++++++++++------------
> >>  1 file changed, 70 insertions(+), 23 deletions(-)
> >>
> >> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> >> index 5f2be8c8df11..788702b6c5dd 100644
> >> --- a/mm/backing-dev.c
> >> +++ b/mm/backing-dev.c
> >> @@ -39,6 +39,19 @@ struct workqueue_struct *bdi_wq;
> >>  #include <linux/debugfs.h>
> >>  #include <linux/seq_file.h>
> >>  
> >> +struct wb_stats {
> >> +	unsigned long nr_dirty;
> >> +	unsigned long nr_io;
> >> +	unsigned long nr_more_io;
> >> +	unsigned long nr_dirty_time;
> >> +	unsigned long nr_writeback;
> >> +	unsigned long nr_reclaimable;
> >> +	unsigned long nr_dirtied;
> >> +	unsigned long nr_written;
> >> +	unsigned long dirty_thresh;
> >> +	unsigned long wb_thresh;
> >> +};
> >> +
> >>  static struct dentry *bdi_debug_root;
> >>  
> >>  static void bdi_debug_init(void)
> >> @@ -46,31 +59,65 @@ static void bdi_debug_init(void)
> >>  	bdi_debug_root = debugfs_create_dir("bdi", NULL);
> >>  }
> >>  
> >> -static int bdi_debug_stats_show(struct seq_file *m, void *v)
> >> +static void collect_wb_stats(struct wb_stats *stats,
> >> +			     struct bdi_writeback *wb)
> >>  {
> >> -	struct backing_dev_info *bdi = m->private;
> >> -	struct bdi_writeback *wb = &bdi->wb;
> >> -	unsigned long background_thresh;
> >> -	unsigned long dirty_thresh;
> >> -	unsigned long wb_thresh;
> >> -	unsigned long nr_dirty, nr_io, nr_more_io, nr_dirty_time;
> >>  	struct inode *inode;
> >>  
> >> -	nr_dirty = nr_io = nr_more_io = nr_dirty_time = 0;
> >>  	spin_lock(&wb->list_lock);
> >>  	list_for_each_entry(inode, &wb->b_dirty, i_io_list)
> >> -		nr_dirty++;
> >> +		stats->nr_dirty++;
> >>  	list_for_each_entry(inode, &wb->b_io, i_io_list)
> >> -		nr_io++;
> >> +		stats->nr_io++;
> >>  	list_for_each_entry(inode, &wb->b_more_io, i_io_list)
> >> -		nr_more_io++;
> >> +		stats->nr_more_io++;
> >>  	list_for_each_entry(inode, &wb->b_dirty_time, i_io_list)
> >>  		if (inode->i_state & I_DIRTY_TIME)
> >> -			nr_dirty_time++;
> >> +			stats->nr_dirty_time++;
> >>  	spin_unlock(&wb->list_lock);
> >>  
> >> +	stats->nr_writeback += wb_stat(wb, WB_WRITEBACK);
> >> +	stats->nr_reclaimable += wb_stat(wb, WB_RECLAIMABLE);
> >> +	stats->nr_dirtied += wb_stat(wb, WB_DIRTIED);
> >> +	stats->nr_written += wb_stat(wb, WB_WRITTEN);
> >> +	stats->wb_thresh += wb_calc_thresh(wb, stats->dirty_thresh);
> >> +}
> >> +
> >> +#ifdef CONFIG_CGROUP_WRITEBACK
> >> +static void bdi_collect_stats(struct backing_dev_info *bdi,
> >> +			      struct wb_stats *stats)
> >> +{
> >> +	struct bdi_writeback *wb;
> >> +
> >> +	/* protect wb from release */
> >> +	mutex_lock(&bdi->cgwb_release_mutex);
> >> +	list_for_each_entry(wb, &bdi->wb_list, bdi_node)
> >> +		collect_wb_stats(stats, wb);
> >> +	mutex_unlock(&bdi->cgwb_release_mutex);
> >> +}
> > 
> > So AFAICT this function can race against
> >   bdi_unregister() -> wb_shutdown(&bdi->wb)
> > 
> > because that doesn't take the cgwb_release_mutex. So we either need the RCU
> > protection as Brian suggested or cgwb_lock or something. But given
> > collect_wb_stats() can take a significant amount of time (traversing all
> > the lists etc.) I think we'll need something more clever.
> Sorry for missing this. I only pay attention to wb in cgroup as there is no
> much change to how we use bdi->wb.
> It seems that there was always a race between orginal bdi_debug_stats_show and
> release of bdi as following
> cat /.../stats
> bdi_debug_stats_show
> 			bdi_unregister
> 			bdi_put
> 			  release_bdi
> 			    kfree(bdi)
>   use after free
> 
> I will fix this in next version. Thanks!
> 

BTW, I thought this was kind of the point of the tryget in the writeback
path. I.e., the higher level loop runs under rcu_read_lock(), but in the
case it needs to cycle the rcu lock it acquires a reference to the wb,
and then can use that wb to continue the loop once the rcu lock is
reacquired. IIUC, this works because the rcu list removal keeps the list
->next pointer sane and then the ref keeps the wb memory from being
freed. A tryget of any wb's that have been shutdown fails because the
percpu ref counter has been killed.

So I _think_ this means that for the stat collection use case, you could
protect the overall walk with rcu as Jan alludes above, but then maybe
use a combination of need_resched()/cond_resched_rcu() and wb_tryget()
to introduce resched points and avoid holding lock(s) for too long.

Personally, I wonder if since this is mainly debug code we could just
get away with doing the simple thing of trying for a ref on each wb
unconditionally rather than only in a need_resched() case, but maybe
there are reasons not to do that... hm?

Brian

> > 
> > 								Honza
> > 
> 



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 6/6] writeback: remove unneeded GDTC_INIT_NO_WB
  2024-03-21  7:12     ` Kemeng Shi
@ 2024-03-25 20:26       ` Tejun Heo
  2024-03-26 13:17         ` Kemeng Shi
  2024-03-27  9:33       ` Jan Kara
  1 sibling, 1 reply; 38+ messages in thread
From: Tejun Heo @ 2024-03-25 20:26 UTC (permalink / raw)
  To: Kemeng Shi
  Cc: akpm, linux-mm, linux-fsdevel, linux-kernel, willy, bfoster,
	jack, dsterba, mjguzik, dhowells, peterz

On Thu, Mar 21, 2024 at 03:12:21PM +0800, Kemeng Shi wrote:
> 
> 
> on 3/20/2024 11:15 PM, Tejun Heo wrote:
> > Hello,
> > 
> > On Wed, Mar 20, 2024 at 07:02:22PM +0800, Kemeng Shi wrote:
> >> We never use gdtc->dom set with GDTC_INIT_NO_WB, just remove unneeded
> >> GDTC_INIT_NO_WB
> >>
> >> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
> > ...
> >>  void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
> >>  {
> >> -	struct dirty_throttle_control gdtc = { GDTC_INIT_NO_WB };
> >> +	struct dirty_throttle_control gdtc = { };
> > 
> > Even if it's currently not referenced, wouldn't it still be better to always
> > guarantee that a dtc's dom is always initialized? I'm not sure what we get
> > by removing this.
> As we explicitly use GDTC_INIT_NO_WB to set global_wb_domain before
> calculating dirty limit with domain_dirty_limits, I intuitively think the dirty
> limit calculation in domain_dirty_limits is related to global_wb_domain when
> CONFIG_WRITEBACK_CGROUP is enabled while the truth is not. So this is a little
> confusing to me.
> Would it be acceptable to you that we keep useing GDTC_INIT_NO_WB but
> define GDTC_INIT_NO_WB to null fow now and redefine GDTC_INIT_NO_WB when some
> member of gdtc is really needed.
> Of couse I'm not insistent on this. Would like to hear you suggestion. Thanks!

Ah, I see. In that case, the proposed change of removing GDTC_INIT_NO_WB
looks good to me.

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/6] writeback: support retrieving per group debug writeback stats of bdi
  2024-03-20 11:02 ` [PATCH 2/6] writeback: support retrieving per group debug writeback stats of bdi Kemeng Shi
  2024-03-20 15:01   ` Tejun Heo
@ 2024-03-26 12:24   ` Jan Kara
  2024-03-26 13:26     ` Kemeng Shi
  1 sibling, 1 reply; 38+ messages in thread
From: Jan Kara @ 2024-03-26 12:24 UTC (permalink / raw)
  To: Kemeng Shi
  Cc: akpm, tj, linux-mm, linux-fsdevel, linux-kernel, willy, bfoster,
	jack, dsterba, mjguzik, dhowells, peterz

On Wed 20-03-24 19:02:18, Kemeng Shi wrote:
> Add /sys/kernel/debug/bdi/xxx/wb_stats to show per group writeback stats
> of bdi.
> 
> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>

...

> +unsigned long wb_calc_cg_thresh(struct bdi_writeback *wb)
> +{
> +	struct dirty_throttle_control gdtc = { GDTC_INIT_NO_WB };
> +	struct dirty_throttle_control mdtc = { MDTC_INIT(wb, &gdtc) };
> +	unsigned long filepages, headroom, writeback;
> +
> +	gdtc.avail = global_dirtyable_memory();
> +	gdtc.dirty = global_node_page_state(NR_FILE_DIRTY) +
> +		     global_node_page_state(NR_WRITEBACK);
> +
> +	mem_cgroup_wb_stats(wb, &filepages, &headroom,
> +			    &mdtc.dirty, &writeback);
> +	mdtc.dirty += writeback;
> +	mdtc_calc_avail(&mdtc, filepages, headroom);
> +	domain_dirty_limits(&mdtc);
> +
> +	return __wb_calc_thresh(&mdtc, mdtc.thresh);
> +}

I kind of wish we didn't replicate this logic from balance_dirty_pages()
and wb_over_bg_thresh() into yet another place. But the refactoring would
be kind of difficult. So OK.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 5/6] writeback: rename nr_reclaimable to nr_dirty in balance_dirty_pages
  2024-03-20 11:02 ` [PATCH 5/6] writeback: rename nr_reclaimable to nr_dirty in balance_dirty_pages Kemeng Shi
@ 2024-03-26 12:27   ` Jan Kara
  0 siblings, 0 replies; 38+ messages in thread
From: Jan Kara @ 2024-03-26 12:27 UTC (permalink / raw)
  To: Kemeng Shi
  Cc: akpm, tj, linux-mm, linux-fsdevel, linux-kernel, willy, bfoster,
	jack, dsterba, mjguzik, dhowells, peterz

On Wed 20-03-24 19:02:21, Kemeng Shi wrote:
> Commit 8d92890bd6b85 ("mm/writeback: discard NR_UNSTABLE_NFS, use
> NR_WRITEBACK instead") removed NR_UNSTABLE_NFS and nr_reclaimable
> only contains dirty page now.
> Rename nr_reclaimable to nr_dirty properly.
> 
> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  mm/page-writeback.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index ba1b6b5ae5d6..481b6bf34c21 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1695,7 +1695,7 @@ static int balance_dirty_pages(struct bdi_writeback *wb,
>  	struct dirty_throttle_control * const mdtc = mdtc_valid(&mdtc_stor) ?
>  						     &mdtc_stor : NULL;
>  	struct dirty_throttle_control *sdtc;
> -	unsigned long nr_reclaimable;	/* = file_dirty */
> +	unsigned long nr_dirty;
>  	long period;
>  	long pause;
>  	long max_pause;
> @@ -1716,9 +1716,9 @@ static int balance_dirty_pages(struct bdi_writeback *wb,
>  		unsigned long m_thresh = 0;
>  		unsigned long m_bg_thresh = 0;
>  
> -		nr_reclaimable = global_node_page_state(NR_FILE_DIRTY);
> +		nr_dirty = global_node_page_state(NR_FILE_DIRTY);
>  		gdtc->avail = global_dirtyable_memory();
> -		gdtc->dirty = nr_reclaimable + global_node_page_state(NR_WRITEBACK);
> +		gdtc->dirty = nr_dirty + global_node_page_state(NR_WRITEBACK);
>  
>  		domain_dirty_limits(gdtc);
>  
> @@ -1769,7 +1769,7 @@ static int balance_dirty_pages(struct bdi_writeback *wb,
>  		 * In normal mode, we start background writeout at the lower
>  		 * background_thresh, to keep the amount of dirty memory low.
>  		 */
> -		if (!laptop_mode && nr_reclaimable > gdtc->bg_thresh &&
> +		if (!laptop_mode && nr_dirty > gdtc->bg_thresh &&
>  		    !writeback_in_progress(wb))
>  			wb_start_background_writeback(wb);
>  
> -- 
> 2.30.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 6/6] writeback: remove unneeded GDTC_INIT_NO_WB
  2024-03-20 11:02 ` [PATCH 6/6] writeback: remove unneeded GDTC_INIT_NO_WB Kemeng Shi
  2024-03-20 15:15   ` Tejun Heo
@ 2024-03-26 12:35   ` Jan Kara
  2024-03-26 13:30     ` Kemeng Shi
  1 sibling, 1 reply; 38+ messages in thread
From: Jan Kara @ 2024-03-26 12:35 UTC (permalink / raw)
  To: Kemeng Shi
  Cc: akpm, tj, linux-mm, linux-fsdevel, linux-kernel, willy, bfoster,
	jack, dsterba, mjguzik, dhowells, peterz

On Wed 20-03-24 19:02:22, Kemeng Shi wrote:
> We never use gdtc->dom set with GDTC_INIT_NO_WB, just remove unneeded
> GDTC_INIT_NO_WB
> 
> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>

Please no, this leaves a trap for the future. If anything, I'd teach
GDTC_INIT() that 'wb' can be NULL and replace GDTC_INIT_NO_WB with
GDTC_INIT(NULL).

								Honza

> ---
>  mm/page-writeback.c | 7 ++-----
>  1 file changed, 2 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 481b6bf34c21..09b2b0754cc5 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -154,8 +154,6 @@ struct dirty_throttle_control {
>  				.dom = &global_wb_domain,		\
>  				.wb_completions = &(__wb)->completions
>  
> -#define GDTC_INIT_NO_WB		.dom = &global_wb_domain
> -
>  #define MDTC_INIT(__wb, __gdtc)	.wb = (__wb),				\
>  				.dom = mem_cgroup_wb_domain(__wb),	\
>  				.wb_completions = &(__wb)->memcg_completions, \
> @@ -210,7 +208,6 @@ static void wb_min_max_ratio(struct bdi_writeback *wb,
>  
>  #define GDTC_INIT(__wb)		.wb = (__wb),                           \
>  				.wb_completions = &(__wb)->completions
> -#define GDTC_INIT_NO_WB
>  #define MDTC_INIT(__wb, __gdtc)
>  
>  static bool mdtc_valid(struct dirty_throttle_control *dtc)
> @@ -438,7 +435,7 @@ static void domain_dirty_limits(struct dirty_throttle_control *dtc)
>   */
>  void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
>  {
> -	struct dirty_throttle_control gdtc = { GDTC_INIT_NO_WB };
> +	struct dirty_throttle_control gdtc = { };
>  
>  	gdtc.avail = global_dirtyable_memory();
>  	domain_dirty_limits(&gdtc);
> @@ -895,7 +892,7 @@ unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh)
>  
>  unsigned long wb_calc_cg_thresh(struct bdi_writeback *wb)
>  {
> -	struct dirty_throttle_control gdtc = { GDTC_INIT_NO_WB };
> +	struct dirty_throttle_control gdtc = { };
>  	struct dirty_throttle_control mdtc = { MDTC_INIT(wb, &gdtc) };
>  	unsigned long filepages, headroom, writeback;
>  
> -- 
> 2.30.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/6] writeback: collect stats of all wb of bdi in bdi_debug_stats_show
  2024-03-22 11:58       ` Brian Foster
@ 2024-03-26 13:16         ` Kemeng Shi
  0 siblings, 0 replies; 38+ messages in thread
From: Kemeng Shi @ 2024-03-26 13:16 UTC (permalink / raw)
  To: Brian Foster
  Cc: Jan Kara, akpm, tj, linux-mm, linux-fsdevel, linux-kernel, willy,
	dsterba, mjguzik, dhowells, peterz



on 3/22/2024 7:58 PM, Brian Foster wrote:
> On Fri, Mar 22, 2024 at 03:51:27PM +0800, Kemeng Shi wrote:
>>
>>
>> on 3/22/2024 2:06 AM, Jan Kara wrote:
>>> On Wed 20-03-24 19:02:17, Kemeng Shi wrote:
>>>> /sys/kernel/debug/bdi/xxx/stats is supposed to show writeback information
>>>> of whole bdi, but only writeback information of bdi in root cgroup is
>>>> collected. So writeback information in non-root cgroup are missing now.
>>>> To be more specific, considering following case:
>>>>
>>>> /* create writeback cgroup */
>>>> cd /sys/fs/cgroup
>>>> echo "+memory +io" > cgroup.subtree_control
>>>> mkdir group1
>>>> cd group1
>>>> echo $$ > cgroup.procs
>>>> /* do writeback in cgroup */
>>>> fio -name test -filename=/dev/vdb ...
>>>> /* get writeback info of bdi */
>>>> cat /sys/kernel/debug/bdi/xxx/stats
>>>> The cat result unexpectedly implies that there is no writeback on target
>>>> bdi.
>>>>
>>>> Fix this by collecting stats of all wb in bdi instead of only wb in
>>>> root cgroup.
>>>>
>>>> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
>>>
>>> Looks mostly good, one comment below:
>>>
>>>> ---
>>>>  mm/backing-dev.c | 93 ++++++++++++++++++++++++++++++++++++------------
>>>>  1 file changed, 70 insertions(+), 23 deletions(-)
>>>>
>>>> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
>>>> index 5f2be8c8df11..788702b6c5dd 100644
>>>> --- a/mm/backing-dev.c
>>>> +++ b/mm/backing-dev.c
>>>> @@ -39,6 +39,19 @@ struct workqueue_struct *bdi_wq;
>>>>  #include <linux/debugfs.h>
>>>>  #include <linux/seq_file.h>
>>>>  
>>>> +struct wb_stats {
>>>> +	unsigned long nr_dirty;
>>>> +	unsigned long nr_io;
>>>> +	unsigned long nr_more_io;
>>>> +	unsigned long nr_dirty_time;
>>>> +	unsigned long nr_writeback;
>>>> +	unsigned long nr_reclaimable;
>>>> +	unsigned long nr_dirtied;
>>>> +	unsigned long nr_written;
>>>> +	unsigned long dirty_thresh;
>>>> +	unsigned long wb_thresh;
>>>> +};
>>>> +
>>>>  static struct dentry *bdi_debug_root;
>>>>  
>>>>  static void bdi_debug_init(void)
>>>> @@ -46,31 +59,65 @@ static void bdi_debug_init(void)
>>>>  	bdi_debug_root = debugfs_create_dir("bdi", NULL);
>>>>  }
>>>>  
>>>> -static int bdi_debug_stats_show(struct seq_file *m, void *v)
>>>> +static void collect_wb_stats(struct wb_stats *stats,
>>>> +			     struct bdi_writeback *wb)
>>>>  {
>>>> -	struct backing_dev_info *bdi = m->private;
>>>> -	struct bdi_writeback *wb = &bdi->wb;
>>>> -	unsigned long background_thresh;
>>>> -	unsigned long dirty_thresh;
>>>> -	unsigned long wb_thresh;
>>>> -	unsigned long nr_dirty, nr_io, nr_more_io, nr_dirty_time;
>>>>  	struct inode *inode;
>>>>  
>>>> -	nr_dirty = nr_io = nr_more_io = nr_dirty_time = 0;
>>>>  	spin_lock(&wb->list_lock);
>>>>  	list_for_each_entry(inode, &wb->b_dirty, i_io_list)
>>>> -		nr_dirty++;
>>>> +		stats->nr_dirty++;
>>>>  	list_for_each_entry(inode, &wb->b_io, i_io_list)
>>>> -		nr_io++;
>>>> +		stats->nr_io++;
>>>>  	list_for_each_entry(inode, &wb->b_more_io, i_io_list)
>>>> -		nr_more_io++;
>>>> +		stats->nr_more_io++;
>>>>  	list_for_each_entry(inode, &wb->b_dirty_time, i_io_list)
>>>>  		if (inode->i_state & I_DIRTY_TIME)
>>>> -			nr_dirty_time++;
>>>> +			stats->nr_dirty_time++;
>>>>  	spin_unlock(&wb->list_lock);
>>>>  
>>>> +	stats->nr_writeback += wb_stat(wb, WB_WRITEBACK);
>>>> +	stats->nr_reclaimable += wb_stat(wb, WB_RECLAIMABLE);
>>>> +	stats->nr_dirtied += wb_stat(wb, WB_DIRTIED);
>>>> +	stats->nr_written += wb_stat(wb, WB_WRITTEN);
>>>> +	stats->wb_thresh += wb_calc_thresh(wb, stats->dirty_thresh);
>>>> +}
>>>> +
>>>> +#ifdef CONFIG_CGROUP_WRITEBACK
>>>> +static void bdi_collect_stats(struct backing_dev_info *bdi,
>>>> +			      struct wb_stats *stats)
>>>> +{
>>>> +	struct bdi_writeback *wb;
>>>> +
>>>> +	/* protect wb from release */
>>>> +	mutex_lock(&bdi->cgwb_release_mutex);
>>>> +	list_for_each_entry(wb, &bdi->wb_list, bdi_node)
>>>> +		collect_wb_stats(stats, wb);
>>>> +	mutex_unlock(&bdi->cgwb_release_mutex);
>>>> +}
>>>
>>> So AFAICT this function can race against
>>>   bdi_unregister() -> wb_shutdown(&bdi->wb)
>>>
>>> because that doesn't take the cgwb_release_mutex. So we either need the RCU
>>> protection as Brian suggested or cgwb_lock or something. But given
>>> collect_wb_stats() can take a significant amount of time (traversing all
>>> the lists etc.) I think we'll need something more clever.
>> Sorry for missing this. I only pay attention to wb in cgroup as there is no
>> much change to how we use bdi->wb.
>> It seems that there was always a race between orginal bdi_debug_stats_show and
>> release of bdi as following
>> cat /.../stats
>> bdi_debug_stats_show
>> 			bdi_unregister
>> 			bdi_put
>> 			  release_bdi
>> 			    kfree(bdi)
>>   use after free
>>
>> I will fix this in next version. Thanks!
>>
> 
Hi Brian
> BTW, I thought this was kind of the point of the tryget in the writeback
> path. I.e., the higher level loop runs under rcu_read_lock(), but in the
> case it needs to cycle the rcu lock it acquires a reference to the wb,
> and then can use that wb to continue the loop once the rcu lock is
> reacquired. IIUC, this works because the rcu list removal keeps the list
> ->next pointer sane and then the ref keeps the wb memory from being
> freed. A tryget of any wb's that have been shutdown fails because the
> percpu ref counter has been killedFor bdi->wb, tryget seems not helpful to protect race as wb_tryget does
nothing for bdi->wb. For wb in cgroup, this works fine.
> 
> So I _think_ this means that for the stat collection use case, you could
> protect the overall walk with rcu as Jan alludes above, but then maybe
> use a combination of need_resched()/cond_resched_rcu() and wb_tryget()
> to introduce resched points and avoid holding lock(s) for too long.
Sure, I will protect race with rcu in next version!
> 
> Personally, I wonder if since this is mainly debug code we could just
> get away with doing the simple thing of trying for a ref on each wb
> unconditionally rather than only in a need_resched() case, but maybe
> there are reasons not to do that... hm?
Agreed, I also prefer simple debug code with no need_resched. Will do
it unless someone is against this.

Thansk a lot for the helpful information!
Kemeng
> 
> Brian
> 
>>>
>>> 								Honza
>>>
>>
> 



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 6/6] writeback: remove unneeded GDTC_INIT_NO_WB
  2024-03-25 20:26       ` Tejun Heo
@ 2024-03-26 13:17         ` Kemeng Shi
  0 siblings, 0 replies; 38+ messages in thread
From: Kemeng Shi @ 2024-03-26 13:17 UTC (permalink / raw)
  To: Tejun Heo
  Cc: akpm, linux-mm, linux-fsdevel, linux-kernel, willy, bfoster,
	jack, dsterba, mjguzik, dhowells, peterz



on 3/26/2024 4:26 AM, Tejun Heo wrote:
> On Thu, Mar 21, 2024 at 03:12:21PM +0800, Kemeng Shi wrote:
>>
>>
>> on 3/20/2024 11:15 PM, Tejun Heo wrote:
>>> Hello,
>>>
>>> On Wed, Mar 20, 2024 at 07:02:22PM +0800, Kemeng Shi wrote:
>>>> We never use gdtc->dom set with GDTC_INIT_NO_WB, just remove unneeded
>>>> GDTC_INIT_NO_WB
>>>>
>>>> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
>>> ...
>>>>  void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
>>>>  {
>>>> -	struct dirty_throttle_control gdtc = { GDTC_INIT_NO_WB };
>>>> +	struct dirty_throttle_control gdtc = { };
>>>
>>> Even if it's currently not referenced, wouldn't it still be better to always
>>> guarantee that a dtc's dom is always initialized? I'm not sure what we get
>>> by removing this.
>> As we explicitly use GDTC_INIT_NO_WB to set global_wb_domain before
>> calculating dirty limit with domain_dirty_limits, I intuitively think the dirty
>> limit calculation in domain_dirty_limits is related to global_wb_domain when
>> CONFIG_WRITEBACK_CGROUP is enabled while the truth is not. So this is a little
>> confusing to me.
>> Would it be acceptable to you that we keep useing GDTC_INIT_NO_WB but
>> define GDTC_INIT_NO_WB to null fow now and redefine GDTC_INIT_NO_WB when some
>> member of gdtc is really needed.
>> Of couse I'm not insistent on this. Would like to hear you suggestion. Thanks!
> 
> Ah, I see. In that case, the proposed change of removing GDTC_INIT_NO_WB
> looks good to me.
Sure, will do it in next version. Thanks!
> 
> Thanks.
> 



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/6] writeback: support retrieving per group debug writeback stats of bdi
  2024-03-26 12:24   ` Jan Kara
@ 2024-03-26 13:26     ` Kemeng Shi
  0 siblings, 0 replies; 38+ messages in thread
From: Kemeng Shi @ 2024-03-26 13:26 UTC (permalink / raw)
  To: Jan Kara
  Cc: akpm, tj, linux-mm, linux-fsdevel, linux-kernel, willy, bfoster,
	dsterba, mjguzik, dhowells, peterz



on 3/26/2024 8:24 PM, Jan Kara wrote:
> On Wed 20-03-24 19:02:18, Kemeng Shi wrote:
>> Add /sys/kernel/debug/bdi/xxx/wb_stats to show per group writeback stats
>> of bdi.
>>
>> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
> 
> ...
> 
>> +unsigned long wb_calc_cg_thresh(struct bdi_writeback *wb)
>> +{
>> +	struct dirty_throttle_control gdtc = { GDTC_INIT_NO_WB };
>> +	struct dirty_throttle_control mdtc = { MDTC_INIT(wb, &gdtc) };
>> +	unsigned long filepages, headroom, writeback;
>> +
>> +	gdtc.avail = global_dirtyable_memory();
>> +	gdtc.dirty = global_node_page_state(NR_FILE_DIRTY) +
>> +		     global_node_page_state(NR_WRITEBACK);
>> +
>> +	mem_cgroup_wb_stats(wb, &filepages, &headroom,
>> +			    &mdtc.dirty, &writeback);
>> +	mdtc.dirty += writeback;
>> +	mdtc_calc_avail(&mdtc, filepages, headroom);
>> +	domain_dirty_limits(&mdtc);
>> +
>> +	return __wb_calc_thresh(&mdtc, mdtc.thresh);
>> +}
> 
> I kind of wish we didn't replicate this logic from balance_dirty_pages()
> and wb_over_bg_thresh() into yet another place. But the refactoring would
> be kind of difficult. So OK.
Thanks for review the code.
I have considered about this. It's difficult and will introduce a lot
change to non-debug code which make this series uneasy to review.
I will submit another patch for refactoring if I could find a way to
clean the code.

Kemeng
> 
> 								Honza
> 



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 6/6] writeback: remove unneeded GDTC_INIT_NO_WB
  2024-03-26 12:35   ` Jan Kara
@ 2024-03-26 13:30     ` Kemeng Shi
  0 siblings, 0 replies; 38+ messages in thread
From: Kemeng Shi @ 2024-03-26 13:30 UTC (permalink / raw)
  To: Jan Kara
  Cc: akpm, tj, linux-mm, linux-fsdevel, linux-kernel, willy, bfoster,
	dsterba, mjguzik, dhowells, peterz



on 3/26/2024 8:35 PM, Jan Kara wrote:
> On Wed 20-03-24 19:02:22, Kemeng Shi wrote:
>> We never use gdtc->dom set with GDTC_INIT_NO_WB, just remove unneeded
>> GDTC_INIT_NO_WB
>>
>> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
> 
> Please no, this leaves a trap for the future. If anything, I'd teach
> GDTC_INIT() that 'wb' can be NULL and replace GDTC_INIT_NO_WB with
> GDTC_INIT(NULL).
Would it be acceptable to define GDTC_INIT_NO_WB to null for now as
discussed in [1].

[1] https://lore.kernel.org/lkml/becdb16b-a318-ec05-61d2-d190541ae997@huaweicloud.com/

Thanks,
Kemeng
> 
> 								Honza
> 
>> ---
>>  mm/page-writeback.c | 7 ++-----
>>  1 file changed, 2 insertions(+), 5 deletions(-)
>>
>> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
>> index 481b6bf34c21..09b2b0754cc5 100644
>> --- a/mm/page-writeback.c
>> +++ b/mm/page-writeback.c
>> @@ -154,8 +154,6 @@ struct dirty_throttle_control {
>>  				.dom = &global_wb_domain,		\
>>  				.wb_completions = &(__wb)->completions
>>  
>> -#define GDTC_INIT_NO_WB		.dom = &global_wb_domain
>> -
>>  #define MDTC_INIT(__wb, __gdtc)	.wb = (__wb),				\
>>  				.dom = mem_cgroup_wb_domain(__wb),	\
>>  				.wb_completions = &(__wb)->memcg_completions, \
>> @@ -210,7 +208,6 @@ static void wb_min_max_ratio(struct bdi_writeback *wb,
>>  
>>  #define GDTC_INIT(__wb)		.wb = (__wb),                           \
>>  				.wb_completions = &(__wb)->completions
>> -#define GDTC_INIT_NO_WB
>>  #define MDTC_INIT(__wb, __gdtc)
>>  
>>  static bool mdtc_valid(struct dirty_throttle_control *dtc)
>> @@ -438,7 +435,7 @@ static void domain_dirty_limits(struct dirty_throttle_control *dtc)
>>   */
>>  void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
>>  {
>> -	struct dirty_throttle_control gdtc = { GDTC_INIT_NO_WB };
>> +	struct dirty_throttle_control gdtc = { };
>>  
>>  	gdtc.avail = global_dirtyable_memory();
>>  	domain_dirty_limits(&gdtc);
>> @@ -895,7 +892,7 @@ unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh)
>>  
>>  unsigned long wb_calc_cg_thresh(struct bdi_writeback *wb)
>>  {
>> -	struct dirty_throttle_control gdtc = { GDTC_INIT_NO_WB };
>> +	struct dirty_throttle_control gdtc = { };
>>  	struct dirty_throttle_control mdtc = { MDTC_INIT(wb, &gdtc) };
>>  	unsigned long filepages, headroom, writeback;
>>  
>> -- 
>> 2.30.0
>>



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 6/6] writeback: remove unneeded GDTC_INIT_NO_WB
  2024-03-21  7:12     ` Kemeng Shi
  2024-03-25 20:26       ` Tejun Heo
@ 2024-03-27  9:33       ` Jan Kara
  2024-03-28  1:49         ` Kemeng Shi
  1 sibling, 1 reply; 38+ messages in thread
From: Jan Kara @ 2024-03-27  9:33 UTC (permalink / raw)
  To: Kemeng Shi
  Cc: Tejun Heo, akpm, linux-mm, linux-fsdevel, linux-kernel, willy,
	bfoster, jack, dsterba, mjguzik, dhowells, peterz

On Thu 21-03-24 15:12:21, Kemeng Shi wrote:
> 
> 
> on 3/20/2024 11:15 PM, Tejun Heo wrote:
> > Hello,
> > 
> > On Wed, Mar 20, 2024 at 07:02:22PM +0800, Kemeng Shi wrote:
> >> We never use gdtc->dom set with GDTC_INIT_NO_WB, just remove unneeded
> >> GDTC_INIT_NO_WB
> >>
> >> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
> > ...
> >>  void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
> >>  {
> >> -	struct dirty_throttle_control gdtc = { GDTC_INIT_NO_WB };
> >> +	struct dirty_throttle_control gdtc = { };
> > 
> > Even if it's currently not referenced, wouldn't it still be better to always
> > guarantee that a dtc's dom is always initialized? I'm not sure what we get
> > by removing this.
> As we explicitly use GDTC_INIT_NO_WB to set global_wb_domain before
> calculating dirty limit with domain_dirty_limits, I intuitively think the
> dirty limit calculation in domain_dirty_limits is related to
> global_wb_domain when CONFIG_WRITEBACK_CGROUP is enabled while the truth
> is not. So this is a little confusing to me.

I'm not sure I understand your confusion. domain_dirty_limits() calculates
the dirty limit (and background dirty limit) for the dirty_throttle_control
passed in. If you pass dtc initialized with GDTC_INIT[_NO_WB], it will
compute global dirty limits. If the dtc passed in is initialized with
MDTC_INIT() it will compute cgroup specific dirty limits.

Now because domain_dirty_limits() does not scale the limits based on each
device throughput - that is done only later in __wb_calc_thresh() to avoid
relatively expensive computations when we don't need them - and also
because the effective dirty limit (dtc->dom->dirty_limit) is not updated by
domain_dirty_limits(), domain_dirty_limits() does not need dtc->dom at all.
But that is a technical detail of implementation and I don't want this
technical detail to be relied on by even more code.

What might have confused you is that GDTC_INIT_NO_WB is defined to be empty
when CONFIG_CGROUP_WRITEBACK is disabled. But this is only because in that
case dtc_dom() function unconditionally returns global_wb_domain so we
don't bother with initializing (or even having) the 'dom' field anywhere.

Now I agree this whole code is substantially confusing and complex and it
would all deserve some serious thought how to make it more readable. But
even after thinking about it again I don't think removing GDTC_INIT_NO_WB is
the right way to go.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 6/6] writeback: remove unneeded GDTC_INIT_NO_WB
  2024-03-27  9:33       ` Jan Kara
@ 2024-03-28  1:49         ` Kemeng Shi
  2024-04-02 13:53           ` Jan Kara
  0 siblings, 1 reply; 38+ messages in thread
From: Kemeng Shi @ 2024-03-28  1:49 UTC (permalink / raw)
  To: Jan Kara
  Cc: Tejun Heo, akpm, linux-mm, linux-fsdevel, linux-kernel, willy,
	bfoster, dsterba, mjguzik, dhowells, peterz



on 3/27/2024 5:33 PM, Jan Kara wrote:
> On Thu 21-03-24 15:12:21, Kemeng Shi wrote:
>>
>>
>> on 3/20/2024 11:15 PM, Tejun Heo wrote:
>>> Hello,
>>>
>>> On Wed, Mar 20, 2024 at 07:02:22PM +0800, Kemeng Shi wrote:
>>>> We never use gdtc->dom set with GDTC_INIT_NO_WB, just remove unneeded
>>>> GDTC_INIT_NO_WB
>>>>
>>>> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
>>> ...
>>>>  void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
>>>>  {
>>>> -	struct dirty_throttle_control gdtc = { GDTC_INIT_NO_WB };
>>>> +	struct dirty_throttle_control gdtc = { };
>>>
>>> Even if it's currently not referenced, wouldn't it still be better to always
>>> guarantee that a dtc's dom is always initialized? I'm not sure what we get
>>> by removing this.
>> As we explicitly use GDTC_INIT_NO_WB to set global_wb_domain before
>> calculating dirty limit with domain_dirty_limits, I intuitively think the
>> dirty limit calculation in domain_dirty_limits is related to
>> global_wb_domain when CONFIG_WRITEBACK_CGROUP is enabled while the truth
>> is not. So this is a little confusing to me.
> 
Hi Jan,
> I'm not sure I understand your confusion. domain_dirty_limits() calculates
> the dirty limit (and background dirty limit) for the dirty_throttle_control
> passed in. If you pass dtc initialized with GDTC_INIT[_NO_WB], it will
> compute global dirty limits. If the dtc passed in is initialized with
> MDTC_INIT() it will compute cgroup specific dirty limits.
No doubt about this.
> 
> Now because domain_dirty_limits() does not scale the limits based on each
> device throughput - that is done only later in __wb_calc_thresh() to avoid> relatively expensive computations when we don't need them - and also
> because the effective dirty limit (dtc->dom->dirty_limit) is not updated by
> domain_dirty_limits(), domain_dirty_limits() does not need dtc->dom at all.
Acutally, here is the thing confusing me. For wb_calc_thresh, we always pass
dtc initialized with a wb (GDTC_INIT(wb) or MDTC_INIT(wb,..). The dtc
initialized with _NO_WB is only passed to domain_dirty_limits. However, The
dom initialized by _NO_WB for domain_dirty_limits is not needed at all.
> But that is a technical detail of implementation and I don't want this
> technical detail to be relied on by even more code.
Yes, I agree with this. So I wonder if it's acceptable to simply define
GDTC_INIT_NO_WB to empty for now instead of remove defination of
GDTC_INIT_NO_WB. When implementation of domain_dirty_limits() or any
other low level function in future using GDTC_INIT(_NO_WB) changes to
need dtc->domain, we re-define GDTC_INIT_NO_WB to proper value.
As this only looks confusing to me. I will drop this one in next version
if you still prefer to keep definatino of GDTC_INIT_NO_WB in the old way.

Thanks,
Kemeng
> 
> What might have confused you is that GDTC_INIT_NO_WB is defined to be empty
> when CONFIG_CGROUP_WRITEBACK is disabled. But this is only because in that
> case dtc_dom() function unconditionally returns global_wb_domain so we
> don't bother with initializing (or even having) the 'dom' field anywhere.
> 
> Now I agree this whole code is substantially confusing and complex and it
> would all deserve some serious thought how to make it more readable. But
> even after thinking about it again I don't think removing GDTC_INIT_NO_WB is
> the right way to go.
> 
> 								Honza
> 



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 6/6] writeback: remove unneeded GDTC_INIT_NO_WB
  2024-03-28  1:49         ` Kemeng Shi
@ 2024-04-02 13:53           ` Jan Kara
  2024-04-03  8:50             ` Kemeng Shi
  0 siblings, 1 reply; 38+ messages in thread
From: Jan Kara @ 2024-04-02 13:53 UTC (permalink / raw)
  To: Kemeng Shi
  Cc: Jan Kara, Tejun Heo, akpm, linux-mm, linux-fsdevel, linux-kernel,
	willy, bfoster, dsterba, mjguzik, dhowells, peterz

On Thu 28-03-24 09:49:59, Kemeng Shi wrote:
> on 3/27/2024 5:33 PM, Jan Kara wrote:
> > On Thu 21-03-24 15:12:21, Kemeng Shi wrote:
> >>
> >>
> >> on 3/20/2024 11:15 PM, Tejun Heo wrote:
> >>> Hello,
> >>>
> >>> On Wed, Mar 20, 2024 at 07:02:22PM +0800, Kemeng Shi wrote:
> >>>> We never use gdtc->dom set with GDTC_INIT_NO_WB, just remove unneeded
> >>>> GDTC_INIT_NO_WB
> >>>>
> >>>> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
> >>> ...
> >>>>  void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
> >>>>  {
> >>>> -	struct dirty_throttle_control gdtc = { GDTC_INIT_NO_WB };
> >>>> +	struct dirty_throttle_control gdtc = { };
> >>>
> >>> Even if it's currently not referenced, wouldn't it still be better to always
> >>> guarantee that a dtc's dom is always initialized? I'm not sure what we get
> >>> by removing this.
> >> As we explicitly use GDTC_INIT_NO_WB to set global_wb_domain before
> >> calculating dirty limit with domain_dirty_limits, I intuitively think the
> >> dirty limit calculation in domain_dirty_limits is related to
> >> global_wb_domain when CONFIG_WRITEBACK_CGROUP is enabled while the truth
> >> is not. So this is a little confusing to me.
> > 
> Hi Jan,
> > I'm not sure I understand your confusion. domain_dirty_limits() calculates
> > the dirty limit (and background dirty limit) for the dirty_throttle_control
> > passed in. If you pass dtc initialized with GDTC_INIT[_NO_WB], it will
> > compute global dirty limits. If the dtc passed in is initialized with
> > MDTC_INIT() it will compute cgroup specific dirty limits.
> No doubt about this.
> > 
> > Now because domain_dirty_limits() does not scale the limits based on each
> > device throughput - that is done only later in __wb_calc_thresh() to avoid> relatively expensive computations when we don't need them - and also
> > because the effective dirty limit (dtc->dom->dirty_limit) is not updated by
> > domain_dirty_limits(), domain_dirty_limits() does not need dtc->dom at all.
> Acutally, here is the thing confusing me. For wb_calc_thresh, we always pass
> dtc initialized with a wb (GDTC_INIT(wb) or MDTC_INIT(wb,..). The dtc
> initialized with _NO_WB is only passed to domain_dirty_limits. However, The
> dom initialized by _NO_WB for domain_dirty_limits is not needed at all.
> > But that is a technical detail of implementation and I don't want this
> > technical detail to be relied on by even more code.
> Yes, I agree with this. So I wonder if it's acceptable to simply define
> GDTC_INIT_NO_WB to empty for now instead of remove defination of
> GDTC_INIT_NO_WB. When implementation of domain_dirty_limits() or any
> other low level function in future using GDTC_INIT(_NO_WB) changes to
> need dtc->domain, we re-define GDTC_INIT_NO_WB to proper value.
> As this only looks confusing to me. I will drop this one in next version
> if you still prefer to keep definatino of GDTC_INIT_NO_WB in the old way.

Yeah, please keep the code as is for now. I agree this needs some cleanups
but what you suggest is IMHO not an improvement.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 6/6] writeback: remove unneeded GDTC_INIT_NO_WB
  2024-04-02 13:53           ` Jan Kara
@ 2024-04-03  8:50             ` Kemeng Shi
  0 siblings, 0 replies; 38+ messages in thread
From: Kemeng Shi @ 2024-04-03  8:50 UTC (permalink / raw)
  To: Jan Kara
  Cc: Tejun Heo, akpm, linux-mm, linux-fsdevel, linux-kernel, willy,
	bfoster, dsterba, mjguzik, dhowells, peterz



on 4/2/2024 9:53 PM, Jan Kara wrote:
> On Thu 28-03-24 09:49:59, Kemeng Shi wrote:
>> on 3/27/2024 5:33 PM, Jan Kara wrote:
>>> On Thu 21-03-24 15:12:21, Kemeng Shi wrote:
>>>>
>>>>
>>>> on 3/20/2024 11:15 PM, Tejun Heo wrote:
>>>>> Hello,
>>>>>
>>>>> On Wed, Mar 20, 2024 at 07:02:22PM +0800, Kemeng Shi wrote:
>>>>>> We never use gdtc->dom set with GDTC_INIT_NO_WB, just remove unneeded
>>>>>> GDTC_INIT_NO_WB
>>>>>>
>>>>>> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
>>>>> ...
>>>>>>  void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
>>>>>>  {
>>>>>> -	struct dirty_throttle_control gdtc = { GDTC_INIT_NO_WB };
>>>>>> +	struct dirty_throttle_control gdtc = { };
>>>>>
>>>>> Even if it's currently not referenced, wouldn't it still be better to always
>>>>> guarantee that a dtc's dom is always initialized? I'm not sure what we get
>>>>> by removing this.
>>>> As we explicitly use GDTC_INIT_NO_WB to set global_wb_domain before
>>>> calculating dirty limit with domain_dirty_limits, I intuitively think the
>>>> dirty limit calculation in domain_dirty_limits is related to
>>>> global_wb_domain when CONFIG_WRITEBACK_CGROUP is enabled while the truth
>>>> is not. So this is a little confusing to me.
>>>
>> Hi Jan,
>>> I'm not sure I understand your confusion. domain_dirty_limits() calculates
>>> the dirty limit (and background dirty limit) for the dirty_throttle_control
>>> passed in. If you pass dtc initialized with GDTC_INIT[_NO_WB], it will
>>> compute global dirty limits. If the dtc passed in is initialized with
>>> MDTC_INIT() it will compute cgroup specific dirty limits.
>> No doubt about this.
>>>
>>> Now because domain_dirty_limits() does not scale the limits based on each
>>> device throughput - that is done only later in __wb_calc_thresh() to avoid> relatively expensive computations when we don't need them - and also
>>> because the effective dirty limit (dtc->dom->dirty_limit) is not updated by
>>> domain_dirty_limits(), domain_dirty_limits() does not need dtc->dom at all.
>> Acutally, here is the thing confusing me. For wb_calc_thresh, we always pass
>> dtc initialized with a wb (GDTC_INIT(wb) or MDTC_INIT(wb,..). The dtc
>> initialized with _NO_WB is only passed to domain_dirty_limits. However, The
>> dom initialized by _NO_WB for domain_dirty_limits is not needed at all.
>>> But that is a technical detail of implementation and I don't want this
>>> technical detail to be relied on by even more code.
>> Yes, I agree with this. So I wonder if it's acceptable to simply define
>> GDTC_INIT_NO_WB to empty for now instead of remove defination of
>> GDTC_INIT_NO_WB. When implementation of domain_dirty_limits() or any
>> other low level function in future using GDTC_INIT(_NO_WB) changes to
>> need dtc->domain, we re-define GDTC_INIT_NO_WB to proper value.
>> As this only looks confusing to me. I will drop this one in next version
>> if you still prefer to keep definatino of GDTC_INIT_NO_WB in the old way.
> 
> Yeah, please keep the code as is for now. I agree this needs some cleanups
> but what you suggest is IMHO not an improvement.
Sure, will drop this in next version.

Thanks,
Kemeng
> 
> 								Honza
> 



^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2024-04-03  8:50 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-20 11:02 [PATCH 0/6] Improve visibility of writeback Kemeng Shi
2024-03-20 11:02 ` [PATCH 1/6] writeback: collect stats of all wb of bdi in bdi_debug_stats_show Kemeng Shi
2024-03-20 13:21   ` Brian Foster
2024-03-21  3:44     ` Kemeng Shi
2024-03-21 12:10       ` Brian Foster
2024-03-22  7:32         ` Kemeng Shi
2024-03-21 18:06   ` Jan Kara
2024-03-22  7:51     ` Kemeng Shi
2024-03-22 11:58       ` Brian Foster
2024-03-26 13:16         ` Kemeng Shi
2024-03-20 11:02 ` [PATCH 2/6] writeback: support retrieving per group debug writeback stats of bdi Kemeng Shi
2024-03-20 15:01   ` Tejun Heo
2024-03-21  3:45     ` Kemeng Shi
2024-03-26 12:24   ` Jan Kara
2024-03-26 13:26     ` Kemeng Shi
2024-03-20 11:02 ` [PATCH 3/6] workqueue: remove unnecessary import and function in wq_monitor.py Kemeng Shi
2024-03-20 15:03   ` Tejun Heo
2024-03-21  6:08     ` Kemeng Shi
2024-03-20 11:02 ` [PATCH 4/6] writeback: add wb_monitor.py script to monitor writeback info on bdi Kemeng Shi
2024-03-20 15:12   ` Tejun Heo
2024-03-21  6:22     ` Kemeng Shi
2024-03-20 11:02 ` [PATCH 5/6] writeback: rename nr_reclaimable to nr_dirty in balance_dirty_pages Kemeng Shi
2024-03-26 12:27   ` Jan Kara
2024-03-20 11:02 ` [PATCH 6/6] writeback: remove unneeded GDTC_INIT_NO_WB Kemeng Shi
2024-03-20 15:15   ` Tejun Heo
2024-03-21  7:12     ` Kemeng Shi
2024-03-25 20:26       ` Tejun Heo
2024-03-26 13:17         ` Kemeng Shi
2024-03-27  9:33       ` Jan Kara
2024-03-28  1:49         ` Kemeng Shi
2024-04-02 13:53           ` Jan Kara
2024-04-03  8:50             ` Kemeng Shi
2024-03-26 12:35   ` Jan Kara
2024-03-26 13:30     ` Kemeng Shi
2024-03-20 15:20 ` [PATCH 0/6] Improve visibility of writeback Tejun Heo
2024-03-20 17:22 ` Jan Kara
2024-03-21  8:12   ` Kemeng Shi
2024-03-21 18:07     ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).