All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/6] Improve visibility of writeback
@ 2024-03-27 15:57 Kemeng Shi
  2024-03-27 15:57 ` [PATCH v2 1/6] writeback: protect race between bdi release and bdi_debug_stats_show Kemeng Shi
                   ` (7 more replies)
  0 siblings, 8 replies; 38+ messages in thread
From: Kemeng Shi @ 2024-03-27 15:57 UTC (permalink / raw)
  To: akpm, willy, jack, bfoster, tj
  Cc: dsterba, mjguzik, dhowells, linux-kernel, linux-mm, linux-fsdevel

v1->v2:
-Send cleanup to wq_monitor.py separately.
-Add patch to avoid use after free of bdi.
-Rename wb_calc_cg_thresh to cgwb_calc_thresh as Tejun suggested.
-Use rcu walk to avoid use after free.
-Add debug output to each related patches.

This series tries to improve visilibity of writeback. Patch 1 make
/sys/kernel/debug/bdi/xxx/stats show writeback info of whole bdi
instead of only writeback info in root cgroup. Patch 2 add a new
debug file /sys/kernel/debug/bdi/xxx/wb_stats to show per wb writeback
info. Patch 4 add wb_monitor.py to monitor basic writeback info
of running system, more info could be added on demand. Rest patches
are some random cleanups. More details can be found in respective
patches. Thanks!
This series is on top of patchset [1].

[1] https://lore.kernel.org/lkml/20240123183332.876854-1-shikemeng@huaweicloud.com/T/#mc6455784a63d0f8aa1a2f5aff325abcdf9336b76

Following domain hierarchy is tested:
                global domain (320G)
                /                 \
        cgroup domain1(10G)     cgroup domain2(10G)
                |                 |
bdi            wb1               wb2

/* all writeback info of bdi is successfully collected */
# cat /sys/kernel/debug/bdi/252:16/stats:
BdiWriteback:              448 kB
BdiReclaimable:        1303904 kB
BdiDirtyThresh:      189914124 kB
DirtyThresh:         195337564 kB
BackgroundThresh:     32516508 kB
BdiDirtied:            3591392 kB
BdiWritten:            2287488 kB
BdiWriteBandwidth:      322248 kBps
b_dirty:                     0
b_io:                        0
b_more_io:                   2
b_dirty_time:                0
bdi_list:                    1
state:                       1

/* per wb writeback info is collected */
# cat /sys/kernel/debug/bdi/252:16/wb_stats:
cat wb_stats
WbCgIno:                    1
WbWriteback:                0 kB
WbReclaimable:              0 kB
WbDirtyThresh:              0 kB
WbDirtied:                  0 kB
WbWritten:                  0 kB
WbWriteBandwidth:      102400 kBps
b_dirty:                    0
b_io:                       0
b_more_io:                  0
b_dirty_time:               0
state:                      1
WbCgIno:                 4284
WbWriteback:              448 kB
WbReclaimable:         818944 kB
WbDirtyThresh:        3096524 kB
WbDirtied:            2266880 kB
WbWritten:            1447936 kB
WbWriteBandwidth:      214036 kBps
b_dirty:                    0
b_io:                       0
b_more_io:                  1
b_dirty_time:               0
state:                      5
WbCgIno:                 4325
WbWriteback:              224 kB
WbReclaimable:         819392 kB
WbDirtyThresh:        2920088 kB
WbDirtied:            2551808 kB
WbWritten:            1732416 kB
WbWriteBandwidth:      201832 kBps
b_dirty:                    0
b_io:                       0
b_more_io:                  1
b_dirty_time:               0
state:                      5

/* monitor writeback info */
# ./wb_monitor.py 252:16 -c
                  writeback  reclaimable   dirtied   written    avg_bw
252:16_1                  0            0         0         0    102400
252:16_4284             672       820064   9230368   8410304    685612
252:16_4325             896       819840  10491264   9671648    652348
252:16                 1568      1639904  19721632  18081952   1440360


                  writeback  reclaimable   dirtied   written    avg_bw
252:16_1                  0            0         0         0    102400
252:16_4284             672       820064   9230368   8410304    685612
252:16_4325             896       819840  10491264   9671648    652348
252:16                 1568      1639904  19721632  18081952   1440360
...

Kemeng Shi (6):
  writeback: protect race between bdi release and bdi_debug_stats_show
  writeback: collect stats of all wb of bdi in bdi_debug_stats_show
  writeback: support retrieving per group debug writeback stats of bdi
  writeback: add wb_monitor.py script to monitor writeback info on bdi
  writeback: rename nr_reclaimable to nr_dirty in balance_dirty_pages
  writeback: define GDTC_INIT_NO_WB to null

 include/linux/writeback.h     |   1 +
 mm/backing-dev.c              | 203 ++++++++++++++++++++++++++++++----
 mm/page-writeback.c           |  31 ++++--
 tools/writeback/wb_monitor.py | 172 ++++++++++++++++++++++++++++
 4 files changed, 378 insertions(+), 29 deletions(-)
 create mode 100644 tools/writeback/wb_monitor.py

-- 
2.30.0


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v2 1/6] writeback: protect race between bdi release and bdi_debug_stats_show
  2024-03-27 15:57 [PATCH v2 0/6] Improve visibility of writeback Kemeng Shi
@ 2024-03-27 15:57 ` Kemeng Shi
  2024-03-28 17:53   ` Brian Foster
  2024-03-27 15:57 ` [PATCH v2 2/6] writeback: collect stats of all wb of bdi in bdi_debug_stats_show Kemeng Shi
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 38+ messages in thread
From: Kemeng Shi @ 2024-03-27 15:57 UTC (permalink / raw)
  To: akpm, willy, jack, bfoster, tj
  Cc: dsterba, mjguzik, dhowells, linux-kernel, linux-mm, linux-fsdevel

There is a race between bdi release and bdi_debug_stats_show:
/* get debug info */		/* bdi release */
bdi_debug_stats_show
  bdi = m->private;
  wb = &bdi->wb;
				bdi_unregister
				bdi_put
				  release_bdi
				    kfree(bdi)
  /* use after free */
  spin_lock(&wb->list_lock);

Search bdi on bdi_list under rcu lock in bdi_debug_stats_show to ensure
the bdi is not freed to fix the issue.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
---
 mm/backing-dev.c | 33 +++++++++++++++++++++++++++++++--
 1 file changed, 31 insertions(+), 2 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 5f2be8c8df11..70f02959f3bd 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -46,16 +46,44 @@ static void bdi_debug_init(void)
 	bdi_debug_root = debugfs_create_dir("bdi", NULL);
 }
 
-static int bdi_debug_stats_show(struct seq_file *m, void *v)
+static struct backing_dev_info *lookup_bdi(struct seq_file *m)
 {
+	const struct file *file = m->file;
 	struct backing_dev_info *bdi = m->private;
-	struct bdi_writeback *wb = &bdi->wb;
+	struct backing_dev_info *exist;
+
+	list_for_each_entry_rcu(exist, &bdi_list, bdi_list) {
+		if (exist != bdi)
+			continue;
+
+		if (exist->debug_dir == file->f_path.dentry->d_parent)
+			return bdi;
+		else
+			return NULL;
+	}
+
+	return NULL;
+}
+
+
+static int bdi_debug_stats_show(struct seq_file *m, void *v)
+{
+	struct backing_dev_info *bdi;
+	struct bdi_writeback *wb;
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long wb_thresh;
 	unsigned long nr_dirty, nr_io, nr_more_io, nr_dirty_time;
 	struct inode *inode;
 
+	rcu_read_lock();
+	bdi = lookup_bdi(m);
+	if (!bdi) {
+		rcu_read_unlock();
+		return -EEXIST;
+	}
+
+	wb = &bdi->wb;
 	nr_dirty = nr_io = nr_more_io = nr_dirty_time = 0;
 	spin_lock(&wb->list_lock);
 	list_for_each_entry(inode, &wb->b_dirty, i_io_list)
@@ -101,6 +129,7 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
 		   nr_dirty_time,
 		   !list_empty(&bdi->bdi_list), bdi->wb.state);
 
+	rcu_read_unlock();
 	return 0;
 }
 DEFINE_SHOW_ATTRIBUTE(bdi_debug_stats);
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 2/6] writeback: collect stats of all wb of bdi in bdi_debug_stats_show
  2024-03-27 15:57 [PATCH v2 0/6] Improve visibility of writeback Kemeng Shi
  2024-03-27 15:57 ` [PATCH v2 1/6] writeback: protect race between bdi release and bdi_debug_stats_show Kemeng Shi
@ 2024-03-27 15:57 ` Kemeng Shi
  2024-03-29 13:04   ` Brian Foster
  2024-03-27 15:57 ` [PATCH v2 3/6] writeback: support retrieving per group debug writeback stats of bdi Kemeng Shi
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 38+ messages in thread
From: Kemeng Shi @ 2024-03-27 15:57 UTC (permalink / raw)
  To: akpm, willy, jack, bfoster, tj
  Cc: dsterba, mjguzik, dhowells, linux-kernel, linux-mm, linux-fsdevel

/sys/kernel/debug/bdi/xxx/stats is supposed to show writeback information
of whole bdi, but only writeback information of bdi in root cgroup is
collected. So writeback information in non-root cgroup are missing now.
To be more specific, considering following case:

/* create writeback cgroup */
cd /sys/fs/cgroup
echo "+memory +io" > cgroup.subtree_control
mkdir group1
cd group1
echo $$ > cgroup.procs
/* do writeback in cgroup */
fio -name test -filename=/dev/vdb ...
/* get writeback info of bdi */
cat /sys/kernel/debug/bdi/xxx/stats
The cat result unexpectedly implies that there is no writeback on target
bdi.

Fix this by collecting stats of all wb in bdi instead of only wb in
root cgroup.

Following domain hierarchy is tested:
                global domain (320G)
                /                 \
        cgroup domain1(10G)     cgroup domain2(10G)
                |                 |
bdi            wb1               wb2

/* all writeback info of bdi is successfully collected */
cat stats
BdiWriteback:             2912 kB
BdiReclaimable:        1598464 kB
BdiDirtyThresh:      167479028 kB
DirtyThresh:         195038532 kB
BackgroundThresh:     32466728 kB
BdiDirtied:           19141696 kB
BdiWritten:           17543456 kB
BdiWriteBandwidth:     1136172 kBps
b_dirty:                     2
b_io:                        0
b_more_io:                   1
b_dirty_time:                0
bdi_list:                    1
state:                       1

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
---
 mm/backing-dev.c | 100 +++++++++++++++++++++++++++++++++--------------
 1 file changed, 71 insertions(+), 29 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 70f02959f3bd..8daf950e6855 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -39,6 +39,19 @@ struct workqueue_struct *bdi_wq;
 #include <linux/debugfs.h>
 #include <linux/seq_file.h>
 
+struct wb_stats {
+	unsigned long nr_dirty;
+	unsigned long nr_io;
+	unsigned long nr_more_io;
+	unsigned long nr_dirty_time;
+	unsigned long nr_writeback;
+	unsigned long nr_reclaimable;
+	unsigned long nr_dirtied;
+	unsigned long nr_written;
+	unsigned long dirty_thresh;
+	unsigned long wb_thresh;
+};
+
 static struct dentry *bdi_debug_root;
 
 static void bdi_debug_init(void)
@@ -65,16 +78,54 @@ static struct backing_dev_info *lookup_bdi(struct seq_file *m)
 	return NULL;
 }
 
+static void collect_wb_stats(struct wb_stats *stats,
+			     struct bdi_writeback *wb)
+{
+	struct inode *inode;
+
+	spin_lock(&wb->list_lock);
+	list_for_each_entry(inode, &wb->b_dirty, i_io_list)
+		stats->nr_dirty++;
+	list_for_each_entry(inode, &wb->b_io, i_io_list)
+		stats->nr_io++;
+	list_for_each_entry(inode, &wb->b_more_io, i_io_list)
+		stats->nr_more_io++;
+	list_for_each_entry(inode, &wb->b_dirty_time, i_io_list)
+		if (inode->i_state & I_DIRTY_TIME)
+			stats->nr_dirty_time++;
+	spin_unlock(&wb->list_lock);
+
+	stats->nr_writeback += wb_stat(wb, WB_WRITEBACK);
+	stats->nr_reclaimable += wb_stat(wb, WB_RECLAIMABLE);
+	stats->nr_dirtied += wb_stat(wb, WB_DIRTIED);
+	stats->nr_written += wb_stat(wb, WB_WRITTEN);
+	stats->wb_thresh += wb_calc_thresh(wb, stats->dirty_thresh);
+}
+
+#ifdef CONFIG_CGROUP_WRITEBACK
+static void bdi_collect_stats(struct backing_dev_info *bdi,
+			      struct wb_stats *stats)
+{
+	struct bdi_writeback *wb;
+
+	list_for_each_entry_rcu(wb, &bdi->wb_list, bdi_node)
+		collect_wb_stats(stats, wb);
+}
+#else
+static void bdi_collect_stats(struct backing_dev_info *bdi,
+			      struct wb_stats *stats)
+{
+	collect_wb_stats(stats, &bdi->wb);
+}
+#endif
 
 static int bdi_debug_stats_show(struct seq_file *m, void *v)
 {
 	struct backing_dev_info *bdi;
-	struct bdi_writeback *wb;
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
-	unsigned long wb_thresh;
-	unsigned long nr_dirty, nr_io, nr_more_io, nr_dirty_time;
-	struct inode *inode;
+	struct wb_stats stats;
+	unsigned long tot_bw;
 
 	rcu_read_lock();
 	bdi = lookup_bdi(m);
@@ -83,22 +134,13 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
 		return -EEXIST;
 	}
 
-	wb = &bdi->wb;
-	nr_dirty = nr_io = nr_more_io = nr_dirty_time = 0;
-	spin_lock(&wb->list_lock);
-	list_for_each_entry(inode, &wb->b_dirty, i_io_list)
-		nr_dirty++;
-	list_for_each_entry(inode, &wb->b_io, i_io_list)
-		nr_io++;
-	list_for_each_entry(inode, &wb->b_more_io, i_io_list)
-		nr_more_io++;
-	list_for_each_entry(inode, &wb->b_dirty_time, i_io_list)
-		if (inode->i_state & I_DIRTY_TIME)
-			nr_dirty_time++;
-	spin_unlock(&wb->list_lock);
-
 	global_dirty_limits(&background_thresh, &dirty_thresh);
-	wb_thresh = wb_calc_thresh(wb, dirty_thresh);
+
+	memset(&stats, 0, sizeof(stats));
+	stats.dirty_thresh = dirty_thresh;
+	bdi_collect_stats(bdi, &stats);
+
+	tot_bw = atomic_long_read(&bdi->tot_write_bandwidth);
 
 	seq_printf(m,
 		   "BdiWriteback:       %10lu kB\n"
@@ -115,18 +157,18 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
 		   "b_dirty_time:       %10lu\n"
 		   "bdi_list:           %10u\n"
 		   "state:              %10lx\n",
-		   (unsigned long) K(wb_stat(wb, WB_WRITEBACK)),
-		   (unsigned long) K(wb_stat(wb, WB_RECLAIMABLE)),
-		   K(wb_thresh),
+		   K(stats.nr_writeback),
+		   K(stats.nr_reclaimable),
+		   K(stats.wb_thresh),
 		   K(dirty_thresh),
 		   K(background_thresh),
-		   (unsigned long) K(wb_stat(wb, WB_DIRTIED)),
-		   (unsigned long) K(wb_stat(wb, WB_WRITTEN)),
-		   (unsigned long) K(wb->write_bandwidth),
-		   nr_dirty,
-		   nr_io,
-		   nr_more_io,
-		   nr_dirty_time,
+		   K(stats.nr_dirtied),
+		   K(stats.nr_written),
+		   K(tot_bw),
+		   stats.nr_dirty,
+		   stats.nr_io,
+		   stats.nr_more_io,
+		   stats.nr_dirty_time,
 		   !list_empty(&bdi->bdi_list), bdi->wb.state);
 
 	rcu_read_unlock();
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 3/6] writeback: support retrieving per group debug writeback stats of bdi
  2024-03-27 15:57 [PATCH v2 0/6] Improve visibility of writeback Kemeng Shi
  2024-03-27 15:57 ` [PATCH v2 1/6] writeback: protect race between bdi release and bdi_debug_stats_show Kemeng Shi
  2024-03-27 15:57 ` [PATCH v2 2/6] writeback: collect stats of all wb of bdi in bdi_debug_stats_show Kemeng Shi
@ 2024-03-27 15:57 ` Kemeng Shi
  2024-03-29 13:10   ` Brian Foster
  2024-03-27 15:57 ` [PATCH v2 4/6] writeback: add wb_monitor.py script to monitor writeback info on bdi Kemeng Shi
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 38+ messages in thread
From: Kemeng Shi @ 2024-03-27 15:57 UTC (permalink / raw)
  To: akpm, willy, jack, bfoster, tj
  Cc: dsterba, mjguzik, dhowells, linux-kernel, linux-mm, linux-fsdevel

Add /sys/kernel/debug/bdi/xxx/wb_stats to show per group writeback stats
of bdi.

Following domain hierarchy is tested:
                global domain (320G)
                /                 \
        cgroup domain1(10G)     cgroup domain2(10G)
                |                 |
bdi            wb1               wb2

/* per wb writeback info of bdi is collected */
cat /sys/kernel/debug/bdi/252:16/wb_stats
WbCgIno:                    1
WbWriteback:                0 kB
WbReclaimable:              0 kB
WbDirtyThresh:              0 kB
WbDirtied:                  0 kB
WbWritten:                  0 kB
WbWriteBandwidth:      102400 kBps
b_dirty:                    0
b_io:                       0
b_more_io:                  0
b_dirty_time:               0
state:                      1
WbCgIno:                 4094
WbWriteback:            54432 kB
WbReclaimable:         766080 kB
WbDirtyThresh:        3094760 kB
WbDirtied:            1656480 kB
WbWritten:             837088 kB
WbWriteBandwidth:      132772 kBps
b_dirty:                    1
b_io:                       1
b_more_io:                  0
b_dirty_time:               0
state:                      7
WbCgIno:                 4135
WbWriteback:            15232 kB
WbReclaimable:         786688 kB
WbDirtyThresh:        2909984 kB
WbDirtied:            1482656 kB
WbWritten:             681408 kB
WbWriteBandwidth:      124848 kBps
b_dirty:                    0
b_io:                       1
b_more_io:                  0
b_dirty_time:               0
state:                      7

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
---
 include/linux/writeback.h |  1 +
 mm/backing-dev.c          | 88 +++++++++++++++++++++++++++++++++++++++
 mm/page-writeback.c       | 19 +++++++++
 3 files changed, 108 insertions(+)

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 9845cb62e40b..112d806ddbe4 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -355,6 +355,7 @@ int dirtytime_interval_handler(struct ctl_table *table, int write,
 
 void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty);
 unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh);
+unsigned long cgwb_calc_thresh(struct bdi_writeback *wb);
 
 void wb_update_bandwidth(struct bdi_writeback *wb);
 
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 8daf950e6855..e3953db7d88d 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -103,6 +103,91 @@ static void collect_wb_stats(struct wb_stats *stats,
 }
 
 #ifdef CONFIG_CGROUP_WRITEBACK
+static void wb_stats_show(struct seq_file *m, struct bdi_writeback *wb,
+			  struct wb_stats *stats)
+{
+
+	seq_printf(m,
+		   "WbCgIno:           %10lu\n"
+		   "WbWriteback:       %10lu kB\n"
+		   "WbReclaimable:     %10lu kB\n"
+		   "WbDirtyThresh:     %10lu kB\n"
+		   "WbDirtied:         %10lu kB\n"
+		   "WbWritten:         %10lu kB\n"
+		   "WbWriteBandwidth:  %10lu kBps\n"
+		   "b_dirty:           %10lu\n"
+		   "b_io:              %10lu\n"
+		   "b_more_io:         %10lu\n"
+		   "b_dirty_time:      %10lu\n"
+		   "state:             %10lx\n",
+		   cgroup_ino(wb->memcg_css->cgroup),
+		   K(stats->nr_writeback),
+		   K(stats->nr_reclaimable),
+		   K(stats->wb_thresh),
+		   K(stats->nr_dirtied),
+		   K(stats->nr_written),
+		   K(wb->avg_write_bandwidth),
+		   stats->nr_dirty,
+		   stats->nr_io,
+		   stats->nr_more_io,
+		   stats->nr_dirty_time,
+		   wb->state);
+}
+
+static int cgwb_debug_stats_show(struct seq_file *m, void *v)
+{
+	struct backing_dev_info *bdi;
+	unsigned long background_thresh;
+	unsigned long dirty_thresh;
+	struct bdi_writeback *wb;
+	struct wb_stats stats;
+
+	rcu_read_lock();
+	bdi = lookup_bdi(m);
+	if (!bdi) {
+		rcu_read_unlock();
+		return -EEXIST;
+	}
+
+	global_dirty_limits(&background_thresh, &dirty_thresh);
+
+	list_for_each_entry_rcu(wb, &bdi->wb_list, bdi_node) {
+		memset(&stats, 0, sizeof(stats));
+		stats.dirty_thresh = dirty_thresh;
+		collect_wb_stats(&stats, wb);
+
+		if (mem_cgroup_wb_domain(wb) == NULL) {
+			wb_stats_show(m, wb, &stats);
+			continue;
+		}
+
+		/*
+		 * cgwb_release will destroy wb->memcg_completions which
+		 * will be ued in cgwb_calc_thresh. Use wb_tryget to prevent
+		 * memcg_completions destruction from cgwb_release.
+		 */
+		if (!wb_tryget(wb))
+			continue;
+
+		rcu_read_unlock();
+		/* cgwb_calc_thresh may sleep in cgroup_rstat_flush */
+		stats.wb_thresh = min(stats.wb_thresh, cgwb_calc_thresh(wb));
+		wb_stats_show(m, wb, &stats);
+		rcu_read_lock();
+		wb_put(wb);
+	}
+	rcu_read_unlock();
+
+	return 0;
+}
+DEFINE_SHOW_ATTRIBUTE(cgwb_debug_stats);
+
+static void cgwb_debug_register(struct backing_dev_info *bdi)
+{
+	debugfs_create_file("wb_stats", 0444, bdi->debug_dir, bdi,
+			    &cgwb_debug_stats_fops);
+}
+
 static void bdi_collect_stats(struct backing_dev_info *bdi,
 			      struct wb_stats *stats)
 {
@@ -117,6 +202,8 @@ static void bdi_collect_stats(struct backing_dev_info *bdi,
 {
 	collect_wb_stats(stats, &bdi->wb);
 }
+
+static inline void cgwb_debug_register(struct backing_dev_info *bdi) { }
 #endif
 
 static int bdi_debug_stats_show(struct seq_file *m, void *v)
@@ -182,6 +269,7 @@ static void bdi_debug_register(struct backing_dev_info *bdi, const char *name)
 
 	debugfs_create_file("stats", 0444, bdi->debug_dir, bdi,
 			    &bdi_debug_stats_fops);
+	cgwb_debug_register(bdi);
 }
 
 static void bdi_debug_unregister(struct backing_dev_info *bdi)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 0e20467367fe..3724c7525316 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -893,6 +893,25 @@ unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh)
 	return __wb_calc_thresh(&gdtc, thresh);
 }
 
+unsigned long cgwb_calc_thresh(struct bdi_writeback *wb)
+{
+	struct dirty_throttle_control gdtc = { GDTC_INIT_NO_WB };
+	struct dirty_throttle_control mdtc = { MDTC_INIT(wb, &gdtc) };
+	unsigned long filepages, headroom, writeback;
+
+	gdtc.avail = global_dirtyable_memory();
+	gdtc.dirty = global_node_page_state(NR_FILE_DIRTY) +
+		     global_node_page_state(NR_WRITEBACK);
+
+	mem_cgroup_wb_stats(wb, &filepages, &headroom,
+			    &mdtc.dirty, &writeback);
+	mdtc.dirty += writeback;
+	mdtc_calc_avail(&mdtc, filepages, headroom);
+	domain_dirty_limits(&mdtc);
+
+	return __wb_calc_thresh(&mdtc, mdtc.thresh);
+}
+
 /*
  *                           setpoint - dirty 3
  *        f(dirty) := 1.0 + (----------------)
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 4/6] writeback: add wb_monitor.py script to monitor writeback info on bdi
  2024-03-27 15:57 [PATCH v2 0/6] Improve visibility of writeback Kemeng Shi
                   ` (2 preceding siblings ...)
  2024-03-27 15:57 ` [PATCH v2 3/6] writeback: support retrieving per group debug writeback stats of bdi Kemeng Shi
@ 2024-03-27 15:57 ` Kemeng Shi
  2024-03-27 15:57 ` [PATCH v2 5/6] writeback: rename nr_reclaimable to nr_dirty in balance_dirty_pages Kemeng Shi
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 38+ messages in thread
From: Kemeng Shi @ 2024-03-27 15:57 UTC (permalink / raw)
  To: akpm, willy, jack, bfoster, tj
  Cc: dsterba, mjguzik, dhowells, linux-kernel, linux-mm, linux-fsdevel

Add wb_monitor.py script to monitor writeback information on backing dev
which makes it easier and more convenient to observe writeback behaviors
of running system.

The wb_monitor.py script is written based on wq_monitor.py.

Following domain hierarchy is tested:
                global domain (320G)
                /                 \
        cgroup domain1(10G)     cgroup domain2(10G)
                |                 |
bdi            wb1               wb2

The wb_monitor.py script output is as following:
./wb_monitor.py 252:16 -c
                  writeback  reclaimable   dirtied   written    avg_bw
252:16_1                  0            0         0         0    102400
252:16_4284             672       820064   9230368   8410304    685612
252:16_4325             896       819840  10491264   9671648    652348
252:16                 1568      1639904  19721632  18081952   1440360

                  writeback  reclaimable   dirtied   written    avg_bw
252:16_1                  0            0         0         0    102400
252:16_4284             672       820064   9230368   8410304    685612
252:16_4325             896       819840  10491264   9671648    652348
252:16                 1568      1639904  19721632  18081952   1440360
...

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Suggested-by: Tejun Heo <tj@kernel.org>
---
 tools/writeback/wb_monitor.py | 172 ++++++++++++++++++++++++++++++++++
 1 file changed, 172 insertions(+)
 create mode 100644 tools/writeback/wb_monitor.py

diff --git a/tools/writeback/wb_monitor.py b/tools/writeback/wb_monitor.py
new file mode 100644
index 000000000000..5e3591f1f9a9
--- /dev/null
+++ b/tools/writeback/wb_monitor.py
@@ -0,0 +1,172 @@
+#!/usr/bin/env drgn
+#
+# Copyright (C) 2024 Kemeng Shi <shikemeng@huaweicloud.com>
+# Copyright (C) 2024 Huawei Inc
+
+desc = """
+This is a drgn script based on wq_monitor.py to monitor writeback info on
+backing dev. For more info on drgn, visit https://github.com/osandov/drgn.
+
+  writeback(kB)     Amount of dirty pages are currently being written back to
+                    disk.
+
+  reclaimable(kB)   Amount of pages are currently reclaimable.
+
+  dirtied(kB)       Amount of pages have been dirtied.
+
+  wrttien(kB)       Amount of dirty pages have been written back to disk.
+
+  avg_wb(kBps)      Smoothly estimated write bandwidth of writing dirty pages
+                    back to disk.
+"""
+
+import signal
+import re
+import time
+import json
+
+import drgn
+from drgn.helpers.linux.list import list_for_each_entry
+
+import argparse
+parser = argparse.ArgumentParser(description=desc,
+                                 formatter_class=argparse.RawTextHelpFormatter)
+parser.add_argument('bdi', metavar='REGEX', nargs='*',
+                    help='Target backing device name patterns (all if empty)')
+parser.add_argument('-i', '--interval', metavar='SECS', type=float, default=1,
+                    help='Monitoring interval (0 to print once and exit)')
+parser.add_argument('-j', '--json', action='store_true',
+                    help='Output in json')
+parser.add_argument('-c', '--cgroup', action='store_true',
+                    help='show writeback of bdi in cgroup')
+args = parser.parse_args()
+
+bdi_list                = prog['bdi_list']
+
+WB_RECLAIMABLE          = prog['WB_RECLAIMABLE']
+WB_WRITEBACK            = prog['WB_WRITEBACK']
+WB_DIRTIED              = prog['WB_DIRTIED']
+WB_WRITTEN              = prog['WB_WRITTEN']
+NR_WB_STAT_ITEMS        = prog['NR_WB_STAT_ITEMS']
+
+PAGE_SHIFT              = prog['PAGE_SHIFT']
+
+def K(x):
+    return x << (PAGE_SHIFT - 10)
+
+class Stats:
+    def dict(self, now):
+        return { 'timestamp'            : now,
+                 'name'                 : self.name,
+                 'writeback'            : self.stats[WB_WRITEBACK],
+                 'reclaimable'          : self.stats[WB_RECLAIMABLE],
+                 'dirtied'              : self.stats[WB_DIRTIED],
+                 'written'              : self.stats[WB_WRITTEN],
+                 'avg_wb'               : self.avg_bw, }
+
+    def table_header_str():
+        return f'{"":>16} {"writeback":>10} {"reclaimable":>12} ' \
+                f'{"dirtied":>9} {"written":>9} {"avg_bw":>9}'
+
+    def table_row_str(self):
+        out = f'{self.name[-16:]:16} ' \
+              f'{self.stats[WB_WRITEBACK]:10} ' \
+              f'{self.stats[WB_RECLAIMABLE]:12} ' \
+              f'{self.stats[WB_DIRTIED]:9} ' \
+              f'{self.stats[WB_WRITTEN]:9} ' \
+              f'{self.avg_bw:9} '
+        return out
+
+    def show_header():
+        if Stats.table_fmt:
+            print()
+            print(Stats.table_header_str())
+
+    def show_stats(self):
+        if Stats.table_fmt:
+            print(self.table_row_str())
+        else:
+            print(self.dict(Stats.now))
+
+class WbStats(Stats):
+    def __init__(self, wb):
+        bdi_name = wb.bdi.dev_name.string_().decode()
+        # avoid to use bdi.wb.memcg_css which is only defined when
+        # CONFIG_CGROUP_WRITEBACK is enabled
+        if wb == wb.bdi.wb.address_of_():
+            ino = "1"
+        else:
+            ino = str(wb.memcg_css.cgroup.kn.id.value_())
+        self.name = bdi_name + '_' + ino
+
+        self.stats = [0] * NR_WB_STAT_ITEMS
+        for i in range(NR_WB_STAT_ITEMS):
+            if wb.stat[i].count >= 0:
+                self.stats[i] = int(K(wb.stat[i].count))
+            else:
+                self.stats[i] = 0
+
+        self.avg_bw = int(K(wb.avg_write_bandwidth))
+
+class BdiStats(Stats):
+    def __init__(self, bdi):
+        self.name = bdi.dev_name.string_().decode()
+        self.stats = [0] * NR_WB_STAT_ITEMS
+        self.avg_bw = 0
+
+    def collectStats(self, wb_stats):
+        for i in range(NR_WB_STAT_ITEMS):
+            self.stats[i] += wb_stats.stats[i]
+
+        self.avg_bw += wb_stats.avg_bw
+
+exit_req = False
+
+def sigint_handler(signr, frame):
+    global exit_req
+    exit_req = True
+
+def main():
+    # handle args
+    Stats.table_fmt = not args.json
+    interval = args.interval
+    cgroup = args.cgroup
+
+    re_str = None
+    if args.bdi:
+        for r in args.bdi:
+            if re_str is None:
+                re_str = r
+            else:
+                re_str += '|' + r
+
+    filter_re = re.compile(re_str) if re_str else None
+
+    # monitoring loop
+    signal.signal(signal.SIGINT, sigint_handler)
+
+    while not exit_req:
+        Stats.now = time.time()
+
+        Stats.show_header()
+        for bdi in list_for_each_entry('struct backing_dev_info', bdi_list.address_of_(), 'bdi_list'):
+            bdi_stats = BdiStats(bdi)
+            if filter_re and not filter_re.search(bdi_stats.name):
+                continue
+
+            for wb in list_for_each_entry('struct bdi_writeback', bdi.wb_list.address_of_(), 'bdi_node'):
+                wb_stats = WbStats(wb)
+                bdi_stats.collectStats(wb_stats)
+                if cgroup:
+                    wb_stats.show_stats()
+
+            bdi_stats.show_stats()
+            if cgroup and Stats.table_fmt:
+                print()
+
+        if interval == 0:
+            break
+        time.sleep(interval)
+
+if __name__ == "__main__":
+    main()
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 5/6] writeback: rename nr_reclaimable to nr_dirty in balance_dirty_pages
  2024-03-27 15:57 [PATCH v2 0/6] Improve visibility of writeback Kemeng Shi
                   ` (3 preceding siblings ...)
  2024-03-27 15:57 ` [PATCH v2 4/6] writeback: add wb_monitor.py script to monitor writeback info on bdi Kemeng Shi
@ 2024-03-27 15:57 ` Kemeng Shi
  2024-03-27 15:57 ` [PATCH v2 6/6] writeback: define GDTC_INIT_NO_WB to null Kemeng Shi
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 38+ messages in thread
From: Kemeng Shi @ 2024-03-27 15:57 UTC (permalink / raw)
  To: akpm, willy, jack, bfoster, tj
  Cc: dsterba, mjguzik, dhowells, linux-kernel, linux-mm, linux-fsdevel

Commit 8d92890bd6b85 ("mm/writeback: discard NR_UNSTABLE_NFS, use
NR_WRITEBACK instead") removed NR_UNSTABLE_NFS and nr_reclaimable
only contains dirty page now.
Rename nr_reclaimable to nr_dirty properly.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 mm/page-writeback.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 3724c7525316..211565d01600 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1695,7 +1695,7 @@ static int balance_dirty_pages(struct bdi_writeback *wb,
 	struct dirty_throttle_control * const mdtc = mdtc_valid(&mdtc_stor) ?
 						     &mdtc_stor : NULL;
 	struct dirty_throttle_control *sdtc;
-	unsigned long nr_reclaimable;	/* = file_dirty */
+	unsigned long nr_dirty;
 	long period;
 	long pause;
 	long max_pause;
@@ -1716,9 +1716,9 @@ static int balance_dirty_pages(struct bdi_writeback *wb,
 		unsigned long m_thresh = 0;
 		unsigned long m_bg_thresh = 0;
 
-		nr_reclaimable = global_node_page_state(NR_FILE_DIRTY);
+		nr_dirty = global_node_page_state(NR_FILE_DIRTY);
 		gdtc->avail = global_dirtyable_memory();
-		gdtc->dirty = nr_reclaimable + global_node_page_state(NR_WRITEBACK);
+		gdtc->dirty = nr_dirty + global_node_page_state(NR_WRITEBACK);
 
 		domain_dirty_limits(gdtc);
 
@@ -1769,7 +1769,7 @@ static int balance_dirty_pages(struct bdi_writeback *wb,
 		 * In normal mode, we start background writeout at the lower
 		 * background_thresh, to keep the amount of dirty memory low.
 		 */
-		if (!laptop_mode && nr_reclaimable > gdtc->bg_thresh &&
+		if (!laptop_mode && nr_dirty > gdtc->bg_thresh &&
 		    !writeback_in_progress(wb))
 			wb_start_background_writeback(wb);
 
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 6/6] writeback: define GDTC_INIT_NO_WB to null
  2024-03-27 15:57 [PATCH v2 0/6] Improve visibility of writeback Kemeng Shi
                   ` (4 preceding siblings ...)
  2024-03-27 15:57 ` [PATCH v2 5/6] writeback: rename nr_reclaimable to nr_dirty in balance_dirty_pages Kemeng Shi
@ 2024-03-27 15:57 ` Kemeng Shi
  2024-03-27 17:40 ` [PATCH v2 0/6] Improve visibility of writeback Andrew Morton
  2024-03-28 19:24 ` Kent Overstreet
  7 siblings, 0 replies; 38+ messages in thread
From: Kemeng Shi @ 2024-03-27 15:57 UTC (permalink / raw)
  To: akpm, willy, jack, bfoster, tj
  Cc: dsterba, mjguzik, dhowells, linux-kernel, linux-mm, linux-fsdevel

We never use gdtc->dom set with GDTC_INIT_NO_WB, just remove unneeded
initialization of gdtc->dom for now.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
---
 mm/page-writeback.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 211565d01600..2fd2fd2e1932 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -147,6 +147,7 @@ struct dirty_throttle_control {
  * reflect changes in current writeout rate.
  */
 #define VM_COMPLETIONS_PERIOD_LEN (3*HZ)
+#define GDTC_INIT_NO_WB
 
 #ifdef CONFIG_CGROUP_WRITEBACK
 
@@ -154,8 +155,6 @@ struct dirty_throttle_control {
 				.dom = &global_wb_domain,		\
 				.wb_completions = &(__wb)->completions
 
-#define GDTC_INIT_NO_WB		.dom = &global_wb_domain
-
 #define MDTC_INIT(__wb, __gdtc)	.wb = (__wb),				\
 				.dom = mem_cgroup_wb_domain(__wb),	\
 				.wb_completions = &(__wb)->memcg_completions, \
@@ -210,7 +209,6 @@ static void wb_min_max_ratio(struct bdi_writeback *wb,
 
 #define GDTC_INIT(__wb)		.wb = (__wb),                           \
 				.wb_completions = &(__wb)->completions
-#define GDTC_INIT_NO_WB
 #define MDTC_INIT(__wb, __gdtc)
 
 static bool mdtc_valid(struct dirty_throttle_control *dtc)
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 0/6] Improve visibility of writeback
  2024-03-27 15:57 [PATCH v2 0/6] Improve visibility of writeback Kemeng Shi
                   ` (5 preceding siblings ...)
  2024-03-27 15:57 ` [PATCH v2 6/6] writeback: define GDTC_INIT_NO_WB to null Kemeng Shi
@ 2024-03-27 17:40 ` Andrew Morton
  2024-03-28  1:59   ` Kemeng Shi
  2024-03-28 19:15   ` Kent Overstreet
  2024-03-28 19:24 ` Kent Overstreet
  7 siblings, 2 replies; 38+ messages in thread
From: Andrew Morton @ 2024-03-27 17:40 UTC (permalink / raw)
  To: Kemeng Shi
  Cc: willy, jack, bfoster, tj, dsterba, mjguzik, dhowells,
	linux-kernel, linux-mm, linux-fsdevel

On Wed, 27 Mar 2024 23:57:45 +0800 Kemeng Shi <shikemeng@huaweicloud.com> wrote:

> This series tries to improve visilibity of writeback.

Well...  why?  Is anyone usefully using the existing instrumentation? 
What is to be gained by expanding it further?  What is the case for
adding this code?

I don't recall hearing of anyone using the existing debug
instrumentation so perhaps we should remove it!

Also, I hit a build error and a pile of warnings with an arm
allnoconfig build.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 0/6] Improve visibility of writeback
  2024-03-27 17:40 ` [PATCH v2 0/6] Improve visibility of writeback Andrew Morton
@ 2024-03-28  1:59   ` Kemeng Shi
  2024-03-28  8:23     ` Kemeng Shi
  2024-03-28 19:15   ` Kent Overstreet
  1 sibling, 1 reply; 38+ messages in thread
From: Kemeng Shi @ 2024-03-28  1:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: willy, jack, bfoster, tj, dsterba, mjguzik, dhowells,
	linux-kernel, linux-mm, linux-fsdevel


on 3/28/2024 1:40 AM, Andrew Morton wrote:
> On Wed, 27 Mar 2024 23:57:45 +0800 Kemeng Shi <shikemeng@huaweicloud.com> wrote:
> 
>> This series tries to improve visilibity of writeback.
> 
> Well...  why?  Is anyone usefully using the existing instrumentation? 
> What is to be gained by expanding it further?  What is the case for
> adding this code?
> 
> I don't recall hearing of anyone using the existing debug
> instrumentation so perhaps we should remove it!
Hi Andrew, this was discussed in [1]. In short, I use the
debug files to test change in submit patchset [1]. The
wb_monitor.py is suggested by Tejun in [2] to improve
visibility of writeback.
I use the debug files to test change in [1]. The wb_monitor.py is suggested by Tejun
in [2] to improve visibility of writeback.
> 
> Also, I hit a build error and a pile of warnings with an arm
> allnoconfig build.
> 
Sorry for this, I only tested on x86. I will look into this and
fix the build problem in next version.

[1] https://lore.kernel.org/lkml/44e3b910-8b52-5583-f8a9-37105bf5e5b6@huaweicloud.com/
[2] https://lore.kernel.org/lkml/a747dc7d-f24a-08bd-d969-d3fb35e151b7@huaweicloud.com/
[3] https://lore.kernel.org/lkml/ZcUsOb_fyvYr-zZ-@slm.duckdns.org/


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 0/6] Improve visibility of writeback
  2024-03-28  1:59   ` Kemeng Shi
@ 2024-03-28  8:23     ` Kemeng Shi
  0 siblings, 0 replies; 38+ messages in thread
From: Kemeng Shi @ 2024-03-28  8:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: willy, jack, bfoster, tj, dsterba, mjguzik, dhowells,
	linux-kernel, linux-mm, linux-fsdevel



on 3/28/2024 9:59 AM, Kemeng Shi wrote:
> 
> on 3/28/2024 1:40 AM, Andrew Morton wrote:
>> On Wed, 27 Mar 2024 23:57:45 +0800 Kemeng Shi <shikemeng@huaweicloud.com> wrote:
>>
>>> This series tries to improve visilibity of writeback.
>>
>> Well...  why?  Is anyone usefully using the existing instrumentation? 
>> What is to be gained by expanding it further?  What is the case for
>> adding this code?
>>
>> I don't recall hearing of anyone using the existing debug
>> instrumentation so perhaps we should remove it!
> Hi Andrew, this was discussed in [1]. In short, I use the
> debug files to test change in submit patchset [1]. The
> wb_monitor.py is suggested by Tejun in [2] to improve
> visibility of writeback.
> I use the debug files to test change in [1]. The wb_monitor.py is suggested by Tejun
> in [2] to improve visibility of writeback.
>>
>> Also, I hit a build error and a pile of warnings with an arm
>> allnoconfig build.
With arm allnoconfig build on uptodated mm-unstable branch, I don't
hit any build error but only some warnings as following:
...
mm/page-writeback.c: In function ‘cgwb_calc_thresh’:
mm/page-writeback.c:906:13: warning: ‘writeback’ is used uninitialized in this function [-Wuninitialized]
  906 |  mdtc.dirty += writeback;
      |  ~~~~~~~~~~~^~~~~~~~~~~~
In file included from ./include/linux/kernel.h:28,
                 from mm/page-writeback.c:15:
./include/linux/minmax.h:46:54: warning: ‘filepages’ is used uninitialized in this function [-Wuninitialized]
   46 | #define __cmp(op, x, y) ((x) __cmp_op_##op (y) ? (x) : (y))
      |                                                      ^
mm/page-writeback.c:898:16: note: ‘filepages’ was declared here
  898 |  unsigned long filepages, headroom, writeback;
      |                ^~~~~~~~~
In file included from ./include/linux/kernel.h:28,
                 from mm/page-writeback.c:15:
./include/linux/minmax.h:46:54: warning: ‘headroom’ is used uninitialized in this function [-Wuninitialized]
   46 | #define __cmp(op, x, y) ((x) __cmp_op_##op (y) ? (x) : (y))
      |                                                      ^
mm/page-writeback.c:898:27: note: ‘headroom’ was declared here
  898 |  unsigned long filepages, headroom, writeback;
      |                           ^~~~~~~~
...

The only reason I can think of is that I also apply patchset [1]
for build. I mentioned patchset [1] in cover letter but I forgot
to notify the dependency to the patchset.
If this is the reason to blame for buidl error, I will send a new
set based on mm-unstable in next version.

Thanks,
Kemeng

[1] https://lore.kernel.org/lkml/20240123183332.876854-1-shikemeng@huaweicloud.com/T/#mc6455784a63d0f8aa1a2f5aff325abcdf9336b76
>>
> Sorry for this, I only tested on x86. I will look into this and
> fix the build problem in next version.
> 
> [1] https://lore.kernel.org/lkml/44e3b910-8b52-5583-f8a9-37105bf5e5b6@huaweicloud.com/
> [2] https://lore.kernel.org/lkml/a747dc7d-f24a-08bd-d969-d3fb35e151b7@huaweicloud.com/
> [3] https://lore.kernel.org/lkml/ZcUsOb_fyvYr-zZ-@slm.duckdns.org/
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/6] writeback: protect race between bdi release and bdi_debug_stats_show
  2024-03-27 15:57 ` [PATCH v2 1/6] writeback: protect race between bdi release and bdi_debug_stats_show Kemeng Shi
@ 2024-03-28 17:53   ` Brian Foster
  2024-04-03  2:16     ` Kemeng Shi
  0 siblings, 1 reply; 38+ messages in thread
From: Brian Foster @ 2024-03-28 17:53 UTC (permalink / raw)
  To: Kemeng Shi
  Cc: akpm, willy, jack, tj, dsterba, mjguzik, dhowells, linux-kernel,
	linux-mm, linux-fsdevel

On Wed, Mar 27, 2024 at 11:57:46PM +0800, Kemeng Shi wrote:
> There is a race between bdi release and bdi_debug_stats_show:
> /* get debug info */		/* bdi release */
> bdi_debug_stats_show
>   bdi = m->private;
>   wb = &bdi->wb;
> 				bdi_unregister
> 				bdi_put
> 				  release_bdi
> 				    kfree(bdi)
>   /* use after free */
>   spin_lock(&wb->list_lock);
> 

Maybe I'm missing something, but it looks to me that
bdi_unregister_debug() can't complete until active readers of associated
debugfs files have completed. For example, see __debugfs_file_removed()
and use of ->active_users[_drained]. Once the dentry is unlinked,
further reads fail (I think) via debugfs_file_get(). Hm?

Brian

> Search bdi on bdi_list under rcu lock in bdi_debug_stats_show to ensure
> the bdi is not freed to fix the issue.
> 
> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
> ---
>  mm/backing-dev.c | 33 +++++++++++++++++++++++++++++++--
>  1 file changed, 31 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 5f2be8c8df11..70f02959f3bd 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -46,16 +46,44 @@ static void bdi_debug_init(void)
>  	bdi_debug_root = debugfs_create_dir("bdi", NULL);
>  }
>  
> -static int bdi_debug_stats_show(struct seq_file *m, void *v)
> +static struct backing_dev_info *lookup_bdi(struct seq_file *m)
>  {
> +	const struct file *file = m->file;
>  	struct backing_dev_info *bdi = m->private;
> -	struct bdi_writeback *wb = &bdi->wb;
> +	struct backing_dev_info *exist;
> +
> +	list_for_each_entry_rcu(exist, &bdi_list, bdi_list) {
> +		if (exist != bdi)
> +			continue;
> +
> +		if (exist->debug_dir == file->f_path.dentry->d_parent)
> +			return bdi;
> +		else
> +			return NULL;
> +	}
> +
> +	return NULL;
> +}
> +
> +
> +static int bdi_debug_stats_show(struct seq_file *m, void *v)
> +{
> +	struct backing_dev_info *bdi;
> +	struct bdi_writeback *wb;
>  	unsigned long background_thresh;
>  	unsigned long dirty_thresh;
>  	unsigned long wb_thresh;
>  	unsigned long nr_dirty, nr_io, nr_more_io, nr_dirty_time;
>  	struct inode *inode;
>  
> +	rcu_read_lock();
> +	bdi = lookup_bdi(m);
> +	if (!bdi) {
> +		rcu_read_unlock();
> +		return -EEXIST;
> +	}
> +
> +	wb = &bdi->wb;
>  	nr_dirty = nr_io = nr_more_io = nr_dirty_time = 0;
>  	spin_lock(&wb->list_lock);
>  	list_for_each_entry(inode, &wb->b_dirty, i_io_list)
> @@ -101,6 +129,7 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
>  		   nr_dirty_time,
>  		   !list_empty(&bdi->bdi_list), bdi->wb.state);
>  
> +	rcu_read_unlock();
>  	return 0;
>  }
>  DEFINE_SHOW_ATTRIBUTE(bdi_debug_stats);
> -- 
> 2.30.0
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 0/6] Improve visibility of writeback
  2024-03-27 17:40 ` [PATCH v2 0/6] Improve visibility of writeback Andrew Morton
  2024-03-28  1:59   ` Kemeng Shi
@ 2024-03-28 19:15   ` Kent Overstreet
  2024-03-28 19:23     ` Andrew Morton
  1 sibling, 1 reply; 38+ messages in thread
From: Kent Overstreet @ 2024-03-28 19:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kemeng Shi, willy, jack, bfoster, tj, dsterba, mjguzik, dhowells,
	linux-kernel, linux-mm, linux-fsdevel

On Wed, Mar 27, 2024 at 10:40:10AM -0700, Andrew Morton wrote:
> On Wed, 27 Mar 2024 23:57:45 +0800 Kemeng Shi <shikemeng@huaweicloud.com> wrote:
> 
> > This series tries to improve visilibity of writeback.
> 
> Well...  why?  Is anyone usefully using the existing instrumentation? 
> What is to be gained by expanding it further?  What is the case for
> adding this code?
> 
> I don't recall hearing of anyone using the existing debug
> instrumentation so perhaps we should remove it!

Remove debug instrumentation!? Surely you just?

I generally don't hear from users of my code when things are working, I
only expect to hear from people when it's not.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 0/6] Improve visibility of writeback
  2024-03-28 19:15   ` Kent Overstreet
@ 2024-03-28 19:23     ` Andrew Morton
  2024-03-28 19:36       ` Kent Overstreet
  0 siblings, 1 reply; 38+ messages in thread
From: Andrew Morton @ 2024-03-28 19:23 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Kemeng Shi, willy, jack, bfoster, tj, dsterba, mjguzik, dhowells,
	linux-kernel, linux-mm, linux-fsdevel

On Thu, 28 Mar 2024 15:15:03 -0400 Kent Overstreet <kent.overstreet@linux.dev> wrote:

> On Wed, Mar 27, 2024 at 10:40:10AM -0700, Andrew Morton wrote:
> > On Wed, 27 Mar 2024 23:57:45 +0800 Kemeng Shi <shikemeng@huaweicloud.com> wrote:
> > 
> > > This series tries to improve visilibity of writeback.
> > 
> > Well...  why?  Is anyone usefully using the existing instrumentation? 
> > What is to be gained by expanding it further?  What is the case for
> > adding this code?
> > 
> > I don't recall hearing of anyone using the existing debug
> > instrumentation so perhaps we should remove it!
> 
> Remove debug instrumentation!? Surely you just?

Absolutely not.  Any code in the kernel should have ongoing
justification for remaining there.  If no such justification exists,
out it goes.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 0/6] Improve visibility of writeback
  2024-03-27 15:57 [PATCH v2 0/6] Improve visibility of writeback Kemeng Shi
                   ` (6 preceding siblings ...)
  2024-03-27 17:40 ` [PATCH v2 0/6] Improve visibility of writeback Andrew Morton
@ 2024-03-28 19:24 ` Kent Overstreet
  2024-03-28 19:31   ` Tejun Heo
  2024-04-03  6:56   ` Kemeng Shi
  7 siblings, 2 replies; 38+ messages in thread
From: Kent Overstreet @ 2024-03-28 19:24 UTC (permalink / raw)
  To: Kemeng Shi
  Cc: akpm, willy, jack, bfoster, tj, dsterba, mjguzik, dhowells,
	linux-kernel, linux-mm, linux-fsdevel

On Wed, Mar 27, 2024 at 11:57:45PM +0800, Kemeng Shi wrote:
> v1->v2:
> -Send cleanup to wq_monitor.py separately.
> -Add patch to avoid use after free of bdi.
> -Rename wb_calc_cg_thresh to cgwb_calc_thresh as Tejun suggested.
> -Use rcu walk to avoid use after free.
> -Add debug output to each related patches.
> 
> This series tries to improve visilibity of writeback. Patch 1 make
> /sys/kernel/debug/bdi/xxx/stats show writeback info of whole bdi
> instead of only writeback info in root cgroup. Patch 2 add a new
> debug file /sys/kernel/debug/bdi/xxx/wb_stats to show per wb writeback
> info. Patch 4 add wb_monitor.py to monitor basic writeback info
> of running system, more info could be added on demand. Rest patches
> are some random cleanups. More details can be found in respective
> patches. Thanks!
> This series is on top of patchset [1].
> 
> [1] https://lore.kernel.org/lkml/20240123183332.876854-1-shikemeng@huaweicloud.com/T/#mc6455784a63d0f8aa1a2f5aff325abcdf9336b76

Not bad

I've been trying to improve our ability to debug latency issues - stalls
of all sorts. While you're looking at all this code, do you think you
could find some places to collect useful latency numbers?

fs/bcachefs/time_stats.c has some code that's going to be moving out to
lib/ at some point, after I switch it to MAD; if you could hook that up
as well to a few points we could see at a glance if there are stalls
happening in the writeback path.

> 
> Following domain hierarchy is tested:
>                 global domain (320G)
>                 /                 \
>         cgroup domain1(10G)     cgroup domain2(10G)
>                 |                 |
> bdi            wb1               wb2
> 
> /* all writeback info of bdi is successfully collected */
> # cat /sys/kernel/debug/bdi/252:16/stats:
> BdiWriteback:              448 kB
> BdiReclaimable:        1303904 kB
> BdiDirtyThresh:      189914124 kB
> DirtyThresh:         195337564 kB
> BackgroundThresh:     32516508 kB
> BdiDirtied:            3591392 kB
> BdiWritten:            2287488 kB
> BdiWriteBandwidth:      322248 kBps
> b_dirty:                     0
> b_io:                        0
> b_more_io:                   2
> b_dirty_time:                0
> bdi_list:                    1
> state:                       1
> 
> /* per wb writeback info is collected */
> # cat /sys/kernel/debug/bdi/252:16/wb_stats:
> cat wb_stats
> WbCgIno:                    1
> WbWriteback:                0 kB
> WbReclaimable:              0 kB
> WbDirtyThresh:              0 kB
> WbDirtied:                  0 kB
> WbWritten:                  0 kB
> WbWriteBandwidth:      102400 kBps
> b_dirty:                    0
> b_io:                       0
> b_more_io:                  0
> b_dirty_time:               0
> state:                      1
> WbCgIno:                 4284
> WbWriteback:              448 kB
> WbReclaimable:         818944 kB
> WbDirtyThresh:        3096524 kB
> WbDirtied:            2266880 kB
> WbWritten:            1447936 kB
> WbWriteBandwidth:      214036 kBps
> b_dirty:                    0
> b_io:                       0
> b_more_io:                  1
> b_dirty_time:               0
> state:                      5
> WbCgIno:                 4325
> WbWriteback:              224 kB
> WbReclaimable:         819392 kB
> WbDirtyThresh:        2920088 kB
> WbDirtied:            2551808 kB
> WbWritten:            1732416 kB
> WbWriteBandwidth:      201832 kBps
> b_dirty:                    0
> b_io:                       0
> b_more_io:                  1
> b_dirty_time:               0
> state:                      5
> 
> /* monitor writeback info */
> # ./wb_monitor.py 252:16 -c
>                   writeback  reclaimable   dirtied   written    avg_bw
> 252:16_1                  0            0         0         0    102400
> 252:16_4284             672       820064   9230368   8410304    685612
> 252:16_4325             896       819840  10491264   9671648    652348
> 252:16                 1568      1639904  19721632  18081952   1440360
> 
> 
>                   writeback  reclaimable   dirtied   written    avg_bw
> 252:16_1                  0            0         0         0    102400
> 252:16_4284             672       820064   9230368   8410304    685612
> 252:16_4325             896       819840  10491264   9671648    652348
> 252:16                 1568      1639904  19721632  18081952   1440360
> ...
> 
> Kemeng Shi (6):
>   writeback: protect race between bdi release and bdi_debug_stats_show
>   writeback: collect stats of all wb of bdi in bdi_debug_stats_show
>   writeback: support retrieving per group debug writeback stats of bdi
>   writeback: add wb_monitor.py script to monitor writeback info on bdi
>   writeback: rename nr_reclaimable to nr_dirty in balance_dirty_pages
>   writeback: define GDTC_INIT_NO_WB to null
> 
>  include/linux/writeback.h     |   1 +
>  mm/backing-dev.c              | 203 ++++++++++++++++++++++++++++++----
>  mm/page-writeback.c           |  31 ++++--
>  tools/writeback/wb_monitor.py | 172 ++++++++++++++++++++++++++++
>  4 files changed, 378 insertions(+), 29 deletions(-)
>  create mode 100644 tools/writeback/wb_monitor.py
> 
> -- 
> 2.30.0
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 0/6] Improve visibility of writeback
  2024-03-28 19:24 ` Kent Overstreet
@ 2024-03-28 19:31   ` Tejun Heo
  2024-03-28 19:40     ` Kent Overstreet
  2024-04-03  6:56   ` Kemeng Shi
  1 sibling, 1 reply; 38+ messages in thread
From: Tejun Heo @ 2024-03-28 19:31 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Kemeng Shi, akpm, willy, jack, bfoster, dsterba, mjguzik,
	dhowells, linux-kernel, linux-mm, linux-fsdevel

Hello, Kent.

On Thu, Mar 28, 2024 at 03:24:35PM -0400, Kent Overstreet wrote:
> fs/bcachefs/time_stats.c has some code that's going to be moving out to
> lib/ at some point, after I switch it to MAD; if you could hook that up
> as well to a few points we could see at a glance if there are stalls
> happening in the writeback path.

Using BPF (whether through bcc or bpftrace) is likely a better approach for
this sort of detailed instrumentation. Fixed debug information is useful and
it's also a common occurrence that they don't quite reveal the full picture
of what one's trying to understand and one needs to dig a bit deeper, wider,
aggregate data in a different way, or whatever.

So, rather than adding more fixed infrastructure, I'd suggest adding places
which can easily be instrumented using the existing tools (they are really
great once you get used to them) whether that's tracepoints or just
strategically placed noinline functions.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 0/6] Improve visibility of writeback
  2024-03-28 19:23     ` Andrew Morton
@ 2024-03-28 19:36       ` Kent Overstreet
  0 siblings, 0 replies; 38+ messages in thread
From: Kent Overstreet @ 2024-03-28 19:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kemeng Shi, willy, jack, bfoster, tj, dsterba, mjguzik, dhowells,
	linux-kernel, linux-mm, linux-fsdevel

On Thu, Mar 28, 2024 at 12:23:52PM -0700, Andrew Morton wrote:
> On Thu, 28 Mar 2024 15:15:03 -0400 Kent Overstreet <kent.overstreet@linux.dev> wrote:
> 
> > On Wed, Mar 27, 2024 at 10:40:10AM -0700, Andrew Morton wrote:
> > > On Wed, 27 Mar 2024 23:57:45 +0800 Kemeng Shi <shikemeng@huaweicloud.com> wrote:
> > > 
> > > > This series tries to improve visilibity of writeback.
> > > 
> > > Well...  why?  Is anyone usefully using the existing instrumentation? 
> > > What is to be gained by expanding it further?  What is the case for
> > > adding this code?
> > > 
> > > I don't recall hearing of anyone using the existing debug
> > > instrumentation so perhaps we should remove it!
> > 
> > Remove debug instrumentation!? Surely you just?
> 
> Absolutely not.  Any code in the kernel should have ongoing
> justification for remaining there.  If no such justification exists,
> out it goes.

Certainly, but this isn't remotely a case where I'd expect to be getting
that kind of feedback. Debugging instrumentation is very much a case
where no one notices it 99% of the time, but when you need it you
_really_ need it.

Not having it can turn a 10 minute "oh, that thing is acting wonky -
it's because your system is overloaded/your drive is wonky/x subsystem
sucks, we know about it and we're working on it" into a weeklong
bughunt, burning up expensive engineer time pointlessly.

To debug complex systems efficiently, in production, in the wild, we
need to be able to see what's going on - we need more of this stuff.

Not to say that this couldn't use more work - perhaps additional focus
on what kinds of issues we expect to need to debug with this, what the
numbers mean and are useful for, documentation on how this relates to
writeback internals, etc.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 0/6] Improve visibility of writeback
  2024-03-28 19:31   ` Tejun Heo
@ 2024-03-28 19:40     ` Kent Overstreet
  2024-03-28 19:46       ` Tejun Heo
  0 siblings, 1 reply; 38+ messages in thread
From: Kent Overstreet @ 2024-03-28 19:40 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Kemeng Shi, akpm, willy, jack, bfoster, dsterba, mjguzik,
	dhowells, linux-kernel, linux-mm, linux-fsdevel

On Thu, Mar 28, 2024 at 09:31:57AM -1000, Tejun Heo wrote:
> Hello, Kent.
> 
> On Thu, Mar 28, 2024 at 03:24:35PM -0400, Kent Overstreet wrote:
> > fs/bcachefs/time_stats.c has some code that's going to be moving out to
> > lib/ at some point, after I switch it to MAD; if you could hook that up
> > as well to a few points we could see at a glance if there are stalls
> > happening in the writeback path.
> 
> Using BPF (whether through bcc or bpftrace) is likely a better approach for
> this sort of detailed instrumentation. Fixed debug information is useful and
> it's also a common occurrence that they don't quite reveal the full picture
> of what one's trying to understand and one needs to dig a bit deeper, wider,
> aggregate data in a different way, or whatever.
> 
> So, rather than adding more fixed infrastructure, I'd suggest adding places
> which can easily be instrumented using the existing tools (they are really
> great once you get used to them) whether that's tracepoints or just
> strategically placed noinline functions.

Collecting latency numbers at various key places is _enormously_ useful.
The hard part is deciding where it's useful to collect; that requires
intimate knowledge of the code. Once you're defining those collection
poitns statically, doing it with BPF is just another useless layer of
indirection.

The time stats stuff I wrote is _really_ cheap, and you really want this
stuff always on so that you've actually got the data you need when
you're bughunting.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 0/6] Improve visibility of writeback
  2024-03-28 19:40     ` Kent Overstreet
@ 2024-03-28 19:46       ` Tejun Heo
  2024-03-28 19:55         ` Kent Overstreet
  0 siblings, 1 reply; 38+ messages in thread
From: Tejun Heo @ 2024-03-28 19:46 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Kemeng Shi, akpm, willy, jack, bfoster, dsterba, mjguzik,
	dhowells, linux-kernel, linux-mm, linux-fsdevel

Hello,

On Thu, Mar 28, 2024 at 03:40:02PM -0400, Kent Overstreet wrote:
> Collecting latency numbers at various key places is _enormously_ useful.
> The hard part is deciding where it's useful to collect; that requires
> intimate knowledge of the code. Once you're defining those collection
> poitns statically, doing it with BPF is just another useless layer of
> indirection.

Given how much flexibility helps with debugging, claiming it useless is a
stretch.

> The time stats stuff I wrote is _really_ cheap, and you really want this
> stuff always on so that you've actually got the data you need when
> you're bughunting.

For some stats and some use cases, always being available is useful and
building fixed infra for them makes sense. For other stats and other use
cases, flexibility is pretty useful too (e.g. what if you want percentile
distribution which is filtered by some criteria?). They aren't mutually
exclusive and I'm not sure bdi wb instrumentation is on top of enough
people's minds.

As for overhead, BPF instrumentation can be _really_ cheap too. We often run
these programs per packet.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 0/6] Improve visibility of writeback
  2024-03-28 19:46       ` Tejun Heo
@ 2024-03-28 19:55         ` Kent Overstreet
  2024-03-28 20:13           ` Tejun Heo
  0 siblings, 1 reply; 38+ messages in thread
From: Kent Overstreet @ 2024-03-28 19:55 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Kemeng Shi, akpm, willy, jack, bfoster, dsterba, mjguzik,
	dhowells, linux-kernel, linux-mm, linux-fsdevel

On Thu, Mar 28, 2024 at 09:46:39AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Thu, Mar 28, 2024 at 03:40:02PM -0400, Kent Overstreet wrote:
> > Collecting latency numbers at various key places is _enormously_ useful.
> > The hard part is deciding where it's useful to collect; that requires
> > intimate knowledge of the code. Once you're defining those collection
> > poitns statically, doing it with BPF is just another useless layer of
> > indirection.
> 
> Given how much flexibility helps with debugging, claiming it useless is a
> stretch.

Well, what would it add?

> > The time stats stuff I wrote is _really_ cheap, and you really want this
> > stuff always on so that you've actually got the data you need when
> > you're bughunting.
> 
> For some stats and some use cases, always being available is useful and
> building fixed infra for them makes sense. For other stats and other use
> cases, flexibility is pretty useful too (e.g. what if you want percentile
> distribution which is filtered by some criteria?). They aren't mutually
> exclusive and I'm not sure bdi wb instrumentation is on top of enough
> people's minds.
> 
> As for overhead, BPF instrumentation can be _really_ cheap too. We often run
> these programs per packet.

The main things I want are just
 - elapsed time since last writeback IO completed, so we can see at a
   glance if it's stalled
 - time stats on writeback io initiation to completion

The main value of this one will be tracking down tail latency issues and
finding out where in the stack they originate.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 0/6] Improve visibility of writeback
  2024-03-28 19:55         ` Kent Overstreet
@ 2024-03-28 20:13           ` Tejun Heo
  2024-03-28 20:22             ` Kent Overstreet
  0 siblings, 1 reply; 38+ messages in thread
From: Tejun Heo @ 2024-03-28 20:13 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Kemeng Shi, akpm, willy, jack, bfoster, dsterba, mjguzik,
	dhowells, linux-kernel, linux-mm, linux-fsdevel

Hello,

On Thu, Mar 28, 2024 at 03:55:32PM -0400, Kent Overstreet wrote:
> > On Thu, Mar 28, 2024 at 03:40:02PM -0400, Kent Overstreet wrote:
> > > Collecting latency numbers at various key places is _enormously_ useful.
> > > The hard part is deciding where it's useful to collect; that requires
> > > intimate knowledge of the code. Once you're defining those collection
> > > poitns statically, doing it with BPF is just another useless layer of
> > > indirection.
> > 
> > Given how much flexibility helps with debugging, claiming it useless is a
> > stretch.
> 
> Well, what would it add?

It depends on the case but here's an example. If I'm seeing occasional tail
latency spikes, I'd want to know whether there's any correation with
specific types or sizes of IOs and if so who's issuing them and why. With
BPF, you can detect those conditions to tag and capture where exactly those
IOs are coming from and aggregate the result however you like across
thousands of machines in production without anyone noticing. That's useful,
no?

Also, actual percentile disribution is almost always a lot more insightful
than more coarsely aggregated numbers. We can't add all that to fixed infra.
In most cases not because runtime overhead would be too hight but because
the added interface and code complexity and maintenance overhead isn't
justifiable given how niche, adhoc and varied these use cases get.

> > > The time stats stuff I wrote is _really_ cheap, and you really want this
> > > stuff always on so that you've actually got the data you need when
> > > you're bughunting.
> > 
> > For some stats and some use cases, always being available is useful and
> > building fixed infra for them makes sense. For other stats and other use
> > cases, flexibility is pretty useful too (e.g. what if you want percentile
> > distribution which is filtered by some criteria?). They aren't mutually
> > exclusive and I'm not sure bdi wb instrumentation is on top of enough
> > people's minds.
> > 
> > As for overhead, BPF instrumentation can be _really_ cheap too. We often run
> > these programs per packet.
> 
> The main things I want are just
>  - elapsed time since last writeback IO completed, so we can see at a
>    glance if it's stalled
>  - time stats on writeback io initiation to completion
> 
> The main value of this one will be tracking down tail latency issues and
> finding out where in the stack they originate.

Yeah, I mean, if always keeping those numbers around is useful for wide
enough number of users and cases, sure, go ahead and add fixed infra. I'm
not quite sure bdi wb stats fall in that bucket given how little attention
it usually gets.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 0/6] Improve visibility of writeback
  2024-03-28 20:13           ` Tejun Heo
@ 2024-03-28 20:22             ` Kent Overstreet
  2024-03-28 20:46               ` Tejun Heo
  0 siblings, 1 reply; 38+ messages in thread
From: Kent Overstreet @ 2024-03-28 20:22 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Kemeng Shi, akpm, willy, jack, bfoster, dsterba, mjguzik,
	dhowells, linux-kernel, linux-mm, linux-fsdevel

On Thu, Mar 28, 2024 at 10:13:27AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Thu, Mar 28, 2024 at 03:55:32PM -0400, Kent Overstreet wrote:
> > > On Thu, Mar 28, 2024 at 03:40:02PM -0400, Kent Overstreet wrote:
> > > > Collecting latency numbers at various key places is _enormously_ useful.
> > > > The hard part is deciding where it's useful to collect; that requires
> > > > intimate knowledge of the code. Once you're defining those collection
> > > > poitns statically, doing it with BPF is just another useless layer of
> > > > indirection.
> > > 
> > > Given how much flexibility helps with debugging, claiming it useless is a
> > > stretch.
> > 
> > Well, what would it add?
> 
> It depends on the case but here's an example. If I'm seeing occasional tail
> latency spikes, I'd want to know whether there's any correation with
> specific types or sizes of IOs and if so who's issuing them and why. With
> BPF, you can detect those conditions to tag and capture where exactly those
> IOs are coming from and aggregate the result however you like across
> thousands of machines in production without anyone noticing. That's useful,
> no?

That's cool, but really esoteric. We need to be able to answer basic
questions and build an overall picture of what the system is doing
without having to reach for the big stuff.

Most users are never going to touch tracing, let alone BPF; that's too
much setup. But I can and do regularly tell users "check this, this and
this" and debug things on that basis without ever touching their
machine.

And basic latency numbers are really easy for users to understand, that
makes them doubly worthwhile to collect and make visible.

> Also, actual percentile disribution is almost always a lot more insightful
> than more coarsely aggregated numbers. We can't add all that to fixed infra.
> In most cases not because runtime overhead would be too hight but because
> the added interface and code complexity and maintenance overhead isn't
> justifiable given how niche, adhoc and varied these use cases get.

You can't calculate percentiles accurately and robustly in one pass -
that only works if your input data obeys a nice statistical
distribution, and the cases we care about are the ones where it doesn't.

> 
> > > > The time stats stuff I wrote is _really_ cheap, and you really want this
> > > > stuff always on so that you've actually got the data you need when
> > > > you're bughunting.
> > > 
> > > For some stats and some use cases, always being available is useful and
> > > building fixed infra for them makes sense. For other stats and other use
> > > cases, flexibility is pretty useful too (e.g. what if you want percentile
> > > distribution which is filtered by some criteria?). They aren't mutually
> > > exclusive and I'm not sure bdi wb instrumentation is on top of enough
> > > people's minds.
> > > 
> > > As for overhead, BPF instrumentation can be _really_ cheap too. We often run
> > > these programs per packet.
> > 
> > The main things I want are just
> >  - elapsed time since last writeback IO completed, so we can see at a
> >    glance if it's stalled
> >  - time stats on writeback io initiation to completion
> > 
> > The main value of this one will be tracking down tail latency issues and
> > finding out where in the stack they originate.
> 
> Yeah, I mean, if always keeping those numbers around is useful for wide
> enough number of users and cases, sure, go ahead and add fixed infra. I'm
> not quite sure bdi wb stats fall in that bucket given how little attention
> it usually gets.

I think it should be getting a lot more attention given that memory
reclaim and writeback are generally implicated whenever a user complains
about their system going out to lunch.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 0/6] Improve visibility of writeback
  2024-03-28 20:22             ` Kent Overstreet
@ 2024-03-28 20:46               ` Tejun Heo
  2024-03-28 20:53                 ` Kent Overstreet
  2024-04-03 16:27                 ` Jan Kara
  0 siblings, 2 replies; 38+ messages in thread
From: Tejun Heo @ 2024-03-28 20:46 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Kemeng Shi, akpm, willy, jack, bfoster, dsterba, mjguzik,
	dhowells, linux-kernel, linux-mm, linux-fsdevel

Hello, Kent.

On Thu, Mar 28, 2024 at 04:22:13PM -0400, Kent Overstreet wrote:
> Most users are never going to touch tracing, let alone BPF; that's too
> much setup. But I can and do regularly tell users "check this, this and
> this" and debug things on that basis without ever touching their
> machine.

I think this is where the disconnect is. It's not difficult to set up at
all. Nowadays, in most distros, it comes down to something like run "pacman
-S bcc" and then run "/usr/share/bcc/tools/biolatpcts" with these params or
run this script I'm attaching. It is a signficant boost when debugging many
different kernel issues. I strongly suggest giving it a try and getting used
to it rather than resisting it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 0/6] Improve visibility of writeback
  2024-03-28 20:46               ` Tejun Heo
@ 2024-03-28 20:53                 ` Kent Overstreet
  2024-04-03 16:27                 ` Jan Kara
  1 sibling, 0 replies; 38+ messages in thread
From: Kent Overstreet @ 2024-03-28 20:53 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Kemeng Shi, akpm, willy, jack, bfoster, dsterba, mjguzik,
	dhowells, linux-kernel, linux-mm, linux-fsdevel

On Thu, Mar 28, 2024 at 10:46:33AM -1000, Tejun Heo wrote:
> Hello, Kent.
> 
> On Thu, Mar 28, 2024 at 04:22:13PM -0400, Kent Overstreet wrote:
> > Most users are never going to touch tracing, let alone BPF; that's too
> > much setup. But I can and do regularly tell users "check this, this and
> > this" and debug things on that basis without ever touching their
> > machine.
> 
> I think this is where the disconnect is. It's not difficult to set up at
> all. Nowadays, in most distros, it comes down to something like run "pacman
> -S bcc" and then run "/usr/share/bcc/tools/biolatpcts" with these params or
> run this script I'm attaching. It is a signficant boost when debugging many
> different kernel issues. I strongly suggest giving it a try and getting used
> to it rather than resisting it.

The disconnect is with you, Tejun. I use tracing all the time, and it's
not going to be much of a discussion if you're not going to formulate a
response.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 2/6] writeback: collect stats of all wb of bdi in bdi_debug_stats_show
  2024-03-27 15:57 ` [PATCH v2 2/6] writeback: collect stats of all wb of bdi in bdi_debug_stats_show Kemeng Shi
@ 2024-03-29 13:04   ` Brian Foster
  2024-04-03  7:49     ` Kemeng Shi
  0 siblings, 1 reply; 38+ messages in thread
From: Brian Foster @ 2024-03-29 13:04 UTC (permalink / raw)
  To: Kemeng Shi
  Cc: akpm, willy, jack, tj, dsterba, mjguzik, dhowells, linux-kernel,
	linux-mm, linux-fsdevel

On Wed, Mar 27, 2024 at 11:57:47PM +0800, Kemeng Shi wrote:
> /sys/kernel/debug/bdi/xxx/stats is supposed to show writeback information
> of whole bdi, but only writeback information of bdi in root cgroup is
> collected. So writeback information in non-root cgroup are missing now.
> To be more specific, considering following case:
> 
> /* create writeback cgroup */
> cd /sys/fs/cgroup
> echo "+memory +io" > cgroup.subtree_control
> mkdir group1
> cd group1
> echo $$ > cgroup.procs
> /* do writeback in cgroup */
> fio -name test -filename=/dev/vdb ...
> /* get writeback info of bdi */
> cat /sys/kernel/debug/bdi/xxx/stats
> The cat result unexpectedly implies that there is no writeback on target
> bdi.
> 
> Fix this by collecting stats of all wb in bdi instead of only wb in
> root cgroup.
> 
> Following domain hierarchy is tested:
>                 global domain (320G)
>                 /                 \
>         cgroup domain1(10G)     cgroup domain2(10G)
>                 |                 |
> bdi            wb1               wb2
> 
> /* all writeback info of bdi is successfully collected */
> cat stats
> BdiWriteback:             2912 kB
> BdiReclaimable:        1598464 kB
> BdiDirtyThresh:      167479028 kB
> DirtyThresh:         195038532 kB
> BackgroundThresh:     32466728 kB
> BdiDirtied:           19141696 kB
> BdiWritten:           17543456 kB
> BdiWriteBandwidth:     1136172 kBps
> b_dirty:                     2
> b_io:                        0
> b_more_io:                   1
> b_dirty_time:                0
> bdi_list:                    1
> state:                       1
> 
> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
> ---
>  mm/backing-dev.c | 100 +++++++++++++++++++++++++++++++++--------------
>  1 file changed, 71 insertions(+), 29 deletions(-)
> 
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 70f02959f3bd..8daf950e6855 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
...
> @@ -65,16 +78,54 @@ static struct backing_dev_info *lookup_bdi(struct seq_file *m)
>  	return NULL;
>  }
>  
> +static void collect_wb_stats(struct wb_stats *stats,
> +			     struct bdi_writeback *wb)
> +{
> +	struct inode *inode;
> +
> +	spin_lock(&wb->list_lock);
> +	list_for_each_entry(inode, &wb->b_dirty, i_io_list)
> +		stats->nr_dirty++;
> +	list_for_each_entry(inode, &wb->b_io, i_io_list)
> +		stats->nr_io++;
> +	list_for_each_entry(inode, &wb->b_more_io, i_io_list)
> +		stats->nr_more_io++;
> +	list_for_each_entry(inode, &wb->b_dirty_time, i_io_list)
> +		if (inode->i_state & I_DIRTY_TIME)
> +			stats->nr_dirty_time++;
> +	spin_unlock(&wb->list_lock);
> +
> +	stats->nr_writeback += wb_stat(wb, WB_WRITEBACK);
> +	stats->nr_reclaimable += wb_stat(wb, WB_RECLAIMABLE);
> +	stats->nr_dirtied += wb_stat(wb, WB_DIRTIED);
> +	stats->nr_written += wb_stat(wb, WB_WRITTEN);
> +	stats->wb_thresh += wb_calc_thresh(wb, stats->dirty_thresh);

Kinda nitty question, but is this a sum of per-wb writeback thresholds?
If so, do you consider that useful information vs. the per-wb threshold
data presumably exposed in the next patch?

I'm not really that worried about what debug data we expose, it just
seems a little odd. How would you document this value in a sentence or
two, for example?

> +}
> +
> +#ifdef CONFIG_CGROUP_WRITEBACK
> +static void bdi_collect_stats(struct backing_dev_info *bdi,
> +			      struct wb_stats *stats)
> +{
> +	struct bdi_writeback *wb;
> +
> +	list_for_each_entry_rcu(wb, &bdi->wb_list, bdi_node)
> +		collect_wb_stats(stats, wb);

Depending on discussion on the previous patch and whether the higher
level rcu protection in bdi_debug_stats_show() is really necessary, it
might make more sense to move it here.

I'm also wondering if you'd want to check the state of the individual wb
(i.e. WB_registered?) before reading it..?

> +}
> +#else
> +static void bdi_collect_stats(struct backing_dev_info *bdi,
> +			      struct wb_stats *stats)
> +{
> +	collect_wb_stats(stats, &bdi->wb);
> +}
> +#endif
...
> @@ -115,18 +157,18 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
>  		   "b_dirty_time:       %10lu\n"
>  		   "bdi_list:           %10u\n"
>  		   "state:              %10lx\n",
> -		   (unsigned long) K(wb_stat(wb, WB_WRITEBACK)),
> -		   (unsigned long) K(wb_stat(wb, WB_RECLAIMABLE)),
> -		   K(wb_thresh),
> +		   K(stats.nr_writeback),
> +		   K(stats.nr_reclaimable),
> +		   K(stats.wb_thresh),
>  		   K(dirty_thresh),
>  		   K(background_thresh),
> -		   (unsigned long) K(wb_stat(wb, WB_DIRTIED)),
> -		   (unsigned long) K(wb_stat(wb, WB_WRITTEN)),
> -		   (unsigned long) K(wb->write_bandwidth),
> -		   nr_dirty,
> -		   nr_io,
> -		   nr_more_io,
> -		   nr_dirty_time,
> +		   K(stats.nr_dirtied),
> +		   K(stats.nr_written),
> +		   K(tot_bw),
> +		   stats.nr_dirty,
> +		   stats.nr_io,
> +		   stats.nr_more_io,
> +		   stats.nr_dirty_time,
>  		   !list_empty(&bdi->bdi_list), bdi->wb.state);

Is it worth showing a list count here rather than list_empty() state?

Brian

>  
>  	rcu_read_unlock();
> -- 
> 2.30.0
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 3/6] writeback: support retrieving per group debug writeback stats of bdi
  2024-03-27 15:57 ` [PATCH v2 3/6] writeback: support retrieving per group debug writeback stats of bdi Kemeng Shi
@ 2024-03-29 13:10   ` Brian Foster
  2024-04-03  8:49     ` Kemeng Shi
  0 siblings, 1 reply; 38+ messages in thread
From: Brian Foster @ 2024-03-29 13:10 UTC (permalink / raw)
  To: Kemeng Shi
  Cc: akpm, willy, jack, tj, dsterba, mjguzik, dhowells, linux-kernel,
	linux-mm, linux-fsdevel

On Wed, Mar 27, 2024 at 11:57:48PM +0800, Kemeng Shi wrote:
> Add /sys/kernel/debug/bdi/xxx/wb_stats to show per group writeback stats
> of bdi.
> 

Hi Kemeng,

Just a few random thoughts/comments..

> Following domain hierarchy is tested:
>                 global domain (320G)
>                 /                 \
>         cgroup domain1(10G)     cgroup domain2(10G)
>                 |                 |
> bdi            wb1               wb2
> 
> /* per wb writeback info of bdi is collected */
> cat /sys/kernel/debug/bdi/252:16/wb_stats
> WbCgIno:                    1
> WbWriteback:                0 kB
> WbReclaimable:              0 kB
> WbDirtyThresh:              0 kB
> WbDirtied:                  0 kB
> WbWritten:                  0 kB
> WbWriteBandwidth:      102400 kBps
> b_dirty:                    0
> b_io:                       0
> b_more_io:                  0
> b_dirty_time:               0
> state:                      1

Maybe some whitespace or something between entries would improve
readability?

> WbCgIno:                 4094
> WbWriteback:            54432 kB
> WbReclaimable:         766080 kB
> WbDirtyThresh:        3094760 kB
> WbDirtied:            1656480 kB
> WbWritten:             837088 kB
> WbWriteBandwidth:      132772 kBps
> b_dirty:                    1
> b_io:                       1
> b_more_io:                  0
> b_dirty_time:               0
> state:                      7
> WbCgIno:                 4135
> WbWriteback:            15232 kB
> WbReclaimable:         786688 kB
> WbDirtyThresh:        2909984 kB
> WbDirtied:            1482656 kB
> WbWritten:             681408 kB
> WbWriteBandwidth:      124848 kBps
> b_dirty:                    0
> b_io:                       1
> b_more_io:                  0
> b_dirty_time:               0
> state:                      7
> 
> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
> ---
>  include/linux/writeback.h |  1 +
>  mm/backing-dev.c          | 88 +++++++++++++++++++++++++++++++++++++++
>  mm/page-writeback.c       | 19 +++++++++
>  3 files changed, 108 insertions(+)
> 
...
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 8daf950e6855..e3953db7d88d 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -103,6 +103,91 @@ static void collect_wb_stats(struct wb_stats *stats,
>  }
>  
>  #ifdef CONFIG_CGROUP_WRITEBACK
...
> +static int cgwb_debug_stats_show(struct seq_file *m, void *v)
> +{
> +	struct backing_dev_info *bdi;
> +	unsigned long background_thresh;
> +	unsigned long dirty_thresh;
> +	struct bdi_writeback *wb;
> +	struct wb_stats stats;
> +
> +	rcu_read_lock();
> +	bdi = lookup_bdi(m);
> +	if (!bdi) {
> +		rcu_read_unlock();
> +		return -EEXIST;
> +	}
> +
> +	global_dirty_limits(&background_thresh, &dirty_thresh);
> +
> +	list_for_each_entry_rcu(wb, &bdi->wb_list, bdi_node) {
> +		memset(&stats, 0, sizeof(stats));
> +		stats.dirty_thresh = dirty_thresh;

If you did something like the following here, wouldn't that also zero
the rest of the structure?

		struct wb_stats stats = { .dirty_thresh = dirty_thresh };

> +		collect_wb_stats(&stats, wb);
> +

Also, similar question as before on whether you'd want to check
WB_registered or something here..

> +		if (mem_cgroup_wb_domain(wb) == NULL) {
> +			wb_stats_show(m, wb, &stats);
> +			continue;
> +		}

Can you explain what this logic is about? Is the cgwb_calc_thresh()
thing not needed in this case? A comment might help for those less
familiar with the implementation details.

BTW, I'm also wondering if something like the following is correct
and/or roughly equivalent:
	
	list_for_each_*(wb, ...) {
		struct wb_stats stats = ...;

		if (!wb_tryget(wb))
			continue;

		collect_wb_stats(&stats, wb);

		/*
		 * Extra wb_thresh magic. Drop rcu lock because ... . We
		 * can do so here because we have a ref.
		 */
		if (mem_cgroup_wb_domain(wb)) {
			rcu_read_unlock();
			stats.wb_thresh = min(stats.wb_thresh, cgwb_calc_thresh(wb));
			rcu_read_lock();
		}

		wb_stats_show(m, wb, &stats)
		wb_put(wb);
	}

> +
> +		/*
> +		 * cgwb_release will destroy wb->memcg_completions which
> +		 * will be ued in cgwb_calc_thresh. Use wb_tryget to prevent
> +		 * memcg_completions destruction from cgwb_release.
> +		 */
> +		if (!wb_tryget(wb))
> +			continue;
> +
> +		rcu_read_unlock();
> +		/* cgwb_calc_thresh may sleep in cgroup_rstat_flush */
> +		stats.wb_thresh = min(stats.wb_thresh, cgwb_calc_thresh(wb));
> +		wb_stats_show(m, wb, &stats);
> +		rcu_read_lock();
> +		wb_put(wb);
> +	}
> +	rcu_read_unlock();
> +
> +	return 0;
> +}
> +DEFINE_SHOW_ATTRIBUTE(cgwb_debug_stats);
> +
> +static void cgwb_debug_register(struct backing_dev_info *bdi)
> +{
> +	debugfs_create_file("wb_stats", 0444, bdi->debug_dir, bdi,
> +			    &cgwb_debug_stats_fops);
> +}
> +
>  static void bdi_collect_stats(struct backing_dev_info *bdi,
>  			      struct wb_stats *stats)
>  {
> @@ -117,6 +202,8 @@ static void bdi_collect_stats(struct backing_dev_info *bdi,
>  {
>  	collect_wb_stats(stats, &bdi->wb);
>  }
> +
> +static inline void cgwb_debug_register(struct backing_dev_info *bdi) { }

Could we just create the wb_stats file regardless of whether cgwb is
enabled? Obviously theres only one wb in the !CGWB case and it's
somewhat duplicative with the bdi stats file, but that seems harmless if
the same code can be reused..? Maybe there's also a small argument for
dropping the state info from the bdi stats file and moving it to
wb_stats.

Brian

>  #endif
>  
>  static int bdi_debug_stats_show(struct seq_file *m, void *v)
> @@ -182,6 +269,7 @@ static void bdi_debug_register(struct backing_dev_info *bdi, const char *name)
>  
>  	debugfs_create_file("stats", 0444, bdi->debug_dir, bdi,
>  			    &bdi_debug_stats_fops);
> +	cgwb_debug_register(bdi);
>  }
>  
>  static void bdi_debug_unregister(struct backing_dev_info *bdi)
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 0e20467367fe..3724c7525316 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -893,6 +893,25 @@ unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh)
>  	return __wb_calc_thresh(&gdtc, thresh);
>  }
>  
> +unsigned long cgwb_calc_thresh(struct bdi_writeback *wb)
> +{
> +	struct dirty_throttle_control gdtc = { GDTC_INIT_NO_WB };
> +	struct dirty_throttle_control mdtc = { MDTC_INIT(wb, &gdtc) };
> +	unsigned long filepages, headroom, writeback;
> +
> +	gdtc.avail = global_dirtyable_memory();
> +	gdtc.dirty = global_node_page_state(NR_FILE_DIRTY) +
> +		     global_node_page_state(NR_WRITEBACK);
> +
> +	mem_cgroup_wb_stats(wb, &filepages, &headroom,
> +			    &mdtc.dirty, &writeback);
> +	mdtc.dirty += writeback;
> +	mdtc_calc_avail(&mdtc, filepages, headroom);
> +	domain_dirty_limits(&mdtc);
> +
> +	return __wb_calc_thresh(&mdtc, mdtc.thresh);
> +}
> +
>  /*
>   *                           setpoint - dirty 3
>   *        f(dirty) := 1.0 + (----------------)
> -- 
> 2.30.0
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/6] writeback: protect race between bdi release and bdi_debug_stats_show
  2024-03-28 17:53   ` Brian Foster
@ 2024-04-03  2:16     ` Kemeng Shi
  0 siblings, 0 replies; 38+ messages in thread
From: Kemeng Shi @ 2024-04-03  2:16 UTC (permalink / raw)
  To: Brian Foster
  Cc: akpm, willy, jack, tj, dsterba, mjguzik, dhowells, linux-kernel,
	linux-mm, linux-fsdevel



on 3/29/2024 1:53 AM, Brian Foster wrote:
> On Wed, Mar 27, 2024 at 11:57:46PM +0800, Kemeng Shi wrote:
>> There is a race between bdi release and bdi_debug_stats_show:
>> /* get debug info */		/* bdi release */
>> bdi_debug_stats_show
>>   bdi = m->private;
>>   wb = &bdi->wb;
>> 				bdi_unregister
>> 				bdi_put
>> 				  release_bdi
>> 				    kfree(bdi)
>>   /* use after free */
>>   spin_lock(&wb->list_lock);
>>
> 
> Maybe I'm missing something, but it looks to me that
> bdi_unregister_debug() can't complete until active readers of associated
> debugfs files have completed. For example, see __debugfs_file_removed()
> and use of ->active_users[_drained]. Once the dentry is unlinked,
> further reads fail (I think) via debugfs_file_get(). Hm?
Sorry for missing this. The race seems not possible if debugfs could prevent
this. Thanks for the information. I will drop this in next version.

Kemeng

> 
> Brian
> 
>> Search bdi on bdi_list under rcu lock in bdi_debug_stats_show to ensure
>> the bdi is not freed to fix the issue.
>>
>> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
>> ---
>>  mm/backing-dev.c | 33 +++++++++++++++++++++++++++++++--
>>  1 file changed, 31 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
>> index 5f2be8c8df11..70f02959f3bd 100644
>> --- a/mm/backing-dev.c
>> +++ b/mm/backing-dev.c
>> @@ -46,16 +46,44 @@ static void bdi_debug_init(void)
>>  	bdi_debug_root = debugfs_create_dir("bdi", NULL);
>>  }
>>  
>> -static int bdi_debug_stats_show(struct seq_file *m, void *v)
>> +static struct backing_dev_info *lookup_bdi(struct seq_file *m)
>>  {
>> +	const struct file *file = m->file;
>>  	struct backing_dev_info *bdi = m->private;
>> -	struct bdi_writeback *wb = &bdi->wb;
>> +	struct backing_dev_info *exist;
>> +
>> +	list_for_each_entry_rcu(exist, &bdi_list, bdi_list) {
>> +		if (exist != bdi)
>> +			continue;
>> +
>> +		if (exist->debug_dir == file->f_path.dentry->d_parent)
>> +			return bdi;
>> +		else
>> +			return NULL;
>> +	}
>> +
>> +	return NULL;
>> +}
>> +
>> +
>> +static int bdi_debug_stats_show(struct seq_file *m, void *v)
>> +{
>> +	struct backing_dev_info *bdi;
>> +	struct bdi_writeback *wb;
>>  	unsigned long background_thresh;
>>  	unsigned long dirty_thresh;
>>  	unsigned long wb_thresh;
>>  	unsigned long nr_dirty, nr_io, nr_more_io, nr_dirty_time;
>>  	struct inode *inode;
>>  
>> +	rcu_read_lock();
>> +	bdi = lookup_bdi(m);
>> +	if (!bdi) {
>> +		rcu_read_unlock();
>> +		return -EEXIST;
>> +	}
>> +
>> +	wb = &bdi->wb;
>>  	nr_dirty = nr_io = nr_more_io = nr_dirty_time = 0;
>>  	spin_lock(&wb->list_lock);
>>  	list_for_each_entry(inode, &wb->b_dirty, i_io_list)
>> @@ -101,6 +129,7 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
>>  		   nr_dirty_time,
>>  		   !list_empty(&bdi->bdi_list), bdi->wb.state);
>>  
>> +	rcu_read_unlock();
>>  	return 0;
>>  }
>>  DEFINE_SHOW_ATTRIBUTE(bdi_debug_stats);
>> -- 
>> 2.30.0
>>
> 
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 0/6] Improve visibility of writeback
  2024-03-28 19:24 ` Kent Overstreet
  2024-03-28 19:31   ` Tejun Heo
@ 2024-04-03  6:56   ` Kemeng Shi
  1 sibling, 0 replies; 38+ messages in thread
From: Kemeng Shi @ 2024-04-03  6:56 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: akpm, willy, jack, bfoster, tj, dsterba, mjguzik, dhowells,
	linux-kernel, linux-mm, linux-fsdevel



on 3/29/2024 3:24 AM, Kent Overstreet wrote:
> On Wed, Mar 27, 2024 at 11:57:45PM +0800, Kemeng Shi wrote:
>> v1->v2:
>> -Send cleanup to wq_monitor.py separately.
>> -Add patch to avoid use after free of bdi.
>> -Rename wb_calc_cg_thresh to cgwb_calc_thresh as Tejun suggested.
>> -Use rcu walk to avoid use after free.
>> -Add debug output to each related patches.
>>
>> This series tries to improve visilibity of writeback. Patch 1 make
>> /sys/kernel/debug/bdi/xxx/stats show writeback info of whole bdi
>> instead of only writeback info in root cgroup. Patch 2 add a new
>> debug file /sys/kernel/debug/bdi/xxx/wb_stats to show per wb writeback
>> info. Patch 4 add wb_monitor.py to monitor basic writeback info
>> of running system, more info could be added on demand. Rest patches
>> are some random cleanups. More details can be found in respective
>> patches. Thanks!
>> This series is on top of patchset [1].
>>
>> [1] https://lore.kernel.org/lkml/20240123183332.876854-1-shikemeng@huaweicloud.com/T/#mc6455784a63d0f8aa1a2f5aff325abcdf9336b76
> 
> Not bad
> 
Hi Kent,
> I've been trying to improve our ability to debug latency issues - stalls
> of all sorts. While you're looking at all this code, do you think you
> could find some places to collect useful latency numbers?
I would like to do it to collect more useful info for writeback.
> 
> fs/bcachefs/time_stats.c has some code that's going to be moving out to
> lib/ at some point, after I switch it to MAD; if you could hook that up
> as well to a few points we could see at a glance if there are stalls
> happening in the writeback path.
I see that Tejun recommend to use bpf. I don't know much about bpf and
new approach in time_stats.c. For me personly, I think that it's better
to use the new approach after the work of moving code to lib/ is merged.
Then I would like to submit a patchset to discuss of using it in writeback.
Would this make sense to you. Look forward to your reply!

Thanks,
Kemeng
> 
>>
>> Following domain hierarchy is tested:
>>                 global domain (320G)
>>                 /                 \
>>         cgroup domain1(10G)     cgroup domain2(10G)
>>                 |                 |
>> bdi            wb1               wb2
>>
>> /* all writeback info of bdi is successfully collected */
>> # cat /sys/kernel/debug/bdi/252:16/stats:
>> BdiWriteback:              448 kB
>> BdiReclaimable:        1303904 kB
>> BdiDirtyThresh:      189914124 kB
>> DirtyThresh:         195337564 kB
>> BackgroundThresh:     32516508 kB
>> BdiDirtied:            3591392 kB
>> BdiWritten:            2287488 kB
>> BdiWriteBandwidth:      322248 kBps
>> b_dirty:                     0
>> b_io:                        0
>> b_more_io:                   2
>> b_dirty_time:                0
>> bdi_list:                    1
>> state:                       1
>>
>> /* per wb writeback info is collected */
>> # cat /sys/kernel/debug/bdi/252:16/wb_stats:
>> cat wb_stats
>> WbCgIno:                    1
>> WbWriteback:                0 kB
>> WbReclaimable:              0 kB
>> WbDirtyThresh:              0 kB
>> WbDirtied:                  0 kB
>> WbWritten:                  0 kB
>> WbWriteBandwidth:      102400 kBps
>> b_dirty:                    0
>> b_io:                       0
>> b_more_io:                  0
>> b_dirty_time:               0
>> state:                      1
>> WbCgIno:                 4284
>> WbWriteback:              448 kB
>> WbReclaimable:         818944 kB
>> WbDirtyThresh:        3096524 kB
>> WbDirtied:            2266880 kB
>> WbWritten:            1447936 kB
>> WbWriteBandwidth:      214036 kBps
>> b_dirty:                    0
>> b_io:                       0
>> b_more_io:                  1
>> b_dirty_time:               0
>> state:                      5
>> WbCgIno:                 4325
>> WbWriteback:              224 kB
>> WbReclaimable:         819392 kB
>> WbDirtyThresh:        2920088 kB
>> WbDirtied:            2551808 kB
>> WbWritten:            1732416 kB
>> WbWriteBandwidth:      201832 kBps
>> b_dirty:                    0
>> b_io:                       0
>> b_more_io:                  1
>> b_dirty_time:               0
>> state:                      5
>>
>> /* monitor writeback info */
>> # ./wb_monitor.py 252:16 -c
>>                   writeback  reclaimable   dirtied   written    avg_bw
>> 252:16_1                  0            0         0         0    102400
>> 252:16_4284             672       820064   9230368   8410304    685612
>> 252:16_4325             896       819840  10491264   9671648    652348
>> 252:16                 1568      1639904  19721632  18081952   1440360
>>
>>
>>                   writeback  reclaimable   dirtied   written    avg_bw
>> 252:16_1                  0            0         0         0    102400
>> 252:16_4284             672       820064   9230368   8410304    685612
>> 252:16_4325             896       819840  10491264   9671648    652348
>> 252:16                 1568      1639904  19721632  18081952   1440360
>> ...
>>
>> Kemeng Shi (6):
>>   writeback: protect race between bdi release and bdi_debug_stats_show
>>   writeback: collect stats of all wb of bdi in bdi_debug_stats_show
>>   writeback: support retrieving per group debug writeback stats of bdi
>>   writeback: add wb_monitor.py script to monitor writeback info on bdi
>>   writeback: rename nr_reclaimable to nr_dirty in balance_dirty_pages
>>   writeback: define GDTC_INIT_NO_WB to null
>>
>>  include/linux/writeback.h     |   1 +
>>  mm/backing-dev.c              | 203 ++++++++++++++++++++++++++++++----
>>  mm/page-writeback.c           |  31 ++++--
>>  tools/writeback/wb_monitor.py | 172 ++++++++++++++++++++++++++++
>>  4 files changed, 378 insertions(+), 29 deletions(-)
>>  create mode 100644 tools/writeback/wb_monitor.py
>>
>> -- 
>> 2.30.0
>>
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 2/6] writeback: collect stats of all wb of bdi in bdi_debug_stats_show
  2024-03-29 13:04   ` Brian Foster
@ 2024-04-03  7:49     ` Kemeng Shi
  0 siblings, 0 replies; 38+ messages in thread
From: Kemeng Shi @ 2024-04-03  7:49 UTC (permalink / raw)
  To: Brian Foster
  Cc: akpm, willy, jack, tj, dsterba, mjguzik, dhowells, linux-kernel,
	linux-mm, linux-fsdevel



on 3/29/2024 9:04 PM, Brian Foster wrote:
> On Wed, Mar 27, 2024 at 11:57:47PM +0800, Kemeng Shi wrote:
>> /sys/kernel/debug/bdi/xxx/stats is supposed to show writeback information
>> of whole bdi, but only writeback information of bdi in root cgroup is
>> collected. So writeback information in non-root cgroup are missing now.
>> To be more specific, considering following case:
>>
>> /* create writeback cgroup */
>> cd /sys/fs/cgroup
>> echo "+memory +io" > cgroup.subtree_control
>> mkdir group1
>> cd group1
>> echo $$ > cgroup.procs
>> /* do writeback in cgroup */
>> fio -name test -filename=/dev/vdb ...
>> /* get writeback info of bdi */
>> cat /sys/kernel/debug/bdi/xxx/stats
>> The cat result unexpectedly implies that there is no writeback on target
>> bdi.
>>
>> Fix this by collecting stats of all wb in bdi instead of only wb in
>> root cgroup.
>>
>> Following domain hierarchy is tested:
>>                 global domain (320G)
>>                 /                 \
>>         cgroup domain1(10G)     cgroup domain2(10G)
>>                 |                 |
>> bdi            wb1               wb2
>>
>> /* all writeback info of bdi is successfully collected */
>> cat stats
>> BdiWriteback:             2912 kB
>> BdiReclaimable:        1598464 kB
>> BdiDirtyThresh:      167479028 kB
>> DirtyThresh:         195038532 kB
>> BackgroundThresh:     32466728 kB
>> BdiDirtied:           19141696 kB
>> BdiWritten:           17543456 kB
>> BdiWriteBandwidth:     1136172 kBps
>> b_dirty:                     2
>> b_io:                        0
>> b_more_io:                   1
>> b_dirty_time:                0
>> bdi_list:                    1
>> state:                       1
>>
>> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
>> ---
>>  mm/backing-dev.c | 100 +++++++++++++++++++++++++++++++++--------------
>>  1 file changed, 71 insertions(+), 29 deletions(-)
>>
>> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
>> index 70f02959f3bd..8daf950e6855 100644
>> --- a/mm/backing-dev.c
>> +++ b/mm/backing-dev.c
> ...
>> @@ -65,16 +78,54 @@ static struct backing_dev_info *lookup_bdi(struct seq_file *m)
>>  	return NULL;
>>  }
>>  
>> +static void collect_wb_stats(struct wb_stats *stats,
>> +			     struct bdi_writeback *wb)
>> +{
>> +	struct inode *inode;
>> +
>> +	spin_lock(&wb->list_lock);
>> +	list_for_each_entry(inode, &wb->b_dirty, i_io_list)
>> +		stats->nr_dirty++;
>> +	list_for_each_entry(inode, &wb->b_io, i_io_list)
>> +		stats->nr_io++;
>> +	list_for_each_entry(inode, &wb->b_more_io, i_io_list)
>> +		stats->nr_more_io++;
>> +	list_for_each_entry(inode, &wb->b_dirty_time, i_io_list)
>> +		if (inode->i_state & I_DIRTY_TIME)
>> +			stats->nr_dirty_time++;
>> +	spin_unlock(&wb->list_lock);
>> +
>> +	stats->nr_writeback += wb_stat(wb, WB_WRITEBACK);
>> +	stats->nr_reclaimable += wb_stat(wb, WB_RECLAIMABLE);
>> +	stats->nr_dirtied += wb_stat(wb, WB_DIRTIED);
>> +	stats->nr_written += wb_stat(wb, WB_WRITTEN);
>> +	stats->wb_thresh += wb_calc_thresh(wb, stats->dirty_thresh);
> 
> Kinda nitty question, but is this a sum of per-wb writeback thresholds?
> If so, do you consider that useful information vs. the per-wb threshold
> data presumably exposed in the next patch?
It's sum of per-wb wirteback thresholds in global domain. For each wb,
it's threshold is min of threshold in global domain and threshold in
cgroup domain (if any). As the debug data of bdi existed before writeback
cgroup was introduced, so it would be better to show bdi threshold in global
domain which is more compatible.
> 
> I'm not really that worried about what debug data we expose, it just
> seems a little odd. How would you document this value in a sentence or
> two, for example?
I think it could simply be "threshold of bdi in global domain".
> 
>> +}
>> +
>> +#ifdef CONFIG_CGROUP_WRITEBACK
>> +static void bdi_collect_stats(struct backing_dev_info *bdi,
>> +			      struct wb_stats *stats)
>> +{
>> +	struct bdi_writeback *wb;
>> +
>> +	list_for_each_entry_rcu(wb, &bdi->wb_list, bdi_node)
>> +		collect_wb_stats(stats, wb);
> 
> Depending on discussion on the previous patch and whether the higher
> level rcu protection in bdi_debug_stats_show() is really necessary, it
> might make more sense to move it here.
Sure, will do it in next version.
> 
> I'm also wondering if you'd want to check the state of the individual wb
> (i.e. WB_registered?) before reading it..?
I think it't better to keep full debug info. The user could filter it out
with state in debug info anyway.
> 
>> +}
>> +#else
>> +static void bdi_collect_stats(struct backing_dev_info *bdi,
>> +			      struct wb_stats *stats)
>> +{
>> +	collect_wb_stats(stats, &bdi->wb);
>> +}
>> +#endif
> ...
>> @@ -115,18 +157,18 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
>>  		   "b_dirty_time:       %10lu\n"
>>  		   "bdi_list:           %10u\n"
>>  		   "state:              %10lx\n",
>> -		   (unsigned long) K(wb_stat(wb, WB_WRITEBACK)),
>> -		   (unsigned long) K(wb_stat(wb, WB_RECLAIMABLE)),
>> -		   K(wb_thresh),
>> +		   K(stats.nr_writeback),
>> +		   K(stats.nr_reclaimable),
>> +		   K(stats.wb_thresh),
>>  		   K(dirty_thresh),
>>  		   K(background_thresh),
>> -		   (unsigned long) K(wb_stat(wb, WB_DIRTIED)),
>> -		   (unsigned long) K(wb_stat(wb, WB_WRITTEN)),
>> -		   (unsigned long) K(wb->write_bandwidth),
>> -		   nr_dirty,
>> -		   nr_io,
>> -		   nr_more_io,
>> -		   nr_dirty_time,
>> +		   K(stats.nr_dirtied),
>> +		   K(stats.nr_written),
>> +		   K(tot_bw),
>> +		   stats.nr_dirty,
>> +		   stats.nr_io,
>> +		   stats.nr_more_io,
>> +		   stats.nr_dirty_time,
>>  		   !list_empty(&bdi->bdi_list), bdi->wb.state);
> 
> Is it worth showing a list count here rather than list_empty() state?
Actually, I don't know how this info was supposed to be used. So I keep it in
old way for compatibility...
As for bdi count, it would be easy to retrieve by counting the bdi number under
/sys/kernel/debug/bdi/.
As for wb count, it would be easy to count with wb_stats.
So I still prefer to keep it in old way for compatibility or just simply remove
it if the list_empty() state is not needed.
Thansk for the review and all advise. Look forward to your reply.

Kemeng

> 
> Brian
> 
>>  
>>  	rcu_read_unlock();
>> -- 
>> 2.30.0
>>
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 3/6] writeback: support retrieving per group debug writeback stats of bdi
  2024-03-29 13:10   ` Brian Foster
@ 2024-04-03  8:49     ` Kemeng Shi
  2024-04-03 15:04       ` Brian Foster
  0 siblings, 1 reply; 38+ messages in thread
From: Kemeng Shi @ 2024-04-03  8:49 UTC (permalink / raw)
  To: Brian Foster
  Cc: akpm, willy, jack, tj, dsterba, mjguzik, dhowells, linux-kernel,
	linux-mm, linux-fsdevel



on 3/29/2024 9:10 PM, Brian Foster wrote:
> On Wed, Mar 27, 2024 at 11:57:48PM +0800, Kemeng Shi wrote:
>> Add /sys/kernel/debug/bdi/xxx/wb_stats to show per group writeback stats
>> of bdi.
>>
> 
> Hi Kemeng,
Hello Brian,
> 
> Just a few random thoughts/comments..
> 
>> Following domain hierarchy is tested:
>>                 global domain (320G)
>>                 /                 \
>>         cgroup domain1(10G)     cgroup domain2(10G)
>>                 |                 |
>> bdi            wb1               wb2
>>
>> /* per wb writeback info of bdi is collected */
>> cat /sys/kernel/debug/bdi/252:16/wb_stats
>> WbCgIno:                    1
>> WbWriteback:                0 kB
>> WbReclaimable:              0 kB
>> WbDirtyThresh:              0 kB
>> WbDirtied:                  0 kB
>> WbWritten:                  0 kB
>> WbWriteBandwidth:      102400 kBps
>> b_dirty:                    0
>> b_io:                       0
>> b_more_io:                  0
>> b_dirty_time:               0
>> state:                      1
> 
> Maybe some whitespace or something between entries would improve
> readability?
Sure, I will add a whitespace in next version.
> 
>> WbCgIno:                 4094
>> WbWriteback:            54432 kB
>> WbReclaimable:         766080 kB
>> WbDirtyThresh:        3094760 kB
>> WbDirtied:            1656480 kB
>> WbWritten:             837088 kB
>> WbWriteBandwidth:      132772 kBps
>> b_dirty:                    1
>> b_io:                       1
>> b_more_io:                  0
>> b_dirty_time:               0
>> state:                      7
>> WbCgIno:                 4135
>> WbWriteback:            15232 kB
>> WbReclaimable:         786688 kB
>> WbDirtyThresh:        2909984 kB
>> WbDirtied:            1482656 kB
>> WbWritten:             681408 kB
>> WbWriteBandwidth:      124848 kBps
>> b_dirty:                    0
>> b_io:                       1
>> b_more_io:                  0
>> b_dirty_time:               0
>> state:                      7
>>
>> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
>> ---
>>  include/linux/writeback.h |  1 +
>>  mm/backing-dev.c          | 88 +++++++++++++++++++++++++++++++++++++++
>>  mm/page-writeback.c       | 19 +++++++++
>>  3 files changed, 108 insertions(+)
>>
> ...
>> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
>> index 8daf950e6855..e3953db7d88d 100644
>> --- a/mm/backing-dev.c
>> +++ b/mm/backing-dev.c
>> @@ -103,6 +103,91 @@ static void collect_wb_stats(struct wb_stats *stats,
>>  }
>>  
>>  #ifdef CONFIG_CGROUP_WRITEBACK
> ...
>> +static int cgwb_debug_stats_show(struct seq_file *m, void *v)
>> +{
>> +	struct backing_dev_info *bdi;
>> +	unsigned long background_thresh;
>> +	unsigned long dirty_thresh;
>> +	struct bdi_writeback *wb;
>> +	struct wb_stats stats;
>> +
>> +	rcu_read_lock();
>> +	bdi = lookup_bdi(m);
>> +	if (!bdi) {
>> +		rcu_read_unlock();
>> +		return -EEXIST;
>> +	}
>> +
>> +	global_dirty_limits(&background_thresh, &dirty_thresh);
>> +
>> +	list_for_each_entry_rcu(wb, &bdi->wb_list, bdi_node) {
>> +		memset(&stats, 0, sizeof(stats));
>> +		stats.dirty_thresh = dirty_thresh;
> 
> If you did something like the following here, wouldn't that also zero
> the rest of the structure?
> 
> 		struct wb_stats stats = { .dirty_thresh = dirty_thresh };
> 
Suer, will do it in next version.
>> +		collect_wb_stats(&stats, wb);
>> +
> 
> Also, similar question as before on whether you'd want to check
> WB_registered or something here..
Still prefer to keep full debug info and user could filter out on
demand.
> 
>> +		if (mem_cgroup_wb_domain(wb) == NULL) {
>> +			wb_stats_show(m, wb, &stats);
>> +			continue;
>> +		}
> 
> Can you explain what this logic is about? Is the cgwb_calc_thresh()
> thing not needed in this case? A comment might help for those less
> familiar with the implementation details.
If mem_cgroup_wb_domain(wb) is NULL, then it's bdi->wb, otherwise,
it's wb in cgroup. For bdi->wb, there is no need to do wb_tryget
and cgwb_calc_thresh. Will add some comment in next version.
> 
> BTW, I'm also wondering if something like the following is correct
> and/or roughly equivalent:
> 	
> 	list_for_each_*(wb, ...) {
> 		struct wb_stats stats = ...;
> 
> 		if (!wb_tryget(wb))
> 			continue;
> 
> 		collect_wb_stats(&stats, wb);
> 
> 		/*
> 		 * Extra wb_thresh magic. Drop rcu lock because ... . We
> 		 * can do so here because we have a ref.
> 		 */
> 		if (mem_cgroup_wb_domain(wb)) {
> 			rcu_read_unlock();
> 			stats.wb_thresh = min(stats.wb_thresh, cgwb_calc_thresh(wb));
> 			rcu_read_lock();
> 		}
> 
> 		wb_stats_show(m, wb, &stats)
> 		wb_put(wb);
> 	}
It's correct as wb_tryget to bdi->wb has no harm. I have considered
to do it in this way, I change my mind to do it in new way for
two reason:
1. Put code handling wb in cgroup more tight which could be easier
to maintain.
2. Rmove extra wb_tryget/wb_put for wb in bdi.
Would this make sense to you?
> 
>> +
>> +		/*
>> +		 * cgwb_release will destroy wb->memcg_completions which
>> +		 * will be ued in cgwb_calc_thresh. Use wb_tryget to prevent
>> +		 * memcg_completions destruction from cgwb_release.
>> +		 */
>> +		if (!wb_tryget(wb))
>> +			continue;
>> +
>> +		rcu_read_unlock();
>> +		/* cgwb_calc_thresh may sleep in cgroup_rstat_flush */
>> +		stats.wb_thresh = min(stats.wb_thresh, cgwb_calc_thresh(wb));
>> +		wb_stats_show(m, wb, &stats);
>> +		rcu_read_lock();
>> +		wb_put(wb);
>> +	}
>> +	rcu_read_unlock();
>> +
>> +	return 0;
>> +}
>> +DEFINE_SHOW_ATTRIBUTE(cgwb_debug_stats);
>> +
>> +static void cgwb_debug_register(struct backing_dev_info *bdi)
>> +{
>> +	debugfs_create_file("wb_stats", 0444, bdi->debug_dir, bdi,
>> +			    &cgwb_debug_stats_fops);
>> +}
>> +
>>  static void bdi_collect_stats(struct backing_dev_info *bdi,
>>  			      struct wb_stats *stats)
>>  {
>> @@ -117,6 +202,8 @@ static void bdi_collect_stats(struct backing_dev_info *bdi,
>>  {
>>  	collect_wb_stats(stats, &bdi->wb);
>>  }
>> +
>> +static inline void cgwb_debug_register(struct backing_dev_info *bdi) { }
> 
> Could we just create the wb_stats file regardless of whether cgwb is
> enabled? Obviously theres only one wb in the !CGWB case and it's
> somewhat duplicative with the bdi stats file, but that seems harmless if
> the same code can be reused..? Maybe there's also a small argument for
> dropping the state info from the bdi stats file and moving it to
> wb_stats.In backing-dev.c, there are a lot "#ifdef CGWB .. #else .. #endif" to
avoid unneed extra cost when CGWB is not enabled.
I think it's better to avoid extra cost from wb_stats when CGWB is not
enabled. For now, we only save cpu cost to create and destroy wb_stats
and save memory cost to record debugfs file, we could save more in
future when wb_stats records more debug info.
Move state info from bdi stats to wb_stats make senses to me. The only
concern would be compatibility problem. I will add a new patch to this
to make this more noticeable and easier to revert.
Thanks a lot for review!

Kemeng
> 
> Brian
> 
>>  #endif
>>  
>>  static int bdi_debug_stats_show(struct seq_file *m, void *v)
>> @@ -182,6 +269,7 @@ static void bdi_debug_register(struct backing_dev_info *bdi, const char *name)
>>  
>>  	debugfs_create_file("stats", 0444, bdi->debug_dir, bdi,
>>  			    &bdi_debug_stats_fops);
>> +	cgwb_debug_register(bdi);
>>  }
>>  
>>  static void bdi_debug_unregister(struct backing_dev_info *bdi)
>> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
>> index 0e20467367fe..3724c7525316 100644
>> --- a/mm/page-writeback.c
>> +++ b/mm/page-writeback.c
>> @@ -893,6 +893,25 @@ unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh)
>>  	return __wb_calc_thresh(&gdtc, thresh);
>>  }
>>  
>> +unsigned long cgwb_calc_thresh(struct bdi_writeback *wb)
>> +{
>> +	struct dirty_throttle_control gdtc = { GDTC_INIT_NO_WB };
>> +	struct dirty_throttle_control mdtc = { MDTC_INIT(wb, &gdtc) };
>> +	unsigned long filepages, headroom, writeback;
>> +
>> +	gdtc.avail = global_dirtyable_memory();
>> +	gdtc.dirty = global_node_page_state(NR_FILE_DIRTY) +
>> +		     global_node_page_state(NR_WRITEBACK);
>> +
>> +	mem_cgroup_wb_stats(wb, &filepages, &headroom,
>> +			    &mdtc.dirty, &writeback);
>> +	mdtc.dirty += writeback;
>> +	mdtc_calc_avail(&mdtc, filepages, headroom);
>> +	domain_dirty_limits(&mdtc);
>> +
>> +	return __wb_calc_thresh(&mdtc, mdtc.thresh);
>> +}
>> +
>>  /*
>>   *                           setpoint - dirty 3
>>   *        f(dirty) := 1.0 + (----------------)
>> -- 
>> 2.30.0
>>
> 
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 3/6] writeback: support retrieving per group debug writeback stats of bdi
  2024-04-03  8:49     ` Kemeng Shi
@ 2024-04-03 15:04       ` Brian Foster
  2024-04-04  9:07         ` Jan Kara
  2024-04-07  2:48         ` Kemeng Shi
  0 siblings, 2 replies; 38+ messages in thread
From: Brian Foster @ 2024-04-03 15:04 UTC (permalink / raw)
  To: Kemeng Shi
  Cc: akpm, willy, jack, tj, dsterba, mjguzik, dhowells, linux-kernel,
	linux-mm, linux-fsdevel

On Wed, Apr 03, 2024 at 04:49:42PM +0800, Kemeng Shi wrote:
> 
> 
> on 3/29/2024 9:10 PM, Brian Foster wrote:
> > On Wed, Mar 27, 2024 at 11:57:48PM +0800, Kemeng Shi wrote:
> >> Add /sys/kernel/debug/bdi/xxx/wb_stats to show per group writeback stats
> >> of bdi.
> >>
> > 
> > Hi Kemeng,
> Hello Brian,
> > 
> > Just a few random thoughts/comments..
> > 
> >> Following domain hierarchy is tested:
> >>                 global domain (320G)
> >>                 /                 \
> >>         cgroup domain1(10G)     cgroup domain2(10G)
> >>                 |                 |
> >> bdi            wb1               wb2
> >>
> >> /* per wb writeback info of bdi is collected */
> >> cat /sys/kernel/debug/bdi/252:16/wb_stats
> >> WbCgIno:                    1
> >> WbWriteback:                0 kB
> >> WbReclaimable:              0 kB
> >> WbDirtyThresh:              0 kB
> >> WbDirtied:                  0 kB
> >> WbWritten:                  0 kB
> >> WbWriteBandwidth:      102400 kBps
> >> b_dirty:                    0
> >> b_io:                       0
> >> b_more_io:                  0
> >> b_dirty_time:               0
> >> state:                      1
> > 
> > Maybe some whitespace or something between entries would improve
> > readability?
> Sure, I will add a whitespace in next version.
> > 
> >> WbCgIno:                 4094
> >> WbWriteback:            54432 kB
> >> WbReclaimable:         766080 kB
> >> WbDirtyThresh:        3094760 kB
> >> WbDirtied:            1656480 kB
> >> WbWritten:             837088 kB
> >> WbWriteBandwidth:      132772 kBps
> >> b_dirty:                    1
> >> b_io:                       1
> >> b_more_io:                  0
> >> b_dirty_time:               0
> >> state:                      7
> >> WbCgIno:                 4135
> >> WbWriteback:            15232 kB
> >> WbReclaimable:         786688 kB
> >> WbDirtyThresh:        2909984 kB
> >> WbDirtied:            1482656 kB
> >> WbWritten:             681408 kB
> >> WbWriteBandwidth:      124848 kBps
> >> b_dirty:                    0
> >> b_io:                       1
> >> b_more_io:                  0
> >> b_dirty_time:               0
> >> state:                      7
> >>
> >> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
> >> ---
> >>  include/linux/writeback.h |  1 +
> >>  mm/backing-dev.c          | 88 +++++++++++++++++++++++++++++++++++++++
> >>  mm/page-writeback.c       | 19 +++++++++
> >>  3 files changed, 108 insertions(+)
> >>
> > ...
> >> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> >> index 8daf950e6855..e3953db7d88d 100644
> >> --- a/mm/backing-dev.c
> >> +++ b/mm/backing-dev.c
> >> @@ -103,6 +103,91 @@ static void collect_wb_stats(struct wb_stats *stats,
> >>  }
> >>  
> >>  #ifdef CONFIG_CGROUP_WRITEBACK
> > ...
> >> +static int cgwb_debug_stats_show(struct seq_file *m, void *v)
> >> +{
> >> +	struct backing_dev_info *bdi;
> >> +	unsigned long background_thresh;
> >> +	unsigned long dirty_thresh;
> >> +	struct bdi_writeback *wb;
> >> +	struct wb_stats stats;
> >> +
> >> +	rcu_read_lock();
> >> +	bdi = lookup_bdi(m);
> >> +	if (!bdi) {
> >> +		rcu_read_unlock();
> >> +		return -EEXIST;
> >> +	}
> >> +
> >> +	global_dirty_limits(&background_thresh, &dirty_thresh);
> >> +
> >> +	list_for_each_entry_rcu(wb, &bdi->wb_list, bdi_node) {
> >> +		memset(&stats, 0, sizeof(stats));
> >> +		stats.dirty_thresh = dirty_thresh;
> > 
> > If you did something like the following here, wouldn't that also zero
> > the rest of the structure?
> > 
> > 		struct wb_stats stats = { .dirty_thresh = dirty_thresh };
> > 
> Suer, will do it in next version.
> >> +		collect_wb_stats(&stats, wb);
> >> +
> > 
> > Also, similar question as before on whether you'd want to check
> > WB_registered or something here..
> Still prefer to keep full debug info and user could filter out on
> demand.

Ok. I was more wondering if that was needed for correctness. If not,
then that seems fair enough to me.

> > 
> >> +		if (mem_cgroup_wb_domain(wb) == NULL) {
> >> +			wb_stats_show(m, wb, &stats);
> >> +			continue;
> >> +		}
> > 
> > Can you explain what this logic is about? Is the cgwb_calc_thresh()
> > thing not needed in this case? A comment might help for those less
> > familiar with the implementation details.
> If mem_cgroup_wb_domain(wb) is NULL, then it's bdi->wb, otherwise,
> it's wb in cgroup. For bdi->wb, there is no need to do wb_tryget
> and cgwb_calc_thresh. Will add some comment in next version.
> > 
> > BTW, I'm also wondering if something like the following is correct
> > and/or roughly equivalent:
> > 	
> > 	list_for_each_*(wb, ...) {
> > 		struct wb_stats stats = ...;
> > 
> > 		if (!wb_tryget(wb))
> > 			continue;
> > 
> > 		collect_wb_stats(&stats, wb);
> > 
> > 		/*
> > 		 * Extra wb_thresh magic. Drop rcu lock because ... . We
> > 		 * can do so here because we have a ref.
> > 		 */
> > 		if (mem_cgroup_wb_domain(wb)) {
> > 			rcu_read_unlock();
> > 			stats.wb_thresh = min(stats.wb_thresh, cgwb_calc_thresh(wb));
> > 			rcu_read_lock();
> > 		}
> > 
> > 		wb_stats_show(m, wb, &stats)
> > 		wb_put(wb);
> > 	}
> It's correct as wb_tryget to bdi->wb has no harm. I have considered
> to do it in this way, I change my mind to do it in new way for
> two reason:
> 1. Put code handling wb in cgroup more tight which could be easier
> to maintain.
> 2. Rmove extra wb_tryget/wb_put for wb in bdi.
> Would this make sense to you?

Ok, well assuming it is correct the above logic is a bit more simple and
readable to me. I think you'd just need to fill in the comment around
the wb_thresh thing rather than i.e. having to explain we don't need to
ref bdi->wb even though it doesn't seem to matter.

I kind of feel the same on the wb_stats file thing below just because it
seems more consistent and available if wb_stats eventually grows more
wb-specific data.

That said, this is subjective and not hugely important so I don't insist
on either point. Maybe wait a bit and see if Jan or Tejun or somebody
has any thoughts..? If nobody else expresses explicit preference then
I'm good with it either way.

Brian

> > 
> >> +
> >> +		/*
> >> +		 * cgwb_release will destroy wb->memcg_completions which
> >> +		 * will be ued in cgwb_calc_thresh. Use wb_tryget to prevent
> >> +		 * memcg_completions destruction from cgwb_release.
> >> +		 */
> >> +		if (!wb_tryget(wb))
> >> +			continue;
> >> +
> >> +		rcu_read_unlock();
> >> +		/* cgwb_calc_thresh may sleep in cgroup_rstat_flush */
> >> +		stats.wb_thresh = min(stats.wb_thresh, cgwb_calc_thresh(wb));
> >> +		wb_stats_show(m, wb, &stats);
> >> +		rcu_read_lock();
> >> +		wb_put(wb);
> >> +	}
> >> +	rcu_read_unlock();
> >> +
> >> +	return 0;
> >> +}
> >> +DEFINE_SHOW_ATTRIBUTE(cgwb_debug_stats);
> >> +
> >> +static void cgwb_debug_register(struct backing_dev_info *bdi)
> >> +{
> >> +	debugfs_create_file("wb_stats", 0444, bdi->debug_dir, bdi,
> >> +			    &cgwb_debug_stats_fops);
> >> +}
> >> +
> >>  static void bdi_collect_stats(struct backing_dev_info *bdi,
> >>  			      struct wb_stats *stats)
> >>  {
> >> @@ -117,6 +202,8 @@ static void bdi_collect_stats(struct backing_dev_info *bdi,
> >>  {
> >>  	collect_wb_stats(stats, &bdi->wb);
> >>  }
> >> +
> >> +static inline void cgwb_debug_register(struct backing_dev_info *bdi) { }
> > 
> > Could we just create the wb_stats file regardless of whether cgwb is
> > enabled? Obviously theres only one wb in the !CGWB case and it's
> > somewhat duplicative with the bdi stats file, but that seems harmless if
> > the same code can be reused..? Maybe there's also a small argument for
> > dropping the state info from the bdi stats file and moving it to
> > wb_stats.In backing-dev.c, there are a lot "#ifdef CGWB .. #else .. #endif" to
> avoid unneed extra cost when CGWB is not enabled.
> I think it's better to avoid extra cost from wb_stats when CGWB is not
> enabled. For now, we only save cpu cost to create and destroy wb_stats
> and save memory cost to record debugfs file, we could save more in
> future when wb_stats records more debug info.
> Move state info from bdi stats to wb_stats make senses to me. The only
> concern would be compatibility problem. I will add a new patch to this
> to make this more noticeable and easier to revert.
> Thanks a lot for review!
> 
> Kemeng
> > 
> > Brian
> > 
> >>  #endif
> >>  
> >>  static int bdi_debug_stats_show(struct seq_file *m, void *v)
> >> @@ -182,6 +269,7 @@ static void bdi_debug_register(struct backing_dev_info *bdi, const char *name)
> >>  
> >>  	debugfs_create_file("stats", 0444, bdi->debug_dir, bdi,
> >>  			    &bdi_debug_stats_fops);
> >> +	cgwb_debug_register(bdi);
> >>  }
> >>  
> >>  static void bdi_debug_unregister(struct backing_dev_info *bdi)
> >> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> >> index 0e20467367fe..3724c7525316 100644
> >> --- a/mm/page-writeback.c
> >> +++ b/mm/page-writeback.c
> >> @@ -893,6 +893,25 @@ unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh)
> >>  	return __wb_calc_thresh(&gdtc, thresh);
> >>  }
> >>  
> >> +unsigned long cgwb_calc_thresh(struct bdi_writeback *wb)
> >> +{
> >> +	struct dirty_throttle_control gdtc = { GDTC_INIT_NO_WB };
> >> +	struct dirty_throttle_control mdtc = { MDTC_INIT(wb, &gdtc) };
> >> +	unsigned long filepages, headroom, writeback;
> >> +
> >> +	gdtc.avail = global_dirtyable_memory();
> >> +	gdtc.dirty = global_node_page_state(NR_FILE_DIRTY) +
> >> +		     global_node_page_state(NR_WRITEBACK);
> >> +
> >> +	mem_cgroup_wb_stats(wb, &filepages, &headroom,
> >> +			    &mdtc.dirty, &writeback);
> >> +	mdtc.dirty += writeback;
> >> +	mdtc_calc_avail(&mdtc, filepages, headroom);
> >> +	domain_dirty_limits(&mdtc);
> >> +
> >> +	return __wb_calc_thresh(&mdtc, mdtc.thresh);
> >> +}
> >> +
> >>  /*
> >>   *                           setpoint - dirty 3
> >>   *        f(dirty) := 1.0 + (----------------)
> >> -- 
> >> 2.30.0
> >>
> > 
> > 
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 0/6] Improve visibility of writeback
  2024-03-28 20:46               ` Tejun Heo
  2024-03-28 20:53                 ` Kent Overstreet
@ 2024-04-03 16:27                 ` Jan Kara
  2024-04-03 18:44                   ` Tejun Heo
  1 sibling, 1 reply; 38+ messages in thread
From: Jan Kara @ 2024-04-03 16:27 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Kent Overstreet, Kemeng Shi, akpm, willy, jack, bfoster, dsterba,
	mjguzik, dhowells, linux-kernel, linux-mm, linux-fsdevel

On Thu 28-03-24 10:46:33, Tejun Heo wrote:
> Hello, Kent.
> 
> On Thu, Mar 28, 2024 at 04:22:13PM -0400, Kent Overstreet wrote:
> > Most users are never going to touch tracing, let alone BPF; that's too
> > much setup. But I can and do regularly tell users "check this, this and
> > this" and debug things on that basis without ever touching their
> > machine.
> 
> I think this is where the disconnect is. It's not difficult to set up at
> all. Nowadays, in most distros, it comes down to something like run "pacman
> -S bcc" and then run "/usr/share/bcc/tools/biolatpcts" with these params or
> run this script I'm attaching. It is a signficant boost when debugging many
> different kernel issues. I strongly suggest giving it a try and getting used
> to it rather than resisting it.

Yeah, BPF is great and I use it but to fill in some cases from practice,
there are sysadmins refusing to install bcc or run your BPF scripts on
their systems due to company regulations, their personal fear, or whatever.
So debugging with what you can achieve from a shell is still the thing
quite often.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 0/6] Improve visibility of writeback
  2024-04-03 16:27                 ` Jan Kara
@ 2024-04-03 18:44                   ` Tejun Heo
  2024-04-03 19:06                     ` Kent Overstreet
  0 siblings, 1 reply; 38+ messages in thread
From: Tejun Heo @ 2024-04-03 18:44 UTC (permalink / raw)
  To: Jan Kara
  Cc: Kent Overstreet, Kemeng Shi, akpm, willy, bfoster, dsterba,
	mjguzik, dhowells, linux-kernel, linux-mm, linux-fsdevel

Hello,

On Wed, Apr 03, 2024 at 06:27:16PM +0200, Jan Kara wrote:
> Yeah, BPF is great and I use it but to fill in some cases from practice,
> there are sysadmins refusing to install bcc or run your BPF scripts on
> their systems due to company regulations, their personal fear, or whatever.
> So debugging with what you can achieve from a shell is still the thing
> quite often.

Yeah, I mean, this happens with anything new. Tracing itself took quite a
while to be adopted widely. BPF, bcc, bpftrace are all still pretty new and
it's likely that the adoption line will keep shifting for quite a while.
Besides, even with all the new gizmos there definitely are cases where good
ol' cat interface makes sense.

So, if the static interface makes sense, we add it but we should keep in
mind that the trade-offs for adding such static infrastructure, especially
for the ones which aren't *widely* useful, are rather quickly shfiting in
the less favorable direction.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 0/6] Improve visibility of writeback
  2024-04-03 18:44                   ` Tejun Heo
@ 2024-04-03 19:06                     ` Kent Overstreet
  2024-04-03 19:21                       ` Tejun Heo
  0 siblings, 1 reply; 38+ messages in thread
From: Kent Overstreet @ 2024-04-03 19:06 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jan Kara, Kemeng Shi, akpm, willy, bfoster, dsterba, mjguzik,
	dhowells, linux-kernel, linux-mm, linux-fsdevel

On Wed, Apr 03, 2024 at 08:44:05AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Wed, Apr 03, 2024 at 06:27:16PM +0200, Jan Kara wrote:
> > Yeah, BPF is great and I use it but to fill in some cases from practice,
> > there are sysadmins refusing to install bcc or run your BPF scripts on
> > their systems due to company regulations, their personal fear, or whatever.
> > So debugging with what you can achieve from a shell is still the thing
> > quite often.
> 
> Yeah, I mean, this happens with anything new. Tracing itself took quite a
> while to be adopted widely. BPF, bcc, bpftrace are all still pretty new and
> it's likely that the adoption line will keep shifting for quite a while.
> Besides, even with all the new gizmos there definitely are cases where good
> ol' cat interface makes sense.
> 
> So, if the static interface makes sense, we add it but we should keep in
> mind that the trade-offs for adding such static infrastructure, especially
> for the ones which aren't *widely* useful, are rather quickly shfiting in
> the less favorable direction.

A lot of our static debug infrastructure isn't that useful because it
just sucks.

Every time I hit a sysfs or procfs file that's just a single integer,
and nothing else, when clearly there's internal structure and
description that needs to be there I die a little inside. It's lazy and
amateurish.

I regularly debug things in bcachefs over IRC in about 5-10 minutes of
asking to check various files and pastebin them - this is my normal
process, I pretty much never have to ssh and touch the actual machines.

That's how it should be if you just make a point of making your internal
state easy to view and introspect, but when I'm debugging issues that
run into the wider block layer, or memory reclaim, we often hit a wall.

Writeback throttling was buggy for _months_, no visibility or
introspection or concerns for debugging, and that's a small chunk of
code. io_uring - had to disable it. I _still_ have people bringing
issues to me that are clearly memory reclaim related but I don't have
the tools.

It's not like any of this code exports much in the way of useful
tracepoints either, but tracepoints often just aren't what you want;
what you want just to be able to see internal state (_without_ having to
use a debugger, because that's completely impractical outside highly
controlled environments) - and tracing is also never the first thing you
want to reach for when you have a user asking you "hey, this thing went
wonky, what's it doing?" - tracing automatically turns it into a multi
step process of decide what you want to look at, run the workload more
to collect data, iterate.

Think more about "what would make code easier to debug" and less about
"how do I shove this round peg through the square tracing/BPF slot".
There's _way_ more we could be doing that would just make our lives
easier.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 0/6] Improve visibility of writeback
  2024-04-03 19:06                     ` Kent Overstreet
@ 2024-04-03 19:21                       ` Tejun Heo
  2024-04-03 22:24                         ` Kent Overstreet
  0 siblings, 1 reply; 38+ messages in thread
From: Tejun Heo @ 2024-04-03 19:21 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Jan Kara, Kemeng Shi, akpm, willy, bfoster, dsterba, mjguzik,
	dhowells, linux-kernel, linux-mm, linux-fsdevel

Hello,

On Wed, Apr 03, 2024 at 03:06:56PM -0400, Kent Overstreet wrote:
...
> That's how it should be if you just make a point of making your internal
> state easy to view and introspect, but when I'm debugging issues that
> run into the wider block layer, or memory reclaim, we often hit a wall.

Try drgn:

  https://drgn.readthedocs.io/en/latest/

I've been adding drgn scripts under tools/ directory for introspection.
They're easy to write, deploy and ask users to run.

> Writeback throttling was buggy for _months_, no visibility or
> introspection or concerns for debugging, and that's a small chunk of
> code. io_uring - had to disable it. I _still_ have people bringing
> issues to me that are clearly memory reclaim related but I don't have
> the tools.
> 
> It's not like any of this code exports much in the way of useful
> tracepoints either, but tracepoints often just aren't what you want;
> what you want just to be able to see internal state (_without_ having to
> use a debugger, because that's completely impractical outside highly
> controlled environments) - and tracing is also never the first thing you
> want to reach for when you have a user asking you "hey, this thing went
> wonky, what's it doing?" - tracing automatically turns it into a multi
> step process of decide what you want to look at, run the workload more
> to collect data, iterate.
> 
> Think more about "what would make code easier to debug" and less about
> "how do I shove this round peg through the square tracing/BPF slot".
> There's _way_ more we could be doing that would just make our lives
> easier.

Maybe it'd help classifying visibility into the the following categories:

1. Current state introspection.
2. Dynamic behavior tracing.
3. Accumluative behavior profiling.

drgn is great for #1. Tracing and BPF stuff is great for #2 especially when
things get complicated. #3 is the trickest. Static stuff is useful in a lot
of cases but BPF can also be useful in other cases.

I agree that it's all about using the right tool for the problem.

-- 
tejun

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 0/6] Improve visibility of writeback
  2024-04-03 19:21                       ` Tejun Heo
@ 2024-04-03 22:24                         ` Kent Overstreet
  0 siblings, 0 replies; 38+ messages in thread
From: Kent Overstreet @ 2024-04-03 22:24 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jan Kara, Kemeng Shi, akpm, willy, bfoster, dsterba, mjguzik,
	dhowells, linux-kernel, linux-mm, linux-fsdevel

On Wed, Apr 03, 2024 at 09:21:37AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Wed, Apr 03, 2024 at 03:06:56PM -0400, Kent Overstreet wrote:
> ...
> > That's how it should be if you just make a point of making your internal
> > state easy to view and introspect, but when I'm debugging issues that
> > run into the wider block layer, or memory reclaim, we often hit a wall.
> 
> Try drgn:
> 
>   https://drgn.readthedocs.io/en/latest/
> 
> I've been adding drgn scripts under tools/ directory for introspection.
> They're easy to write, deploy and ask users to run.

Which is still inferior to simply writing to_text() functions for all
your objects and exposing them under sysfs/debugfs.

Plus, it's a whole new language/system for boths devs and users to
learn.

And having to_text() functions makes your log and error messages way
better.

"But what about code size/overhead?" - bullshit, we're talking about a
couple percent of .text for the code itself; we blow more memory on
permament dentries/inodes due to the way our virtual filesystems work
but that's more of a problem for tracefs.

> > Writeback throttling was buggy for _months_, no visibility or
> > introspection or concerns for debugging, and that's a small chunk of
> > code. io_uring - had to disable it. I _still_ have people bringing
> > issues to me that are clearly memory reclaim related but I don't have
> > the tools.
> > 
> > It's not like any of this code exports much in the way of useful
> > tracepoints either, but tracepoints often just aren't what you want;
> > what you want just to be able to see internal state (_without_ having to
> > use a debugger, because that's completely impractical outside highly
> > controlled environments) - and tracing is also never the first thing you
> > want to reach for when you have a user asking you "hey, this thing went
> > wonky, what's it doing?" - tracing automatically turns it into a multi
> > step process of decide what you want to look at, run the workload more
> > to collect data, iterate.
> > 
> > Think more about "what would make code easier to debug" and less about
> > "how do I shove this round peg through the square tracing/BPF slot".
> > There's _way_ more we could be doing that would just make our lives
> > easier.
> 
> Maybe it'd help classifying visibility into the the following categories:
> 
> 1. Current state introspection.
> 2. Dynamic behavior tracing.
> 3. Accumluative behavior profiling.
> 
> drgn is great for #1. Tracing and BPF stuff is great for #2 especially when
> things get complicated. #3 is the trickest. Static stuff is useful in a lot
> of cases but BPF can also be useful in other cases.
> 
> I agree that it's all about using the right tool for the problem.

Yeah, and you guys are all about the nerdiest and most overengineered
tools and ignoring the basics. Get the simple stuff right, /then/ if
there's stuff you still can't do, that's when you start looking at the
more complicated stuff.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 3/6] writeback: support retrieving per group debug writeback stats of bdi
  2024-04-03 15:04       ` Brian Foster
@ 2024-04-04  9:07         ` Jan Kara
  2024-04-07  3:13           ` Kemeng Shi
  2024-04-07  2:48         ` Kemeng Shi
  1 sibling, 1 reply; 38+ messages in thread
From: Jan Kara @ 2024-04-04  9:07 UTC (permalink / raw)
  To: Brian Foster
  Cc: Kemeng Shi, akpm, willy, jack, tj, dsterba, mjguzik, dhowells,
	linux-kernel, linux-mm, linux-fsdevel

On Wed 03-04-24 11:04:58, Brian Foster wrote:
> On Wed, Apr 03, 2024 at 04:49:42PM +0800, Kemeng Shi wrote:
> > on 3/29/2024 9:10 PM, Brian Foster wrote:
> > > On Wed, Mar 27, 2024 at 11:57:48PM +0800, Kemeng Shi wrote:
> > >> +		collect_wb_stats(&stats, wb);
> > >> +
> > > 
> > > Also, similar question as before on whether you'd want to check
> > > WB_registered or something here..
> > Still prefer to keep full debug info and user could filter out on
> > demand.
> 
> Ok. I was more wondering if that was needed for correctness. If not,
> then that seems fair enough to me.
> 
> > >> +		if (mem_cgroup_wb_domain(wb) == NULL) {
> > >> +			wb_stats_show(m, wb, &stats);
> > >> +			continue;
> > >> +		}
> > > 
> > > Can you explain what this logic is about? Is the cgwb_calc_thresh()
> > > thing not needed in this case? A comment might help for those less
> > > familiar with the implementation details.
> > If mem_cgroup_wb_domain(wb) is NULL, then it's bdi->wb, otherwise,
> > it's wb in cgroup. For bdi->wb, there is no need to do wb_tryget
> > and cgwb_calc_thresh. Will add some comment in next version.
> > > 
> > > BTW, I'm also wondering if something like the following is correct
> > > and/or roughly equivalent:
> > > 	
> > > 	list_for_each_*(wb, ...) {
> > > 		struct wb_stats stats = ...;
> > > 
> > > 		if (!wb_tryget(wb))
> > > 			continue;
> > > 
> > > 		collect_wb_stats(&stats, wb);
> > > 
> > > 		/*
> > > 		 * Extra wb_thresh magic. Drop rcu lock because ... . We
> > > 		 * can do so here because we have a ref.
> > > 		 */
> > > 		if (mem_cgroup_wb_domain(wb)) {
> > > 			rcu_read_unlock();
> > > 			stats.wb_thresh = min(stats.wb_thresh, cgwb_calc_thresh(wb));
> > > 			rcu_read_lock();
> > > 		}
> > > 
> > > 		wb_stats_show(m, wb, &stats)
> > > 		wb_put(wb);
> > > 	}
> > It's correct as wb_tryget to bdi->wb has no harm. I have considered
> > to do it in this way, I change my mind to do it in new way for
> > two reason:
> > 1. Put code handling wb in cgroup more tight which could be easier
> > to maintain.
> > 2. Rmove extra wb_tryget/wb_put for wb in bdi.
> > Would this make sense to you?
> 
> Ok, well assuming it is correct the above logic is a bit more simple and
> readable to me. I think you'd just need to fill in the comment around
> the wb_thresh thing rather than i.e. having to explain we don't need to
> ref bdi->wb even though it doesn't seem to matter.
> 
> I kind of feel the same on the wb_stats file thing below just because it
> seems more consistent and available if wb_stats eventually grows more
> wb-specific data.
> 
> That said, this is subjective and not hugely important so I don't insist
> on either point. Maybe wait a bit and see if Jan or Tejun or somebody
> has any thoughts..? If nobody else expresses explicit preference then
> I'm good with it either way.

No strong opinion from me really.

> > >> +static void cgwb_debug_register(struct backing_dev_info *bdi)
> > >> +{
> > >> +	debugfs_create_file("wb_stats", 0444, bdi->debug_dir, bdi,
> > >> +			    &cgwb_debug_stats_fops);
> > >> +}
> > >> +
> > >>  static void bdi_collect_stats(struct backing_dev_info *bdi,
> > >>  			      struct wb_stats *stats)
> > >>  {
> > >> @@ -117,6 +202,8 @@ static void bdi_collect_stats(struct backing_dev_info *bdi,
> > >>  {
> > >>  	collect_wb_stats(stats, &bdi->wb);
> > >>  }
> > >> +
> > >> +static inline void cgwb_debug_register(struct backing_dev_info *bdi) { }
> > > 
> > > Could we just create the wb_stats file regardless of whether cgwb is
> > > enabled? Obviously theres only one wb in the !CGWB case and it's
> > > somewhat duplicative with the bdi stats file, but that seems harmless if
> > > the same code can be reused..? Maybe there's also a small argument for
> > > dropping the state info from the bdi stats file and moving it to
> > > wb_stats.In backing-dev.c, there are a lot "#ifdef CGWB .. #else .. #endif" to
> > avoid unneed extra cost when CGWB is not enabled.
> > I think it's better to avoid extra cost from wb_stats when CGWB is not
> > enabled. For now, we only save cpu cost to create and destroy wb_stats
> > and save memory cost to record debugfs file, we could save more in
> > future when wb_stats records more debug info.

Well, there's the other side that you don't have to think whether the
kernel has CGWB enabled or not when asking a customer to gather the
writeback debug info - you can always ask for wb_stats. Also if you move
the wb->state to wb_stats only it will become inaccessible with CGWB
disabled. So I agree with Brian that it is better to provide wb_stats also
with CGWB disabled (and we can just implement wb_stats for !CGWB case with
the same function as bdi_stats).

That being said all production kernels I have seen do have CGWB enabled so
I don't care that much about this...

> > Move state info from bdi stats to wb_stats make senses to me. The only
> > concern would be compatibility problem. I will add a new patch to this
> > to make this more noticeable and easier to revert.

Yeah, I don't think we care much about debugfs compatibility but I think
removing state from bdi_stats is not worth the inconsistency between
wb_stats and bdi_stats in the !CGWB case.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 3/6] writeback: support retrieving per group debug writeback stats of bdi
  2024-04-03 15:04       ` Brian Foster
  2024-04-04  9:07         ` Jan Kara
@ 2024-04-07  2:48         ` Kemeng Shi
  1 sibling, 0 replies; 38+ messages in thread
From: Kemeng Shi @ 2024-04-07  2:48 UTC (permalink / raw)
  To: Brian Foster
  Cc: akpm, willy, jack, tj, dsterba, mjguzik, dhowells, linux-kernel,
	linux-mm, linux-fsdevel



on 4/3/2024 11:04 PM, Brian Foster wrote:
> On Wed, Apr 03, 2024 at 04:49:42PM +0800, Kemeng Shi wrote:
>>
>>
>> on 3/29/2024 9:10 PM, Brian Foster wrote:
>>> On Wed, Mar 27, 2024 at 11:57:48PM +0800, Kemeng Shi wrote:
>>>> Add /sys/kernel/debug/bdi/xxx/wb_stats to show per group writeback stats
>>>> of bdi.
>>>>
>>>
>>> Hi Kemeng,
>> Hello Brian,
>>>
>>> Just a few random thoughts/comments..
>>>
>>>> Following domain hierarchy is tested:
>>>>                 global domain (320G)
>>>>                 /                 \
>>>>         cgroup domain1(10G)     cgroup domain2(10G)
>>>>                 |                 |
>>>> bdi            wb1               wb2
>>>>
>>>> /* per wb writeback info of bdi is collected */
>>>> cat /sys/kernel/debug/bdi/252:16/wb_stats
>>>> WbCgIno:                    1
>>>> WbWriteback:                0 kB
>>>> WbReclaimable:              0 kB
>>>> WbDirtyThresh:              0 kB
>>>> WbDirtied:                  0 kB
>>>> WbWritten:                  0 kB
>>>> WbWriteBandwidth:      102400 kBps
>>>> b_dirty:                    0
>>>> b_io:                       0
>>>> b_more_io:                  0
>>>> b_dirty_time:               0
>>>> state:                      1
>>>
>>> Maybe some whitespace or something between entries would improve
>>> readability?
>> Sure, I will add a whitespace in next version.
>>>
>>>> WbCgIno:                 4094
>>>> WbWriteback:            54432 kB
>>>> WbReclaimable:         766080 kB
>>>> WbDirtyThresh:        3094760 kB
>>>> WbDirtied:            1656480 kB
>>>> WbWritten:             837088 kB
>>>> WbWriteBandwidth:      132772 kBps
>>>> b_dirty:                    1
>>>> b_io:                       1
>>>> b_more_io:                  0
>>>> b_dirty_time:               0
>>>> state:                      7
>>>> WbCgIno:                 4135
>>>> WbWriteback:            15232 kB
>>>> WbReclaimable:         786688 kB
>>>> WbDirtyThresh:        2909984 kB
>>>> WbDirtied:            1482656 kB
>>>> WbWritten:             681408 kB
>>>> WbWriteBandwidth:      124848 kBps
>>>> b_dirty:                    0
>>>> b_io:                       1
>>>> b_more_io:                  0
>>>> b_dirty_time:               0
>>>> state:                      7
>>>>
>>>> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
>>>> ---
>>>>  include/linux/writeback.h |  1 +
>>>>  mm/backing-dev.c          | 88 +++++++++++++++++++++++++++++++++++++++
>>>>  mm/page-writeback.c       | 19 +++++++++
>>>>  3 files changed, 108 insertions(+)
>>>>
>>> ...
>>>> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
>>>> index 8daf950e6855..e3953db7d88d 100644
>>>> --- a/mm/backing-dev.c
>>>> +++ b/mm/backing-dev.c
>>>> @@ -103,6 +103,91 @@ static void collect_wb_stats(struct wb_stats *stats,
>>>>  }
>>>>  
>>>>  #ifdef CONFIG_CGROUP_WRITEBACK
>>> ...
>>>> +static int cgwb_debug_stats_show(struct seq_file *m, void *v)
>>>> +{
>>>> +	struct backing_dev_info *bdi;
>>>> +	unsigned long background_thresh;
>>>> +	unsigned long dirty_thresh;
>>>> +	struct bdi_writeback *wb;
>>>> +	struct wb_stats stats;
>>>> +
>>>> +	rcu_read_lock();
>>>> +	bdi = lookup_bdi(m);
>>>> +	if (!bdi) {
>>>> +		rcu_read_unlock();
>>>> +		return -EEXIST;
>>>> +	}
>>>> +
>>>> +	global_dirty_limits(&background_thresh, &dirty_thresh);
>>>> +
>>>> +	list_for_each_entry_rcu(wb, &bdi->wb_list, bdi_node) {
>>>> +		memset(&stats, 0, sizeof(stats));
>>>> +		stats.dirty_thresh = dirty_thresh;
>>>
>>> If you did something like the following here, wouldn't that also zero
>>> the rest of the structure?
>>>
>>> 		struct wb_stats stats = { .dirty_thresh = dirty_thresh };
>>>
>> Suer, will do it in next version.
>>>> +		collect_wb_stats(&stats, wb);
>>>> +
>>>
>>> Also, similar question as before on whether you'd want to check
>>> WB_registered or something here..
>> Still prefer to keep full debug info and user could filter out on
>> demand.
> 
> Ok. I was more wondering if that was needed for correctness. If not,
> then that seems fair enough to me.
For bdi->wb, it's unavailable after release_bdi. As bdi_debug_unregister
will block bdi_unregister, then release_bdi must not be reached yet and
it's safe to collect bdi->wb info.
For wb in cgroup, it's unavailable after cgwb_release_workfn, we could
prevent this with wb_tryget before collection.
So it's correct for per-wb stats but we add a extra wb_trget in
bdi stats in patch 2 and will do it in next version.
> 
>>>
>>>> +		if (mem_cgroup_wb_domain(wb) == NULL) {
>>>> +			wb_stats_show(m, wb, &stats);
>>>> +			continue;
>>>> +		}
>>>
>>> Can you explain what this logic is about? Is the cgwb_calc_thresh()
>>> thing not needed in this case? A comment might help for those less
>>> familiar with the implementation details.
>> If mem_cgroup_wb_domain(wb) is NULL, then it's bdi->wb, otherwise,
>> it's wb in cgroup. For bdi->wb, there is no need to do wb_tryget
>> and cgwb_calc_thresh. Will add some comment in next version.
>>>
>>> BTW, I'm also wondering if something like the following is correct
>>> and/or roughly equivalent:
>>> 	
>>> 	list_for_each_*(wb, ...) {
>>> 		struct wb_stats stats = ...;
>>>
>>> 		if (!wb_tryget(wb))
>>> 			continue;
>>>
>>> 		collect_wb_stats(&stats, wb);
>>>
>>> 		/*
>>> 		 * Extra wb_thresh magic. Drop rcu lock because ... . We
>>> 		 * can do so here because we have a ref.
>>> 		 */
>>> 		if (mem_cgroup_wb_domain(wb)) {
>>> 			rcu_read_unlock();
>>> 			stats.wb_thresh = min(stats.wb_thresh, cgwb_calc_thresh(wb));
>>> 			rcu_read_lock();
>>> 		}
>>>
>>> 		wb_stats_show(m, wb, &stats)
>>> 		wb_put(wb);
>>> 	}
>> It's correct as wb_tryget to bdi->wb has no harm. I have considered
>> to do it in this way, I change my mind to do it in new way for
>> two reason:
>> 1. Put code handling wb in cgroup more tight which could be easier
>> to maintain.
>> 2. Rmove extra wb_tryget/wb_put for wb in bdi.
>> Would this make sense to you?
> 
> Ok, well assuming it is correct the above logic is a bit more simple and
> readable to me. I think you'd just need to fill in the comment around
> the wb_thresh thing rather than i.e. having to explain we don't need to
> ref bdi->wb even though it doesn't seem to matter.
> 
> I kind of feel the same on the wb_stats file thing below just because it
> seems more consistent and available if wb_stats eventually grows more
> wb-specific data.
> 
> That said, this is subjective and not hugely important so I don't insist
> on either point. Maybe wait a bit and see if Jan or Tejun or somebody
> has any thoughts..? If nobody else expresses explicit preference then
> I'm good with it either way.
Sure, I will wait for someday and decide the way used in next version.

Thanks so much for all the advise.

Kemeng
> 
> Brian
> 
>>>
>>>> +
>>>> +		/*
>>>> +		 * cgwb_release will destroy wb->memcg_completions which
>>>> +		 * will be ued in cgwb_calc_thresh. Use wb_tryget to prevent
>>>> +		 * memcg_completions destruction from cgwb_release.
>>>> +		 */
>>>> +		if (!wb_tryget(wb))
>>>> +			continue;
>>>> +
>>>> +		rcu_read_unlock();
>>>> +		/* cgwb_calc_thresh may sleep in cgroup_rstat_flush */
>>>> +		stats.wb_thresh = min(stats.wb_thresh, cgwb_calc_thresh(wb));
>>>> +		wb_stats_show(m, wb, &stats);
>>>> +		rcu_read_lock();
>>>> +		wb_put(wb);
>>>> +	}
>>>> +	rcu_read_unlock();
>>>> +
>>>> +	return 0;
>>>> +}
>>>> +DEFINE_SHOW_ATTRIBUTE(cgwb_debug_stats);
>>>> +
>>>> +static void cgwb_debug_register(struct backing_dev_info *bdi)
>>>> +{
>>>> +	debugfs_create_file("wb_stats", 0444, bdi->debug_dir, bdi,
>>>> +			    &cgwb_debug_stats_fops);
>>>> +}
>>>> +
>>>>  static void bdi_collect_stats(struct backing_dev_info *bdi,
>>>>  			      struct wb_stats *stats)
>>>>  {
>>>> @@ -117,6 +202,8 @@ static void bdi_collect_stats(struct backing_dev_info *bdi,
>>>>  {
>>>>  	collect_wb_stats(stats, &bdi->wb);
>>>>  }
>>>> +
>>>> +static inline void cgwb_debug_register(struct backing_dev_info *bdi) { }
>>>
>>> Could we just create the wb_stats file regardless of whether cgwb is
>>> enabled? Obviously theres only one wb in the !CGWB case and it's
>>> somewhat duplicative with the bdi stats file, but that seems harmless if
>>> the same code can be reused..? Maybe there's also a small argument for
>>> dropping the state info from the bdi stats file and moving it to
>>> wb_stats.In backing-dev.c, there are a lot "#ifdef CGWB .. #else .. #endif" to
>> avoid unneed extra cost when CGWB is not enabled.
>> I think it's better to avoid extra cost from wb_stats when CGWB is not
>> enabled. For now, we only save cpu cost to create and destroy wb_stats
>> and save memory cost to record debugfs file, we could save more in
>> future when wb_stats records more debug info.
>> Move state info from bdi stats to wb_stats make senses to me. The only
>> concern would be compatibility problem. I will add a new patch to this
>> to make this more noticeable and easier to revert.
>> Thanks a lot for review!
>>
>> Kemeng
>>>
>>> Brian
>>>
>>>>  #endif
>>>>  
>>>>  static int bdi_debug_stats_show(struct seq_file *m, void *v)
>>>> @@ -182,6 +269,7 @@ static void bdi_debug_register(struct backing_dev_info *bdi, const char *name)
>>>>  
>>>>  	debugfs_create_file("stats", 0444, bdi->debug_dir, bdi,
>>>>  			    &bdi_debug_stats_fops);
>>>> +	cgwb_debug_register(bdi);
>>>>  }
>>>>  
>>>>  static void bdi_debug_unregister(struct backing_dev_info *bdi)
>>>> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
>>>> index 0e20467367fe..3724c7525316 100644
>>>> --- a/mm/page-writeback.c
>>>> +++ b/mm/page-writeback.c
>>>> @@ -893,6 +893,25 @@ unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh)
>>>>  	return __wb_calc_thresh(&gdtc, thresh);
>>>>  }
>>>>  
>>>> +unsigned long cgwb_calc_thresh(struct bdi_writeback *wb)
>>>> +{
>>>> +	struct dirty_throttle_control gdtc = { GDTC_INIT_NO_WB };
>>>> +	struct dirty_throttle_control mdtc = { MDTC_INIT(wb, &gdtc) };
>>>> +	unsigned long filepages, headroom, writeback;
>>>> +
>>>> +	gdtc.avail = global_dirtyable_memory();
>>>> +	gdtc.dirty = global_node_page_state(NR_FILE_DIRTY) +
>>>> +		     global_node_page_state(NR_WRITEBACK);
>>>> +
>>>> +	mem_cgroup_wb_stats(wb, &filepages, &headroom,
>>>> +			    &mdtc.dirty, &writeback);
>>>> +	mdtc.dirty += writeback;
>>>> +	mdtc_calc_avail(&mdtc, filepages, headroom);
>>>> +	domain_dirty_limits(&mdtc);
>>>> +
>>>> +	return __wb_calc_thresh(&mdtc, mdtc.thresh);
>>>> +}
>>>> +
>>>>  /*
>>>>   *                           setpoint - dirty 3
>>>>   *        f(dirty) := 1.0 + (----------------)
>>>> -- 
>>>> 2.30.0
>>>>
>>>
>>>
>>
> 
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 3/6] writeback: support retrieving per group debug writeback stats of bdi
  2024-04-04  9:07         ` Jan Kara
@ 2024-04-07  3:13           ` Kemeng Shi
  0 siblings, 0 replies; 38+ messages in thread
From: Kemeng Shi @ 2024-04-07  3:13 UTC (permalink / raw)
  To: Jan Kara, Brian Foster
  Cc: akpm, willy, tj, dsterba, mjguzik, dhowells, linux-kernel,
	linux-mm, linux-fsdevel



on 4/4/2024 5:07 PM, Jan Kara wrote:
> On Wed 03-04-24 11:04:58, Brian Foster wrote:
>> On Wed, Apr 03, 2024 at 04:49:42PM +0800, Kemeng Shi wrote:
>>> on 3/29/2024 9:10 PM, Brian Foster wrote:
>>>> On Wed, Mar 27, 2024 at 11:57:48PM +0800, Kemeng Shi wrote:
>>>>> +		collect_wb_stats(&stats, wb);
>>>>> +
>>>>
>>>> Also, similar question as before on whether you'd want to check
>>>> WB_registered or something here..
>>> Still prefer to keep full debug info and user could filter out on
>>> demand.
>>
>> Ok. I was more wondering if that was needed for correctness. If not,
>> then that seems fair enough to me.
>>
>>>>> +		if (mem_cgroup_wb_domain(wb) == NULL) {
>>>>> +			wb_stats_show(m, wb, &stats);
>>>>> +			continue;
>>>>> +		}
>>>>
>>>> Can you explain what this logic is about? Is the cgwb_calc_thresh()
>>>> thing not needed in this case? A comment might help for those less
>>>> familiar with the implementation details.
>>> If mem_cgroup_wb_domain(wb) is NULL, then it's bdi->wb, otherwise,
>>> it's wb in cgroup. For bdi->wb, there is no need to do wb_tryget
>>> and cgwb_calc_thresh. Will add some comment in next version.
>>>>
>>>> BTW, I'm also wondering if something like the following is correct
>>>> and/or roughly equivalent:
>>>> 	
>>>> 	list_for_each_*(wb, ...) {
>>>> 		struct wb_stats stats = ...;
>>>>
>>>> 		if (!wb_tryget(wb))
>>>> 			continue;
>>>>
>>>> 		collect_wb_stats(&stats, wb);
>>>>
>>>> 		/*
>>>> 		 * Extra wb_thresh magic. Drop rcu lock because ... . We
>>>> 		 * can do so here because we have a ref.
>>>> 		 */
>>>> 		if (mem_cgroup_wb_domain(wb)) {
>>>> 			rcu_read_unlock();
>>>> 			stats.wb_thresh = min(stats.wb_thresh, cgwb_calc_thresh(wb));
>>>> 			rcu_read_lock();
>>>> 		}
>>>>
>>>> 		wb_stats_show(m, wb, &stats)
>>>> 		wb_put(wb);
>>>> 	}
>>> It's correct as wb_tryget to bdi->wb has no harm. I have considered
>>> to do it in this way, I change my mind to do it in new way for
>>> two reason:
>>> 1. Put code handling wb in cgroup more tight which could be easier
>>> to maintain.
>>> 2. Rmove extra wb_tryget/wb_put for wb in bdi.
>>> Would this make sense to you?
>>
>> Ok, well assuming it is correct the above logic is a bit more simple and
>> readable to me. I think you'd just need to fill in the comment around
>> the wb_thresh thing rather than i.e. having to explain we don't need to
>> ref bdi->wb even though it doesn't seem to matter.
>>
>> I kind of feel the same on the wb_stats file thing below just because it
>> seems more consistent and available if wb_stats eventually grows more
>> wb-specific data.
>>
>> That said, this is subjective and not hugely important so I don't insist
>> on either point. Maybe wait a bit and see if Jan or Tejun or somebody
>> has any thoughts..? If nobody else expresses explicit preference then
>> I'm good with it either way.
> 
> No strong opinion from me really.
> 
>>>>> +static void cgwb_debug_register(struct backing_dev_info *bdi)
>>>>> +{
>>>>> +	debugfs_create_file("wb_stats", 0444, bdi->debug_dir, bdi,
>>>>> +			    &cgwb_debug_stats_fops);
>>>>> +}
>>>>> +
>>>>>  static void bdi_collect_stats(struct backing_dev_info *bdi,
>>>>>  			      struct wb_stats *stats)
>>>>>  {
>>>>> @@ -117,6 +202,8 @@ static void bdi_collect_stats(struct backing_dev_info *bdi,
>>>>>  {
>>>>>  	collect_wb_stats(stats, &bdi->wb);
>>>>>  }
>>>>> +
>>>>> +static inline void cgwb_debug_register(struct backing_dev_info *bdi) { }
>>>>
>>>> Could we just create the wb_stats file regardless of whether cgwb is
>>>> enabled? Obviously theres only one wb in the !CGWB case and it's
>>>> somewhat duplicative with the bdi stats file, but that seems harmless if
>>>> the same code can be reused..? Maybe there's also a small argument for
>>>> dropping the state info from the bdi stats file and moving it to
>>>> wb_stats.In backing-dev.c, there are a lot "#ifdef CGWB .. #else .. #endif" to
>>> avoid unneed extra cost when CGWB is not enabled.
>>> I think it's better to avoid extra cost from wb_stats when CGWB is not
>>> enabled. For now, we only save cpu cost to create and destroy wb_stats
>>> and save memory cost to record debugfs file, we could save more in
>>> future when wb_stats records more debug info.
> 
> Well, there's the other side that you don't have to think whether the
> kernel has CGWB enabled or not when asking a customer to gather the
> writeback debug info - you can always ask for wb_stats. Also if you move
> the wb->state to wb_stats only it will become inaccessible with CGWB
> disabled. So I agree with Brian that it is better to provide wb_stats also
> with CGWB disabled (and we can just implement wb_stats for !CGWB case with
> the same function as bdi_stats).
> 
> That being said all production kernels I have seen do have CGWB enabled so
> I don't care that much about this...
It's acceptable to me if the extra cost is tolerable.
> 
>>> Move state info from bdi stats to wb_stats make senses to me. The only
>>> concern would be compatibility problem. I will add a new patch to this
>>> to make this more noticeable and easier to revert.
> 
> Yeah, I don't think we care much about debugfs compatibility but I think
> removing state from bdi_stats is not worth the inconsistency between
> wb_stats and bdi_stats in the !CGWB case.
OK, I will simply keep wb_stats even CGWB is not enabled while keep state
in both bdi_stats and wb_stats if Braian doesn't against in recent dasy.

Kemeng
> 
> 								Honza
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2024-04-07  3:13 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-27 15:57 [PATCH v2 0/6] Improve visibility of writeback Kemeng Shi
2024-03-27 15:57 ` [PATCH v2 1/6] writeback: protect race between bdi release and bdi_debug_stats_show Kemeng Shi
2024-03-28 17:53   ` Brian Foster
2024-04-03  2:16     ` Kemeng Shi
2024-03-27 15:57 ` [PATCH v2 2/6] writeback: collect stats of all wb of bdi in bdi_debug_stats_show Kemeng Shi
2024-03-29 13:04   ` Brian Foster
2024-04-03  7:49     ` Kemeng Shi
2024-03-27 15:57 ` [PATCH v2 3/6] writeback: support retrieving per group debug writeback stats of bdi Kemeng Shi
2024-03-29 13:10   ` Brian Foster
2024-04-03  8:49     ` Kemeng Shi
2024-04-03 15:04       ` Brian Foster
2024-04-04  9:07         ` Jan Kara
2024-04-07  3:13           ` Kemeng Shi
2024-04-07  2:48         ` Kemeng Shi
2024-03-27 15:57 ` [PATCH v2 4/6] writeback: add wb_monitor.py script to monitor writeback info on bdi Kemeng Shi
2024-03-27 15:57 ` [PATCH v2 5/6] writeback: rename nr_reclaimable to nr_dirty in balance_dirty_pages Kemeng Shi
2024-03-27 15:57 ` [PATCH v2 6/6] writeback: define GDTC_INIT_NO_WB to null Kemeng Shi
2024-03-27 17:40 ` [PATCH v2 0/6] Improve visibility of writeback Andrew Morton
2024-03-28  1:59   ` Kemeng Shi
2024-03-28  8:23     ` Kemeng Shi
2024-03-28 19:15   ` Kent Overstreet
2024-03-28 19:23     ` Andrew Morton
2024-03-28 19:36       ` Kent Overstreet
2024-03-28 19:24 ` Kent Overstreet
2024-03-28 19:31   ` Tejun Heo
2024-03-28 19:40     ` Kent Overstreet
2024-03-28 19:46       ` Tejun Heo
2024-03-28 19:55         ` Kent Overstreet
2024-03-28 20:13           ` Tejun Heo
2024-03-28 20:22             ` Kent Overstreet
2024-03-28 20:46               ` Tejun Heo
2024-03-28 20:53                 ` Kent Overstreet
2024-04-03 16:27                 ` Jan Kara
2024-04-03 18:44                   ` Tejun Heo
2024-04-03 19:06                     ` Kent Overstreet
2024-04-03 19:21                       ` Tejun Heo
2024-04-03 22:24                         ` Kent Overstreet
2024-04-03  6:56   ` Kemeng Shi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.