Linux-Block Archive on lore.kernel.org
 help / color / Atom feed
* [PATCHSET] writeback, memcg: Implement foreign inode flushing
@ 2019-08-03 14:01 Tejun Heo
  2019-08-03 14:01 ` [PATCH 1/4] writeback: Generalize and expose wb_completion Tejun Heo
                   ` (3 more replies)
  0 siblings, 4 replies; 24+ messages in thread
From: Tejun Heo @ 2019-08-03 14:01 UTC (permalink / raw)
  To: axboe, jack, hannes, mhocko, vdavydov.dev
  Cc: cgroups, linux-mm, linux-block, linux-kernel, kernel-team, guro, akpm

Hello,

There's an inherent mismatch between memcg and writeback.  The former
trackes ownership per-page while the latter per-inode.  This was a
deliberate design decision because honoring per-page ownership in the
writeback path is complicated, may lead to higher CPU and IO overheads
and deemed unnecessary given that write-sharing an inode across
different cgroups isn't a common use-case.

Combined with inode majority-writer ownership switching, this works
well enough in most cases but there are some pathological cases.  For
example, let's say there are two cgroups A and B which keep writing to
different but confined parts of the same inode.  B owns the inode and
A's memory is limited far below B's.  A's dirty ratio can rise enough
to trigger balance_dirty_pages() sleeps but B's can be low enough to
avoid triggering background writeback.  A will be slowed down without
a way to make writeback of the dirty pages happen.

This patchset implements foreign dirty recording and foreign mechanism
so that when a memcg encounters a condition as above it can trigger
flushes on bdi_writebacks which can clean its pages.  Please see the
last patch for more details.

This patchset contains the following four patches.

 0001-writeback-Generalize-and-expose-wb_completion.patch
 0002-bdi-Add-bdi-id.patch
 0003-writeback-memcg-Implement-cgroup_writeback_by_id.patch
 0004-writeback-memcg-Implement-foreign-dirty-flushing.patch

0001-0003 are prep patches which expose wb_completion and implement
bdi->id and flushing by bdi and memcg IDs.

0004 implement foreign inode flushing.

Thanks.  diffstat follows.

 fs/fs-writeback.c                |  111 ++++++++++++++++++++++++----------
 include/linux/backing-dev-defs.h |   23 +++++++
 include/linux/backing-dev.h      |    3 
 include/linux/memcontrol.h       |   35 ++++++++++
 include/linux/writeback.h        |    4 +
 mm/backing-dev.c                 |   65 +++++++++++++++++++-
 mm/memcontrol.c                  |  125 +++++++++++++++++++++++++++++++++++++++
 mm/page-writeback.c              |    4 +
 8 files changed, 335 insertions(+), 35 deletions(-)

--
tejun


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/4] writeback: Generalize and expose wb_completion
  2019-08-03 14:01 [PATCHSET] writeback, memcg: Implement foreign inode flushing Tejun Heo
@ 2019-08-03 14:01 ` Tejun Heo
  2019-08-15 14:41   ` Jan Kara
  2019-08-03 14:01 ` [PATCH 2/4] bdi: Add bdi->id Tejun Heo
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 24+ messages in thread
From: Tejun Heo @ 2019-08-03 14:01 UTC (permalink / raw)
  To: axboe, jack, hannes, mhocko, vdavydov.dev
  Cc: cgroups, linux-mm, linux-block, linux-kernel, kernel-team, guro,
	akpm, Tejun Heo

wb_completion is used to track writeback completions.  We want to use
it from memcg side for foreign inode flushes.  This patch updates it
to remember the target waitq instead of assuming bdi->wb_waitq and
expose it outside of fs-writeback.c.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 fs/fs-writeback.c                | 47 ++++++++++----------------------
 include/linux/backing-dev-defs.h | 20 ++++++++++++++
 include/linux/backing-dev.h      |  2 ++
 3 files changed, 36 insertions(+), 33 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 542b02d170f8..6129debdc938 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -36,10 +36,6 @@
  */
 #define MIN_WRITEBACK_PAGES	(4096UL >> (PAGE_SHIFT - 10))
 
-struct wb_completion {
-	atomic_t		cnt;
-};
-
 /*
  * Passed into wb_writeback(), essentially a subset of writeback_control
  */
@@ -60,19 +56,6 @@ struct wb_writeback_work {
 	struct wb_completion *done;	/* set if the caller waits */
 };
 
-/*
- * If one wants to wait for one or more wb_writeback_works, each work's
- * ->done should be set to a wb_completion defined using the following
- * macro.  Once all work items are issued with wb_queue_work(), the caller
- * can wait for the completion of all using wb_wait_for_completion().  Work
- * items which are waited upon aren't freed automatically on completion.
- */
-#define DEFINE_WB_COMPLETION_ONSTACK(cmpl)				\
-	struct wb_completion cmpl = {					\
-		.cnt		= ATOMIC_INIT(1),			\
-	}
-
-
 /*
  * If an inode is constantly having its pages dirtied, but then the
  * updates stop dirtytime_expire_interval seconds in the past, it's
@@ -182,7 +165,7 @@ static void finish_writeback_work(struct bdi_writeback *wb,
 	if (work->auto_free)
 		kfree(work);
 	if (done && atomic_dec_and_test(&done->cnt))
-		wake_up_all(&wb->bdi->wb_waitq);
+		wake_up_all(done->waitq);
 }
 
 static void wb_queue_work(struct bdi_writeback *wb,
@@ -206,20 +189,18 @@ static void wb_queue_work(struct bdi_writeback *wb,
 
 /**
  * wb_wait_for_completion - wait for completion of bdi_writeback_works
- * @bdi: bdi work items were issued to
  * @done: target wb_completion
  *
  * Wait for one or more work items issued to @bdi with their ->done field
- * set to @done, which should have been defined with
- * DEFINE_WB_COMPLETION_ONSTACK().  This function returns after all such
- * work items are completed.  Work items which are waited upon aren't freed
+ * set to @done, which should have been initialized with
+ * DEFINE_WB_COMPLETION().  This function returns after all such work items
+ * are completed.  Work items which are waited upon aren't freed
  * automatically on completion.
  */
-static void wb_wait_for_completion(struct backing_dev_info *bdi,
-				   struct wb_completion *done)
+void wb_wait_for_completion(struct wb_completion *done)
 {
 	atomic_dec(&done->cnt);		/* put down the initial count */
-	wait_event(bdi->wb_waitq, !atomic_read(&done->cnt));
+	wait_event(*done->waitq, !atomic_read(&done->cnt));
 }
 
 #ifdef CONFIG_CGROUP_WRITEBACK
@@ -843,7 +824,7 @@ static void bdi_split_work_to_wbs(struct backing_dev_info *bdi,
 restart:
 	rcu_read_lock();
 	list_for_each_entry_continue_rcu(wb, &bdi->wb_list, bdi_node) {
-		DEFINE_WB_COMPLETION_ONSTACK(fallback_work_done);
+		DEFINE_WB_COMPLETION(fallback_work_done, bdi);
 		struct wb_writeback_work fallback_work;
 		struct wb_writeback_work *work;
 		long nr_pages;
@@ -890,7 +871,7 @@ static void bdi_split_work_to_wbs(struct backing_dev_info *bdi,
 		last_wb = wb;
 
 		rcu_read_unlock();
-		wb_wait_for_completion(bdi, &fallback_work_done);
+		wb_wait_for_completion(&fallback_work_done);
 		goto restart;
 	}
 	rcu_read_unlock();
@@ -2362,7 +2343,8 @@ static void wait_sb_inodes(struct super_block *sb)
 static void __writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr,
 				     enum wb_reason reason, bool skip_if_busy)
 {
-	DEFINE_WB_COMPLETION_ONSTACK(done);
+	struct backing_dev_info *bdi = sb->s_bdi;
+	DEFINE_WB_COMPLETION(done, bdi);
 	struct wb_writeback_work work = {
 		.sb			= sb,
 		.sync_mode		= WB_SYNC_NONE,
@@ -2371,14 +2353,13 @@ static void __writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr,
 		.nr_pages		= nr,
 		.reason			= reason,
 	};
-	struct backing_dev_info *bdi = sb->s_bdi;
 
 	if (!bdi_has_dirty_io(bdi) || bdi == &noop_backing_dev_info)
 		return;
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
 	bdi_split_work_to_wbs(sb->s_bdi, &work, skip_if_busy);
-	wb_wait_for_completion(bdi, &done);
+	wb_wait_for_completion(&done);
 }
 
 /**
@@ -2440,7 +2421,8 @@ EXPORT_SYMBOL(try_to_writeback_inodes_sb);
  */
 void sync_inodes_sb(struct super_block *sb)
 {
-	DEFINE_WB_COMPLETION_ONSTACK(done);
+	struct backing_dev_info *bdi = sb->s_bdi;
+	DEFINE_WB_COMPLETION(done, bdi);
 	struct wb_writeback_work work = {
 		.sb		= sb,
 		.sync_mode	= WB_SYNC_ALL,
@@ -2450,7 +2432,6 @@ void sync_inodes_sb(struct super_block *sb)
 		.reason		= WB_REASON_SYNC,
 		.for_sync	= 1,
 	};
-	struct backing_dev_info *bdi = sb->s_bdi;
 
 	/*
 	 * Can't skip on !bdi_has_dirty() because we should wait for !dirty
@@ -2464,7 +2445,7 @@ void sync_inodes_sb(struct super_block *sb)
 	/* protect against inode wb switch, see inode_switch_wbs_work_fn() */
 	bdi_down_write_wb_switch_rwsem(bdi);
 	bdi_split_work_to_wbs(bdi, &work, false);
-	wb_wait_for_completion(bdi, &done);
+	wb_wait_for_completion(&done);
 	bdi_up_write_wb_switch_rwsem(bdi);
 
 	wait_sb_inodes(sb);
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 6a1a8a314d85..8fb740178d5d 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -67,6 +67,26 @@ enum wb_reason {
 	WB_REASON_MAX,
 };
 
+struct wb_completion {
+	atomic_t		cnt;
+	wait_queue_head_t	*waitq;
+};
+
+#define __WB_COMPLETION_INIT(_waitq)	\
+	(struct wb_completion){ .cnt = ATOMIC_INIT(1), .waitq = (_waitq) }
+
+/*
+ * If one wants to wait for one or more wb_writeback_works, each work's
+ * ->done should be set to a wb_completion defined using the following
+ * macro.  Once all work items are issued with wb_queue_work(), the caller
+ * can wait for the completion of all using wb_wait_for_completion().  Work
+ * items which are waited upon aren't freed automatically on completion.
+ */
+#define WB_COMPLETION_INIT(bdi)		__WB_COMPLETION_INIT(&(bdi)->wb_waitq)
+
+#define DEFINE_WB_COMPLETION(cmpl, bdi)	\
+	struct wb_completion cmpl = WB_COMPLETION_INIT(bdi)
+
 /*
  * For cgroup writeback, multiple wb's may map to the same blkcg.  Those
  * wb's can operate mostly independently but should share the congested
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 35b31d176f74..02650b1253a2 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -44,6 +44,8 @@ void wb_start_background_writeback(struct bdi_writeback *wb);
 void wb_workfn(struct work_struct *work);
 void wb_wakeup_delayed(struct bdi_writeback *wb);
 
+void wb_wait_for_completion(struct wb_completion *done);
+
 extern spinlock_t bdi_lock;
 extern struct list_head bdi_list;
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 2/4] bdi: Add bdi->id
  2019-08-03 14:01 [PATCHSET] writeback, memcg: Implement foreign inode flushing Tejun Heo
  2019-08-03 14:01 ` [PATCH 1/4] writeback: Generalize and expose wb_completion Tejun Heo
@ 2019-08-03 14:01 ` Tejun Heo
  2019-08-03 15:39   ` Matthew Wilcox
                     ` (2 more replies)
  2019-08-03 14:01 ` [PATCH 3/4] writeback, memcg: Implement cgroup_writeback_by_id() Tejun Heo
  2019-08-03 14:01 ` [PATCH 4/4] writeback, memcg: Implement foreign dirty flushing Tejun Heo
  3 siblings, 3 replies; 24+ messages in thread
From: Tejun Heo @ 2019-08-03 14:01 UTC (permalink / raw)
  To: axboe, jack, hannes, mhocko, vdavydov.dev
  Cc: cgroups, linux-mm, linux-block, linux-kernel, kernel-team, guro,
	akpm, Tejun Heo

There currently is no way to universally identify and lookup a bdi
without holding a reference and pointer to it.  This patch adds an
non-recycling bdi->id and implements bdi_get_by_id() which looks up
bdis by their ids.  This will be used by memcg foreign inode flushing.

I left bdi_list alone for simplicity and because while rb_tree does
support rcu assignment it doesn't seem to guarantee lossless walk when
walk is racing aginst tree rebalance operations.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/backing-dev-defs.h |  2 +
 include/linux/backing-dev.h      |  1 +
 mm/backing-dev.c                 | 65 +++++++++++++++++++++++++++++++-
 3 files changed, 66 insertions(+), 2 deletions(-)

diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 8fb740178d5d..1075f2552cfc 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -185,6 +185,8 @@ struct bdi_writeback {
 };
 
 struct backing_dev_info {
+	u64 id;
+	struct rb_node rb_node; /* keyed by ->id */
 	struct list_head bdi_list;
 	unsigned long ra_pages;	/* max readahead in PAGE_SIZE units */
 	unsigned long io_pages;	/* max allowed IO size */
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 02650b1253a2..84cdcfbc763f 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -24,6 +24,7 @@ static inline struct backing_dev_info *bdi_get(struct backing_dev_info *bdi)
 	return bdi;
 }
 
+struct backing_dev_info *bdi_get_by_id(u64 id);
 void bdi_put(struct backing_dev_info *bdi);
 
 __printf(2, 3)
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index e8e89158adec..4a8816e0b8d4 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0-only
 
 #include <linux/wait.h>
+#include <linux/rbtree.h>
 #include <linux/backing-dev.h>
 #include <linux/kthread.h>
 #include <linux/freezer.h>
@@ -22,10 +23,12 @@ EXPORT_SYMBOL_GPL(noop_backing_dev_info);
 static struct class *bdi_class;
 
 /*
- * bdi_lock protects updates to bdi_list. bdi_list has RCU reader side
- * locking.
+ * bdi_lock protects bdi_tree and updates to bdi_list. bdi_list has RCU
+ * reader side locking.
  */
 DEFINE_SPINLOCK(bdi_lock);
+static u64 bdi_id_cursor;
+static struct rb_root bdi_tree = RB_ROOT;
 LIST_HEAD(bdi_list);
 
 /* bdi_wq serves all asynchronous writeback tasks */
@@ -859,9 +862,58 @@ struct backing_dev_info *bdi_alloc_node(gfp_t gfp_mask, int node_id)
 }
 EXPORT_SYMBOL(bdi_alloc_node);
 
+struct rb_node **bdi_lookup_rb_node(u64 id, struct rb_node **parentp)
+{
+	struct rb_node **p = &bdi_tree.rb_node;
+	struct rb_node *parent = NULL;
+	struct backing_dev_info *bdi;
+
+	lockdep_assert_held(&bdi_lock);
+
+	while (*p) {
+		parent = *p;
+		bdi = rb_entry(parent, struct backing_dev_info, rb_node);
+
+		if (bdi->id > id)
+			p = &(*p)->rb_left;
+		else if (bdi->id < id)
+			p = &(*p)->rb_right;
+		else
+			break;
+	}
+
+	if (parentp)
+		*parentp = parent;
+	return p;
+}
+
+/**
+ * bdi_get_by_id - lookup and get bdi from its id
+ * @id: bdi id to lookup
+ *
+ * Find bdi matching @id and get it.  Returns NULL if the matching bdi
+ * doesn't exist or is already unregistered.
+ */
+struct backing_dev_info *bdi_get_by_id(u64 id)
+{
+	struct backing_dev_info *bdi = NULL;
+	struct rb_node **p;
+
+	spin_lock_irq(&bdi_lock);
+	p = bdi_lookup_rb_node(id, NULL);
+	if (*p) {
+		bdi = rb_entry(*p, struct backing_dev_info, rb_node);
+		bdi_get(bdi);
+	}
+	spin_unlock_irq(&bdi_lock);
+
+	return bdi;
+}
+
 int bdi_register_va(struct backing_dev_info *bdi, const char *fmt, va_list args)
 {
 	struct device *dev;
+	struct rb_node *parent, **p;
 
 	if (bdi->dev)	/* The driver needs to use separate queues per device */
 		return 0;
@@ -877,7 +929,15 @@ int bdi_register_va(struct backing_dev_info *bdi, const char *fmt, va_list args)
 	set_bit(WB_registered, &bdi->wb.state);
 
 	spin_lock_bh(&bdi_lock);
+
+	bdi->id = ++bdi_id_cursor;
+
+	p = bdi_lookup_rb_node(bdi->id, &parent);
+	rb_link_node(&bdi->rb_node, parent, p);
+	rb_insert_color(&bdi->rb_node, &bdi_tree);
+
 	list_add_tail_rcu(&bdi->bdi_list, &bdi_list);
+
 	spin_unlock_bh(&bdi_lock);
 
 	trace_writeback_bdi_register(bdi);
@@ -918,6 +978,7 @@ EXPORT_SYMBOL(bdi_register_owner);
 static void bdi_remove_from_list(struct backing_dev_info *bdi)
 {
 	spin_lock_bh(&bdi_lock);
+	rb_erase(&bdi->rb_node, &bdi_tree);
 	list_del_rcu(&bdi->bdi_list);
 	spin_unlock_bh(&bdi_lock);
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 3/4] writeback, memcg: Implement cgroup_writeback_by_id()
  2019-08-03 14:01 [PATCHSET] writeback, memcg: Implement foreign inode flushing Tejun Heo
  2019-08-03 14:01 ` [PATCH 1/4] writeback: Generalize and expose wb_completion Tejun Heo
  2019-08-03 14:01 ` [PATCH 2/4] bdi: Add bdi->id Tejun Heo
@ 2019-08-03 14:01 ` Tejun Heo
  2019-08-15 14:05   ` Jan Kara
  2019-08-15 14:54   ` Jan Kara
  2019-08-03 14:01 ` [PATCH 4/4] writeback, memcg: Implement foreign dirty flushing Tejun Heo
  3 siblings, 2 replies; 24+ messages in thread
From: Tejun Heo @ 2019-08-03 14:01 UTC (permalink / raw)
  To: axboe, jack, hannes, mhocko, vdavydov.dev
  Cc: cgroups, linux-mm, linux-block, linux-kernel, kernel-team, guro,
	akpm, Tejun Heo

Implement cgroup_writeback_by_id() which initiates cgroup writeback
from bdi and memcg IDs.  This will be used by memcg foreign inode
flushing.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 fs/fs-writeback.c         | 64 +++++++++++++++++++++++++++++++++++++++
 include/linux/writeback.h |  4 +++
 2 files changed, 68 insertions(+)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 6129debdc938..5c79d7acefdb 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -880,6 +880,70 @@ static void bdi_split_work_to_wbs(struct backing_dev_info *bdi,
 		wb_put(last_wb);
 }
 
+/**
+ * cgroup_writeback_by_id - initiate cgroup writeback from bdi and memcg IDs
+ * @bdi_id: target bdi id
+ * @memcg_id: target memcg css id
+ * @nr_pages: number of pages to write
+ * @reason: reason why some writeback work initiated
+ * @done: target wb_completion
+ *
+ * Initiate flush of the bdi_writeback identified by @bdi_id and @memcg_id
+ * with the specified parameters.
+ */
+int cgroup_writeback_by_id(u64 bdi_id, int memcg_id, unsigned long nr,
+			   enum wb_reason reason, struct wb_completion *done)
+{
+	struct backing_dev_info *bdi;
+	struct cgroup_subsys_state *memcg_css;
+	struct bdi_writeback *wb;
+	struct wb_writeback_work *work;
+	int ret;
+
+	/* lookup bdi and memcg */
+	bdi = bdi_get_by_id(bdi_id);
+	if (!bdi)
+		return -ENOENT;
+
+	rcu_read_lock();
+	memcg_css = css_from_id(memcg_id, &memory_cgrp_subsys);
+	if (memcg_css && !css_tryget(memcg_css))
+		memcg_css = NULL;
+	rcu_read_unlock();
+	if (!memcg_css) {
+		ret = -ENOENT;
+		goto out_bdi_put;
+	}
+
+	/* and find the associated wb */
+	wb = wb_get_create(bdi, memcg_css, GFP_NOWAIT | __GFP_NOWARN);
+	if (!wb) {
+		ret = -ENOMEM;
+		goto out_css_put;
+	}
+
+	/* issue the writeback work */
+	work = kzalloc(sizeof(*work), GFP_NOWAIT | __GFP_NOWARN);
+	if (work) {
+		work->nr_pages = nr;
+		work->sync_mode = WB_SYNC_NONE;
+		work->reason = reason;
+		work->done = done;
+		work->auto_free = 1;
+		wb_queue_work(wb, work);
+		ret = 0;
+	} else {
+		ret = -ENOMEM;
+	}
+
+	wb_put(wb);
+out_css_put:
+	css_put(memcg_css);
+out_bdi_put:
+	bdi_put(bdi);
+	return ret;
+}
+
 /**
  * cgroup_writeback_umount - flush inode wb switches for umount
  *
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 8945aac31392..ad794f2a7d42 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -217,6 +217,10 @@ void wbc_attach_and_unlock_inode(struct writeback_control *wbc,
 void wbc_detach_inode(struct writeback_control *wbc);
 void wbc_account_cgroup_owner(struct writeback_control *wbc, struct page *page,
 			      size_t bytes);
+int cgroup_writeback_by_id(u64 bdi_id, int memcg_id, unsigned long nr_pages,
+			   enum wb_reason reason, struct wb_completion *done);
+int writeback_by_id(int id, unsigned long nr, enum wb_reason reason,
+		    struct wb_completion *done);
 void cgroup_writeback_umount(void);
 
 /**
-- 
2.17.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 4/4] writeback, memcg: Implement foreign dirty flushing
  2019-08-03 14:01 [PATCHSET] writeback, memcg: Implement foreign inode flushing Tejun Heo
                   ` (2 preceding siblings ...)
  2019-08-03 14:01 ` [PATCH 3/4] writeback, memcg: Implement cgroup_writeback_by_id() Tejun Heo
@ 2019-08-03 14:01 ` Tejun Heo
  2019-08-06 23:03   ` Andrew Morton
  2019-08-15 14:34   ` Jan Kara
  3 siblings, 2 replies; 24+ messages in thread
From: Tejun Heo @ 2019-08-03 14:01 UTC (permalink / raw)
  To: axboe, jack, hannes, mhocko, vdavydov.dev
  Cc: cgroups, linux-mm, linux-block, linux-kernel, kernel-team, guro,
	akpm, Tejun Heo

There's an inherent mismatch between memcg and writeback.  The former
trackes ownership per-page while the latter per-inode.  This was a
deliberate design decision because honoring per-page ownership in the
writeback path is complicated, may lead to higher CPU and IO overheads
and deemed unnecessary given that write-sharing an inode across
different cgroups isn't a common use-case.

Combined with inode majority-writer ownership switching, this works
well enough in most cases but there are some pathological cases.  For
example, let's say there are two cgroups A and B which keep writing to
different but confined parts of the same inode.  B owns the inode and
A's memory is limited far below B's.  A's dirty ratio can rise enough
to trigger balance_dirty_pages() sleeps but B's can be low enough to
avoid triggering background writeback.  A will be slowed down without
a way to make writeback of the dirty pages happen.

This patch implements foreign dirty recording and foreign mechanism so
that when a memcg encounters a condition as above it can trigger
flushes on bdi_writebacks which can clean its pages.  Please see the
comment on top of mem_cgroup_track_foreign_dirty_slowpath() for
details.

A reproducer follows.

write-range.c::

  #include <stdio.h>
  #include <stdlib.h>
  #include <unistd.h>
  #include <fcntl.h>
  #include <sys/types.h>

  static const char *usage = "write-range FILE START SIZE\n";

  int main(int argc, char **argv)
  {
	  int fd;
	  unsigned long start, size, end, pos;
	  char *endp;
	  char buf[4096];

	  if (argc < 4) {
		  fprintf(stderr, usage);
		  return 1;
	  }

	  fd = open(argv[1], O_WRONLY);
	  if (fd < 0) {
		  perror("open");
		  return 1;
	  }

	  start = strtoul(argv[2], &endp, 0);
	  if (*endp != '\0') {
		  fprintf(stderr, usage);
		  return 1;
	  }

	  size = strtoul(argv[3], &endp, 0);
	  if (*endp != '\0') {
		  fprintf(stderr, usage);
		  return 1;
	  }

	  end = start + size;

	  while (1) {
		  for (pos = start; pos < end; ) {
			  long bread, bwritten = 0;

			  if (lseek(fd, pos, SEEK_SET) < 0) {
				  perror("lseek");
				  return 1;
			  }

			  bread = read(0, buf, sizeof(buf) < end - pos ?
					       sizeof(buf) : end - pos);
			  if (bread < 0) {
				  perror("read");
				  return 1;
			  }
			  if (bread == 0)
				  return 0;

			  while (bwritten < bread) {
				  long this;

				  this = write(fd, buf + bwritten,
					       bread - bwritten);
				  if (this < 0) {
					  perror("write");
					  return 1;
				  }

				  bwritten += this;
				  pos += bwritten;
			  }
		  }
	  }
  }

repro.sh::

  #!/bin/bash

  set -e
  set -x

  sysctl -w vm.dirty_expire_centisecs=300000
  sysctl -w vm.dirty_writeback_centisecs=300000
  sysctl -w vm.dirtytime_expire_seconds=300000
  echo 3 > /proc/sys/vm/drop_caches

  TEST=/sys/fs/cgroup/test
  A=$TEST/A
  B=$TEST/B

  mkdir -p $A $B
  echo "+memory +io" > $TEST/cgroup.subtree_control
  echo $((1<<30)) > $A/memory.high
  echo $((32<<30)) > $B/memory.high

  rm -f testfile
  touch testfile
  fallocate -l 4G testfile

  echo "Starting B"

  (echo $BASHPID > $B/cgroup.procs
   pv -q --rate-limit 70M < /dev/urandom | ./write-range testfile $((2<<30)) $((2<<30))) &

  echo "Waiting 10s to ensure B claims the testfile inode"
  sleep 5
  sync
  sleep 5
  sync
  echo "Starting A"

  (echo $BASHPID > $A/cgroup.procs
   pv < /dev/urandom | ./write-range testfile 0 $((2<<30)))

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/backing-dev-defs.h |   1 +
 include/linux/memcontrol.h       |  35 +++++++++
 mm/memcontrol.c                  | 125 +++++++++++++++++++++++++++++++
 mm/page-writeback.c              |   4 +
 4 files changed, 165 insertions(+)

diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 1075f2552cfc..4fc87dee005a 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -63,6 +63,7 @@ enum wb_reason {
 	 * so it has a mismatch name.
 	 */
 	WB_REASON_FORKER_THREAD,
+	WB_REASON_FOREIGN_FLUSH,
 
 	WB_REASON_MAX,
 };
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 44c41462be33..7422e4a4dbeb 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -183,6 +183,19 @@ struct memcg_padding {
 #define MEMCG_PADDING(name)
 #endif
 
+/*
+ * Remember four most recent foreign inodes with dirty pages on this
+ * cgroup.  See mem_cgroup_track_foreign_dirty_slowpath() for details.
+ */
+#define MEMCG_CGWB_FRN_CNT	4
+
+struct memcg_cgwb_frn {
+	u64 bdi_id;			/* bdi->id of the foreign inode */
+	int memcg_id;			/* memcg->css.id of foreign inode */
+	u64 at;				/* jiffies_64 at the time of dirtying */
+	struct wb_completion done;	/* tracks in-flight foreign writebacks */
+};
+
 /*
  * The memory controller data structure. The memory controller controls both
  * page cache and RSS per cgroup. We would eventually like to provide
@@ -307,6 +320,7 @@ struct mem_cgroup {
 #ifdef CONFIG_CGROUP_WRITEBACK
 	struct list_head cgwb_list;
 	struct wb_domain cgwb_domain;
+	struct memcg_cgwb_frn cgwb_frn[MEMCG_CGWB_FRN_CNT];
 #endif
 
 	/* List of events which userspace want to receive */
@@ -1218,6 +1232,18 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages,
 			 unsigned long *pheadroom, unsigned long *pdirty,
 			 unsigned long *pwriteback);
 
+void mem_cgroup_track_foreign_dirty_slowpath(struct page *page,
+					     struct bdi_writeback *wb);
+
+static inline void mem_cgroup_track_foreign_dirty(struct page *page,
+						  struct bdi_writeback *wb)
+{
+	if (unlikely(&page->mem_cgroup->css != wb->memcg_css))
+		mem_cgroup_track_foreign_dirty_slowpath(page, wb);
+}
+
+void mem_cgroup_flush_foreign(struct bdi_writeback *wb);
+
 #else	/* CONFIG_CGROUP_WRITEBACK */
 
 static inline struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb)
@@ -1233,6 +1259,15 @@ static inline void mem_cgroup_wb_stats(struct bdi_writeback *wb,
 {
 }
 
+static inline void mem_cgroup_track_foreign_dirty(struct page *page,
+						  struct bdi_writeback *wb)
+{
+}
+
+static inline void mem_cgroup_flush_foreign(struct bdi_writeback *wb)
+{
+}
+
 #endif	/* CONFIG_CGROUP_WRITEBACK */
 
 struct sock;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index cdbb7a84cb6e..e642fbacb504 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -87,6 +87,10 @@ int do_swap_account __read_mostly;
 #define do_swap_account		0
 #endif
 
+#ifdef CONFIG_CGROUP_WRITEBACK
+static DECLARE_WAIT_QUEUE_HEAD(memcg_cgwb_frn_waitq);
+#endif
+
 /* Whether legacy memory+swap accounting is active */
 static bool do_memsw_account(void)
 {
@@ -4145,6 +4149,118 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages,
 	}
 }
 
+/*
+ * Foreign dirty flushing
+ *
+ * There's an inherent mismatch between memcg and writeback.  The former
+ * trackes ownership per-page while the latter per-inode.  This was a
+ * deliberate design decision because honoring per-page ownership in the
+ * writeback path is complicated, may lead to higher CPU and IO overheads
+ * and deemed unnecessary given that write-sharing an inode across
+ * different cgroups isn't a common use-case.
+ *
+ * Combined with inode majority-writer ownership switching, this works well
+ * enough in most cases but there are some pathological cases.  For
+ * example, let's say there are two cgroups A and B which keep writing to
+ * different but confined parts of the same inode.  B owns the inode and
+ * A's memory is limited far below B's.  A's dirty ratio can rise enough to
+ * trigger balance_dirty_pages() sleeps but B's can be low enough to avoid
+ * triggering background writeback.  A will be slowed down without a way to
+ * make writeback of the dirty pages happen.
+ *
+ * Conditions like the above can lead to a cgroup getting repatedly and
+ * severely throttled after making some progress after each
+ * dirty_expire_interval while the underyling IO device is almost
+ * completely idle.
+ *
+ * Solving this problem completely requires matching the ownership tracking
+ * granularities between memcg and writeback in either direction.  However,
+ * the more egregious behaviors can be avoided by simply remembering the
+ * most recent foreign dirtying events and initiating remote flushes on
+ * them when local writeback isn't enough to keep the memory clean enough.
+ *
+ * The following two functions implement such mechanism.  When a foreign
+ * page - a page whose memcg and writeback ownerships don't match - is
+ * dirtied, mem_cgroup_track_foreign_dirty() records the inode owning
+ * bdi_writeback on the page owning memcg.  When balance_dirty_pages()
+ * decides that the memcg needs to sleep due to high dirty ratio, it calls
+ * mem_cgroup_flush_foreign() which queues writeback on the recorded
+ * foreign bdi_writebacks which haven't expired.  Both the numbers of
+ * recorded bdi_writebacks and concurrent in-flight foreign writebacks are
+ * limited to MEMCG_CGWB_FRN_CNT.
+ *
+ * The mechanism only remembers IDs and doesn't hold any object references.
+ * As being wrong occasionally doesn't matter, updates and accesses to the
+ * records are lockless and racy.
+ */
+void mem_cgroup_track_foreign_dirty_slowpath(struct page *page,
+					     struct bdi_writeback *wb)
+{
+	struct mem_cgroup *memcg = page->mem_cgroup;
+	struct memcg_cgwb_frn *frn;
+	u64 now = jiffies_64;
+	u64 oldest_at = now;
+	int oldest = -1;
+	int i;
+
+	/*
+	 * Pick the slot to use.  If there is already a slot for @wb, keep
+	 * using it.  If not replace the oldest one which isn't being
+	 * written out.
+	 */
+	for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) {
+		frn = &memcg->cgwb_frn[i];
+		if (frn->bdi_id == wb->bdi->id &&
+		    frn->memcg_id == wb->memcg_css->id)
+			break;
+		if (frn->at < oldest_at && atomic_read(&frn->done.cnt) == 1) {
+			oldest = i;
+			oldest_at = frn->at;
+		}
+	}
+
+	if (i < MEMCG_CGWB_FRN_CNT) {
+		unsigned long update_intv =
+			min_t(unsigned long, HZ,
+			      msecs_to_jiffies(dirty_expire_interval * 10) / 8);
+		/*
+		 * Re-using an existing one.  Let's update timestamp lazily
+		 * to avoid making the cacheline hot.
+		 */
+		if (frn->at < now - update_intv)
+			frn->at = now;
+	} else if (oldest >= 0) {
+		/* replace the oldest free one */
+		frn = &memcg->cgwb_frn[oldest];
+		frn->bdi_id = wb->bdi->id;
+		frn->memcg_id = wb->memcg_css->id;
+		frn->at = now;
+	}
+}
+
+/*
+ * Issue foreign writeback flushes for recorded foreign dirtying events
+ * which haven't expired yet and aren't already being written out.
+ */
+void mem_cgroup_flush_foreign(struct bdi_writeback *wb)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css);
+	unsigned long intv = msecs_to_jiffies(dirty_expire_interval * 10);
+	u64 now = jiffies_64;
+	int i;
+
+	for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) {
+		struct memcg_cgwb_frn *frn = &memcg->cgwb_frn[i];
+
+		if (frn->at > now - intv && atomic_read(&frn->done.cnt) == 1) {
+			frn->at = 0;
+			cgroup_writeback_by_id(frn->bdi_id, frn->memcg_id,
+					       LONG_MAX, WB_REASON_FOREIGN_FLUSH,
+					       &frn->done);
+		}
+	}
+}
+
 #else	/* CONFIG_CGROUP_WRITEBACK */
 
 static int memcg_wb_domain_init(struct mem_cgroup *memcg, gfp_t gfp)
@@ -4661,6 +4777,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 	struct mem_cgroup *memcg;
 	unsigned int size;
 	int node;
+	int __maybe_unused i;
 
 	size = sizeof(struct mem_cgroup);
 	size += nr_node_ids * sizeof(struct mem_cgroup_per_node *);
@@ -4704,6 +4821,9 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 #endif
 #ifdef CONFIG_CGROUP_WRITEBACK
 	INIT_LIST_HEAD(&memcg->cgwb_list);
+	for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++)
+		memcg->cgwb_frn[i].done =
+			__WB_COMPLETION_INIT(&memcg_cgwb_frn_waitq);
 #endif
 	idr_replace(&mem_cgroup_idr, memcg, memcg->id.id);
 	return memcg;
@@ -4833,7 +4953,12 @@ static void mem_cgroup_css_released(struct cgroup_subsys_state *css)
 static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+	int __maybe_unused i;
 
+#ifdef CONFIG_CGROUP_WRITEBACK
+	for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++)
+		wb_wait_for_completion(&memcg->cgwb_frn[i].done);
+#endif
 	if (cgroup_subsys_on_dfl(memory_cgrp_subsys) && !cgroup_memory_nosocket)
 		static_branch_dec(&memcg_sockets_enabled_key);
 
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 1804f64ff43c..50055d2e4ea8 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1667,6 +1667,8 @@ static void balance_dirty_pages(struct bdi_writeback *wb,
 		if (unlikely(!writeback_in_progress(wb)))
 			wb_start_background_writeback(wb);
 
+		mem_cgroup_flush_foreign(wb);
+
 		/*
 		 * Calculate global domain's pos_ratio and select the
 		 * global dtc by default.
@@ -2427,6 +2429,8 @@ void account_page_dirtied(struct page *page, struct address_space *mapping)
 		task_io_account_write(PAGE_SIZE);
 		current->nr_dirtied++;
 		this_cpu_inc(bdp_ratelimits);
+
+		mem_cgroup_track_foreign_dirty(page, wb);
 	}
 }
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/4] bdi: Add bdi->id
  2019-08-03 14:01 ` [PATCH 2/4] bdi: Add bdi->id Tejun Heo
@ 2019-08-03 15:39   ` Matthew Wilcox
  2019-08-03 15:53     ` Tejun Heo
  2019-08-06 23:01   ` Andrew Morton
  2019-08-15 14:46   ` Jan Kara
  2 siblings, 1 reply; 24+ messages in thread
From: Matthew Wilcox @ 2019-08-03 15:39 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe, jack, hannes, mhocko, vdavydov.dev, cgroups, linux-mm,
	linux-block, linux-kernel, kernel-team, guro, akpm

On Sat, Aug 03, 2019 at 07:01:53AM -0700, Tejun Heo wrote:
> There currently is no way to universally identify and lookup a bdi
> without holding a reference and pointer to it.  This patch adds an
> non-recycling bdi->id and implements bdi_get_by_id() which looks up
> bdis by their ids.  This will be used by memcg foreign inode flushing.
> 
> I left bdi_list alone for simplicity and because while rb_tree does
> support rcu assignment it doesn't seem to guarantee lossless walk when
> walk is racing aginst tree rebalance operations.

This would seem like the perfect use for an allocating xarray.  That
does guarantee lossless walk under the RCU lock.  You could get rid of the
bdi_list too.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/4] bdi: Add bdi->id
  2019-08-03 15:39   ` Matthew Wilcox
@ 2019-08-03 15:53     ` Tejun Heo
  2019-08-03 16:17       ` Matthew Wilcox
  0 siblings, 1 reply; 24+ messages in thread
From: Tejun Heo @ 2019-08-03 15:53 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: axboe, jack, hannes, mhocko, vdavydov.dev, cgroups, linux-mm,
	linux-block, linux-kernel, kernel-team, guro, akpm

Hey, Matthew.

On Sat, Aug 03, 2019 at 08:39:08AM -0700, Matthew Wilcox wrote:
> On Sat, Aug 03, 2019 at 07:01:53AM -0700, Tejun Heo wrote:
> > There currently is no way to universally identify and lookup a bdi
> > without holding a reference and pointer to it.  This patch adds an
> > non-recycling bdi->id and implements bdi_get_by_id() which looks up
> > bdis by their ids.  This will be used by memcg foreign inode flushing.
> > 
> > I left bdi_list alone for simplicity and because while rb_tree does
> > support rcu assignment it doesn't seem to guarantee lossless walk when
> > walk is racing aginst tree rebalance operations.
> 
> This would seem like the perfect use for an allocating xarray.  That
> does guarantee lossless walk under the RCU lock.  You could get rid of the
> bdi_list too.

It definitely came to mind but there's a bunch of downsides to
recycling IDs or using radix tree for non-compacting allocations.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/4] bdi: Add bdi->id
  2019-08-03 15:53     ` Tejun Heo
@ 2019-08-03 16:17       ` Matthew Wilcox
  0 siblings, 0 replies; 24+ messages in thread
From: Matthew Wilcox @ 2019-08-03 16:17 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe, jack, hannes, mhocko, vdavydov.dev, cgroups, linux-mm,
	linux-block, linux-kernel, kernel-team, guro, akpm

On Sat, Aug 03, 2019 at 08:53:49AM -0700, Tejun Heo wrote:
> Hey, Matthew.
> 
> On Sat, Aug 03, 2019 at 08:39:08AM -0700, Matthew Wilcox wrote:
> > On Sat, Aug 03, 2019 at 07:01:53AM -0700, Tejun Heo wrote:
> > > There currently is no way to universally identify and lookup a bdi
> > > without holding a reference and pointer to it.  This patch adds an
> > > non-recycling bdi->id and implements bdi_get_by_id() which looks up
> > > bdis by their ids.  This will be used by memcg foreign inode flushing.
> > > 
> > > I left bdi_list alone for simplicity and because while rb_tree does
> > > support rcu assignment it doesn't seem to guarantee lossless walk when
> > > walk is racing aginst tree rebalance operations.
> > 
> > This would seem like the perfect use for an allocating xarray.  That
> > does guarantee lossless walk under the RCU lock.  You could get rid of the
> > bdi_list too.
> 
> It definitely came to mind but there's a bunch of downsides to
> recycling IDs or using radix tree for non-compacting allocations.

Ah, I wasn't sure what would happen if you recycled an ID.  I agree, the
radix tree is pretty horrid for monotonically increasing IDs.  I'm still
working on the maple tree to replace it, but that's going slower than
I would like, so I can't in good conscience ask you to wait for it to
be ready.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/4] bdi: Add bdi->id
  2019-08-03 14:01 ` [PATCH 2/4] bdi: Add bdi->id Tejun Heo
  2019-08-03 15:39   ` Matthew Wilcox
@ 2019-08-06 23:01   ` Andrew Morton
  2019-08-07 18:31     ` Tejun Heo
  2019-08-15 14:46   ` Jan Kara
  2 siblings, 1 reply; 24+ messages in thread
From: Andrew Morton @ 2019-08-06 23:01 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe, jack, hannes, mhocko, vdavydov.dev, cgroups, linux-mm,
	linux-block, linux-kernel, kernel-team, guro

On Sat,  3 Aug 2019 07:01:53 -0700 Tejun Heo <tj@kernel.org> wrote:

> There currently is no way to universally identify and lookup a bdi
> without holding a reference and pointer to it.  This patch adds an
> non-recycling bdi->id and implements bdi_get_by_id() which looks up
> bdis by their ids.  This will be used by memcg foreign inode flushing.

Why is the id non-recycling?  Presumably to address some
lifetime/lookup issues, but what are they?

Why was the IDR code not used?

> I left bdi_list alone for simplicity and because while rb_tree does
> support rcu assignment it doesn't seem to guarantee lossless walk when
> walk is racing aginst tree rebalance operations.
> 
> ...
>
> +/**
> + * bdi_get_by_id - lookup and get bdi from its id
> + * @id: bdi id to lookup
> + *
> + * Find bdi matching @id and get it.  Returns NULL if the matching bdi
> + * doesn't exist or is already unregistered.
> + */
> +struct backing_dev_info *bdi_get_by_id(u64 id)
> +{
> +	struct backing_dev_info *bdi = NULL;
> +	struct rb_node **p;
> +
> +	spin_lock_irq(&bdi_lock);

Why irq-safe?  Everywhere else uses spin_lock_bh(&bdi_lock).

> +	p = bdi_lookup_rb_node(id, NULL);
> +	if (*p) {
> +		bdi = rb_entry(*p, struct backing_dev_info, rb_node);
> +		bdi_get(bdi);
> +	}
> +	spin_unlock_irq(&bdi_lock);
> +
> +	return bdi;
> +}
> +
>
> ...
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] writeback, memcg: Implement foreign dirty flushing
  2019-08-03 14:01 ` [PATCH 4/4] writeback, memcg: Implement foreign dirty flushing Tejun Heo
@ 2019-08-06 23:03   ` Andrew Morton
  2019-08-07 18:34     ` Tejun Heo
  2019-08-15 14:34   ` Jan Kara
  1 sibling, 1 reply; 24+ messages in thread
From: Andrew Morton @ 2019-08-06 23:03 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe, jack, hannes, mhocko, vdavydov.dev, cgroups, linux-mm,
	linux-block, linux-kernel, kernel-team, guro

On Sat,  3 Aug 2019 07:01:55 -0700 Tejun Heo <tj@kernel.org> wrote:

> There's an inherent mismatch between memcg and writeback.  The former
> trackes ownership per-page while the latter per-inode.  This was a
> deliberate design decision because honoring per-page ownership in the
> writeback path is complicated, may lead to higher CPU and IO overheads
> and deemed unnecessary given that write-sharing an inode across
> different cgroups isn't a common use-case.
> 
> Combined with inode majority-writer ownership switching, this works
> well enough in most cases but there are some pathological cases.  For
> example, let's say there are two cgroups A and B which keep writing to
> different but confined parts of the same inode.  B owns the inode and
> A's memory is limited far below B's.  A's dirty ratio can rise enough
> to trigger balance_dirty_pages() sleeps but B's can be low enough to
> avoid triggering background writeback.  A will be slowed down without
> a way to make writeback of the dirty pages happen.
> 
> This patch implements foreign dirty recording and foreign mechanism so
> that when a memcg encounters a condition as above it can trigger
> flushes on bdi_writebacks which can clean its pages.  Please see the
> comment on top of mem_cgroup_track_foreign_dirty_slowpath() for
> details.
> 
> ...
>
> +void mem_cgroup_track_foreign_dirty_slowpath(struct page *page,
> +					     struct bdi_writeback *wb)
> +{
> +	struct mem_cgroup *memcg = page->mem_cgroup;
> +	struct memcg_cgwb_frn *frn;
> +	u64 now = jiffies_64;
> +	u64 oldest_at = now;
> +	int oldest = -1;
> +	int i;
> +
> +	/*
> +	 * Pick the slot to use.  If there is already a slot for @wb, keep
> +	 * using it.  If not replace the oldest one which isn't being
> +	 * written out.
> +	 */
> +	for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) {
> +		frn = &memcg->cgwb_frn[i];
> +		if (frn->bdi_id == wb->bdi->id &&
> +		    frn->memcg_id == wb->memcg_css->id)
> +			break;
> +		if (frn->at < oldest_at && atomic_read(&frn->done.cnt) == 1) {
> +			oldest = i;
> +			oldest_at = frn->at;
> +		}
> +	}
> +
> +	if (i < MEMCG_CGWB_FRN_CNT) {
> +		unsigned long update_intv =
> +			min_t(unsigned long, HZ,
> +			      msecs_to_jiffies(dirty_expire_interval * 10) / 8);

An explanation of what's going on here would be helpful.

Why "* 1.25" and not, umm "* 1.24"?

> +		/*
> +		 * Re-using an existing one.  Let's update timestamp lazily
> +		 * to avoid making the cacheline hot.
> +		 */
> +		if (frn->at < now - update_intv)
> +			frn->at = now;
> +	} else if (oldest >= 0) {
> +		/* replace the oldest free one */
> +		frn = &memcg->cgwb_frn[oldest];
> +		frn->bdi_id = wb->bdi->id;
> +		frn->memcg_id = wb->memcg_css->id;
> +		frn->at = now;
> +	}
> +}
> +
> +/*
> + * Issue foreign writeback flushes for recorded foreign dirtying events
> + * which haven't expired yet and aren't already being written out.
> + */
> +void mem_cgroup_flush_foreign(struct bdi_writeback *wb)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css);
> +	unsigned long intv = msecs_to_jiffies(dirty_expire_interval * 10);

Ditto.

> +	u64 now = jiffies_64;
> +	int i;
> +
> +	for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) {
> +		struct memcg_cgwb_frn *frn = &memcg->cgwb_frn[i];
> +
> +		if (frn->at > now - intv && atomic_read(&frn->done.cnt) == 1) {
> +			frn->at = 0;
> +			cgroup_writeback_by_id(frn->bdi_id, frn->memcg_id,
> +					       LONG_MAX, WB_REASON_FOREIGN_FLUSH,
> +					       &frn->done);
> +		}
> +	}
> +}
> +


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/4] bdi: Add bdi->id
  2019-08-06 23:01   ` Andrew Morton
@ 2019-08-07 18:31     ` Tejun Heo
  2019-08-07 19:00       ` Andrew Morton
  0 siblings, 1 reply; 24+ messages in thread
From: Tejun Heo @ 2019-08-07 18:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: axboe, jack, hannes, mhocko, vdavydov.dev, cgroups, linux-mm,
	linux-block, linux-kernel, kernel-team, guro

Hello,

On Tue, Aug 06, 2019 at 04:01:02PM -0700, Andrew Morton wrote:
> On Sat,  3 Aug 2019 07:01:53 -0700 Tejun Heo <tj@kernel.org> wrote:
> > There currently is no way to universally identify and lookup a bdi
> > without holding a reference and pointer to it.  This patch adds an
> > non-recycling bdi->id and implements bdi_get_by_id() which looks up
> > bdis by their ids.  This will be used by memcg foreign inode flushing.
> 
> Why is the id non-recycling?  Presumably to address some
> lifetime/lookup issues, but what are they?

The ID by itself is used to point to the bdi from cgroup and idr
recycles really aggressively.  Combined with, for example, loop device
based containers, stale pointing can become pretty common.  We're
having similar issues with cgroup IDs.

> Why was the IDR code not used?

Because of the rapid recycling.  In the longer term, I think we want
IDR to be able to support non-recycling IDs, or at least less
agressive recycling.

> > +struct backing_dev_info *bdi_get_by_id(u64 id)
> > +{
> > +	struct backing_dev_info *bdi = NULL;
> > +	struct rb_node **p;
> > +
> > +	spin_lock_irq(&bdi_lock);
> 
> Why irq-safe?  Everywhere else uses spin_lock_bh(&bdi_lock).

By mistake, I'll change them to bh.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] writeback, memcg: Implement foreign dirty flushing
  2019-08-06 23:03   ` Andrew Morton
@ 2019-08-07 18:34     ` Tejun Heo
  0 siblings, 0 replies; 24+ messages in thread
From: Tejun Heo @ 2019-08-07 18:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: axboe, jack, hannes, mhocko, vdavydov.dev, cgroups, linux-mm,
	linux-block, linux-kernel, kernel-team, guro

Hello,

On Tue, Aug 06, 2019 at 04:03:06PM -0700, Andrew Morton wrote:
> > +	if (i < MEMCG_CGWB_FRN_CNT) {
> > +		unsigned long update_intv =
> > +			min_t(unsigned long, HZ,
> > +			      msecs_to_jiffies(dirty_expire_interval * 10) / 8);
> 
> An explanation of what's going on here would be helpful.
> 
> Why "* 1.25" and not, umm "* 1.24"?

Just because /8 is cheaper.  It's likely that a fairly wide range of
numbers are okay for the above.  I'll add some comment to explain that
and why the specific constants are picked.

> > +void mem_cgroup_flush_foreign(struct bdi_writeback *wb)
> > +{
> > +	struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css);
> > +	unsigned long intv = msecs_to_jiffies(dirty_expire_interval * 10);
> 
> Ditto.

This is just dirty expiration.  If the dirty data has expired,
writeback must already be in progress by its bdi_wb, so there's no
reason to scheduler foreign writeback.  Will add a comment.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/4] bdi: Add bdi->id
  2019-08-07 18:31     ` Tejun Heo
@ 2019-08-07 19:00       ` Andrew Morton
  2019-08-07 20:34         ` Tejun Heo
  2019-08-09  0:57         ` Rik van Riel
  0 siblings, 2 replies; 24+ messages in thread
From: Andrew Morton @ 2019-08-07 19:00 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe, jack, hannes, mhocko, vdavydov.dev, cgroups, linux-mm,
	linux-block, linux-kernel, kernel-team, guro

On Wed, 7 Aug 2019 11:31:51 -0700 Tejun Heo <tj@kernel.org> wrote:

> Hello,
> 
> On Tue, Aug 06, 2019 at 04:01:02PM -0700, Andrew Morton wrote:
> > On Sat,  3 Aug 2019 07:01:53 -0700 Tejun Heo <tj@kernel.org> wrote:
> > > There currently is no way to universally identify and lookup a bdi
> > > without holding a reference and pointer to it.  This patch adds an
> > > non-recycling bdi->id and implements bdi_get_by_id() which looks up
> > > bdis by their ids.  This will be used by memcg foreign inode flushing.
> > 
> > Why is the id non-recycling?  Presumably to address some
> > lifetime/lookup issues, but what are they?
> 
> The ID by itself is used to point to the bdi from cgroup and idr
> recycles really aggressively.  Combined with, for example, loop device
> based containers, stale pointing can become pretty common.  We're
> having similar issues with cgroup IDs.

OK, but why is recycling a problem?  For example, file descriptors
recycle as aggressively as is possible, and that doesn't cause any
trouble.  Presumably recycling is a problem with cgroups because of
some sort of stale reference problem?



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/4] bdi: Add bdi->id
  2019-08-07 19:00       ` Andrew Morton
@ 2019-08-07 20:34         ` Tejun Heo
  2019-08-09  0:57         ` Rik van Riel
  1 sibling, 0 replies; 24+ messages in thread
From: Tejun Heo @ 2019-08-07 20:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: axboe, jack, hannes, mhocko, vdavydov.dev, cgroups, linux-mm,
	linux-block, linux-kernel, kernel-team, guro

Hello,

On Wed, Aug 07, 2019 at 12:00:37PM -0700, Andrew Morton wrote:
> OK, but why is recycling a problem?  For example, file descriptors
> recycle as aggressively as is possible, and that doesn't cause any
> trouble.  Presumably recycling is a problem with cgroups because of
> some sort of stale reference problem?

Oh, because there are use cases where the consumers are detached from
the lifetime synchronization.  In this example, the benefit of using
IDs is that memcgs don't need to pin foreign bdi_wb's and just look up
and verify when it wants to flush them.  If we use pointers, we have
to pin the objects which then requires either shooting down those
references with timers or somehow doing reverse lookup to shoot them
down when bdi_wb is taken offline.  If we use IDs which can be
recycling aggressively, there can be pathological cases where remote
flushes are issued on the wrong target possibly repeatedly, which may
or may not be a real problem.

For cgroup ID, the problem is more immediate.  We give out the IDs to
userspace and there is no way to shoot those references down when the
cgroup goes away and the ID gets recycled, so when the user comes back
and asks "I want to attach this bpf program to the cgroup identified
with this ID", we can end up attaching it to the wrong cgroup if we
recycle IDs.  cgroup ended up with two separate IDs, which is kinda
dumb.

tl;dr is that it's either cumbersome or impossible to synchronize the
users of these IDs, so if they get recycled, they end up identifying
the wrong thing.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/4] bdi: Add bdi->id
  2019-08-07 19:00       ` Andrew Morton
  2019-08-07 20:34         ` Tejun Heo
@ 2019-08-09  0:57         ` Rik van Riel
  1 sibling, 0 replies; 24+ messages in thread
From: Rik van Riel @ 2019-08-09  0:57 UTC (permalink / raw)
  To: Andrew Morton, Tejun Heo
  Cc: axboe, jack, hannes, mhocko, vdavydov.dev, cgroups, linux-mm,
	linux-block, linux-kernel, Kernel Team, Roman Gushchin

On Wed, 2019-08-07 at 12:00 -0700, Andrew Morton wrote:
> On Wed, 7 Aug 2019 11:31:51 -0700 Tejun Heo <tj@kernel.org> wrote:
> 
> > Hello,
> > 
> > On Tue, Aug 06, 2019 at 04:01:02PM -0700, Andrew Morton wrote:
> > > On Sat,  3 Aug 2019 07:01:53 -0700 Tejun Heo <tj@kernel.org>
> > > wrote:
> > > > There currently is no way to universally identify and lookup a
> > > > bdi
> > > > without holding a reference and pointer to it.  This patch adds
> > > > an
> > > > non-recycling bdi->id and implements bdi_get_by_id() which
> > > > looks up
> > > > bdis by their ids.  This will be used by memcg foreign inode
> > > > flushing.
> > > 
> > > Why is the id non-recycling?  Presumably to address some
> > > lifetime/lookup issues, but what are they?
> > 
> > The ID by itself is used to point to the bdi from cgroup and idr
> > recycles really aggressively.  Combined with, for example, loop
> > device
> > based containers, stale pointing can become pretty common.  We're
> > having similar issues with cgroup IDs.
> 
> OK, but why is recycling a problem?  For example, file descriptors
> recycle as aggressively as is possible, and that doesn't cause any
> trouble.  Presumably recycling is a problem with cgroups because of
> some sort of stale reference problem?

PIDs, on the other hand, we recycle as slowly as
possible...


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/4] writeback, memcg: Implement cgroup_writeback_by_id()
  2019-08-03 14:01 ` [PATCH 3/4] writeback, memcg: Implement cgroup_writeback_by_id() Tejun Heo
@ 2019-08-15 14:05   ` Jan Kara
  2019-08-15 15:43     ` Tejun Heo
  2019-08-15 14:54   ` Jan Kara
  1 sibling, 1 reply; 24+ messages in thread
From: Jan Kara @ 2019-08-15 14:05 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe, jack, hannes, mhocko, vdavydov.dev, cgroups, linux-mm,
	linux-block, linux-kernel, kernel-team, guro, akpm

On Sat 03-08-19 07:01:54, Tejun Heo wrote:
> Implement cgroup_writeback_by_id() which initiates cgroup writeback
> from bdi and memcg IDs.  This will be used by memcg foreign inode
> flushing.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  fs/fs-writeback.c         | 64 +++++++++++++++++++++++++++++++++++++++
>  include/linux/writeback.h |  4 +++
>  2 files changed, 68 insertions(+)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 6129debdc938..5c79d7acefdb 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -880,6 +880,70 @@ static void bdi_split_work_to_wbs(struct backing_dev_info *bdi,
>  		wb_put(last_wb);
>  }
>  
> +/**
> + * cgroup_writeback_by_id - initiate cgroup writeback from bdi and memcg IDs
> + * @bdi_id: target bdi id
> + * @memcg_id: target memcg css id
> + * @nr_pages: number of pages to write
> + * @reason: reason why some writeback work initiated
> + * @done: target wb_completion
> + *
> + * Initiate flush of the bdi_writeback identified by @bdi_id and @memcg_id
> + * with the specified parameters.
> + */
> +int cgroup_writeback_by_id(u64 bdi_id, int memcg_id, unsigned long nr,
> +			   enum wb_reason reason, struct wb_completion *done)
> +{
> +	struct backing_dev_info *bdi;
> +	struct cgroup_subsys_state *memcg_css;
> +	struct bdi_writeback *wb;
> +	struct wb_writeback_work *work;
> +	int ret;
> +
> +	/* lookup bdi and memcg */
> +	bdi = bdi_get_by_id(bdi_id);
> +	if (!bdi)
> +		return -ENOENT;
> +
> +	rcu_read_lock();
> +	memcg_css = css_from_id(memcg_id, &memory_cgrp_subsys);
> +	if (memcg_css && !css_tryget(memcg_css))
> +		memcg_css = NULL;
> +	rcu_read_unlock();
> +	if (!memcg_css) {
> +		ret = -ENOENT;
> +		goto out_bdi_put;
> +	}
> +
> +	/* and find the associated wb */
> +	wb = wb_get_create(bdi, memcg_css, GFP_NOWAIT | __GFP_NOWARN);
> +	if (!wb) {
> +		ret = -ENOMEM;
> +		goto out_css_put;
> +	}
> +
> +	/* issue the writeback work */
> +	work = kzalloc(sizeof(*work), GFP_NOWAIT | __GFP_NOWARN);
> +	if (work) {
> +		work->nr_pages = nr;
> +		work->sync_mode = WB_SYNC_NONE;
> +		work->reason = reason;
> +		work->done = done;
> +		work->auto_free = 1;
> +		wb_queue_work(wb, work);
> +		ret = 0;
> +	} else {
> +		ret = -ENOMEM;
> +	}
> +
> +	wb_put(wb);
> +out_css_put:
> +	css_put(memcg_css);
> +out_bdi_put:
> +	bdi_put(bdi);
> +	return ret;
> +}
> +
>  /**
>   * cgroup_writeback_umount - flush inode wb switches for umount
>   *
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index 8945aac31392..ad794f2a7d42 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -217,6 +217,10 @@ void wbc_attach_and_unlock_inode(struct writeback_control *wbc,
>  void wbc_detach_inode(struct writeback_control *wbc);
>  void wbc_account_cgroup_owner(struct writeback_control *wbc, struct page *page,
>  			      size_t bytes);
> +int cgroup_writeback_by_id(u64 bdi_id, int memcg_id, unsigned long nr_pages,
> +			   enum wb_reason reason, struct wb_completion *done);
> +int writeback_by_id(int id, unsigned long nr, enum wb_reason reason,
> +		    struct wb_completion *done);

I guess this writeback_by_id() is stale? I didn't find it anywhere else...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] writeback, memcg: Implement foreign dirty flushing
  2019-08-03 14:01 ` [PATCH 4/4] writeback, memcg: Implement foreign dirty flushing Tejun Heo
  2019-08-06 23:03   ` Andrew Morton
@ 2019-08-15 14:34   ` Jan Kara
  2019-08-15 17:31     ` Tejun Heo
  1 sibling, 1 reply; 24+ messages in thread
From: Jan Kara @ 2019-08-15 14:34 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe, jack, hannes, mhocko, vdavydov.dev, cgroups, linux-mm,
	linux-block, linux-kernel, kernel-team, guro, akpm

On Sat 03-08-19 07:01:55, Tejun Heo wrote:
> +void mem_cgroup_track_foreign_dirty_slowpath(struct page *page,
> +					     struct bdi_writeback *wb)
> +{
> +	struct mem_cgroup *memcg = page->mem_cgroup;
> +	struct memcg_cgwb_frn *frn;
> +	u64 now = jiffies_64;
> +	u64 oldest_at = now;
> +	int oldest = -1;
> +	int i;
> +
> +	/*
> +	 * Pick the slot to use.  If there is already a slot for @wb, keep
> +	 * using it.  If not replace the oldest one which isn't being
> +	 * written out.
> +	 */
> +	for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) {
> +		frn = &memcg->cgwb_frn[i];
> +		if (frn->bdi_id == wb->bdi->id &&
> +		    frn->memcg_id == wb->memcg_css->id)
> +			break;
> +		if (frn->at < oldest_at && atomic_read(&frn->done.cnt) == 1) {
> +			oldest = i;
> +			oldest_at = frn->at;
> +		}
> +	}
> +
> +	if (i < MEMCG_CGWB_FRN_CNT) {
> +		unsigned long update_intv =
> +			min_t(unsigned long, HZ,
> +			      msecs_to_jiffies(dirty_expire_interval * 10) / 8);
> +		/*
> +		 * Re-using an existing one.  Let's update timestamp lazily
> +		 * to avoid making the cacheline hot.
> +		 */
> +		if (frn->at < now - update_intv)
> +			frn->at = now;
> +	} else if (oldest >= 0) {
> +		/* replace the oldest free one */
> +		frn = &memcg->cgwb_frn[oldest];
> +		frn->bdi_id = wb->bdi->id;
> +		frn->memcg_id = wb->memcg_css->id;
> +		frn->at = now;
> +	}

I have to say I'm a bit nervous about the completely lockless handling
here. I understand that garbage in the cgwb_frn will just result in this
mechanism not working and possibly flushing wrong wb's but still it seems a
bit fragile. But I don't see any cheap way of synchronizing this so I guess
let's try how this will work in practice.

								Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/4] writeback: Generalize and expose wb_completion
  2019-08-03 14:01 ` [PATCH 1/4] writeback: Generalize and expose wb_completion Tejun Heo
@ 2019-08-15 14:41   ` Jan Kara
  0 siblings, 0 replies; 24+ messages in thread
From: Jan Kara @ 2019-08-15 14:41 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe, jack, hannes, mhocko, vdavydov.dev, cgroups, linux-mm,
	linux-block, linux-kernel, kernel-team, guro, akpm

On Sat 03-08-19 07:01:52, Tejun Heo wrote:
> wb_completion is used to track writeback completions.  We want to use
> it from memcg side for foreign inode flushes.  This patch updates it
> to remember the target waitq instead of assuming bdi->wb_waitq and
> expose it outside of fs-writeback.c.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

The patch looks good to me. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/fs-writeback.c                | 47 ++++++++++----------------------
>  include/linux/backing-dev-defs.h | 20 ++++++++++++++
>  include/linux/backing-dev.h      |  2 ++
>  3 files changed, 36 insertions(+), 33 deletions(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 542b02d170f8..6129debdc938 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -36,10 +36,6 @@
>   */
>  #define MIN_WRITEBACK_PAGES	(4096UL >> (PAGE_SHIFT - 10))
>  
> -struct wb_completion {
> -	atomic_t		cnt;
> -};
> -
>  /*
>   * Passed into wb_writeback(), essentially a subset of writeback_control
>   */
> @@ -60,19 +56,6 @@ struct wb_writeback_work {
>  	struct wb_completion *done;	/* set if the caller waits */
>  };
>  
> -/*
> - * If one wants to wait for one or more wb_writeback_works, each work's
> - * ->done should be set to a wb_completion defined using the following
> - * macro.  Once all work items are issued with wb_queue_work(), the caller
> - * can wait for the completion of all using wb_wait_for_completion().  Work
> - * items which are waited upon aren't freed automatically on completion.
> - */
> -#define DEFINE_WB_COMPLETION_ONSTACK(cmpl)				\
> -	struct wb_completion cmpl = {					\
> -		.cnt		= ATOMIC_INIT(1),			\
> -	}
> -
> -
>  /*
>   * If an inode is constantly having its pages dirtied, but then the
>   * updates stop dirtytime_expire_interval seconds in the past, it's
> @@ -182,7 +165,7 @@ static void finish_writeback_work(struct bdi_writeback *wb,
>  	if (work->auto_free)
>  		kfree(work);
>  	if (done && atomic_dec_and_test(&done->cnt))
> -		wake_up_all(&wb->bdi->wb_waitq);
> +		wake_up_all(done->waitq);
>  }
>  
>  static void wb_queue_work(struct bdi_writeback *wb,
> @@ -206,20 +189,18 @@ static void wb_queue_work(struct bdi_writeback *wb,
>  
>  /**
>   * wb_wait_for_completion - wait for completion of bdi_writeback_works
> - * @bdi: bdi work items were issued to
>   * @done: target wb_completion
>   *
>   * Wait for one or more work items issued to @bdi with their ->done field
> - * set to @done, which should have been defined with
> - * DEFINE_WB_COMPLETION_ONSTACK().  This function returns after all such
> - * work items are completed.  Work items which are waited upon aren't freed
> + * set to @done, which should have been initialized with
> + * DEFINE_WB_COMPLETION().  This function returns after all such work items
> + * are completed.  Work items which are waited upon aren't freed
>   * automatically on completion.
>   */
> -static void wb_wait_for_completion(struct backing_dev_info *bdi,
> -				   struct wb_completion *done)
> +void wb_wait_for_completion(struct wb_completion *done)
>  {
>  	atomic_dec(&done->cnt);		/* put down the initial count */
> -	wait_event(bdi->wb_waitq, !atomic_read(&done->cnt));
> +	wait_event(*done->waitq, !atomic_read(&done->cnt));
>  }
>  
>  #ifdef CONFIG_CGROUP_WRITEBACK
> @@ -843,7 +824,7 @@ static void bdi_split_work_to_wbs(struct backing_dev_info *bdi,
>  restart:
>  	rcu_read_lock();
>  	list_for_each_entry_continue_rcu(wb, &bdi->wb_list, bdi_node) {
> -		DEFINE_WB_COMPLETION_ONSTACK(fallback_work_done);
> +		DEFINE_WB_COMPLETION(fallback_work_done, bdi);
>  		struct wb_writeback_work fallback_work;
>  		struct wb_writeback_work *work;
>  		long nr_pages;
> @@ -890,7 +871,7 @@ static void bdi_split_work_to_wbs(struct backing_dev_info *bdi,
>  		last_wb = wb;
>  
>  		rcu_read_unlock();
> -		wb_wait_for_completion(bdi, &fallback_work_done);
> +		wb_wait_for_completion(&fallback_work_done);
>  		goto restart;
>  	}
>  	rcu_read_unlock();
> @@ -2362,7 +2343,8 @@ static void wait_sb_inodes(struct super_block *sb)
>  static void __writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr,
>  				     enum wb_reason reason, bool skip_if_busy)
>  {
> -	DEFINE_WB_COMPLETION_ONSTACK(done);
> +	struct backing_dev_info *bdi = sb->s_bdi;
> +	DEFINE_WB_COMPLETION(done, bdi);
>  	struct wb_writeback_work work = {
>  		.sb			= sb,
>  		.sync_mode		= WB_SYNC_NONE,
> @@ -2371,14 +2353,13 @@ static void __writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr,
>  		.nr_pages		= nr,
>  		.reason			= reason,
>  	};
> -	struct backing_dev_info *bdi = sb->s_bdi;
>  
>  	if (!bdi_has_dirty_io(bdi) || bdi == &noop_backing_dev_info)
>  		return;
>  	WARN_ON(!rwsem_is_locked(&sb->s_umount));
>  
>  	bdi_split_work_to_wbs(sb->s_bdi, &work, skip_if_busy);
> -	wb_wait_for_completion(bdi, &done);
> +	wb_wait_for_completion(&done);
>  }
>  
>  /**
> @@ -2440,7 +2421,8 @@ EXPORT_SYMBOL(try_to_writeback_inodes_sb);
>   */
>  void sync_inodes_sb(struct super_block *sb)
>  {
> -	DEFINE_WB_COMPLETION_ONSTACK(done);
> +	struct backing_dev_info *bdi = sb->s_bdi;
> +	DEFINE_WB_COMPLETION(done, bdi);
>  	struct wb_writeback_work work = {
>  		.sb		= sb,
>  		.sync_mode	= WB_SYNC_ALL,
> @@ -2450,7 +2432,6 @@ void sync_inodes_sb(struct super_block *sb)
>  		.reason		= WB_REASON_SYNC,
>  		.for_sync	= 1,
>  	};
> -	struct backing_dev_info *bdi = sb->s_bdi;
>  
>  	/*
>  	 * Can't skip on !bdi_has_dirty() because we should wait for !dirty
> @@ -2464,7 +2445,7 @@ void sync_inodes_sb(struct super_block *sb)
>  	/* protect against inode wb switch, see inode_switch_wbs_work_fn() */
>  	bdi_down_write_wb_switch_rwsem(bdi);
>  	bdi_split_work_to_wbs(bdi, &work, false);
> -	wb_wait_for_completion(bdi, &done);
> +	wb_wait_for_completion(&done);
>  	bdi_up_write_wb_switch_rwsem(bdi);
>  
>  	wait_sb_inodes(sb);
> diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
> index 6a1a8a314d85..8fb740178d5d 100644
> --- a/include/linux/backing-dev-defs.h
> +++ b/include/linux/backing-dev-defs.h
> @@ -67,6 +67,26 @@ enum wb_reason {
>  	WB_REASON_MAX,
>  };
>  
> +struct wb_completion {
> +	atomic_t		cnt;
> +	wait_queue_head_t	*waitq;
> +};
> +
> +#define __WB_COMPLETION_INIT(_waitq)	\
> +	(struct wb_completion){ .cnt = ATOMIC_INIT(1), .waitq = (_waitq) }
> +
> +/*
> + * If one wants to wait for one or more wb_writeback_works, each work's
> + * ->done should be set to a wb_completion defined using the following
> + * macro.  Once all work items are issued with wb_queue_work(), the caller
> + * can wait for the completion of all using wb_wait_for_completion().  Work
> + * items which are waited upon aren't freed automatically on completion.
> + */
> +#define WB_COMPLETION_INIT(bdi)		__WB_COMPLETION_INIT(&(bdi)->wb_waitq)
> +
> +#define DEFINE_WB_COMPLETION(cmpl, bdi)	\
> +	struct wb_completion cmpl = WB_COMPLETION_INIT(bdi)
> +
>  /*
>   * For cgroup writeback, multiple wb's may map to the same blkcg.  Those
>   * wb's can operate mostly independently but should share the congested
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index 35b31d176f74..02650b1253a2 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -44,6 +44,8 @@ void wb_start_background_writeback(struct bdi_writeback *wb);
>  void wb_workfn(struct work_struct *work);
>  void wb_wakeup_delayed(struct bdi_writeback *wb);
>  
> +void wb_wait_for_completion(struct wb_completion *done);
> +
>  extern spinlock_t bdi_lock;
>  extern struct list_head bdi_list;
>  
> -- 
> 2.17.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/4] bdi: Add bdi->id
  2019-08-03 14:01 ` [PATCH 2/4] bdi: Add bdi->id Tejun Heo
  2019-08-03 15:39   ` Matthew Wilcox
  2019-08-06 23:01   ` Andrew Morton
@ 2019-08-15 14:46   ` Jan Kara
  2019-08-15 17:34     ` Tejun Heo
  2 siblings, 1 reply; 24+ messages in thread
From: Jan Kara @ 2019-08-15 14:46 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe, jack, hannes, mhocko, vdavydov.dev, cgroups, linux-mm,
	linux-block, linux-kernel, kernel-team, guro, akpm

On Sat 03-08-19 07:01:53, Tejun Heo wrote:
> There currently is no way to universally identify and lookup a bdi
> without holding a reference and pointer to it.  This patch adds an
> non-recycling bdi->id and implements bdi_get_by_id() which looks up
> bdis by their ids.  This will be used by memcg foreign inode flushing.
> 
> I left bdi_list alone for simplicity and because while rb_tree does
> support rcu assignment it doesn't seem to guarantee lossless walk when
> walk is racing aginst tree rebalance operations.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

The patch looks good to me. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

Although I would note that here you take effort not to recycle bdi->id so
that you don't flush wrong devices while in patch 4 you take pretty lax
approach to feeding garbage into the writeback system. So these two don't
quite match to me...

								Honza

> ---
>  include/linux/backing-dev-defs.h |  2 +
>  include/linux/backing-dev.h      |  1 +
>  mm/backing-dev.c                 | 65 +++++++++++++++++++++++++++++++-
>  3 files changed, 66 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
> index 8fb740178d5d..1075f2552cfc 100644
> --- a/include/linux/backing-dev-defs.h
> +++ b/include/linux/backing-dev-defs.h
> @@ -185,6 +185,8 @@ struct bdi_writeback {
>  };
>  
>  struct backing_dev_info {
> +	u64 id;
> +	struct rb_node rb_node; /* keyed by ->id */
>  	struct list_head bdi_list;
>  	unsigned long ra_pages;	/* max readahead in PAGE_SIZE units */
>  	unsigned long io_pages;	/* max allowed IO size */
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index 02650b1253a2..84cdcfbc763f 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -24,6 +24,7 @@ static inline struct backing_dev_info *bdi_get(struct backing_dev_info *bdi)
>  	return bdi;
>  }
>  
> +struct backing_dev_info *bdi_get_by_id(u64 id);
>  void bdi_put(struct backing_dev_info *bdi);
>  
>  __printf(2, 3)
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index e8e89158adec..4a8816e0b8d4 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -1,6 +1,7 @@
>  // SPDX-License-Identifier: GPL-2.0-only
>  
>  #include <linux/wait.h>
> +#include <linux/rbtree.h>
>  #include <linux/backing-dev.h>
>  #include <linux/kthread.h>
>  #include <linux/freezer.h>
> @@ -22,10 +23,12 @@ EXPORT_SYMBOL_GPL(noop_backing_dev_info);
>  static struct class *bdi_class;
>  
>  /*
> - * bdi_lock protects updates to bdi_list. bdi_list has RCU reader side
> - * locking.
> + * bdi_lock protects bdi_tree and updates to bdi_list. bdi_list has RCU
> + * reader side locking.
>   */
>  DEFINE_SPINLOCK(bdi_lock);
> +static u64 bdi_id_cursor;
> +static struct rb_root bdi_tree = RB_ROOT;
>  LIST_HEAD(bdi_list);
>  
>  /* bdi_wq serves all asynchronous writeback tasks */
> @@ -859,9 +862,58 @@ struct backing_dev_info *bdi_alloc_node(gfp_t gfp_mask, int node_id)
>  }
>  EXPORT_SYMBOL(bdi_alloc_node);
>  
> +struct rb_node **bdi_lookup_rb_node(u64 id, struct rb_node **parentp)
> +{
> +	struct rb_node **p = &bdi_tree.rb_node;
> +	struct rb_node *parent = NULL;
> +	struct backing_dev_info *bdi;
> +
> +	lockdep_assert_held(&bdi_lock);
> +
> +	while (*p) {
> +		parent = *p;
> +		bdi = rb_entry(parent, struct backing_dev_info, rb_node);
> +
> +		if (bdi->id > id)
> +			p = &(*p)->rb_left;
> +		else if (bdi->id < id)
> +			p = &(*p)->rb_right;
> +		else
> +			break;
> +	}
> +
> +	if (parentp)
> +		*parentp = parent;
> +	return p;
> +}
> +
> +/**
> + * bdi_get_by_id - lookup and get bdi from its id
> + * @id: bdi id to lookup
> + *
> + * Find bdi matching @id and get it.  Returns NULL if the matching bdi
> + * doesn't exist or is already unregistered.
> + */
> +struct backing_dev_info *bdi_get_by_id(u64 id)
> +{
> +	struct backing_dev_info *bdi = NULL;
> +	struct rb_node **p;
> +
> +	spin_lock_irq(&bdi_lock);
> +	p = bdi_lookup_rb_node(id, NULL);
> +	if (*p) {
> +		bdi = rb_entry(*p, struct backing_dev_info, rb_node);
> +		bdi_get(bdi);
> +	}
> +	spin_unlock_irq(&bdi_lock);
> +
> +	return bdi;
> +}
> +
>  int bdi_register_va(struct backing_dev_info *bdi, const char *fmt, va_list args)
>  {
>  	struct device *dev;
> +	struct rb_node *parent, **p;
>  
>  	if (bdi->dev)	/* The driver needs to use separate queues per device */
>  		return 0;
> @@ -877,7 +929,15 @@ int bdi_register_va(struct backing_dev_info *bdi, const char *fmt, va_list args)
>  	set_bit(WB_registered, &bdi->wb.state);
>  
>  	spin_lock_bh(&bdi_lock);
> +
> +	bdi->id = ++bdi_id_cursor;
> +
> +	p = bdi_lookup_rb_node(bdi->id, &parent);
> +	rb_link_node(&bdi->rb_node, parent, p);
> +	rb_insert_color(&bdi->rb_node, &bdi_tree);
> +
>  	list_add_tail_rcu(&bdi->bdi_list, &bdi_list);
> +
>  	spin_unlock_bh(&bdi_lock);
>  
>  	trace_writeback_bdi_register(bdi);
> @@ -918,6 +978,7 @@ EXPORT_SYMBOL(bdi_register_owner);
>  static void bdi_remove_from_list(struct backing_dev_info *bdi)
>  {
>  	spin_lock_bh(&bdi_lock);
> +	rb_erase(&bdi->rb_node, &bdi_tree);
>  	list_del_rcu(&bdi->bdi_list);
>  	spin_unlock_bh(&bdi_lock);
>  
> -- 
> 2.17.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/4] writeback, memcg: Implement cgroup_writeback_by_id()
  2019-08-03 14:01 ` [PATCH 3/4] writeback, memcg: Implement cgroup_writeback_by_id() Tejun Heo
  2019-08-15 14:05   ` Jan Kara
@ 2019-08-15 14:54   ` Jan Kara
  2019-08-15 16:12     ` Tejun Heo
  1 sibling, 1 reply; 24+ messages in thread
From: Jan Kara @ 2019-08-15 14:54 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe, jack, hannes, mhocko, vdavydov.dev, cgroups, linux-mm,
	linux-block, linux-kernel, kernel-team, guro, akpm

On Sat 03-08-19 07:01:54, Tejun Heo wrote:
> Implement cgroup_writeback_by_id() which initiates cgroup writeback
> from bdi and memcg IDs.  This will be used by memcg foreign inode
> flushing.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  fs/fs-writeback.c         | 64 +++++++++++++++++++++++++++++++++++++++
>  include/linux/writeback.h |  4 +++
>  2 files changed, 68 insertions(+)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 6129debdc938..5c79d7acefdb 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -880,6 +880,70 @@ static void bdi_split_work_to_wbs(struct backing_dev_info *bdi,
>  		wb_put(last_wb);
>  }
>  
> +/**
> + * cgroup_writeback_by_id - initiate cgroup writeback from bdi and memcg IDs
> + * @bdi_id: target bdi id
> + * @memcg_id: target memcg css id
> + * @nr_pages: number of pages to write
> + * @reason: reason why some writeback work initiated
> + * @done: target wb_completion
> + *
> + * Initiate flush of the bdi_writeback identified by @bdi_id and @memcg_id
> + * with the specified parameters.
> + */
> +int cgroup_writeback_by_id(u64 bdi_id, int memcg_id, unsigned long nr,
> +			   enum wb_reason reason, struct wb_completion *done)
> +{
> +	struct backing_dev_info *bdi;
> +	struct cgroup_subsys_state *memcg_css;
> +	struct bdi_writeback *wb;
> +	struct wb_writeback_work *work;
> +	int ret;
> +
> +	/* lookup bdi and memcg */
> +	bdi = bdi_get_by_id(bdi_id);
> +	if (!bdi)
> +		return -ENOENT;
> +
> +	rcu_read_lock();
> +	memcg_css = css_from_id(memcg_id, &memory_cgrp_subsys);
> +	if (memcg_css && !css_tryget(memcg_css))
> +		memcg_css = NULL;
> +	rcu_read_unlock();
> +	if (!memcg_css) {
> +		ret = -ENOENT;
> +		goto out_bdi_put;
> +	}
> +
> +	/* and find the associated wb */
> +	wb = wb_get_create(bdi, memcg_css, GFP_NOWAIT | __GFP_NOWARN);
> +	if (!wb) {
> +		ret = -ENOMEM;
> +		goto out_css_put;
> +	}

One more thought: You don't want the "_create" part here, do you? If
there's any point in writing back using this wb, it must be attached to
some inode and thus it must exist. In the normal case wb_get_create() will
just fetch the reference and be done with it but when you feed garbage into
this function due to id going stale or frn structures getting corrupted due
to concurrent access, you can be creating bogus wb structures in bdi...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/4] writeback, memcg: Implement cgroup_writeback_by_id()
  2019-08-15 14:05   ` Jan Kara
@ 2019-08-15 15:43     ` Tejun Heo
  0 siblings, 0 replies; 24+ messages in thread
From: Tejun Heo @ 2019-08-15 15:43 UTC (permalink / raw)
  To: Jan Kara
  Cc: axboe, hannes, mhocko, vdavydov.dev, cgroups, linux-mm,
	linux-block, linux-kernel, kernel-team, guro, akpm

On Thu, Aug 15, 2019 at 04:05:35PM +0200, Jan Kara wrote:
> > +int cgroup_writeback_by_id(u64 bdi_id, int memcg_id, unsigned long nr_pages,
> > +			   enum wb_reason reason, struct wb_completion *done);
> > +int writeback_by_id(int id, unsigned long nr, enum wb_reason reason,
> > +		    struct wb_completion *done);
> 
> I guess this writeback_by_id() is stale? I didn't find it anywhere else...

Yes, removed.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/4] writeback, memcg: Implement cgroup_writeback_by_id()
  2019-08-15 14:54   ` Jan Kara
@ 2019-08-15 16:12     ` Tejun Heo
  0 siblings, 0 replies; 24+ messages in thread
From: Tejun Heo @ 2019-08-15 16:12 UTC (permalink / raw)
  To: Jan Kara
  Cc: axboe, hannes, mhocko, vdavydov.dev, cgroups, linux-mm,
	linux-block, linux-kernel, kernel-team, guro, akpm

On Thu, Aug 15, 2019 at 04:54:21PM +0200, Jan Kara wrote:
> > +	/* and find the associated wb */
> > +	wb = wb_get_create(bdi, memcg_css, GFP_NOWAIT | __GFP_NOWARN);
> > +	if (!wb) {
> > +		ret = -ENOMEM;
> > +		goto out_css_put;
> > +	}
> 
> One more thought: You don't want the "_create" part here, do you? If
> there's any point in writing back using this wb, it must be attached to
> some inode and thus it must exist. In the normal case wb_get_create() will
> just fetch the reference and be done with it but when you feed garbage into
> this function due to id going stale or frn structures getting corrupted due
> to concurrent access, you can be creating bogus wb structures in bdi...

Yeah, it can create wbs unnecessarily which isn't critical but also is
easy to fix.  Will update.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] writeback, memcg: Implement foreign dirty flushing
  2019-08-15 14:34   ` Jan Kara
@ 2019-08-15 17:31     ` Tejun Heo
  0 siblings, 0 replies; 24+ messages in thread
From: Tejun Heo @ 2019-08-15 17:31 UTC (permalink / raw)
  To: Jan Kara
  Cc: axboe, hannes, mhocko, vdavydov.dev, cgroups, linux-mm,
	linux-block, linux-kernel, kernel-team, guro, akpm

Hello, Jan.

On Thu, Aug 15, 2019 at 04:34:04PM +0200, Jan Kara wrote:
> I have to say I'm a bit nervous about the completely lockless handling
> here. I understand that garbage in the cgwb_frn will just result in this
> mechanism not working and possibly flushing wrong wb's but still it seems a
> bit fragile. But I don't see any cheap way of synchronizing this so I guess
> let's try how this will work in practice.

Yeah, this approach is fundamentally best-effort, so I went for low
overhead and mostly correct operation.  If something like this doesn't
cut it (w/ bug fixes and some polishing over time), my gut feeling is
that we probably should bite the bullet and synchronize cgroup memory
and inode ownerships rather than pushing further on inherently
imprecise mitigation mechanisms.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/4] bdi: Add bdi->id
  2019-08-15 14:46   ` Jan Kara
@ 2019-08-15 17:34     ` Tejun Heo
  0 siblings, 0 replies; 24+ messages in thread
From: Tejun Heo @ 2019-08-15 17:34 UTC (permalink / raw)
  To: Jan Kara
  Cc: axboe, hannes, mhocko, vdavydov.dev, cgroups, linux-mm,
	linux-block, linux-kernel, kernel-team, guro, akpm

Hello,

On Thu, Aug 15, 2019 at 04:46:23PM +0200, Jan Kara wrote:
> Although I would note that here you take effort not to recycle bdi->id so
> that you don't flush wrong devices while in patch 4 you take pretty lax
> approach to feeding garbage into the writeback system. So these two don't
> quite match to me...

So, I was trying to avoid systemic errors where the wrong thing can be
triggered repeatedly.  A wrong flush once in a blue moon shouldn't be
a big problem but if they can be triggered consistently by some
pathological behavior, it's an a lot bigger problem.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, back to index

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-03 14:01 [PATCHSET] writeback, memcg: Implement foreign inode flushing Tejun Heo
2019-08-03 14:01 ` [PATCH 1/4] writeback: Generalize and expose wb_completion Tejun Heo
2019-08-15 14:41   ` Jan Kara
2019-08-03 14:01 ` [PATCH 2/4] bdi: Add bdi->id Tejun Heo
2019-08-03 15:39   ` Matthew Wilcox
2019-08-03 15:53     ` Tejun Heo
2019-08-03 16:17       ` Matthew Wilcox
2019-08-06 23:01   ` Andrew Morton
2019-08-07 18:31     ` Tejun Heo
2019-08-07 19:00       ` Andrew Morton
2019-08-07 20:34         ` Tejun Heo
2019-08-09  0:57         ` Rik van Riel
2019-08-15 14:46   ` Jan Kara
2019-08-15 17:34     ` Tejun Heo
2019-08-03 14:01 ` [PATCH 3/4] writeback, memcg: Implement cgroup_writeback_by_id() Tejun Heo
2019-08-15 14:05   ` Jan Kara
2019-08-15 15:43     ` Tejun Heo
2019-08-15 14:54   ` Jan Kara
2019-08-15 16:12     ` Tejun Heo
2019-08-03 14:01 ` [PATCH 4/4] writeback, memcg: Implement foreign dirty flushing Tejun Heo
2019-08-06 23:03   ` Andrew Morton
2019-08-07 18:34     ` Tejun Heo
2019-08-15 14:34   ` Jan Kara
2019-08-15 17:31     ` Tejun Heo

Linux-Block Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-block/0 linux-block/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-block linux-block/ https://lore.kernel.org/linux-block \
		linux-block@vger.kernel.org linux-block@archiver.kernel.org
	public-inbox-index linux-block


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-block


AGPL code for this site: git clone https://public-inbox.org/ public-inbox