All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHSET block/for-linus] writeback: cgroup writeback fixes
@ 2015-09-29 16:47 ` Tejun Heo
  0 siblings, 0 replies; 24+ messages in thread
From: Tejun Heo @ 2015-09-29 16:47 UTC (permalink / raw)
  To: axboe; +Cc: linux-kernel, cgroups, tytso, dedekind1, decui, kernel-team

Hello,

This patchset contains the following five patches fixing bugs in
cgroup writeback support.

 0001-writeback-laptop_mode_timer_fn-needs-rcu_read_lock-a.patch
 0002-writeback-fix-bdi_writeback-iteration-in-wakeup_dirt.patch
 0003-writeback-bdi_writeback-iteration-must-not-skip-dyin.patch
 0004-writeback-memcg-dirty_throttle_control-should-be-ini.patch
 0005-writeback-fix-incorrect-calculation-of-available-mem.patch

0001 affects laptop_mode users.  0002-0005 affect users who enable
cgroup writeback support.

This patchset is also available in the following git branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-cgwb-4.3-fixes

diffstat follows.  Thanks.

 fs/fs-writeback.c                |   32 +++++++++++++------
 include/linux/backing-dev-defs.h |    3 +
 include/linux/backing-dev.h      |   63 ---------------------------------------
 include/linux/memcontrol.h       |    5 +--
 mm/backing-dev.c                 |   17 +++++++++-
 mm/memcontrol.c                  |   35 ++++++++++-----------
 mm/page-writeback.c              |   54 +++++++++++++++++++--------------
 7 files changed, 92 insertions(+), 117 deletions(-)

--
tejun

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCHSET block/for-linus] writeback: cgroup writeback fixes
@ 2015-09-29 16:47 ` Tejun Heo
  0 siblings, 0 replies; 24+ messages in thread
From: Tejun Heo @ 2015-09-29 16:47 UTC (permalink / raw)
  To: axboe-tSWWG44O7X1aa/9Udqfwiw
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tytso-3s7WtUTddSA,
	dedekind1-Re5JQEeQqe8AvxtiuMwx3w, decui-0li6OtcxBFHby3iVrkZq2A,
	kernel-team-b10kYP2dOMg

Hello,

This patchset contains the following five patches fixing bugs in
cgroup writeback support.

 0001-writeback-laptop_mode_timer_fn-needs-rcu_read_lock-a.patch
 0002-writeback-fix-bdi_writeback-iteration-in-wakeup_dirt.patch
 0003-writeback-bdi_writeback-iteration-must-not-skip-dyin.patch
 0004-writeback-memcg-dirty_throttle_control-should-be-ini.patch
 0005-writeback-fix-incorrect-calculation-of-available-mem.patch

0001 affects laptop_mode users.  0002-0005 affect users who enable
cgroup writeback support.

This patchset is also available in the following git branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-cgwb-4.3-fixes

diffstat follows.  Thanks.

 fs/fs-writeback.c                |   32 +++++++++++++------
 include/linux/backing-dev-defs.h |    3 +
 include/linux/backing-dev.h      |   63 ---------------------------------------
 include/linux/memcontrol.h       |    5 +--
 mm/backing-dev.c                 |   17 +++++++++-
 mm/memcontrol.c                  |   35 ++++++++++-----------
 mm/page-writeback.c              |   54 +++++++++++++++++++--------------
 7 files changed, 92 insertions(+), 117 deletions(-)

--
tejun

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/5] writeback: laptop_mode_timer_fn() needs rcu_read_lock() around bdi_writeback iteration
@ 2015-09-29 16:47   ` Tejun Heo
  0 siblings, 0 replies; 24+ messages in thread
From: Tejun Heo @ 2015-09-29 16:47 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, cgroups, tytso, dedekind1, decui, kernel-team, Tejun Heo

laptop_mode_timer_fn() was using bdi_for_each_wb() without the
required RCU locking leading to the following warning.

 WARNING: CPU: 0 PID: 0 at include/linux/backing-dev.h:415 laptop_mode_timer_fn+0x106/0x170()
 ...
 Call Trace:
  <IRQ>  [<ffffffff81480cdc>] dump_stack+0x4e/0x82
  [<ffffffff81051912>] warn_slowpath_common+0x82/0xc0
  [<ffffffff81051a0a>] warn_slowpath_null+0x1a/0x20
  [<ffffffff8115f0e6>] laptop_mode_timer_fn+0x106/0x170
  [<ffffffff810ca8e3>] call_timer_fn+0xb3/0x2f0
  [<ffffffff810cad25>] run_timer_softirq+0x205/0x370
  [<ffffffff81056854>] __do_softirq+0xd4/0x460
  [<ffffffff81056d69>] irq_exit+0x89/0xa0
  [<ffffffff8185a892>] smp_apic_timer_interrupt+0x42/0x50
  [<ffffffff81858a44>] apic_timer_interrupt+0x84/0x90
 ...

Fix it by adding rcu_read_lock() around the iteration.

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: a06fd6b10228 ("writeback: make laptop_mode_timer_fn() handle multiple bdi_writeback's")
---
 mm/page-writeback.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 0a931cd..902e5f2 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1965,10 +1965,12 @@ void laptop_mode_timer_fn(unsigned long data)
 	if (!bdi_has_dirty_io(&q->backing_dev_info))
 		return;
 
+	rcu_read_lock();
 	bdi_for_each_wb(wb, &q->backing_dev_info, &iter, 0)
 		if (wb_has_dirty_io(wb))
 			wb_start_writeback(wb, nr_pages, true,
 					   WB_REASON_LAPTOP_TIMER);
+	rcu_read_unlock();
 }
 
 /*
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 1/5] writeback: laptop_mode_timer_fn() needs rcu_read_lock() around bdi_writeback iteration
@ 2015-09-29 16:47   ` Tejun Heo
  0 siblings, 0 replies; 24+ messages in thread
From: Tejun Heo @ 2015-09-29 16:47 UTC (permalink / raw)
  To: axboe-tSWWG44O7X1aa/9Udqfwiw
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tytso-3s7WtUTddSA,
	dedekind1-Re5JQEeQqe8AvxtiuMwx3w, decui-0li6OtcxBFHby3iVrkZq2A,
	kernel-team-b10kYP2dOMg, Tejun Heo

laptop_mode_timer_fn() was using bdi_for_each_wb() without the
required RCU locking leading to the following warning.

 WARNING: CPU: 0 PID: 0 at include/linux/backing-dev.h:415 laptop_mode_timer_fn+0x106/0x170()
 ...
 Call Trace:
  <IRQ>  [<ffffffff81480cdc>] dump_stack+0x4e/0x82
  [<ffffffff81051912>] warn_slowpath_common+0x82/0xc0
  [<ffffffff81051a0a>] warn_slowpath_null+0x1a/0x20
  [<ffffffff8115f0e6>] laptop_mode_timer_fn+0x106/0x170
  [<ffffffff810ca8e3>] call_timer_fn+0xb3/0x2f0
  [<ffffffff810cad25>] run_timer_softirq+0x205/0x370
  [<ffffffff81056854>] __do_softirq+0xd4/0x460
  [<ffffffff81056d69>] irq_exit+0x89/0xa0
  [<ffffffff8185a892>] smp_apic_timer_interrupt+0x42/0x50
  [<ffffffff81858a44>] apic_timer_interrupt+0x84/0x90
 ...

Fix it by adding rcu_read_lock() around the iteration.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Fixes: a06fd6b10228 ("writeback: make laptop_mode_timer_fn() handle multiple bdi_writeback's")
---
 mm/page-writeback.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 0a931cd..902e5f2 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1965,10 +1965,12 @@ void laptop_mode_timer_fn(unsigned long data)
 	if (!bdi_has_dirty_io(&q->backing_dev_info))
 		return;
 
+	rcu_read_lock();
 	bdi_for_each_wb(wb, &q->backing_dev_info, &iter, 0)
 		if (wb_has_dirty_io(wb))
 			wb_start_writeback(wb, nr_pages, true,
 					   WB_REASON_LAPTOP_TIMER);
+	rcu_read_unlock();
 }
 
 /*
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 2/5] writeback: fix bdi_writeback iteration in wakeup_dirtytime_writeback()
  2015-09-29 16:47 ` Tejun Heo
  (?)
  (?)
@ 2015-09-29 16:47 ` Tejun Heo
  2015-10-02 14:12     ` Jan Kara
  -1 siblings, 1 reply; 24+ messages in thread
From: Tejun Heo @ 2015-09-29 16:47 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, cgroups, tytso, dedekind1, decui, kernel-team, Tejun Heo

wakeup_dirtytime_writeback() walks and wakes up all wb's of all bdi's;
unfortunately, it was always waking up bdi->wb instead of the wb being
walked.  Fix it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 001fe6f617b1 ("writeback: make wakeup_dirtytime_writeback() handle multiple bdi_writeback's")
---
 fs/fs-writeback.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 091a364..d0da306 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1897,8 +1897,8 @@ static void wakeup_dirtytime_writeback(struct work_struct *w)
 		struct wb_iter iter;
 
 		bdi_for_each_wb(wb, bdi, &iter, 0)
-			if (!list_empty(&bdi->wb.b_dirty_time))
-				wb_wakeup(&bdi->wb);
+			if (!list_empty(&wb->b_dirty_time))
+				wb_wakeup(wb);
 	}
 	rcu_read_unlock();
 	schedule_delayed_work(&dirtytime_work, dirtytime_expire_interval * HZ);
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 3/5] writeback: bdi_writeback iteration must not skip dying ones
@ 2015-09-29 16:47   ` Tejun Heo
  0 siblings, 0 replies; 24+ messages in thread
From: Tejun Heo @ 2015-09-29 16:47 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, cgroups, tytso, dedekind1, decui, kernel-team, Tejun Heo

bdi_for_each_wb() is used in several places to wake up or issue
writeback work items to all wb's (bdi_writeback's) on a given bdi.
The iteration is performed by walking bdi->cgwb_tree; however, the
tree only indexes wb's which are currently active.

For example, when a memcg gets associated with a different blkcg, the
old wb is removed from the tree so that the new one can be indexed.
The old wb starts dying from then on but will linger till all its
inodes are drained.  As these dying wb's may still host dirty inodes,
writeback operations which affect all wb's must include them.
bdi_for_each_wb() skipping dying wb's led to sync(2) missing and
failing to sync the inodes belonging to those wb's.

This patch adds a RCU protected @bdi->wb_list which lists all wb's
beloinging to that bdi.  wb's are added on creation and removed on
release rather than on the start of destruction.  bdi_for_each_wb()
usages are replaced with list_for_each[_continue]_rcu() iterations
over @bdi->wb_list and bdi_for_each_wb() and its helpers are removed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Artem Bityutskiy <dedekind1@gmail.com>
Fixes: ebe41ab0c79d ("writeback: implement bdi_for_each_wb()")
Link: http://lkml.kernel.org/g/1443012552.19983.209.camel@gmail.com
---
 fs/fs-writeback.c                | 28 ++++++++++++------
 include/linux/backing-dev-defs.h |  3 ++
 include/linux/backing-dev.h      | 63 ----------------------------------------
 mm/backing-dev.c                 | 17 ++++++++++-
 mm/page-writeback.c              |  3 +-
 5 files changed, 39 insertions(+), 75 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index d0da306..afa4848 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -778,19 +778,24 @@ static void bdi_split_work_to_wbs(struct backing_dev_info *bdi,
 				  struct wb_writeback_work *base_work,
 				  bool skip_if_busy)
 {
-	int next_memcg_id = 0;
-	struct bdi_writeback *wb;
-	struct wb_iter iter;
+	struct bdi_writeback *last_wb = NULL;
+	struct bdi_writeback *wb = list_entry_rcu(&bdi->wb_list,
+						struct bdi_writeback, bdi_node);
 
 	might_sleep();
 restart:
 	rcu_read_lock();
-	bdi_for_each_wb(wb, bdi, &iter, next_memcg_id) {
+	list_for_each_entry_continue_rcu(wb, &bdi->wb_list, bdi_node) {
 		DEFINE_WB_COMPLETION_ONSTACK(fallback_work_done);
 		struct wb_writeback_work fallback_work;
 		struct wb_writeback_work *work;
 		long nr_pages;
 
+		if (last_wb) {
+			wb_put(last_wb);
+			last_wb = NULL;
+		}
+
 		/* SYNC_ALL writes out I_DIRTY_TIME too */
 		if (!wb_has_dirty_io(wb) &&
 		    (base_work->sync_mode == WB_SYNC_NONE ||
@@ -819,7 +824,14 @@ static void bdi_split_work_to_wbs(struct backing_dev_info *bdi,
 
 		wb_queue_work(wb, work);
 
-		next_memcg_id = wb->memcg_css->id + 1;
+		/*
+		 * Pin @wb so that it stays on @bdi->wb_list.  This allows
+		 * continuing iteration from @wb after dropping and
+		 * regrabbing rcu read lock.
+		 */
+		wb_get(wb);
+		last_wb = wb;
+
 		rcu_read_unlock();
 		wb_wait_for_completion(bdi, &fallback_work_done);
 		goto restart;
@@ -1857,12 +1869,11 @@ void wakeup_flusher_threads(long nr_pages, enum wb_reason reason)
 	rcu_read_lock();
 	list_for_each_entry_rcu(bdi, &bdi_list, bdi_list) {
 		struct bdi_writeback *wb;
-		struct wb_iter iter;
 
 		if (!bdi_has_dirty_io(bdi))
 			continue;
 
-		bdi_for_each_wb(wb, bdi, &iter, 0)
+		list_for_each_entry_rcu(wb, &bdi->wb_list, bdi_node)
 			wb_start_writeback(wb, wb_split_bdi_pages(wb, nr_pages),
 					   false, reason);
 	}
@@ -1894,9 +1905,8 @@ static void wakeup_dirtytime_writeback(struct work_struct *w)
 	rcu_read_lock();
 	list_for_each_entry_rcu(bdi, &bdi_list, bdi_list) {
 		struct bdi_writeback *wb;
-		struct wb_iter iter;
 
-		bdi_for_each_wb(wb, bdi, &iter, 0)
+		list_for_each_entry_rcu(wb, &bdi->wb_list, bdi_node)
 			if (!list_empty(&wb->b_dirty_time))
 				wb_wakeup(wb);
 	}
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index a23209b..1b4d69f 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -116,6 +116,8 @@ struct bdi_writeback {
 	struct list_head work_list;
 	struct delayed_work dwork;	/* work item used for writeback */
 
+	struct list_head bdi_node;	/* anchored at bdi->wb_list */
+
 #ifdef CONFIG_CGROUP_WRITEBACK
 	struct percpu_ref refcnt;	/* used only for !root wb's */
 	struct fprop_local_percpu memcg_completions;
@@ -150,6 +152,7 @@ struct backing_dev_info {
 	atomic_long_t tot_write_bandwidth;
 
 	struct bdi_writeback wb;  /* the root writeback info for this bdi */
+	struct list_head wb_list; /* list of all wbs */
 #ifdef CONFIG_CGROUP_WRITEBACK
 	struct radix_tree_root cgwb_tree; /* radix tree of active cgroup wbs */
 	struct rb_root cgwb_congested_tree; /* their congested states */
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 5a5d79e..01a93da 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -401,61 +401,6 @@ static inline void unlocked_inode_to_wb_end(struct inode *inode, bool locked)
 	rcu_read_unlock();
 }
 
-struct wb_iter {
-	int			start_memcg_id;
-	struct radix_tree_iter	tree_iter;
-	void			**slot;
-};
-
-static inline struct bdi_writeback *__wb_iter_next(struct wb_iter *iter,
-						   struct backing_dev_info *bdi)
-{
-	struct radix_tree_iter *titer = &iter->tree_iter;
-
-	WARN_ON_ONCE(!rcu_read_lock_held());
-
-	if (iter->start_memcg_id >= 0) {
-		iter->slot = radix_tree_iter_init(titer, iter->start_memcg_id);
-		iter->start_memcg_id = -1;
-	} else {
-		iter->slot = radix_tree_next_slot(iter->slot, titer, 0);
-	}
-
-	if (!iter->slot)
-		iter->slot = radix_tree_next_chunk(&bdi->cgwb_tree, titer, 0);
-	if (iter->slot)
-		return *iter->slot;
-	return NULL;
-}
-
-static inline struct bdi_writeback *__wb_iter_init(struct wb_iter *iter,
-						   struct backing_dev_info *bdi,
-						   int start_memcg_id)
-{
-	iter->start_memcg_id = start_memcg_id;
-
-	if (start_memcg_id)
-		return __wb_iter_next(iter, bdi);
-	else
-		return &bdi->wb;
-}
-
-/**
- * bdi_for_each_wb - walk all wb's of a bdi in ascending memcg ID order
- * @wb_cur: cursor struct bdi_writeback pointer
- * @bdi: bdi to walk wb's of
- * @iter: pointer to struct wb_iter to be used as iteration buffer
- * @start_memcg_id: memcg ID to start iteration from
- *
- * Iterate @wb_cur through the wb's (bdi_writeback's) of @bdi in ascending
- * memcg ID order starting from @start_memcg_id.  @iter is struct wb_iter
- * to be used as temp storage during iteration.  rcu_read_lock() must be
- * held throughout iteration.
- */
-#define bdi_for_each_wb(wb_cur, bdi, iter, start_memcg_id)		\
-	for ((wb_cur) = __wb_iter_init(iter, bdi, start_memcg_id);	\
-	     (wb_cur); (wb_cur) = __wb_iter_next(iter, bdi))
-
 #else	/* CONFIG_CGROUP_WRITEBACK */
 
 static inline bool inode_cgwb_enabled(struct inode *inode)
@@ -515,14 +460,6 @@ static inline void wb_blkcg_offline(struct blkcg *blkcg)
 {
 }
 
-struct wb_iter {
-	int		next_id;
-};
-
-#define bdi_for_each_wb(wb_cur, bdi, iter, start_blkcg_id)		\
-	for ((iter)->next_id = (start_blkcg_id);			\
-	     ({	(wb_cur) = !(iter)->next_id++ ? &(bdi)->wb : NULL; }); )
-
 static inline int inode_congested(struct inode *inode, int cong_bits)
 {
 	return wb_congested(&inode_to_bdi(inode)->wb, cong_bits);
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 2df8ddc..c48070b 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -480,6 +480,10 @@ static void cgwb_release_workfn(struct work_struct *work)
 						release_work);
 	struct backing_dev_info *bdi = wb->bdi;
 
+	spin_lock_irq(&cgwb_lock);
+	list_del_rcu(&wb->bdi_node);
+	spin_unlock_irq(&cgwb_lock);
+
 	wb_shutdown(wb);
 
 	css_put(wb->memcg_css);
@@ -575,6 +579,7 @@ static int cgwb_create(struct backing_dev_info *bdi,
 		ret = radix_tree_insert(&bdi->cgwb_tree, memcg_css->id, wb);
 		if (!ret) {
 			atomic_inc(&bdi->usage_cnt);
+			list_add_tail_rcu(&wb->bdi_node, &bdi->wb_list);
 			list_add(&wb->memcg_node, memcg_cgwb_list);
 			list_add(&wb->blkcg_node, blkcg_cgwb_list);
 			css_get(memcg_css);
@@ -686,6 +691,9 @@ static void cgwb_bdi_destroy(struct backing_dev_info *bdi)
 	radix_tree_for_each_slot(slot, &bdi->cgwb_tree, &iter, 0)
 		cgwb_kill(*slot);
 
+	/* wb may get released after @bdi is freed, sever list head */
+	list_del(&bdi->wb_list);
+
 	rbtree_postorder_for_each_entry_safe(congested, congested_n,
 					&bdi->cgwb_congested_tree, rb_node) {
 		rb_erase(&congested->rb_node, &bdi->cgwb_congested_tree);
@@ -764,15 +772,22 @@ static void cgwb_bdi_destroy(struct backing_dev_info *bdi) { }
 
 int bdi_init(struct backing_dev_info *bdi)
 {
+	int ret;
+
 	bdi->dev = NULL;
 
 	bdi->min_ratio = 0;
 	bdi->max_ratio = 100;
 	bdi->max_prop_frac = FPROP_FRAC_BASE;
 	INIT_LIST_HEAD(&bdi->bdi_list);
+	INIT_LIST_HEAD(&bdi->wb_list);
 	init_waitqueue_head(&bdi->wb_waitq);
 
-	return cgwb_bdi_init(bdi);
+	ret = cgwb_bdi_init(bdi);
+
+	list_add_tail_rcu(&bdi->wb.bdi_node, &bdi->wb_list);
+
+	return ret;
 }
 EXPORT_SYMBOL(bdi_init);
 
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 902e5f2..0f1a94e 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1956,7 +1956,6 @@ void laptop_mode_timer_fn(unsigned long data)
 	int nr_pages = global_page_state(NR_FILE_DIRTY) +
 		global_page_state(NR_UNSTABLE_NFS);
 	struct bdi_writeback *wb;
-	struct wb_iter iter;
 
 	/*
 	 * We want to write everything out, not just down to the dirty
@@ -1966,7 +1965,7 @@ void laptop_mode_timer_fn(unsigned long data)
 		return;
 
 	rcu_read_lock();
-	bdi_for_each_wb(wb, &q->backing_dev_info, &iter, 0)
+	list_for_each_entry_rcu(wb, &q->backing_dev_info.wb_list, bdi_node)
 		if (wb_has_dirty_io(wb))
 			wb_start_writeback(wb, nr_pages, true,
 					   WB_REASON_LAPTOP_TIMER);
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 3/5] writeback: bdi_writeback iteration must not skip dying ones
@ 2015-09-29 16:47   ` Tejun Heo
  0 siblings, 0 replies; 24+ messages in thread
From: Tejun Heo @ 2015-09-29 16:47 UTC (permalink / raw)
  To: axboe-tSWWG44O7X1aa/9Udqfwiw
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tytso-3s7WtUTddSA,
	dedekind1-Re5JQEeQqe8AvxtiuMwx3w, decui-0li6OtcxBFHby3iVrkZq2A,
	kernel-team-b10kYP2dOMg, Tejun Heo

bdi_for_each_wb() is used in several places to wake up or issue
writeback work items to all wb's (bdi_writeback's) on a given bdi.
The iteration is performed by walking bdi->cgwb_tree; however, the
tree only indexes wb's which are currently active.

For example, when a memcg gets associated with a different blkcg, the
old wb is removed from the tree so that the new one can be indexed.
The old wb starts dying from then on but will linger till all its
inodes are drained.  As these dying wb's may still host dirty inodes,
writeback operations which affect all wb's must include them.
bdi_for_each_wb() skipping dying wb's led to sync(2) missing and
failing to sync the inodes belonging to those wb's.

This patch adds a RCU protected @bdi->wb_list which lists all wb's
beloinging to that bdi.  wb's are added on creation and removed on
release rather than on the start of destruction.  bdi_for_each_wb()
usages are replaced with list_for_each[_continue]_rcu() iterations
over @bdi->wb_list and bdi_for_each_wb() and its helpers are removed.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Reported-and-tested-by: Artem Bityutskiy <dedekind1-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Fixes: ebe41ab0c79d ("writeback: implement bdi_for_each_wb()")
Link: http://lkml.kernel.org/g/1443012552.19983.209.camel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org
---
 fs/fs-writeback.c                | 28 ++++++++++++------
 include/linux/backing-dev-defs.h |  3 ++
 include/linux/backing-dev.h      | 63 ----------------------------------------
 mm/backing-dev.c                 | 17 ++++++++++-
 mm/page-writeback.c              |  3 +-
 5 files changed, 39 insertions(+), 75 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index d0da306..afa4848 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -778,19 +778,24 @@ static void bdi_split_work_to_wbs(struct backing_dev_info *bdi,
 				  struct wb_writeback_work *base_work,
 				  bool skip_if_busy)
 {
-	int next_memcg_id = 0;
-	struct bdi_writeback *wb;
-	struct wb_iter iter;
+	struct bdi_writeback *last_wb = NULL;
+	struct bdi_writeback *wb = list_entry_rcu(&bdi->wb_list,
+						struct bdi_writeback, bdi_node);
 
 	might_sleep();
 restart:
 	rcu_read_lock();
-	bdi_for_each_wb(wb, bdi, &iter, next_memcg_id) {
+	list_for_each_entry_continue_rcu(wb, &bdi->wb_list, bdi_node) {
 		DEFINE_WB_COMPLETION_ONSTACK(fallback_work_done);
 		struct wb_writeback_work fallback_work;
 		struct wb_writeback_work *work;
 		long nr_pages;
 
+		if (last_wb) {
+			wb_put(last_wb);
+			last_wb = NULL;
+		}
+
 		/* SYNC_ALL writes out I_DIRTY_TIME too */
 		if (!wb_has_dirty_io(wb) &&
 		    (base_work->sync_mode == WB_SYNC_NONE ||
@@ -819,7 +824,14 @@ static void bdi_split_work_to_wbs(struct backing_dev_info *bdi,
 
 		wb_queue_work(wb, work);
 
-		next_memcg_id = wb->memcg_css->id + 1;
+		/*
+		 * Pin @wb so that it stays on @bdi->wb_list.  This allows
+		 * continuing iteration from @wb after dropping and
+		 * regrabbing rcu read lock.
+		 */
+		wb_get(wb);
+		last_wb = wb;
+
 		rcu_read_unlock();
 		wb_wait_for_completion(bdi, &fallback_work_done);
 		goto restart;
@@ -1857,12 +1869,11 @@ void wakeup_flusher_threads(long nr_pages, enum wb_reason reason)
 	rcu_read_lock();
 	list_for_each_entry_rcu(bdi, &bdi_list, bdi_list) {
 		struct bdi_writeback *wb;
-		struct wb_iter iter;
 
 		if (!bdi_has_dirty_io(bdi))
 			continue;
 
-		bdi_for_each_wb(wb, bdi, &iter, 0)
+		list_for_each_entry_rcu(wb, &bdi->wb_list, bdi_node)
 			wb_start_writeback(wb, wb_split_bdi_pages(wb, nr_pages),
 					   false, reason);
 	}
@@ -1894,9 +1905,8 @@ static void wakeup_dirtytime_writeback(struct work_struct *w)
 	rcu_read_lock();
 	list_for_each_entry_rcu(bdi, &bdi_list, bdi_list) {
 		struct bdi_writeback *wb;
-		struct wb_iter iter;
 
-		bdi_for_each_wb(wb, bdi, &iter, 0)
+		list_for_each_entry_rcu(wb, &bdi->wb_list, bdi_node)
 			if (!list_empty(&wb->b_dirty_time))
 				wb_wakeup(wb);
 	}
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index a23209b..1b4d69f 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -116,6 +116,8 @@ struct bdi_writeback {
 	struct list_head work_list;
 	struct delayed_work dwork;	/* work item used for writeback */
 
+	struct list_head bdi_node;	/* anchored at bdi->wb_list */
+
 #ifdef CONFIG_CGROUP_WRITEBACK
 	struct percpu_ref refcnt;	/* used only for !root wb's */
 	struct fprop_local_percpu memcg_completions;
@@ -150,6 +152,7 @@ struct backing_dev_info {
 	atomic_long_t tot_write_bandwidth;
 
 	struct bdi_writeback wb;  /* the root writeback info for this bdi */
+	struct list_head wb_list; /* list of all wbs */
 #ifdef CONFIG_CGROUP_WRITEBACK
 	struct radix_tree_root cgwb_tree; /* radix tree of active cgroup wbs */
 	struct rb_root cgwb_congested_tree; /* their congested states */
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 5a5d79e..01a93da 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -401,61 +401,6 @@ static inline void unlocked_inode_to_wb_end(struct inode *inode, bool locked)
 	rcu_read_unlock();
 }
 
-struct wb_iter {
-	int			start_memcg_id;
-	struct radix_tree_iter	tree_iter;
-	void			**slot;
-};
-
-static inline struct bdi_writeback *__wb_iter_next(struct wb_iter *iter,
-						   struct backing_dev_info *bdi)
-{
-	struct radix_tree_iter *titer = &iter->tree_iter;
-
-	WARN_ON_ONCE(!rcu_read_lock_held());
-
-	if (iter->start_memcg_id >= 0) {
-		iter->slot = radix_tree_iter_init(titer, iter->start_memcg_id);
-		iter->start_memcg_id = -1;
-	} else {
-		iter->slot = radix_tree_next_slot(iter->slot, titer, 0);
-	}
-
-	if (!iter->slot)
-		iter->slot = radix_tree_next_chunk(&bdi->cgwb_tree, titer, 0);
-	if (iter->slot)
-		return *iter->slot;
-	return NULL;
-}
-
-static inline struct bdi_writeback *__wb_iter_init(struct wb_iter *iter,
-						   struct backing_dev_info *bdi,
-						   int start_memcg_id)
-{
-	iter->start_memcg_id = start_memcg_id;
-
-	if (start_memcg_id)
-		return __wb_iter_next(iter, bdi);
-	else
-		return &bdi->wb;
-}
-
-/**
- * bdi_for_each_wb - walk all wb's of a bdi in ascending memcg ID order
- * @wb_cur: cursor struct bdi_writeback pointer
- * @bdi: bdi to walk wb's of
- * @iter: pointer to struct wb_iter to be used as iteration buffer
- * @start_memcg_id: memcg ID to start iteration from
- *
- * Iterate @wb_cur through the wb's (bdi_writeback's) of @bdi in ascending
- * memcg ID order starting from @start_memcg_id.  @iter is struct wb_iter
- * to be used as temp storage during iteration.  rcu_read_lock() must be
- * held throughout iteration.
- */
-#define bdi_for_each_wb(wb_cur, bdi, iter, start_memcg_id)		\
-	for ((wb_cur) = __wb_iter_init(iter, bdi, start_memcg_id);	\
-	     (wb_cur); (wb_cur) = __wb_iter_next(iter, bdi))
-
 #else	/* CONFIG_CGROUP_WRITEBACK */
 
 static inline bool inode_cgwb_enabled(struct inode *inode)
@@ -515,14 +460,6 @@ static inline void wb_blkcg_offline(struct blkcg *blkcg)
 {
 }
 
-struct wb_iter {
-	int		next_id;
-};
-
-#define bdi_for_each_wb(wb_cur, bdi, iter, start_blkcg_id)		\
-	for ((iter)->next_id = (start_blkcg_id);			\
-	     ({	(wb_cur) = !(iter)->next_id++ ? &(bdi)->wb : NULL; }); )
-
 static inline int inode_congested(struct inode *inode, int cong_bits)
 {
 	return wb_congested(&inode_to_bdi(inode)->wb, cong_bits);
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 2df8ddc..c48070b 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -480,6 +480,10 @@ static void cgwb_release_workfn(struct work_struct *work)
 						release_work);
 	struct backing_dev_info *bdi = wb->bdi;
 
+	spin_lock_irq(&cgwb_lock);
+	list_del_rcu(&wb->bdi_node);
+	spin_unlock_irq(&cgwb_lock);
+
 	wb_shutdown(wb);
 
 	css_put(wb->memcg_css);
@@ -575,6 +579,7 @@ static int cgwb_create(struct backing_dev_info *bdi,
 		ret = radix_tree_insert(&bdi->cgwb_tree, memcg_css->id, wb);
 		if (!ret) {
 			atomic_inc(&bdi->usage_cnt);
+			list_add_tail_rcu(&wb->bdi_node, &bdi->wb_list);
 			list_add(&wb->memcg_node, memcg_cgwb_list);
 			list_add(&wb->blkcg_node, blkcg_cgwb_list);
 			css_get(memcg_css);
@@ -686,6 +691,9 @@ static void cgwb_bdi_destroy(struct backing_dev_info *bdi)
 	radix_tree_for_each_slot(slot, &bdi->cgwb_tree, &iter, 0)
 		cgwb_kill(*slot);
 
+	/* wb may get released after @bdi is freed, sever list head */
+	list_del(&bdi->wb_list);
+
 	rbtree_postorder_for_each_entry_safe(congested, congested_n,
 					&bdi->cgwb_congested_tree, rb_node) {
 		rb_erase(&congested->rb_node, &bdi->cgwb_congested_tree);
@@ -764,15 +772,22 @@ static void cgwb_bdi_destroy(struct backing_dev_info *bdi) { }
 
 int bdi_init(struct backing_dev_info *bdi)
 {
+	int ret;
+
 	bdi->dev = NULL;
 
 	bdi->min_ratio = 0;
 	bdi->max_ratio = 100;
 	bdi->max_prop_frac = FPROP_FRAC_BASE;
 	INIT_LIST_HEAD(&bdi->bdi_list);
+	INIT_LIST_HEAD(&bdi->wb_list);
 	init_waitqueue_head(&bdi->wb_waitq);
 
-	return cgwb_bdi_init(bdi);
+	ret = cgwb_bdi_init(bdi);
+
+	list_add_tail_rcu(&bdi->wb.bdi_node, &bdi->wb_list);
+
+	return ret;
 }
 EXPORT_SYMBOL(bdi_init);
 
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 902e5f2..0f1a94e 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1956,7 +1956,6 @@ void laptop_mode_timer_fn(unsigned long data)
 	int nr_pages = global_page_state(NR_FILE_DIRTY) +
 		global_page_state(NR_UNSTABLE_NFS);
 	struct bdi_writeback *wb;
-	struct wb_iter iter;
 
 	/*
 	 * We want to write everything out, not just down to the dirty
@@ -1966,7 +1965,7 @@ void laptop_mode_timer_fn(unsigned long data)
 		return;
 
 	rcu_read_lock();
-	bdi_for_each_wb(wb, &q->backing_dev_info, &iter, 0)
+	list_for_each_entry_rcu(wb, &q->backing_dev_info.wb_list, bdi_node)
 		if (wb_has_dirty_io(wb))
 			wb_start_writeback(wb, nr_pages, true,
 					   WB_REASON_LAPTOP_TIMER);
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 4/5] writeback: memcg dirty_throttle_control should be initialized with wb->memcg_completions
@ 2015-09-29 16:47   ` Tejun Heo
  0 siblings, 0 replies; 24+ messages in thread
From: Tejun Heo @ 2015-09-29 16:47 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, cgroups, tytso, dedekind1, decui, kernel-team, Tejun Heo

MDTC_INIT() is used to initialize dirty_throttle_control for memcg
domains.  It used DTC_INIT_COMMON() to initialized mdtc->wb and
->wb_completions which is incorrect as DTC_INIT_COMMON() sets the
latter to wb->completions instead of wb->memcg_completions.  This can
lead to wildly incorrect results when calculating the proportion of
dirty memory the memcg domain should get.

Remove DTC_INIT_COMMON() and update MDTC_INIT() to initialize
mdtc->wb_completions to wb->memcg_completions.

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: c2aa723a6093 ("writeback: implement memcg writeback domain based throttling")
---
 mm/page-writeback.c | 20 +++++++++++---------
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 0f1a94e..56c0bff 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -145,9 +145,6 @@ struct dirty_throttle_control {
 	unsigned long		pos_ratio;
 };
 
-#define DTC_INIT_COMMON(__wb)	.wb = (__wb),				\
-				.wb_completions = &(__wb)->completions
-
 /*
  * Length of period for aging writeout fractions of bdis. This is an
  * arbitrarily chosen number. The longer the period, the slower fractions will
@@ -157,12 +154,16 @@ struct dirty_throttle_control {
 
 #ifdef CONFIG_CGROUP_WRITEBACK
 
-#define GDTC_INIT(__wb)		.dom = &global_wb_domain,		\
-				DTC_INIT_COMMON(__wb)
+#define GDTC_INIT(__wb)		.wb = (__wb),				\
+				.dom = &global_wb_domain,		\
+				.wb_completions = &(__wb)->completions
+
 #define GDTC_INIT_NO_WB		.dom = &global_wb_domain
-#define MDTC_INIT(__wb, __gdtc)	.dom = mem_cgroup_wb_domain(__wb),	\
-				.gdtc = __gdtc,				\
-				DTC_INIT_COMMON(__wb)
+
+#define MDTC_INIT(__wb, __gdtc)	.wb = (__wb),				\
+				.dom = mem_cgroup_wb_domain(__wb),	\
+				.wb_completions = &(__wb)->memcg_completions, \
+				.gdtc = __gdtc
 
 static bool mdtc_valid(struct dirty_throttle_control *dtc)
 {
@@ -213,7 +214,8 @@ static void wb_min_max_ratio(struct bdi_writeback *wb,
 
 #else	/* CONFIG_CGROUP_WRITEBACK */
 
-#define GDTC_INIT(__wb)		DTC_INIT_COMMON(__wb)
+#define GDTC_INIT(__wb)		.wb = (__wb),                           \
+				.wb_completions = &(__wb)->completions
 #define GDTC_INIT_NO_WB
 #define MDTC_INIT(__wb, __gdtc)
 
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 4/5] writeback: memcg dirty_throttle_control should be initialized with wb->memcg_completions
@ 2015-09-29 16:47   ` Tejun Heo
  0 siblings, 0 replies; 24+ messages in thread
From: Tejun Heo @ 2015-09-29 16:47 UTC (permalink / raw)
  To: axboe-tSWWG44O7X1aa/9Udqfwiw
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tytso-3s7WtUTddSA,
	dedekind1-Re5JQEeQqe8AvxtiuMwx3w, decui-0li6OtcxBFHby3iVrkZq2A,
	kernel-team-b10kYP2dOMg, Tejun Heo

MDTC_INIT() is used to initialize dirty_throttle_control for memcg
domains.  It used DTC_INIT_COMMON() to initialized mdtc->wb and
->wb_completions which is incorrect as DTC_INIT_COMMON() sets the
latter to wb->completions instead of wb->memcg_completions.  This can
lead to wildly incorrect results when calculating the proportion of
dirty memory the memcg domain should get.

Remove DTC_INIT_COMMON() and update MDTC_INIT() to initialize
mdtc->wb_completions to wb->memcg_completions.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Fixes: c2aa723a6093 ("writeback: implement memcg writeback domain based throttling")
---
 mm/page-writeback.c | 20 +++++++++++---------
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 0f1a94e..56c0bff 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -145,9 +145,6 @@ struct dirty_throttle_control {
 	unsigned long		pos_ratio;
 };
 
-#define DTC_INIT_COMMON(__wb)	.wb = (__wb),				\
-				.wb_completions = &(__wb)->completions
-
 /*
  * Length of period for aging writeout fractions of bdis. This is an
  * arbitrarily chosen number. The longer the period, the slower fractions will
@@ -157,12 +154,16 @@ struct dirty_throttle_control {
 
 #ifdef CONFIG_CGROUP_WRITEBACK
 
-#define GDTC_INIT(__wb)		.dom = &global_wb_domain,		\
-				DTC_INIT_COMMON(__wb)
+#define GDTC_INIT(__wb)		.wb = (__wb),				\
+				.dom = &global_wb_domain,		\
+				.wb_completions = &(__wb)->completions
+
 #define GDTC_INIT_NO_WB		.dom = &global_wb_domain
-#define MDTC_INIT(__wb, __gdtc)	.dom = mem_cgroup_wb_domain(__wb),	\
-				.gdtc = __gdtc,				\
-				DTC_INIT_COMMON(__wb)
+
+#define MDTC_INIT(__wb, __gdtc)	.wb = (__wb),				\
+				.dom = mem_cgroup_wb_domain(__wb),	\
+				.wb_completions = &(__wb)->memcg_completions, \
+				.gdtc = __gdtc
 
 static bool mdtc_valid(struct dirty_throttle_control *dtc)
 {
@@ -213,7 +214,8 @@ static void wb_min_max_ratio(struct bdi_writeback *wb,
 
 #else	/* CONFIG_CGROUP_WRITEBACK */
 
-#define GDTC_INIT(__wb)		DTC_INIT_COMMON(__wb)
+#define GDTC_INIT(__wb)		.wb = (__wb),                           \
+				.wb_completions = &(__wb)->completions
 #define GDTC_INIT_NO_WB
 #define MDTC_INIT(__wb, __gdtc)
 
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 5/5] writeback: fix incorrect calculation of available memory for memcg domains
@ 2015-09-29 16:47   ` Tejun Heo
  0 siblings, 0 replies; 24+ messages in thread
From: Tejun Heo @ 2015-09-29 16:47 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, cgroups, tytso, dedekind1, decui, kernel-team, Tejun Heo

For memcg domains, the amount of available memory was calculated as

 min(the amount currently in use + headroom according to memcg,
     total clean memory)

This isn't quite correct as what should be capped by the amount of
clean memory is the headroom, not the sum of memory in use and
headroom.  For example, if a memcg domain has a significant amount of
dirty memory, the above can lead to a value which is lower than the
current amount in use which doesn't make much sense.  In most
circumstances, the above leads to a number which is somewhat but not
drastically lower.

As the amount of memory which can be readily allocated to the memcg
domain is capped by the amount of system-wide clean memory which is
not already assigned to the memcg itself, the number we want is

 the amount currently in use +
 min(headroom according to memcg, clean memory elsewhere in the system)

This patch updates mem_cgroup_wb_stats() to return the number of
filepages and headroom instead of the calculated available pages.
mdtc_cap_avail() is renamed to mdtc_calc_avail() and performs the
above calculation from file, headroom, dirty and globally clean pages.

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: c2aa723a6093 ("writeback: implement memcg writeback domain based throttling")
---
 include/linux/memcontrol.h |  5 +++--
 mm/memcontrol.c            | 35 +++++++++++++++++------------------
 mm/page-writeback.c        | 29 ++++++++++++++++++-----------
 3 files changed, 38 insertions(+), 31 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index ad800e6..0df3e1a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -677,8 +677,9 @@ enum {
 
 struct list_head *mem_cgroup_cgwb_list(struct mem_cgroup *memcg);
 struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb);
-void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pavail,
-			 unsigned long *pdirty, unsigned long *pwriteback);
+void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages,
+			 unsigned long *pheadroom, unsigned long *pdirty,
+			 unsigned long *pwriteback);
 
 #else	/* CONFIG_CGROUP_WRITEBACK */
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6ddaeba..b357ccf 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3734,44 +3734,43 @@ struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb)
 /**
  * mem_cgroup_wb_stats - retrieve writeback related stats from its memcg
  * @wb: bdi_writeback in question
- * @pavail: out parameter for number of available pages
+ * @pfilepages: out parameter for number of file pages
+ * @pheadroom: out parameter for number of allocatable pages according to memcg
  * @pdirty: out parameter for number of dirty pages
  * @pwriteback: out parameter for number of pages under writeback
  *
- * Determine the numbers of available, dirty, and writeback pages in @wb's
- * memcg.  Dirty and writeback are self-explanatory.  Available is a bit
- * more involved.
+ * Determine the numbers of file, headroom, dirty, and writeback pages in
+ * @wb's memcg.  File, dirty and writeback are self-explanatory.  Headroom
+ * is a bit more involved.
  *
- * A memcg's headroom is "min(max, high) - used".  The available memory is
- * calculated as the lowest headroom of itself and the ancestors plus the
- * number of pages already being used for file pages.  Note that this
- * doesn't consider the actual amount of available memory in the system.
- * The caller should further cap *@pavail accordingly.
+ * A memcg's headroom is "min(max, high) - used".  In the hierarchy, the
+ * headroom is calculated as the lowest headroom of itself and the
+ * ancestors.  Note that this doesn't consider the actual amount of
+ * available memory in the system.  The caller should further cap
+ * *@pheadroom accordingly.
  */
-void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pavail,
-			 unsigned long *pdirty, unsigned long *pwriteback)
+void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages,
+			 unsigned long *pheadroom, unsigned long *pdirty,
+			 unsigned long *pwriteback)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css);
 	struct mem_cgroup *parent;
-	unsigned long head_room = PAGE_COUNTER_MAX;
-	unsigned long file_pages;
 
 	*pdirty = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_DIRTY);
 
 	/* this should eventually include NR_UNSTABLE_NFS */
 	*pwriteback = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK);
+	*pfilepages = mem_cgroup_nr_lru_pages(memcg, (1 << LRU_INACTIVE_FILE) |
+						     (1 << LRU_ACTIVE_FILE));
+	*pheadroom = PAGE_COUNTER_MAX;
 
-	file_pages = mem_cgroup_nr_lru_pages(memcg, (1 << LRU_INACTIVE_FILE) |
-						    (1 << LRU_ACTIVE_FILE));
 	while ((parent = parent_mem_cgroup(memcg))) {
 		unsigned long ceiling = min(memcg->memory.limit, memcg->high);
 		unsigned long used = page_counter_read(&memcg->memory);
 
-		head_room = min(head_room, ceiling - min(ceiling, used));
+		*pheadroom = min(*pheadroom, ceiling - min(ceiling, used));
 		memcg = parent;
 	}
-
-	*pavail = file_pages + head_room;
 }
 
 #else	/* CONFIG_CGROUP_WRITEBACK */
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 56c0bff..2c90357 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -684,13 +684,19 @@ static unsigned long hard_dirty_limit(struct wb_domain *dom,
 	return max(thresh, dom->dirty_limit);
 }
 
-/* memory available to a memcg domain is capped by system-wide clean memory */
-static void mdtc_cap_avail(struct dirty_throttle_control *mdtc)
+/*
+ * Memory which can be further allocated to a memcg domain is capped by
+ * system-wide clean memory excluding the amount being used in the domain.
+ */
+static void mdtc_calc_avail(struct dirty_throttle_control *mdtc,
+			    unsigned long filepages, unsigned long headroom)
 {
 	struct dirty_throttle_control *gdtc = mdtc_gdtc(mdtc);
-	unsigned long clean = gdtc->avail - min(gdtc->avail, gdtc->dirty);
+	unsigned long clean = filepages - min(filepages, mdtc->dirty);
+	unsigned long global_clean = gdtc->avail - min(gdtc->avail, gdtc->dirty);
+	unsigned long other_clean = global_clean - min(global_clean, clean);
 
-	mdtc->avail = min(mdtc->avail, clean);
+	mdtc->avail = filepages + min(headroom, other_clean);
 }
 
 /**
@@ -1564,16 +1570,16 @@ static void balance_dirty_pages(struct address_space *mapping,
 		}
 
 		if (mdtc) {
-			unsigned long writeback;
+			unsigned long filepages, headroom, writeback;
 
 			/*
 			 * If @wb belongs to !root memcg, repeat the same
 			 * basic calculations for the memcg domain.
 			 */
-			mem_cgroup_wb_stats(wb, &mdtc->avail, &mdtc->dirty,
-					    &writeback);
-			mdtc_cap_avail(mdtc);
+			mem_cgroup_wb_stats(wb, &filepages, &headroom,
+					    &mdtc->dirty, &writeback);
 			mdtc->dirty += writeback;
+			mdtc_calc_avail(mdtc, filepages, headroom);
 
 			domain_dirty_limits(mdtc);
 
@@ -1895,10 +1901,11 @@ bool wb_over_bg_thresh(struct bdi_writeback *wb)
 		return true;
 
 	if (mdtc) {
-		unsigned long writeback;
+		unsigned long filepages, headroom, writeback;
 
-		mem_cgroup_wb_stats(wb, &mdtc->avail, &mdtc->dirty, &writeback);
-		mdtc_cap_avail(mdtc);
+		mem_cgroup_wb_stats(wb, &filepages, &headroom, &mdtc->dirty,
+				    &writeback);
+		mdtc_calc_avail(mdtc, filepages, headroom);
 		domain_dirty_limits(mdtc);	/* ditto, ignore writeback */
 
 		if (mdtc->dirty > mdtc->bg_thresh)
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 5/5] writeback: fix incorrect calculation of available memory for memcg domains
@ 2015-09-29 16:47   ` Tejun Heo
  0 siblings, 0 replies; 24+ messages in thread
From: Tejun Heo @ 2015-09-29 16:47 UTC (permalink / raw)
  To: axboe-tSWWG44O7X1aa/9Udqfwiw
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tytso-3s7WtUTddSA,
	dedekind1-Re5JQEeQqe8AvxtiuMwx3w, decui-0li6OtcxBFHby3iVrkZq2A,
	kernel-team-b10kYP2dOMg, Tejun Heo

For memcg domains, the amount of available memory was calculated as

 min(the amount currently in use + headroom according to memcg,
     total clean memory)

This isn't quite correct as what should be capped by the amount of
clean memory is the headroom, not the sum of memory in use and
headroom.  For example, if a memcg domain has a significant amount of
dirty memory, the above can lead to a value which is lower than the
current amount in use which doesn't make much sense.  In most
circumstances, the above leads to a number which is somewhat but not
drastically lower.

As the amount of memory which can be readily allocated to the memcg
domain is capped by the amount of system-wide clean memory which is
not already assigned to the memcg itself, the number we want is

 the amount currently in use +
 min(headroom according to memcg, clean memory elsewhere in the system)

This patch updates mem_cgroup_wb_stats() to return the number of
filepages and headroom instead of the calculated available pages.
mdtc_cap_avail() is renamed to mdtc_calc_avail() and performs the
above calculation from file, headroom, dirty and globally clean pages.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Fixes: c2aa723a6093 ("writeback: implement memcg writeback domain based throttling")
---
 include/linux/memcontrol.h |  5 +++--
 mm/memcontrol.c            | 35 +++++++++++++++++------------------
 mm/page-writeback.c        | 29 ++++++++++++++++++-----------
 3 files changed, 38 insertions(+), 31 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index ad800e6..0df3e1a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -677,8 +677,9 @@ enum {
 
 struct list_head *mem_cgroup_cgwb_list(struct mem_cgroup *memcg);
 struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb);
-void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pavail,
-			 unsigned long *pdirty, unsigned long *pwriteback);
+void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages,
+			 unsigned long *pheadroom, unsigned long *pdirty,
+			 unsigned long *pwriteback);
 
 #else	/* CONFIG_CGROUP_WRITEBACK */
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6ddaeba..b357ccf 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3734,44 +3734,43 @@ struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb)
 /**
  * mem_cgroup_wb_stats - retrieve writeback related stats from its memcg
  * @wb: bdi_writeback in question
- * @pavail: out parameter for number of available pages
+ * @pfilepages: out parameter for number of file pages
+ * @pheadroom: out parameter for number of allocatable pages according to memcg
  * @pdirty: out parameter for number of dirty pages
  * @pwriteback: out parameter for number of pages under writeback
  *
- * Determine the numbers of available, dirty, and writeback pages in @wb's
- * memcg.  Dirty and writeback are self-explanatory.  Available is a bit
- * more involved.
+ * Determine the numbers of file, headroom, dirty, and writeback pages in
+ * @wb's memcg.  File, dirty and writeback are self-explanatory.  Headroom
+ * is a bit more involved.
  *
- * A memcg's headroom is "min(max, high) - used".  The available memory is
- * calculated as the lowest headroom of itself and the ancestors plus the
- * number of pages already being used for file pages.  Note that this
- * doesn't consider the actual amount of available memory in the system.
- * The caller should further cap *@pavail accordingly.
+ * A memcg's headroom is "min(max, high) - used".  In the hierarchy, the
+ * headroom is calculated as the lowest headroom of itself and the
+ * ancestors.  Note that this doesn't consider the actual amount of
+ * available memory in the system.  The caller should further cap
+ * *@pheadroom accordingly.
  */
-void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pavail,
-			 unsigned long *pdirty, unsigned long *pwriteback)
+void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages,
+			 unsigned long *pheadroom, unsigned long *pdirty,
+			 unsigned long *pwriteback)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css);
 	struct mem_cgroup *parent;
-	unsigned long head_room = PAGE_COUNTER_MAX;
-	unsigned long file_pages;
 
 	*pdirty = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_DIRTY);
 
 	/* this should eventually include NR_UNSTABLE_NFS */
 	*pwriteback = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK);
+	*pfilepages = mem_cgroup_nr_lru_pages(memcg, (1 << LRU_INACTIVE_FILE) |
+						     (1 << LRU_ACTIVE_FILE));
+	*pheadroom = PAGE_COUNTER_MAX;
 
-	file_pages = mem_cgroup_nr_lru_pages(memcg, (1 << LRU_INACTIVE_FILE) |
-						    (1 << LRU_ACTIVE_FILE));
 	while ((parent = parent_mem_cgroup(memcg))) {
 		unsigned long ceiling = min(memcg->memory.limit, memcg->high);
 		unsigned long used = page_counter_read(&memcg->memory);
 
-		head_room = min(head_room, ceiling - min(ceiling, used));
+		*pheadroom = min(*pheadroom, ceiling - min(ceiling, used));
 		memcg = parent;
 	}
-
-	*pavail = file_pages + head_room;
 }
 
 #else	/* CONFIG_CGROUP_WRITEBACK */
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 56c0bff..2c90357 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -684,13 +684,19 @@ static unsigned long hard_dirty_limit(struct wb_domain *dom,
 	return max(thresh, dom->dirty_limit);
 }
 
-/* memory available to a memcg domain is capped by system-wide clean memory */
-static void mdtc_cap_avail(struct dirty_throttle_control *mdtc)
+/*
+ * Memory which can be further allocated to a memcg domain is capped by
+ * system-wide clean memory excluding the amount being used in the domain.
+ */
+static void mdtc_calc_avail(struct dirty_throttle_control *mdtc,
+			    unsigned long filepages, unsigned long headroom)
 {
 	struct dirty_throttle_control *gdtc = mdtc_gdtc(mdtc);
-	unsigned long clean = gdtc->avail - min(gdtc->avail, gdtc->dirty);
+	unsigned long clean = filepages - min(filepages, mdtc->dirty);
+	unsigned long global_clean = gdtc->avail - min(gdtc->avail, gdtc->dirty);
+	unsigned long other_clean = global_clean - min(global_clean, clean);
 
-	mdtc->avail = min(mdtc->avail, clean);
+	mdtc->avail = filepages + min(headroom, other_clean);
 }
 
 /**
@@ -1564,16 +1570,16 @@ static void balance_dirty_pages(struct address_space *mapping,
 		}
 
 		if (mdtc) {
-			unsigned long writeback;
+			unsigned long filepages, headroom, writeback;
 
 			/*
 			 * If @wb belongs to !root memcg, repeat the same
 			 * basic calculations for the memcg domain.
 			 */
-			mem_cgroup_wb_stats(wb, &mdtc->avail, &mdtc->dirty,
-					    &writeback);
-			mdtc_cap_avail(mdtc);
+			mem_cgroup_wb_stats(wb, &filepages, &headroom,
+					    &mdtc->dirty, &writeback);
 			mdtc->dirty += writeback;
+			mdtc_calc_avail(mdtc, filepages, headroom);
 
 			domain_dirty_limits(mdtc);
 
@@ -1895,10 +1901,11 @@ bool wb_over_bg_thresh(struct bdi_writeback *wb)
 		return true;
 
 	if (mdtc) {
-		unsigned long writeback;
+		unsigned long filepages, headroom, writeback;
 
-		mem_cgroup_wb_stats(wb, &mdtc->avail, &mdtc->dirty, &writeback);
-		mdtc_cap_avail(mdtc);
+		mem_cgroup_wb_stats(wb, &filepages, &headroom, &mdtc->dirty,
+				    &writeback);
+		mdtc_calc_avail(mdtc, filepages, headroom);
 		domain_dirty_limits(mdtc);	/* ditto, ignore writeback */
 
 		if (mdtc->dirty > mdtc->bg_thresh)
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCHSET block/for-linus] writeback: cgroup writeback fixes
@ 2015-09-29 16:50   ` Tejun Heo
  0 siblings, 0 replies; 24+ messages in thread
From: Tejun Heo @ 2015-09-29 16:50 UTC (permalink / raw)
  To: jack; +Cc: linux-kernel, cgroups, tytso, dedekind1, decui, kernel-team, axboe

Oops, forgot Jan.  Jan, this thread contains five patches fixing bugs
in cgroup writeback support.

 http://lkml.kernel.org/g/1443545274-18787-1-git-send-email-tj@kernel.org

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCHSET block/for-linus] writeback: cgroup writeback fixes
@ 2015-09-29 16:50   ` Tejun Heo
  0 siblings, 0 replies; 24+ messages in thread
From: Tejun Heo @ 2015-09-29 16:50 UTC (permalink / raw)
  To: jack-IBi9RG/b67k
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tytso-3s7WtUTddSA,
	dedekind1-Re5JQEeQqe8AvxtiuMwx3w, decui-0li6OtcxBFHby3iVrkZq2A,
	kernel-team-b10kYP2dOMg, axboe-tSWWG44O7X1aa/9Udqfwiw

Oops, forgot Jan.  Jan, this thread contains five patches fixing bugs
in cgroup writeback support.

 http://lkml.kernel.org/g/1443545274-18787-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2 5/5] writeback: fix incorrect calculation of available memory for memcg domains
@ 2015-09-29 17:04     ` Tejun Heo
  0 siblings, 0 replies; 24+ messages in thread
From: Tejun Heo @ 2015-09-29 17:04 UTC (permalink / raw)
  To: axboe; +Cc: linux-kernel, cgroups, tytso, dedekind1, decui, kernel-team

For memcg domains, the amount of available memory was calculated as

 min(the amount currently in use + headroom according to memcg,
     total clean memory)

This isn't quite correct as what should be capped by the amount of
clean memory is the headroom, not the sum of memory in use and
headroom.  For example, if a memcg domain has a significant amount of
dirty memory, the above can lead to a value which is lower than the
current amount in use which doesn't make much sense.  In most
circumstances, the above leads to a number which is somewhat but not
drastically lower.

As the amount of memory which can be readily allocated to the memcg
domain is capped by the amount of system-wide clean memory which is
not already assigned to the memcg itself, the number we want is

 the amount currently in use +
 min(headroom according to memcg, clean memory elsewhere in the system)

This patch updates mem_cgroup_wb_stats() to return the number of
filepages and headroom instead of the calculated available pages.
mdtc_cap_avail() is renamed to mdtc_calc_avail() and performs the
above calculation from file, headroom, dirty and globally clean pages.

v2: Dummy mem_cgroup_wb_stats() implementation wasn't updated leading
    to build failure when !CGROUP_WRITEBACK.  Fixed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: c2aa723a6093 ("writeback: implement memcg writeback domain based throttling")
---
 include/linux/memcontrol.h |  8 +++++---
 mm/memcontrol.c            | 35 +++++++++++++++++------------------
 mm/page-writeback.c        | 29 ++++++++++++++++++-----------
 3 files changed, 40 insertions(+), 32 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index ad800e6..00bec85 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -677,8 +677,9 @@ enum {
 
 struct list_head *mem_cgroup_cgwb_list(struct mem_cgroup *memcg);
 struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb);
-void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pavail,
-			 unsigned long *pdirty, unsigned long *pwriteback);
+void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages,
+			 unsigned long *pheadroom, unsigned long *pdirty,
+			 unsigned long *pwriteback);
 
 #else	/* CONFIG_CGROUP_WRITEBACK */
 
@@ -688,7 +689,8 @@ static inline struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb)
 }
 
 static inline void mem_cgroup_wb_stats(struct bdi_writeback *wb,
-				       unsigned long *pavail,
+				       unsigned long *pfilepages,
+				       unsigned long *pheadroom,
 				       unsigned long *pdirty,
 				       unsigned long *pwriteback)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6ddaeba..b357ccf 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3734,44 +3734,43 @@ struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb)
 /**
  * mem_cgroup_wb_stats - retrieve writeback related stats from its memcg
  * @wb: bdi_writeback in question
- * @pavail: out parameter for number of available pages
+ * @pfilepages: out parameter for number of file pages
+ * @pheadroom: out parameter for number of allocatable pages according to memcg
  * @pdirty: out parameter for number of dirty pages
  * @pwriteback: out parameter for number of pages under writeback
  *
- * Determine the numbers of available, dirty, and writeback pages in @wb's
- * memcg.  Dirty and writeback are self-explanatory.  Available is a bit
- * more involved.
+ * Determine the numbers of file, headroom, dirty, and writeback pages in
+ * @wb's memcg.  File, dirty and writeback are self-explanatory.  Headroom
+ * is a bit more involved.
  *
- * A memcg's headroom is "min(max, high) - used".  The available memory is
- * calculated as the lowest headroom of itself and the ancestors plus the
- * number of pages already being used for file pages.  Note that this
- * doesn't consider the actual amount of available memory in the system.
- * The caller should further cap *@pavail accordingly.
+ * A memcg's headroom is "min(max, high) - used".  In the hierarchy, the
+ * headroom is calculated as the lowest headroom of itself and the
+ * ancestors.  Note that this doesn't consider the actual amount of
+ * available memory in the system.  The caller should further cap
+ * *@pheadroom accordingly.
  */
-void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pavail,
-			 unsigned long *pdirty, unsigned long *pwriteback)
+void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages,
+			 unsigned long *pheadroom, unsigned long *pdirty,
+			 unsigned long *pwriteback)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css);
 	struct mem_cgroup *parent;
-	unsigned long head_room = PAGE_COUNTER_MAX;
-	unsigned long file_pages;
 
 	*pdirty = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_DIRTY);
 
 	/* this should eventually include NR_UNSTABLE_NFS */
 	*pwriteback = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK);
+	*pfilepages = mem_cgroup_nr_lru_pages(memcg, (1 << LRU_INACTIVE_FILE) |
+						     (1 << LRU_ACTIVE_FILE));
+	*pheadroom = PAGE_COUNTER_MAX;
 
-	file_pages = mem_cgroup_nr_lru_pages(memcg, (1 << LRU_INACTIVE_FILE) |
-						    (1 << LRU_ACTIVE_FILE));
 	while ((parent = parent_mem_cgroup(memcg))) {
 		unsigned long ceiling = min(memcg->memory.limit, memcg->high);
 		unsigned long used = page_counter_read(&memcg->memory);
 
-		head_room = min(head_room, ceiling - min(ceiling, used));
+		*pheadroom = min(*pheadroom, ceiling - min(ceiling, used));
 		memcg = parent;
 	}
-
-	*pavail = file_pages + head_room;
 }
 
 #else	/* CONFIG_CGROUP_WRITEBACK */
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 56c0bff..2c90357 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -684,13 +684,19 @@ static unsigned long hard_dirty_limit(struct wb_domain *dom,
 	return max(thresh, dom->dirty_limit);
 }
 
-/* memory available to a memcg domain is capped by system-wide clean memory */
-static void mdtc_cap_avail(struct dirty_throttle_control *mdtc)
+/*
+ * Memory which can be further allocated to a memcg domain is capped by
+ * system-wide clean memory excluding the amount being used in the domain.
+ */
+static void mdtc_calc_avail(struct dirty_throttle_control *mdtc,
+			    unsigned long filepages, unsigned long headroom)
 {
 	struct dirty_throttle_control *gdtc = mdtc_gdtc(mdtc);
-	unsigned long clean = gdtc->avail - min(gdtc->avail, gdtc->dirty);
+	unsigned long clean = filepages - min(filepages, mdtc->dirty);
+	unsigned long global_clean = gdtc->avail - min(gdtc->avail, gdtc->dirty);
+	unsigned long other_clean = global_clean - min(global_clean, clean);
 
-	mdtc->avail = min(mdtc->avail, clean);
+	mdtc->avail = filepages + min(headroom, other_clean);
 }
 
 /**
@@ -1564,16 +1570,16 @@ static void balance_dirty_pages(struct address_space *mapping,
 		}
 
 		if (mdtc) {
-			unsigned long writeback;
+			unsigned long filepages, headroom, writeback;
 
 			/*
 			 * If @wb belongs to !root memcg, repeat the same
 			 * basic calculations for the memcg domain.
 			 */
-			mem_cgroup_wb_stats(wb, &mdtc->avail, &mdtc->dirty,
-					    &writeback);
-			mdtc_cap_avail(mdtc);
+			mem_cgroup_wb_stats(wb, &filepages, &headroom,
+					    &mdtc->dirty, &writeback);
 			mdtc->dirty += writeback;
+			mdtc_calc_avail(mdtc, filepages, headroom);
 
 			domain_dirty_limits(mdtc);
 
@@ -1895,10 +1901,11 @@ bool wb_over_bg_thresh(struct bdi_writeback *wb)
 		return true;
 
 	if (mdtc) {
-		unsigned long writeback;
+		unsigned long filepages, headroom, writeback;
 
-		mem_cgroup_wb_stats(wb, &mdtc->avail, &mdtc->dirty, &writeback);
-		mdtc_cap_avail(mdtc);
+		mem_cgroup_wb_stats(wb, &filepages, &headroom, &mdtc->dirty,
+				    &writeback);
+		mdtc_calc_avail(mdtc, filepages, headroom);
 		domain_dirty_limits(mdtc);	/* ditto, ignore writeback */
 
 		if (mdtc->dirty > mdtc->bg_thresh)
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 5/5] writeback: fix incorrect calculation of available memory for memcg domains
@ 2015-09-29 17:04     ` Tejun Heo
  0 siblings, 0 replies; 24+ messages in thread
From: Tejun Heo @ 2015-09-29 17:04 UTC (permalink / raw)
  To: axboe-tSWWG44O7X1aa/9Udqfwiw
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tytso-3s7WtUTddSA,
	dedekind1-Re5JQEeQqe8AvxtiuMwx3w, decui-0li6OtcxBFHby3iVrkZq2A,
	kernel-team-b10kYP2dOMg

For memcg domains, the amount of available memory was calculated as

 min(the amount currently in use + headroom according to memcg,
     total clean memory)

This isn't quite correct as what should be capped by the amount of
clean memory is the headroom, not the sum of memory in use and
headroom.  For example, if a memcg domain has a significant amount of
dirty memory, the above can lead to a value which is lower than the
current amount in use which doesn't make much sense.  In most
circumstances, the above leads to a number which is somewhat but not
drastically lower.

As the amount of memory which can be readily allocated to the memcg
domain is capped by the amount of system-wide clean memory which is
not already assigned to the memcg itself, the number we want is

 the amount currently in use +
 min(headroom according to memcg, clean memory elsewhere in the system)

This patch updates mem_cgroup_wb_stats() to return the number of
filepages and headroom instead of the calculated available pages.
mdtc_cap_avail() is renamed to mdtc_calc_avail() and performs the
above calculation from file, headroom, dirty and globally clean pages.

v2: Dummy mem_cgroup_wb_stats() implementation wasn't updated leading
    to build failure when !CGROUP_WRITEBACK.  Fixed.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Fixes: c2aa723a6093 ("writeback: implement memcg writeback domain based throttling")
---
 include/linux/memcontrol.h |  8 +++++---
 mm/memcontrol.c            | 35 +++++++++++++++++------------------
 mm/page-writeback.c        | 29 ++++++++++++++++++-----------
 3 files changed, 40 insertions(+), 32 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index ad800e6..00bec85 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -677,8 +677,9 @@ enum {
 
 struct list_head *mem_cgroup_cgwb_list(struct mem_cgroup *memcg);
 struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb);
-void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pavail,
-			 unsigned long *pdirty, unsigned long *pwriteback);
+void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages,
+			 unsigned long *pheadroom, unsigned long *pdirty,
+			 unsigned long *pwriteback);
 
 #else	/* CONFIG_CGROUP_WRITEBACK */
 
@@ -688,7 +689,8 @@ static inline struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb)
 }
 
 static inline void mem_cgroup_wb_stats(struct bdi_writeback *wb,
-				       unsigned long *pavail,
+				       unsigned long *pfilepages,
+				       unsigned long *pheadroom,
 				       unsigned long *pdirty,
 				       unsigned long *pwriteback)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6ddaeba..b357ccf 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3734,44 +3734,43 @@ struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb)
 /**
  * mem_cgroup_wb_stats - retrieve writeback related stats from its memcg
  * @wb: bdi_writeback in question
- * @pavail: out parameter for number of available pages
+ * @pfilepages: out parameter for number of file pages
+ * @pheadroom: out parameter for number of allocatable pages according to memcg
  * @pdirty: out parameter for number of dirty pages
  * @pwriteback: out parameter for number of pages under writeback
  *
- * Determine the numbers of available, dirty, and writeback pages in @wb's
- * memcg.  Dirty and writeback are self-explanatory.  Available is a bit
- * more involved.
+ * Determine the numbers of file, headroom, dirty, and writeback pages in
+ * @wb's memcg.  File, dirty and writeback are self-explanatory.  Headroom
+ * is a bit more involved.
  *
- * A memcg's headroom is "min(max, high) - used".  The available memory is
- * calculated as the lowest headroom of itself and the ancestors plus the
- * number of pages already being used for file pages.  Note that this
- * doesn't consider the actual amount of available memory in the system.
- * The caller should further cap *@pavail accordingly.
+ * A memcg's headroom is "min(max, high) - used".  In the hierarchy, the
+ * headroom is calculated as the lowest headroom of itself and the
+ * ancestors.  Note that this doesn't consider the actual amount of
+ * available memory in the system.  The caller should further cap
+ * *@pheadroom accordingly.
  */
-void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pavail,
-			 unsigned long *pdirty, unsigned long *pwriteback)
+void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages,
+			 unsigned long *pheadroom, unsigned long *pdirty,
+			 unsigned long *pwriteback)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css);
 	struct mem_cgroup *parent;
-	unsigned long head_room = PAGE_COUNTER_MAX;
-	unsigned long file_pages;
 
 	*pdirty = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_DIRTY);
 
 	/* this should eventually include NR_UNSTABLE_NFS */
 	*pwriteback = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK);
+	*pfilepages = mem_cgroup_nr_lru_pages(memcg, (1 << LRU_INACTIVE_FILE) |
+						     (1 << LRU_ACTIVE_FILE));
+	*pheadroom = PAGE_COUNTER_MAX;
 
-	file_pages = mem_cgroup_nr_lru_pages(memcg, (1 << LRU_INACTIVE_FILE) |
-						    (1 << LRU_ACTIVE_FILE));
 	while ((parent = parent_mem_cgroup(memcg))) {
 		unsigned long ceiling = min(memcg->memory.limit, memcg->high);
 		unsigned long used = page_counter_read(&memcg->memory);
 
-		head_room = min(head_room, ceiling - min(ceiling, used));
+		*pheadroom = min(*pheadroom, ceiling - min(ceiling, used));
 		memcg = parent;
 	}
-
-	*pavail = file_pages + head_room;
 }
 
 #else	/* CONFIG_CGROUP_WRITEBACK */
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 56c0bff..2c90357 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -684,13 +684,19 @@ static unsigned long hard_dirty_limit(struct wb_domain *dom,
 	return max(thresh, dom->dirty_limit);
 }
 
-/* memory available to a memcg domain is capped by system-wide clean memory */
-static void mdtc_cap_avail(struct dirty_throttle_control *mdtc)
+/*
+ * Memory which can be further allocated to a memcg domain is capped by
+ * system-wide clean memory excluding the amount being used in the domain.
+ */
+static void mdtc_calc_avail(struct dirty_throttle_control *mdtc,
+			    unsigned long filepages, unsigned long headroom)
 {
 	struct dirty_throttle_control *gdtc = mdtc_gdtc(mdtc);
-	unsigned long clean = gdtc->avail - min(gdtc->avail, gdtc->dirty);
+	unsigned long clean = filepages - min(filepages, mdtc->dirty);
+	unsigned long global_clean = gdtc->avail - min(gdtc->avail, gdtc->dirty);
+	unsigned long other_clean = global_clean - min(global_clean, clean);
 
-	mdtc->avail = min(mdtc->avail, clean);
+	mdtc->avail = filepages + min(headroom, other_clean);
 }
 
 /**
@@ -1564,16 +1570,16 @@ static void balance_dirty_pages(struct address_space *mapping,
 		}
 
 		if (mdtc) {
-			unsigned long writeback;
+			unsigned long filepages, headroom, writeback;
 
 			/*
 			 * If @wb belongs to !root memcg, repeat the same
 			 * basic calculations for the memcg domain.
 			 */
-			mem_cgroup_wb_stats(wb, &mdtc->avail, &mdtc->dirty,
-					    &writeback);
-			mdtc_cap_avail(mdtc);
+			mem_cgroup_wb_stats(wb, &filepages, &headroom,
+					    &mdtc->dirty, &writeback);
 			mdtc->dirty += writeback;
+			mdtc_calc_avail(mdtc, filepages, headroom);
 
 			domain_dirty_limits(mdtc);
 
@@ -1895,10 +1901,11 @@ bool wb_over_bg_thresh(struct bdi_writeback *wb)
 		return true;
 
 	if (mdtc) {
-		unsigned long writeback;
+		unsigned long filepages, headroom, writeback;
 
-		mem_cgroup_wb_stats(wb, &mdtc->avail, &mdtc->dirty, &writeback);
-		mdtc_cap_avail(mdtc);
+		mem_cgroup_wb_stats(wb, &filepages, &headroom, &mdtc->dirty,
+				    &writeback);
+		mdtc_calc_avail(mdtc, filepages, headroom);
 		domain_dirty_limits(mdtc);	/* ditto, ignore writeback */
 
 		if (mdtc->dirty > mdtc->bg_thresh)
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/5] writeback: bdi_writeback iteration must not skip dying ones
@ 2015-10-02 14:12     ` Jan Kara
  0 siblings, 0 replies; 24+ messages in thread
From: Jan Kara @ 2015-10-02 14:12 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe, linux-kernel, cgroups, tytso, dedekind1, decui, kernel-team

On Tue 29-09-15 12:47:52, Tejun Heo wrote:
> bdi_for_each_wb() is used in several places to wake up or issue
> writeback work items to all wb's (bdi_writeback's) on a given bdi.
> The iteration is performed by walking bdi->cgwb_tree; however, the
> tree only indexes wb's which are currently active.
> 
> For example, when a memcg gets associated with a different blkcg, the
> old wb is removed from the tree so that the new one can be indexed.
> The old wb starts dying from then on but will linger till all its
> inodes are drained.  As these dying wb's may still host dirty inodes,
> writeback operations which affect all wb's must include them.
> bdi_for_each_wb() skipping dying wb's led to sync(2) missing and
> failing to sync the inodes belonging to those wb's.
> 
> This patch adds a RCU protected @bdi->wb_list which lists all wb's
> beloinging to that bdi.  wb's are added on creation and removed on
> release rather than on the start of destruction.  bdi_for_each_wb()
> usages are replaced with list_for_each[_continue]_rcu() iterations
> over @bdi->wb_list and bdi_for_each_wb() and its helpers are removed.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reported-and-tested-by: Artem Bityutskiy <dedekind1@gmail.com>
> Fixes: ebe41ab0c79d ("writeback: implement bdi_for_each_wb()")
> Link: http://lkml.kernel.org/g/1443012552.19983.209.camel@gmail.com
> ---
>  fs/fs-writeback.c                | 28 ++++++++++++------
>  include/linux/backing-dev-defs.h |  3 ++
>  include/linux/backing-dev.h      | 63 ----------------------------------------
>  mm/backing-dev.c                 | 17 ++++++++++-
>  mm/page-writeback.c              |  3 +-
>  5 files changed, 39 insertions(+), 75 deletions(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index d0da306..afa4848 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -778,19 +778,24 @@ static void bdi_split_work_to_wbs(struct backing_dev_info *bdi,
>  				  struct wb_writeback_work *base_work,
>  				  bool skip_if_busy)
>  {
> -	int next_memcg_id = 0;
> -	struct bdi_writeback *wb;
> -	struct wb_iter iter;
> +	struct bdi_writeback *last_wb = NULL;
> +	struct bdi_writeback *wb = list_entry_rcu(&bdi->wb_list,
> +						struct bdi_writeback, bdi_node);
>  
>  	might_sleep();
>  restart:
>  	rcu_read_lock();
> -	bdi_for_each_wb(wb, bdi, &iter, next_memcg_id) {
> +	list_for_each_entry_continue_rcu(wb, &bdi->wb_list, bdi_node) {
>  		DEFINE_WB_COMPLETION_ONSTACK(fallback_work_done);
>  		struct wb_writeback_work fallback_work;
>  		struct wb_writeback_work *work;
>  		long nr_pages;
>  
> +		if (last_wb) {
> +			wb_put(last_wb);
> +			last_wb = NULL;
> +		}

But you seem to forget to drop last_wb reference in case this was the last
wb in the list, don't you?

> +
>  		/* SYNC_ALL writes out I_DIRTY_TIME too */
>  		if (!wb_has_dirty_io(wb) &&
>  		    (base_work->sync_mode == WB_SYNC_NONE ||
> @@ -819,7 +824,14 @@ static void bdi_split_work_to_wbs(struct backing_dev_info *bdi,
>  
>  		wb_queue_work(wb, work);
>  
> -		next_memcg_id = wb->memcg_css->id + 1;
> +		/*
> +		 * Pin @wb so that it stays on @bdi->wb_list.  This allows
> +		 * continuing iteration from @wb after dropping and
> +		 * regrabbing rcu read lock.
> +		 */
> +		wb_get(wb);
> +		last_wb = wb;
> +
>  		rcu_read_unlock();
>  		wb_wait_for_completion(bdi, &fallback_work_done);
>  		goto restart;
...
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 2df8ddc..c48070b 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -686,6 +691,9 @@ static void cgwb_bdi_destroy(struct backing_dev_info *bdi)
>  	radix_tree_for_each_slot(slot, &bdi->cgwb_tree, &iter, 0)
>  		cgwb_kill(*slot);
>  
> +	/* wb may get released after @bdi is freed, sever list head */
> +	list_del(&bdi->wb_list);
> +

But we wait for bdi->usage_cnt to drop to 0 which means there's no wb,
don't we? What am I missing?

> @@ -764,15 +772,22 @@ static void cgwb_bdi_destroy(struct backing_dev_info *bdi) { }
>  
>  int bdi_init(struct backing_dev_info *bdi)
>  {
> +	int ret;
> +
>  	bdi->dev = NULL;
>  
>  	bdi->min_ratio = 0;
>  	bdi->max_ratio = 100;
>  	bdi->max_prop_frac = FPROP_FRAC_BASE;
>  	INIT_LIST_HEAD(&bdi->bdi_list);
> +	INIT_LIST_HEAD(&bdi->wb_list);
>  	init_waitqueue_head(&bdi->wb_waitq);
>  
> -	return cgwb_bdi_init(bdi);
> +	ret = cgwb_bdi_init(bdi);
> +
> +	list_add_tail_rcu(&bdi->wb.bdi_node, &bdi->wb_list);

Won't this be more logical in cgwb_bdi_init()?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/5] writeback: bdi_writeback iteration must not skip dying ones
@ 2015-10-02 14:12     ` Jan Kara
  0 siblings, 0 replies; 24+ messages in thread
From: Jan Kara @ 2015-10-02 14:12 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tytso-3s7WtUTddSA,
	dedekind1-Re5JQEeQqe8AvxtiuMwx3w, decui-0li6OtcxBFHby3iVrkZq2A,
	kernel-team-b10kYP2dOMg

On Tue 29-09-15 12:47:52, Tejun Heo wrote:
> bdi_for_each_wb() is used in several places to wake up or issue
> writeback work items to all wb's (bdi_writeback's) on a given bdi.
> The iteration is performed by walking bdi->cgwb_tree; however, the
> tree only indexes wb's which are currently active.
> 
> For example, when a memcg gets associated with a different blkcg, the
> old wb is removed from the tree so that the new one can be indexed.
> The old wb starts dying from then on but will linger till all its
> inodes are drained.  As these dying wb's may still host dirty inodes,
> writeback operations which affect all wb's must include them.
> bdi_for_each_wb() skipping dying wb's led to sync(2) missing and
> failing to sync the inodes belonging to those wb's.
> 
> This patch adds a RCU protected @bdi->wb_list which lists all wb's
> beloinging to that bdi.  wb's are added on creation and removed on
> release rather than on the start of destruction.  bdi_for_each_wb()
> usages are replaced with list_for_each[_continue]_rcu() iterations
> over @bdi->wb_list and bdi_for_each_wb() and its helpers are removed.
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Reported-and-tested-by: Artem Bityutskiy <dedekind1-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> Fixes: ebe41ab0c79d ("writeback: implement bdi_for_each_wb()")
> Link: http://lkml.kernel.org/g/1443012552.19983.209.camel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org
> ---
>  fs/fs-writeback.c                | 28 ++++++++++++------
>  include/linux/backing-dev-defs.h |  3 ++
>  include/linux/backing-dev.h      | 63 ----------------------------------------
>  mm/backing-dev.c                 | 17 ++++++++++-
>  mm/page-writeback.c              |  3 +-
>  5 files changed, 39 insertions(+), 75 deletions(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index d0da306..afa4848 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -778,19 +778,24 @@ static void bdi_split_work_to_wbs(struct backing_dev_info *bdi,
>  				  struct wb_writeback_work *base_work,
>  				  bool skip_if_busy)
>  {
> -	int next_memcg_id = 0;
> -	struct bdi_writeback *wb;
> -	struct wb_iter iter;
> +	struct bdi_writeback *last_wb = NULL;
> +	struct bdi_writeback *wb = list_entry_rcu(&bdi->wb_list,
> +						struct bdi_writeback, bdi_node);
>  
>  	might_sleep();
>  restart:
>  	rcu_read_lock();
> -	bdi_for_each_wb(wb, bdi, &iter, next_memcg_id) {
> +	list_for_each_entry_continue_rcu(wb, &bdi->wb_list, bdi_node) {
>  		DEFINE_WB_COMPLETION_ONSTACK(fallback_work_done);
>  		struct wb_writeback_work fallback_work;
>  		struct wb_writeback_work *work;
>  		long nr_pages;
>  
> +		if (last_wb) {
> +			wb_put(last_wb);
> +			last_wb = NULL;
> +		}

But you seem to forget to drop last_wb reference in case this was the last
wb in the list, don't you?

> +
>  		/* SYNC_ALL writes out I_DIRTY_TIME too */
>  		if (!wb_has_dirty_io(wb) &&
>  		    (base_work->sync_mode == WB_SYNC_NONE ||
> @@ -819,7 +824,14 @@ static void bdi_split_work_to_wbs(struct backing_dev_info *bdi,
>  
>  		wb_queue_work(wb, work);
>  
> -		next_memcg_id = wb->memcg_css->id + 1;
> +		/*
> +		 * Pin @wb so that it stays on @bdi->wb_list.  This allows
> +		 * continuing iteration from @wb after dropping and
> +		 * regrabbing rcu read lock.
> +		 */
> +		wb_get(wb);
> +		last_wb = wb;
> +
>  		rcu_read_unlock();
>  		wb_wait_for_completion(bdi, &fallback_work_done);
>  		goto restart;
...
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 2df8ddc..c48070b 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -686,6 +691,9 @@ static void cgwb_bdi_destroy(struct backing_dev_info *bdi)
>  	radix_tree_for_each_slot(slot, &bdi->cgwb_tree, &iter, 0)
>  		cgwb_kill(*slot);
>  
> +	/* wb may get released after @bdi is freed, sever list head */
> +	list_del(&bdi->wb_list);
> +

But we wait for bdi->usage_cnt to drop to 0 which means there's no wb,
don't we? What am I missing?

> @@ -764,15 +772,22 @@ static void cgwb_bdi_destroy(struct backing_dev_info *bdi) { }
>  
>  int bdi_init(struct backing_dev_info *bdi)
>  {
> +	int ret;
> +
>  	bdi->dev = NULL;
>  
>  	bdi->min_ratio = 0;
>  	bdi->max_ratio = 100;
>  	bdi->max_prop_frac = FPROP_FRAC_BASE;
>  	INIT_LIST_HEAD(&bdi->bdi_list);
> +	INIT_LIST_HEAD(&bdi->wb_list);
>  	init_waitqueue_head(&bdi->wb_waitq);
>  
> -	return cgwb_bdi_init(bdi);
> +	ret = cgwb_bdi_init(bdi);
> +
> +	list_add_tail_rcu(&bdi->wb.bdi_node, &bdi->wb_list);

Won't this be more logical in cgwb_bdi_init()?

								Honza
-- 
Jan Kara <jack-IBi9RG/b67k@public.gmane.org>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/5] writeback: fix bdi_writeback iteration in wakeup_dirtytime_writeback()
@ 2015-10-02 14:12     ` Jan Kara
  0 siblings, 0 replies; 24+ messages in thread
From: Jan Kara @ 2015-10-02 14:12 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe, linux-kernel, cgroups, tytso, dedekind1, decui, kernel-team

On Tue 29-09-15 12:47:51, Tejun Heo wrote:
> wakeup_dirtytime_writeback() walks and wakes up all wb's of all bdi's;
> unfortunately, it was always waking up bdi->wb instead of the wb being
> walked.  Fix it.

Looks good. You can add:

Reviewed-by: Jan Kara <jack@suse.com>

								Honza

> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Fixes: 001fe6f617b1 ("writeback: make wakeup_dirtytime_writeback() handle multiple bdi_writeback's")
> ---
>  fs/fs-writeback.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 091a364..d0da306 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -1897,8 +1897,8 @@ static void wakeup_dirtytime_writeback(struct work_struct *w)
>  		struct wb_iter iter;
>  
>  		bdi_for_each_wb(wb, bdi, &iter, 0)
> -			if (!list_empty(&bdi->wb.b_dirty_time))
> -				wb_wakeup(&bdi->wb);
> +			if (!list_empty(&wb->b_dirty_time))
> +				wb_wakeup(wb);
>  	}
>  	rcu_read_unlock();
>  	schedule_delayed_work(&dirtytime_work, dirtytime_expire_interval * HZ);
> -- 
> 2.4.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/5] writeback: fix bdi_writeback iteration in wakeup_dirtytime_writeback()
@ 2015-10-02 14:12     ` Jan Kara
  0 siblings, 0 replies; 24+ messages in thread
From: Jan Kara @ 2015-10-02 14:12 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tytso-3s7WtUTddSA,
	dedekind1-Re5JQEeQqe8AvxtiuMwx3w, decui-0li6OtcxBFHby3iVrkZq2A,
	kernel-team-b10kYP2dOMg

On Tue 29-09-15 12:47:51, Tejun Heo wrote:
> wakeup_dirtytime_writeback() walks and wakes up all wb's of all bdi's;
> unfortunately, it was always waking up bdi->wb instead of the wb being
> walked.  Fix it.

Looks good. You can add:

Reviewed-by: Jan Kara <jack-IBi9RG/b67k@public.gmane.org>

								Honza

> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Fixes: 001fe6f617b1 ("writeback: make wakeup_dirtytime_writeback() handle multiple bdi_writeback's")
> ---
>  fs/fs-writeback.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 091a364..d0da306 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -1897,8 +1897,8 @@ static void wakeup_dirtytime_writeback(struct work_struct *w)
>  		struct wb_iter iter;
>  
>  		bdi_for_each_wb(wb, bdi, &iter, 0)
> -			if (!list_empty(&bdi->wb.b_dirty_time))
> -				wb_wakeup(&bdi->wb);
> +			if (!list_empty(&wb->b_dirty_time))
> +				wb_wakeup(wb);
>  	}
>  	rcu_read_unlock();
>  	schedule_delayed_work(&dirtytime_work, dirtytime_expire_interval * HZ);
> -- 
> 2.4.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
-- 
Jan Kara <jack-IBi9RG/b67k@public.gmane.org>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/5] writeback: laptop_mode_timer_fn() needs rcu_read_lock() around bdi_writeback iteration
@ 2015-10-02 14:13     ` Jan Kara
  0 siblings, 0 replies; 24+ messages in thread
From: Jan Kara @ 2015-10-02 14:13 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe, linux-kernel, cgroups, tytso, dedekind1, decui, kernel-team

On Tue 29-09-15 12:47:50, Tejun Heo wrote:
> laptop_mode_timer_fn() was using bdi_for_each_wb() without the
> required RCU locking leading to the following warning.
> 
>  WARNING: CPU: 0 PID: 0 at include/linux/backing-dev.h:415 laptop_mode_timer_fn+0x106/0x170()
>  ...
>  Call Trace:
>   <IRQ>  [<ffffffff81480cdc>] dump_stack+0x4e/0x82
>   [<ffffffff81051912>] warn_slowpath_common+0x82/0xc0
>   [<ffffffff81051a0a>] warn_slowpath_null+0x1a/0x20
>   [<ffffffff8115f0e6>] laptop_mode_timer_fn+0x106/0x170
>   [<ffffffff810ca8e3>] call_timer_fn+0xb3/0x2f0
>   [<ffffffff810cad25>] run_timer_softirq+0x205/0x370
>   [<ffffffff81056854>] __do_softirq+0xd4/0x460
>   [<ffffffff81056d69>] irq_exit+0x89/0xa0
>   [<ffffffff8185a892>] smp_apic_timer_interrupt+0x42/0x50
>   [<ffffffff81858a44>] apic_timer_interrupt+0x84/0x90
>  ...
> 
> Fix it by adding rcu_read_lock() around the iteration.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Fixes: a06fd6b10228 ("writeback: make laptop_mode_timer_fn() handle multiple bdi_writeback's")

Looks good. You can add:

Reviewed-by: Jan Kara <jack@suse.com>

								Honza
> ---
>  mm/page-writeback.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 0a931cd..902e5f2 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1965,10 +1965,12 @@ void laptop_mode_timer_fn(unsigned long data)
>  	if (!bdi_has_dirty_io(&q->backing_dev_info))
>  		return;
>  
> +	rcu_read_lock();
>  	bdi_for_each_wb(wb, &q->backing_dev_info, &iter, 0)
>  		if (wb_has_dirty_io(wb))
>  			wb_start_writeback(wb, nr_pages, true,
>  					   WB_REASON_LAPTOP_TIMER);
> +	rcu_read_unlock();
>  }
>  
>  /*
> -- 
> 2.4.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/5] writeback: laptop_mode_timer_fn() needs rcu_read_lock() around bdi_writeback iteration
@ 2015-10-02 14:13     ` Jan Kara
  0 siblings, 0 replies; 24+ messages in thread
From: Jan Kara @ 2015-10-02 14:13 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tytso-3s7WtUTddSA,
	dedekind1-Re5JQEeQqe8AvxtiuMwx3w, decui-0li6OtcxBFHby3iVrkZq2A,
	kernel-team-b10kYP2dOMg

On Tue 29-09-15 12:47:50, Tejun Heo wrote:
> laptop_mode_timer_fn() was using bdi_for_each_wb() without the
> required RCU locking leading to the following warning.
> 
>  WARNING: CPU: 0 PID: 0 at include/linux/backing-dev.h:415 laptop_mode_timer_fn+0x106/0x170()
>  ...
>  Call Trace:
>   <IRQ>  [<ffffffff81480cdc>] dump_stack+0x4e/0x82
>   [<ffffffff81051912>] warn_slowpath_common+0x82/0xc0
>   [<ffffffff81051a0a>] warn_slowpath_null+0x1a/0x20
>   [<ffffffff8115f0e6>] laptop_mode_timer_fn+0x106/0x170
>   [<ffffffff810ca8e3>] call_timer_fn+0xb3/0x2f0
>   [<ffffffff810cad25>] run_timer_softirq+0x205/0x370
>   [<ffffffff81056854>] __do_softirq+0xd4/0x460
>   [<ffffffff81056d69>] irq_exit+0x89/0xa0
>   [<ffffffff8185a892>] smp_apic_timer_interrupt+0x42/0x50
>   [<ffffffff81858a44>] apic_timer_interrupt+0x84/0x90
>  ...
> 
> Fix it by adding rcu_read_lock() around the iteration.
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Fixes: a06fd6b10228 ("writeback: make laptop_mode_timer_fn() handle multiple bdi_writeback's")

Looks good. You can add:

Reviewed-by: Jan Kara <jack-IBi9RG/b67k@public.gmane.org>

								Honza
> ---
>  mm/page-writeback.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 0a931cd..902e5f2 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1965,10 +1965,12 @@ void laptop_mode_timer_fn(unsigned long data)
>  	if (!bdi_has_dirty_io(&q->backing_dev_info))
>  		return;
>  
> +	rcu_read_lock();
>  	bdi_for_each_wb(wb, &q->backing_dev_info, &iter, 0)
>  		if (wb_has_dirty_io(wb))
>  			wb_start_writeback(wb, nr_pages, true,
>  					   WB_REASON_LAPTOP_TIMER);
> +	rcu_read_unlock();
>  }
>  
>  /*
> -- 
> 2.4.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
-- 
Jan Kara <jack-IBi9RG/b67k@public.gmane.org>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/5] writeback: bdi_writeback iteration must not skip dying ones
@ 2015-10-02 18:19       ` Tejun Heo
  0 siblings, 0 replies; 24+ messages in thread
From: Tejun Heo @ 2015-10-02 18:19 UTC (permalink / raw)
  To: Jan Kara
  Cc: axboe, linux-kernel, cgroups, tytso, dedekind1, decui, kernel-team

On Fri, Oct 02, 2015 at 04:12:12PM +0200, Jan Kara wrote:
> > +		if (last_wb) {
> > +			wb_put(last_wb);
> > +			last_wb = NULL;
> > +		}
> 
> But you seem to forget to drop last_wb reference in case this was the last
> wb in the list, don't you?

You're right.  Will update the patch.

> > @@ -686,6 +691,9 @@ static void cgwb_bdi_destroy(struct backing_dev_info *bdi)
> >  	radix_tree_for_each_slot(slot, &bdi->cgwb_tree, &iter, 0)
> >  		cgwb_kill(*slot);
> >  
> > +	/* wb may get released after @bdi is freed, sever list head */
> > +	list_del(&bdi->wb_list);
> > +
> 
> But we wait for bdi->usage_cnt to drop to 0 which means there's no wb,
> don't we? What am I missing?

And I forgot that we were doing that.  This isn't necessary.

> > @@ -764,15 +772,22 @@ static void cgwb_bdi_destroy(struct backing_dev_info *bdi) { }
> >  
> >  int bdi_init(struct backing_dev_info *bdi)
> >  {
> > +	int ret;
> > +
> >  	bdi->dev = NULL;
> >  
> >  	bdi->min_ratio = 0;
> >  	bdi->max_ratio = 100;
> >  	bdi->max_prop_frac = FPROP_FRAC_BASE;
> >  	INIT_LIST_HEAD(&bdi->bdi_list);
> > +	INIT_LIST_HEAD(&bdi->wb_list);
> >  	init_waitqueue_head(&bdi->wb_waitq);
> >  
> > -	return cgwb_bdi_init(bdi);
> > +	ret = cgwb_bdi_init(bdi);
> > +
> > +	list_add_tail_rcu(&bdi->wb.bdi_node, &bdi->wb_list);
> 
> Won't this be more logical in cgwb_bdi_init()?

bdi->wb_list exists whether cgwb is enabled or not, so if we move this
to cgwb_bdi_init(), we'll be duplicating it in both paths.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/5] writeback: bdi_writeback iteration must not skip dying ones
@ 2015-10-02 18:19       ` Tejun Heo
  0 siblings, 0 replies; 24+ messages in thread
From: Tejun Heo @ 2015-10-02 18:19 UTC (permalink / raw)
  To: Jan Kara
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tytso-3s7WtUTddSA,
	dedekind1-Re5JQEeQqe8AvxtiuMwx3w, decui-0li6OtcxBFHby3iVrkZq2A,
	kernel-team-b10kYP2dOMg

On Fri, Oct 02, 2015 at 04:12:12PM +0200, Jan Kara wrote:
> > +		if (last_wb) {
> > +			wb_put(last_wb);
> > +			last_wb = NULL;
> > +		}
> 
> But you seem to forget to drop last_wb reference in case this was the last
> wb in the list, don't you?

You're right.  Will update the patch.

> > @@ -686,6 +691,9 @@ static void cgwb_bdi_destroy(struct backing_dev_info *bdi)
> >  	radix_tree_for_each_slot(slot, &bdi->cgwb_tree, &iter, 0)
> >  		cgwb_kill(*slot);
> >  
> > +	/* wb may get released after @bdi is freed, sever list head */
> > +	list_del(&bdi->wb_list);
> > +
> 
> But we wait for bdi->usage_cnt to drop to 0 which means there's no wb,
> don't we? What am I missing?

And I forgot that we were doing that.  This isn't necessary.

> > @@ -764,15 +772,22 @@ static void cgwb_bdi_destroy(struct backing_dev_info *bdi) { }
> >  
> >  int bdi_init(struct backing_dev_info *bdi)
> >  {
> > +	int ret;
> > +
> >  	bdi->dev = NULL;
> >  
> >  	bdi->min_ratio = 0;
> >  	bdi->max_ratio = 100;
> >  	bdi->max_prop_frac = FPROP_FRAC_BASE;
> >  	INIT_LIST_HEAD(&bdi->bdi_list);
> > +	INIT_LIST_HEAD(&bdi->wb_list);
> >  	init_waitqueue_head(&bdi->wb_waitq);
> >  
> > -	return cgwb_bdi_init(bdi);
> > +	ret = cgwb_bdi_init(bdi);
> > +
> > +	list_add_tail_rcu(&bdi->wb.bdi_node, &bdi->wb_list);
> 
> Won't this be more logical in cgwb_bdi_init()?

bdi->wb_list exists whether cgwb is enabled or not, so if we move this
to cgwb_bdi_init(), we'll be duplicating it in both paths.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2 3/5] writeback: bdi_writeback iteration must not skip dying ones
  2015-09-29 16:47   ` Tejun Heo
  (?)
  (?)
@ 2015-10-02 18:47   ` Tejun Heo
  -1 siblings, 0 replies; 24+ messages in thread
From: Tejun Heo @ 2015-10-02 18:47 UTC (permalink / raw)
  To: axboe; +Cc: linux-kernel, cgroups, tytso, dedekind1, decui, kernel-team

bdi_for_each_wb() is used in several places to wake up or issue
writeback work items to all wb's (bdi_writeback's) on a given bdi.
The iteration is performed by walking bdi->cgwb_tree; however, the
tree only indexes wb's which are currently active.

For example, when a memcg gets associated with a different blkcg, the
old wb is removed from the tree so that the new one can be indexed.
The old wb starts dying from then on but will linger till all its
inodes are drained.  As these dying wb's may still host dirty inodes,
writeback operations which affect all wb's must include them.
bdi_for_each_wb() skipping dying wb's led to sync(2) missing and
failing to sync the inodes belonging to those wb's.

This patch adds a RCU protected @bdi->wb_list which lists all wb's
beloinging to that bdi.  wb's are added on creation and removed on
release rather than on the start of destruction.  bdi_for_each_wb()
usages are replaced with list_for_each[_continue]_rcu() iterations
over @bdi->wb_list and bdi_for_each_wb() and its helpers are removed.

v2: Updated as per Jan.  last_wb ref leak in bdi_split_work_to_wbs()
    fixed and unnecessary list head severing in cgwb_bdi_destroy()
    removed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Artem Bityutskiy <dedekind1@gmail.com>
Fixes: ebe41ab0c79d ("writeback: implement bdi_for_each_wb()")
Link: http://lkml.kernel.org/g/1443012552.19983.209.camel@gmail.com
Cc: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c                |   31 +++++++++++++------
 include/linux/backing-dev-defs.h |    3 +
 include/linux/backing-dev.h      |   63 ---------------------------------------
 mm/backing-dev.c                 |   14 ++++++++
 mm/page-writeback.c              |    3 -
 5 files changed, 39 insertions(+), 75 deletions(-)

--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -778,19 +778,24 @@ static void bdi_split_work_to_wbs(struct
 				  struct wb_writeback_work *base_work,
 				  bool skip_if_busy)
 {
-	int next_memcg_id = 0;
-	struct bdi_writeback *wb;
-	struct wb_iter iter;
+	struct bdi_writeback *last_wb = NULL;
+	struct bdi_writeback *wb = list_entry_rcu(&bdi->wb_list,
+						struct bdi_writeback, bdi_node);
 
 	might_sleep();
 restart:
 	rcu_read_lock();
-	bdi_for_each_wb(wb, bdi, &iter, next_memcg_id) {
+	list_for_each_entry_continue_rcu(wb, &bdi->wb_list, bdi_node) {
 		DEFINE_WB_COMPLETION_ONSTACK(fallback_work_done);
 		struct wb_writeback_work fallback_work;
 		struct wb_writeback_work *work;
 		long nr_pages;
 
+		if (last_wb) {
+			wb_put(last_wb);
+			last_wb = NULL;
+		}
+
 		/* SYNC_ALL writes out I_DIRTY_TIME too */
 		if (!wb_has_dirty_io(wb) &&
 		    (base_work->sync_mode == WB_SYNC_NONE ||
@@ -819,12 +824,22 @@ restart:
 
 		wb_queue_work(wb, work);
 
-		next_memcg_id = wb->memcg_css->id + 1;
+		/*
+		 * Pin @wb so that it stays on @bdi->wb_list.  This allows
+		 * continuing iteration from @wb after dropping and
+		 * regrabbing rcu read lock.
+		 */
+		wb_get(wb);
+		last_wb = wb;
+
 		rcu_read_unlock();
 		wb_wait_for_completion(bdi, &fallback_work_done);
 		goto restart;
 	}
 	rcu_read_unlock();
+
+	if (last_wb)
+		wb_put(last_wb);
 }
 
 #else	/* CONFIG_CGROUP_WRITEBACK */
@@ -1857,12 +1872,11 @@ void wakeup_flusher_threads(long nr_page
 	rcu_read_lock();
 	list_for_each_entry_rcu(bdi, &bdi_list, bdi_list) {
 		struct bdi_writeback *wb;
-		struct wb_iter iter;
 
 		if (!bdi_has_dirty_io(bdi))
 			continue;
 
-		bdi_for_each_wb(wb, bdi, &iter, 0)
+		list_for_each_entry_rcu(wb, &bdi->wb_list, bdi_node)
 			wb_start_writeback(wb, wb_split_bdi_pages(wb, nr_pages),
 					   false, reason);
 	}
@@ -1894,9 +1908,8 @@ static void wakeup_dirtytime_writeback(s
 	rcu_read_lock();
 	list_for_each_entry_rcu(bdi, &bdi_list, bdi_list) {
 		struct bdi_writeback *wb;
-		struct wb_iter iter;
 
-		bdi_for_each_wb(wb, bdi, &iter, 0)
+		list_for_each_entry_rcu(wb, &bdi->wb_list, bdi_node)
 			if (!list_empty(&wb->b_dirty_time))
 				wb_wakeup(wb);
 	}
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -116,6 +116,8 @@ struct bdi_writeback {
 	struct list_head work_list;
 	struct delayed_work dwork;	/* work item used for writeback */
 
+	struct list_head bdi_node;	/* anchored at bdi->wb_list */
+
 #ifdef CONFIG_CGROUP_WRITEBACK
 	struct percpu_ref refcnt;	/* used only for !root wb's */
 	struct fprop_local_percpu memcg_completions;
@@ -150,6 +152,7 @@ struct backing_dev_info {
 	atomic_long_t tot_write_bandwidth;
 
 	struct bdi_writeback wb;  /* the root writeback info for this bdi */
+	struct list_head wb_list; /* list of all wbs */
 #ifdef CONFIG_CGROUP_WRITEBACK
 	struct radix_tree_root cgwb_tree; /* radix tree of active cgroup wbs */
 	struct rb_root cgwb_congested_tree; /* their congested states */
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -401,61 +401,6 @@ static inline void unlocked_inode_to_wb_
 	rcu_read_unlock();
 }
 
-struct wb_iter {
-	int			start_memcg_id;
-	struct radix_tree_iter	tree_iter;
-	void			**slot;
-};
-
-static inline struct bdi_writeback *__wb_iter_next(struct wb_iter *iter,
-						   struct backing_dev_info *bdi)
-{
-	struct radix_tree_iter *titer = &iter->tree_iter;
-
-	WARN_ON_ONCE(!rcu_read_lock_held());
-
-	if (iter->start_memcg_id >= 0) {
-		iter->slot = radix_tree_iter_init(titer, iter->start_memcg_id);
-		iter->start_memcg_id = -1;
-	} else {
-		iter->slot = radix_tree_next_slot(iter->slot, titer, 0);
-	}
-
-	if (!iter->slot)
-		iter->slot = radix_tree_next_chunk(&bdi->cgwb_tree, titer, 0);
-	if (iter->slot)
-		return *iter->slot;
-	return NULL;
-}
-
-static inline struct bdi_writeback *__wb_iter_init(struct wb_iter *iter,
-						   struct backing_dev_info *bdi,
-						   int start_memcg_id)
-{
-	iter->start_memcg_id = start_memcg_id;
-
-	if (start_memcg_id)
-		return __wb_iter_next(iter, bdi);
-	else
-		return &bdi->wb;
-}
-
-/**
- * bdi_for_each_wb - walk all wb's of a bdi in ascending memcg ID order
- * @wb_cur: cursor struct bdi_writeback pointer
- * @bdi: bdi to walk wb's of
- * @iter: pointer to struct wb_iter to be used as iteration buffer
- * @start_memcg_id: memcg ID to start iteration from
- *
- * Iterate @wb_cur through the wb's (bdi_writeback's) of @bdi in ascending
- * memcg ID order starting from @start_memcg_id.  @iter is struct wb_iter
- * to be used as temp storage during iteration.  rcu_read_lock() must be
- * held throughout iteration.
- */
-#define bdi_for_each_wb(wb_cur, bdi, iter, start_memcg_id)		\
-	for ((wb_cur) = __wb_iter_init(iter, bdi, start_memcg_id);	\
-	     (wb_cur); (wb_cur) = __wb_iter_next(iter, bdi))
-
 #else	/* CONFIG_CGROUP_WRITEBACK */
 
 static inline bool inode_cgwb_enabled(struct inode *inode)
@@ -515,14 +460,6 @@ static inline void wb_blkcg_offline(stru
 {
 }
 
-struct wb_iter {
-	int		next_id;
-};
-
-#define bdi_for_each_wb(wb_cur, bdi, iter, start_blkcg_id)		\
-	for ((iter)->next_id = (start_blkcg_id);			\
-	     ({	(wb_cur) = !(iter)->next_id++ ? &(bdi)->wb : NULL; }); )
-
 static inline int inode_congested(struct inode *inode, int cong_bits)
 {
 	return wb_congested(&inode_to_bdi(inode)->wb, cong_bits);
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -480,6 +480,10 @@ static void cgwb_release_workfn(struct w
 						release_work);
 	struct backing_dev_info *bdi = wb->bdi;
 
+	spin_lock_irq(&cgwb_lock);
+	list_del_rcu(&wb->bdi_node);
+	spin_unlock_irq(&cgwb_lock);
+
 	wb_shutdown(wb);
 
 	css_put(wb->memcg_css);
@@ -575,6 +579,7 @@ static int cgwb_create(struct backing_de
 		ret = radix_tree_insert(&bdi->cgwb_tree, memcg_css->id, wb);
 		if (!ret) {
 			atomic_inc(&bdi->usage_cnt);
+			list_add_tail_rcu(&wb->bdi_node, &bdi->wb_list);
 			list_add(&wb->memcg_node, memcg_cgwb_list);
 			list_add(&wb->blkcg_node, blkcg_cgwb_list);
 			css_get(memcg_css);
@@ -764,15 +769,22 @@ static void cgwb_bdi_destroy(struct back
 
 int bdi_init(struct backing_dev_info *bdi)
 {
+	int ret;
+
 	bdi->dev = NULL;
 
 	bdi->min_ratio = 0;
 	bdi->max_ratio = 100;
 	bdi->max_prop_frac = FPROP_FRAC_BASE;
 	INIT_LIST_HEAD(&bdi->bdi_list);
+	INIT_LIST_HEAD(&bdi->wb_list);
 	init_waitqueue_head(&bdi->wb_waitq);
 
-	return cgwb_bdi_init(bdi);
+	ret = cgwb_bdi_init(bdi);
+
+	list_add_tail_rcu(&bdi->wb.bdi_node, &bdi->wb_list);
+
+	return ret;
 }
 EXPORT_SYMBOL(bdi_init);
 
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1956,7 +1956,6 @@ void laptop_mode_timer_fn(unsigned long
 	int nr_pages = global_page_state(NR_FILE_DIRTY) +
 		global_page_state(NR_UNSTABLE_NFS);
 	struct bdi_writeback *wb;
-	struct wb_iter iter;
 
 	/*
 	 * We want to write everything out, not just down to the dirty
@@ -1966,7 +1965,7 @@ void laptop_mode_timer_fn(unsigned long
 		return;
 
 	rcu_read_lock();
-	bdi_for_each_wb(wb, &q->backing_dev_info, &iter, 0)
+	list_for_each_entry_rcu(wb, &q->backing_dev_info.wb_list, bdi_node)
 		if (wb_has_dirty_io(wb))
 			wb_start_writeback(wb, nr_pages, true,
 					   WB_REASON_LAPTOP_TIMER);

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2015-10-02 18:47 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-29 16:47 [PATCHSET block/for-linus] writeback: cgroup writeback fixes Tejun Heo
2015-09-29 16:47 ` Tejun Heo
2015-09-29 16:47 ` [PATCH 1/5] writeback: laptop_mode_timer_fn() needs rcu_read_lock() around bdi_writeback iteration Tejun Heo
2015-09-29 16:47   ` Tejun Heo
2015-10-02 14:13   ` Jan Kara
2015-10-02 14:13     ` Jan Kara
2015-09-29 16:47 ` [PATCH 2/5] writeback: fix bdi_writeback iteration in wakeup_dirtytime_writeback() Tejun Heo
2015-10-02 14:12   ` Jan Kara
2015-10-02 14:12     ` Jan Kara
2015-09-29 16:47 ` [PATCH 3/5] writeback: bdi_writeback iteration must not skip dying ones Tejun Heo
2015-09-29 16:47   ` Tejun Heo
2015-10-02 14:12   ` Jan Kara
2015-10-02 14:12     ` Jan Kara
2015-10-02 18:19     ` Tejun Heo
2015-10-02 18:19       ` Tejun Heo
2015-10-02 18:47   ` [PATCH v2 " Tejun Heo
2015-09-29 16:47 ` [PATCH 4/5] writeback: memcg dirty_throttle_control should be initialized with wb->memcg_completions Tejun Heo
2015-09-29 16:47   ` Tejun Heo
2015-09-29 16:47 ` [PATCH 5/5] writeback: fix incorrect calculation of available memory for memcg domains Tejun Heo
2015-09-29 16:47   ` Tejun Heo
2015-09-29 17:04   ` [PATCH v2 " Tejun Heo
2015-09-29 17:04     ` Tejun Heo
2015-09-29 16:50 ` [PATCHSET block/for-linus] writeback: cgroup writeback fixes Tejun Heo
2015-09-29 16:50   ` Tejun Heo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.