[PATCHSET] writeback: convert writeback to unbound workqueue

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCHSET] writeback: convert writeback to unbound workqueue
@ 2013-03-07 21:44 Tejun Heo
  2013-03-07 21:44 ` [PATCH 1/4] implement current_is_workqueue_rescuer() Tejun Heo
                   ` (4 more replies)
  0 siblings, 5 replies; 19+ messages in thread
From: Tejun Heo @ 2013-03-07 21:44 UTC (permalink / raw)
  To: axboe, laijs, fengguang.wu; +Cc: linux-kernel, jmoyer

Hello,

There's no reason for writeback to implement its own worker pool when
using workqueue is much simpler and more efficient.  This patchset
replaces writeback's custom worker pool with unbound workqueue and
also exports it to userland using WQ_SYSFS so that it can be tuned
from userland as requested a couple releases ago.

This patchset contains the following four patches.

 0001-implement-current_is_workqueue_rescuer.patch
 0002-writeback-remove-unused-bdi_pending_list.patch
 0003-writeback-replace-custom-worker-pool-implementation-.patch
 0004-writeback-expose-the-bdi_wq-workqueue.patch

0001-0002 are prep patches.  0003 does the conversion.  0004 makes
bdi_wq visible to userland.

This patchset is on top of v3.9-rc1 + "workqueue: implement workqueue
with custom worker attributes" patchset[1] and available in the
following git branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git review-writeback-conversion

diffstat follows.  Thanks.

 fs/fs-writeback.c                |  102 ++++-----------
 include/linux/backing-dev.h      |   16 --
 include/linux/workqueue.h        |    1
 include/trace/events/writeback.h |    5
 kernel/workqueue.c               |   13 +
 mm/backing-dev.c                 |  259 ++++-----------------------------------
 6 files changed, 80 insertions(+), 316 deletions(-)

--
tejun

[1] http://thread.gmane.org/gmane.linux.kernel/1449826

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/4] implement current_is_workqueue_rescuer()
  2013-03-07 21:44 [PATCHSET] writeback: convert writeback to unbound workqueue Tejun Heo
@ 2013-03-07 21:44 ` Tejun Heo
  2013-03-13  0:42   ` Tejun Heo
  2013-03-07 21:44 ` [PATCH 2/4] writeback: remove unused bdi_pending_list Tejun Heo
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 19+ messages in thread
From: Tejun Heo @ 2013-03-07 21:44 UTC (permalink / raw)
  To: axboe, laijs, fengguang.wu; +Cc: linux-kernel, jmoyer, Tejun Heo

Implement a function which queries whether it currently is running off
a workqueue rescuer.  This will be used to convert writeback to
workqueue.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/workqueue.h |  1 +
 kernel/workqueue.c        | 13 +++++++++++++
 2 files changed, 14 insertions(+)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 7f6d29a..df30763 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -451,6 +451,7 @@ extern bool cancel_delayed_work_sync(struct delayed_work *dwork);
 
 extern void workqueue_set_max_active(struct workqueue_struct *wq,
 				     int max_active);
+extern bool current_is_workqueue_rescuer(void);
 extern bool workqueue_congested(int cpu, struct workqueue_struct *wq);
 extern unsigned int work_busy(struct work_struct *work);
 
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 904a901..af79dd8 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -4062,6 +4062,19 @@ void workqueue_set_max_active(struct workqueue_struct *wq, int max_active)
 EXPORT_SYMBOL_GPL(workqueue_set_max_active);
 
 /**
+ * current_is_workqueue_rescuer - is %current workqueue rescuer?
+ *
+ * Determine whether %current is a workqueue rescuer.  Can be used from
+ * work functions to determine whether it's being run off the rescuer task.
+ */
+bool current_is_workqueue_rescuer(void)
+{
+	struct worker *worker = current_wq_worker();
+
+	return worker && worker == worker->current_pwq->wq->rescuer;
+}
+
+/**
  * workqueue_congested - test whether a workqueue is congested
  * @cpu: CPU in question
  * @wq: target workqueue
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/4] writeback: remove unused bdi_pending_list
  2013-03-07 21:44 [PATCHSET] writeback: convert writeback to unbound workqueue Tejun Heo
  2013-03-07 21:44 ` [PATCH 1/4] implement current_is_workqueue_rescuer() Tejun Heo
@ 2013-03-07 21:44 ` Tejun Heo
  2013-03-12 12:02   ` Jan Kara
  2013-03-07 21:44 ` [PATCH 3/4] writeback: replace custom worker pool implementation with unbound workqueue Tejun Heo
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 19+ messages in thread
From: Tejun Heo @ 2013-03-07 21:44 UTC (permalink / raw)
  To: axboe, laijs, fengguang.wu; +Cc: linux-kernel, jmoyer, Tejun Heo

There's no user left.  Remove it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Fengguang Wu <fengguang.wu@intel.com>
---
 include/linux/backing-dev.h | 1 -
 mm/backing-dev.c            | 4 +---
 2 files changed, 1 insertion(+), 4 deletions(-)

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 3504599..a5ef27f 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -130,7 +130,6 @@ void bdi_lock_two(struct bdi_writeback *wb1, struct bdi_writeback *wb2);
 
 extern spinlock_t bdi_lock;
 extern struct list_head bdi_list;
-extern struct list_head bdi_pending_list;
 
 static inline int wb_has_dirty_io(struct bdi_writeback *wb)
 {
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 41733c5..657569b 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -31,13 +31,11 @@ EXPORT_SYMBOL_GPL(noop_backing_dev_info);
 static struct class *bdi_class;
 
 /*
- * bdi_lock protects updates to bdi_list and bdi_pending_list, as well as
- * reader side protection for bdi_pending_list. bdi_list has RCU reader side
+ * bdi_lock protects updates to bdi_list. bdi_list has RCU reader side
  * locking.
  */
 DEFINE_SPINLOCK(bdi_lock);
 LIST_HEAD(bdi_list);
-LIST_HEAD(bdi_pending_list);
 
 void bdi_lock_two(struct bdi_writeback *wb1, struct bdi_writeback *wb2)
 {
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/4] writeback: replace custom worker pool implementation with unbound workqueue
  2013-03-07 21:44 [PATCHSET] writeback: convert writeback to unbound workqueue Tejun Heo
  2013-03-07 21:44 ` [PATCH 1/4] implement current_is_workqueue_rescuer() Tejun Heo
  2013-03-07 21:44 ` [PATCH 2/4] writeback: remove unused bdi_pending_list Tejun Heo
@ 2013-03-07 21:44 ` Tejun Heo
  2013-03-12 15:05   ` Jan Kara
  2013-03-07 21:44 ` [PATCH 4/4] writeback: expose the bdi_wq workqueue Tejun Heo
  2013-03-12 15:06 ` [PATCHSET] writeback: convert writeback to unbound workqueue Jens Axboe
  4 siblings, 1 reply; 19+ messages in thread
From: Tejun Heo @ 2013-03-07 21:44 UTC (permalink / raw)
  To: axboe, laijs, fengguang.wu; +Cc: linux-kernel, jmoyer, Tejun Heo

Writeback implements its own worker pool - each bdi can be associated
with a worker thread which is created and destroyed dynamically.  The
worker thread for the default bdi is always present and serves as the
"forker" thread which forks off worker threads for other bdis.

there's no reason for writeback to implement its own worker pool when
using unbound workqueue instead is much simpler and more efficient.
This patch replaces custom worker pool implementation in writeback
with an unbound workqueue.

The conversion isn't too complicated but the followings are worth
mentioning.

* bdi_writeback->last_active, task and wakeup_timer are removed.
  delayed_work ->dwork is added instead.  Explicit timer handling is
  no longer necessary.  Everything works by either queueing / modding
  / flushing / canceling the delayed_work item.

* bdi_writeback_thread() becomes bdi_writeback_workfn() which runs off
  bdi_writeback->dwork.  On each execution, it processes
  bdi->work_list and reschedules itself if there are more things to
  do.

  The function also handles low-mem condition, which used to be
  handled by the forker thread.  If the function is running off a
  rescuer thread, it only writes out limited number of pages so that
  the rescuer can serve other bdis too.  This preserves the flusher
  creation failure behavior of the forker thread.

* INIT_LIST_HEAD(&bdi->bdi_list) is used to tell
  bdi_writeback_workfn() about on-going bdi unregistration so that it
  always drains work_list even if it's running off the rescuer.  Note
  that the original code was broken in this regard.  Under memory
  pressure, a bdi could finish unregistration with non-empty
  work_list.

* The default bdi is no longer special.  It now is treated the same as
  any other bdi and bdi_cap_flush_forker() is removed.

* BDI_pending is no longer used.  Removed.

* Some tracepoints become non-applicable.  The following TPs are
  removed - writeback_nothread, writeback_wake_thread,
  writeback_wake_forker_thread, writeback_thread_start,
  writeback_thread_stop.

Everything, including devices coming and going away and rescuer
operation under simulated memory pressure, seems to work fine in my
test setup.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
---
 fs/fs-writeback.c                | 102 +++++-----------
 include/linux/backing-dev.h      |  15 +--
 include/trace/events/writeback.h |   5 -
 mm/backing-dev.c                 | 255 +++++----------------------------------
 4 files changed, 65 insertions(+), 312 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 21f46fb..8067d37 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -22,7 +22,6 @@
 #include <linux/mm.h>
 #include <linux/pagemap.h>
 #include <linux/kthread.h>
-#include <linux/freezer.h>
 #include <linux/writeback.h>
 #include <linux/blkdev.h>
 #include <linux/backing-dev.h>
@@ -88,20 +87,6 @@ static inline struct inode *wb_inode(struct list_head *head)
 #define CREATE_TRACE_POINTS
 #include <trace/events/writeback.h>
 
-/* Wakeup flusher thread or forker thread to fork it. Requires bdi->wb_lock. */
-static void bdi_wakeup_flusher(struct backing_dev_info *bdi)
-{
-	if (bdi->wb.task) {
-		wake_up_process(bdi->wb.task);
-	} else {
-		/*
-		 * The bdi thread isn't there, wake up the forker thread which
-		 * will create and run it.
-		 */
-		wake_up_process(default_backing_dev_info.wb.task);
-	}
-}
-
 static void bdi_queue_work(struct backing_dev_info *bdi,
 			   struct wb_writeback_work *work)
 {
@@ -109,10 +94,9 @@ static void bdi_queue_work(struct backing_dev_info *bdi,
 
 	spin_lock_bh(&bdi->wb_lock);
 	list_add_tail(&work->list, &bdi->work_list);
-	if (!bdi->wb.task)
-		trace_writeback_nothread(bdi, work);
-	bdi_wakeup_flusher(bdi);
 	spin_unlock_bh(&bdi->wb_lock);
+
+	mod_delayed_work(bdi_wq, &bdi->wb.dwork, 0);
 }
 
 static void
@@ -127,10 +111,8 @@ __bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
 	 */
 	work = kzalloc(sizeof(*work), GFP_ATOMIC);
 	if (!work) {
-		if (bdi->wb.task) {
-			trace_writeback_nowork(bdi);
-			wake_up_process(bdi->wb.task);
-		}
+		trace_writeback_nowork(bdi);
+		mod_delayed_work(bdi_wq, &bdi->wb.dwork, 0);
 		return;
 	}
 
@@ -177,9 +159,7 @@ void bdi_start_background_writeback(struct backing_dev_info *bdi)
 	 * writeback as soon as there is no other work to do.
 	 */
 	trace_writeback_wake_background(bdi);
-	spin_lock_bh(&bdi->wb_lock);
-	bdi_wakeup_flusher(bdi);
-	spin_unlock_bh(&bdi->wb_lock);
+	mod_delayed_work(bdi_wq, &bdi->wb.dwork, 0);
 }
 
 /*
@@ -1020,66 +1000,48 @@ long wb_do_writeback(struct bdi_writeback *wb, int force_wait)
 
 /*
  * Handle writeback of dirty data for the device backed by this bdi. Also
- * wakes up periodically and does kupdated style flushing.
+ * reschedules periodically and does kupdated style flushing.
  */
-int bdi_writeback_thread(void *data)
+void bdi_writeback_workfn(struct work_struct *work)
 {
-	struct bdi_writeback *wb = data;
+	struct bdi_writeback *wb = container_of(to_delayed_work(work),
+						struct bdi_writeback, dwork);
 	struct backing_dev_info *bdi = wb->bdi;
 	long pages_written;
 
 	current->flags |= PF_SWAPWRITE;
-	set_freezable();
-	wb->last_active = jiffies;
-
-	/*
-	 * Our parent may run at a different priority, just set us to normal
-	 */
-	set_user_nice(current, 0);
-
-	trace_writeback_thread_start(bdi);
 
-	while (!kthread_freezable_should_stop(NULL)) {
+	if (likely(!current_is_workqueue_rescuer() ||
+		   list_empty(&bdi->bdi_list))) {
 		/*
-		 * Remove own delayed wake-up timer, since we are already awake
-		 * and we'll take care of the periodic write-back.
+		 * The normal path.  Keep writing back @bdi until its
+		 * work_list is empty.  Note that this path is also taken
+		 * if @bdi is shutting down even when we're running off the
+		 * rescuer as work_list needs to be drained.
 		 */
-		del_timer(&wb->wakeup_timer);
-
-		pages_written = wb_do_writeback(wb, 0);
-
+		do {
+			pages_written = wb_do_writeback(wb, 0);
+			trace_writeback_pages_written(pages_written);
+		} while (!list_empty(&bdi->work_list));
+	} else {
+		/*
+		 * bdi_wq can't get enough workers and we're running off
+		 * the emergency worker.  Don't hog it.  Hopefully, 1024 is
+		 * enough for efficient IO.
+		 */
+		pages_written = writeback_inodes_wb(&bdi->wb, 1024,
+						    WB_REASON_FORKER_THREAD);
 		trace_writeback_pages_written(pages_written);
-
-		if (pages_written)
-			wb->last_active = jiffies;
-
-		set_current_state(TASK_INTERRUPTIBLE);
-		if (!list_empty(&bdi->work_list) || kthread_should_stop()) {
-			__set_current_state(TASK_RUNNING);
-			continue;
-		}
-
-		if (wb_has_dirty_io(wb) && dirty_writeback_interval)
-			schedule_timeout(msecs_to_jiffies(dirty_writeback_interval * 10));
-		else {
-			/*
-			 * We have nothing to do, so can go sleep without any
-			 * timeout and save power. When a work is queued or
-			 * something is made dirty - we will be woken up.
-			 */
-			schedule();
-		}
 	}
 
-	/* Flush any work that raced with us exiting */
-	if (!list_empty(&bdi->work_list))
-		wb_do_writeback(wb, 1);
+	if (!list_empty(&bdi->work_list) ||
+	    (wb_has_dirty_io(wb) && dirty_writeback_interval))
+		queue_delayed_work(bdi_wq, &wb->dwork,
+			msecs_to_jiffies(dirty_writeback_interval * 10));
 
-	trace_writeback_thread_stop(bdi);
-	return 0;
+	current->flags &= ~PF_SWAPWRITE;
 }
 
-
 /*
  * Start writeback of `nr_pages' pages.  If `nr_pages' is zero, write back
  * the whole world.
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index a5ef27f..c388155 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -18,6 +18,7 @@
 #include <linux/writeback.h>
 #include <linux/atomic.h>
 #include <linux/sysctl.h>
+#include <linux/workqueue.h>
 
 struct page;
 struct device;
@@ -27,7 +28,6 @@ struct dentry;
  * Bits in backing_dev_info.state
  */
 enum bdi_state {
-	BDI_pending,		/* On its way to being activated */
 	BDI_wb_alloc,		/* Default embedded wb allocated */
 	BDI_async_congested,	/* The async (write) queue is getting full */
 	BDI_sync_congested,	/* The sync queue is getting full */
@@ -53,10 +53,8 @@ struct bdi_writeback {
 	unsigned int nr;
 
 	unsigned long last_old_flush;	/* last old data flush */
-	unsigned long last_active;	/* last time bdi thread was active */
 
-	struct task_struct *task;	/* writeback thread */
-	struct timer_list wakeup_timer; /* used for delayed bdi thread wakeup */
+	struct delayed_work dwork;	/* work item used for writeback */
 	struct list_head b_dirty;	/* dirty inodes */
 	struct list_head b_io;		/* parked for writeback */
 	struct list_head b_more_io;	/* parked for more writeback */
@@ -123,7 +121,7 @@ int bdi_setup_and_register(struct backing_dev_info *, char *, unsigned int);
 void bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
 			enum wb_reason reason);
 void bdi_start_background_writeback(struct backing_dev_info *bdi);
-int bdi_writeback_thread(void *data);
+void bdi_writeback_workfn(struct work_struct *work);
 int bdi_has_dirty_io(struct backing_dev_info *bdi);
 void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi);
 void bdi_lock_two(struct bdi_writeback *wb1, struct bdi_writeback *wb2);
@@ -131,6 +129,8 @@ void bdi_lock_two(struct bdi_writeback *wb1, struct bdi_writeback *wb2);
 extern spinlock_t bdi_lock;
 extern struct list_head bdi_list;
 
+extern struct workqueue_struct *bdi_wq;
+
 static inline int wb_has_dirty_io(struct bdi_writeback *wb)
 {
 	return !list_empty(&wb->b_dirty) ||
@@ -335,11 +335,6 @@ static inline bool bdi_cap_swap_backed(struct backing_dev_info *bdi)
 	return bdi->capabilities & BDI_CAP_SWAP_BACKED;
 }
 
-static inline bool bdi_cap_flush_forker(struct backing_dev_info *bdi)
-{
-	return bdi == &default_backing_dev_info;
-}
-
 static inline bool mapping_cap_writeback_dirty(struct address_space *mapping)
 {
 	return bdi_cap_writeback_dirty(mapping->backing_dev_info);
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index 6a16fd2..464ea82 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -183,7 +183,6 @@ DECLARE_EVENT_CLASS(writeback_work_class,
 DEFINE_EVENT(writeback_work_class, name, \
 	TP_PROTO(struct backing_dev_info *bdi, struct wb_writeback_work *work), \
 	TP_ARGS(bdi, work))
-DEFINE_WRITEBACK_WORK_EVENT(writeback_nothread);
 DEFINE_WRITEBACK_WORK_EVENT(writeback_queue);
 DEFINE_WRITEBACK_WORK_EVENT(writeback_exec);
 DEFINE_WRITEBACK_WORK_EVENT(writeback_start);
@@ -222,12 +221,8 @@ DEFINE_EVENT(writeback_class, name, \
 
 DEFINE_WRITEBACK_EVENT(writeback_nowork);
 DEFINE_WRITEBACK_EVENT(writeback_wake_background);
-DEFINE_WRITEBACK_EVENT(writeback_wake_thread);
-DEFINE_WRITEBACK_EVENT(writeback_wake_forker_thread);
 DEFINE_WRITEBACK_EVENT(writeback_bdi_register);
 DEFINE_WRITEBACK_EVENT(writeback_bdi_unregister);
-DEFINE_WRITEBACK_EVENT(writeback_thread_start);
-DEFINE_WRITEBACK_EVENT(writeback_thread_stop);
 
 DECLARE_EVENT_CLASS(wbc_class,
 	TP_PROTO(struct writeback_control *wbc, struct backing_dev_info *bdi),
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 657569b..2857d4f 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -37,6 +37,9 @@ static struct class *bdi_class;
 DEFINE_SPINLOCK(bdi_lock);
 LIST_HEAD(bdi_list);
 
+/* bdi_wq serves all asynchronous writeback tasks */
+struct workqueue_struct *bdi_wq;
+
 void bdi_lock_two(struct bdi_writeback *wb1, struct bdi_writeback *wb2)
 {
 	if (wb1 < wb2) {
@@ -255,6 +258,11 @@ static int __init default_bdi_init(void)
 {
 	int err;
 
+	bdi_wq = alloc_workqueue("writeback", WQ_MEM_RECLAIM | WQ_FREEZABLE |
+					      WQ_UNBOUND, 0);
+	if (!bdi_wq)
+		return -ENOMEM;
+
 	err = bdi_init(&default_backing_dev_info);
 	if (!err)
 		bdi_register(&default_backing_dev_info, NULL, "default");
@@ -269,26 +277,6 @@ int bdi_has_dirty_io(struct backing_dev_info *bdi)
 	return wb_has_dirty_io(&bdi->wb);
 }
 
-static void wakeup_timer_fn(unsigned long data)
-{
-	struct backing_dev_info *bdi = (struct backing_dev_info *)data;
-
-	spin_lock_bh(&bdi->wb_lock);
-	if (bdi->wb.task) {
-		trace_writeback_wake_thread(bdi);
-		wake_up_process(bdi->wb.task);
-	} else if (bdi->dev) {
-		/*
-		 * When bdi tasks are inactive for long time, they are killed.
-		 * In this case we have to wake-up the forker thread which
-		 * should create and run the bdi thread.
-		 */
-		trace_writeback_wake_forker_thread(bdi);
-		wake_up_process(default_backing_dev_info.wb.task);
-	}
-	spin_unlock_bh(&bdi->wb_lock);
-}
-
 /*
  * This function is used when the first inode for this bdi is marked dirty. It
  * wakes-up the corresponding bdi thread which should then take care of the
@@ -305,176 +293,7 @@ void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi)
 	unsigned long timeout;
 
 	timeout = msecs_to_jiffies(dirty_writeback_interval * 10);
-	mod_timer(&bdi->wb.wakeup_timer, jiffies + timeout);
-}
-
-/*
- * Calculate the longest interval (jiffies) bdi threads are allowed to be
- * inactive.
- */
-static unsigned long bdi_longest_inactive(void)
-{
-	unsigned long interval;
-
-	interval = msecs_to_jiffies(dirty_writeback_interval * 10);
-	return max(5UL * 60 * HZ, interval);
-}
-
-/*
- * Clear pending bit and wakeup anybody waiting for flusher thread creation or
- * shutdown
- */
-static void bdi_clear_pending(struct backing_dev_info *bdi)
-{
-	clear_bit(BDI_pending, &bdi->state);
-	smp_mb__after_clear_bit();
-	wake_up_bit(&bdi->state, BDI_pending);
-}
-
-static int bdi_forker_thread(void *ptr)
-{
-	struct bdi_writeback *me = ptr;
-
-	current->flags |= PF_SWAPWRITE;
-	set_freezable();
-
-	/*
-	 * Our parent may run at a different priority, just set us to normal
-	 */
-	set_user_nice(current, 0);
-
-	for (;;) {
-		struct task_struct *task = NULL;
-		struct backing_dev_info *bdi;
-		enum {
-			NO_ACTION,   /* Nothing to do */
-			FORK_THREAD, /* Fork bdi thread */
-			KILL_THREAD, /* Kill inactive bdi thread */
-		} action = NO_ACTION;
-
-		/*
-		 * Temporary measure, we want to make sure we don't see
-		 * dirty data on the default backing_dev_info
-		 */
-		if (wb_has_dirty_io(me) || !list_empty(&me->bdi->work_list)) {
-			del_timer(&me->wakeup_timer);
-			wb_do_writeback(me, 0);
-		}
-
-		spin_lock_bh(&bdi_lock);
-		/*
-		 * In the following loop we are going to check whether we have
-		 * some work to do without any synchronization with tasks
-		 * waking us up to do work for them. Set the task state here
-		 * so that we don't miss wakeups after verifying conditions.
-		 */
-		set_current_state(TASK_INTERRUPTIBLE);
-
-		list_for_each_entry(bdi, &bdi_list, bdi_list) {
-			bool have_dirty_io;
-
-			if (!bdi_cap_writeback_dirty(bdi) ||
-			     bdi_cap_flush_forker(bdi))
-				continue;
-
-			WARN(!test_bit(BDI_registered, &bdi->state),
-			     "bdi %p/%s is not registered!\n", bdi, bdi->name);
-
-			have_dirty_io = !list_empty(&bdi->work_list) ||
-					wb_has_dirty_io(&bdi->wb);
-
-			/*
-			 * If the bdi has work to do, but the thread does not
-			 * exist - create it.
-			 */
-			if (!bdi->wb.task && have_dirty_io) {
-				/*
-				 * Set the pending bit - if someone will try to
-				 * unregister this bdi - it'll wait on this bit.
-				 */
-				set_bit(BDI_pending, &bdi->state);
-				action = FORK_THREAD;
-				break;
-			}
-
-			spin_lock(&bdi->wb_lock);
-
-			/*
-			 * If there is no work to do and the bdi thread was
-			 * inactive long enough - kill it. The wb_lock is taken
-			 * to make sure no-one adds more work to this bdi and
-			 * wakes the bdi thread up.
-			 */
-			if (bdi->wb.task && !have_dirty_io &&
-			    time_after(jiffies, bdi->wb.last_active +
-						bdi_longest_inactive())) {
-				task = bdi->wb.task;
-				bdi->wb.task = NULL;
-				spin_unlock(&bdi->wb_lock);
-				set_bit(BDI_pending, &bdi->state);
-				action = KILL_THREAD;
-				break;
-			}
-			spin_unlock(&bdi->wb_lock);
-		}
-		spin_unlock_bh(&bdi_lock);
-
-		/* Keep working if default bdi still has things to do */
-		if (!list_empty(&me->bdi->work_list))
-			__set_current_state(TASK_RUNNING);
-
-		switch (action) {
-		case FORK_THREAD:
-			__set_current_state(TASK_RUNNING);
-			task = kthread_create(bdi_writeback_thread, &bdi->wb,
-					      "flush-%s", dev_name(bdi->dev));
-			if (IS_ERR(task)) {
-				/*
-				 * If thread creation fails, force writeout of
-				 * the bdi from the thread. Hopefully 1024 is
-				 * large enough for efficient IO.
-				 */
-				writeback_inodes_wb(&bdi->wb, 1024,
-						    WB_REASON_FORKER_THREAD);
-			} else {
-				/*
-				 * The spinlock makes sure we do not lose
-				 * wake-ups when racing with 'bdi_queue_work()'.
-				 * And as soon as the bdi thread is visible, we
-				 * can start it.
-				 */
-				spin_lock_bh(&bdi->wb_lock);
-				bdi->wb.task = task;
-				spin_unlock_bh(&bdi->wb_lock);
-				wake_up_process(task);
-			}
-			bdi_clear_pending(bdi);
-			break;
-
-		case KILL_THREAD:
-			__set_current_state(TASK_RUNNING);
-			kthread_stop(task);
-			bdi_clear_pending(bdi);
-			break;
-
-		case NO_ACTION:
-			if (!wb_has_dirty_io(me) || !dirty_writeback_interval)
-				/*
-				 * There are no dirty data. The only thing we
-				 * should now care about is checking for
-				 * inactive bdi threads and killing them. Thus,
-				 * let's sleep for longer time, save energy and
-				 * be friendly for battery-driven devices.
-				 */
-				schedule_timeout(bdi_longest_inactive());
-			else
-				schedule_timeout(msecs_to_jiffies(dirty_writeback_interval * 10));
-			try_to_freeze();
-			break;
-		}
-	}
-
-	return 0;
+	mod_delayed_work(bdi_wq, &bdi->wb.dwork, timeout);
 }
 
 /*
@@ -487,6 +306,9 @@ static void bdi_remove_from_list(struct backing_dev_info *bdi)
 	spin_unlock_bh(&bdi_lock);
 
 	synchronize_rcu_expedited();
+
+	/* bdi_list is now unused, clear it to mark @bdi dying */
+	INIT_LIST_HEAD(&bdi->bdi_list);
 }
 
 int bdi_register(struct backing_dev_info *bdi, struct device *parent,
@@ -506,20 +328,6 @@ int bdi_register(struct backing_dev_info *bdi, struct device *parent,
 
 	bdi->dev = dev;
 
-	/*
-	 * Just start the forker thread for our default backing_dev_info,
-	 * and add other bdi's to the list. They will get a thread created
-	 * on-demand when they need it.
-	 */
-	if (bdi_cap_flush_forker(bdi)) {
-		struct bdi_writeback *wb = &bdi->wb;
-
-		wb->task = kthread_run(bdi_forker_thread, wb, "bdi-%s",
-						dev_name(dev));
-		if (IS_ERR(wb->task))
-			return PTR_ERR(wb->task);
-	}
-
 	bdi_debug_register(bdi, dev_name(dev));
 	set_bit(BDI_registered, &bdi->state);
 
@@ -543,8 +351,6 @@ EXPORT_SYMBOL(bdi_register_dev);
  */
 static void bdi_wb_shutdown(struct backing_dev_info *bdi)
 {
-	struct task_struct *task;
-
 	if (!bdi_cap_writeback_dirty(bdi))
 		return;
 
@@ -554,22 +360,20 @@ static void bdi_wb_shutdown(struct backing_dev_info *bdi)
 	bdi_remove_from_list(bdi);
 
 	/*
-	 * If setup is pending, wait for that to complete first
+	 * Drain work list and shutdown the delayed_work.  At this point,
+	 * @bdi->bdi_list is empty telling bdi_Writeback_workfn() that @bdi
+	 * is dying and its work_list needs to be drained no matter what.
 	 */
-	wait_on_bit(&bdi->state, BDI_pending, bdi_sched_wait,
-			TASK_UNINTERRUPTIBLE);
+	mod_delayed_work(bdi_wq, &bdi->wb.dwork, 0);
+	flush_delayed_work(&bdi->wb.dwork);
+	WARN_ON(!list_empty(&bdi->work_list));
 
 	/*
-	 * Finally, kill the kernel thread. We don't need to be RCU
-	 * safe anymore, since the bdi is gone from visibility.
+	 * This shouldn't be necessary unless @bdi for some reason has
+	 * unflushed dirty IO after work_list is drained.  Do it anyway
+	 * just in case.
 	 */
-	spin_lock_bh(&bdi->wb_lock);
-	task = bdi->wb.task;
-	bdi->wb.task = NULL;
-	spin_unlock_bh(&bdi->wb_lock);
-
-	if (task)
-		kthread_stop(task);
+	cancel_delayed_work_sync(&bdi->wb.dwork);
 }
 
 /*
@@ -595,10 +399,8 @@ void bdi_unregister(struct backing_dev_info *bdi)
 		bdi_set_min_ratio(bdi, 0);
 		trace_writeback_bdi_unregister(bdi);
 		bdi_prune_sb(bdi);
-		del_timer_sync(&bdi->wb.wakeup_timer);
 
-		if (!bdi_cap_flush_forker(bdi))
-			bdi_wb_shutdown(bdi);
+		bdi_wb_shutdown(bdi);
 		bdi_debug_unregister(bdi);
 
 		spin_lock_bh(&bdi->wb_lock);
@@ -620,7 +422,7 @@ static void bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
 	INIT_LIST_HEAD(&wb->b_io);
 	INIT_LIST_HEAD(&wb->b_more_io);
 	spin_lock_init(&wb->list_lock);
-	setup_timer(&wb->wakeup_timer, wakeup_timer_fn, (unsigned long)bdi);
+	INIT_DELAYED_WORK(&wb->dwork, bdi_writeback_workfn);
 }
 
 /*
@@ -693,12 +495,11 @@ void bdi_destroy(struct backing_dev_info *bdi)
 	bdi_unregister(bdi);
 
 	/*
-	 * If bdi_unregister() had already been called earlier, the
-	 * wakeup_timer could still be armed because bdi_prune_sb()
-	 * can race with the bdi_wakeup_thread_delayed() calls from
-	 * __mark_inode_dirty().
+	 * If bdi_unregister() had already been called earlier, the dwork
+	 * could still be pending because bdi_prune_sb() can race with the
+	 * bdi_wakeup_thread_delayed() calls from __mark_inode_dirty().
 	 */
-	del_timer_sync(&bdi->wb.wakeup_timer);
+	cancel_delayed_work_sync(&bdi->wb.dwork);
 
 	for (i = 0; i < NR_BDI_STAT_ITEMS; i++)
 		percpu_counter_destroy(&bdi->bdi_stat[i]);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 4/4] writeback: expose the bdi_wq workqueue
  2013-03-07 21:44 [PATCHSET] writeback: convert writeback to unbound workqueue Tejun Heo
                   ` (2 preceding siblings ...)
  2013-03-07 21:44 ` [PATCH 3/4] writeback: replace custom worker pool implementation with unbound workqueue Tejun Heo
@ 2013-03-07 21:44 ` Tejun Heo
  2013-03-12 15:06 ` [PATCHSET] writeback: convert writeback to unbound workqueue Jens Axboe
  4 siblings, 0 replies; 19+ messages in thread
From: Tejun Heo @ 2013-03-07 21:44 UTC (permalink / raw)
  To: axboe, laijs, fengguang.wu
  Cc: linux-kernel, jmoyer, Tejun Heo, Kay Sievers, Greg Kroah-Hartman

There are cases where userland wants to tweak the priority and
affinity of writeback flushers.  Expose bdi_wq to userland by setting
WQ_SYSFS.  It appears under /sys/bus/workqueue/devices/writeback/ and
allows adjusting maximum concurrency level, cpumask and nice level.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 mm/backing-dev.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 2857d4f..5025174 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -259,7 +259,7 @@ static int __init default_bdi_init(void)
 	int err;
 
 	bdi_wq = alloc_workqueue("writeback", WQ_MEM_RECLAIM | WQ_FREEZABLE |
-					      WQ_UNBOUND, 0);
+					      WQ_UNBOUND | WQ_SYSFS, 0);
 	if (!bdi_wq)
 		return -ENOMEM;
 
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/4] writeback: remove unused bdi_pending_list
  2013-03-07 21:44 ` [PATCH 2/4] writeback: remove unused bdi_pending_list Tejun Heo
@ 2013-03-12 12:02   ` Jan Kara
  0 siblings, 0 replies; 19+ messages in thread
From: Jan Kara @ 2013-03-12 12:02 UTC (permalink / raw)
  To: Tejun Heo; +Cc: axboe, laijs, fengguang.wu, linux-kernel, jmoyer

On Thu 07-03-13 13:44:07, Tejun Heo wrote:
> There's no user left.  Remove it.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Jens Axboe <axboe@kernel.dk>
> Cc: Fengguang Wu <fengguang.wu@intel.com>
  Looks good. You can add
Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  include/linux/backing-dev.h | 1 -
>  mm/backing-dev.c            | 4 +---
>  2 files changed, 1 insertion(+), 4 deletions(-)
> 
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index 3504599..a5ef27f 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -130,7 +130,6 @@ void bdi_lock_two(struct bdi_writeback *wb1, struct bdi_writeback *wb2);
>  
>  extern spinlock_t bdi_lock;
>  extern struct list_head bdi_list;
> -extern struct list_head bdi_pending_list;
>  
>  static inline int wb_has_dirty_io(struct bdi_writeback *wb)
>  {
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 41733c5..657569b 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -31,13 +31,11 @@ EXPORT_SYMBOL_GPL(noop_backing_dev_info);
>  static struct class *bdi_class;
>  
>  /*
> - * bdi_lock protects updates to bdi_list and bdi_pending_list, as well as
> - * reader side protection for bdi_pending_list. bdi_list has RCU reader side
> + * bdi_lock protects updates to bdi_list. bdi_list has RCU reader side
>   * locking.
>   */
>  DEFINE_SPINLOCK(bdi_lock);
>  LIST_HEAD(bdi_list);
> -LIST_HEAD(bdi_pending_list);
>  
>  void bdi_lock_two(struct bdi_writeback *wb1, struct bdi_writeback *wb2)
>  {
> -- 
> 1.8.1.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/4] writeback: replace custom worker pool implementation with unbound workqueue
  2013-03-07 21:44 ` [PATCH 3/4] writeback: replace custom worker pool implementation with unbound workqueue Tejun Heo
@ 2013-03-12 15:05   ` Jan Kara
  2013-03-18 22:32     ` Jan Kara
  0 siblings, 1 reply; 19+ messages in thread
From: Jan Kara @ 2013-03-12 15:05 UTC (permalink / raw)
  To: Tejun Heo; +Cc: axboe, laijs, fengguang.wu, linux-kernel, jmoyer

On Thu 07-03-13 13:44:08, Tejun Heo wrote:
> Writeback implements its own worker pool - each bdi can be associated
> with a worker thread which is created and destroyed dynamically.  The
> worker thread for the default bdi is always present and serves as the
> "forker" thread which forks off worker threads for other bdis.
> 
> there's no reason for writeback to implement its own worker pool when
> using unbound workqueue instead is much simpler and more efficient.
> This patch replaces custom worker pool implementation in writeback
> with an unbound workqueue.
> 
> The conversion isn't too complicated but the followings are worth
> mentioning.
> 
> * bdi_writeback->last_active, task and wakeup_timer are removed.
>   delayed_work ->dwork is added instead.  Explicit timer handling is
>   no longer necessary.  Everything works by either queueing / modding
>   / flushing / canceling the delayed_work item.
> 
> * bdi_writeback_thread() becomes bdi_writeback_workfn() which runs off
>   bdi_writeback->dwork.  On each execution, it processes
>   bdi->work_list and reschedules itself if there are more things to
>   do.
> 
>   The function also handles low-mem condition, which used to be
>   handled by the forker thread.  If the function is running off a
>   rescuer thread, it only writes out limited number of pages so that
>   the rescuer can serve other bdis too.  This preserves the flusher
>   creation failure behavior of the forker thread.
> 
> * INIT_LIST_HEAD(&bdi->bdi_list) is used to tell
>   bdi_writeback_workfn() about on-going bdi unregistration so that it
>   always drains work_list even if it's running off the rescuer.  Note
>   that the original code was broken in this regard.  Under memory
>   pressure, a bdi could finish unregistration with non-empty
>   work_list.
> 
> * The default bdi is no longer special.  It now is treated the same as
>   any other bdi and bdi_cap_flush_forker() is removed.
> 
> * BDI_pending is no longer used.  Removed.
> 
> * Some tracepoints become non-applicable.  The following TPs are
>   removed - writeback_nothread, writeback_wake_thread,
>   writeback_wake_forker_thread, writeback_thread_start,
>   writeback_thread_stop.
> 
> Everything, including devices coming and going away and rescuer
> operation under simulated memory pressure, seems to work fine in my
> test setup.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Jens Axboe <axboe@kernel.dk>
> Cc: Fengguang Wu <fengguang.wu@intel.com>
> Cc: Jeff Moyer <jmoyer@redhat.com>
  The conversion looks OK. So you can add:
Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/fs-writeback.c                | 102 +++++-----------
>  include/linux/backing-dev.h      |  15 +--
>  include/trace/events/writeback.h |   5 -
>  mm/backing-dev.c                 | 255 +++++----------------------------------
>  4 files changed, 65 insertions(+), 312 deletions(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 21f46fb..8067d37 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -22,7 +22,6 @@
>  #include <linux/mm.h>
>  #include <linux/pagemap.h>
>  #include <linux/kthread.h>
> -#include <linux/freezer.h>
>  #include <linux/writeback.h>
>  #include <linux/blkdev.h>
>  #include <linux/backing-dev.h>
> @@ -88,20 +87,6 @@ static inline struct inode *wb_inode(struct list_head *head)
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/writeback.h>
>  
> -/* Wakeup flusher thread or forker thread to fork it. Requires bdi->wb_lock. */
> -static void bdi_wakeup_flusher(struct backing_dev_info *bdi)
> -{
> -	if (bdi->wb.task) {
> -		wake_up_process(bdi->wb.task);
> -	} else {
> -		/*
> -		 * The bdi thread isn't there, wake up the forker thread which
> -		 * will create and run it.
> -		 */
> -		wake_up_process(default_backing_dev_info.wb.task);
> -	}
> -}
> -
>  static void bdi_queue_work(struct backing_dev_info *bdi,
>  			   struct wb_writeback_work *work)
>  {
> @@ -109,10 +94,9 @@ static void bdi_queue_work(struct backing_dev_info *bdi,
>  
>  	spin_lock_bh(&bdi->wb_lock);
>  	list_add_tail(&work->list, &bdi->work_list);
> -	if (!bdi->wb.task)
> -		trace_writeback_nothread(bdi, work);
> -	bdi_wakeup_flusher(bdi);
>  	spin_unlock_bh(&bdi->wb_lock);
> +
> +	mod_delayed_work(bdi_wq, &bdi->wb.dwork, 0);
>  }
>  
>  static void
> @@ -127,10 +111,8 @@ __bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
>  	 */
>  	work = kzalloc(sizeof(*work), GFP_ATOMIC);
>  	if (!work) {
> -		if (bdi->wb.task) {
> -			trace_writeback_nowork(bdi);
> -			wake_up_process(bdi->wb.task);
> -		}
> +		trace_writeback_nowork(bdi);
> +		mod_delayed_work(bdi_wq, &bdi->wb.dwork, 0);
>  		return;
>  	}
>  
> @@ -177,9 +159,7 @@ void bdi_start_background_writeback(struct backing_dev_info *bdi)
>  	 * writeback as soon as there is no other work to do.
>  	 */
>  	trace_writeback_wake_background(bdi);
> -	spin_lock_bh(&bdi->wb_lock);
> -	bdi_wakeup_flusher(bdi);
> -	spin_unlock_bh(&bdi->wb_lock);
> +	mod_delayed_work(bdi_wq, &bdi->wb.dwork, 0);
>  }
>  
>  /*
> @@ -1020,66 +1000,48 @@ long wb_do_writeback(struct bdi_writeback *wb, int force_wait)
>  
>  /*
>   * Handle writeback of dirty data for the device backed by this bdi. Also
> - * wakes up periodically and does kupdated style flushing.
> + * reschedules periodically and does kupdated style flushing.
>   */
> -int bdi_writeback_thread(void *data)
> +void bdi_writeback_workfn(struct work_struct *work)
>  {
> -	struct bdi_writeback *wb = data;
> +	struct bdi_writeback *wb = container_of(to_delayed_work(work),
> +						struct bdi_writeback, dwork);
>  	struct backing_dev_info *bdi = wb->bdi;
>  	long pages_written;
>  
>  	current->flags |= PF_SWAPWRITE;
> -	set_freezable();
> -	wb->last_active = jiffies;
> -
> -	/*
> -	 * Our parent may run at a different priority, just set us to normal
> -	 */
> -	set_user_nice(current, 0);
> -
> -	trace_writeback_thread_start(bdi);
>  
> -	while (!kthread_freezable_should_stop(NULL)) {
> +	if (likely(!current_is_workqueue_rescuer() ||
> +		   list_empty(&bdi->bdi_list))) {
>  		/*
> -		 * Remove own delayed wake-up timer, since we are already awake
> -		 * and we'll take care of the periodic write-back.
> +		 * The normal path.  Keep writing back @bdi until its
> +		 * work_list is empty.  Note that this path is also taken
> +		 * if @bdi is shutting down even when we're running off the
> +		 * rescuer as work_list needs to be drained.
>  		 */
> -		del_timer(&wb->wakeup_timer);
> -
> -		pages_written = wb_do_writeback(wb, 0);
> -
> +		do {
> +			pages_written = wb_do_writeback(wb, 0);
> +			trace_writeback_pages_written(pages_written);
> +		} while (!list_empty(&bdi->work_list));
> +	} else {
> +		/*
> +		 * bdi_wq can't get enough workers and we're running off
> +		 * the emergency worker.  Don't hog it.  Hopefully, 1024 is
> +		 * enough for efficient IO.
> +		 */
> +		pages_written = writeback_inodes_wb(&bdi->wb, 1024,
> +						    WB_REASON_FORKER_THREAD);
>  		trace_writeback_pages_written(pages_written);
> -
> -		if (pages_written)
> -			wb->last_active = jiffies;
> -
> -		set_current_state(TASK_INTERRUPTIBLE);
> -		if (!list_empty(&bdi->work_list) || kthread_should_stop()) {
> -			__set_current_state(TASK_RUNNING);
> -			continue;
> -		}
> -
> -		if (wb_has_dirty_io(wb) && dirty_writeback_interval)
> -			schedule_timeout(msecs_to_jiffies(dirty_writeback_interval * 10));
> -		else {
> -			/*
> -			 * We have nothing to do, so can go sleep without any
> -			 * timeout and save power. When a work is queued or
> -			 * something is made dirty - we will be woken up.
> -			 */
> -			schedule();
> -		}
>  	}
>  
> -	/* Flush any work that raced with us exiting */
> -	if (!list_empty(&bdi->work_list))
> -		wb_do_writeback(wb, 1);
> +	if (!list_empty(&bdi->work_list) ||
> +	    (wb_has_dirty_io(wb) && dirty_writeback_interval))
> +		queue_delayed_work(bdi_wq, &wb->dwork,
> +			msecs_to_jiffies(dirty_writeback_interval * 10));
>  
> -	trace_writeback_thread_stop(bdi);
> -	return 0;
> +	current->flags &= ~PF_SWAPWRITE;
>  }
>  
> -
>  /*
>   * Start writeback of `nr_pages' pages.  If `nr_pages' is zero, write back
>   * the whole world.
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index a5ef27f..c388155 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -18,6 +18,7 @@
>  #include <linux/writeback.h>
>  #include <linux/atomic.h>
>  #include <linux/sysctl.h>
> +#include <linux/workqueue.h>
>  
>  struct page;
>  struct device;
> @@ -27,7 +28,6 @@ struct dentry;
>   * Bits in backing_dev_info.state
>   */
>  enum bdi_state {
> -	BDI_pending,		/* On its way to being activated */
>  	BDI_wb_alloc,		/* Default embedded wb allocated */
>  	BDI_async_congested,	/* The async (write) queue is getting full */
>  	BDI_sync_congested,	/* The sync queue is getting full */
> @@ -53,10 +53,8 @@ struct bdi_writeback {
>  	unsigned int nr;
>  
>  	unsigned long last_old_flush;	/* last old data flush */
> -	unsigned long last_active;	/* last time bdi thread was active */
>  
> -	struct task_struct *task;	/* writeback thread */
> -	struct timer_list wakeup_timer; /* used for delayed bdi thread wakeup */
> +	struct delayed_work dwork;	/* work item used for writeback */
>  	struct list_head b_dirty;	/* dirty inodes */
>  	struct list_head b_io;		/* parked for writeback */
>  	struct list_head b_more_io;	/* parked for more writeback */
> @@ -123,7 +121,7 @@ int bdi_setup_and_register(struct backing_dev_info *, char *, unsigned int);
>  void bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
>  			enum wb_reason reason);
>  void bdi_start_background_writeback(struct backing_dev_info *bdi);
> -int bdi_writeback_thread(void *data);
> +void bdi_writeback_workfn(struct work_struct *work);
>  int bdi_has_dirty_io(struct backing_dev_info *bdi);
>  void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi);
>  void bdi_lock_two(struct bdi_writeback *wb1, struct bdi_writeback *wb2);
> @@ -131,6 +129,8 @@ void bdi_lock_two(struct bdi_writeback *wb1, struct bdi_writeback *wb2);
>  extern spinlock_t bdi_lock;
>  extern struct list_head bdi_list;
>  
> +extern struct workqueue_struct *bdi_wq;
> +
>  static inline int wb_has_dirty_io(struct bdi_writeback *wb)
>  {
>  	return !list_empty(&wb->b_dirty) ||
> @@ -335,11 +335,6 @@ static inline bool bdi_cap_swap_backed(struct backing_dev_info *bdi)
>  	return bdi->capabilities & BDI_CAP_SWAP_BACKED;
>  }
>  
> -static inline bool bdi_cap_flush_forker(struct backing_dev_info *bdi)
> -{
> -	return bdi == &default_backing_dev_info;
> -}
> -
>  static inline bool mapping_cap_writeback_dirty(struct address_space *mapping)
>  {
>  	return bdi_cap_writeback_dirty(mapping->backing_dev_info);
> diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
> index 6a16fd2..464ea82 100644
> --- a/include/trace/events/writeback.h
> +++ b/include/trace/events/writeback.h
> @@ -183,7 +183,6 @@ DECLARE_EVENT_CLASS(writeback_work_class,
>  DEFINE_EVENT(writeback_work_class, name, \
>  	TP_PROTO(struct backing_dev_info *bdi, struct wb_writeback_work *work), \
>  	TP_ARGS(bdi, work))
> -DEFINE_WRITEBACK_WORK_EVENT(writeback_nothread);
>  DEFINE_WRITEBACK_WORK_EVENT(writeback_queue);
>  DEFINE_WRITEBACK_WORK_EVENT(writeback_exec);
>  DEFINE_WRITEBACK_WORK_EVENT(writeback_start);
> @@ -222,12 +221,8 @@ DEFINE_EVENT(writeback_class, name, \
>  
>  DEFINE_WRITEBACK_EVENT(writeback_nowork);
>  DEFINE_WRITEBACK_EVENT(writeback_wake_background);
> -DEFINE_WRITEBACK_EVENT(writeback_wake_thread);
> -DEFINE_WRITEBACK_EVENT(writeback_wake_forker_thread);
>  DEFINE_WRITEBACK_EVENT(writeback_bdi_register);
>  DEFINE_WRITEBACK_EVENT(writeback_bdi_unregister);
> -DEFINE_WRITEBACK_EVENT(writeback_thread_start);
> -DEFINE_WRITEBACK_EVENT(writeback_thread_stop);
>  
>  DECLARE_EVENT_CLASS(wbc_class,
>  	TP_PROTO(struct writeback_control *wbc, struct backing_dev_info *bdi),
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 657569b..2857d4f 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -37,6 +37,9 @@ static struct class *bdi_class;
>  DEFINE_SPINLOCK(bdi_lock);
>  LIST_HEAD(bdi_list);
>  
> +/* bdi_wq serves all asynchronous writeback tasks */
> +struct workqueue_struct *bdi_wq;
> +
>  void bdi_lock_two(struct bdi_writeback *wb1, struct bdi_writeback *wb2)
>  {
>  	if (wb1 < wb2) {
> @@ -255,6 +258,11 @@ static int __init default_bdi_init(void)
>  {
>  	int err;
>  
> +	bdi_wq = alloc_workqueue("writeback", WQ_MEM_RECLAIM | WQ_FREEZABLE |
> +					      WQ_UNBOUND, 0);
> +	if (!bdi_wq)
> +		return -ENOMEM;
> +
>  	err = bdi_init(&default_backing_dev_info);
>  	if (!err)
>  		bdi_register(&default_backing_dev_info, NULL, "default");
> @@ -269,26 +277,6 @@ int bdi_has_dirty_io(struct backing_dev_info *bdi)
>  	return wb_has_dirty_io(&bdi->wb);
>  }
>  
> -static void wakeup_timer_fn(unsigned long data)
> -{
> -	struct backing_dev_info *bdi = (struct backing_dev_info *)data;
> -
> -	spin_lock_bh(&bdi->wb_lock);
> -	if (bdi->wb.task) {
> -		trace_writeback_wake_thread(bdi);
> -		wake_up_process(bdi->wb.task);
> -	} else if (bdi->dev) {
> -		/*
> -		 * When bdi tasks are inactive for long time, they are killed.
> -		 * In this case we have to wake-up the forker thread which
> -		 * should create and run the bdi thread.
> -		 */
> -		trace_writeback_wake_forker_thread(bdi);
> -		wake_up_process(default_backing_dev_info.wb.task);
> -	}
> -	spin_unlock_bh(&bdi->wb_lock);
> -}
> -
>  /*
>   * This function is used when the first inode for this bdi is marked dirty. It
>   * wakes-up the corresponding bdi thread which should then take care of the
> @@ -305,176 +293,7 @@ void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi)
>  	unsigned long timeout;
>  
>  	timeout = msecs_to_jiffies(dirty_writeback_interval * 10);
> -	mod_timer(&bdi->wb.wakeup_timer, jiffies + timeout);
> -}
> -
> -/*
> - * Calculate the longest interval (jiffies) bdi threads are allowed to be
> - * inactive.
> - */
> -static unsigned long bdi_longest_inactive(void)
> -{
> -	unsigned long interval;
> -
> -	interval = msecs_to_jiffies(dirty_writeback_interval * 10);
> -	return max(5UL * 60 * HZ, interval);
> -}
> -
> -/*
> - * Clear pending bit and wakeup anybody waiting for flusher thread creation or
> - * shutdown
> - */
> -static void bdi_clear_pending(struct backing_dev_info *bdi)
> -{
> -	clear_bit(BDI_pending, &bdi->state);
> -	smp_mb__after_clear_bit();
> -	wake_up_bit(&bdi->state, BDI_pending);
> -}
> -
> -static int bdi_forker_thread(void *ptr)
> -{
> -	struct bdi_writeback *me = ptr;
> -
> -	current->flags |= PF_SWAPWRITE;
> -	set_freezable();
> -
> -	/*
> -	 * Our parent may run at a different priority, just set us to normal
> -	 */
> -	set_user_nice(current, 0);
> -
> -	for (;;) {
> -		struct task_struct *task = NULL;
> -		struct backing_dev_info *bdi;
> -		enum {
> -			NO_ACTION,   /* Nothing to do */
> -			FORK_THREAD, /* Fork bdi thread */
> -			KILL_THREAD, /* Kill inactive bdi thread */
> -		} action = NO_ACTION;
> -
> -		/*
> -		 * Temporary measure, we want to make sure we don't see
> -		 * dirty data on the default backing_dev_info
> -		 */
> -		if (wb_has_dirty_io(me) || !list_empty(&me->bdi->work_list)) {
> -			del_timer(&me->wakeup_timer);
> -			wb_do_writeback(me, 0);
> -		}
> -
> -		spin_lock_bh(&bdi_lock);
> -		/*
> -		 * In the following loop we are going to check whether we have
> -		 * some work to do without any synchronization with tasks
> -		 * waking us up to do work for them. Set the task state here
> -		 * so that we don't miss wakeups after verifying conditions.
> -		 */
> -		set_current_state(TASK_INTERRUPTIBLE);
> -
> -		list_for_each_entry(bdi, &bdi_list, bdi_list) {
> -			bool have_dirty_io;
> -
> -			if (!bdi_cap_writeback_dirty(bdi) ||
> -			     bdi_cap_flush_forker(bdi))
> -				continue;
> -
> -			WARN(!test_bit(BDI_registered, &bdi->state),
> -			     "bdi %p/%s is not registered!\n", bdi, bdi->name);
> -
> -			have_dirty_io = !list_empty(&bdi->work_list) ||
> -					wb_has_dirty_io(&bdi->wb);
> -
> -			/*
> -			 * If the bdi has work to do, but the thread does not
> -			 * exist - create it.
> -			 */
> -			if (!bdi->wb.task && have_dirty_io) {
> -				/*
> -				 * Set the pending bit - if someone will try to
> -				 * unregister this bdi - it'll wait on this bit.
> -				 */
> -				set_bit(BDI_pending, &bdi->state);
> -				action = FORK_THREAD;
> -				break;
> -			}
> -
> -			spin_lock(&bdi->wb_lock);
> -
> -			/*
> -			 * If there is no work to do and the bdi thread was
> -			 * inactive long enough - kill it. The wb_lock is taken
> -			 * to make sure no-one adds more work to this bdi and
> -			 * wakes the bdi thread up.
> -			 */
> -			if (bdi->wb.task && !have_dirty_io &&
> -			    time_after(jiffies, bdi->wb.last_active +
> -						bdi_longest_inactive())) {
> -				task = bdi->wb.task;
> -				bdi->wb.task = NULL;
> -				spin_unlock(&bdi->wb_lock);
> -				set_bit(BDI_pending, &bdi->state);
> -				action = KILL_THREAD;
> -				break;
> -			}
> -			spin_unlock(&bdi->wb_lock);
> -		}
> -		spin_unlock_bh(&bdi_lock);
> -
> -		/* Keep working if default bdi still has things to do */
> -		if (!list_empty(&me->bdi->work_list))
> -			__set_current_state(TASK_RUNNING);
> -
> -		switch (action) {
> -		case FORK_THREAD:
> -			__set_current_state(TASK_RUNNING);
> -			task = kthread_create(bdi_writeback_thread, &bdi->wb,
> -					      "flush-%s", dev_name(bdi->dev));
> -			if (IS_ERR(task)) {
> -				/*
> -				 * If thread creation fails, force writeout of
> -				 * the bdi from the thread. Hopefully 1024 is
> -				 * large enough for efficient IO.
> -				 */
> -				writeback_inodes_wb(&bdi->wb, 1024,
> -						    WB_REASON_FORKER_THREAD);
> -			} else {
> -				/*
> -				 * The spinlock makes sure we do not lose
> -				 * wake-ups when racing with 'bdi_queue_work()'.
> -				 * And as soon as the bdi thread is visible, we
> -				 * can start it.
> -				 */
> -				spin_lock_bh(&bdi->wb_lock);
> -				bdi->wb.task = task;
> -				spin_unlock_bh(&bdi->wb_lock);
> -				wake_up_process(task);
> -			}
> -			bdi_clear_pending(bdi);
> -			break;
> -
> -		case KILL_THREAD:
> -			__set_current_state(TASK_RUNNING);
> -			kthread_stop(task);
> -			bdi_clear_pending(bdi);
> -			break;
> -
> -		case NO_ACTION:
> -			if (!wb_has_dirty_io(me) || !dirty_writeback_interval)
> -				/*
> -				 * There are no dirty data. The only thing we
> -				 * should now care about is checking for
> -				 * inactive bdi threads and killing them. Thus,
> -				 * let's sleep for longer time, save energy and
> -				 * be friendly for battery-driven devices.
> -				 */
> -				schedule_timeout(bdi_longest_inactive());
> -			else
> -				schedule_timeout(msecs_to_jiffies(dirty_writeback_interval * 10));
> -			try_to_freeze();
> -			break;
> -		}
> -	}
> -
> -	return 0;
> +	mod_delayed_work(bdi_wq, &bdi->wb.dwork, timeout);
>  }
>  
>  /*
> @@ -487,6 +306,9 @@ static void bdi_remove_from_list(struct backing_dev_info *bdi)
>  	spin_unlock_bh(&bdi_lock);
>  
>  	synchronize_rcu_expedited();
> +
> +	/* bdi_list is now unused, clear it to mark @bdi dying */
> +	INIT_LIST_HEAD(&bdi->bdi_list);
>  }
>  
>  int bdi_register(struct backing_dev_info *bdi, struct device *parent,
> @@ -506,20 +328,6 @@ int bdi_register(struct backing_dev_info *bdi, struct device *parent,
>  
>  	bdi->dev = dev;
>  
> -	/*
> -	 * Just start the forker thread for our default backing_dev_info,
> -	 * and add other bdi's to the list. They will get a thread created
> -	 * on-demand when they need it.
> -	 */
> -	if (bdi_cap_flush_forker(bdi)) {
> -		struct bdi_writeback *wb = &bdi->wb;
> -
> -		wb->task = kthread_run(bdi_forker_thread, wb, "bdi-%s",
> -						dev_name(dev));
> -		if (IS_ERR(wb->task))
> -			return PTR_ERR(wb->task);
> -	}
> -
>  	bdi_debug_register(bdi, dev_name(dev));
>  	set_bit(BDI_registered, &bdi->state);
>  
> @@ -543,8 +351,6 @@ EXPORT_SYMBOL(bdi_register_dev);
>   */
>  static void bdi_wb_shutdown(struct backing_dev_info *bdi)
>  {
> -	struct task_struct *task;
> -
>  	if (!bdi_cap_writeback_dirty(bdi))
>  		return;
>  
> @@ -554,22 +360,20 @@ static void bdi_wb_shutdown(struct backing_dev_info *bdi)
>  	bdi_remove_from_list(bdi);
>  
>  	/*
> -	 * If setup is pending, wait for that to complete first
> +	 * Drain work list and shutdown the delayed_work.  At this point,
> +	 * @bdi->bdi_list is empty telling bdi_Writeback_workfn() that @bdi
> +	 * is dying and its work_list needs to be drained no matter what.
>  	 */
> -	wait_on_bit(&bdi->state, BDI_pending, bdi_sched_wait,
> -			TASK_UNINTERRUPTIBLE);
> +	mod_delayed_work(bdi_wq, &bdi->wb.dwork, 0);
> +	flush_delayed_work(&bdi->wb.dwork);
> +	WARN_ON(!list_empty(&bdi->work_list));
>  
>  	/*
> -	 * Finally, kill the kernel thread. We don't need to be RCU
> -	 * safe anymore, since the bdi is gone from visibility.
> +	 * This shouldn't be necessary unless @bdi for some reason has
> +	 * unflushed dirty IO after work_list is drained.  Do it anyway
> +	 * just in case.
>  	 */
> -	spin_lock_bh(&bdi->wb_lock);
> -	task = bdi->wb.task;
> -	bdi->wb.task = NULL;
> -	spin_unlock_bh(&bdi->wb_lock);
> -
> -	if (task)
> -		kthread_stop(task);
> +	cancel_delayed_work_sync(&bdi->wb.dwork);
>  }
>  
>  /*
> @@ -595,10 +399,8 @@ void bdi_unregister(struct backing_dev_info *bdi)
>  		bdi_set_min_ratio(bdi, 0);
>  		trace_writeback_bdi_unregister(bdi);
>  		bdi_prune_sb(bdi);
> -		del_timer_sync(&bdi->wb.wakeup_timer);
>  
> -		if (!bdi_cap_flush_forker(bdi))
> -			bdi_wb_shutdown(bdi);
> +		bdi_wb_shutdown(bdi);
>  		bdi_debug_unregister(bdi);
>  
>  		spin_lock_bh(&bdi->wb_lock);
> @@ -620,7 +422,7 @@ static void bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
>  	INIT_LIST_HEAD(&wb->b_io);
>  	INIT_LIST_HEAD(&wb->b_more_io);
>  	spin_lock_init(&wb->list_lock);
> -	setup_timer(&wb->wakeup_timer, wakeup_timer_fn, (unsigned long)bdi);
> +	INIT_DELAYED_WORK(&wb->dwork, bdi_writeback_workfn);
>  }
>  
>  /*
> @@ -693,12 +495,11 @@ void bdi_destroy(struct backing_dev_info *bdi)
>  	bdi_unregister(bdi);
>  
>  	/*
> -	 * If bdi_unregister() had already been called earlier, the
> -	 * wakeup_timer could still be armed because bdi_prune_sb()
> -	 * can race with the bdi_wakeup_thread_delayed() calls from
> -	 * __mark_inode_dirty().
> +	 * If bdi_unregister() had already been called earlier, the dwork
> +	 * could still be pending because bdi_prune_sb() can race with the
> +	 * bdi_wakeup_thread_delayed() calls from __mark_inode_dirty().
>  	 */
> -	del_timer_sync(&bdi->wb.wakeup_timer);
> +	cancel_delayed_work_sync(&bdi->wb.dwork);
>  
>  	for (i = 0; i < NR_BDI_STAT_ITEMS; i++)
>  		percpu_counter_destroy(&bdi->bdi_stat[i]);
> -- 
> 1.8.1.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCHSET] writeback: convert writeback to unbound workqueue
  2013-03-07 21:44 [PATCHSET] writeback: convert writeback to unbound workqueue Tejun Heo
                   ` (3 preceding siblings ...)
  2013-03-07 21:44 ` [PATCH 4/4] writeback: expose the bdi_wq workqueue Tejun Heo
@ 2013-03-12 15:06 ` Jens Axboe
  2013-03-12 17:10   ` Tejun Heo
  4 siblings, 1 reply; 19+ messages in thread
From: Jens Axboe @ 2013-03-12 15:06 UTC (permalink / raw)
  To: Tejun Heo; +Cc: laijs, fengguang.wu, linux-kernel, jmoyer

On Thu, Mar 07 2013, Tejun Heo wrote:
> Hello,
> 
> There's no reason for writeback to implement its own worker pool when
> using workqueue is much simpler and more efficient.  This patchset
> replaces writeback's custom worker pool with unbound workqueue and
> also exports it to userland using WQ_SYSFS so that it can be tuned
> from userland as requested a couple releases ago.
> 
> This patchset contains the following four patches.
> 
>  0001-implement-current_is_workqueue_rescuer.patch
>  0002-writeback-remove-unused-bdi_pending_list.patch
>  0003-writeback-replace-custom-worker-pool-implementation-.patch
>  0004-writeback-expose-the-bdi_wq-workqueue.patch
> 
> 0001-0002 are prep patches.  0003 does the conversion.  0004 makes
> bdi_wq visible to userland.
> 
> This patchset is on top of v3.9-rc1 + "workqueue: implement workqueue
> with custom worker attributes" patchset[1] and available in the
> following git branch.
> 
>  git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git review-writeback-conversion

I like it, diffstat looks nice too :-)

Have you done any performance testing, or just functional verification?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCHSET] writeback: convert writeback to unbound workqueue
  2013-03-12 15:06 ` [PATCHSET] writeback: convert writeback to unbound workqueue Jens Axboe
@ 2013-03-12 17:10   ` Tejun Heo
  0 siblings, 0 replies; 19+ messages in thread
From: Tejun Heo @ 2013-03-12 17:10 UTC (permalink / raw)
  To: Jens Axboe; +Cc: laijs, fengguang.wu, linux-kernel, jmoyer, Jan Kara

Hey, Jens.

On Tue, Mar 12, 2013 at 04:06:33PM +0100, Jens Axboe wrote:
> >  git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git review-writeback-conversion
> 
> I like it, diffstat looks nice too :-)
> 
> Have you done any performance testing, or just functional verification?

Only functional test at this point.  I'm buliding NUMA awareness to
workqueue and planning on doing comparison of before-wq, after-wq,
after-wq-with-NUMA-awareness.  I don't expect any tangible difference
between before-wq and after-wq tho.  It would be great if someone can
recommend me a test scenario which can emphasize CPU overhead of
writeback tasks.  Any ideas?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/4] implement current_is_workqueue_rescuer()
  2013-03-07 21:44 ` [PATCH 1/4] implement current_is_workqueue_rescuer() Tejun Heo
@ 2013-03-13  0:42   ` Tejun Heo
  0 siblings, 0 replies; 19+ messages in thread
From: Tejun Heo @ 2013-03-13  0:42 UTC (permalink / raw)
  To: axboe, laijs, fengguang.wu; +Cc: linux-kernel, jmoyer

On Thu, Mar 07, 2013 at 01:44:06PM -0800, Tejun Heo wrote:
> Implement a function which queries whether it currently is running off
> a workqueue rescuer.  This will be used to convert writeback to
> workqueue.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

Applied this one to wq/for-3.10.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/4] writeback: replace custom worker pool implementation with unbound workqueue
  2013-03-12 15:05   ` Jan Kara
@ 2013-03-18 22:32     ` Jan Kara
  2013-03-18 22:35       ` Tejun Heo
  0 siblings, 1 reply; 19+ messages in thread
From: Jan Kara @ 2013-03-18 22:32 UTC (permalink / raw)
  To: Tejun Heo; +Cc: axboe, laijs, fengguang.wu, linux-kernel, jmoyer

On Tue 12-03-13 16:05:10, Jan Kara wrote:
> On Thu 07-03-13 13:44:08, Tejun Heo wrote:
> > Writeback implements its own worker pool - each bdi can be associated
> > with a worker thread which is created and destroyed dynamically.  The
> > worker thread for the default bdi is always present and serves as the
> > "forker" thread which forks off worker threads for other bdis.
> > 
> > there's no reason for writeback to implement its own worker pool when
> > using unbound workqueue instead is much simpler and more efficient.
> > This patch replaces custom worker pool implementation in writeback
> > with an unbound workqueue.
  I realized there may be one issue - so far we have a clear identification
which thread works for which bdi in the thread name (flush-x:y naming).
That was useful when debugging things. Now with your worker pool this is
lost, am I right? Would it be possible to restore that?

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/4] writeback: replace custom worker pool implementation with unbound workqueue
  2013-03-18 22:32     ` Jan Kara
@ 2013-03-18 22:35       ` Tejun Heo
  2013-03-19 15:46         ` Jan Kara
  0 siblings, 1 reply; 19+ messages in thread
From: Tejun Heo @ 2013-03-18 22:35 UTC (permalink / raw)
  To: Jan Kara; +Cc: axboe, laijs, fengguang.wu, linux-kernel, jmoyer

Hello, Jan.

On Mon, Mar 18, 2013 at 11:32:44PM +0100, Jan Kara wrote:
>   I realized there may be one issue - so far we have a clear identification
> which thread works for which bdi in the thread name (flush-x:y naming).
> That was useful when debugging things. Now with your worker pool this is
> lost, am I right? Would it be possible to restore that?

Unfortunately not directly.  It shouldn't be difficult to tell who's
working on what from workqueue and writeback tracepoints tho.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/4] writeback: replace custom worker pool implementation with unbound workqueue
  2013-03-18 22:35       ` Tejun Heo
@ 2013-03-19 15:46         ` Jan Kara
  2013-03-19 17:28           ` Tejun Heo
  0 siblings, 1 reply; 19+ messages in thread
From: Jan Kara @ 2013-03-19 15:46 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jan Kara, axboe, laijs, fengguang.wu, linux-kernel, jmoyer

On Mon 18-03-13 15:35:26, Tejun Heo wrote:
> On Mon, Mar 18, 2013 at 11:32:44PM +0100, Jan Kara wrote:
> >   I realized there may be one issue - so far we have a clear identification
> > which thread works for which bdi in the thread name (flush-x:y naming).
> > That was useful when debugging things. Now with your worker pool this is
> > lost, am I right? Would it be possible to restore that?
> 
> Unfortunately not directly.  It shouldn't be difficult to tell who's
> working on what from workqueue and writeback tracepoints tho.
  Well, but what you often get is just output of sysrq-w, or sysrq-t, or
splat from scheduler about stuck task. You often don't have the comfort of
tracing... Can't we somehow change 'comm' of the task when it starts
processing work of some bdi?
								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/4] writeback: replace custom worker pool implementation with unbound workqueue
  2013-03-19 15:46         ` Jan Kara
@ 2013-03-19 17:28           ` Tejun Heo
  2013-03-21  1:57             ` Dave Chinner
  0 siblings, 1 reply; 19+ messages in thread
From: Tejun Heo @ 2013-03-19 17:28 UTC (permalink / raw)
  To: Jan Kara; +Cc: axboe, laijs, fengguang.wu, linux-kernel, jmoyer

Hello, Jan.

On Tue, Mar 19, 2013 at 8:46 AM, Jan Kara <jack@suse.cz> wrote:
>   Well, but what you often get is just output of sysrq-w, or sysrq-t, or
> splat from scheduler about stuck task. You often don't have the comfort of
> tracing... Can't we somehow change 'comm' of the task when it starts
> processing work of some bdi?

You sure can but I'd prefer not to do that. If you wanna do it
properly, you have to grab task lock every time a work item starts
execution. I'm not sure how beneficial having the block device
identifier would be. Backtrace would be there the same. Is identifying
the block device that important?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/4] writeback: replace custom worker pool implementation with unbound workqueue
  2013-03-19 17:28           ` Tejun Heo
@ 2013-03-21  1:57             ` Dave Chinner
  2013-03-21  5:07               ` Tejun Heo
  0 siblings, 1 reply; 19+ messages in thread
From: Dave Chinner @ 2013-03-21  1:57 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jan Kara, axboe, laijs, fengguang.wu, linux-kernel, jmoyer

On Tue, Mar 19, 2013 at 10:28:24AM -0700, Tejun Heo wrote:
> Hello, Jan.
> 
> On Tue, Mar 19, 2013 at 8:46 AM, Jan Kara <jack@suse.cz> wrote:
> >   Well, but what you often get is just output of sysrq-w, or sysrq-t, or
> > splat from scheduler about stuck task. You often don't have the comfort of
> > tracing... Can't we somehow change 'comm' of the task when it starts
> > processing work of some bdi?
> 
> You sure can but I'd prefer not to do that. If you wanna do it
> properly, you have to grab task lock every time a work item starts
> execution. I'm not sure how beneficial having the block device
> identifier would be. Backtrace would be there the same. Is identifying
> the block device that important?

When you have a system that has 50+ active filesystems (pretty
common in the distributed storage environments were every disk has
it's own filesystem), knowing which filesystem(s) are getting stuck
in writeback from the sysrq-w or hangcheck output is pretty damn
important....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/4] writeback: replace custom worker pool implementation with unbound workqueue
  2013-03-21  1:57             ` Dave Chinner
@ 2013-03-21  5:07               ` Tejun Heo
  2013-03-21 11:32                 ` Jan Kara
  0 siblings, 1 reply; 19+ messages in thread
From: Tejun Heo @ 2013-03-21  5:07 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Jan Kara, axboe, laijs, fengguang.wu, linux-kernel, jmoyer

Hello, Dave.

On Thu, Mar 21, 2013 at 12:57:21PM +1100, Dave Chinner wrote:
> When you have a system that has 50+ active filesystems (pretty
> common in the distributed storage environments were every disk has
> it's own filesystem), knowing which filesystem(s) are getting stuck
> in writeback from the sysrq-w or hangcheck output is pretty damn
> important....

Hmm... I guess, then, what we can do is adding a worker description
which is printed out with task dump.  ie. A work function can call
work_set_desc(fmt, arg....).  It won't be expensive and can be
implemented relatively easily.  Does that sound good enough?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/4] writeback: replace custom worker pool implementation with unbound workqueue
  2013-03-21  5:07               ` Tejun Heo
@ 2013-03-21 11:32                 ` Jan Kara
  2013-03-21 13:08                   ` Christoph Hellwig
  0 siblings, 1 reply; 19+ messages in thread
From: Jan Kara @ 2013-03-21 11:32 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Dave Chinner, Jan Kara, axboe, laijs, fengguang.wu, linux-kernel, jmoyer

On Wed 20-03-13 22:07:18, Tejun Heo wrote:
> Hello, Dave.
> 
> On Thu, Mar 21, 2013 at 12:57:21PM +1100, Dave Chinner wrote:
> > When you have a system that has 50+ active filesystems (pretty
> > common in the distributed storage environments were every disk has
> > it's own filesystem), knowing which filesystem(s) are getting stuck
> > in writeback from the sysrq-w or hangcheck output is pretty damn
> > important....
> 
> Hmm... I guess, then, what we can do is adding a worker description
> which is printed out with task dump.  ie. A work function can call
> work_set_desc(fmt, arg....).  It won't be expensive and can be
> implemented relatively easily.  Does that sound good enough?
  Yeah, that sounds good! Thanks!

							Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/4] writeback: replace custom worker pool implementation with unbound workqueue
  2013-03-21 11:32                 ` Jan Kara
@ 2013-03-21 13:08                   ` Christoph Hellwig
  2013-03-21 18:37                     ` Tejun Heo
  0 siblings, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2013-03-21 13:08 UTC (permalink / raw)
  To: Jan Kara
  Cc: Tejun Heo, Dave Chinner, axboe, laijs, fengguang.wu,
	linux-kernel, jmoyer

On Thu, Mar 21, 2013 at 12:32:52PM +0100, Jan Kara wrote:
> On Wed 20-03-13 22:07:18, Tejun Heo wrote:
> > Hello, Dave.
> > 
> > On Thu, Mar 21, 2013 at 12:57:21PM +1100, Dave Chinner wrote:
> > > When you have a system that has 50+ active filesystems (pretty
> > > common in the distributed storage environments were every disk has
> > > it's own filesystem), knowing which filesystem(s) are getting stuck
> > > in writeback from the sysrq-w or hangcheck output is pretty damn
> > > important....
> > 
> > Hmm... I guess, then, what we can do is adding a worker description
> > which is printed out with task dump.  ie. A work function can call
> > work_set_desc(fmt, arg....).  It won't be expensive and can be
> > implemented relatively easily.  Does that sound good enough?
>   Yeah, that sounds good! Thanks!

How about automatically doing this based on the workqueue name?


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/4] writeback: replace custom worker pool implementation with unbound workqueue
  2013-03-21 13:08                   ` Christoph Hellwig
@ 2013-03-21 18:37                     ` Tejun Heo
  0 siblings, 0 replies; 19+ messages in thread
From: Tejun Heo @ 2013-03-21 18:37 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, Dave Chinner, axboe, laijs, fengguang.wu, linux-kernel, jmoyer

Hey, Christoph.

On Thu, Mar 21, 2013 at 09:08:19AM -0400, Christoph Hellwig wrote:
> > > Hmm... I guess, then, what we can do is adding a worker description
> > > which is printed out with task dump.  ie. A work function can call
> > > work_set_desc(fmt, arg....).  It won't be expensive and can be
> > > implemented relatively easily.  Does that sound good enough?
> >   Yeah, that sounds good! Thanks!
> 
> How about automatically doing this based on the workqueue name?

We already have that information and just need to print it.  The thing
is some users including writeback want further identifying information
inside the workqueue.  The same workqueue would be serving all
writeback work items but it would be nice to be able to distinguish
which specific device was being serviced, so we need more information
than the associated workqueue.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2013-03-21 18:37 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-03-07 21:44 [PATCHSET] writeback: convert writeback to unbound workqueue Tejun Heo
2013-03-07 21:44 ` [PATCH 1/4] implement current_is_workqueue_rescuer() Tejun Heo
2013-03-13  0:42   ` Tejun Heo
2013-03-07 21:44 ` [PATCH 2/4] writeback: remove unused bdi_pending_list Tejun Heo
2013-03-12 12:02   ` Jan Kara
2013-03-07 21:44 ` [PATCH 3/4] writeback: replace custom worker pool implementation with unbound workqueue Tejun Heo
2013-03-12 15:05   ` Jan Kara
2013-03-18 22:32     ` Jan Kara
2013-03-18 22:35       ` Tejun Heo
2013-03-19 15:46         ` Jan Kara
2013-03-19 17:28           ` Tejun Heo
2013-03-21  1:57             ` Dave Chinner
2013-03-21  5:07               ` Tejun Heo
2013-03-21 11:32                 ` Jan Kara
2013-03-21 13:08                   ` Christoph Hellwig
2013-03-21 18:37                     ` Tejun Heo
2013-03-07 21:44 ` [PATCH 4/4] writeback: expose the bdi_wq workqueue Tejun Heo
2013-03-12 15:06 ` [PATCHSET] writeback: convert writeback to unbound workqueue Jens Axboe
2013-03-12 17:10   ` Tejun Heo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.