linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHSET RFC] sched, jbd2: mark sleeps on journal->j_checkpoint_mutex as iowait
@ 2016-10-28 16:58 Tejun Heo
  2016-10-28 16:58 ` [PATCH 1/4] sched: move IO scheduling accounting from io_schedule_timeout() to __schedule() Tejun Heo
                   ` (3 more replies)
  0 siblings, 4 replies; 22+ messages in thread
From: Tejun Heo @ 2016-10-28 16:58 UTC (permalink / raw)
  To: torvalds, akpm, mingo, peterz, axboe, tytso, jack, adilger.kernel
  Cc: linux-ext4, linux-fsdevel, linux-kernel, kernel-team, mingbo

Hello,

When there's heavy metadata operation traffic on ext4, the journal
gets filled soon and majority of filesystem users end up blocking on
journal->j_checkpoint_mutex with a stacktrace similar to the
following.

 [<ffffffff8c32e758>] __jbd2_log_wait_for_space+0xb8/0x1d0
 [<ffffffff8c3285f6>] add_transaction_credits+0x286/0x2a0
 [<ffffffff8c32876c>] start_this_handle+0x10c/0x400
 [<ffffffff8c328c5b>] jbd2__journal_start+0xdb/0x1e0
 [<ffffffff8c30ee5d>] __ext4_journal_start_sb+0x6d/0x120
 [<ffffffff8c2d713e>] __ext4_new_inode+0x64e/0x1330
 [<ffffffff8c2e9bf0>] ext4_create+0xc0/0x1c0
 [<ffffffff8c2570fd>] path_openat+0x124d/0x1380
 [<ffffffff8c258501>] do_filp_open+0x91/0x100
 [<ffffffff8c2462d0>] do_sys_open+0x130/0x220
 [<ffffffff8c2463de>] SyS_open+0x1e/0x20
 [<ffffffff8c7ec5b2>] entry_SYSCALL_64_fastpath+0x1a/0xa4
 [<ffffffffffffffff>] 0xffffffffffffffff

Because the sleeps on the mutex aren't accounted as iowait, the system
doesn't show the usual signs of being bogged down by IOs - both iowait
and /proc/stat:procs_blocked stay misleadingly low.  While propagation
of iowait through locking constructs is far from being strict, heavy
contention on j_checkpoint_mutex is easy to trigger, obviously iowait
and getting it right can help users in tracking down the issue quite a
bit.

Due to the way io_schedule() is implemented, it currently is hairy to
add an io variant to an existing interface - the schedule() call
itself, which is usually buried deep, should be replaced with
io_schedule().  As we already have current->in_iowait to mark the task
as sleeping for iowait, this can be made easy by breaking up
io_schedule() into multiple steps so that the preparation and marking
can be done before calling an existing interafce and the actual iowait
accounting can be done from inside the scheduler.

What do you think?

This patch contains the following four patches.

 0001-sched-move-IO-scheduling-accounting-from-io_schedule.patch
 0002-sched-separate-out-io_schedule_prepare-and-io_schedu.patch
 0003-mutex-add-mutex_lock_io.patch
 0004-jbd2-use-mutex_lock_io-for-journal-j_checkpoint_mute.patch

0001-0002 implement io_schedule_prepare/finish().
0003 implements mutex_lock_io() using io_schedule_prepare/finish().
0004 uses mutex_lock_io() on journal->j_checkpoint_mutex.

This patchset is also available in the following git branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git review-mutex_lock_io

Thanks, diffstat follows.

 fs/jbd2/commit.c       |    2 -
 fs/jbd2/journal.c      |   14 ++++++-------
 include/linux/mutex.h  |    4 +++
 include/linux/sched.h  |    8 ++-----
 kernel/locking/mutex.c |   24 ++++++++++++++++++++++
 kernel/sched/core.c    |   52 +++++++++++++++++++++++++++++++++++++------------
 6 files changed, 79 insertions(+), 25 deletions(-)

--
tejun

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/4] sched: move IO scheduling accounting from io_schedule_timeout() to __schedule()
  2016-10-28 16:58 [PATCHSET RFC] sched, jbd2: mark sleeps on journal->j_checkpoint_mutex as iowait Tejun Heo
@ 2016-10-28 16:58 ` Tejun Heo
  2016-10-28 18:27   ` Peter Zijlstra
                     ` (2 more replies)
  2016-10-28 16:58 ` [PATCH 2/4] sched: separate out io_schedule_prepare() and io_schedule_finish() Tejun Heo
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 22+ messages in thread
From: Tejun Heo @ 2016-10-28 16:58 UTC (permalink / raw)
  To: torvalds, akpm, mingo, peterz, axboe, tytso, jack, adilger.kernel
  Cc: linux-ext4, linux-fsdevel, linux-kernel, kernel-team, mingbo, Tejun Heo

For an interface to support blocking for IOs, it must call
io_schedule() instead of schedule().  This makes it tedious to add IO
blocking to existing interfaces as the switching between schedule()
and io_schedule() is often buried deep.

As we already have a way to mark the task as IO scheduling, this can
be made easier by separating out io_schedule() into multiple steps so
that IO schedule preparation can be performed before invoking a
blocking interface and the actual accounting happens inside
schedule().

io_schedule_timeout() does the following three things prior to calling
schedule_timeout().

 1. Mark the task as scheduling for IO.
 2. Flush out plugged IOs.
 3. Account the IO scheduling.

#1 and #2 can be performed in the prepartaion step while #3 must be
done close to the actual scheduling.  This patch moves #3 into
__schedule() so that later patches can separate out preparation and
finish steps from io_schedule().

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
---
 kernel/sched/core.c | 19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 94732d1..f6baa38 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3336,11 +3336,17 @@ static void __sched notrace __schedule(bool preempt)
 	unsigned long *switch_count;
 	struct pin_cookie cookie;
 	struct rq *rq;
-	int cpu;
+	int cpu, in_iowait;
 
 	cpu = smp_processor_id();
 	rq = cpu_rq(cpu);
 	prev = rq->curr;
+	in_iowait = prev->in_iowait;
+
+	if (in_iowait) {
+		delayacct_blkio_start();
+		atomic_inc(&rq->nr_iowait);
+	}
 
 	schedule_debug(prev);
 
@@ -3406,6 +3412,11 @@ static void __sched notrace __schedule(bool preempt)
 	}
 
 	balance_callback(rq);
+
+	if (in_iowait) {
+		atomic_dec(&rq->nr_iowait);
+		delayacct_blkio_end();
+	}
 }
 
 void __noreturn do_task_dead(void)
@@ -5063,19 +5074,13 @@ EXPORT_SYMBOL_GPL(yield_to);
 long __sched io_schedule_timeout(long timeout)
 {
 	int old_iowait = current->in_iowait;
-	struct rq *rq;
 	long ret;
 
 	current->in_iowait = 1;
 	blk_schedule_flush_plug(current);
 
-	delayacct_blkio_start();
-	rq = raw_rq();
-	atomic_inc(&rq->nr_iowait);
 	ret = schedule_timeout(timeout);
 	current->in_iowait = old_iowait;
-	atomic_dec(&rq->nr_iowait);
-	delayacct_blkio_end();
 
 	return ret;
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/4] sched: separate out io_schedule_prepare() and io_schedule_finish()
  2016-10-28 16:58 [PATCHSET RFC] sched, jbd2: mark sleeps on journal->j_checkpoint_mutex as iowait Tejun Heo
  2016-10-28 16:58 ` [PATCH 1/4] sched: move IO scheduling accounting from io_schedule_timeout() to __schedule() Tejun Heo
@ 2016-10-28 16:58 ` Tejun Heo
  2017-01-14 12:49   ` [tip:sched/core] sched/core: Separate " tip-bot for Tejun Heo
  2016-10-28 16:58 ` [PATCH 3/4] mutex: add mutex_lock_io() Tejun Heo
  2016-10-28 16:58 ` [PATCH 4/4] jbd2: use mutex_lock_io() for journal->j_checkpoint_mutex Tejun Heo
  3 siblings, 1 reply; 22+ messages in thread
From: Tejun Heo @ 2016-10-28 16:58 UTC (permalink / raw)
  To: torvalds, akpm, mingo, peterz, axboe, tytso, jack, adilger.kernel
  Cc: linux-ext4, linux-fsdevel, linux-kernel, kernel-team, mingbo, Tejun Heo

Now that IO schedule accounting is done inside __schedule(),
io_schedule() can be split into three steps - prep, schedule, and
finish - where the schedule part doesn't need any special annotation.
This allows marking a sleep as iowait by simply wrapping an existing
blocking function with io_schedule_prepare() and io_schedule_finish().

Because task_struct->in_iowait is single bit, the caller of
io_schedule_prepare() needs to record and the pass its state to
io_schedule_finish() to be safe regarding nesting.  While this isn't
the prettiest, these functions are mostly gonna be used by core
functions and we don't want to use more space for ->in_iowait.

While at it, as it's simple to do now, reimplement io_schedule()
without unnecessarily going through io_schedule_timeout().

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
---
 include/linux/sched.h |  8 +++-----
 kernel/sched/core.c   | 33 ++++++++++++++++++++++++++++-----
 2 files changed, 31 insertions(+), 10 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 348f51b..c025f77 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -441,12 +441,10 @@ extern signed long schedule_timeout_idle(signed long timeout);
 asmlinkage void schedule(void);
 extern void schedule_preempt_disabled(void);
 
+extern int __must_check io_schedule_prepare(void);
+extern void io_schedule_finish(int token);
 extern long io_schedule_timeout(long timeout);
-
-static inline void io_schedule(void)
-{
-	io_schedule_timeout(MAX_SCHEDULE_TIMEOUT);
-}
+extern void io_schedule(void);
 
 void __noreturn do_task_dead(void);
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f6baa38..30d3185 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5067,25 +5067,48 @@ int __sched yield_to(struct task_struct *p, bool preempt)
 }
 EXPORT_SYMBOL_GPL(yield_to);
 
+int io_schedule_prepare(void)
+{
+	int old_iowait = current->in_iowait;
+
+	current->in_iowait = 1;
+	blk_schedule_flush_plug(current);
+
+	return old_iowait;
+}
+
+void io_schedule_finish(int token)
+{
+	current->in_iowait = token;
+}
+
 /*
  * This task is about to go to sleep on IO. Increment rq->nr_iowait so
  * that process accounting knows that this is a task in IO wait state.
  */
 long __sched io_schedule_timeout(long timeout)
 {
-	int old_iowait = current->in_iowait;
+	int token;
 	long ret;
 
-	current->in_iowait = 1;
-	blk_schedule_flush_plug(current);
-
+	token = io_schedule_prepare();
 	ret = schedule_timeout(timeout);
-	current->in_iowait = old_iowait;
+	io_schedule_finish(token);
 
 	return ret;
 }
 EXPORT_SYMBOL(io_schedule_timeout);
 
+void io_schedule(void)
+{
+	int token;
+
+	token = io_schedule_prepare();
+	schedule();
+	io_schedule_finish(token);
+}
+EXPORT_SYMBOL(io_schedule);
+
 /**
  * sys_sched_get_priority_max - return maximum RT priority.
  * @policy: scheduling class.
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/4] mutex: add mutex_lock_io()
  2016-10-28 16:58 [PATCHSET RFC] sched, jbd2: mark sleeps on journal->j_checkpoint_mutex as iowait Tejun Heo
  2016-10-28 16:58 ` [PATCH 1/4] sched: move IO scheduling accounting from io_schedule_timeout() to __schedule() Tejun Heo
  2016-10-28 16:58 ` [PATCH 2/4] sched: separate out io_schedule_prepare() and io_schedule_finish() Tejun Heo
@ 2016-10-28 16:58 ` Tejun Heo
  2017-01-14 12:50   ` [tip:sched/core] locking/mutex, sched/wait: Add mutex_lock_io() tip-bot for Tejun Heo
  2016-10-28 16:58 ` [PATCH 4/4] jbd2: use mutex_lock_io() for journal->j_checkpoint_mutex Tejun Heo
  3 siblings, 1 reply; 22+ messages in thread
From: Tejun Heo @ 2016-10-28 16:58 UTC (permalink / raw)
  To: torvalds, akpm, mingo, peterz, axboe, tytso, jack, adilger.kernel
  Cc: linux-ext4, linux-fsdevel, linux-kernel, kernel-team, mingbo, Tejun Heo

We sometimes end up propagating IO blocking through mutexes; however,
because there currently is no way of annotating mutex sleeps as
iowait, there are cases where iowait and /proc/stat:procs_blocked
report misleading numbers obscuring the actual state of the system.

This patch adds mutex_lock_io() so that mutex sleeps can be marked as
iowait in those cases.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
---
 include/linux/mutex.h  |  4 ++++
 kernel/locking/mutex.c | 24 ++++++++++++++++++++++++
 2 files changed, 28 insertions(+)

diff --git a/include/linux/mutex.h b/include/linux/mutex.h
index 2cb7531..5d77ddd 100644
--- a/include/linux/mutex.h
+++ b/include/linux/mutex.h
@@ -142,10 +142,12 @@ extern int __must_check mutex_lock_interruptible_nested(struct mutex *lock,
 					unsigned int subclass);
 extern int __must_check mutex_lock_killable_nested(struct mutex *lock,
 					unsigned int subclass);
+extern void mutex_lock_io_nested(struct mutex *lock, unsigned int subclass);
 
 #define mutex_lock(lock) mutex_lock_nested(lock, 0)
 #define mutex_lock_interruptible(lock) mutex_lock_interruptible_nested(lock, 0)
 #define mutex_lock_killable(lock) mutex_lock_killable_nested(lock, 0)
+#define mutex_lock_io(lock) mutex_lock_io_nested(lock, 0)
 
 #define mutex_lock_nest_lock(lock, nest_lock)				\
 do {									\
@@ -157,11 +159,13 @@ do {									\
 extern void mutex_lock(struct mutex *lock);
 extern int __must_check mutex_lock_interruptible(struct mutex *lock);
 extern int __must_check mutex_lock_killable(struct mutex *lock);
+extern void mutex_lock_io(struct mutex *lock);
 
 # define mutex_lock_nested(lock, subclass) mutex_lock(lock)
 # define mutex_lock_interruptible_nested(lock, subclass) mutex_lock_interruptible(lock)
 # define mutex_lock_killable_nested(lock, subclass) mutex_lock_killable(lock)
 # define mutex_lock_nest_lock(lock, nest_lock) mutex_lock(lock)
+# define mutex_lock_nest_io(lock, nest_lock) mutex_io(lock)
 #endif
 
 /*
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index a70b90d..a37709c 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -653,6 +653,20 @@ mutex_lock_interruptible_nested(struct mutex *lock, unsigned int subclass)
 
 EXPORT_SYMBOL_GPL(mutex_lock_interruptible_nested);
 
+void __sched
+mutex_lock_io_nested(struct mutex *lock, unsigned int subclass)
+{
+	int token;
+
+	might_sleep();
+
+	token = io_schedule_prepare();
+	__mutex_lock_common(lock, TASK_UNINTERRUPTIBLE,
+			    subclass, NULL, _RET_IP_, NULL, 0);
+	io_schedule_finish(token);
+}
+EXPORT_SYMBOL_GPL(mutex_lock_io_nested);
+
 static inline int
 ww_mutex_deadlock_injection(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
 {
@@ -816,6 +830,16 @@ int __sched mutex_lock_killable(struct mutex *lock)
 }
 EXPORT_SYMBOL(mutex_lock_killable);
 
+void __sched mutex_lock_io(struct mutex *lock)
+{
+	int token;
+
+	token = io_schedule_prepare();
+	mutex_lock(lock);
+	io_schedule_finish(token);
+}
+EXPORT_SYMBOL_GPL(mutex_lock_io);
+
 __visible void __sched
 __mutex_lock_slowpath(atomic_t *lock_count)
 {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 4/4] jbd2: use mutex_lock_io() for journal->j_checkpoint_mutex
  2016-10-28 16:58 [PATCHSET RFC] sched, jbd2: mark sleeps on journal->j_checkpoint_mutex as iowait Tejun Heo
                   ` (2 preceding siblings ...)
  2016-10-28 16:58 ` [PATCH 3/4] mutex: add mutex_lock_io() Tejun Heo
@ 2016-10-28 16:58 ` Tejun Heo
  2017-01-14 12:51   ` [tip:sched/core] fs/jbd2, locking/mutex, sched/wait: Use " tip-bot for Tejun Heo
  3 siblings, 1 reply; 22+ messages in thread
From: Tejun Heo @ 2016-10-28 16:58 UTC (permalink / raw)
  To: torvalds, akpm, mingo, peterz, axboe, tytso, jack, adilger.kernel
  Cc: linux-ext4, linux-fsdevel, linux-kernel, kernel-team, mingbo, Tejun Heo

When an ext4 fs is bogged down by a lot of metadata IOs (in the
reported case, it was deletion of millions of files, but any massive
amount of journal writes would do), after the journal is filled up,
tasks which try to access the filesystem and aren't currently
performing the journal writes end up waiting in
__jbd2_log_wait_for_space() for journal->j_checkpoint_mutex.

Because those mutex sleeps aren't marked as iowait, this condition can
lead to misleadingly low iowait and /proc/stat:procs_blocked.  While
iowait propagation is far from strict, this condition can be triggered
fairly easily and annotating these sleeps correctly helps initial
diagnosis quite a bit.

Use the new mutex_lock_io() for journal->j_checkpoint_mutex so that
these sleeps are properly marked as iowait.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Mingbo Wan <mingbo@fb.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Jan Kara <jack@suse.com>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: linux-ext4@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
---
 fs/jbd2/commit.c  |  2 +-
 fs/jbd2/journal.c | 14 +++++++-------
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 31f8ca0..e772881 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -392,7 +392,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 	/* Do we need to erase the effects of a prior jbd2_journal_flush? */
 	if (journal->j_flags & JBD2_FLUSHED) {
 		jbd_debug(3, "super block updated\n");
-		mutex_lock(&journal->j_checkpoint_mutex);
+		mutex_lock_io(&journal->j_checkpoint_mutex);
 		/*
 		 * We hold j_checkpoint_mutex so tail cannot change under us.
 		 * We don't need any special data guarantees for writing sb
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 927da49..0aa7d06 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -944,7 +944,7 @@ int __jbd2_update_log_tail(journal_t *journal, tid_t tid, unsigned long block)
  */
 void jbd2_update_log_tail(journal_t *journal, tid_t tid, unsigned long block)
 {
-	mutex_lock(&journal->j_checkpoint_mutex);
+	mutex_lock_io(&journal->j_checkpoint_mutex);
 	if (tid_gt(tid, journal->j_tail_sequence))
 		__jbd2_update_log_tail(journal, tid, block);
 	mutex_unlock(&journal->j_checkpoint_mutex);
@@ -1304,7 +1304,7 @@ static int journal_reset(journal_t *journal)
 		journal->j_flags |= JBD2_FLUSHED;
 	} else {
 		/* Lock here to make assertions happy... */
-		mutex_lock(&journal->j_checkpoint_mutex);
+		mutex_lock_io(&journal->j_checkpoint_mutex);
 		/*
 		 * Update log tail information. We use WRITE_FUA since new
 		 * transaction will start reusing journal space and so we
@@ -1691,7 +1691,7 @@ int jbd2_journal_destroy(journal_t *journal)
 	spin_lock(&journal->j_list_lock);
 	while (journal->j_checkpoint_transactions != NULL) {
 		spin_unlock(&journal->j_list_lock);
-		mutex_lock(&journal->j_checkpoint_mutex);
+		mutex_lock_io(&journal->j_checkpoint_mutex);
 		err = jbd2_log_do_checkpoint(journal);
 		mutex_unlock(&journal->j_checkpoint_mutex);
 		/*
@@ -1713,7 +1713,7 @@ int jbd2_journal_destroy(journal_t *journal)
 
 	if (journal->j_sb_buffer) {
 		if (!is_journal_aborted(journal)) {
-			mutex_lock(&journal->j_checkpoint_mutex);
+			mutex_lock_io(&journal->j_checkpoint_mutex);
 
 			write_lock(&journal->j_state_lock);
 			journal->j_tail_sequence =
@@ -1954,7 +1954,7 @@ int jbd2_journal_flush(journal_t *journal)
 	spin_lock(&journal->j_list_lock);
 	while (!err && journal->j_checkpoint_transactions != NULL) {
 		spin_unlock(&journal->j_list_lock);
-		mutex_lock(&journal->j_checkpoint_mutex);
+		mutex_lock_io(&journal->j_checkpoint_mutex);
 		err = jbd2_log_do_checkpoint(journal);
 		mutex_unlock(&journal->j_checkpoint_mutex);
 		spin_lock(&journal->j_list_lock);
@@ -1964,7 +1964,7 @@ int jbd2_journal_flush(journal_t *journal)
 	if (is_journal_aborted(journal))
 		return -EIO;
 
-	mutex_lock(&journal->j_checkpoint_mutex);
+	mutex_lock_io(&journal->j_checkpoint_mutex);
 	if (!err) {
 		err = jbd2_cleanup_journal_tail(journal);
 		if (err < 0) {
@@ -2024,7 +2024,7 @@ int jbd2_journal_wipe(journal_t *journal, int write)
 	err = jbd2_journal_skip_recovery(journal);
 	if (write) {
 		/* Lock to make assertions happy... */
-		mutex_lock(&journal->j_checkpoint_mutex);
+		mutex_lock_io(&journal->j_checkpoint_mutex);
 		jbd2_mark_journal_empty(journal, WRITE_FUA);
 		mutex_unlock(&journal->j_checkpoint_mutex);
 	}
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/4] sched: move IO scheduling accounting from io_schedule_timeout() to __schedule()
  2016-10-28 16:58 ` [PATCH 1/4] sched: move IO scheduling accounting from io_schedule_timeout() to __schedule() Tejun Heo
@ 2016-10-28 18:27   ` Peter Zijlstra
  2016-10-28 19:07     ` Peter Zijlstra
  2016-11-03 15:33   ` Pavan Kondeti
  2016-12-06 21:29   ` [PATCH v2 " Tejun Heo
  2 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2016-10-28 18:27 UTC (permalink / raw)
  To: Tejun Heo
  Cc: torvalds, akpm, mingo, axboe, tytso, jack, adilger.kernel,
	linux-ext4, linux-fsdevel, linux-kernel, kernel-team, mingbo

On Fri, Oct 28, 2016 at 12:58:09PM -0400, Tejun Heo wrote:
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3336,11 +3336,17 @@ static void __sched notrace __schedule(bool preempt)
>  	unsigned long *switch_count;
>  	struct pin_cookie cookie;
>  	struct rq *rq;
> -	int cpu;
> +	int cpu, in_iowait;
>  
>  	cpu = smp_processor_id();
>  	rq = cpu_rq(cpu);
>  	prev = rq->curr;
> +	in_iowait = prev->in_iowait;
> +
> +	if (in_iowait) {
> +		delayacct_blkio_start();
> +		atomic_inc(&rq->nr_iowait);
> +	}
>  
>  	schedule_debug(prev);
>  
> @@ -3406,6 +3412,11 @@ static void __sched notrace __schedule(bool preempt)
>  	}
>  
>  	balance_callback(rq);
> +
> +	if (in_iowait) {
> +		atomic_dec(&rq->nr_iowait);
> +		delayacct_blkio_end();
> +	}
>  }
>  
>  void __noreturn do_task_dead(void)

Urgh, can't say I like this much. It moves two branches into the
schedule path.

Nor do I really like the idea of having to annotate special mutexes for
the iowait crap.

I'll think more after KS/LPC etc..

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/4] sched: move IO scheduling accounting from io_schedule_timeout() to __schedule()
  2016-10-28 18:27   ` Peter Zijlstra
@ 2016-10-28 19:07     ` Peter Zijlstra
  2016-10-28 19:12       ` Tejun Heo
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2016-10-28 19:07 UTC (permalink / raw)
  To: Tejun Heo
  Cc: torvalds, akpm, mingo, axboe, tytso, jack, adilger.kernel,
	linux-ext4, linux-fsdevel, linux-kernel, kernel-team, mingbo

On Fri, Oct 28, 2016 at 08:27:12PM +0200, Peter Zijlstra wrote:
> On Fri, Oct 28, 2016 at 12:58:09PM -0400, Tejun Heo wrote:
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3336,11 +3336,17 @@ static void __sched notrace __schedule(bool preempt)
> >  	unsigned long *switch_count;
> >  	struct pin_cookie cookie;
> >  	struct rq *rq;
> > -	int cpu;
> > +	int cpu, in_iowait;
> >  
> >  	cpu = smp_processor_id();
> >  	rq = cpu_rq(cpu);
> >  	prev = rq->curr;
> > +	in_iowait = prev->in_iowait;
> > +
> > +	if (in_iowait) {
> > +		delayacct_blkio_start();
> > +		atomic_inc(&rq->nr_iowait);
> > +	}
> >  
> >  	schedule_debug(prev);
> >  
> > @@ -3406,6 +3412,11 @@ static void __sched notrace __schedule(bool preempt)
> >  	}
> >  
> >  	balance_callback(rq);
> > +
> > +	if (in_iowait) {
> > +		atomic_dec(&rq->nr_iowait);
> > +		delayacct_blkio_end();
> > +	}
> >  }
> >  
> >  void __noreturn do_task_dead(void)
> 
> Urgh, can't say I like this much. It moves two branches into the
> schedule path.
> 
> Nor do I really like the idea of having to annotate special mutexes for
> the iowait crap.
> 
> I'll think more after KS/LPC etc..

One alternative is to inherit the iowait state of the task we block on.
That'll not get rid of the branches much, but it will remove the new
mutex APIs.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/4] sched: move IO scheduling accounting from io_schedule_timeout() to __schedule()
  2016-10-28 19:07     ` Peter Zijlstra
@ 2016-10-28 19:12       ` Tejun Heo
  2016-10-29  3:21         ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Tejun Heo @ 2016-10-28 19:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, torvalds, akpm, mingo, axboe, tytso, jack,
	adilger.kernel, linux-ext4, linux-fsdevel, linux-kernel,
	kernel-team, mingbo

Hello, Peter.

On Fri, Oct 28, 2016 at 09:07:02PM +0200, Peter Zijlstra wrote:
> One alternative is to inherit the iowait state of the task we block on.
> That'll not get rid of the branches much, but it will remove the new
> mutex APIs.

Yeah, thought about that briefly but we don't necessarily track mutex
or other synchronization construct owners, things get gnarly with
rwsems (the inode ones sometimes end up in a similar situation), and
we'll probably end up dealing with some surprising propagations down
the line.  That said, getting such automatic propagation working would
be great.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/4] sched: move IO scheduling accounting from io_schedule_timeout() to __schedule()
  2016-10-28 19:12       ` Tejun Heo
@ 2016-10-29  3:21         ` Peter Zijlstra
  2016-10-31 16:45           ` Tejun Heo
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2016-10-29  3:21 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Tejun Heo, torvalds, akpm, mingo, axboe, tytso, jack,
	adilger.kernel, linux-ext4, linux-fsdevel, linux-kernel,
	kernel-team, mingbo

On Fri, Oct 28, 2016 at 03:12:32PM -0400, Tejun Heo wrote:
> Hello, Peter.
> 
> On Fri, Oct 28, 2016 at 09:07:02PM +0200, Peter Zijlstra wrote:
> > One alternative is to inherit the iowait state of the task we block on.
> > That'll not get rid of the branches much, but it will remove the new
> > mutex APIs.
> 
> Yeah, thought about that briefly but we don't necessarily track mutex

This one I actually fixed and should be in -next. And it would be
sufficient to cover the use case here.

> or other synchronization construct owners, things get gnarly with
> rwsems (the inode ones sometimes end up in a similar situation), and
> we'll probably end up dealing with some surprising propagations down
> the line.

rwsems could be done for writers only.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/4] sched: move IO scheduling accounting from io_schedule_timeout() to __schedule()
  2016-10-29  3:21         ` Peter Zijlstra
@ 2016-10-31 16:45           ` Tejun Heo
  2016-12-06 21:30             ` Tejun Heo
  0 siblings, 1 reply; 22+ messages in thread
From: Tejun Heo @ 2016-10-31 16:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, torvalds, akpm, mingo, axboe, tytso, jack,
	adilger.kernel, linux-ext4, linux-fsdevel, linux-kernel,
	kernel-team, mingbo

Hello,

On Sat, Oct 29, 2016 at 05:21:26AM +0200, Peter Zijlstra wrote:
> On Fri, Oct 28, 2016 at 03:12:32PM -0400, Tejun Heo wrote:
> > Hello, Peter.
> > 
> > On Fri, Oct 28, 2016 at 09:07:02PM +0200, Peter Zijlstra wrote:
> > > One alternative is to inherit the iowait state of the task we block on.
> > > That'll not get rid of the branches much, but it will remove the new
> > > mutex APIs.
> > 
> > Yeah, thought about that briefly but we don't necessarily track mutex
> 
> This one I actually fixed and should be in -next. And it would be
> sufficient to cover the use case here.

Tracking the owners of mutexes and rwsems does help quite a bit.  I
don't think it's as simple as inheriting io sleep state from the
current owner tho.  The owner might be running or in a non-IO sleep
when others try to grab the mutex.  It is an option to ignore those
cases but this would have a real possibility to lead to surprising
results in some corner cases.  If we choose to propagate dynamically,
it becomes an a lot more complex problem and I don't think it'd be
justfiable.

Unless there can be a simple enough and reliable solution, I think
it'd be better to stick with explicit marking.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/4] sched: move IO scheduling accounting from io_schedule_timeout() to __schedule()
  2016-10-28 16:58 ` [PATCH 1/4] sched: move IO scheduling accounting from io_schedule_timeout() to __schedule() Tejun Heo
  2016-10-28 18:27   ` Peter Zijlstra
@ 2016-11-03 15:33   ` Pavan Kondeti
  2016-11-08 22:51     ` Tejun Heo
  2016-12-06 21:29   ` [PATCH v2 " Tejun Heo
  2 siblings, 1 reply; 22+ messages in thread
From: Pavan Kondeti @ 2016-11-03 15:33 UTC (permalink / raw)
  To: Tejun Heo
  Cc: torvalds, akpm, mingo, Peter Zijlstra, axboe, tytso, jack,
	adilger.kernel, linux-ext4, linux-fsdevel, linux-kernel,
	kernel-team, mingbo

Hi Tejun,

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 94732d1..f6baa38 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3336,11 +3336,17 @@ static void __sched notrace __schedule(bool preempt)
>         unsigned long *switch_count;
>         struct pin_cookie cookie;
>         struct rq *rq;
> -       int cpu;
> +       int cpu, in_iowait;
>
>         cpu = smp_processor_id();
>         rq = cpu_rq(cpu);
>         prev = rq->curr;
> +       in_iowait = prev->in_iowait;
> +
> +       if (in_iowait) {
> +               delayacct_blkio_start();
> +               atomic_inc(&rq->nr_iowait);
> +       }
>
>         schedule_debug(prev);
>
> @@ -3406,6 +3412,11 @@ static void __sched notrace __schedule(bool preempt)
>         }
>
>         balance_callback(rq);
> +
> +       if (in_iowait) {
> +               atomic_dec(&rq->nr_iowait);
> +               delayacct_blkio_end();
> +       }
>  }

I think, the nr_iowait update can go wrong here.

When the task migrates to a different CPU upon wakeup, this rq points
to a different CPU from the one on which nr_iowait is incremented
before.

Thanks,
Pavan

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a
Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/4] sched: move IO scheduling accounting from io_schedule_timeout() to __schedule()
  2016-11-03 15:33   ` Pavan Kondeti
@ 2016-11-08 22:51     ` Tejun Heo
  0 siblings, 0 replies; 22+ messages in thread
From: Tejun Heo @ 2016-11-08 22:51 UTC (permalink / raw)
  To: Pavan Kondeti
  Cc: torvalds, akpm, mingo, Peter Zijlstra, axboe, tytso, jack,
	adilger.kernel, linux-ext4, linux-fsdevel, linux-kernel,
	kernel-team, mingbo

Hello,

On Thu, Nov 03, 2016 at 09:03:45PM +0530, Pavan Kondeti wrote:
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 94732d1..f6baa38 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3336,11 +3336,17 @@ static void __sched notrace __schedule(bool preempt)
> >         unsigned long *switch_count;
> >         struct pin_cookie cookie;
> >         struct rq *rq;
> > -       int cpu;
> > +       int cpu, in_iowait;
> >
> >         cpu = smp_processor_id();
> >         rq = cpu_rq(cpu);
> >         prev = rq->curr;
> > +       in_iowait = prev->in_iowait;
> > +
> > +       if (in_iowait) {
> > +               delayacct_blkio_start();
> > +               atomic_inc(&rq->nr_iowait);
> > +       }
> >
> >         schedule_debug(prev);
> >
> > @@ -3406,6 +3412,11 @@ static void __sched notrace __schedule(bool preempt)
> >         }
> >
> >         balance_callback(rq);
> > +
> > +       if (in_iowait) {
> > +               atomic_dec(&rq->nr_iowait);
> > +               delayacct_blkio_end();
> > +       }
> >  }
> 
> I think, the nr_iowait update can go wrong here.
> 
> When the task migrates to a different CPU upon wakeup, this rq points
> to a different CPU from the one on which nr_iowait is incremented
> before.

Ah, you're right, it should remember the original rq.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 1/4] sched: move IO scheduling accounting from io_schedule_timeout() to __schedule()
  2016-10-28 16:58 ` [PATCH 1/4] sched: move IO scheduling accounting from io_schedule_timeout() to __schedule() Tejun Heo
  2016-10-28 18:27   ` Peter Zijlstra
  2016-11-03 15:33   ` Pavan Kondeti
@ 2016-12-06 21:29   ` Tejun Heo
  2016-12-07  9:35     ` Peter Zijlstra
  2 siblings, 1 reply; 22+ messages in thread
From: Tejun Heo @ 2016-12-06 21:29 UTC (permalink / raw)
  To: torvalds, akpm, mingo, peterz, axboe, tytso, jack, adilger.kernel
  Cc: linux-ext4, linux-fsdevel, linux-kernel, kernel-team, mingbo

For an interface to support blocking for IOs, it must call
io_schedule() instead of schedule().  This makes it tedious to add IO
blocking to existing interfaces as the switching between schedule()
and io_schedule() is often buried deep.

As we already have a way to mark the task as IO scheduling, this can
be made easier by separating out io_schedule() into multiple steps so
that IO schedule preparation can be performed before invoking a
blocking interface and the actual accounting happens inside
schedule().

io_schedule_timeout() does the following three things prior to calling
schedule_timeout().

 1. Mark the task as scheduling for IO.
 2. Flush out plugged IOs.
 3. Account the IO scheduling.

#1 and #2 can be performed in the prepartaion step while #3 must be
done close to the actual scheduling.  This patch moves #3 into
__schedule() so that later patches can separate out preparation and
finish steps from io_schedule().

v2: Remember the rq in @prev_rq and use it for decrementing nr_iowait
    to avoid misattributing the count after the task gets migrated to
    another CPU.  Noticed by Pavan.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Pavan Kondeti <pkondeti@codeaurora.org>
---
 kernel/sched/core.c |   23 ++++++++++++++---------
 1 file changed, 14 insertions(+), 9 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3335,12 +3335,18 @@ static void __sched notrace __schedule(b
 	struct task_struct *prev, *next;
 	unsigned long *switch_count;
 	struct pin_cookie cookie;
-	struct rq *rq;
-	int cpu;
+	struct rq *rq, *prev_rq;
+	int cpu, in_iowait;
 
 	cpu = smp_processor_id();
-	rq = cpu_rq(cpu);
+	rq = prev_rq = cpu_rq(cpu);
 	prev = rq->curr;
+	in_iowait = prev->in_iowait;
+
+	if (in_iowait) {
+		delayacct_blkio_start();
+		atomic_inc(&rq->nr_iowait);
+	}
 
 	schedule_debug(prev);
 
@@ -3406,6 +3412,11 @@ static void __sched notrace __schedule(b
 	}
 
 	balance_callback(rq);
+
+	if (in_iowait) {
+		atomic_dec(&prev_rq->nr_iowait);
+		delayacct_blkio_end();
+	}
 }
 
 void __noreturn do_task_dead(void)
@@ -5063,19 +5074,13 @@ EXPORT_SYMBOL_GPL(yield_to);
 long __sched io_schedule_timeout(long timeout)
 {
 	int old_iowait = current->in_iowait;
-	struct rq *rq;
 	long ret;
 
 	current->in_iowait = 1;
 	blk_schedule_flush_plug(current);
 
-	delayacct_blkio_start();
-	rq = raw_rq();
-	atomic_inc(&rq->nr_iowait);
 	ret = schedule_timeout(timeout);
 	current->in_iowait = old_iowait;
-	atomic_dec(&rq->nr_iowait);
-	delayacct_blkio_end();
 
 	return ret;
 }

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/4] sched: move IO scheduling accounting from io_schedule_timeout() to __schedule()
  2016-10-31 16:45           ` Tejun Heo
@ 2016-12-06 21:30             ` Tejun Heo
  0 siblings, 0 replies; 22+ messages in thread
From: Tejun Heo @ 2016-12-06 21:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, torvalds, akpm, mingo, axboe, tytso, jack,
	adilger.kernel, linux-ext4, linux-fsdevel, linux-kernel,
	kernel-team, mingbo

Hello,

On Mon, Oct 31, 2016 at 10:45:56AM -0600, Tejun Heo wrote:
> Tracking the owners of mutexes and rwsems does help quite a bit.  I
> don't think it's as simple as inheriting io sleep state from the
> current owner tho.  The owner might be running or in a non-IO sleep
> when others try to grab the mutex.  It is an option to ignore those
> cases but this would have a real possibility to lead to surprising
> results in some corner cases.  If we choose to propagate dynamically,
> it becomes an a lot more complex problem and I don't think it'd be
> justfiable.
> 
> Unless there can be a simple enough and reliable solution, I think
> it'd be better to stick with explicit marking.

Just posted the fixed version for the first patch.  Any more thoughts
on this?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 1/4] sched: move IO scheduling accounting from io_schedule_timeout() to __schedule()
  2016-12-06 21:29   ` [PATCH v2 " Tejun Heo
@ 2016-12-07  9:35     ` Peter Zijlstra
  2016-12-07 20:48       ` [PATCH v3 1/4] sched: move IO scheduling accounting from io_schedule_timeout() into scheduler Tejun Heo
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2016-12-07  9:35 UTC (permalink / raw)
  To: Tejun Heo
  Cc: torvalds, akpm, mingo, axboe, tytso, jack, adilger.kernel,
	linux-ext4, linux-fsdevel, linux-kernel, kernel-team, mingbo

On Tue, Dec 06, 2016 at 04:29:35PM -0500, Tejun Heo wrote:

> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3335,12 +3335,18 @@ static void __sched notrace __schedule(b
>  	struct task_struct *prev, *next;
>  	unsigned long *switch_count;
>  	struct pin_cookie cookie;
> -	struct rq *rq;
> -	int cpu;
> +	struct rq *rq, *prev_rq;
> +	int cpu, in_iowait;
>  
>  	cpu = smp_processor_id();
> -	rq = cpu_rq(cpu);
> +	rq = prev_rq = cpu_rq(cpu);
>  	prev = rq->curr;
> +	in_iowait = prev->in_iowait;
> +
> +	if (in_iowait) {
> +		delayacct_blkio_start();
> +		atomic_inc(&rq->nr_iowait);
> +	}
>  
>  	schedule_debug(prev);
>  
> @@ -3406,6 +3412,11 @@ static void __sched notrace __schedule(b
>  	}
>  
>  	balance_callback(rq);
> +
> +	if (in_iowait) {
> +		atomic_dec(&prev_rq->nr_iowait);
> +		delayacct_blkio_end();
> +	}
>  }
>  
>  void __noreturn do_task_dead(void)

So I really dislike this, it mucks with the fast paths for something of
dubious utility.

At the very least move this to the actual blocking paths such that
regular context switches don't see this overhead (also delayacct is
horrific crap).

A little something like so... completely untested.

---
 kernel/sched/core.c | 62 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 61 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8b08fb257856..935bcd91f4a4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2085,11 +2085,24 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	p->sched_contributes_to_load = !!task_contributes_to_load(p);
 	p->state = TASK_WAKING;
 
+	if (p->in_iowait) {
+		delayacct_blkio_end();
+		atomic_dec(&task_rq(p)->nr_iowait);
+	}
+
 	cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
 	if (task_cpu(p) != cpu) {
 		wake_flags |= WF_MIGRATED;
 		set_task_cpu(p, cpu);
 	}
+
+#else /* CONFIG_SMP */
+
+	if (p->in_iowait) {
+		delayacct_blkio_end();
+		atomic_dec(&task_rq(p)->nr_iowait);
+	}
+
 #endif /* CONFIG_SMP */
 
 	ttwu_queue(p, cpu, wake_flags);
@@ -2139,8 +2152,13 @@ static void try_to_wake_up_local(struct task_struct *p, struct pin_cookie cookie
 
 	trace_sched_waking(p);
 
-	if (!task_on_rq_queued(p))
+	if (!task_on_rq_queued(p)) {
+		if (p->in_iowait) {
+			delayacct_blkio_end();
+			atomic_dec(&task_rq(p)->nr_iowait);
+		}
 		ttwu_activate(rq, p, ENQUEUE_WAKEUP);
+	}
 
 	ttwu_do_wakeup(rq, p, 0, cookie);
 	ttwu_stat(p, smp_processor_id(), 0);
@@ -2948,6 +2966,36 @@ unsigned long long nr_context_switches(void)
 	return sum;
 }
 
+/*
+ * IO-wait accounting, and how its mostly bollocks (on SMP).
+ *
+ * The idea behind IO-wait account is to account the idle time that we could
+ * have spend running if it were not for IO. That is, if we were to improve the
+ * storage performance, we'd have a proportional reduction in IO-wait time.
+ *
+ * This all works nicely on UP, where, when a task blocks on IO, we account
+ * idle time as IO-wait, because if the storage were faster, it could've been
+ * running and we'd not be idle.
+ *
+ * This has been extended to SMP, by doing the same for each CPU. This however
+ * is broken.
+ *
+ * Imagine for instance the case where two tasks block on one CPU, only the one
+ * CPU will have IO-wait accounted, while the other has regular idle. Even
+ * though, if the storage were faster, both could've ran at the same time,
+ * utilising both CPUs.
+ *
+ * This means, that when looking globally, the current IO-wait accounting on
+ * SMP is a lower bound, by reason of under accounting.
+ *
+ * Worse, since the numbers are provided per CPU, they are sometimes
+ * interpreted per CPU, and that is nonsensical. A blocked task isn't strictly
+ * associated with any one particular CPU, it can wake to another CPU than it
+ * blocked on. This means the per CPU IO-wait number is meaningless.
+ *
+ * Task CPU affinities can make all that even more 'interesting'.
+ */
+
 unsigned long nr_iowait(void)
 {
 	unsigned long i, sum = 0;
@@ -2958,6 +3006,13 @@ unsigned long nr_iowait(void)
 	return sum;
 }
 
+/*
+ * Consumers of these two interfaces, like for example the cpufreq menu
+ * governor are using nonsensical data. Boosting frequency for a CPU that has
+ * IO-wait which might not even end up running the task when it does become
+ * runnable.
+ */
+
 unsigned long nr_iowait_cpu(int cpu)
 {
 	struct rq *this = cpu_rq(cpu);
@@ -3369,6 +3424,11 @@ static void __sched notrace __schedule(bool preempt)
 			deactivate_task(rq, prev, DEQUEUE_SLEEP);
 			prev->on_rq = 0;
 
+			if (prev->in_iowait) {
+				atomic_inc(&rq->nr_iowait);
+				delayacct_blkio_start();
+			}
+
 			/*
 			 * If a worker went to sleep, notify and ask workqueue
 			 * whether it wants to wake up a task to maintain

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v3 1/4] sched: move IO scheduling accounting from io_schedule_timeout() into scheduler
  2016-12-07  9:35     ` Peter Zijlstra
@ 2016-12-07 20:48       ` Tejun Heo
  2017-01-14 12:49         ` [tip:sched/core] sched/core: " tip-bot for Tejun Heo
  0 siblings, 1 reply; 22+ messages in thread
From: Tejun Heo @ 2016-12-07 20:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, akpm, mingo, axboe, tytso, jack, adilger.kernel,
	linux-ext4, linux-fsdevel, linux-kernel, kernel-team, mingbo

Hello,

Yeah, that works.  Here's v3 based on your patch.  The other patches
still apply correctly.

Thanks.

------ 8< ------
For an interface to support blocking for IOs, it must call
io_schedule() instead of schedule().  This makes it tedious to add IO
blocking to existing interfaces as the switching between schedule()
and io_schedule() is often buried deep.

As we already have a way to mark the task as IO scheduling, this can
be made easier by separating out io_schedule() into multiple steps so
that IO schedule preparation can be performed before invoking a
blocking interface and the actual accounting happens inside the
scheduler.

io_schedule_timeout() does the following three things prior to calling
schedule_timeout().

 1. Mark the task as scheduling for IO.
 2. Flush out plugged IOs.
 3. Account the IO scheduling.

#1 and #2 can be performed in the prepartaion step while #3 must be
done close to the actual scheduling.  This patch moves #3 into the
scheduler so that later patches can separate out preparation and
finish steps from io_schedule().

v3: Replaced with PeterZ's implementation which performs nr_iowait
    accounting in the sleep and wake up path to avoid unnecessarily
    burdening non sleeping paths in __schedule().

v2: Remember the rq in @prev_rq and use it for decrementing nr_iowait
    to avoid misattributing the count after the task gets migrated to
    another CPU.  Noticed by Pavan.

Signed-off-by: Tejun Heo <tj@kernel.org>
Patch-originally-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Pavan Kondeti <pkondeti@codeaurora.org>
---
 kernel/sched/core.c |   68 ++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 61 insertions(+), 7 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2085,11 +2085,24 @@ try_to_wake_up(struct task_struct *p, un
 	p->sched_contributes_to_load = !!task_contributes_to_load(p);
 	p->state = TASK_WAKING;
 
+	if (p->in_iowait) {
+		delayacct_blkio_end();
+		atomic_dec(&task_rq(p)->nr_iowait);
+	}
+
 	cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
 	if (task_cpu(p) != cpu) {
 		wake_flags |= WF_MIGRATED;
 		set_task_cpu(p, cpu);
 	}
+
+#else /* CONFIG_SMP */
+
+	if (p->in_iowait) {
+		delayacct_blkio_end();
+		atomic_dec(&task_rq(p)->nr_iowait);
+	}
+
 #endif /* CONFIG_SMP */
 
 	ttwu_queue(p, cpu, wake_flags);
@@ -2139,8 +2152,13 @@ static void try_to_wake_up_local(struct
 
 	trace_sched_waking(p);
 
-	if (!task_on_rq_queued(p))
+	if (!task_on_rq_queued(p)) {
+		if (p->in_iowait) {
+			delayacct_blkio_end();
+			atomic_dec(&rq->nr_iowait);
+		}
 		ttwu_activate(rq, p, ENQUEUE_WAKEUP);
+	}
 
 	ttwu_do_wakeup(rq, p, 0, cookie);
 	ttwu_stat(p, smp_processor_id(), 0);
@@ -2948,6 +2966,36 @@ unsigned long long nr_context_switches(v
 	return sum;
 }
 
+/*
+ * IO-wait accounting, and how its mostly bollocks (on SMP).
+ *
+ * The idea behind IO-wait account is to account the idle time that we could
+ * have spend running if it were not for IO. That is, if we were to improve the
+ * storage performance, we'd have a proportional reduction in IO-wait time.
+ *
+ * This all works nicely on UP, where, when a task blocks on IO, we account
+ * idle time as IO-wait, because if the storage were faster, it could've been
+ * running and we'd not be idle.
+ *
+ * This has been extended to SMP, by doing the same for each CPU. This however
+ * is broken.
+ *
+ * Imagine for instance the case where two tasks block on one CPU, only the one
+ * CPU will have IO-wait accounted, while the other has regular idle. Even
+ * though, if the storage were faster, both could've ran at the same time,
+ * utilising both CPUs.
+ *
+ * This means, that when looking globally, the current IO-wait accounting on
+ * SMP is a lower bound, by reason of under accounting.
+ *
+ * Worse, since the numbers are provided per CPU, they are sometimes
+ * interpreted per CPU, and that is nonsensical. A blocked task isn't strictly
+ * associated with any one particular CPU, it can wake to another CPU than it
+ * blocked on. This means the per CPU IO-wait number is meaningless.
+ *
+ * Task CPU affinities can make all that even more 'interesting'.
+ */
+
 unsigned long nr_iowait(void)
 {
 	unsigned long i, sum = 0;
@@ -2958,6 +3006,13 @@ unsigned long nr_iowait(void)
 	return sum;
 }
 
+/*
+ * Consumers of these two interfaces, like for example the cpufreq menu
+ * governor are using nonsensical data. Boosting frequency for a CPU that has
+ * IO-wait which might not even end up running the task when it does become
+ * runnable.
+ */
+
 unsigned long nr_iowait_cpu(int cpu)
 {
 	struct rq *this = cpu_rq(cpu);
@@ -3369,6 +3424,11 @@ static void __sched notrace __schedule(b
 			deactivate_task(rq, prev, DEQUEUE_SLEEP);
 			prev->on_rq = 0;
 
+			if (prev->in_iowait) {
+				atomic_inc(&rq->nr_iowait);
+				delayacct_blkio_start();
+			}
+
 			/*
 			 * If a worker went to sleep, notify and ask workqueue
 			 * whether it wants to wake up a task to maintain
@@ -5063,19 +5123,13 @@ EXPORT_SYMBOL_GPL(yield_to);
 long __sched io_schedule_timeout(long timeout)
 {
 	int old_iowait = current->in_iowait;
-	struct rq *rq;
 	long ret;
 
 	current->in_iowait = 1;
 	blk_schedule_flush_plug(current);
 
-	delayacct_blkio_start();
-	rq = raw_rq();
-	atomic_inc(&rq->nr_iowait);
 	ret = schedule_timeout(timeout);
 	current->in_iowait = old_iowait;
-	atomic_dec(&rq->nr_iowait);
-	delayacct_blkio_end();
 
 	return ret;
 }

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [tip:sched/core] sched/core: move IO scheduling accounting from io_schedule_timeout() into scheduler
  2016-12-07 20:48       ` [PATCH v3 1/4] sched: move IO scheduling accounting from io_schedule_timeout() into scheduler Tejun Heo
@ 2017-01-14 12:49         ` tip-bot for Tejun Heo
  0 siblings, 0 replies; 22+ messages in thread
From: tip-bot for Tejun Heo @ 2017-01-14 12:49 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: torvalds, tglx, tj, mingo, efault, linux-kernel, peterz, hpa

Commit-ID:  e33a9bba85a869b85fd0a7f16e21a5ec8977e325
Gitweb:     http://git.kernel.org/tip/e33a9bba85a869b85fd0a7f16e21a5ec8977e325
Author:     Tejun Heo <tj@kernel.org>
AuthorDate: Wed, 7 Dec 2016 15:48:41 -0500
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Sat, 14 Jan 2017 11:30:03 +0100

sched/core: move IO scheduling accounting from io_schedule_timeout() into scheduler

For an interface to support blocking for IOs, it must call
io_schedule() instead of schedule().  This makes it tedious to add IO
blocking to existing interfaces as the switching between schedule()
and io_schedule() is often buried deep.

As we already have a way to mark the task as IO scheduling, this can
be made easier by separating out io_schedule() into multiple steps so
that IO schedule preparation can be performed before invoking a
blocking interface and the actual accounting happens inside the
scheduler.

io_schedule_timeout() does the following three things prior to calling
schedule_timeout().

 1. Mark the task as scheduling for IO.
 2. Flush out plugged IOs.
 3. Account the IO scheduling.

done close to the actual scheduling.  This patch moves #3 into the
scheduler so that later patches can separate out preparation and
finish steps from io_schedule().

Patch-originally-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: adilger.kernel@dilger.ca
Cc: akpm@linux-foundation.org
Cc: axboe@kernel.dk
Cc: jack@suse.com
Cc: kernel-team@fb.com
Cc: mingbo@fb.com
Cc: tytso@mit.edu
Link: http://lkml.kernel.org/r/20161207204841.GA22296@htj.duckdns.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c | 68 +++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 61 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 96a4267..9fd3716 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2089,11 +2089,24 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	p->sched_contributes_to_load = !!task_contributes_to_load(p);
 	p->state = TASK_WAKING;
 
+	if (p->in_iowait) {
+		delayacct_blkio_end();
+		atomic_dec(&task_rq(p)->nr_iowait);
+	}
+
 	cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
 	if (task_cpu(p) != cpu) {
 		wake_flags |= WF_MIGRATED;
 		set_task_cpu(p, cpu);
 	}
+
+#else /* CONFIG_SMP */
+
+	if (p->in_iowait) {
+		delayacct_blkio_end();
+		atomic_dec(&task_rq(p)->nr_iowait);
+	}
+
 #endif /* CONFIG_SMP */
 
 	ttwu_queue(p, cpu, wake_flags);
@@ -2143,8 +2156,13 @@ static void try_to_wake_up_local(struct task_struct *p, struct rq_flags *rf)
 
 	trace_sched_waking(p);
 
-	if (!task_on_rq_queued(p))
+	if (!task_on_rq_queued(p)) {
+		if (p->in_iowait) {
+			delayacct_blkio_end();
+			atomic_dec(&rq->nr_iowait);
+		}
 		ttwu_activate(rq, p, ENQUEUE_WAKEUP);
+	}
 
 	ttwu_do_wakeup(rq, p, 0, rf);
 	ttwu_stat(p, smp_processor_id(), 0);
@@ -2956,6 +2974,36 @@ unsigned long long nr_context_switches(void)
 	return sum;
 }
 
+/*
+ * IO-wait accounting, and how its mostly bollocks (on SMP).
+ *
+ * The idea behind IO-wait account is to account the idle time that we could
+ * have spend running if it were not for IO. That is, if we were to improve the
+ * storage performance, we'd have a proportional reduction in IO-wait time.
+ *
+ * This all works nicely on UP, where, when a task blocks on IO, we account
+ * idle time as IO-wait, because if the storage were faster, it could've been
+ * running and we'd not be idle.
+ *
+ * This has been extended to SMP, by doing the same for each CPU. This however
+ * is broken.
+ *
+ * Imagine for instance the case where two tasks block on one CPU, only the one
+ * CPU will have IO-wait accounted, while the other has regular idle. Even
+ * though, if the storage were faster, both could've ran at the same time,
+ * utilising both CPUs.
+ *
+ * This means, that when looking globally, the current IO-wait accounting on
+ * SMP is a lower bound, by reason of under accounting.
+ *
+ * Worse, since the numbers are provided per CPU, they are sometimes
+ * interpreted per CPU, and that is nonsensical. A blocked task isn't strictly
+ * associated with any one particular CPU, it can wake to another CPU than it
+ * blocked on. This means the per CPU IO-wait number is meaningless.
+ *
+ * Task CPU affinities can make all that even more 'interesting'.
+ */
+
 unsigned long nr_iowait(void)
 {
 	unsigned long i, sum = 0;
@@ -2966,6 +3014,13 @@ unsigned long nr_iowait(void)
 	return sum;
 }
 
+/*
+ * Consumers of these two interfaces, like for example the cpufreq menu
+ * governor are using nonsensical data. Boosting frequency for a CPU that has
+ * IO-wait which might not even end up running the task when it does become
+ * runnable.
+ */
+
 unsigned long nr_iowait_cpu(int cpu)
 {
 	struct rq *this = cpu_rq(cpu);
@@ -3377,6 +3432,11 @@ static void __sched notrace __schedule(bool preempt)
 			deactivate_task(rq, prev, DEQUEUE_SLEEP);
 			prev->on_rq = 0;
 
+			if (prev->in_iowait) {
+				atomic_inc(&rq->nr_iowait);
+				delayacct_blkio_start();
+			}
+
 			/*
 			 * If a worker went to sleep, notify and ask workqueue
 			 * whether it wants to wake up a task to maintain
@@ -5075,19 +5135,13 @@ EXPORT_SYMBOL_GPL(yield_to);
 long __sched io_schedule_timeout(long timeout)
 {
 	int old_iowait = current->in_iowait;
-	struct rq *rq;
 	long ret;
 
 	current->in_iowait = 1;
 	blk_schedule_flush_plug(current);
 
-	delayacct_blkio_start();
-	rq = raw_rq();
-	atomic_inc(&rq->nr_iowait);
 	ret = schedule_timeout(timeout);
 	current->in_iowait = old_iowait;
-	atomic_dec(&rq->nr_iowait);
-	delayacct_blkio_end();
 
 	return ret;
 }

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [tip:sched/core] sched/core: Separate out io_schedule_prepare() and io_schedule_finish()
  2016-10-28 16:58 ` [PATCH 2/4] sched: separate out io_schedule_prepare() and io_schedule_finish() Tejun Heo
@ 2017-01-14 12:49   ` tip-bot for Tejun Heo
  0 siblings, 0 replies; 22+ messages in thread
From: tip-bot for Tejun Heo @ 2017-01-14 12:49 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: torvalds, peterz, tj, mingo, tglx, efault, hpa, akpm,
	linux-kernel, axboe

Commit-ID:  10ab56434f2f633a51e432ee8b7c29e12438e163
Gitweb:     http://git.kernel.org/tip/10ab56434f2f633a51e432ee8b7c29e12438e163
Author:     Tejun Heo <tj@kernel.org>
AuthorDate: Fri, 28 Oct 2016 12:58:10 -0400
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Sat, 14 Jan 2017 11:30:04 +0100

sched/core: Separate out io_schedule_prepare() and io_schedule_finish()

Now that IO schedule accounting is done inside __schedule(),
io_schedule() can be split into three steps - prep, schedule, and
finish - where the schedule part doesn't need any special annotation.
This allows marking a sleep as iowait by simply wrapping an existing
blocking function with io_schedule_prepare() and io_schedule_finish().

Because task_struct->in_iowait is single bit, the caller of
io_schedule_prepare() needs to record and the pass its state to
io_schedule_finish() to be safe regarding nesting.  While this isn't
the prettiest, these functions are mostly gonna be used by core
functions and we don't want to use more space for ->in_iowait.

While at it, as it's simple to do now, reimplement io_schedule()
without unnecessarily going through io_schedule_timeout().

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: adilger.kernel@dilger.ca
Cc: jack@suse.com
Cc: kernel-team@fb.com
Cc: mingbo@fb.com
Cc: tytso@mit.edu
Link: http://lkml.kernel.org/r/1477673892-28940-3-git-send-email-tj@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h |  8 +++-----
 kernel/sched/core.c   | 33 ++++++++++++++++++++++++++++-----
 2 files changed, 31 insertions(+), 10 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 94a48bb..a8daed9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -461,12 +461,10 @@ extern signed long schedule_timeout_idle(signed long timeout);
 asmlinkage void schedule(void);
 extern void schedule_preempt_disabled(void);
 
+extern int __must_check io_schedule_prepare(void);
+extern void io_schedule_finish(int token);
 extern long io_schedule_timeout(long timeout);
-
-static inline void io_schedule(void)
-{
-	io_schedule_timeout(MAX_SCHEDULE_TIMEOUT);
-}
+extern void io_schedule(void);
 
 void __noreturn do_task_dead(void);
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9fd3716..49ce1cb 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5128,25 +5128,48 @@ out_irq:
 }
 EXPORT_SYMBOL_GPL(yield_to);
 
+int io_schedule_prepare(void)
+{
+	int old_iowait = current->in_iowait;
+
+	current->in_iowait = 1;
+	blk_schedule_flush_plug(current);
+
+	return old_iowait;
+}
+
+void io_schedule_finish(int token)
+{
+	current->in_iowait = token;
+}
+
 /*
  * This task is about to go to sleep on IO. Increment rq->nr_iowait so
  * that process accounting knows that this is a task in IO wait state.
  */
 long __sched io_schedule_timeout(long timeout)
 {
-	int old_iowait = current->in_iowait;
+	int token;
 	long ret;
 
-	current->in_iowait = 1;
-	blk_schedule_flush_plug(current);
-
+	token = io_schedule_prepare();
 	ret = schedule_timeout(timeout);
-	current->in_iowait = old_iowait;
+	io_schedule_finish(token);
 
 	return ret;
 }
 EXPORT_SYMBOL(io_schedule_timeout);
 
+void io_schedule(void)
+{
+	int token;
+
+	token = io_schedule_prepare();
+	schedule();
+	io_schedule_finish(token);
+}
+EXPORT_SYMBOL(io_schedule);
+
 /**
  * sys_sched_get_priority_max - return maximum RT priority.
  * @policy: scheduling class.

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [tip:sched/core] locking/mutex, sched/wait: Add mutex_lock_io()
  2016-10-28 16:58 ` [PATCH 3/4] mutex: add mutex_lock_io() Tejun Heo
@ 2017-01-14 12:50   ` tip-bot for Tejun Heo
  2017-01-14 14:13     ` Mike Galbraith
  0 siblings, 1 reply; 22+ messages in thread
From: tip-bot for Tejun Heo @ 2017-01-14 12:50 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: akpm, linux-kernel, torvalds, efault, axboe, tglx, mingo, peterz,
	tj, hpa

Commit-ID:  1460cb65a10f6c7a6e3a1c76513338861a0a43b6
Gitweb:     http://git.kernel.org/tip/1460cb65a10f6c7a6e3a1c76513338861a0a43b6
Author:     Tejun Heo <tj@kernel.org>
AuthorDate: Fri, 28 Oct 2016 12:58:11 -0400
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Sat, 14 Jan 2017 11:30:05 +0100

locking/mutex, sched/wait: Add mutex_lock_io()

We sometimes end up propagating IO blocking through mutexes; however,
because there currently is no way of annotating mutex sleeps as
iowait, there are cases where iowait and /proc/stat:procs_blocked
report misleading numbers obscuring the actual state of the system.

This patch adds mutex_lock_io() so that mutex sleeps can be marked as
iowait in those cases.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: adilger.kernel@dilger.ca
Cc: jack@suse.com
Cc: kernel-team@fb.com
Cc: mingbo@fb.com
Cc: tytso@mit.edu
Link: http://lkml.kernel.org/r/1477673892-28940-4-git-send-email-tj@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/mutex.h  |  4 ++++
 kernel/locking/mutex.c | 24 ++++++++++++++++++++++++
 2 files changed, 28 insertions(+)

diff --git a/include/linux/mutex.h b/include/linux/mutex.h
index b97870f..980ba16 100644
--- a/include/linux/mutex.h
+++ b/include/linux/mutex.h
@@ -156,10 +156,12 @@ extern int __must_check mutex_lock_interruptible_nested(struct mutex *lock,
 					unsigned int subclass);
 extern int __must_check mutex_lock_killable_nested(struct mutex *lock,
 					unsigned int subclass);
+extern void mutex_lock_io_nested(struct mutex *lock, unsigned int subclass);
 
 #define mutex_lock(lock) mutex_lock_nested(lock, 0)
 #define mutex_lock_interruptible(lock) mutex_lock_interruptible_nested(lock, 0)
 #define mutex_lock_killable(lock) mutex_lock_killable_nested(lock, 0)
+#define mutex_lock_io(lock) mutex_lock_io_nested(lock, 0)
 
 #define mutex_lock_nest_lock(lock, nest_lock)				\
 do {									\
@@ -171,11 +173,13 @@ do {									\
 extern void mutex_lock(struct mutex *lock);
 extern int __must_check mutex_lock_interruptible(struct mutex *lock);
 extern int __must_check mutex_lock_killable(struct mutex *lock);
+extern void mutex_lock_io(struct mutex *lock);
 
 # define mutex_lock_nested(lock, subclass) mutex_lock(lock)
 # define mutex_lock_interruptible_nested(lock, subclass) mutex_lock_interruptible(lock)
 # define mutex_lock_killable_nested(lock, subclass) mutex_lock_killable(lock)
 # define mutex_lock_nest_lock(lock, nest_lock) mutex_lock(lock)
+# define mutex_lock_nest_io(lock, nest_lock) mutex_io(lock)
 #endif
 
 /*
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 9b34961..8464a5c 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -783,6 +783,20 @@ mutex_lock_interruptible_nested(struct mutex *lock, unsigned int subclass)
 }
 EXPORT_SYMBOL_GPL(mutex_lock_interruptible_nested);
 
+void __sched
+mutex_lock_io_nested(struct mutex *lock, unsigned int subclass)
+{
+	int token;
+
+	might_sleep();
+
+	token = io_schedule_prepare();
+	__mutex_lock_common(lock, TASK_UNINTERRUPTIBLE,
+			    subclass, NULL, _RET_IP_, NULL, 0);
+	io_schedule_finish(token);
+}
+EXPORT_SYMBOL_GPL(mutex_lock_io_nested);
+
 static inline int
 ww_mutex_deadlock_injection(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
 {
@@ -950,6 +964,16 @@ int __sched mutex_lock_killable(struct mutex *lock)
 }
 EXPORT_SYMBOL(mutex_lock_killable);
 
+void __sched mutex_lock_io(struct mutex *lock)
+{
+	int token;
+
+	token = io_schedule_prepare();
+	mutex_lock(lock);
+	io_schedule_finish(token);
+}
+EXPORT_SYMBOL_GPL(mutex_lock_io);
+
 static noinline void __sched
 __mutex_lock_slowpath(struct mutex *lock)
 {

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [tip:sched/core] fs/jbd2, locking/mutex, sched/wait: Use mutex_lock_io() for journal->j_checkpoint_mutex
  2016-10-28 16:58 ` [PATCH 4/4] jbd2: use mutex_lock_io() for journal->j_checkpoint_mutex Tejun Heo
@ 2017-01-14 12:51   ` tip-bot for Tejun Heo
  0 siblings, 0 replies; 22+ messages in thread
From: tip-bot for Tejun Heo @ 2017-01-14 12:51 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: hpa, mingo, jack, adilger.kernel, tglx, mingbo, efault, akpm,
	torvalds, axboe, peterz, tj, tytso, linux-kernel

Commit-ID:  6fa7aa50b2c48400bbd045daf3a2498882eb0596
Gitweb:     http://git.kernel.org/tip/6fa7aa50b2c48400bbd045daf3a2498882eb0596
Author:     Tejun Heo <tj@kernel.org>
AuthorDate: Fri, 28 Oct 2016 12:58:12 -0400
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Sat, 14 Jan 2017 11:30:06 +0100

fs/jbd2, locking/mutex, sched/wait: Use mutex_lock_io() for journal->j_checkpoint_mutex

When an ext4 fs is bogged down by a lot of metadata IOs (in the
reported case, it was deletion of millions of files, but any massive
amount of journal writes would do), after the journal is filled up,
tasks which try to access the filesystem and aren't currently
performing the journal writes end up waiting in
__jbd2_log_wait_for_space() for journal->j_checkpoint_mutex.

Because those mutex sleeps aren't marked as iowait, this condition can
lead to misleadingly low iowait and /proc/stat:procs_blocked.  While
iowait propagation is far from strict, this condition can be triggered
fairly easily and annotating these sleeps correctly helps initial
diagnosis quite a bit.

Use the new mutex_lock_io() for journal->j_checkpoint_mutex so that
these sleeps are properly marked as iowait.

Reported-by: Mingbo Wan <mingbo@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jan Kara <jack@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kernel-team@fb.com
Link: http://lkml.kernel.org/r/1477673892-28940-5-git-send-email-tj@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 fs/jbd2/commit.c  |  2 +-
 fs/jbd2/journal.c | 12 ++++++------
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 8c51436..b6b194e 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -393,7 +393,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 	/* Do we need to erase the effects of a prior jbd2_journal_flush? */
 	if (journal->j_flags & JBD2_FLUSHED) {
 		jbd_debug(3, "super block updated\n");
-		mutex_lock(&journal->j_checkpoint_mutex);
+		mutex_lock_io(&journal->j_checkpoint_mutex);
 		/*
 		 * We hold j_checkpoint_mutex so tail cannot change under us.
 		 * We don't need any special data guarantees for writing sb
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index a097048..d8a5d0a 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -944,7 +944,7 @@ out:
  */
 void jbd2_update_log_tail(journal_t *journal, tid_t tid, unsigned long block)
 {
-	mutex_lock(&journal->j_checkpoint_mutex);
+	mutex_lock_io(&journal->j_checkpoint_mutex);
 	if (tid_gt(tid, journal->j_tail_sequence))
 		__jbd2_update_log_tail(journal, tid, block);
 	mutex_unlock(&journal->j_checkpoint_mutex);
@@ -1304,7 +1304,7 @@ static int journal_reset(journal_t *journal)
 		journal->j_flags |= JBD2_FLUSHED;
 	} else {
 		/* Lock here to make assertions happy... */
-		mutex_lock(&journal->j_checkpoint_mutex);
+		mutex_lock_io(&journal->j_checkpoint_mutex);
 		/*
 		 * Update log tail information. We use REQ_FUA since new
 		 * transaction will start reusing journal space and so we
@@ -1691,7 +1691,7 @@ int jbd2_journal_destroy(journal_t *journal)
 	spin_lock(&journal->j_list_lock);
 	while (journal->j_checkpoint_transactions != NULL) {
 		spin_unlock(&journal->j_list_lock);
-		mutex_lock(&journal->j_checkpoint_mutex);
+		mutex_lock_io(&journal->j_checkpoint_mutex);
 		err = jbd2_log_do_checkpoint(journal);
 		mutex_unlock(&journal->j_checkpoint_mutex);
 		/*
@@ -1713,7 +1713,7 @@ int jbd2_journal_destroy(journal_t *journal)
 
 	if (journal->j_sb_buffer) {
 		if (!is_journal_aborted(journal)) {
-			mutex_lock(&journal->j_checkpoint_mutex);
+			mutex_lock_io(&journal->j_checkpoint_mutex);
 
 			write_lock(&journal->j_state_lock);
 			journal->j_tail_sequence =
@@ -1955,7 +1955,7 @@ int jbd2_journal_flush(journal_t *journal)
 	spin_lock(&journal->j_list_lock);
 	while (!err && journal->j_checkpoint_transactions != NULL) {
 		spin_unlock(&journal->j_list_lock);
-		mutex_lock(&journal->j_checkpoint_mutex);
+		mutex_lock_io(&journal->j_checkpoint_mutex);
 		err = jbd2_log_do_checkpoint(journal);
 		mutex_unlock(&journal->j_checkpoint_mutex);
 		spin_lock(&journal->j_list_lock);
@@ -1965,7 +1965,7 @@ int jbd2_journal_flush(journal_t *journal)
 	if (is_journal_aborted(journal))
 		return -EIO;
 
-	mutex_lock(&journal->j_checkpoint_mutex);
+	mutex_lock_io(&journal->j_checkpoint_mutex);
 	if (!err) {
 		err = jbd2_cleanup_journal_tail(journal);
 		if (err < 0) {

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [tip:sched/core] locking/mutex, sched/wait: Add mutex_lock_io()
  2017-01-14 12:50   ` [tip:sched/core] locking/mutex, sched/wait: Add mutex_lock_io() tip-bot for Tejun Heo
@ 2017-01-14 14:13     ` Mike Galbraith
  2017-01-14 16:10       ` Ingo Molnar
  0 siblings, 1 reply; 22+ messages in thread
From: Mike Galbraith @ 2017-01-14 14:13 UTC (permalink / raw)
  To: torvalds, linux-kernel, akpm, axboe, mingo, tglx, peterz, tj,
	hpa, linux-tip-commits

On Sat, 2017-01-14 at 04:50 -0800, tip-bot for Tejun Heo wrote:

> diff --git a/include/linux/mutex.h b/include/linux/mutex.h
> index b97870f..980ba16 100644
> --- a/include/linux/mutex.h
> +++ b/include/linux/mutex.h

> @@ -171,11 +173,13 @@ do {									\
>  extern void mutex_lock(struct mutex *lock);
>  extern int __must_check mutex_lock_interruptible(struct mutex *lock);
>  extern int __must_check mutex_lock_killable(struct mutex *lock);
> +extern void mutex_lock_io(struct mutex *lock);
>  
>  # define mutex_lock_nested(lock, subclass) mutex_lock(lock)
>  # define mutex_lock_interruptible_nested(lock, subclass) mutex_lock_interruptible(lock)
>  # define mutex_lock_killable_nested(lock, subclass) mutex_lock_killable(lock)
>  # define mutex_lock_nest_lock(lock, nest_lock) mutex_lock(lock)
> +# define mutex_lock_nest_io(lock, nest_lock) mutex_io(lock)
                                                ^^^^^^^^^^^^^^ typo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [tip:sched/core] locking/mutex, sched/wait: Add mutex_lock_io()
  2017-01-14 14:13     ` Mike Galbraith
@ 2017-01-14 16:10       ` Ingo Molnar
  0 siblings, 0 replies; 22+ messages in thread
From: Ingo Molnar @ 2017-01-14 16:10 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: torvalds, linux-kernel, akpm, axboe, tglx, peterz, tj, hpa,
	linux-tip-commits


* Mike Galbraith <efault@gmx.de> wrote:

> On Sat, 2017-01-14 at 04:50 -0800, tip-bot for Tejun Heo wrote:
> 
> > diff --git a/include/linux/mutex.h b/include/linux/mutex.h
> > index b97870f..980ba16 100644
> > --- a/include/linux/mutex.h
> > +++ b/include/linux/mutex.h
> 
> > @@ -171,11 +173,13 @@ do {									\
> >  extern void mutex_lock(struct mutex *lock);
> >  extern int __must_check mutex_lock_interruptible(struct mutex *lock);
> >  extern int __must_check mutex_lock_killable(struct mutex *lock);
> > +extern void mutex_lock_io(struct mutex *lock);
> >  
> >  # define mutex_lock_nested(lock, subclass) mutex_lock(lock)
> >  # define mutex_lock_interruptible_nested(lock, subclass) mutex_lock_interruptible(lock)
> >  # define mutex_lock_killable_nested(lock, subclass) mutex_lock_killable(lock)
> >  # define mutex_lock_nest_lock(lock, nest_lock) mutex_lock(lock)
> > +# define mutex_lock_nest_io(lock, nest_lock) mutex_io(lock)
>                                                 ^^^^^^^^^^^^^^ typo

Double typo in fact I think, that should be something like:

# define mutex_lock_io_nested(lock, subclass) mutex_lock(lock)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2017-01-14 16:10 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-28 16:58 [PATCHSET RFC] sched, jbd2: mark sleeps on journal->j_checkpoint_mutex as iowait Tejun Heo
2016-10-28 16:58 ` [PATCH 1/4] sched: move IO scheduling accounting from io_schedule_timeout() to __schedule() Tejun Heo
2016-10-28 18:27   ` Peter Zijlstra
2016-10-28 19:07     ` Peter Zijlstra
2016-10-28 19:12       ` Tejun Heo
2016-10-29  3:21         ` Peter Zijlstra
2016-10-31 16:45           ` Tejun Heo
2016-12-06 21:30             ` Tejun Heo
2016-11-03 15:33   ` Pavan Kondeti
2016-11-08 22:51     ` Tejun Heo
2016-12-06 21:29   ` [PATCH v2 " Tejun Heo
2016-12-07  9:35     ` Peter Zijlstra
2016-12-07 20:48       ` [PATCH v3 1/4] sched: move IO scheduling accounting from io_schedule_timeout() into scheduler Tejun Heo
2017-01-14 12:49         ` [tip:sched/core] sched/core: " tip-bot for Tejun Heo
2016-10-28 16:58 ` [PATCH 2/4] sched: separate out io_schedule_prepare() and io_schedule_finish() Tejun Heo
2017-01-14 12:49   ` [tip:sched/core] sched/core: Separate " tip-bot for Tejun Heo
2016-10-28 16:58 ` [PATCH 3/4] mutex: add mutex_lock_io() Tejun Heo
2017-01-14 12:50   ` [tip:sched/core] locking/mutex, sched/wait: Add mutex_lock_io() tip-bot for Tejun Heo
2017-01-14 14:13     ` Mike Galbraith
2017-01-14 16:10       ` Ingo Molnar
2016-10-28 16:58 ` [PATCH 4/4] jbd2: use mutex_lock_io() for journal->j_checkpoint_mutex Tejun Heo
2017-01-14 12:51   ` [tip:sched/core] fs/jbd2, locking/mutex, sched/wait: Use " tip-bot for Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).