linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
@ 2008-10-02  3:00 Arjan van de Ven
  2008-10-02  4:56 ` Andrew Morton
  0 siblings, 1 reply; 76+ messages in thread
From: Arjan van de Ven @ 2008-10-02  3:00 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel; +Cc: Alan Cox


From: Arjan van de Ven <arjan@linux.intel.com>
Date: Wed, 1 Oct 2008 19:58:18 -0700
Subject: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority

With latencytop, I noticed that the (in memory) atime updates during a
kernel build had latencies of 6 seconds or longer; this is obviously not so
nice behavior. Other EXT3 journal related operations had similar or even
longer latencies.

Digging into this a bit more, it appears to be an interaction between EXT3
and CFQ in that CFQ tries to be fair to everyone, including kjournald.
However, in reality, kjournald is "special" in that it does a lot of journal
work on behalf of other processes and effectively this leads to a twisted 
kind of "mass priority inversion" type of behavior.

The good news is that CFQ already has the infrastructure to make certain
processes special... JBD just wasn't using that quite yet.

The patch below makes kjournald of the IOPRIO_CLASS_RT priority to break
this priority inversion behavior. With this patch, the latencies for atime
updates (and similar operation) go down by a factor of 3x to 4x !

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 fs/ioprio.c            |    3 ++-
 fs/jbd/journal.c       |   12 ++++++++++++
 include/linux/ioprio.h |    2 ++
 3 files changed, 16 insertions(+), 1 deletions(-)

diff --git a/fs/ioprio.c b/fs/ioprio.c
index da3cc46..3bd95dc 100644
--- a/fs/ioprio.c
+++ b/fs/ioprio.c
@@ -27,7 +27,7 @@
 #include <linux/security.h>
 #include <linux/pid_namespace.h>
 
-static int set_task_ioprio(struct task_struct *task, int ioprio)
+int set_task_ioprio(struct task_struct *task, int ioprio)
 {
 	int err;
 	struct io_context *ioc;
@@ -64,6 +64,7 @@ static int set_task_ioprio(struct task_struct *task, int ioprio)
 	task_unlock(task);
 	return err;
 }
+EXPORT_SYMBOL_GPL(set_task_ioprio);
 
 asmlinkage long sys_ioprio_set(int which, int who, int ioprio)
 {
diff --git a/fs/jbd/journal.c b/fs/jbd/journal.c
index aa7143a..2ed3d8f 100644
--- a/fs/jbd/journal.c
+++ b/fs/jbd/journal.c
@@ -36,6 +36,7 @@
 #include <linux/poison.h>
 #include <linux/proc_fs.h>
 #include <linux/debugfs.h>
+#include <linux/ioprio.h>
 
 #include <asm/uaccess.h>
 #include <asm/page.h>
@@ -131,6 +132,17 @@ static int kjournald(void *arg)
 			journal->j_commit_interval / HZ);
 
 	/*
+	 * kjournald is the process on which most other processes depend on
+	 * for doing the filesystem portion of their IO. As such, there exists
+	 * the equivalent of a priority inversion situation, where kjournald
+	 * would get less priority as it should.
+	 *
+	 * For this reason we set to "medium real time priority", which is higher
+	 * than regular tasks, but not infinitely powerful.
+	 */
+	set_task_ioprio(current, IOPRIO_PRIO_VALUE(IOPRIO_CLASS_RT, 4));
+
+	/*
 	 * And now, wait forever for commit wakeup events.
 	 */
 	spin_lock(&journal->j_state_lock);
diff --git a/include/linux/ioprio.h b/include/linux/ioprio.h
index f98a656..76dad48 100644
--- a/include/linux/ioprio.h
+++ b/include/linux/ioprio.h
@@ -86,4 +86,6 @@ static inline int task_nice_ioclass(struct task_struct *task)
  */
 extern int ioprio_best(unsigned short aprio, unsigned short bprio);
 
+extern int set_task_ioprio(struct task_struct *task, int ioprio);
+
 #endif
-- 
1.5.5.1



-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
  2008-10-02  3:00 [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority Arjan van de Ven
@ 2008-10-02  4:56 ` Andrew Morton
  2008-10-02  6:27   ` Jens Axboe
                     ` (2 more replies)
  0 siblings, 3 replies; 76+ messages in thread
From: Andrew Morton @ 2008-10-02  4:56 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: Jens Axboe, linux-kernel, Alan Cox

On Wed, 1 Oct 2008 20:00:34 -0700 Arjan van de Ven <arjan@infradead.org> wrote:

> Subject: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority

You proposed this a while back and it didn't happen and I forget
why and the changelog doesn't mention any of that?

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
  2008-10-02  4:56 ` Andrew Morton
@ 2008-10-02  6:27   ` Jens Axboe
  2008-10-02  6:55     ` Andrew Morton
  2008-10-02  6:57   ` Andi Kleen
  2008-10-02 13:05   ` Arjan van de Ven
  2 siblings, 1 reply; 76+ messages in thread
From: Jens Axboe @ 2008-10-02  6:27 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Arjan van de Ven, linux-kernel, Alan Cox

On Wed, Oct 01 2008, Andrew Morton wrote:
> On Wed, 1 Oct 2008 20:00:34 -0700 Arjan van de Ven <arjan@infradead.org> wrote:
> 
> > Subject: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
> 
> You proposed this a while back and it didn't happen and I forget
> why and the changelog doesn't mention any of that?

I think you called for benchmark results, which I don't think happened.
The patch definitely makes sense, so we should just make sure that we
don't regress elsewhere. Honestly, I'd be surprised if we did...

How about I just toss it into the 2.6.28 testing mix, plenty of time for
testing and such?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
  2008-10-02  6:27   ` Jens Axboe
@ 2008-10-02  6:55     ` Andrew Morton
  2008-10-02  7:45       ` Jens Axboe
  2008-10-02 13:12       ` Arjan van de Ven
  0 siblings, 2 replies; 76+ messages in thread
From: Andrew Morton @ 2008-10-02  6:55 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Arjan van de Ven, linux-kernel, Alan Cox

On Thu, 2 Oct 2008 08:27:37 +0200 Jens Axboe <jens.axboe@oracle.com> wrote:

> On Wed, Oct 01 2008, Andrew Morton wrote:
> > On Wed, 1 Oct 2008 20:00:34 -0700 Arjan van de Ven <arjan@infradead.org> wrote:
> > 
> > > Subject: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
> > 
> > You proposed this a while back and it didn't happen and I forget
> > why and the changelog doesn't mention any of that?
> 
> I think you called for benchmark results, which I don't think happened.
> The patch definitely makes sense, so we should just make sure that we
> don't regress elsewhere. Honestly, I'd be surprised if we did...

Now I think about it, didn't the earlier patch tweak CPU priority and
not IO priority?  I forget.  <kicks the changelog again>

> How about I just toss it into the 2.6.28 testing mix, plenty of time for
> testing and such?

Many performance regressions don't get noticed for six or twelve
months, by which time everyone is suffering from them (see
kernel/sched.c).

kjournald does huge amounts of not-terribly-important writeback.  One
obvious risk is that by making all that bulk writeback high-priority,
read-latency-sensitive applications might suffer latency spikes.



Now, kjournald is _supposed_ to be mostly asynchronous wrt foreground
operations.  And once upon a time (seven years ago) it mostly was.  But
there was some horrid race which I ended up fixing by introducing one
point where synchronous userspace actions had to block behind kjournald
activity.  I spent quite some time on it but couldn't come up with
anything better.  It had fairly bad effects on some workloads.

I've forgotten where that code is now, but I don't think it was ever
revisited.  It should be.

So.  Where are these atime updaters getting blocked?

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
  2008-10-02  4:56 ` Andrew Morton
  2008-10-02  6:27   ` Jens Axboe
@ 2008-10-02  6:57   ` Andi Kleen
  2008-10-02  7:55     ` Jens Axboe
  2008-10-02 13:05   ` Arjan van de Ven
  2 siblings, 1 reply; 76+ messages in thread
From: Andi Kleen @ 2008-10-02  6:57 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Arjan van de Ven, Jens Axboe, linux-kernel, Alan Cox

Andrew Morton <akpm@linux-foundation.org> writes:

> On Wed, 1 Oct 2008 20:00:34 -0700 Arjan van de Ven <arjan@infradead.org> wrote:
>
>> Subject: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
>
> You proposed this a while back and it didn't happen and I forget
> why and the changelog doesn't mention any of that?

XFS tried this some time ago too.

I think the issue was that real user supplied RT applications don't want to 
compete with a "pseudo RT" kjournald.

So it would really need a new priority class between RT and normal priority.

-Andi

-- 
ak@linux.intel.com

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
  2008-10-02  6:55     ` Andrew Morton
@ 2008-10-02  7:45       ` Jens Axboe
  2008-10-02  8:03         ` Andrew Morton
  2008-10-02 13:12       ` Arjan van de Ven
  1 sibling, 1 reply; 76+ messages in thread
From: Jens Axboe @ 2008-10-02  7:45 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Arjan van de Ven, linux-kernel, Alan Cox

On Wed, Oct 01 2008, Andrew Morton wrote:
> On Thu, 2 Oct 2008 08:27:37 +0200 Jens Axboe <jens.axboe@oracle.com> wrote:
> 
> > On Wed, Oct 01 2008, Andrew Morton wrote:
> > > On Wed, 1 Oct 2008 20:00:34 -0700 Arjan van de Ven <arjan@infradead.org> wrote:
> > > 
> > > > Subject: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
> > > 
> > > You proposed this a while back and it didn't happen and I forget
> > > why and the changelog doesn't mention any of that?
> > 
> > I think you called for benchmark results, which I don't think happened.
> > The patch definitely makes sense, so we should just make sure that we
> > don't regress elsewhere. Honestly, I'd be surprised if we did...
> 
> Now I think about it, didn't the earlier patch tweak CPU priority and
> not IO priority?  I forget.  <kicks the changelog again>
> 
> > How about I just toss it into the 2.6.28 testing mix, plenty of time for
> > testing and such?
> 
> Many performance regressions don't get noticed for six or twelve
> months, by which time everyone is suffering from them (see
> kernel/sched.c).
> 
> kjournald does huge amounts of not-terribly-important writeback.  One
> obvious risk is that by making all that bulk writeback high-priority,
> read-latency-sensitive applications might suffer latency spikes.
> 
> 
> 
> Now, kjournald is _supposed_ to be mostly asynchronous wrt foreground
> operations.  And once upon a time (seven years ago) it mostly was.  But
> there was some horrid race which I ended up fixing by introducing one
> point where synchronous userspace actions had to block behind kjournald
> activity.  I spent quite some time on it but couldn't come up with
> anything better.  It had fairly bad effects on some workloads.
> 
> I've forgotten where that code is now, but I don't think it was ever
> revisited.  It should be.
> 
> So.  Where are these atime updaters getting blocked?

Behind other IO activity I suppose, since it's marked async. A more
appropriate fix may be to mark atime updates as sync IO.

Once upon a time I actually did start xxxing meta data IO from the file
system, in ext3. That was mainly for blktrace so we could inspect what
data was what at the trace end, but if we had full coverage (or better,
at least), we could use that to bump priority as well. Probably the
priority should be left at the default, but the IO should be upgraded to
SYNC instead. That'll ensure that it goes out fairly quickly (like other
IO at the same class), but not get preferential treatment apart from
that. Seems like the right thing to do.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
  2008-10-02  6:57   ` Andi Kleen
@ 2008-10-02  7:55     ` Jens Axboe
  2008-10-02  9:33       ` Dave Chinner
  0 siblings, 1 reply; 76+ messages in thread
From: Jens Axboe @ 2008-10-02  7:55 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Andrew Morton, Arjan van de Ven, linux-kernel, Alan Cox

On Thu, Oct 02 2008, Andi Kleen wrote:
> Andrew Morton <akpm@linux-foundation.org> writes:
> 
> > On Wed, 1 Oct 2008 20:00:34 -0700 Arjan van de Ven <arjan@infradead.org> wrote:
> >
> >> Subject: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
> >
> > You proposed this a while back and it didn't happen and I forget
> > why and the changelog doesn't mention any of that?
> 
> XFS tried this some time ago too.
> 
> I think the issue was that real user supplied RT applications don't want to 
> compete with a "pseudo RT" kjournald.
> 
> So it would really need a new priority class between RT and normal priority.

Good point. I think we should mark the IO as sync, and maintain the same
priority level. Any IO that ends up being waited on is sync by
definition, we just need to expand the coverage a bit.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
  2008-10-02  7:45       ` Jens Axboe
@ 2008-10-02  8:03         ` Andrew Morton
  2008-10-02  8:22           ` Jens Axboe
  2008-10-02 12:04           ` Theodore Tso
  0 siblings, 2 replies; 76+ messages in thread
From: Andrew Morton @ 2008-10-02  8:03 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Arjan van de Ven, linux-kernel, Alan Cox

On Thu, 2 Oct 2008 09:45:24 +0200 Jens Axboe <jens.axboe@oracle.com> wrote:

> > So.  Where are these atime updaters getting blocked?
> 
> Behind other IO activity I suppose, since it's marked async. A more
> appropriate fix may be to mark atime updates as sync IO.

No, they might be getting blocked at a higher level.

An async atime update gets recorded into the current transaction. 
kjournald is working on the committing transaction.  We try to keep
those separated, to prevent user processes from getting blocked behind
kjournald activity.

But sometimes that doesn't work (including the place where I knowingly
broke it).  If we can find and fix the offending piece of jbd logic (a
big if) then all is peachy.

If the above theory turns out to be true then diddling IO priorities
is but a workaround.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
  2008-10-02  8:03         ` Andrew Morton
@ 2008-10-02  8:22           ` Jens Axboe
  2008-10-02  8:43             ` Andrew Morton
  2008-10-02 12:04           ` Theodore Tso
  1 sibling, 1 reply; 76+ messages in thread
From: Jens Axboe @ 2008-10-02  8:22 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Arjan van de Ven, linux-kernel, Alan Cox

On Thu, Oct 02 2008, Andrew Morton wrote:
> On Thu, 2 Oct 2008 09:45:24 +0200 Jens Axboe <jens.axboe@oracle.com> wrote:
> 
> > > So.  Where are these atime updaters getting blocked?
> > 
> > Behind other IO activity I suppose, since it's marked async. A more
> > appropriate fix may be to mark atime updates as sync IO.
> 
> No, they might be getting blocked at a higher level.
> 
> An async atime update gets recorded into the current transaction. 
> kjournald is working on the committing transaction.  We try to keep
> those separated, to prevent user processes from getting blocked behind
> kjournald activity.
> 
> But sometimes that doesn't work (including the place where I knowingly
> broke it).  If we can find and fix the offending piece of jbd logic (a
> big if) then all is peachy.
> 
> If the above theory turns out to be true then diddling IO priorities
> is but a workaround.

If diddling with io priorities makes a big difference (and Arjan said it
does), it's clearly stuck inside the io scheduler waiting to be
dispatched. If it was marked sync, it would be going through much
faster. As async, it'll get mixed in with other async writeout and that
doesn't get a lot of attention in case of sync activity.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
  2008-10-02  8:22           ` Jens Axboe
@ 2008-10-02  8:43             ` Andrew Morton
  2008-10-02  8:46               ` Jens Axboe
  0 siblings, 1 reply; 76+ messages in thread
From: Andrew Morton @ 2008-10-02  8:43 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Arjan van de Ven, linux-kernel, Alan Cox

On Thu, 2 Oct 2008 10:22:13 +0200 Jens Axboe <jens.axboe@oracle.com> wrote:

> On Thu, Oct 02 2008, Andrew Morton wrote:
> > On Thu, 2 Oct 2008 09:45:24 +0200 Jens Axboe <jens.axboe@oracle.com> wrote:
> > 
> > > > So.  Where are these atime updaters getting blocked?
> > > 
> > > Behind other IO activity I suppose, since it's marked async. A more
> > > appropriate fix may be to mark atime updates as sync IO.
> > 
> > No, they might be getting blocked at a higher level.
> > 
> > An async atime update gets recorded into the current transaction. 
> > kjournald is working on the committing transaction.  We try to keep
> > those separated, to prevent user processes from getting blocked behind
> > kjournald activity.
> > 
> > But sometimes that doesn't work (including the place where I knowingly
> > broke it).  If we can find and fix the offending piece of jbd logic (a
> > big if) then all is peachy.
> > 
> > If the above theory turns out to be true then diddling IO priorities
> > is but a workaround.
> 
> If diddling with io priorities makes a big difference (and Arjan said it
> does), it's clearly stuck inside the io scheduler waiting to be
> dispatched. If it was marked sync, it would be going through much
> faster. As async, it'll get mixed in with other async writeout and that
> doesn't get a lot of attention in case of sync activity.
> 

Maybe.  And maybe marking all journal writeout and all commit activity
as sync would have the same effect with less downside.

But it'd be better to not wait on this IO at all, if poss..

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
  2008-10-02  8:43             ` Andrew Morton
@ 2008-10-02  8:46               ` Jens Axboe
  0 siblings, 0 replies; 76+ messages in thread
From: Jens Axboe @ 2008-10-02  8:46 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Arjan van de Ven, linux-kernel, Alan Cox

On Thu, Oct 02 2008, Andrew Morton wrote:
> On Thu, 2 Oct 2008 10:22:13 +0200 Jens Axboe <jens.axboe@oracle.com> wrote:
> 
> > On Thu, Oct 02 2008, Andrew Morton wrote:
> > > On Thu, 2 Oct 2008 09:45:24 +0200 Jens Axboe <jens.axboe@oracle.com> wrote:
> > > 
> > > > > So.  Where are these atime updaters getting blocked?
> > > > 
> > > > Behind other IO activity I suppose, since it's marked async. A more
> > > > appropriate fix may be to mark atime updates as sync IO.
> > > 
> > > No, they might be getting blocked at a higher level.
> > > 
> > > An async atime update gets recorded into the current transaction. 
> > > kjournald is working on the committing transaction.  We try to keep
> > > those separated, to prevent user processes from getting blocked behind
> > > kjournald activity.
> > > 
> > > But sometimes that doesn't work (including the place where I knowingly
> > > broke it).  If we can find and fix the offending piece of jbd logic (a
> > > big if) then all is peachy.
> > > 
> > > If the above theory turns out to be true then diddling IO priorities
> > > is but a workaround.
> > 
> > If diddling with io priorities makes a big difference (and Arjan said it
> > does), it's clearly stuck inside the io scheduler waiting to be
> > dispatched. If it was marked sync, it would be going through much
> > faster. As async, it'll get mixed in with other async writeout and that
> > doesn't get a lot of attention in case of sync activity.
> > 
> 
> Maybe.  And maybe marking all journal writeout and all commit activity
> as sync would have the same effect with less downside.
> 
> But it'd be better to not wait on this IO at all, if poss..

Agree, would be much better...

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
  2008-10-02  7:55     ` Jens Axboe
@ 2008-10-02  9:33       ` Dave Chinner
  2008-10-02  9:45         ` Jens Axboe
  0 siblings, 1 reply; 76+ messages in thread
From: Dave Chinner @ 2008-10-02  9:33 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Andi Kleen, Andrew Morton, Arjan van de Ven, linux-kernel, Alan Cox

On Thu, Oct 02, 2008 at 09:55:11AM +0200, Jens Axboe wrote:
> On Thu, Oct 02 2008, Andi Kleen wrote:
> > Andrew Morton <akpm@linux-foundation.org> writes:
> > 
> > > On Wed, 1 Oct 2008 20:00:34 -0700 Arjan van de Ven <arjan@infradead.org> wrote:
> > >
> > >> Subject: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
> > >
> > > You proposed this a while back and it didn't happen and I forget
> > > why and the changelog doesn't mention any of that?
> > 
> > XFS tried this some time ago too.
> > 
> > I think the issue was that real user supplied RT applications don't want to 
> > compete with a "pseudo RT" kjournald.
> > 
> > So it would really need a new priority class between RT and normal priority.
> 
> Good point. I think we should mark the IO as sync, and maintain the same
> priority level. Any IO that ends up being waited on is sync by
> definition, we just need to expand the coverage a bit.

That's what XFS has always done - mark the journal I/O as sync.
Still, once you load up the elevator, the sync I/O can still get
delayed for hundreds of milliseconds before dispatch, which was
why I started looking at boosting the priority of the log I/O.
It proved to be much more effective at getting the log I/O
dispatched than the existing "mark it sync" technique....

The RT folk were happy with the idea of journal I/O using the
highest non-RT priority for the journal, but I never got around
to testing that out as I had a bunnch of other stuff to fix at
the time.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
  2008-10-02  9:33       ` Dave Chinner
@ 2008-10-02  9:45         ` Jens Axboe
  2008-10-02 13:14           ` Arjan van de Ven
  2008-10-02 19:04           ` Arjan van de Ven
  0 siblings, 2 replies; 76+ messages in thread
From: Jens Axboe @ 2008-10-02  9:45 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andi Kleen, Andrew Morton, Arjan van de Ven, linux-kernel, Alan Cox

On Thu, Oct 02 2008, Dave Chinner wrote:
> On Thu, Oct 02, 2008 at 09:55:11AM +0200, Jens Axboe wrote:
> > On Thu, Oct 02 2008, Andi Kleen wrote:
> > > Andrew Morton <akpm@linux-foundation.org> writes:
> > > 
> > > > On Wed, 1 Oct 2008 20:00:34 -0700 Arjan van de Ven <arjan@infradead.org> wrote:
> > > >
> > > >> Subject: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
> > > >
> > > > You proposed this a while back and it didn't happen and I forget
> > > > why and the changelog doesn't mention any of that?
> > > 
> > > XFS tried this some time ago too.
> > > 
> > > I think the issue was that real user supplied RT applications don't want to 
> > > compete with a "pseudo RT" kjournald.
> > > 
> > > So it would really need a new priority class between RT and normal priority.
> > 
> > Good point. I think we should mark the IO as sync, and maintain the same
> > priority level. Any IO that ends up being waited on is sync by
> > definition, we just need to expand the coverage a bit.
> 
> That's what XFS has always done - mark the journal I/O as sync.
> Still, once you load up the elevator, the sync I/O can still get
> delayed for hundreds of milliseconds before dispatch, which was
> why I started looking at boosting the priority of the log I/O.
> It proved to be much more effective at getting the log I/O
> dispatched than the existing "mark it sync" technique....

Sure, just marking it as sync is not a magic bullet. It'll be in the
first priority for that class, but it'll share bandwidth with other
processes. So if you have lots of IO going on, it can take hundreds of
miliseconds before being dispatched.

RT will always be faster, but mainly by virtue of there being no RT IO
in the first place. And of course preferential treatment is given to
this higher priority scheduling class.

> The RT folk were happy with the idea of journal I/O using the
> highest non-RT priority for the journal, but I never got around
> to testing that out as I had a bunnch of other stuff to fix at
> the time.

That's a good idea, just bump the priority a little bit. Arjan, did you
test that out? I'd suggest just trying prio level 0 and still using
best-effort scheduling. Probably still need the sync marking, would be
interesting to experiment with though.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
  2008-10-02  8:03         ` Andrew Morton
  2008-10-02  8:22           ` Jens Axboe
@ 2008-10-02 12:04           ` Theodore Tso
  2008-10-02 13:16             ` Arjan van de Ven
  1 sibling, 1 reply; 76+ messages in thread
From: Theodore Tso @ 2008-10-02 12:04 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jens Axboe, Arjan van de Ven, linux-kernel, Alan Cox

On Thu, Oct 02, 2008 at 01:03:15AM -0700, Andrew Morton wrote:
> 
> An async atime update gets recorded into the current transaction. 
> kjournald is working on the committing transaction.  We try to keep
> those separated, to prevent user processes from getting blocked behind
> kjournald activity.
> 

This is true unless the journal gets too full, and we need to do a
checkpoint operation --- at which point, everything stops.  If this
was metadata-intensive a benchmark, and the journal wasn't large
enough, this could be the problem.  (And if you make the journal
bigger, then when you *do* finally get forced to do a checkpoint
operation, things get stalled for even longer.)

Arjan, is this *really* about atime updates?  I thought most poeple
these days run with noatime or relatime.  If people *really* want true
atime semantics, the best way to solve this problem would be to have
two dirty flags in the inode --- an "atime dirty" and a "dirty" flag.
The atime dirty bit would not actually cause the inode to get written
to disk, unless either (a) we are unmounting the filesystem, or (b) we
are trying to shrink the inode cache due to memory pressure.  If when
we write the inode out to disk, only the atime dirty bit is set, we
can also skip journalling the inode table block.  So if there are
people who really care about true atime semantics, without getting
killed by the I/O writes, there are some solutions we can pursue.

But if this is really about the "entangled fsync problem", where we
have a large number of processes writing a large amount of async data,
and then we have a single process writing a small amount of data and
then calling fsync(), then that's a different (and very long-standing)
problem in ext3/4.  Raising the I/O priority is probably the only
thing we can do in this circumstance.  We could try to do some kind of
complex priority inheritance scheme, but it would certainly be much
simpler to raise the I/O priority.  We could choose a level just below
realtime priority, but the reality is that if a real-time priority is
trying to write to the filesystem, and we are doing a checkpointing
opration, we're going to be blocking the real-time process anyway, and
it will be a priority inversion.  So perhaps the simplest and best
algorithm would be to use a priority level just below real-time when
doing a normal commit, but if we start to do a checkpoint, we go to
IOPRIO_CLASS_RT.

> But sometimes that doesn't work (including the place where I knowingly
> broke it).  If we can find and fix the offending piece of jbd logic (a
> big if) then all is peachy.

Do we have workloads that can easily demonstrate this problem?  If so,
we can add some tracing code which will allow us to see which theory
is correct, and what is actually happening.

						- Ted

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
  2008-10-02  4:56 ` Andrew Morton
  2008-10-02  6:27   ` Jens Axboe
  2008-10-02  6:57   ` Andi Kleen
@ 2008-10-02 13:05   ` Arjan van de Ven
  2008-10-02 17:11     ` Jens Axboe
  2 siblings, 1 reply; 76+ messages in thread
From: Arjan van de Ven @ 2008-10-02 13:05 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jens Axboe, linux-kernel, Alan Cox

On Wed, 1 Oct 2008 21:56:38 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Wed, 1 Oct 2008 20:00:34 -0700 Arjan van de Ven
> <arjan@infradead.org> wrote:
> 
> > Subject: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
> 
> You proposed this a while back

one year ago to be precise

>and it didn't happen and I forget
> why and the changelog doesn't mention any of that?

Jens didn't like the APi I used back then.
Since then the kernel grew (by Jens) a clean API to do the same...


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
  2008-10-02  6:55     ` Andrew Morton
  2008-10-02  7:45       ` Jens Axboe
@ 2008-10-02 13:12       ` Arjan van de Ven
  2008-10-02 20:24         ` Andrew Morton
  1 sibling, 1 reply; 76+ messages in thread
From: Arjan van de Ven @ 2008-10-02 13:12 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jens Axboe, linux-kernel, Alan Cox

On Wed, 1 Oct 2008 23:55:01 -0700
> 
> I've forgotten where that code is now, but I don't think it was ever
> revisited.  It should be.
> 
> So.  Where are these atime updaters getting blocked?

my reproducer is sadly very simple (claws-mail is my mail client that uses maildir)

Process claws-mail (4896)                  Total: 2829.7 msec 
EXT3: Waiting for journal access                  2491.0 msec         88.4 % 
Writing back inodes				   160.9 msec          5.7 % 
synchronous write                                   78.8 msec          3.0 %

is an example of such a trace (this is with patch, without patch the numbers are about 3x bigger)

Waiting for journal access is "journal_get_write_access"
Writing back inodes is "writeback_inodes"
synchronous write is "do_sync_write"


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
  2008-10-02  9:45         ` Jens Axboe
@ 2008-10-02 13:14           ` Arjan van de Ven
  2008-10-02 13:27             ` Jens Axboe
  2008-10-02 19:04           ` Arjan van de Ven
  1 sibling, 1 reply; 76+ messages in thread
From: Arjan van de Ven @ 2008-10-02 13:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Dave Chinner, Andi Kleen, Andrew Morton, linux-kernel, Alan Cox

On Thu, 2 Oct 2008 11:45:37 +0200
Jens Axboe <jens.axboe@oracle.com> wrote:

> 
> That's a good idea, just bump the priority a little bit. Arjan, did
> you test that out? I'd suggest just trying prio level 0 and still
> using best-effort scheduling. Probably still need the sync marking,
> would be interesting to experiment with though.

I looked at 0 but it appears the 0 is the default for everyone...
if everyone just defaulted to > 0 then yes I would have picked 0.


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
  2008-10-02 12:04           ` Theodore Tso
@ 2008-10-02 13:16             ` Arjan van de Ven
  2008-10-02 13:46               ` Theodore Tso
  0 siblings, 1 reply; 76+ messages in thread
From: Arjan van de Ven @ 2008-10-02 13:16 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Andrew Morton, Jens Axboe, linux-kernel, Alan Cox

On Thu, 2 Oct 2008 08:04:44 -0400
Theodore Tso <tytso@mit.edu> wrote:

> 
> > But sometimes that doesn't work (including the place where I
> > knowingly broke it).  If we can find and fix the offending piece of
> > jbd logic (a big if) then all is peachy.
> 
> Do we have workloads that can easily demonstrate this problem?  If so,
> we can add some tracing code which will allow us to see which theory
> is correct, and what is actually happening.

I can very easily reproduce it; my mail client (claws) stalls due to
this several seconds at least once every 5 to 10 minutes...
(usually when I'm typing an email.. grumble)



-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
  2008-10-02 13:14           ` Arjan van de Ven
@ 2008-10-02 13:27             ` Jens Axboe
  2008-10-02 13:36               ` Arjan van de Ven
  0 siblings, 1 reply; 76+ messages in thread
From: Jens Axboe @ 2008-10-02 13:27 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Dave Chinner, Andi Kleen, Andrew Morton, linux-kernel, Alan Cox

On Thu, Oct 02 2008, Arjan van de Ven wrote:
> On Thu, 2 Oct 2008 11:45:37 +0200
> Jens Axboe <jens.axboe@oracle.com> wrote:
> 
> > 
> > That's a good idea, just bump the priority a little bit. Arjan, did
> > you test that out? I'd suggest just trying prio level 0 and still
> > using best-effort scheduling. Probably still need the sync marking,
> > would be interesting to experiment with though.
> 
> I looked at 0 but it appears the 0 is the default for everyone...
> if everyone just defaulted to > 0 then yes I would have picked 0.

That's not correct, class BE and value 4 is the default (and the code
defaults to that, if you haven't set a value yourself):

#define IOPRIO_NORM     (4)
static inline int task_ioprio(struct io_context *ioc)
{
        if (ioprio_valid(ioc->ioprio))
                return IOPRIO_PRIO_DATA(ioc->ioprio);

        return IOPRIO_NORM;
}

static inline int task_ioprio_class(struct io_context *ioc)
{
        if (ioprio_valid(ioc->ioprio))
                return IOPRIO_PRIO_CLASS(ioc->ioprio);

        return IOPRIO_CLASS_BE;
}

So if you use IOPRIO_CLASS_BE and 0 for the ioprio, you will have the
highest priority of the default scheduling class.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
  2008-10-02 13:27             ` Jens Axboe
@ 2008-10-02 13:36               ` Arjan van de Ven
  2008-10-02 13:47                 ` Jens Axboe
  0 siblings, 1 reply; 76+ messages in thread
From: Arjan van de Ven @ 2008-10-02 13:36 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Dave Chinner, Andi Kleen, Andrew Morton, linux-kernel, Alan Cox

On Thu, 2 Oct 2008 15:27:47 +0200
Jens Axboe <jens.axboe@oracle.com> wrote:

> On Thu, Oct 02 2008, Arjan van de Ven wrote:
> > On Thu, 2 Oct 2008 11:45:37 +0200
> > Jens Axboe <jens.axboe@oracle.com> wrote:
> > 
> > > 
> > > That's a good idea, just bump the priority a little bit. Arjan,
> > > did you test that out? I'd suggest just trying prio level 0 and
> > > still using best-effort scheduling. Probably still need the sync
> > > marking, would be interesting to experiment with though.
> > 
> > I looked at 0 but it appears the 0 is the default for everyone...
> > if everyone just defaulted to > 0 then yes I would have picked 0.
> 
> That's not correct, class BE and value 4 is the default (and the code
> defaults to that, if you haven't set a value yourself):
> 
> #define IOPRIO_NORM     (4)
> static inline int task_ioprio(struct io_context *ioc)
> {
>         if (ioprio_valid(ioc->ioprio))
>                 return IOPRIO_PRIO_DATA(ioc->ioprio);
> 
>         return IOPRIO_NORM;
> }
> 
> static inline int task_ioprio_class(struct io_context *ioc)
> {
>         if (ioprio_valid(ioc->ioprio))
>                 return IOPRIO_PRIO_CLASS(ioc->ioprio);
> 
>         return IOPRIO_CLASS_BE;
> }
> 
> So if you use IOPRIO_CLASS_BE and 0 for the ioprio, you will have the
> highest priority of the default scheduling class.

ok

I checked not by looking at the code, but running ionice -p <pid> on a
bunch of things and they came back as 0 

> 


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
  2008-10-02 13:16             ` Arjan van de Ven
@ 2008-10-02 13:46               ` Theodore Tso
  2008-10-02 14:33                 ` Arjan van de Ven
  0 siblings, 1 reply; 76+ messages in thread
From: Theodore Tso @ 2008-10-02 13:46 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: Andrew Morton, Jens Axboe, linux-kernel, Alan Cox

On Thu, Oct 02, 2008 at 06:16:29AM -0700, Arjan van de Ven wrote:
> On Thu, 2 Oct 2008 08:04:44 -0400
> Theodore Tso <tytso@mit.edu> wrote:
> 
> > 
> > > But sometimes that doesn't work (including the place where I
> > > knowingly broke it).  If we can find and fix the offending piece of
> > > jbd logic (a big if) then all is peachy.
> > 
> > Do we have workloads that can easily demonstrate this problem?  If so,
> > we can add some tracing code which will allow us to see which theory
> > is correct, and what is actually happening.
> 
> I can very easily reproduce it; my mail client (claws) stalls due to
> this several seconds at least once every 5 to 10 minutes...
> (usually when I'm typing an email.. grumble)

Are you running with noatime or relatime?

One quick thing to try that might show what is going on is to run
debugfs on the device where your mail directory is located, while
claws is running, and then use the debugfs command "logdump -a" to
dump out the contents of the journal.  You can then use ncheck and
icheck to take the FS block numbers from logdump and translate them to
inode numbers, and then from inode numbers to pathnames.  That might
give us some insight as to what is going on.

I can whip up a patch which adds some markers which we could use to
find out more information about what is happening.

     	      		  	     	- Ted

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
  2008-10-02 13:36               ` Arjan van de Ven
@ 2008-10-02 13:47                 ` Jens Axboe
  2008-10-02 14:26                   ` Arjan van de Ven
  0 siblings, 1 reply; 76+ messages in thread
From: Jens Axboe @ 2008-10-02 13:47 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Dave Chinner, Andi Kleen, Andrew Morton, linux-kernel, Alan Cox

On Thu, Oct 02 2008, Arjan van de Ven wrote:
> On Thu, 2 Oct 2008 15:27:47 +0200
> Jens Axboe <jens.axboe@oracle.com> wrote:
> 
> > On Thu, Oct 02 2008, Arjan van de Ven wrote:
> > > On Thu, 2 Oct 2008 11:45:37 +0200
> > > Jens Axboe <jens.axboe@oracle.com> wrote:
> > > 
> > > > 
> > > > That's a good idea, just bump the priority a little bit. Arjan,
> > > > did you test that out? I'd suggest just trying prio level 0 and
> > > > still using best-effort scheduling. Probably still need the sync
> > > > marking, would be interesting to experiment with though.
> > > 
> > > I looked at 0 but it appears the 0 is the default for everyone...
> > > if everyone just defaulted to > 0 then yes I would have picked 0.
> > 
> > That's not correct, class BE and value 4 is the default (and the code
> > defaults to that, if you haven't set a value yourself):
> > 
> > #define IOPRIO_NORM     (4)
> > static inline int task_ioprio(struct io_context *ioc)
> > {
> >         if (ioprio_valid(ioc->ioprio))
> >                 return IOPRIO_PRIO_DATA(ioc->ioprio);
> > 
> >         return IOPRIO_NORM;
> > }
> > 
> > static inline int task_ioprio_class(struct io_context *ioc)
> > {
> >         if (ioprio_valid(ioc->ioprio))
> >                 return IOPRIO_PRIO_CLASS(ioc->ioprio);
> > 
> >         return IOPRIO_CLASS_BE;
> > }
> > 
> > So if you use IOPRIO_CLASS_BE and 0 for the ioprio, you will have the
> > highest priority of the default scheduling class.
> 
> ok
> 
> I checked not by looking at the code, but running ionice -p <pid> on a
> bunch of things and they came back as 0 

Yes, it'll report '0' which means 'not set'. The kernel inteprets 'not
set' as the default values, BE/4. There's a big diffence, since '0'
means that we track CPU nice values where as if it returned be/4 then
that is a strict/fixed setting.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
  2008-10-02 13:47                 ` Jens Axboe
@ 2008-10-02 14:26                   ` Arjan van de Ven
  2008-10-02 16:42                     ` Jens Axboe
  0 siblings, 1 reply; 76+ messages in thread
From: Arjan van de Ven @ 2008-10-02 14:26 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Dave Chinner, Andi Kleen, Andrew Morton, linux-kernel, Alan Cox

On Thu, 2 Oct 2008 15:47:37 +0200
Jens Axboe <jens.axboe@oracle.com> wrote:

> Yes, it'll report '0' which means 'not set'. The kernel inteprets 'not
> set' as the default values, BE/4. There's a big diffence, since '0'
> means that we track CPU nice values where as if it returned be/4 then
> that is a strict/fixed setting.

argh. "0" means both "not set" and "highest priority".

how about we fix this?

right now "query + set" isn't idempotent.....
it should be able to be that.


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
  2008-10-02 13:46               ` Theodore Tso
@ 2008-10-02 14:33                 ` Arjan van de Ven
  2008-10-04 14:12                   ` Theodore Tso
  0 siblings, 1 reply; 76+ messages in thread
From: Arjan van de Ven @ 2008-10-02 14:33 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Andrew Morton, Jens Axboe, linux-kernel, Alan Cox

On Thu, 2 Oct 2008 09:46:59 -0400
Theodore Tso <tytso@mit.edu> wrote:

> On Thu, Oct 02, 2008 at 06:16:29AM -0700, Arjan van de Ven wrote:
> > On Thu, 2 Oct 2008 08:04:44 -0400
> > Theodore Tso <tytso@mit.edu> wrote:
> > 
> > > 
> > > > But sometimes that doesn't work (including the place where I
> > > > knowingly broke it).  If we can find and fix the offending
> > > > piece of jbd logic (a big if) then all is peachy.
> > > 
> > > Do we have workloads that can easily demonstrate this problem?
> > > If so, we can add some tracing code which will allow us to see
> > > which theory is correct, and what is actually happening.
> > 
> > I can very easily reproduce it; my mail client (claws) stalls due to
> > this several seconds at least once every 5 to 10 minutes...
> > (usually when I'm typing an email.. grumble)
> 
> Are you running with noatime or relatime?

yes; I have

mount -o remount,relatime / &
for i in `pidof kjournald` ; do ionice -c1 -p $i ; done

in my /etc/rc.local 


> One quick thing to try that might show what is going on is to run
> debugfs on the device where your mail directory is located, while
> claws is running, and then use the debugfs command "logdump -a" to
> dump out the contents of the journal.  You can then use ncheck and
> icheck to take the FS block numbers from logdump and translate them to
> inode numbers, and then from inode numbers to pathnames.  That might
> give us some insight as to what is going on.

I'll try (sadly I'm in a training session all day today so I might not
get to it quickly)
 
> I can whip up a patch which adds some markers which we could use to
> find out more information about what is happening.

interesting testcase of the markers concept.


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
  2008-10-02 14:26                   ` Arjan van de Ven
@ 2008-10-02 16:42                     ` Jens Axboe
  0 siblings, 0 replies; 76+ messages in thread
From: Jens Axboe @ 2008-10-02 16:42 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Dave Chinner, Andi Kleen, Andrew Morton, linux-kernel, Alan Cox

On Thu, Oct 02 2008, Arjan van de Ven wrote:
> On Thu, 2 Oct 2008 15:47:37 +0200
> Jens Axboe <jens.axboe@oracle.com> wrote:
> 
> > Yes, it'll report '0' which means 'not set'. The kernel inteprets 'not
> > set' as the default values, BE/4. There's a big diffence, since '0'
> > means that we track CPU nice values where as if it returned be/4 then
> > that is a strict/fixed setting.
> 
> argh. "0" means both "not set" and "highest priority".

Arjan, please read the interface, it does not...

0 means not set, period. What matters is the class setting, if that is
non-zero then it is a valid setting. See

#define IOPRIO_PRIO_VALUE(class, data)  (((class) << IOPRIO_CLASS_SHIFT) | data)

So highest priority BE is IOPRIO_PRIO_VALUE(IOPRIO_CLASS_BE, 0), which
is of course valid.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
  2008-10-02 13:05   ` Arjan van de Ven
@ 2008-10-02 17:11     ` Jens Axboe
  0 siblings, 0 replies; 76+ messages in thread
From: Jens Axboe @ 2008-10-02 17:11 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: Andrew Morton, linux-kernel, Alan Cox

On Thu, Oct 02 2008, Arjan van de Ven wrote:
> On Wed, 1 Oct 2008 21:56:38 -0700
> Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > On Wed, 1 Oct 2008 20:00:34 -0700 Arjan van de Ven
> > <arjan@infradead.org> wrote:
> > 
> > > Subject: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
> > 
> > You proposed this a while back
> 
> one year ago to be precise
> 
> >and it didn't happen and I forget
> > why and the changelog doesn't mention any of that?
> 
> Jens didn't like the APi I used back then.
> Since then the kernel grew (by Jens) a clean API to do the same...

Actually I don't think I changed anything, did I? IIRC, I just pointed
you at the API :-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
  2008-10-02  9:45         ` Jens Axboe
  2008-10-02 13:14           ` Arjan van de Ven
@ 2008-10-02 19:04           ` Arjan van de Ven
  2008-10-02 19:22             ` Jens Axboe
  1 sibling, 1 reply; 76+ messages in thread
From: Arjan van de Ven @ 2008-10-02 19:04 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Dave Chinner, Andi Kleen, Andrew Morton, linux-kernel, Alan Cox

On Thu, 2 Oct 2008 11:45:37 +0200
Jens Axboe <jens.axboe@oracle.com> wrote:

> > The RT folk were happy with the idea of journal I/O using the
> > highest non-RT priority for the journal, but I never got around
> > to testing that out as I had a bunnch of other stuff to fix at
> > the time.
> 
> That's a good idea, just bump the priority a little bit. Arjan, did
> you test that out? I'd suggest just trying prio level 0 and still
> using best-effort scheduling. Probably still need the sync marking,
> would be interesting to experiment with though.
> 

ok 0 works ok enough in quick testing as well...... updated patch below

>From df64cc4e2ab0c102bbac609dd948958a6f804fd3 Mon Sep 17 00:00:00 2001
From: Arjan van de Ven <arjan@linux.intel.com>
Date: Wed, 1 Oct 2008 19:58:18 -0700
Subject: [PATCH] Give kjournald a higher io priority

With latencytop, I noticed that the (in memory) file updates during my
workload (reading mail) had latencies of 6 seconds or longer; this is
obviously not so nice behavior. Other EXT3 journal related operations had
similar or even longer latencies.

Digging into this a bit more, it appears to be an interaction between EXT3
and CFQ in that CFQ tries to be fair to everyone, including kjournald.
However, in reality, kjournald is "special" in that it does a lot of journal
work and effectively this leads to a twisted kind of "mass priority
inversion" type of behavior.

The good news is that CFQ already has the infrastructure to make certain
processes special... JBD just wasn't using that quite yet.

The patch below makes kjournald of a slighlty higher priority than normal
applications, reducing these latencies significantly.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 fs/ioprio.c            |    3 ++-
 fs/jbd/journal.c       |   12 ++++++++++++
 include/linux/ioprio.h |    2 ++
 3 files changed, 16 insertions(+), 1 deletions(-)

diff --git a/fs/ioprio.c b/fs/ioprio.c
index da3cc46..3bd95dc 100644
--- a/fs/ioprio.c
+++ b/fs/ioprio.c
@@ -27,7 +27,7 @@
 #include <linux/security.h>
 #include <linux/pid_namespace.h>
 
-static int set_task_ioprio(struct task_struct *task, int ioprio)
+int set_task_ioprio(struct task_struct *task, int ioprio)
 {
 	int err;
 	struct io_context *ioc;
@@ -64,6 +64,7 @@ static int set_task_ioprio(struct task_struct *task, int ioprio)
 	task_unlock(task);
 	return err;
 }
+EXPORT_SYMBOL_GPL(set_task_ioprio);
 
 asmlinkage long sys_ioprio_set(int which, int who, int ioprio)
 {
diff --git a/fs/jbd/journal.c b/fs/jbd/journal.c
index aa7143a..a859a46 100644
--- a/fs/jbd/journal.c
+++ b/fs/jbd/journal.c
@@ -36,6 +36,7 @@
 #include <linux/poison.h>
 #include <linux/proc_fs.h>
 #include <linux/debugfs.h>
+#include <linux/ioprio.h>
 
 #include <asm/uaccess.h>
 #include <asm/page.h>
@@ -131,6 +132,17 @@ static int kjournald(void *arg)
 			journal->j_commit_interval / HZ);
 
 	/*
+	 * kjournald is the process on which most other processes depend on
+	 * for doing the filesystem portion of their IO. As such, there exists
+	 * the equivalent of a priority inversion situation, where kjournald
+	 * would get less priority as it should.
+	 *
+	 * For this reason we set to "medium real time priority", which is higher
+	 * than regular tasks, but not infinitely powerful.
+	 */
+	set_task_ioprio(current, IOPRIO_PRIO_VALUE(IOPRIO_CLASS_BE, 0));
+
+	/*
 	 * And now, wait forever for commit wakeup events.
 	 */
 	spin_lock(&journal->j_state_lock);
diff --git a/include/linux/ioprio.h b/include/linux/ioprio.h
index f98a656..76dad48 100644
--- a/include/linux/ioprio.h
+++ b/include/linux/ioprio.h
@@ -86,4 +86,6 @@ static inline int task_nice_ioclass(struct task_struct *task)
  */
 extern int ioprio_best(unsigned short aprio, unsigned short bprio);
 
+extern int set_task_ioprio(struct task_struct *task, int ioprio);
+
 #endif
-- 
1.5.5.1



-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
  2008-10-02 19:04           ` Arjan van de Ven
@ 2008-10-02 19:22             ` Jens Axboe
  2008-10-02 21:37               ` Andrew Morton
  0 siblings, 1 reply; 76+ messages in thread
From: Jens Axboe @ 2008-10-02 19:22 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Dave Chinner, Andi Kleen, Andrew Morton, linux-kernel, Alan Cox

On Thu, Oct 02 2008, Arjan van de Ven wrote:
> On Thu, 2 Oct 2008 11:45:37 +0200
> Jens Axboe <jens.axboe@oracle.com> wrote:
> 
> > > The RT folk were happy with the idea of journal I/O using the
> > > highest non-RT priority for the journal, but I never got around
> > > to testing that out as I had a bunnch of other stuff to fix at
> > > the time.
> > 
> > That's a good idea, just bump the priority a little bit. Arjan, did
> > you test that out? I'd suggest just trying prio level 0 and still
> > using best-effort scheduling. Probably still need the sync marking,
> > would be interesting to experiment with though.
> > 
> 
> ok 0 works ok enough in quick testing as well...... updated patch below
> 
> From df64cc4e2ab0c102bbac609dd948958a6f804fd3 Mon Sep 17 00:00:00 2001
> From: Arjan van de Ven <arjan@linux.intel.com>
> Date: Wed, 1 Oct 2008 19:58:18 -0700
> Subject: [PATCH] Give kjournald a higher io priority
> 
> With latencytop, I noticed that the (in memory) file updates during my
> workload (reading mail) had latencies of 6 seconds or longer; this is
> obviously not so nice behavior. Other EXT3 journal related operations had
> similar or even longer latencies.
> 
> Digging into this a bit more, it appears to be an interaction between EXT3
> and CFQ in that CFQ tries to be fair to everyone, including kjournald.
> However, in reality, kjournald is "special" in that it does a lot of journal
> work and effectively this leads to a twisted kind of "mass priority
> inversion" type of behavior.
> 
> The good news is that CFQ already has the infrastructure to make certain
> processes special... JBD just wasn't using that quite yet.
> 
> The patch below makes kjournald of a slighlty higher priority than normal
> applications, reducing these latencies significantly.
> 
> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
> ---
>  fs/ioprio.c            |    3 ++-
>  fs/jbd/journal.c       |   12 ++++++++++++
>  include/linux/ioprio.h |    2 ++
>  3 files changed, 16 insertions(+), 1 deletions(-)
> 
> diff --git a/fs/ioprio.c b/fs/ioprio.c
> index da3cc46..3bd95dc 100644
> --- a/fs/ioprio.c
> +++ b/fs/ioprio.c
> @@ -27,7 +27,7 @@
>  #include <linux/security.h>
>  #include <linux/pid_namespace.h>
>  
> -static int set_task_ioprio(struct task_struct *task, int ioprio)
> +int set_task_ioprio(struct task_struct *task, int ioprio)
>  {
>  	int err;
>  	struct io_context *ioc;
> @@ -64,6 +64,7 @@ static int set_task_ioprio(struct task_struct *task, int ioprio)
>  	task_unlock(task);
>  	return err;
>  }
> +EXPORT_SYMBOL_GPL(set_task_ioprio);
>  
>  asmlinkage long sys_ioprio_set(int which, int who, int ioprio)
>  {
> diff --git a/fs/jbd/journal.c b/fs/jbd/journal.c
> index aa7143a..a859a46 100644
> --- a/fs/jbd/journal.c
> +++ b/fs/jbd/journal.c
> @@ -36,6 +36,7 @@
>  #include <linux/poison.h>
>  #include <linux/proc_fs.h>
>  #include <linux/debugfs.h>
> +#include <linux/ioprio.h>
>  
>  #include <asm/uaccess.h>
>  #include <asm/page.h>
> @@ -131,6 +132,17 @@ static int kjournald(void *arg)
>  			journal->j_commit_interval / HZ);
>  
>  	/*
> +	 * kjournald is the process on which most other processes depend on
> +	 * for doing the filesystem portion of their IO. As such, there exists
> +	 * the equivalent of a priority inversion situation, where kjournald
> +	 * would get less priority as it should.
> +	 *
> +	 * For this reason we set to "medium real time priority", which is higher
> +	 * than regular tasks, but not infinitely powerful.
> +	 */
> +	set_task_ioprio(current, IOPRIO_PRIO_VALUE(IOPRIO_CLASS_BE, 0));
> +
> +	/*
>  	 * And now, wait forever for commit wakeup events.
>  	 */
>  	spin_lock(&journal->j_state_lock);
> diff --git a/include/linux/ioprio.h b/include/linux/ioprio.h
> index f98a656..76dad48 100644
> --- a/include/linux/ioprio.h
> +++ b/include/linux/ioprio.h
> @@ -86,4 +86,6 @@ static inline int task_nice_ioclass(struct task_struct *task)
>   */
>  extern int ioprio_best(unsigned short aprio, unsigned short bprio);
>  
> +extern int set_task_ioprio(struct task_struct *task, int ioprio);
> +
>  #endif
> -- 
> 1.5.5.1

Can we agree on this patch?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
  2008-10-02 13:12       ` Arjan van de Ven
@ 2008-10-02 20:24         ` Andrew Morton
  2008-10-03  4:01           ` Arjan van de Ven
  0 siblings, 1 reply; 76+ messages in thread
From: Andrew Morton @ 2008-10-02 20:24 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: Jens Axboe, linux-kernel, Alan Cox

On Thu, 2 Oct 2008 06:12:36 -0700 Arjan van de Ven <arjan@infradead.org> wrote:

> On Wed, 1 Oct 2008 23:55:01 -0700
> > 
> > I've forgotten where that code is now, but I don't think it was ever
> > revisited.  It should be.
> > 
> > So.  Where are these atime updaters getting blocked?
> 
> my reproducer is sadly very simple (claws-mail is my mail client that uses maildir)
> 
> Process claws-mail (4896)                  Total: 2829.7 msec 
> EXT3: Waiting for journal access                  2491.0 msec         88.4 % 
> Writing back inodes				   160.9 msec          5.7 % 
> synchronous write                                   78.8 msec          3.0 %
> 
> is an example of such a trace (this is with patch, without patch the numbers are about 3x bigger)
> 
> Waiting for journal access is "journal_get_write_access"
> Writing back inodes is "writeback_inodes"
> synchronous write is "do_sync_write"
> 

Right.  Probably the lock_buffer() in do_get_write_access().  kjournald
is checkpointing the committing transaction (writing metadata buffers
back into the fs) and a user process operating on the current
transaction is trying to get access to one of those buffers but has to
wait for the writeout to complete first.

It wasn't always thus.  Back in, umm, 2.5.0 we did

	/*
	 * The buffer_locked() || buffer_dirty() tests here are simply an
	 * optimisation tweak.  If anyone else in the system decides to
	 * lock this buffer later on, we'll blow up.  There doesn't seem
	 * to be a good reason why they should do this.
	 */
	if (jh->b_cp_transaction &&
	    (buffer_locked(jh2bh(jh)) || buffer_dirty(jh2bh(jh)))) {
		unlock_journal(journal);
		lock_buffer(jh2bh(jh));

and I _think_ it was the loss of that which hurt us a lot. 
773fc4c63442fbd8237b4805627f6906143204a8 or thereabouts in the old git
tree.

It would be very good if we could again decouple the committing and
current transactions, but I fear that none of us remember sufficiently
well how it all works (or, more importantly, how it all doesn't work
when you make a change).

Of course, that could all be wrong and we could be stuck somewhere
else.  A good way to diagnose this stuff would be

--- a/kernel/sched.c~a
+++ a/kernel/sched.c
@@ -5567,10 +5567,14 @@ EXPORT_SYMBOL(yield);
 void __sched io_schedule(void)
 {
 	struct rq *rq = &__raw_get_cpu_var(runqueues);
+	unsigned long in, out;
 
 	delayacct_blkio_start();
 	atomic_inc(&rq->nr_iowait);
+	in = jiffies;
 	schedule();
+	out = jiffies;
+	WARN_ON(time_after(out, in + 1 * HZ));
 	atomic_dec(&rq->nr_iowait);
 	delayacct_blkio_end();
 }
_

perhaps for varying values of "1".

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
  2008-10-02 19:22             ` Jens Axboe
@ 2008-10-02 21:37               ` Andrew Morton
  2008-10-02 23:58                 ` Dave Chinner
  0 siblings, 1 reply; 76+ messages in thread
From: Andrew Morton @ 2008-10-02 21:37 UTC (permalink / raw)
  To: Jens Axboe; +Cc: arjan, david, andi, linux-kernel, alan

On Thu, 2 Oct 2008 21:22:23 +0200
Jens Axboe <jens.axboe@oracle.com> wrote:

> > @@ -131,6 +132,17 @@ static int kjournald(void *arg)
> >  			journal->j_commit_interval / HZ);
> >  
> >  	/*
> > +	 * kjournald is the process on which most other processes depend on
> > +	 * for doing the filesystem portion of their IO. As such, there exists
> > +	 * the equivalent of a priority inversion situation, where kjournald
> > +	 * would get less priority as it should.
> > +	 *
> > +	 * For this reason we set to "medium real time priority", which is higher
> > +	 * than regular tasks, but not infinitely powerful.
> > +	 */
> > +	set_task_ioprio(current, IOPRIO_PRIO_VALUE(IOPRIO_CLASS_BE, 0));
> > +
> > +	/*
> >  	 * And now, wait forever for commit wakeup events.
> >  	 */
> >  	spin_lock(&journal->j_state_lock);
> > diff --git a/include/linux/ioprio.h b/include/linux/ioprio.h
> > index f98a656..76dad48 100644
> > --- a/include/linux/ioprio.h
> > +++ b/include/linux/ioprio.h
> > @@ -86,4 +86,6 @@ static inline int task_nice_ioclass(struct task_struct *task)
> >   */
> >  extern int ioprio_best(unsigned short aprio, unsigned short bprio);
> >  
> > +extern int set_task_ioprio(struct task_struct *task, int ioprio);
> > +
> >  #endif
> > -- 
> > 1.5.5.1
> 
> Can we agree on this patch?

This change will cause _all_ kjournald writeout to have elevated
priority.  The majority of that writeout (in data=ordered mode) is file
data, which we didn't intend to change.

The risk here is that this will *worsen* latency for plain old read(),
because now kjournald writeout will be favoured.

There is in fact a good argument for _reducing_ kjournald's IO
priority, not increasing it!

A better approach might be to mark the relevant buffers/bios as needing
higher priority at submit_bh() time (if that's possible).  At least
that way we don't accidentally elevate the priority of the bulk data.


It's a bit of a hack, sorry :(

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
  2008-10-02 21:37               ` Andrew Morton
@ 2008-10-02 23:58                 ` Dave Chinner
  2008-10-03  0:06                   ` Andrew Morton
  0 siblings, 1 reply; 76+ messages in thread
From: Dave Chinner @ 2008-10-02 23:58 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jens Axboe, arjan, andi, linux-kernel, alan

On Thu, Oct 02, 2008 at 02:37:13PM -0700, Andrew Morton wrote:
> On Thu, 2 Oct 2008 21:22:23 +0200 Jens Axboe <jens.axboe@oracle.com> wrote:
> > Can we agree on this patch?
> 
> This change will cause _all_ kjournald writeout to have elevated
> priority.  The majority of that writeout (in data=ordered mode) is file
> data, which we didn't intend to change.
> 
> The risk here is that this will *worsen* latency for plain old read(),
> because now kjournald writeout will be favoured.
> 
> There is in fact a good argument for _reducing_ kjournald's IO
> priority, not increasing it!
> 
> A better approach might be to mark the relevant buffers/bios as needing
> higher priority at submit_bh() time (if that's possible).  At least
> that way we don't accidentally elevate the priority of the bulk data.

You can do that for submit_bio() by calling bio_set_prio() before
submision - I did that for elevating only the XFS journal I/O.
submit_bh() doesn't have any way of passing a priority through to it
right now...

I should resurrect the XFS patches I had an retest them....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
  2008-10-02 23:58                 ` Dave Chinner
@ 2008-10-03  0:06                   ` Andrew Morton
  2008-10-03  0:20                     ` Andrew Morton
  0 siblings, 1 reply; 76+ messages in thread
From: Andrew Morton @ 2008-10-03  0:06 UTC (permalink / raw)
  To: Dave Chinner; +Cc: jens.axboe, arjan, andi, linux-kernel, alan

On Fri, 3 Oct 2008 09:58:49 +1000
Dave Chinner <david@fromorbit.com> wrote:

> On Thu, Oct 02, 2008 at 02:37:13PM -0700, Andrew Morton wrote:
> > On Thu, 2 Oct 2008 21:22:23 +0200 Jens Axboe <jens.axboe@oracle.com> wrote:
> > > Can we agree on this patch?
> > 
> > This change will cause _all_ kjournald writeout to have elevated
> > priority.  The majority of that writeout (in data=ordered mode) is file
> > data, which we didn't intend to change.
> > 
> > The risk here is that this will *worsen* latency for plain old read(),
> > because now kjournald writeout will be favoured.
> > 
> > There is in fact a good argument for _reducing_ kjournald's IO
> > priority, not increasing it!
> > 
> > A better approach might be to mark the relevant buffers/bios as needing
> > higher priority at submit_bh() time (if that's possible).  At least
> > that way we don't accidentally elevate the priority of the bulk data.
> 
> You can do that for submit_bio() by calling bio_set_prio() before
> submision - I did that for elevating only the XFS journal I/O.
> submit_bh() doesn't have any way of passing a priority through to it
> right now...

Yup.  There are plenty of spare bits in buffer_head.b_state. 
set_buffer_kludge()?


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
  2008-10-03  0:06                   ` Andrew Morton
@ 2008-10-03  0:20                     ` Andrew Morton
  0 siblings, 0 replies; 76+ messages in thread
From: Andrew Morton @ 2008-10-03  0:20 UTC (permalink / raw)
  To: david, jens.axboe, arjan, andi, linux-kernel, alan

On Thu, 2 Oct 2008 17:06:46 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Fri, 3 Oct 2008 09:58:49 +1000
> Dave Chinner <david@fromorbit.com> wrote:
> 
> > On Thu, Oct 02, 2008 at 02:37:13PM -0700, Andrew Morton wrote:
> > > On Thu, 2 Oct 2008 21:22:23 +0200 Jens Axboe <jens.axboe@oracle.com> wrote:
> > > > Can we agree on this patch?
> > > 
> > > This change will cause _all_ kjournald writeout to have elevated
> > > priority.  The majority of that writeout (in data=ordered mode) is file
> > > data, which we didn't intend to change.
> > > 
> > > The risk here is that this will *worsen* latency for plain old read(),
> > > because now kjournald writeout will be favoured.
> > > 
> > > There is in fact a good argument for _reducing_ kjournald's IO
> > > priority, not increasing it!
> > > 
> > > A better approach might be to mark the relevant buffers/bios as needing
> > > higher priority at submit_bh() time (if that's possible).  At least
> > > that way we don't accidentally elevate the priority of the bulk data.
> > 
> > You can do that for submit_bio() by calling bio_set_prio() before
> > submision - I did that for elevating only the XFS journal I/O.
> > submit_bh() doesn't have any way of passing a priority through to it
> > right now...
> 
> Yup.  There are plenty of spare bits in buffer_head.b_state. 
> set_buffer_kludge()?
> 

And if we do decide to run set_buffer_kludge() against these blocks
then it should be done at the time when they are marked dirty, when JBD
refiles them for checkpoint writeback.

This is because kjournald isn't the only process which writes these
buffers.  Pretty often (usually?) it is pdflush.  The increased IO
prioritisation is a property of the buffer, not of the task which
submits it for IO.  I suspect.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
  2008-10-02 20:24         ` Andrew Morton
@ 2008-10-03  4:01           ` Arjan van de Ven
  2008-10-03  4:23             ` Arjan van de Ven
  0 siblings, 1 reply; 76+ messages in thread
From: Arjan van de Ven @ 2008-10-03  4:01 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jens Axboe, linux-kernel, Alan Cox

On Thu, 2 Oct 2008 13:24:57 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> Of course, that could all be wrong and we could be stuck somewhere
> else.  A good way to diagnose this stuff would be
> 
> --- a/kernel/sched.c~a
> +++ a/kernel/sched.c
> @@ -5567,10 +5567,14 @@ EXPORT_SYMBOL(yield);
>  void __sched io_schedule(void)
>  {
>  	struct rq *rq = &__raw_get_cpu_var(runqueues);
> +	unsigned long in, out;
>  
>  	delayacct_blkio_start();
>  	atomic_inc(&rq->nr_iowait);
> +	in = jiffies;
>  	schedule();
> +	out = jiffies;
> +	WARN_ON(time_after(out, in + 1 * HZ));
>  	atomic_dec(&rq->nr_iowait);
>  	delayacct_blkio_end();
>  }
> _
> 
> perhaps for varying values of "1".

I'll run with this. (well with a value of "1000" for 1 ;-)

Also I'll tell latencytop to give me the full backtrace as well ;-)


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
  2008-10-03  4:01           ` Arjan van de Ven
@ 2008-10-03  4:23             ` Arjan van de Ven
  2008-10-03  4:40               ` Andrew Morton
  0 siblings, 1 reply; 76+ messages in thread
From: Arjan van de Ven @ 2008-10-03  4:23 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: Andrew Morton, Jens Axboe, linux-kernel, Alan Cox

On Thu, 2 Oct 2008 21:01:17 -0700
Arjan van de Ven <arjan@infradead.org> wrote:

> > _
> > 
> > perhaps for varying values of "1".
> 

caught a few, a few were over 3 1/2 seconds in stall time 
(specifically I know this for the last one)

(I've stripped out the ? entries to keep them reasonable)

[  410.168277] WARNING: at kernel/sched.c:5652 io_schedule+0x77/0xb0()
[  410.168347] Pid: 699, comm: kjournald Not tainted 2.6.27-rc8-tip #50
[  410.168366]  [<c042ee64>] warn_on_slowpath+0x41/0x65
[  410.168414]  [<c070ec83>] io_schedule+0x77/0xb0
[  410.168421]  [<c04abc72>] sync_buffer+0x33/0x37
[  410.168429]  [<c070f123>] __wait_on_bit+0x36/0x5d
[  410.168445]  [<c070f1f5>] out_of_line_wait_on_bit+0xab/0xb3
[  410.168471]  [<c04abbd1>] __wait_on_buffer+0x19/0x1c
[  410.168478]  [<c04de796>] journal_commit_transaction+0x484/0xcb2
[  410.168519]  [<c04e14ae>] kjournald+0xc7/0x1ea
[  410.168544]  [<c043fb87>] kthread+0x3b/0x61
[  410.168559]  [<c040499f>] kernel_thread_helper+0x7/0x10
[  410.168567]  =======================
[  410.168572] ---[ end trace de523043f88bd9a7 ]---
[  451.605034] ------------[ cut here ]------------
[  451.605041] WARNING: at kernel/sched.c:5652 io_schedule+0x77/0xb0()
[  451.605114] Pid: 699, comm: kjournald Tainted: G        W 2.6.27-rc8-tip #50
[  451.605133]  [<c042ee64>] warn_on_slowpath+0x41/0x65
[  451.605182]  [<c070ec83>] io_schedule+0x77/0xb0
[  451.605189]  [<c04abc72>] sync_buffer+0x33/0x37
[  451.605198]  [<c070f123>] __wait_on_bit+0x36/0x5d
[  451.605213]  [<c070f1f5>] out_of_line_wait_on_bit+0xab/0xb3
[  451.605240]  [<c04abbd1>] __wait_on_buffer+0x19/0x1c
[  451.605248]  [<c04de796>] journal_commit_transaction+0x484/0xcb2
[  451.605300]  [<c04e14ae>] kjournald+0xc7/0x1ea
[  451.605324]  [<c043fb87>] kthread+0x3b/0x61
[  451.605339]  [<c040499f>] kernel_thread_helper+0x7/0x10
[  451.605348]  =======================
[  451.605353] ---[ end trace de523043f88bd9a7 ]---
[  514.780993] ------------[ cut here ]------------
[  514.781001] WARNING: at kernel/sched.c:5652 io_schedule+0x77/0xb0()
[  514.781074] Pid: 699, comm: kjournald Tainted: G        W 2.6.27-rc8-tip #50
[  514.781092]  [<c042ee64>] warn_on_slowpath+0x41/0x65
[  514.781152]  [<c070ec83>] io_schedule+0x77/0xb0
[  514.781159]  [<c04abc72>] sync_buffer+0x33/0x37
[  514.781167]  [<c070f010>] __wait_on_bit_lock+0x34/0x5e
[  514.781183]  [<c070f0e5>] out_of_line_wait_on_bit_lock+0xab/0xb3
[  514.781210]  [<c04abfa1>] __lock_buffer+0x24/0x2a
[  514.781218]  [<c04de597>] journal_commit_transaction+0x285/0xcb2
[  514.781288]  [<c04e14ae>] kjournald+0xc7/0x1ea
[  514.781313]  [<c043fb87>] kthread+0x3b/0x61
[  514.781334]  [<c040499f>] kernel_thread_helper+0x7/0x10
[  514.781345]  =======================
[  514.781352] ---[ end trace de523043f88bd9a7 ]---
[  517.064300] ------------[ cut here ]------------
[  517.064379] Pid: 699, comm: kjournald Tainted: G        W 2.6.27-rc8-tip #50
[  517.064398]  [<c042ee64>] warn_on_slowpath+0x41/0x65
[  517.064474]  [<c070ec83>] io_schedule+0x77/0xb0
[  517.064481]  [<c04abc72>] sync_buffer+0x33/0x37
[  517.064490]  [<c070f123>] __wait_on_bit+0x36/0x5d
[  517.064505]  [<c070f1f5>] out_of_line_wait_on_bit+0xab/0xb3
[  517.064530]  [<c04abbd1>] __wait_on_buffer+0x19/0x1c
[  517.064538]  [<c04de796>] journal_commit_transaction+0x484/0xcb2
[  517.064579]  [<c04e14ae>] kjournald+0xc7/0x1ea
[  517.064603]  [<c043fb87>] kthread+0x3b/0x61
[  517.064618]  [<c040499f>] kernel_thread_helper+0x7/0x10
[  517.064626]  =======================
[  517.064631] ---[ end trace de523043f88bd9a7 ]---
[  517.067487] ------------[ cut here ]------------
[  517.067493] WARNING: at kernel/sched.c:5652 io_schedule+0x77/0xb0()
[  517.067556] Pid: 4320, comm: claws-mail Tainted: G        W 2.6.27-rc8-tip #50
[  517.067572]  [<c042ee64>] warn_on_slowpath+0x41/0x65
[  517.067652]  [<c070ec83>] io_schedule+0x77/0xb0
[  517.067659]  [<c04abc72>] sync_buffer+0x33/0x37
[  517.067666]  [<c070f010>] __wait_on_bit_lock+0x34/0x5e
[  517.067682]  [<c070f0e5>] out_of_line_wait_on_bit_lock+0xab/0xb3
[  517.067707]  [<c04abfa1>] __lock_buffer+0x24/0x2a
[  517.067715]  [<c04dd7fc>] do_get_write_access+0x64/0x3b1
[  517.067743]  [<c04ddb64>] journal_get_write_access+0x1b/0x2a
[  517.067752]  [<c04da374>] __ext3_journal_get_write_access+0x19/0x3c
[  517.067761]  [<c04cf672>] ext3_reserve_inode_write+0x34/0x68
[  517.067769]  [<c04cf6d5>] ext3_mark_inode_dirty+0x2f/0x46
[  517.067777]  [<c04cf7f7>] ext3_dirty_inode+0x53/0x67
[  517.067784]  [<c04a7bed>] __mark_inode_dirty+0x29/0x144
[  517.067794]  [<c049e60f>] file_update_time+0x80/0xa9
[  517.067803]  [<c046b66c>] __generic_file_aio_write_nolock+0x2f0/0x41b
[  517.067842]  [<c046bf0d>] generic_file_aio_write+0x5a/0xb7
[  517.067850]  [<c04cdc65>] ext3_file_write+0x1a/0x89
[  517.067858]  [<c048da41>] do_sync_write+0xab/0xe9
[  517.067896]  [<c048e302>] vfs_write+0x8a/0x12e
[  517.067903]  [<c048e43f>] sys_write+0x3b/0x60
[  517.067910]  [<c0403b0b>] sysenter_do_call+0x12/0x2f
[  517.067919]  =======================
[  517.067923] ---[ end trace de523043f88bd9a7 ]---


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
  2008-10-03  4:23             ` Arjan van de Ven
@ 2008-10-03  4:40               ` Andrew Morton
  2008-10-03  4:43                 ` Arjan van de Ven
  2008-10-03  4:45                 ` Arjan van de Ven
  0 siblings, 2 replies; 76+ messages in thread
From: Andrew Morton @ 2008-10-03  4:40 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: Jens Axboe, linux-kernel, Alan Cox

On Thu, 2 Oct 2008 21:23:55 -0700 Arjan van de Ven <arjan@infradead.org> wrote:

> On Thu, 2 Oct 2008 21:01:17 -0700
> Arjan van de Ven <arjan@infradead.org> wrote:
> 
> > > _
> > > 
> > > perhaps for varying values of "1".
> > 
> 
> caught a few, a few were over 3 1/2 seconds in stall time 
> (specifically I know this for the last one)
> 
> (I've stripped out the ? entries to keep them reasonable)
> 
> [  410.168277] WARNING: at kernel/sched.c:5652 io_schedule+0x77/0xb0()
> [  410.168347] Pid: 699, comm: kjournald Not tainted 2.6.27-rc8-tip #50
> [  410.168366]  [<c042ee64>] warn_on_slowpath+0x41/0x65
> [  410.168414]  [<c070ec83>] io_schedule+0x77/0xb0
> [  410.168421]  [<c04abc72>] sync_buffer+0x33/0x37
> [  410.168429]  [<c070f123>] __wait_on_bit+0x36/0x5d
> [  410.168445]  [<c070f1f5>] out_of_line_wait_on_bit+0xab/0xb3
> [  410.168471]  [<c04abbd1>] __wait_on_buffer+0x19/0x1c
> [  410.168478]  [<c04de796>] journal_commit_transaction+0x484/0xcb2
> [  410.168519]  [<c04e14ae>] kjournald+0xc7/0x1ea
> [  410.168544]  [<c043fb87>] kthread+0x3b/0x61
> [  410.168559]  [<c040499f>] kernel_thread_helper+0x7/0x10
> [  410.168567]  =======================
> [  410.168572] ---[ end trace de523043f88bd9a7 ]---
> [  451.605034] ------------[ cut here ]------------
> [  451.605041] WARNING: at kernel/sched.c:5652 io_schedule+0x77/0xb0()
> [  451.605114] Pid: 699, comm: kjournald Tainted: G        W 2.6.27-rc8-tip #50
>
> ...
>

hm, they're all kjournald getting stuck on lock_buffer().  That _may_
be related to the one we care about, but it doesn't seem likely.


> [  517.067556] Pid: 4320, comm: claws-mail Tainted: G        W 2.6.27-rc8-tip #50
> [  517.067572]  [<c042ee64>] warn_on_slowpath+0x41/0x65
> [  517.067652]  [<c070ec83>] io_schedule+0x77/0xb0
> [  517.067659]  [<c04abc72>] sync_buffer+0x33/0x37
> [  517.067666]  [<c070f010>] __wait_on_bit_lock+0x34/0x5e
> [  517.067682]  [<c070f0e5>] out_of_line_wait_on_bit_lock+0xab/0xb3
> [  517.067707]  [<c04abfa1>] __lock_buffer+0x24/0x2a
> [  517.067715]  [<c04dd7fc>] do_get_write_access+0x64/0x3b1
> [  517.067743]  [<c04ddb64>] journal_get_write_access+0x1b/0x2a
> [  517.067752]  [<c04da374>] __ext3_journal_get_write_access+0x19/0x3c
> [  517.067761]  [<c04cf672>] ext3_reserve_inode_write+0x34/0x68
> [  517.067769]  [<c04cf6d5>] ext3_mark_inode_dirty+0x2f/0x46
> [  517.067777]  [<c04cf7f7>] ext3_dirty_inode+0x53/0x67
> [  517.067784]  [<c04a7bed>] __mark_inode_dirty+0x29/0x144
> [  517.067794]  [<c049e60f>] file_update_time+0x80/0xa9
> [  517.067803]  [<c046b66c>] __generic_file_aio_write_nolock+0x2f0/0x41b
> [  517.067842]  [<c046bf0d>] generic_file_aio_write+0x5a/0xb7
> [  517.067850]  [<c04cdc65>] ext3_file_write+0x1a/0x89
> [  517.067858]  [<c048da41>] do_sync_write+0xab/0xe9
> [  517.067896]  [<c048e302>] vfs_write+0x8a/0x12e
> [  517.067903]  [<c048e43f>] sys_write+0x3b/0x60
> [  517.067910]  [<c0403b0b>] sysenter_do_call+0x12/0x2f
> [  517.067919]  =======================
> [  517.067923] ---[ end trace de523043f88bd9a7 ]---

That's the one - the lock_buffer() in do_get_write_access().  It's a
major contention site and it'd be a major win if we could fix it.  Even
if we resorted to some nasty thing like taking a temp copy of the
buffer's contents.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
  2008-10-03  4:40               ` Andrew Morton
@ 2008-10-03  4:43                 ` Arjan van de Ven
  2008-10-03  4:50                   ` Andrew Morton
  2008-10-03  4:45                 ` Arjan van de Ven
  1 sibling, 1 reply; 76+ messages in thread
From: Arjan van de Ven @ 2008-10-03  4:43 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jens Axboe, linux-kernel, Alan Cox

> [  517.067572]  [<c042ee64>] warn_on_slowpath+0x41/0x65
> [  517.067652]  [<c070ec83>] io_schedule+0x77/0xb0
> [  517.067659]  [<c04abc72>] sync_buffer+0x33/0x37
> [  517.067666]  [<c070f010>] __wait_on_bit_lock+0x34/0x5e
> [  517.067682]  [<c070f0e5>] out_of_line_wait_on_bit_lock+0xab/0xb3
> [  517.067707]  [<c04abfa1>] __lock_buffer+0x24/0x2a
> [  517.067715]  [<c04dd7fc>] do_get_write_access+0x64/0x3b1
> [  517.067743]  [<c04ddb64>] journal_get_write_access+0x1b/0x2a
> [  517.067752]  [<c04da374>] __ext3_journal_get_write_access+0x19/0x3c
> [  517.067761]  [<c04cf672>] ext3_reserve_inode_write+0x34/0x68
> [  517.067769]  [<c04cf6d5>] ext3_mark_inode_dirty+0x2f/0x46
> [  517.067777]  [<c04cf7f7>] ext3_dirty_inode+0x53/0x67
> [  517.067784]  [<c04a7bed>] __mark_inode_dirty+0x29/0x144
> [  517.067794]  [<c049e60f>] file_update_time+0x80/0xa9
> [  517.067803]  [<c046b66c>] __generic_file_aio_write_nolock+0x2f0/0x41b
> [  517.067842]  [<c046bf0d>] generic_file_aio_write+0x5a/0xb7
> [  517.067850]  [<c04cdc65>] ext3_file_write+0x1a/0x89
> [  517.067858]  [<c048da41>] do_sync_write+0xab/0xe9
> [  517.067896]  [<c048e302>] vfs_write+0x8a/0x12e
> [  517.067903]  [<c048e43f>] sys_write+0x3b/0x60
> [  517.067910]  [<c0403b0b>] sysenter_do_call+0x12/0x2f
> [  517.067919]  =======================
> [  517.067923] ---[ end trace de523043f88bd9a7 ]---  

> That's the one - the lock_buffer() in do_get_write_access().  It's a
> major contention site and it'd be a major win if we could fix it.
> Even if we resorted to some nasty thing like taking a temp copy of the
> buffer's contents.

I also notice it's part of "file_update_time". Do we really need to go all the way 
down to this level of synchronicity for that?
(I also randomly wonder if we, in the write path, dirty the inode twice, once for size once for item, and
if we then also reserve two slots in the journal for that..... but I'm showing
my total ignorance of JBD internals here)

-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
  2008-10-03  4:40               ` Andrew Morton
  2008-10-03  4:43                 ` Arjan van de Ven
@ 2008-10-03  4:45                 ` Arjan van de Ven
  1 sibling, 0 replies; 76+ messages in thread
From: Arjan van de Ven @ 2008-10-03  4:45 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jens Axboe, linux-kernel, Alan Cox

On Thu, 2 Oct 2008 21:40:00 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Thu, 2 Oct 2008 21:23:55 -0700 Arjan van de Ven
> <arjan@infradead.org> wrote:
> 
> > On Thu, 2 Oct 2008 21:01:17 -0700
> > Arjan van de Ven <arjan@infradead.org> wrote:
> > 
> > > > _
> > > > 
> > > > perhaps for varying values of "1".
> > > 
> > 
> > caught a few, a few were over 3 1/2 seconds in stall time 
> > (specifically I know this for the last one)
> > 
> > (I've stripped out the ? entries to keep them reasonable)
> > 
> > [  410.168277] WARNING: at kernel/sched.c:5652
> > io_schedule+0x77/0xb0() [  410.168347] Pid: 699, comm: kjournald
> > Not tainted 2.6.27-rc8-tip #50 [  410.168366]  [<c042ee64>]
> > warn_on_slowpath+0x41/0x65 [  410.168414]  [<c070ec83>]
> > io_schedule+0x77/0xb0 [  410.168421]  [<c04abc72>]
> > sync_buffer+0x33/0x37 [  410.168429]  [<c070f123>]
> > __wait_on_bit+0x36/0x5d [  410.168445]  [<c070f1f5>]
> > out_of_line_wait_on_bit+0xab/0xb3 [  410.168471]  [<c04abbd1>]
> > __wait_on_buffer+0x19/0x1c [  410.168478]  [<c04de796>]
> > journal_commit_transaction+0x484/0xcb2 [  410.168519]  [<c04e14ae>]
> > kjournald+0xc7/0x1ea [  410.168544]  [<c043fb87>] kthread+0x3b/0x61
> > [  410.168559]  [<c040499f>] kernel_thread_helper+0x7/0x10
> > [  410.168567]  =======================
> > [  410.168572] ---[ end trace de523043f88bd9a7 ]---
> > [  451.605034] ------------[ cut here ]------------
> > [  451.605041] WARNING: at kernel/sched.c:5652
> > io_schedule+0x77/0xb0() [  451.605114] Pid: 699, comm: kjournald
> > Tainted: G        W 2.6.27-rc8-tip #50
> >
> > ...
> >
> 
> hm, they're all kjournald getting stuck on lock_buffer().  That _may_
> be related to the one we care about, but it doesn't seem likely.

one of them is (based on timestamps of the printk) is less than 3
seconds before the "real" one, while the real one's delay was 4 seconds.
To me that implies they're at least somewhat correlated...


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
  2008-10-03  4:43                 ` Arjan van de Ven
@ 2008-10-03  4:50                   ` Andrew Morton
  2008-10-03  5:00                     ` Arjan van de Ven
  0 siblings, 1 reply; 76+ messages in thread
From: Andrew Morton @ 2008-10-03  4:50 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: Jens Axboe, linux-kernel, Alan Cox

On Thu, 2 Oct 2008 21:43:53 -0700 Arjan van de Ven <arjan@infradead.org> wrote:

> > [  517.067572]  [<c042ee64>] warn_on_slowpath+0x41/0x65
> > [  517.067652]  [<c070ec83>] io_schedule+0x77/0xb0
> > [  517.067659]  [<c04abc72>] sync_buffer+0x33/0x37
> > [  517.067666]  [<c070f010>] __wait_on_bit_lock+0x34/0x5e
> > [  517.067682]  [<c070f0e5>] out_of_line_wait_on_bit_lock+0xab/0xb3
> > [  517.067707]  [<c04abfa1>] __lock_buffer+0x24/0x2a
> > [  517.067715]  [<c04dd7fc>] do_get_write_access+0x64/0x3b1
> > [  517.067743]  [<c04ddb64>] journal_get_write_access+0x1b/0x2a
> > [  517.067752]  [<c04da374>] __ext3_journal_get_write_access+0x19/0x3c
> > [  517.067761]  [<c04cf672>] ext3_reserve_inode_write+0x34/0x68
> > [  517.067769]  [<c04cf6d5>] ext3_mark_inode_dirty+0x2f/0x46
> > [  517.067777]  [<c04cf7f7>] ext3_dirty_inode+0x53/0x67
> > [  517.067784]  [<c04a7bed>] __mark_inode_dirty+0x29/0x144
> > [  517.067794]  [<c049e60f>] file_update_time+0x80/0xa9
> > [  517.067803]  [<c046b66c>] __generic_file_aio_write_nolock+0x2f0/0x41b
> > [  517.067842]  [<c046bf0d>] generic_file_aio_write+0x5a/0xb7
> > [  517.067850]  [<c04cdc65>] ext3_file_write+0x1a/0x89
> > [  517.067858]  [<c048da41>] do_sync_write+0xab/0xe9
> > [  517.067896]  [<c048e302>] vfs_write+0x8a/0x12e
> > [  517.067903]  [<c048e43f>] sys_write+0x3b/0x60
> > [  517.067910]  [<c0403b0b>] sysenter_do_call+0x12/0x2f
> > [  517.067919]  =======================
> > [  517.067923] ---[ end trace de523043f88bd9a7 ]---  
> 
> > That's the one - the lock_buffer() in do_get_write_access().  It's a
> > major contention site and it'd be a major win if we could fix it.
> > Even if we resorted to some nasty thing like taking a temp copy of the
> > buffer's contents.
> 
> I also notice it's part of "file_update_time". Do we really need to go all the way 
> down to this level of synchronicity for that?

Well, we've tossed that around many times but never implemented it. 
Once you get into the details it gets a bit nasty.  Need to keep the
dirtiness state in the VFS (or fs) inode, and going backwards from a
plain old buffer_head at commit time isn't possible.  We usually
tempfixed the problem by adding increasingly fancy ways of not doing the
atime update at all.

Of course, fixing this running-vs-committing contention point would fix
a lot more things than just atime updates.

> (I also randomly wonder if we, in the write path, dirty the inode twice, once for size once for item, and
> if we then also reserve two slots in the journal for that.....

That shouldn't be the case - once we have write access to the buffer it
remains freely modifiable for the rest of the transaction period.  I
think.

> but I'm showing
> my total ignorance of JBD internals here)

I'm going on senile memories of JDB five years ago, but the concepts
didn't change much.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
  2008-10-03  4:50                   ` Andrew Morton
@ 2008-10-03  5:00                     ` Arjan van de Ven
  2008-10-03  5:24                       ` Andrew Morton
  0 siblings, 1 reply; 76+ messages in thread
From: Arjan van de Ven @ 2008-10-03  5:00 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jens Axboe, linux-kernel, Alan Cox

On Thu, 2 Oct 2008 21:50:26 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> > 
> > I also notice it's part of "file_update_time". Do we really need to
> > go all the way down to this level of synchronicity for that?
> 
> Well, we've tossed that around many times but never implemented it. 
> Once you get into the details it gets a bit nasty.  Need to keep the
> dirtiness state in the VFS (or fs) inode, and going backwards from a
> plain old buffer_head at commit time isn't possible.  We usually
> tempfixed the problem by adding increasingly fancy ways of not doing
> the atime update at all.

given that this is the write path, I was assuming this was mtime rather
than atime; doesn't change your answer though.
> 
> Of course, fixing this running-vs-committing contention point would
> fix a lot more things than just atime updates.

yes clearly. It's waaay above my paygrade to hack on though; JBD is one
of those places in the kernel that scare me for doing fundamental
changes ;-(

> 
> > (I also randomly wonder if we, in the write path, dirty the inode
> > twice, once for size once for item, and if we then also reserve two
> > slots in the journal for that.....
> 
> That shouldn't be the case - once we have write access to the buffer
> it remains freely modifiable for the rest of the transaction period.
> I think.

I hope you're right otherwise we'd always hit this; once for the size
change, then block for the mtime. That would thoroughly suck; so much
so that you just must be right.



-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
  2008-10-03  5:00                     ` Arjan van de Ven
@ 2008-10-03  5:24                       ` Andrew Morton
  2008-10-03 17:21                         ` Arjan van de Ven
  2008-10-09  3:00                         ` Theodore Tso
  0 siblings, 2 replies; 76+ messages in thread
From: Andrew Morton @ 2008-10-03  5:24 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: Jens Axboe, linux-kernel, Alan Cox

On Thu, 2 Oct 2008 22:00:40 -0700 Arjan van de Ven <arjan@infradead.org> wrote:

> > Of course, fixing this running-vs-committing contention point would
> > fix a lot more things than just atime updates.
> 
> yes clearly. It's waaay above my paygrade to hack on though; JBD is one
> of those places in the kernel that scare me for doing fundamental
> changes ;-(

If someone has an hour or so to kill, we can pretend to fix it:


 fs/ext3/super.c         |   12 +++++++++++-
 fs/jbd/transaction.c    |   13 +++++++++++--
 include/linux/ext3_fs.h |    1 +
 include/linux/jbd.h     |    1 +
 4 files changed, 24 insertions(+), 3 deletions(-)

diff -puN fs/ext3/super.c~a fs/ext3/super.c
--- a/fs/ext3/super.c~a
+++ a/fs/ext3/super.c
@@ -617,6 +617,8 @@ static int ext3_show_options(struct seq_
 		seq_puts(seq, ",barrier=1");
 	if (test_opt(sb, NOBH))
 		seq_puts(seq, ",nobh");
+	if (test_opt(sb, AKPM))
+		seq_puts(seq, ",akpm");
 
 	if (test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_JOURNAL_DATA)
 		seq_puts(seq, ",data=journal");
@@ -757,7 +759,7 @@ enum {
 	Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
 	Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_quota, Opt_noquota,
 	Opt_ignore, Opt_barrier, Opt_err, Opt_resize, Opt_usrquota,
-	Opt_grpquota
+	Opt_grpquota, Opt_akpm,
 };
 
 static match_table_t tokens = {
@@ -808,6 +810,7 @@ static match_table_t tokens = {
 	{Opt_usrquota, "usrquota"},
 	{Opt_barrier, "barrier=%u"},
 	{Opt_resize, "resize"},
+	{Opt_akpm, "akpm"},
 	{Opt_err, NULL},
 };
 
@@ -1097,6 +1100,9 @@ set_qf_format:
 			set_opt(sbi->s_mount_opt, QUOTA);
 			set_opt(sbi->s_mount_opt, GRPQUOTA);
 			break;
+		case Opt_akpm:
+			set_opt(sbi->s_mount_opt, AKPM);
+			break;
 		case Opt_noquota:
 			if (sb_any_quota_enabled(sb) ||
 			    sb_any_quota_suspended(sb)) {
@@ -1986,6 +1992,10 @@ static void ext3_init_journal_params(str
 		journal->j_flags |= JFS_BARRIER;
 	else
 		journal->j_flags &= ~JFS_BARRIER;
+	if (test_opt(sb, AKPM))
+		journal->j_flags |= JFS_AKPM;
+	else
+		journal->j_flags &= ~JFS_AKPM;
 	spin_unlock(&journal->j_state_lock);
 }
 
diff -puN include/linux/ext3_fs.h~a include/linux/ext3_fs.h
--- a/include/linux/ext3_fs.h~a
+++ a/include/linux/ext3_fs.h
@@ -380,6 +380,7 @@ struct ext3_inode {
 #define EXT3_MOUNT_QUOTA		0x80000 /* Some quota option set */
 #define EXT3_MOUNT_USRQUOTA		0x100000 /* "old" user quota */
 #define EXT3_MOUNT_GRPQUOTA		0x200000 /* "old" group quota */
+#define EXT3_MOUNT_AKPM			0x400000
 
 /* Compatibility, for having both ext2_fs.h and ext3_fs.h included at once */
 #ifndef _LINUX_EXT2_FS_H
diff -puN include/linux/jbd.h~a include/linux/jbd.h
--- a/include/linux/jbd.h~a
+++ a/include/linux/jbd.h
@@ -816,6 +816,7 @@ struct journal_s
 #define JFS_FLUSHED	0x008	/* The journal superblock has been flushed */
 #define JFS_LOADED	0x010	/* The journal superblock has been loaded */
 #define JFS_BARRIER	0x020	/* Use IDE barriers */
+#define JFS_AKPM	0x040
 
 /*
  * Function declarations for the journaling transaction and buffer
diff -puN fs/jbd/transaction.c~a fs/jbd/transaction.c
--- a/fs/jbd/transaction.c~a
+++ a/fs/jbd/transaction.c
@@ -537,6 +537,7 @@ do_get_write_access(handle_t *handle, st
 	int error;
 	char *frozen_buffer = NULL;
 	int need_copy = 0;
+	int locked;
 
 	if (is_handle_aborted(handle))
 		return -EROFS;
@@ -552,7 +553,14 @@ repeat:
 
 	/* @@@ Need to check for errors here at some point. */
 
-	lock_buffer(bh);
+	locked = 0;
+	if (journal->j_flags & JFS_AKPM) {
+		if (trylock_buffer(bh))
+			locked = 1;	/* lolz */
+	} else {
+		lock_buffer(bh);
+		locked = 1;
+	}
 	jbd_lock_bh_state(bh);
 
 	/* We now hold the buffer lock so it is safe to query the buffer
@@ -591,7 +599,8 @@ repeat:
 		jbd_unexpected_dirty_buffer(jh);
 	}
 
-	unlock_buffer(bh);
+	if (locked)
+		unlock_buffer(bh);
 
 	error = -EROFS;
 	if (is_handle_aborted(handle)) {
_


(might emit nasty warnings or assertion failures which will need to be
disabled)


Mount a junk partition with `-oakpm' and run some benchmarks.  If the
results are "wow" then it's worth spending time on.  If the results are
"meh" then we can not bother..


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
  2008-10-03  5:24                       ` Andrew Morton
@ 2008-10-03 17:21                         ` Arjan van de Ven
  2008-10-09  3:00                         ` Theodore Tso
  1 sibling, 0 replies; 76+ messages in thread
From: Arjan van de Ven @ 2008-10-03 17:21 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jens Axboe, linux-kernel, Alan Cox

On Thu, 2 Oct 2008 22:24:38 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> 
> Mount a junk partition with `-oakpm' and run some benchmarks.  If the
> results are "wow" then it's worth spending time on.  If the results
> are "meh" then we can not bother..
> 

I've been running with this for a few hours now
while I can't see IO is perfectly smooth now, it does feel a lot better
and latencytop isn't showing me the huge spikes in latency either in
the mail client.
(there's a lot of little stuff, but not the 4+ seconds)
-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
  2008-10-02 14:33                 ` Arjan van de Ven
@ 2008-10-04 14:12                   ` Theodore Tso
  2008-10-04 17:14                     ` Joseph Fannin
  0 siblings, 1 reply; 76+ messages in thread
From: Theodore Tso @ 2008-10-04 14:12 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: Andrew Morton, Jens Axboe, linux-kernel, Alan Cox

[-- Attachment #1: Type: text/plain, Size: 1616 bytes --]

On Thu, Oct 02, 2008 at 07:33:04AM -0700, Arjan van de Ven wrote:
> > I can whip up a patch which adds some markers which we could use to
> > find out more information about what is happening.
> 
> interesting testcase of the markers concept.

Sorry for the delay, I ran into a minor bug in the Modules.marker
generation support that prevented Systemtap from being able to use
markers.  (It was busted since 2.6.27-rc1, so I guess that gives us
some sense how often developers use Systemtap.  :-)

It looks like Andrew's workaround seems to help you out, but if you're
willing to run this while your mail reader is running, and correlate
it with with the large latency spikes, we might get some interesting
results.

Anyway, here's the patch (against ext4, although could pretty easily
move this to ext3 --- but you can mount an ext3 filesystem as ext4,
although for the moment you do have to run the command "tune2fs -E
test_fs /dev/hdXX" first), and a sample systemtap script.  You'll also
need the markers patch, and of course the latest systemtap from their
git repository.

						- Ted


P.S.  Make sure you install systemtap in /usr/local, instead of trying
to run it out of the build tree.  See for an interesting report from
Roland McGrath about what happens if you make this mistake:

	http://sources.redhat.com/ml/systemtap/2008-q3/msg00809.html

I really think Systemtap has a lot of potential if only they could get
past some "minor usability concerns".  So one thing that I think would
really help the Systemtap folks is if more people gave them more, ah,
"constructive feedback" to their mailing list.

[-- Attachment #2: add-debugging-markers --]
[-- Type: text/plain, Size: 3244 bytes --]

ext4: Add debugging markers that can be used by systemtap

This debugging markers are designed to debug problems such as the
random filesystem latency problems reported by Arjan.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
---
diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index c37d1e8..f8b57a2 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -28,6 +28,7 @@
 #include <linux/writeback.h>
 #include <linux/jbd2.h>
 #include <linux/blkdev.h>
+#include <linux/marker.h>
 #include "ext4.h"
 #include "ext4_jbd2.h"
 
@@ -51,6 +52,10 @@ int ext4_sync_file(struct file *file, struct dentry *dentry, int datasync)
 
 	J_ASSERT(ext4_journal_current_handle() == NULL);
 
+	trace_mark(ext4_sync_file, "datasync %d dev %d ino %ld parent %ld",
+		   datasync, inode->i_sb->s_dev, inode->i_ino,
+		   dentry->d_parent->d_inode->i_ino);
+
 	/*
 	 * data=writeback:
 	 *  The caller's filemap_fdatawrite()/wait will sync the data.
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 515af05..68b0301 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -35,6 +35,7 @@
 #include <linux/quotaops.h>
 #include <linux/seq_file.h>
 #include <linux/proc_fs.h>
+#include <linux/marker.h>
 #include <linux/log2.h>
 #include <linux/crc16.h>
 #include <asm/uaccess.h>
@@ -2950,6 +2951,7 @@ static int ext4_sync_fs(struct super_block *sb, int wait)
 {
 	tid_t target;
 
+	trace_mark(ext4_sync_fs, "dev %d wait %d", sb->s_dev, wait);
 	sb->s_dirt = 0;
 	if (jbd2_journal_start_commit(EXT4_SB(sb)->s_journal, &target)) {
 		if (wait)
diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
index 91389c8..72bec15 100644
--- a/fs/jbd2/checkpoint.c
+++ b/fs/jbd2/checkpoint.c
@@ -20,6 +20,7 @@
 #include <linux/time.h>
 #include <linux/fs.h>
 #include <linux/jbd2.h>
+#include <linux/marker.h>
 #include <linux/errno.h>
 #include <linux/slab.h>
 
@@ -313,6 +329,8 @@ int jbd2_log_do_checkpoint(journal_t *journal)
 	 * journal straight away.
 	 */
 	result = jbd2_cleanup_journal_tail(journal);
+	trace_mark(jbd2_checkpoint, "dev %d need_checkpoint %d",
+		   journal->j_fs_dev->bd_dev, result);
 	jbd_debug(1, "cleanup_journal_tail returned %d\n", result);
 	if (result <= 0)
 		return result;
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index b091e53..ecb485b 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -16,6 +16,7 @@
 #include <linux/time.h>
 #include <linux/fs.h>
 #include <linux/jbd2.h>
+#include <linux/marker.h>
 #include <linux/errno.h>
 #include <linux/slab.h>
 #include <linux/mm.h>
@@ -368,6 +369,8 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 	commit_transaction = journal->j_running_transaction;
 	J_ASSERT(commit_transaction->t_state == T_RUNNING);
 
+	trace_mark(jbd2_start_commit, "dev %d transaction %d",
+		   journal->j_fs_dev->bd_dev, commit_transaction->t_tid);
 	jbd_debug(1, "JBD: starting commit of transaction %d\n",
 			commit_transaction->t_tid);
 
@@ -985,6 +988,9 @@ restart_loop:
 	}
 	spin_unlock(&journal->j_list_lock);
 
+	trace_mark(jbd2_end_commit, "dev %d transaction %d head %d",
+		   journal->j_fs_dev->bd_dev, commit_transaction->t_tid,
+		   journal->j_tail_sequence);
 	jbd_debug(1, "JBD: commit %d complete, head %d\n",
 		  journal->j_commit_sequence, journal->j_tail_sequence);
 

[-- Attachment #3: markers-patch --]
[-- Type: text/plain, Size: 2614 bytes --]

commit d80d36745e53d211293e358bc124dd68a3c88ba9
Author: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Date:   Fri Oct 3 15:54:36 2008 -0400

    Marker depmod fix core kernel list
    
    * Theodore Ts'o (tytso@mit.edu) wrote:
    >
    > I've been playing with adding some markers into ext4 to see if they
    > could be useful in solving some problems along with Systemtap.  It
    > appears, though, that as of 2.6.27-rc8, markers defined in code which is
    > compiled directly into the kernel (i.e., not as modules) don't show up
    > in Module.markers:
    >
    > kvm_trace_entryexit arch/x86/kvm/kvm-intel  %u %p %u %u %u %u %u %u
    > kvm_trace_handler arch/x86/kvm/kvm-intel  %u %p %u %u %u %u %u %u
    > kvm_trace_entryexit arch/x86/kvm/kvm-amd  %u %p %u %u %u %u %u %u
    > kvm_trace_handler arch/x86/kvm/kvm-amd  %u %p %u %u %u %u %u %u
    >
    > (Note the lack of any of the kernel_sched_* markers, and the markers I
    > added for ext4_* and jbd2_* are missing as wel.)
    >
    > Systemtap apparently depends on in-kernel trace_mark being recorded in
    > Module.markers, and apparently it's been claimed that it used to be
    > there.  Is this a bug in systemtap, or in how Module.markers is getting
    > built?   And is there a file that contains the equivalent information
    > for markers located in non-modules code?
    >
    > Thanks, regards,
    >
    
    I think the problem comes from this patch :
    commit d35cb360c29956510b2fe1a953bd4968536f7216
    "markers: fix duplicate modpost entry"
    
    Here is a fix that should take care if this problem. Given I am not the
    modpost expert, let's see if I can get an ACK from Sam.
    
    Thanks for the bug report!
    
    Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
    CC: Theodore Ts'o <tytso@mit.edu>
    CC: David Smith <dsmith@redhat.com>
    CC: Roland McGrath <roland@redhat.com>
    CC: Sam Ravnborg <sam@ravnborg.org>
    CC: Wenji Huang <wenji.huang@oracle.com>
    CC: Takashi Nishiie <t-nishiie@np.css.fujitsu.com>
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>

diff --git a/scripts/mod/modpost.c b/scripts/mod/modpost.c
index 418cd7d..8e0de6a 100644
--- a/scripts/mod/modpost.c
+++ b/scripts/mod/modpost.c
@@ -1986,11 +1986,13 @@ static void read_markers(const char *fname)
 
 		mod = find_module(modname);
 		if (!mod) {
-			if (is_vmlinux(modname))
-				have_vmlinux = 1;
 			mod = new_module(NOFAIL(strdup(modname)));
 			mod->skip = 1;
 		}
+		if (is_vmlinux(modname)) {
+			have_vmlinux = 1;
+			mod->skip = 0;
+		}
 
 		if (!mod->skip)
 			add_marker(mod, marker, fmt);

[-- Attachment #4: ext4-marker.stp --]
[-- Type: text/plain, Size: 774 bytes --]

probe kernel.mark("ext4_sync_file")
{
	t = gettimeofday_ms();
	printf("%d.%d:ext4_sync_file: datasync %d ino %d parent %d\n",
		t / 1000, t % 1000, $arg1, $arg3, $arg4)
}

probe kernel.mark("ext4_sync_fs")
{
	t = gettimeofday_ms();
	printf("%d.%d:ext4_sync_fs: wait %d\n", t / 1000, t % 1000,
		$arg2)
}

probe kernel.mark("jbd2_start_commit")
{
	t = gettimeofday_ms();
	printf("%d.%d:jbd2_start_commit: transaction %d\n",
		t / 1000, t % 1000, $arg2)
}

probe kernel.mark("jbd2_end_commit")
{
	t = gettimeofday_ms();
	printf("%d.%d:jbd2_end_commit: transaction %d head %d\n",
	       t / 1000, t % 1000, $arg2, $arg3)
}

probe kernel.mark("jbd2_checkpoint")
{
	t = gettimeofday_ms();
	printf("%d.%d:jbd2_checkpoint: need_checkpoint %d\n", 
		t / 1000, t % 1000, $arg2);
}


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
  2008-10-04 14:12                   ` Theodore Tso
@ 2008-10-04 17:14                     ` Joseph Fannin
  2008-10-04 21:27                       ` Theodore Tso
  0 siblings, 1 reply; 76+ messages in thread
From: Joseph Fannin @ 2008-10-04 17:14 UTC (permalink / raw)
  To: Theodore Tso, Arjan van de Ven, Andrew Morton, Jens Axboe,
	linux-kernel, Alan Cox

On Sat, Oct 04, 2008 at 10:12:52AM -0400, Theodore Tso wrote:

> Anyway, here's the patch (against ext4, although could pretty easily
> move this to ext3 --- but you can mount an ext3 filesystem as ext4,
> although for the moment you do have to run the command "tune2fs -E
> test_fs /dev/hdXX" first) [...]

Shouldn't he also be sure to mount the FS with the "noextents" option,
or has ext4 learned not to to enable extents if they're not already
enabled?

--
Joseph Fannin
jfannin@gmail.com


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
  2008-10-04 17:14                     ` Joseph Fannin
@ 2008-10-04 21:27                       ` Theodore Tso
  0 siblings, 0 replies; 76+ messages in thread
From: Theodore Tso @ 2008-10-04 21:27 UTC (permalink / raw)
  To: Arjan van de Ven, Andrew Morton, Jens Axboe, linux-kernel, Alan Cox

On Sat, Oct 04, 2008 at 01:14:45PM -0400, Joseph Fannin wrote:
> On Sat, Oct 04, 2008 at 10:12:52AM -0400, Theodore Tso wrote:
> 
> > Anyway, here's the patch (against ext4, although could pretty easily
> > move this to ext3 --- but you can mount an ext3 filesystem as ext4,
> > although for the moment you do have to run the command "tune2fs -E
> > test_fs /dev/hdXX" first) [...]
> 
> Shouldn't he also be sure to mount the FS with the "noextents" option,
> or has ext4 learned not to to enable extents if they're not already
> enabled?

The latter; ext4 will no longer enable extents if not already enabled
as of commit e4079a11, dated July 11th.  That means 2.6.26 with any of
the ext4 patchsets (highly recommended if you want to use ext4) or any
mainline release post 2.6.26-git4/2.6.27-rc1.

A number of the ext4 tutorials/howto's are available on the web page
are out of date, including the one on IBM developer works.  The best
place to get up-to-date information on using ext4 is
http://ext4.wiki.kernel.org.

Regards,

						- Ted

P.S.  Note that there will be advantages to using even an unconverted
filesystem with the ext3 filesystem format with the ext4 code base,
even if you don't enable extents or any other ext4-specific feature;
the delayed allocation and soon-to-be-added inode table readahead
features (in the 2.6.27-rc8-ext4-2 patchset, and which will be pushed
during the next merge window) in the ext4 code base will improve
performance even on an "ext3 filesystem".

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
  2008-10-03  5:24                       ` Andrew Morton
  2008-10-03 17:21                         ` Arjan van de Ven
@ 2008-10-09  3:00                         ` Theodore Tso
  2008-10-09  3:38                           ` Andrew Morton
  1 sibling, 1 reply; 76+ messages in thread
From: Theodore Tso @ 2008-10-09  3:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Arjan van de Ven, Jens Axboe, linux-kernel, Alan Cox, linux-ext4

On Thu, Oct 02, 2008 at 10:24:38PM -0700, Andrew Morton wrote:
> 
> Mount a junk partition with `-oakpm' and run some benchmarks.  If the
> results are "wow" then it's worth spending time on.  If the results are
> "meh" then we can not bother..
> 

I've ported the patch to the ext4 filesystem, and dropped it into the
unstable portion of the ext4 patch queue.  If we can get someone (hi,
Ric!) to run fs_mark with and without -o akpm_lock_hack, I suspect we
will find that it makes quite a large difference on that particular
benchmark, since it is fsync-heavy to force a large number of
transaction, and the creation of the inodes should cause multiple
blocks that will be entangled between the current and committing
transactions.

	      	     	     	      	      	 - Ted

ext4: akpm's locking hack to fix locking delays

This is a port of the following patch from Andrew Morton to ext4:

	http://lkml.org/lkml/2008/10/3/22

This fixes a major contention problem in do_get_write_access() when a
buffer is modified in both the current and committing transaction.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: akpm@linux-foundation.org
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index f46a513..23822fb 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -540,6 +540,7 @@ do {									       \
 #define EXT4_MOUNT_JOURNAL_ASYNC_COMMIT	0x1000000 /* Journal Async Commit */
 #define EXT4_MOUNT_I_VERSION            0x2000000 /* i_version support */
 #define EXT4_MOUNT_DELALLOC		0x8000000 /* Delalloc support */
+#define EXT4_MOUNT_AKPM_LOCK_HACK	0x10000000 /* akpm lock hack */
 /* Compatibility, for having both ext2_fs.h and ext4_fs.h included at once */
 #ifndef _LINUX_EXT2_FS_H
 #define clear_opt(o, opt)		o &= ~EXT4_MOUNT_##opt
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 67ebefb..f4e7157 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -752,6 +752,8 @@ static int ext4_show_options(struct seq_file *seq, struct vfsmount *vfs)
 		seq_puts(seq, ",journal_async_commit");
 	if (test_opt(sb, NOBH))
 		seq_puts(seq, ",nobh");
+	if (test_opt(sb, AKPM_LOCK_HACK))
+		seq_puts(seq, ",akpm_lock_hack");
 	if (!test_opt(sb, EXTENTS))
 		seq_puts(seq, ",noextents");
 	if (test_opt(sb, I_VERSION))
@@ -911,7 +913,7 @@ enum {
 	Opt_ignore, Opt_barrier, Opt_err, Opt_resize, Opt_usrquota,
 	Opt_grpquota, Opt_extents, Opt_noextents, Opt_i_version,
 	Opt_mballoc, Opt_nomballoc, Opt_stripe, Opt_delalloc, Opt_nodelalloc,
-	Opt_inode_readahead_blks
+	Opt_inode_readahead_blks, Opt_akpm_lock_hack,
 };
 
 static match_table_t tokens = {
@@ -973,6 +975,7 @@ static match_table_t tokens = {
 	{Opt_delalloc, "delalloc"},
 	{Opt_nodelalloc, "nodelalloc"},
 	{Opt_inode_readahead_blks, "inode_readahead_blks=%u"},
+	{Opt_akpm_lock_hack, "akpm_lock_hack"},
 	{Opt_err, NULL},
 };
 
@@ -1382,6 +1385,9 @@ set_qf_format:
 				return 0;
 			sbi->s_inode_readahead_blks = option;
 			break;
+		case Opt_akpm_lock_hack:
+			set_opt(sbi->s_mount_opt, AKPM_LOCK_HACK);
+			break;
 		default:
 			printk(KERN_ERR
 			       "EXT4-fs: Unrecognized mount option \"%s\" "
@@ -2534,6 +2540,10 @@ static void ext4_init_journal_params(struct super_block *sb, journal_t *journal)
 		journal->j_flags |= JBD2_BARRIER;
 	else
 		journal->j_flags &= ~JBD2_BARRIER;
+	if (test_opt(sb, AKPM_LOCK_HACK))
+		journal->j_flags |= JBD2_LOCK_HACK;
+	else
+		journal->j_flags &= ~JBD2_LOCK_HACK;
 	spin_unlock(&journal->j_state_lock);
 }
 
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index e5d5405..32c288a 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -546,6 +546,7 @@ do_get_write_access(handle_t *handle, struct journal_head *jh,
 	int error;
 	char *frozen_buffer = NULL;
 	int need_copy = 0;
+	int locked = 0;
 
 	if (is_handle_aborted(handle))
 		return -EROFS;
@@ -561,7 +562,13 @@ repeat:
 
 	/* @@@ Need to check for errors here at some point. */
 
-	lock_buffer(bh);
+	if (journal->j_flags & JBD2_LOCK_HACK) {
+		if (trylock_buffer(bh))
+			locked = 1;	/* lolz */
+	} else {
+		lock_buffer(bh);
+		locked = 1;
+	}
 	jbd_lock_bh_state(bh);
 
 	/* We now hold the buffer lock so it is safe to query the buffer
@@ -600,7 +607,8 @@ repeat:
 		jbd_unexpected_dirty_buffer(jh);
 	}
 
-	unlock_buffer(bh);
+	if (locked)
+		unlock_buffer(bh);
 
 	error = -EROFS;
 	if (is_handle_aborted(handle)) {
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 66c3499..c614dae 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -967,6 +967,7 @@ struct journal_s
 #define JBD2_FLUSHED	0x008	/* The journal superblock has been flushed */
 #define JBD2_LOADED	0x010	/* The journal superblock has been loaded */
 #define JBD2_BARRIER	0x020	/* Use IDE barriers */
+#define JBD2_LOCK_HACK	0x040	/* akpm's locking hack */
 
 /*
  * Function declarations for the journaling transaction and buffer

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
  2008-10-09  3:00                         ` Theodore Tso
@ 2008-10-09  3:38                           ` Andrew Morton
  0 siblings, 0 replies; 76+ messages in thread
From: Andrew Morton @ 2008-10-09  3:38 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Arjan van de Ven, Jens Axboe, linux-kernel, Alan Cox, linux-ext4

On Wed, 8 Oct 2008 23:00:54 -0400 Theodore Tso <tytso@mit.edu> wrote:

> On Thu, Oct 02, 2008 at 10:24:38PM -0700, Andrew Morton wrote:
> > 
> > Mount a junk partition with `-oakpm' and run some benchmarks.  If the
> > results are "wow" then it's worth spending time on.  If the results are
> > "meh" then we can not bother..
> > 
> 
> I've ported the patch to the ext4 filesystem, and dropped it into the
> unstable portion of the ext4 patch queue.

Useful, thanks.

>  If we can get someone (hi,
> Ric!) to run fs_mark with and without -o akpm_lock_hack, I suspect we
> will find that it makes quite a large difference on that particular
> benchmark, since it is fsync-heavy to force a large number of
> transaction, and the creation of the inodes should cause multiple
> blocks that will be entangled between the current and committing
> transactions.
> 

fsync?  Yes, I suppose so.  Repeated modifications to the same
inodes/directories/bitmaps blocks/etc will hurt.

A quick test on other quantified workloads would be useful too.  If the
results look promising then someone(tm) will need to work out how to
fix this for real.

> 
> ext4: akpm's locking hack to fix locking delays
> 
> This is a port of the following patch from Andrew Morton to ext4:
> 
> 	http://lkml.org/lkml/2008/10/3/22
> 
> This fixes a major contention problem in do_get_write_access() when a
> buffer is modified in both the current and committing transaction.

More specifically: "under checkpoint writeback in the committing
transaction when the committing transaction requests write access".


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
  2008-10-07 22:22                     ` Dave Chinner
@ 2008-10-09  8:48                       ` Jens Axboe
  0 siblings, 0 replies; 76+ messages in thread
From: Jens Axboe @ 2008-10-09  8:48 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Aaron Carroll, Bodo Eggert, Andi Kleen, Andrew Morton,
	Arjan van de Ven, linux-kernel, Alan Cox

On Wed, Oct 08 2008, Dave Chinner wrote:
> On Tue, Oct 07, 2008 at 08:06:53PM +0200, Jens Axboe wrote:
> > On Mon, Oct 06 2008, Dave Chinner wrote:
> > > On Sat, Oct 04, 2008 at 05:45:00PM +1000, Aaron Carroll wrote:
> > > > Dave Chinner wrote:
> > > >> On Thu, Oct 02, 2008 at 05:32:04PM +0200, Bodo Eggert wrote:
> > > >>> Sounds like you need a priority class besides sync and async.
> > > >>
> > > >> There's BIO_META now as well, which I was testing at the same time
> > > >> as RT priority. Marking all the metadata I/O as BIO_META did help,
> > > >> but once again I never got to determining if that was a result of
> > > >> the different tagging or the priority increase.
> > > >
> > > > What exactly do you want META to mean?  Strict prioritisation over
> > > > all other non-META requests, or just more frequent and/or larger
> > > > dispatches?  Should META requests be sorted?
> > > 
> > > The real question is "what was it supposed to mean"? AFAICT, it was
> > > added to a couple of filesystems to be used to tag superblock read
> > > I/O. Why - I don't know - there's a distinct lack of documentation
> > > surrounding these bio flags. :/
> > 
> > It was added to be able to differentiate between data and meta data IO
> > when using blktrace, that is all.
> 
> Ok.
> 
> > > Realistically, I'm not sure that having a separate queue for
> > > BIO_META will buy us anything, given that noop is quite often the
> > > fastest scheduler for XFS because it enables interleaved metadata
> > > I/O to be merged with data I/O. Like I said, I was not able to spend
> > > the time to determine exactly how BIO_META affected I/O patterns, so
> > > I can't really comment on whether it is really necessary or not.
> > 
> > There's no seperate queue for meta data IO anywhere. CFQ will give
> > _slight_ preference to meta data IO as a side effect, preferring the
> > meta IO for otherwise same IO in what to serve next in the same queue.
> > And it will not allow preemption of a meta data IO for a data IO.
> > 
> > So using meta should not yield any important boosts by itself.
> 
> Which means that performance increase I saw on CFQ was a result of
> removing the BIO_SYNC tagging "optimisation" XFS uses for metadata,
> not from adding BIO_META.....
> 
> Could you please document what these tags actually mean and do
> so that other people don't get as confused as me about this
> stuff....

Sure, I've added such a patch for 2.6.28.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
  2008-10-07 18:06                   ` Jens Axboe
@ 2008-10-07 22:22                     ` Dave Chinner
  2008-10-09  8:48                       ` Jens Axboe
  0 siblings, 1 reply; 76+ messages in thread
From: Dave Chinner @ 2008-10-07 22:22 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Aaron Carroll, Bodo Eggert, Andi Kleen, Andrew Morton,
	Arjan van de Ven, linux-kernel, Alan Cox

On Tue, Oct 07, 2008 at 08:06:53PM +0200, Jens Axboe wrote:
> On Mon, Oct 06 2008, Dave Chinner wrote:
> > On Sat, Oct 04, 2008 at 05:45:00PM +1000, Aaron Carroll wrote:
> > > Dave Chinner wrote:
> > >> On Thu, Oct 02, 2008 at 05:32:04PM +0200, Bodo Eggert wrote:
> > >>> Sounds like you need a priority class besides sync and async.
> > >>
> > >> There's BIO_META now as well, which I was testing at the same time
> > >> as RT priority. Marking all the metadata I/O as BIO_META did help,
> > >> but once again I never got to determining if that was a result of
> > >> the different tagging or the priority increase.
> > >
> > > What exactly do you want META to mean?  Strict prioritisation over
> > > all other non-META requests, or just more frequent and/or larger
> > > dispatches?  Should META requests be sorted?
> > 
> > The real question is "what was it supposed to mean"? AFAICT, it was
> > added to a couple of filesystems to be used to tag superblock read
> > I/O. Why - I don't know - there's a distinct lack of documentation
> > surrounding these bio flags. :/
> 
> It was added to be able to differentiate between data and meta data IO
> when using blktrace, that is all.

Ok.

> > Realistically, I'm not sure that having a separate queue for
> > BIO_META will buy us anything, given that noop is quite often the
> > fastest scheduler for XFS because it enables interleaved metadata
> > I/O to be merged with data I/O. Like I said, I was not able to spend
> > the time to determine exactly how BIO_META affected I/O patterns, so
> > I can't really comment on whether it is really necessary or not.
> 
> There's no seperate queue for meta data IO anywhere. CFQ will give
> _slight_ preference to meta data IO as a side effect, preferring the
> meta IO for otherwise same IO in what to serve next in the same queue.
> And it will not allow preemption of a meta data IO for a data IO.
> 
> So using meta should not yield any important boosts by itself.

Which means that performance increase I saw on CFQ was a result of
removing the BIO_SYNC tagging "optimisation" XFS uses for metadata,
not from adding BIO_META.....

Could you please document what these tags actually mean and do
so that other people don't get as confused as me about this
stuff....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
  2008-10-06  3:18                 ` Dave Chinner
@ 2008-10-07 18:06                   ` Jens Axboe
  2008-10-07 22:22                     ` Dave Chinner
  0 siblings, 1 reply; 76+ messages in thread
From: Jens Axboe @ 2008-10-07 18:06 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Aaron Carroll, Bodo Eggert, Andi Kleen, Andrew Morton,
	Arjan van de Ven, linux-kernel, Alan Cox

On Mon, Oct 06 2008, Dave Chinner wrote:
> On Sat, Oct 04, 2008 at 05:45:00PM +1000, Aaron Carroll wrote:
> > Dave Chinner wrote:
> >> On Thu, Oct 02, 2008 at 05:32:04PM +0200, Bodo Eggert wrote:
> >>> Sounds like you need a priority class besides sync and async.
> >>
> >> There's BIO_META now as well, which I was testing at the same time
> >> as RT priority. Marking all the metadata I/O as BIO_META did help,
> >> but once again I never got to determining if that was a result of
> >> the different tagging or the priority increase.
> >
> > What exactly do you want META to mean?  Strict prioritisation over
> > all other non-META requests, or just more frequent and/or larger
> > dispatches?  Should META requests be sorted?
> 
> The real question is "what was it supposed to mean"? AFAICT, it was
> added to a couple of filesystems to be used to tag superblock read
> I/O. Why - I don't know - there's a distinct lack of documentation
> surrounding these bio flags. :/

It was added to be able to differentiate between data and meta data IO
when using blktrace, that is all.

> Realistically, I'm not sure that having a separate queue for
> BIO_META will buy us anything, given that noop is quite often the
> fastest scheduler for XFS because it enables interleaved metadata
> I/O to be merged with data I/O. Like I said, I was not able to spend
> the time to determine exactly how BIO_META affected I/O patterns, so
> I can't really comment on whether it is really necessary or not.

There's no seperate queue for meta data IO anywhere. CFQ will give
_slight_ preference to meta data IO as a side effect, preferring the
meta IO for otherwise same IO in what to serve next in the same queue.
And it will not allow preemption of a meta data IO for a data IO.

So using meta should not yield any important boosts by itself.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
  2008-10-04  7:45               ` Aaron Carroll
@ 2008-10-06  3:18                 ` Dave Chinner
  2008-10-07 18:06                   ` Jens Axboe
  0 siblings, 1 reply; 76+ messages in thread
From: Dave Chinner @ 2008-10-06  3:18 UTC (permalink / raw)
  To: Aaron Carroll
  Cc: Bodo Eggert, Jens Axboe, Andi Kleen, Andrew Morton,
	Arjan van de Ven, linux-kernel, Alan Cox

On Sat, Oct 04, 2008 at 05:45:00PM +1000, Aaron Carroll wrote:
> Dave Chinner wrote:
>> On Thu, Oct 02, 2008 at 05:32:04PM +0200, Bodo Eggert wrote:
>>> Sounds like you need a priority class besides sync and async.
>>
>> There's BIO_META now as well, which I was testing at the same time
>> as RT priority. Marking all the metadata I/O as BIO_META did help,
>> but once again I never got to determining if that was a result of
>> the different tagging or the priority increase.
>
> What exactly do you want META to mean?  Strict prioritisation over
> all other non-META requests, or just more frequent and/or larger
> dispatches?  Should META requests be sorted?

The real question is "what was it supposed to mean"? AFAICT, it was
added to a couple of filesystems to be used to tag superblock read
I/O. Why - I don't know - there's a distinct lack of documentation
surrounding these bio flags. :/

Realistically, I'm not sure that having a separate queue for
BIO_META will buy us anything, given that noop is quite often the
fastest scheduler for XFS because it enables interleaved metadata
I/O to be merged with data I/O. Like I said, I was not able to spend
the time to determine exactly how BIO_META affected I/O patterns, so
I can't really comment on whether it is really necessary or not.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
  2008-10-02 23:34             ` Dave Chinner
@ 2008-10-04  7:45               ` Aaron Carroll
  2008-10-06  3:18                 ` Dave Chinner
  0 siblings, 1 reply; 76+ messages in thread
From: Aaron Carroll @ 2008-10-04  7:45 UTC (permalink / raw)
  To: david
  Cc: Bodo Eggert, Jens Axboe, Andi Kleen, Andrew Morton,
	Arjan van de Ven, linux-kernel, Alan Cox

Dave Chinner wrote:
> On Thu, Oct 02, 2008 at 05:32:04PM +0200, Bodo Eggert wrote:
>> Sounds like you need a priority class besides sync and async.
> 
> There's BIO_META now as well, which I was testing at the same time
> as RT priority. Marking all the metadata I/O as BIO_META did help,
> but once again I never got to determining if that was a result of
> the different tagging or the priority increase.

What exactly do you want META to mean?  Strict prioritisation over
all other non-META requests, or just more frequent and/or larger
dispatches?  Should META requests be sorted?

> However, given that only CFQ understand BIO_META, I suspect that
> changing the way XFS uses BIO_SYNC to be a combination of BIO_META
> and BIO_SYNC would cause significant regressions on other
> schedulers.....

That shouldn't be a problem.  noop doesn't care about any of that stuff,
and deadline doesn't care about BIO_SYNC (more on that below).  If the
bios that use META are a subset of those that currently use SYNC, then
we can temporarily change AS to treat META and SYNC equally.  Only CFQ
would change in behaviour.

So deadline should probably support BIO_SYNC... below is a patch to do
that.  It doesn't have much of an effect on postmark or compilebench on
a single spindle, but I'm guessing that's not the workload or hardware
that is expected to benefit.


---

Subject: [PATCH] deadline-iosched: support SYNC bio/request flag

Support sync/async requests in deadline rather than read/write, as is
done in AS and CFQ.

Signed-off-by: Aaron Carroll <aaronc@gelato.unsw.edu.au>
---
 block/deadline-iosched.c |   63 ++++++++++++++++++++++++---------------------
 1 files changed, 34 insertions(+), 29 deletions(-)

diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index 342448c..b2cfd47 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -23,6 +23,11 @@ static const int writes_starved = 2;    /* max times reads can starve a write */
 static const int fifo_batch = 16;       /* # of sequential requests treated as one
 				     by the above parameters. For throughput. */
 
+enum {
+	REQ_ASYNC,
+	REQ_SYNC,
+};
+
 struct deadline_data {
 	/*
 	 * run time data
@@ -53,7 +58,7 @@ struct deadline_data {
 
 static void deadline_move_request(struct deadline_data *, struct request *);
 
-#define RQ_RB_ROOT(dd, rq)	(&(dd)->sort_list[rq_data_dir((rq))])
+#define RQ_RB_ROOT(dd, rq)	(&(dd)->sort_list[rq_is_sync((rq))])
 
 /*
  * get the request after `rq' in sector-sorted order
@@ -86,7 +91,7 @@ retry:
 static inline void
 deadline_del_rq_rb(struct deadline_data *dd, struct request *rq)
 {
-	const int data_dir = rq_data_dir(rq);
+	const int data_dir = rq_is_sync(rq);
 
 	if (dd->next_rq[data_dir] == rq)
 		dd->next_rq[data_dir] = deadline_latter_request(rq);
@@ -101,7 +106,7 @@ static void
 deadline_add_request(struct request_queue *q, struct request *rq)
 {
 	struct deadline_data *dd = q->elevator->elevator_data;
-	const int data_dir = rq_data_dir(rq);
+	const int data_dir = rq_is_sync(rq);
 
 	deadline_add_rq_rb(dd, rq);
 
@@ -206,10 +211,10 @@ deadline_move_to_dispatch(struct deadline_data *dd, struct request *rq)
 static void
 deadline_move_request(struct deadline_data *dd, struct request *rq)
 {
-	const int data_dir = rq_data_dir(rq);
+	const int data_dir = rq_is_sync(rq);
 
-	dd->next_rq[READ] = NULL;
-	dd->next_rq[WRITE] = NULL;
+	dd->next_rq[REQ_SYNC] = NULL;
+	dd->next_rq[REQ_ASYNC] = NULL;
 	dd->next_rq[data_dir] = deadline_latter_request(rq);
 
 	dd->last_sector = rq->sector + rq->nr_sectors;
@@ -245,18 +250,18 @@ static inline int deadline_check_fifo(struct deadline_data *dd, int ddir)
 static int deadline_dispatch_requests(struct request_queue *q, int force)
 {
 	struct deadline_data *dd = q->elevator->elevator_data;
-	const int reads = !list_empty(&dd->fifo_list[READ]);
-	const int writes = !list_empty(&dd->fifo_list[WRITE]);
+	const int reads = !list_empty(&dd->fifo_list[REQ_SYNC]);
+	const int writes = !list_empty(&dd->fifo_list[REQ_ASYNC]);
 	struct request *rq;
 	int data_dir;
 
 	/*
 	 * batches are currently reads XOR writes
 	 */
-	if (dd->next_rq[WRITE])
-		rq = dd->next_rq[WRITE];
+	if (dd->next_rq[REQ_ASYNC])
+		rq = dd->next_rq[REQ_ASYNC];
 	else
-		rq = dd->next_rq[READ];
+		rq = dd->next_rq[REQ_SYNC];
 
 	if (rq) {
 		/* we have a "next request" */
@@ -276,12 +281,12 @@ static int deadline_dispatch_requests(struct request_queue *q, int force)
 	 */
 
 	if (reads) {
-		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[READ]));
+		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[REQ_SYNC]));
 
 		if (writes && (dd->starved++ >= dd->writes_starved))
 			goto dispatch_writes;
 
-		data_dir = READ;
+		data_dir = REQ_SYNC;
 
 		goto dispatch_find_request;
 	}
@@ -292,11 +297,11 @@ static int deadline_dispatch_requests(struct request_queue *q, int force)
 
 	if (writes) {
 dispatch_writes:
-		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[WRITE]));
+		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[REQ_ASYNC]));
 
 		dd->starved = 0;
 
-		data_dir = WRITE;
+		data_dir = REQ_ASYNC;
 
 		goto dispatch_find_request;
 	}
@@ -338,16 +343,16 @@ static int deadline_queue_empty(struct request_queue *q)
 {
 	struct deadline_data *dd = q->elevator->elevator_data;
 
-	return list_empty(&dd->fifo_list[WRITE])
-		&& list_empty(&dd->fifo_list[READ]);
+	return list_empty(&dd->fifo_list[REQ_ASYNC])
+		&& list_empty(&dd->fifo_list[REQ_SYNC]);
 }
 
 static void deadline_exit_queue(elevator_t *e)
 {
 	struct deadline_data *dd = e->elevator_data;
 
-	BUG_ON(!list_empty(&dd->fifo_list[READ]));
-	BUG_ON(!list_empty(&dd->fifo_list[WRITE]));
+	BUG_ON(!list_empty(&dd->fifo_list[REQ_SYNC]));
+	BUG_ON(!list_empty(&dd->fifo_list[REQ_ASYNC]));
 
 	kfree(dd);
 }
@@ -363,12 +368,12 @@ static void *deadline_init_queue(struct request_queue *q)
 	if (!dd)
 		return NULL;
 
-	INIT_LIST_HEAD(&dd->fifo_list[READ]);
-	INIT_LIST_HEAD(&dd->fifo_list[WRITE]);
-	dd->sort_list[READ] = RB_ROOT;
-	dd->sort_list[WRITE] = RB_ROOT;
-	dd->fifo_expire[READ] = read_expire;
-	dd->fifo_expire[WRITE] = write_expire;
+	INIT_LIST_HEAD(&dd->fifo_list[REQ_SYNC]);
+	INIT_LIST_HEAD(&dd->fifo_list[REQ_ASYNC]);
+	dd->sort_list[REQ_SYNC] = RB_ROOT;
+	dd->sort_list[REQ_ASYNC] = RB_ROOT;
+	dd->fifo_expire[REQ_SYNC] = read_expire;
+	dd->fifo_expire[REQ_ASYNC] = write_expire;
 	dd->writes_starved = writes_starved;
 	dd->front_merges = 1;
 	dd->fifo_batch = fifo_batch;
@@ -403,8 +408,8 @@ static ssize_t __FUNC(elevator_t *e, char *page)			\
 		__data = jiffies_to_msecs(__data);			\
 	return deadline_var_show(__data, (page));			\
 }
-SHOW_FUNCTION(deadline_read_expire_show, dd->fifo_expire[READ], 1);
-SHOW_FUNCTION(deadline_write_expire_show, dd->fifo_expire[WRITE], 1);
+SHOW_FUNCTION(deadline_read_expire_show, dd->fifo_expire[REQ_SYNC], 1);
+SHOW_FUNCTION(deadline_write_expire_show, dd->fifo_expire[REQ_ASYNC], 1);
 SHOW_FUNCTION(deadline_writes_starved_show, dd->writes_starved, 0);
 SHOW_FUNCTION(deadline_front_merges_show, dd->front_merges, 0);
 SHOW_FUNCTION(deadline_fifo_batch_show, dd->fifo_batch, 0);
@@ -426,8 +431,8 @@ static ssize_t __FUNC(elevator_t *e, const char *page, size_t count)	\
 		*(__PTR) = __data;					\
 	return ret;							\
 }
-STORE_FUNCTION(deadline_read_expire_store, &dd->fifo_expire[READ], 0, INT_MAX, 1);
-STORE_FUNCTION(deadline_write_expire_store, &dd->fifo_expire[WRITE], 0, INT_MAX, 1);
+STORE_FUNCTION(deadline_read_expire_store, &dd->fifo_expire[REQ_SYNC], 0, INT_MAX, 1);
+STORE_FUNCTION(deadline_write_expire_store, &dd->fifo_expire[REQ_ASYNC], 0, INT_MAX, 1);
 STORE_FUNCTION(deadline_writes_starved_store, &dd->writes_starved, INT_MIN, INT_MAX, 0);
 STORE_FUNCTION(deadline_front_merges_store, &dd->front_merges, 0, 1, 0);
 STORE_FUNCTION(deadline_fifo_batch_store, &dd->fifo_batch, 0, INT_MAX, 0);
-- 
1.6.0.2

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
  2008-10-02 15:32           ` Bodo Eggert
@ 2008-10-02 23:34             ` Dave Chinner
  2008-10-04  7:45               ` Aaron Carroll
  0 siblings, 1 reply; 76+ messages in thread
From: Dave Chinner @ 2008-10-02 23:34 UTC (permalink / raw)
  To: Bodo Eggert
  Cc: Jens Axboe, Andi Kleen, Andrew Morton, Arjan van de Ven,
	linux-kernel, Alan Cox

On Thu, Oct 02, 2008 at 05:32:04PM +0200, Bodo Eggert wrote:
> Jens Axboe <jens.axboe@oracle.com> wrote:
> > On Thu, Oct 02 2008, Dave Chinner wrote:
> >> On Thu, Oct 02, 2008 at 09:55:11AM +0200, Jens Axboe wrote:
> 
> >> > Good point. I think we should mark the IO as sync, and maintain the same
> >> > priority level. Any IO that ends up being waited on is sync by
> >> > definition, we just need to expand the coverage a bit.
> >> 
> >> That's what XFS has always done - mark the journal I/O as sync.
> >> Still, once you load up the elevator, the sync I/O can still get
> >> delayed for hundreds of milliseconds before dispatch, which was
> >> why I started looking at boosting the priority of the log I/O.
> >> It proved to be much more effective at getting the log I/O
> >> dispatched than the existing "mark it sync" technique....
> > 
> > Sure, just marking it as sync is not a magic bullet. It'll be in the
> > first priority for that class, but it'll share bandwidth with other
> > processes. So if you have lots of IO going on, it can take hundreds of
> > miliseconds before being dispatched.
> 
> Sounds like you need a priority class besides sync and async.

There's BIO_META now as well, which I was testing at the same time
as RT priority. Marking all the metadata I/O as BIO_META did help,
but once again I never got to determining if that was a result of
the different tagging or the priority increase.

However, given that only CFQ understand BIO_META, I suspect that
changing the way XFS uses BIO_SYNC to be a combination of BIO_META
and BIO_SYNC would cause significant regressions on other
schedulers.....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
       [not found]         ` <bisYW-3HQ-13@gated-at.bofh.it>
@ 2008-10-02 15:32           ` Bodo Eggert
  2008-10-02 23:34             ` Dave Chinner
  0 siblings, 1 reply; 76+ messages in thread
From: Bodo Eggert @ 2008-10-02 15:32 UTC (permalink / raw)
  To: Jens Axboe, Dave Chinner, Andi Kleen, Andrew Morton,
	Arjan van de Ven, linux-kernel, Alan Cox

Jens Axboe <jens.axboe@oracle.com> wrote:
> On Thu, Oct 02 2008, Dave Chinner wrote:
>> On Thu, Oct 02, 2008 at 09:55:11AM +0200, Jens Axboe wrote:

>> > Good point. I think we should mark the IO as sync, and maintain the same
>> > priority level. Any IO that ends up being waited on is sync by
>> > definition, we just need to expand the coverage a bit.
>> 
>> That's what XFS has always done - mark the journal I/O as sync.
>> Still, once you load up the elevator, the sync I/O can still get
>> delayed for hundreds of milliseconds before dispatch, which was
>> why I started looking at boosting the priority of the log I/O.
>> It proved to be much more effective at getting the log I/O
>> dispatched than the existing "mark it sync" technique....
> 
> Sure, just marking it as sync is not a magic bullet. It'll be in the
> first priority for that class, but it'll share bandwidth with other
> processes. So if you have lots of IO going on, it can take hundreds of
> miliseconds before being dispatched.

Sounds like you need a priority class besides sync and async.



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority
  2007-11-16 18:35             ` Ray Lee
@ 2007-11-16 18:39               ` Alan D. Brunelle
  0 siblings, 0 replies; 76+ messages in thread
From: Alan D. Brunelle @ 2007-11-16 18:39 UTC (permalink / raw)
  To: Ray Lee
  Cc: Andrew Morton, Rik van Riel, arjan, jens.axboe, mingo, linux-kernel

Ray Lee wrote:

> 
> Out of curiosity, what are the mount options for the freshly created
> ext3 fs? In particular, are you using noatime, nodiratime?
> 
> Ray

Nope, just mount. However, the tool I'm using to read the large file & 
overwrite the large file does open with O_NOATIME for reads...

The tool used to read the files in the read-a-tree test is dd, and I 
doubt(?) it does a O_NOATIME...

Alan

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority
  2007-11-16 16:25           ` Alan D. Brunelle
  2007-11-16 16:40             ` Alan D. Brunelle
@ 2007-11-16 18:35             ` Ray Lee
  2007-11-16 18:39               ` Alan D. Brunelle
  1 sibling, 1 reply; 76+ messages in thread
From: Ray Lee @ 2007-11-16 18:35 UTC (permalink / raw)
  To: Alan D. Brunelle
  Cc: Andrew Morton, Rik van Riel, arjan, jens.axboe, mingo, linux-kernel

On Nov 16, 2007 8:25 AM, Alan D. Brunelle <alan.brunelle@hp.com> wrote:
> Here are the results for the latest tests, some notes:
>
> o  The machine actually has 8GiB of RAM, so the tests still may end up
> using (some) page cache. (But at least it was the same for both kernels!
> :-) )
>
> o  Sorry the results took so long - the updated tree size caused the
> runs to take > 12 hours...
>
> o  The longer runs seemed to bring down the standard deviation a bit,
> although they are still quite large.
>
> o  10 runs per test (read large file, read a tree, overwrite large
> file), with averages presented.
>
> o  1st 4 columns (min, avg, max, std dev) refer to the average run
> lengths for the tests - real time, in seconds
>
> o  The last 3 columns are extracted from iostat results over the course
> of the whole run.
>
> o  The read a tree test certainly stands out - the other 2 large file
> manipulations have the two kernels within a couple of percent, but the
> read a tree test has Arjan's patch taking about 47%(!) longer on
> average. The increased %iowait & %system time in all 3 cases is interesting.
>
>
> Read large file:
>
> Kernel  Min    Avg    Max   Std Dev    %user  %system  %iowait
> --------------------------------------------------------------
> base :  201.6  215.1  275.5   22.8     0.26%    4.69%   33.54%
> arjan:  198.0  210.3  261.5   18.5     0.33%   10.24%   54.00%
>
> Read a tree:
>
> Kernel  Min    Avg    Max   Std Dev    %user  %system  %iowait
> --------------------------------------------------------------
> base : 3518.2 4631.3 5991.3  784.6     0.19%    3.29%   23.56%
> arjan: 5731.6 6849.8 7777.4  731.6     0.32%    9.90%   52.70%
>
> Overwrite large file:
>
> Kernel  Min    Avg    Max   Std Dev    %user  %system  %iowait
> --------------------------------------------------------------
> base :  104.2  147.7  239.5   38.4      0.02%    0.05%   1.08%
> arjan:  106.2  149.7  239.2   38.4      0.25%    0.79%  14.97%
>
> Let me know if there is anything else I can do to elaborate, or if you
> have suggestions for further testing.

Out of curiosity, what are the mount options for the freshly created
ext3 fs? In particular, are you using noatime, nodiratime?

Ray

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority
  2007-11-16 16:25           ` Alan D. Brunelle
@ 2007-11-16 16:40             ` Alan D. Brunelle
  2007-11-16 18:35             ` Ray Lee
  1 sibling, 0 replies; 76+ messages in thread
From: Alan D. Brunelle @ 2007-11-16 16:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Rik van Riel, arjan, jens.axboe, mingo, linux-kernel

Alan D. Brunelle wrote:
> 
> Read large file:
> 
> Kernel  Min    Avg    Max   Std Dev    %user  %system  %iowait
> --------------------------------------------------------------
> base :  201.6  215.1  275.5   22.8     0.26%    4.69%   33.54%
> arjan:  198.0  210.3  261.5   18.5     0.33%   10.24%   54.00%
> 
> Read a tree:
> 
> Kernel  Min    Avg    Max   Std Dev    %user  %system  %iowait
> --------------------------------------------------------------
> base : 3518.2 4631.3 5991.3  784.6     0.19%    3.29%   23.56%
> arjan: 5731.6 6849.8 7777.4  731.6     0.32%    9.90%   52.70%
> 
> Overwrite large file:
> 
> Kernel  Min    Avg    Max   Std Dev    %user  %system  %iowait
> --------------------------------------------------------------
> base :  104.2  147.7  239.5   38.4      0.02%    0.05%   1.08%
> arjan:  106.2  149.7  239.2   38.4      0.25%    0.79%  14.97%
> 
>

I'm going to try and do some clean up work on the iostat CPU results - 
the reason %user & %system are so low is (I think) because they also 
include a lot of 0% results from the tail of the runs (as the unmount is 
going on I think). I'm going to try and extract results for just the 
"meat" of the runs.

Alan

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority
  2007-11-14 17:14         ` Andrew Morton
  2007-11-14 17:18           ` Ingo Molnar
  2007-11-14 19:24           ` Alan D. Brunelle
@ 2007-11-16 16:25           ` Alan D. Brunelle
  2007-11-16 16:40             ` Alan D. Brunelle
  2007-11-16 18:35             ` Ray Lee
  2 siblings, 2 replies; 76+ messages in thread
From: Alan D. Brunelle @ 2007-11-16 16:25 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Rik van Riel, arjan, jens.axboe, mingo, linux-kernel

Here are the results for the latest tests, some notes:

o  The machine actually has 8GiB of RAM, so the tests still may end up 
using (some) page cache. (But at least it was the same for both kernels! 
:-) )

o  Sorry the results took so long - the updated tree size caused the 
runs to take > 12 hours...

o  The longer runs seemed to bring down the standard deviation a bit, 
although they are still quite large.

o  10 runs per test (read large file, read a tree, overwrite large 
file), with averages presented.

o  1st 4 columns (min, avg, max, std dev) refer to the average run 
lengths for the tests - real time, in seconds

o  The last 3 columns are extracted from iostat results over the course 
of the whole run.

o  The read a tree test certainly stands out - the other 2 large file 
manipulations have the two kernels within a couple of percent, but the 
read a tree test has Arjan's patch taking about 47%(!) longer on 
average. The increased %iowait & %system time in all 3 cases is interesting.


Read large file:

Kernel  Min    Avg    Max   Std Dev    %user  %system  %iowait
--------------------------------------------------------------
base :  201.6  215.1  275.5   22.8     0.26%    4.69%   33.54%
arjan:  198.0  210.3  261.5   18.5     0.33%   10.24%   54.00%

Read a tree:

Kernel  Min    Avg    Max   Std Dev    %user  %system  %iowait
--------------------------------------------------------------
base : 3518.2 4631.3 5991.3  784.6     0.19%    3.29%   23.56%
arjan: 5731.6 6849.8 7777.4  731.6     0.32%    9.90%   52.70%

Overwrite large file:

Kernel  Min    Avg    Max   Std Dev    %user  %system  %iowait
--------------------------------------------------------------
base :  104.2  147.7  239.5   38.4      0.02%    0.05%   1.08%
arjan:  106.2  149.7  239.2   38.4      0.25%    0.79%  14.97%

Let me know if there is anything else I can do to elaborate, or if you 
have suggestions for further testing.

Alan

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority
  2007-11-14 19:24           ` Alan D. Brunelle
  2007-11-14 19:50             ` Arjan van de Ven
@ 2007-11-14 19:56             ` Alan D. Brunelle
  1 sibling, 0 replies; 76+ messages in thread
From: Alan D. Brunelle @ 2007-11-14 19:56 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Rik van Riel, arjan, Jens Axboe, mingo, linux-kernel

Oh, and the runs were done in single-user mode...

Alan

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority
  2007-11-14 19:24           ` Alan D. Brunelle
@ 2007-11-14 19:50             ` Arjan van de Ven
  2007-11-14 19:56             ` Alan D. Brunelle
  1 sibling, 0 replies; 76+ messages in thread
From: Arjan van de Ven @ 2007-11-14 19:50 UTC (permalink / raw)
  To: Alan D. Brunelle
  Cc: Andrew Morton, Rik van Riel, Jens Axboe, mingo, linux-kernel

On Wed, 14 Nov 2007 14:24:03 -0500
"Alan D. Brunelle" <Alan.Brunelle@hp.com> wrote:
> >   
> 
> The test works like this:
> 
>    1. I ensure that the device under test (DUT) is set to run the CFQ
>       scheduler.
>          1. It is a Fibre Channel 72GiB disk
>          2. Single partition...
>    2. Put an Ext3 FS on the partition (mkfs.ext3 -b 4096)
>    3. Mount the device, and then:
>          1. Put an 8GiB file on the new FS
>          2. Put 3 copies of a Linux tree (w/ objs & kernel & such)
> onto the FS in separate directories
>                1. Note: I'm going to do runs with 6 copies to each
>                   directory tree to get to about 4.2GiB per directory
> tree 4. Then, for each of the tests:
>          1. Remount the device (purge page cache by umount & then
> mount) 2. Start up a copy of 1 kernel tree to another tree (you hadn't
>             specified if the copy in the background should be to a new
>             area or not, so I'm just re-using the same area so we
> don't have to worry about removing the old). I keep doing the copy
>             as long as the tests are going
>          3. Perform the test (10 times)
> 
> The tests are:
> 
>     * Linear read of a large file (8GiB)
>     * Tree read (foreach file in the tree, dd it to /dev/null)
>     * Overwrite of that large file: was doing 256KiB random&direct
>       read/writes, will go down to 4KiB read/writes as that is more
>       realistic I'd guess
> 

ok so the obvious meta-question is this: what does it mean that your
test takes longer or shorter. I can see IO "capacity" (trying to avoid
the use bandwidth here) moves from the foreground test to the
background test (and/or other way around)... but if that was starved
previously... it could or could not be the right result.
What do you think the measure of "it's at least not worse" is? Is there
any way to get to that concept? (and then looking at if that got met is
the second step ;( )

-- 
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority
  2007-11-14 17:51             ` Arjan van de Ven
  2007-11-14 18:55               ` Ingo Molnar
@ 2007-11-14 19:43               ` Alan D. Brunelle
  1 sibling, 0 replies; 76+ messages in thread
From: Alan D. Brunelle @ 2007-11-14 19:43 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Ingo Molnar, Andrew Morton, Rik van Riel, jens.axboe, linux-kernel

Arjan van de Ven wrote:
> On Wed, 14 Nov 2007 18:18:05 +0100
> Ingo Molnar <mingo@elte.hu> wrote:
>
>   
>> * Andrew Morton <akpm@linux-foundation.org> wrote:
>>
>>     
>>> ooh, more performance testing.  Thanks
>>>
>>>       
>>>>     * The overwriter task (on an 8GiB file), average over 10 runs:
>>>>           o 2.6.24 - 300.88226 seconds
>>>>           o 2.6.24 + Arjan's patch - 403.85505 seconds
>>>>
>>>>     * The read-a-different-kernel-tree task, average over 10 runs:
>>>>           o 2.6.24 - 46.8145945549 seconds
>>>>           o 2.6.24 + Arjan's patch - 39.6430601119 seconds
>>>>
>>>>     * The large-linear-read task (on an 8GiB file), average over
>>>> 10 runs: o 2.6.24 - 290.32522 seconds
>>>>           o 2.6.24 + Arjan's patch - 386.34860 seconds
>>>>         
>>> These are *large* differences, making this a very signifcant
>>> patch. Much care is needed now.
>>>       
>> and the numbers suggest it's mostly a severe performance regression. 
>> That's not what i have expected - ho hum. Apologies for my earlier 
>> "please merge it already!" whining.
>>     
>
> that's.. not automatic; it depends on what the right thing is :-(
> What for sure changes is that who gets to do IO changes. Some of the
> tests we ran internally (we didn't publish yet because we saw REALLY
> large variations for most of them even without any patch) show for
> example that "dbench" got slower. But.. dbench gets slower when things
> get more fair, and faster when things get unfair. What conclusion you
> draw out of that is a whole different matter and depends on exactly
> what the test is doing, and what is the right thing for the OS to do in
> terms of who gets to do the IO.
>
> THis makes the patch more tricky than the one line change suggests, and
> this is also why I haven't published a ton of data yet; it's hard to
> get useful tests for this (and the variation of the 2.6.23+ kernels
> makes it even harder to do anything meaningful ;-( )
>
>
>   
I'd also like to point out here that the run-to-run deviation was indeed 
quite large for both the unpatched- and patched-kernels, I'll report on 
that information with the next set of results...

Alan

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority
  2007-11-14 17:14         ` Andrew Morton
  2007-11-14 17:18           ` Ingo Molnar
@ 2007-11-14 19:24           ` Alan D. Brunelle
  2007-11-14 19:50             ` Arjan van de Ven
  2007-11-14 19:56             ` Alan D. Brunelle
  2007-11-16 16:25           ` Alan D. Brunelle
  2 siblings, 2 replies; 76+ messages in thread
From: Alan D. Brunelle @ 2007-11-14 19:24 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Rik van Riel, arjan, Jens Axboe, mingo, linux-kernel

Andrew Morton wrote:
> (cc lkml restored, with permission)
>
> On Wed, 14 Nov 2007 10:48:10 -0500 "Alan D. Brunelle" <Alan.Brunelle@hp.com> wrote:
>
>   
>> Andrew Morton wrote:
>>     
>>> On Mon, 15 Oct 2007 16:13:15 -0400
>>> Rik van Riel <riel@redhat.com> wrote:
>>>
>>>   
>>>       
>>>> Since you have been involved a lot with ext3 development,
>>>> which kinds of workloads do you think will show a performance
>>>> degradation with Arjan's patch?   What should I test?
>>>>     
>>>>         
>>> Gee.  Expect the unexpected ;)
>>>
>>> One problem might be when kjournald is doing its ordered-mode data
>>> writeback at the start of commit.  That writeout will now be
>>> higher-priority and might impact other tasks which are doing synchronous
>>> file overwrites (ie: no commits) or O_DIRECT reads or writes or just plain
>>> old reads.
>>>
>>> If the aggregate number of seeks over the long term is the same as before
>>> then of course the overall throughput should be the same, in which case the
>>> impact might only be upon latency.  However if for some reason the total
>>> number of seeks is increased then there will be throughput impacts as well.
>>>
>>> So as a starting point I guess one could set up a
>>> copy-a-kernel-tree-in-a-loop running in the background and then see what
>>> impact that has upon a large-linear-read, upon a
>>> read-a-different-kernel-tree and upon some database-style multithreaded
>>> O_DIRECT reader/overwriter.
>>> -
>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> Please read the FAQ at  http://www.tux.org/lkml/
>>>   
>>>       
>> Hi folks -
>>
>> I noticed this thread recently (sorry, month late and a dollar short), 
>> but it was very interesting to me, and as I had some cycles yesterday I 
>> did a quick run of what Andrew suggested here (using 2.6.24-rc2 w/out 
>> and then w/ Ajran's patch). After doing the runs last night I'm not 
>> overly happy with the test setup, but before redoing that, I thought I'd 
>> ask to make sure that this patch was still being considered?
>>     
>
> I'd consider its status to be "might be a good idea, more performance
> testing needed".
>
>   
>> And, btw, the results from the first pass were rather mixed -
>>     
>
> ooh, more performance testing.  Thanks
>
>   
>>     * The overwriter task (on an 8GiB file), average over 10 runs:
>>           o 2.6.24 - 300.88226 seconds
>>           o 2.6.24 + Arjan's patch - 403.85505 seconds
>>
>>     * The read-a-different-kernel-tree task, average over 10 runs:
>>           o 2.6.24 - 46.8145945549 seconds
>>           o 2.6.24 + Arjan's patch - 39.6430601119 seconds
>>
>>     * The large-linear-read task (on an 8GiB file), average over 10 runs:
>>           o 2.6.24 - 290.32522 seconds
>>           o 2.6.24 + Arjan's patch - 386.34860 seconds
>>     
>
> These are *large* differences, making this a very signifcant patch.  Much
> care is needed now.
>
> Could you expand a bit on what you're testing here?  I think that in one
> process you're doing a continuous copy-a-kernel-tree and in the other
> process you're the above three things, yes?
>   

The test works like this:

   1. I ensure that the device under test (DUT) is set to run the CFQ
      scheduler.
         1. It is a Fibre Channel 72GiB disk
         2. Single partition...
   2. Put an Ext3 FS on the partition (mkfs.ext3 -b 4096)
   3. Mount the device, and then:
         1. Put an 8GiB file on the new FS
         2. Put 3 copies of a Linux tree (w/ objs & kernel & such) onto
            the FS in separate directories
               1. Note: I'm going to do runs with 6 copies to each
                  directory tree to get to about 4.2GiB per directory tree
   4. Then, for each of the tests:
         1. Remount the device (purge page cache by umount & then mount)
         2. Start up a copy of 1 kernel tree to another tree (you hadn't
            specified if the copy in the background should be to a new
            area or not, so I'm just re-using the same area so we don't
            have to worry about removing the old). I keep doing the copy
            as long as the tests are going
         3. Perform the test (10 times)

The tests are:

    * Linear read of a large file (8GiB)
    * Tree read (foreach file in the tree, dd it to /dev/null)
    * Overwrite of that large file: was doing 256KiB random&direct
      read/writes, will go down to 4KiB read/writes as that is more
      realistic I'd guess

I'm going to try and get the comparisons done by tomorrow, the results 
should be very different due to the changes noted above (going to 4.2GiB 
trees instead of 700MiB, going to 4K instead of 256K read/writes). This 
may cause the runs to be much longer, and then I won't get it done as 
quickly...

> I guess the other things we should look at are the impact on the
> continuously-copy-a-kernel-tree process and also the overall IO throughput.
>  These things will of course be related.  If the overall system-wide IO
> throughput increases with the patch then we probably have a no-brainer.  If
> (as I suspect) the overall IO throughput is decreased then this will be a
> difficult call.
>   

I'll add in continuous 'iostat' grabs, and present data on that too - it 
would contain both generic IO information as well as grabbing 
CPU-related stuff (in particular %system).

>   
>> This was performed on a 4-way IA64 w/ 4GiB RAM w/ a FC disk containing 
>> an Ext3 (4KiB block) FS. The reason I woke up this morning and wasn't 
>> too happy with the benchmarking harness is that a kernel source tree is 
>> less than 1GiB and size, and thus the back ground task of reading it 
>> might all fit in the page cache. To do this right, I'd use another tree 
>> that has >4GiB of data stored in it. (And likewise, I'd change the 
>> read-a-different-kernel-tree task to use a larger tree as well.)
>>     
>
> hm, yes.  Back in the days when I used to do useful things I'd do most
> testing of this sort on 256MB, 128MB or even 64MB machines.  So that data
> would get tossed out of cache quickly so that I could use smaller working
> sets so that the tests could be executed faster.
>
>   
>> But before I do that, I want to make sure that the path is still being 
>> considered.
>>     
>
> Sure, it hasn't been ruled out.  Especially as those time deltas you're
> measuring are so large.  We haven't seen changes in IO throughput like that
> in years.  We just need to work out if they're net-positive or net-negative ;)
>
> This will end up being a pretty large hunk of work I expect.
>   
I'll get results out when I have the changes made to the script 
(outlined above), and the runs done.

Alan

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority
  2007-11-14 17:51             ` Arjan van de Ven
@ 2007-11-14 18:55               ` Ingo Molnar
  2007-11-14 19:43               ` Alan D. Brunelle
  1 sibling, 0 replies; 76+ messages in thread
From: Ingo Molnar @ 2007-11-14 18:55 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Andrew Morton, Alan D. Brunelle, Rik van Riel, jens.axboe, linux-kernel


* Arjan van de Ven <arjan@infradead.org> wrote:

> > > >     * The read-a-different-kernel-tree task, average over 10 runs:
> > > >           o 2.6.24 - 46.8145945549 seconds
> > > >           o 2.6.24 + Arjan's patch - 39.6430601119 seconds
> > > > 
> > > >     * The large-linear-read task (on an 8GiB file), average over
> > > > 10 runs: o 2.6.24 - 290.32522 seconds
> > > >           o 2.6.24 + Arjan's patch - 386.34860 seconds
> > > 
> > > These are *large* differences, making this a very signifcant
> > > patch. Much care is needed now.
> > 
> > and the numbers suggest it's mostly a severe performance regression. 
> > That's not what i have expected - ho hum. Apologies for my earlier 
> > "please merge it already!" whining.
> 
> that's.. not automatic; it depends on what the right thing is :-( What 
> for sure changes is that who gets to do IO changes. Some of the tests 
> we ran internally (we didn't publish yet because we saw REALLY large 
> variations for most of them even without any patch) show for example 
> that "dbench" got slower. But.. dbench gets slower when things get 
> more fair, and faster when things get unfair. What conclusion you draw 
> out of that is a whole different matter and depends on exactly what 
> the test is doing, and what is the right thing for the OS to do in 
> terms of who gets to do the IO.

yeah, i'd agree to not too much faith into dbench results.

	Ingo

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority
  2007-11-14 17:18           ` Ingo Molnar
@ 2007-11-14 17:51             ` Arjan van de Ven
  2007-11-14 18:55               ` Ingo Molnar
  2007-11-14 19:43               ` Alan D. Brunelle
  0 siblings, 2 replies; 76+ messages in thread
From: Arjan van de Ven @ 2007-11-14 17:51 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Alan D. Brunelle, Rik van Riel, jens.axboe, linux-kernel

On Wed, 14 Nov 2007 18:18:05 +0100
Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > ooh, more performance testing.  Thanks
> > 
> > >     * The overwriter task (on an 8GiB file), average over 10 runs:
> > >           o 2.6.24 - 300.88226 seconds
> > >           o 2.6.24 + Arjan's patch - 403.85505 seconds
> > > 
> > >     * The read-a-different-kernel-tree task, average over 10 runs:
> > >           o 2.6.24 - 46.8145945549 seconds
> > >           o 2.6.24 + Arjan's patch - 39.6430601119 seconds
> > > 
> > >     * The large-linear-read task (on an 8GiB file), average over
> > > 10 runs: o 2.6.24 - 290.32522 seconds
> > >           o 2.6.24 + Arjan's patch - 386.34860 seconds
> > 
> > These are *large* differences, making this a very signifcant
> > patch. Much care is needed now.
> 
> and the numbers suggest it's mostly a severe performance regression. 
> That's not what i have expected - ho hum. Apologies for my earlier 
> "please merge it already!" whining.

that's.. not automatic; it depends on what the right thing is :-(
What for sure changes is that who gets to do IO changes. Some of the
tests we ran internally (we didn't publish yet because we saw REALLY
large variations for most of them even without any patch) show for
example that "dbench" got slower. But.. dbench gets slower when things
get more fair, and faster when things get unfair. What conclusion you
draw out of that is a whole different matter and depends on exactly
what the test is doing, and what is the right thing for the OS to do in
terms of who gets to do the IO.

THis makes the patch more tricky than the one line change suggests, and
this is also why I haven't published a ton of data yet; it's hard to
get useful tests for this (and the variation of the 2.6.23+ kernels
makes it even harder to do anything meaningful ;-( )


-- 
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority
  2007-11-14 17:14         ` Andrew Morton
@ 2007-11-14 17:18           ` Ingo Molnar
  2007-11-14 17:51             ` Arjan van de Ven
  2007-11-14 19:24           ` Alan D. Brunelle
  2007-11-16 16:25           ` Alan D. Brunelle
  2 siblings, 1 reply; 76+ messages in thread
From: Ingo Molnar @ 2007-11-14 17:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alan D. Brunelle, Rik van Riel, arjan, jens.axboe, linux-kernel


* Andrew Morton <akpm@linux-foundation.org> wrote:

> ooh, more performance testing.  Thanks
> 
> >     * The overwriter task (on an 8GiB file), average over 10 runs:
> >           o 2.6.24 - 300.88226 seconds
> >           o 2.6.24 + Arjan's patch - 403.85505 seconds
> > 
> >     * The read-a-different-kernel-tree task, average over 10 runs:
> >           o 2.6.24 - 46.8145945549 seconds
> >           o 2.6.24 + Arjan's patch - 39.6430601119 seconds
> > 
> >     * The large-linear-read task (on an 8GiB file), average over 10 runs:
> >           o 2.6.24 - 290.32522 seconds
> >           o 2.6.24 + Arjan's patch - 386.34860 seconds
> 
> These are *large* differences, making this a very signifcant patch.  
> Much care is needed now.

and the numbers suggest it's mostly a severe performance regression. 
That's not what i have expected - ho hum. Apologies for my earlier 
"please merge it already!" whining.

	Ingo

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority
       [not found]       ` <473B18BA.5000709@hp.com>
@ 2007-11-14 17:14         ` Andrew Morton
  2007-11-14 17:18           ` Ingo Molnar
                             ` (2 more replies)
  0 siblings, 3 replies; 76+ messages in thread
From: Andrew Morton @ 2007-11-14 17:14 UTC (permalink / raw)
  To: Alan D. Brunelle; +Cc: Rik van Riel, arjan, jens.axboe, mingo, linux-kernel


(cc lkml restored, with permission)

On Wed, 14 Nov 2007 10:48:10 -0500 "Alan D. Brunelle" <Alan.Brunelle@hp.com> wrote:

> Andrew Morton wrote:
> > On Mon, 15 Oct 2007 16:13:15 -0400
> > Rik van Riel <riel@redhat.com> wrote:
> >
> >   
> >> Since you have been involved a lot with ext3 development,
> >> which kinds of workloads do you think will show a performance
> >> degradation with Arjan's patch?   What should I test?
> >>     
> >
> > Gee.  Expect the unexpected ;)
> >
> > One problem might be when kjournald is doing its ordered-mode data
> > writeback at the start of commit.  That writeout will now be
> > higher-priority and might impact other tasks which are doing synchronous
> > file overwrites (ie: no commits) or O_DIRECT reads or writes or just plain
> > old reads.
> >
> > If the aggregate number of seeks over the long term is the same as before
> > then of course the overall throughput should be the same, in which case the
> > impact might only be upon latency.  However if for some reason the total
> > number of seeks is increased then there will be throughput impacts as well.
> >
> > So as a starting point I guess one could set up a
> > copy-a-kernel-tree-in-a-loop running in the background and then see what
> > impact that has upon a large-linear-read, upon a
> > read-a-different-kernel-tree and upon some database-style multithreaded
> > O_DIRECT reader/overwriter.
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> >   
> Hi folks -
> 
> I noticed this thread recently (sorry, month late and a dollar short), 
> but it was very interesting to me, and as I had some cycles yesterday I 
> did a quick run of what Andrew suggested here (using 2.6.24-rc2 w/out 
> and then w/ Ajran's patch). After doing the runs last night I'm not 
> overly happy with the test setup, but before redoing that, I thought I'd 
> ask to make sure that this patch was still being considered?

I'd consider its status to be "might be a good idea, more performance
testing needed".

> And, btw, the results from the first pass were rather mixed -

ooh, more performance testing.  Thanks

>     * The overwriter task (on an 8GiB file), average over 10 runs:
>           o 2.6.24 - 300.88226 seconds
>           o 2.6.24 + Arjan's patch - 403.85505 seconds
> 
>     * The read-a-different-kernel-tree task, average over 10 runs:
>           o 2.6.24 - 46.8145945549 seconds
>           o 2.6.24 + Arjan's patch - 39.6430601119 seconds
> 
>     * The large-linear-read task (on an 8GiB file), average over 10 runs:
>           o 2.6.24 - 290.32522 seconds
>           o 2.6.24 + Arjan's patch - 386.34860 seconds

These are *large* differences, making this a very signifcant patch.  Much
care is needed now.

Could you expand a bit on what you're testing here?  I think that in one
process you're doing a continuous copy-a-kernel-tree and in the other
process you're the above three things, yes?

I guess the other things we should look at are the impact on the
continuously-copy-a-kernel-tree process and also the overall IO throughput.
 These things will of course be related.  If the overall system-wide IO
throughput increases with the patch then we probably have a no-brainer.  If
(as I suspect) the overall IO throughput is decreased then this will be a
difficult call.

> This was performed on a 4-way IA64 w/ 4GiB RAM w/ a FC disk containing 
> an Ext3 (4KiB block) FS. The reason I woke up this morning and wasn't 
> too happy with the benchmarking harness is that a kernel source tree is 
> less than 1GiB and size, and thus the back ground task of reading it 
> might all fit in the page cache. To do this right, I'd use another tree 
> that has >4GiB of data stored in it. (And likewise, I'd change the 
> read-a-different-kernel-tree task to use a larger tree as well.)

hm, yes.  Back in the days when I used to do useful things I'd do most
testing of this sort on 256MB, 128MB or even 64MB machines.  So that data
would get tossed out of cache quickly so that I could use smaller working
sets so that the tests could be executed faster.

> But before I do that, I want to make sure that the path is still being 
> considered.

Sure, it hasn't been ruled out.  Especially as those time deltas you're
measuring are so large.  We haven't seen changes in IO throughput like that
in years.  We just need to work out if they're net-positive or net-negative ;)

This will end up being a pretty large hunk of work I expect.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority
  2007-10-22  9:40         ` Ingo Molnar
@ 2007-10-22  9:49           ` Andrew Morton
  0 siblings, 0 replies; 76+ messages in thread
From: Andrew Morton @ 2007-10-22  9:49 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Jens Axboe, Arjan van de Ven, linux-kernel

On Mon, 22 Oct 2007 11:40:14 +0200 Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > > so lets just goddamn apply this _trivial_ patch. This isnt an 
> > > intrusive 1000 line rewrite that is hard to revert. If it causes any 
> > > bandwidth problems, it will be just as trivial to undo. If we do 
> > > anything else we just stiffle the still young and very much 
> > > under-represented "lets fix latencies that bothers people" movement. 
> > > If anything we need _positive_ discrimination for latency related 
> > > fixes (which treatment this fix does not need at all - all it needs 
> > > is _equal_ footing with the countless bandwidth patches that go into 
> > > the kernel all the time), otherwise it will never take off and 
> > > become as healthy as bandwidth optimizations. Ok?
> > 
> > I think the situation is that we've asked for some additional 
> > what-can-be-hurt-by-this testing.
> >
> > Yes, we could sling it out there and wait for the reports.  But often 
> > that's a pretty painful process and regressions can be discovered too 
> > late for us to do anything about them.
> 
> reverting this oneliner is trivial. Finding bandwidth problems and 
> tracking them down to this oneliner change is relatively easy too. 
> Finding latency problems and fixing them is _not_ trivial.
> 
> Boot up a Linux desktop and start OOo or firefox, and measure the time 
> it takes to start the app up. 10-20 seconds on a top-of-the-line 
> quad-core 3.2 GHz system - which is a shame. Same box can do in excess 
> of 1GB/sec block IO. Yes, one could blame the apps but in reality most 
> of the blame is mostly on the kernel side. We do not make bloat and 
> latency suckage apparent enough to user-space (due to lack of 
> intelligent instrumentation), we make latencies hard to fix, we have an 
> acceptance bias towards bandwidth fixes (because they are easier to 
> measure and justify) - and that's all what it takes to let such a 
> situation get out of control.
> 
> and i can bring up the scheduler as an example. CFS broke the bandwidth 
> performance of one particular app and it took only a few days to get it 
> back under control. But it was months to get good latency behavior out 
> of the scheduler. And that is with the help of excellent scheduler 
> instrumentation. In the IO space the latency situation is much, much 
> worse. Really.
> 

None of which is an argument for simply not bothering to do a bit more
developer testing before merging.


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority
  2007-10-22  9:23       ` Andrew Morton
  2007-10-22  9:27         ` Ingo Molnar
@ 2007-10-22  9:40         ` Ingo Molnar
  2007-10-22  9:49           ` Andrew Morton
  1 sibling, 1 reply; 76+ messages in thread
From: Ingo Molnar @ 2007-10-22  9:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jens Axboe, Arjan van de Ven, linux-kernel


* Andrew Morton <akpm@linux-foundation.org> wrote:

> > so lets just goddamn apply this _trivial_ patch. This isnt an 
> > intrusive 1000 line rewrite that is hard to revert. If it causes any 
> > bandwidth problems, it will be just as trivial to undo. If we do 
> > anything else we just stiffle the still young and very much 
> > under-represented "lets fix latencies that bothers people" movement. 
> > If anything we need _positive_ discrimination for latency related 
> > fixes (which treatment this fix does not need at all - all it needs 
> > is _equal_ footing with the countless bandwidth patches that go into 
> > the kernel all the time), otherwise it will never take off and 
> > become as healthy as bandwidth optimizations. Ok?
> 
> I think the situation is that we've asked for some additional 
> what-can-be-hurt-by-this testing.
>
> Yes, we could sling it out there and wait for the reports.  But often 
> that's a pretty painful process and regressions can be discovered too 
> late for us to do anything about them.

reverting this oneliner is trivial. Finding bandwidth problems and 
tracking them down to this oneliner change is relatively easy too. 
Finding latency problems and fixing them is _not_ trivial.

Boot up a Linux desktop and start OOo or firefox, and measure the time 
it takes to start the app up. 10-20 seconds on a top-of-the-line 
quad-core 3.2 GHz system - which is a shame. Same box can do in excess 
of 1GB/sec block IO. Yes, one could blame the apps but in reality most 
of the blame is mostly on the kernel side. We do not make bloat and 
latency suckage apparent enough to user-space (due to lack of 
intelligent instrumentation), we make latencies hard to fix, we have an 
acceptance bias towards bandwidth fixes (because they are easier to 
measure and justify) - and that's all what it takes to let such a 
situation get out of control.

and i can bring up the scheduler as an example. CFS broke the bandwidth 
performance of one particular app and it took only a few days to get it 
back under control. But it was months to get good latency behavior out 
of the scheduler. And that is with the help of excellent scheduler 
instrumentation. In the IO space the latency situation is much, much 
worse. Really.

	Ingo

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority
  2007-10-22  9:23       ` Andrew Morton
@ 2007-10-22  9:27         ` Ingo Molnar
  2007-10-22  9:40         ` Ingo Molnar
  1 sibling, 0 replies; 76+ messages in thread
From: Ingo Molnar @ 2007-10-22  9:27 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jens Axboe, Arjan van de Ven, linux-kernel


* Andrew Morton <akpm@linux-foundation.org> wrote:

> > but what bothers me even more is the large picture. Linux's 
> > development is still fundamentally skewed towards bandwidth (which 
> > goes up with hardware advances anyway), while the focus on latencies 
> > is very lacking (which users do care about much more and which 
> > usually does _not_ improve with improved hardware), so i cannot see 
> > why we shouldnt apply this. Reminds me of the illogical, almost 
> > superstitious resistence against the relatime patch. (which is not 
> > in 2.6.24 mind you - killed for good)
> 
> Try `mount -o relatime' and prepare to be surprised ;)

i mean this one:

  http://people.redhat.com/mingo/relatime-patches/improve-relatime.patch

this actually makes relatime practical.

	Ingo

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority
  2007-10-22  9:10     ` Ingo Molnar
@ 2007-10-22  9:23       ` Andrew Morton
  2007-10-22  9:27         ` Ingo Molnar
  2007-10-22  9:40         ` Ingo Molnar
  0 siblings, 2 replies; 76+ messages in thread
From: Andrew Morton @ 2007-10-22  9:23 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Jens Axboe, Arjan van de Ven, linux-kernel

On Mon, 22 Oct 2007 11:10:57 +0200 Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Jens Axboe <jens.axboe@oracle.com> wrote:
> 
> > > Seems a pretty fundamental change which could do with some careful 
> > > benchmarking, methinks.
> > > 
> > > See, your patch amounts to "do more seeks to improve one test case". 
> > > Surely other testcases will worsen.  What are they?
> > 
> > Yes, completely agree! I think Arjans patch makes a heap of sense, but 
> > some numbers would be great to see.
> 
> Arjan gave the relevant hard numbers:
> 
> | With latencytop, I noticed that the (in memory) atime updates during a 
> | kernel build had latencies of 600 msec or longer [...]
> |
> | With this patch, the latencies for atime updates (and similar 
> | operation) go down by a factor of 3x to 4x !
> 
> atime update latencies went down by a factor of 3x-4x ...
> 
> but what bothers me even more is the large picture. Linux's development 
> is still fundamentally skewed towards bandwidth (which goes up with 
> hardware advances anyway), while the focus on latencies is very lacking 
> (which users do care about much more and which usually does _not_ 
> improve with improved hardware), so i cannot see why we shouldnt apply 
> this. Reminds me of the illogical, almost superstitious resistence 
> against the relatime patch. (which is not in 2.6.24 mind you - killed 
> for good)

Try `mount -o relatime' and prepare to be surprised ;)

> if bandwidth hurts anywhere, it will be pointed out and fixed, we've got 
> like tons of bandwidth benchmarks and it's _easy_ to fix bandwidth 
> problems. But _finally_ we now have desktop latency tools, hard numbers 
> and patches that fix them, but what do we do ... we put up extra 
> roadblocks??
> 
> so lets just goddamn apply this _trivial_ patch. This isnt an intrusive 
> 1000 line rewrite that is hard to revert. If it causes any bandwidth 
> problems, it will be just as trivial to undo. If we do anything else we 
> just stiffle the still young and very much under-represented "lets fix 
> latencies that bothers people" movement. If anything we need _positive_ 
> discrimination for latency related fixes (which treatment this fix does 
> not need at all - all it needs is _equal_ footing with the countless 
> bandwidth patches that go into the kernel all the time), otherwise it 
> will never take off and become as healthy as bandwidth optimizations. 
> Ok?
> 

I think the situation is that we've asked for some additional
what-can-be-hurt-by-this testing.

Yes, we could sling it out there and wait for the reports.  But often
that's a pretty painful process and regressions can be discovered too late
for us to do anything about them.


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority
  2007-10-15 19:28   ` Jens Axboe
@ 2007-10-22  9:10     ` Ingo Molnar
  2007-10-22  9:23       ` Andrew Morton
  0 siblings, 1 reply; 76+ messages in thread
From: Ingo Molnar @ 2007-10-22  9:10 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Andrew Morton, Arjan van de Ven, linux-kernel


* Jens Axboe <jens.axboe@oracle.com> wrote:

> > Seems a pretty fundamental change which could do with some careful 
> > benchmarking, methinks.
> > 
> > See, your patch amounts to "do more seeks to improve one test case". 
> > Surely other testcases will worsen.  What are they?
> 
> Yes, completely agree! I think Arjans patch makes a heap of sense, but 
> some numbers would be great to see.

Arjan gave the relevant hard numbers:

| With latencytop, I noticed that the (in memory) atime updates during a 
| kernel build had latencies of 600 msec or longer [...]
|
| With this patch, the latencies for atime updates (and similar 
| operation) go down by a factor of 3x to 4x !

atime update latencies went down by a factor of 3x-4x ...

but what bothers me even more is the large picture. Linux's development 
is still fundamentally skewed towards bandwidth (which goes up with 
hardware advances anyway), while the focus on latencies is very lacking 
(which users do care about much more and which usually does _not_ 
improve with improved hardware), so i cannot see why we shouldnt apply 
this. Reminds me of the illogical, almost superstitious resistence 
against the relatime patch. (which is not in 2.6.24 mind you - killed 
for good)

if bandwidth hurts anywhere, it will be pointed out and fixed, we've got 
like tons of bandwidth benchmarks and it's _easy_ to fix bandwidth 
problems. But _finally_ we now have desktop latency tools, hard numbers 
and patches that fix them, but what do we do ... we put up extra 
roadblocks??

so lets just goddamn apply this _trivial_ patch. This isnt an intrusive 
1000 line rewrite that is hard to revert. If it causes any bandwidth 
problems, it will be just as trivial to undo. If we do anything else we 
just stiffle the still young and very much under-represented "lets fix 
latencies that bothers people" movement. If anything we need _positive_ 
discrimination for latency related fixes (which treatment this fix does 
not need at all - all it needs is _equal_ footing with the countless 
bandwidth patches that go into the kernel all the time), otherwise it 
will never take off and become as healthy as bandwidth optimizations. 
Ok?

	Ingo

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority
  2007-10-15 20:13   ` Rik van Riel
@ 2007-10-15 21:12     ` Andrew Morton
       [not found]       ` <473B18BA.5000709@hp.com>
  0 siblings, 1 reply; 76+ messages in thread
From: Andrew Morton @ 2007-10-15 21:12 UTC (permalink / raw)
  To: Rik van Riel; +Cc: arjan, linux-kernel, jens.axboe, mingo

On Mon, 15 Oct 2007 16:13:15 -0400
Rik van Riel <riel@redhat.com> wrote:

> Since you have been involved a lot with ext3 development,
> which kinds of workloads do you think will show a performance
> degradation with Arjan's patch?   What should I test?

Gee.  Expect the unexpected ;)

One problem might be when kjournald is doing its ordered-mode data
writeback at the start of commit.  That writeout will now be
higher-priority and might impact other tasks which are doing synchronous
file overwrites (ie: no commits) or O_DIRECT reads or writes or just plain
old reads.

If the aggregate number of seeks over the long term is the same as before
then of course the overall throughput should be the same, in which case the
impact might only be upon latency.  However if for some reason the total
number of seeks is increased then there will be throughput impacts as well.

So as a starting point I guess one could set up a
copy-a-kernel-tree-in-a-loop running in the background and then see what
impact that has upon a large-linear-read, upon a
read-a-different-kernel-tree and upon some database-style multithreaded
O_DIRECT reader/overwriter.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority
  2007-10-15 18:47 ` Andrew Morton
  2007-10-15 19:28   ` Jens Axboe
@ 2007-10-15 20:13   ` Rik van Riel
  2007-10-15 21:12     ` Andrew Morton
  1 sibling, 1 reply; 76+ messages in thread
From: Rik van Riel @ 2007-10-15 20:13 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Arjan van de Ven, linux-kernel, jens.axboe, mingo

On Mon, 15 Oct 2007 11:47:38 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:
> On Mon, 15 Oct 2007 10:46:47 -0700
> Arjan van de Ven <arjan@infradead.org> wrote:
> 
> > 
> > Subject: Give kjournald a IOPRIO_CLASS_RT io priority
> > From: Arjan van de Ven <arjan@linux.intel.com>

> Seems a pretty fundamental change which could do with some careful
> benchmarking, methinks.

FWIW, I have marked the kjournald processes on my system
realtime with "rtprio -c 1 `pidof kjournald`" and the
usual desktop stalls that plague my system have not yet
happened this afternoon.

> See, your patch amounts to "do more seeks to improve one test case". 
> Surely other testcases will worsen.  What are they?

The big problem I have seen here is that processes end up
waiting on kjournald to do something, and kjournald is
waiting due to the IO scheduler.

This can cause a lot of low IO (high IO priority) processes
to indirectly get stuck behind a few high IO (low priority)
processes.

Since you have been involved a lot with ext3 development,
which kinds of workloads do you think will show a performance
degradation with Arjan's patch?   What should I test?

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority
  2007-10-15 18:47 ` Andrew Morton
@ 2007-10-15 19:28   ` Jens Axboe
  2007-10-22  9:10     ` Ingo Molnar
  2007-10-15 20:13   ` Rik van Riel
  1 sibling, 1 reply; 76+ messages in thread
From: Jens Axboe @ 2007-10-15 19:28 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Arjan van de Ven, linux-kernel, mingo

On Mon, Oct 15 2007, Andrew Morton wrote:
> On Mon, 15 Oct 2007 10:46:47 -0700
> Arjan van de Ven <arjan@infradead.org> wrote:
> 
> > 
> > Subject: Give kjournald a IOPRIO_CLASS_RT io priority
> > From: Arjan van de Ven <arjan@linux.intel.com>
> > 
> > With latencytop, I noticed that the (in memory) atime updates during a
> > kernel build had latencies of 600 msec or longer; this is obviously not so
> > nice behavior. Other EXT3 journal related operations had similar or even
> > longer latencies.
> > 
> > Digging into this a bit more, it appears to be an interaction between EXT3
> > and CFQ in that CFQ tries to be fair to everyone, including kjournald.
> > However, in reality, kjournald is "special" in that it does a lot of journal
> > work and effectively this leads to a twisted kind of "mass priority
> > inversion" type of behavior.
> > 
> > The good news is that CFQ already has the infrastructure to make certain
> > processes special... JBD just wasn't using that quite yet.
> > 
> > The patch below makes kjournald of the IOPRIO_CLASS_RT priority to break
> > this priority inversion behavior. With this patch, the latencies for atime
> > updates (and similar operation) go down by a factor of 3x to 4x !
> > 
> 
> Seems a pretty fundamental change which could do with some careful
> benchmarking, methinks.
> 
> See, your patch amounts to "do more seeks to improve one test case". 
> Surely other testcases will worsen.  What are they?

Yes, completely agree! I think Arjans patch makes a heap of sense, but
some numbers would be great to see.

> > Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
> > 
> > 
> > diff -purN linux-2.6.23-rc9.org/fs/jbd/journal.c linux-2.6.23-rc9.lt/fs/jbd/journal.c
> > --- linux-2.6.23-rc9.org/fs/jbd/journal.c	2007-10-02 05:24:52.000000000 +0200
> > +++ linux-2.6.23-rc9.lt/fs/jbd/journal.c	2007-10-14 00:06:55.000000000 +0200
> > @@ -35,6 +35,7 @@
> >  #include <linux/kthread.h>
> >  #include <linux/poison.h>
> >  #include <linux/proc_fs.h>
> > +#include <linux/ioprio.h>
> >  
> >  #include <asm/uaccess.h>
> >  #include <asm/page.h>
> > @@ -131,6 +132,8 @@ static int kjournald(void *arg)
> >  	printk(KERN_INFO "kjournald starting.  Commit interval %ld seconds\n",
> >  			journal->j_commit_interval / HZ);
> >  
> > +	current->ioprio =  (IOPRIO_CLASS_RT << IOPRIO_CLASS_SHIFT) | 4;
> > +
> 
> Might be worth a code comment?

It should not be merged as-is, instead I'll provide a function to do
this. It should also set current->io_context->ioprio_changed. Since no
IO has been done yet at this point it doesn't matter. But we should cut
a piece of set_task_ioprio() out and provide that as a kernel helper for
this sort of thing.

Even just writing it as:

current->ioprio =  (IOPRIO_CLASS_RT << IOPRIO_CLASS_SHIFT) | IOPRIO_NORM;

would be more readable. Or perhaps this would suffice, given the above
restriction that IO hasn't been done yet:

current->ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_RT, IOPRIO_NORM);

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [patch] Give kjournald a IOPRIO_CLASS_RT io priority
  2007-10-15 17:46 [patch] " Arjan van de Ven
@ 2007-10-15 18:47 ` Andrew Morton
  2007-10-15 19:28   ` Jens Axboe
  2007-10-15 20:13   ` Rik van Riel
  0 siblings, 2 replies; 76+ messages in thread
From: Andrew Morton @ 2007-10-15 18:47 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: linux-kernel, jens.axboe, mingo

On Mon, 15 Oct 2007 10:46:47 -0700
Arjan van de Ven <arjan@infradead.org> wrote:

> 
> Subject: Give kjournald a IOPRIO_CLASS_RT io priority
> From: Arjan van de Ven <arjan@linux.intel.com>
> 
> With latencytop, I noticed that the (in memory) atime updates during a
> kernel build had latencies of 600 msec or longer; this is obviously not so
> nice behavior. Other EXT3 journal related operations had similar or even
> longer latencies.
> 
> Digging into this a bit more, it appears to be an interaction between EXT3
> and CFQ in that CFQ tries to be fair to everyone, including kjournald.
> However, in reality, kjournald is "special" in that it does a lot of journal
> work and effectively this leads to a twisted kind of "mass priority
> inversion" type of behavior.
> 
> The good news is that CFQ already has the infrastructure to make certain
> processes special... JBD just wasn't using that quite yet.
> 
> The patch below makes kjournald of the IOPRIO_CLASS_RT priority to break
> this priority inversion behavior. With this patch, the latencies for atime
> updates (and similar operation) go down by a factor of 3x to 4x !
> 

Seems a pretty fundamental change which could do with some careful
benchmarking, methinks.

See, your patch amounts to "do more seeks to improve one test case". 
Surely other testcases will worsen.  What are they?

> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
> 
> 
> diff -purN linux-2.6.23-rc9.org/fs/jbd/journal.c linux-2.6.23-rc9.lt/fs/jbd/journal.c
> --- linux-2.6.23-rc9.org/fs/jbd/journal.c	2007-10-02 05:24:52.000000000 +0200
> +++ linux-2.6.23-rc9.lt/fs/jbd/journal.c	2007-10-14 00:06:55.000000000 +0200
> @@ -35,6 +35,7 @@
>  #include <linux/kthread.h>
>  #include <linux/poison.h>
>  #include <linux/proc_fs.h>
> +#include <linux/ioprio.h>
>  
>  #include <asm/uaccess.h>
>  #include <asm/page.h>
> @@ -131,6 +132,8 @@ static int kjournald(void *arg)
>  	printk(KERN_INFO "kjournald starting.  Commit interval %ld seconds\n",
>  			journal->j_commit_interval / HZ);
>  
> +	current->ioprio =  (IOPRIO_CLASS_RT << IOPRIO_CLASS_SHIFT) | 4;
> +

Might be worth a code comment?

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [patch] Give kjournald a IOPRIO_CLASS_RT io priority
@ 2007-10-15 17:46 Arjan van de Ven
  2007-10-15 18:47 ` Andrew Morton
  0 siblings, 1 reply; 76+ messages in thread
From: Arjan van de Ven @ 2007-10-15 17:46 UTC (permalink / raw)
  To: linux-kernel; +Cc: akpm, jens.axboe, mingo


Subject: Give kjournald a IOPRIO_CLASS_RT io priority
From: Arjan van de Ven <arjan@linux.intel.com>

With latencytop, I noticed that the (in memory) atime updates during a
kernel build had latencies of 600 msec or longer; this is obviously not so
nice behavior. Other EXT3 journal related operations had similar or even
longer latencies.

Digging into this a bit more, it appears to be an interaction between EXT3
and CFQ in that CFQ tries to be fair to everyone, including kjournald.
However, in reality, kjournald is "special" in that it does a lot of journal
work and effectively this leads to a twisted kind of "mass priority
inversion" type of behavior.

The good news is that CFQ already has the infrastructure to make certain
processes special... JBD just wasn't using that quite yet.

The patch below makes kjournald of the IOPRIO_CLASS_RT priority to break
this priority inversion behavior. With this patch, the latencies for atime
updates (and similar operation) go down by a factor of 3x to 4x !


Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>


diff -purN linux-2.6.23-rc9.org/fs/jbd/journal.c linux-2.6.23-rc9.lt/fs/jbd/journal.c
--- linux-2.6.23-rc9.org/fs/jbd/journal.c	2007-10-02 05:24:52.000000000 +0200
+++ linux-2.6.23-rc9.lt/fs/jbd/journal.c	2007-10-14 00:06:55.000000000 +0200
@@ -35,6 +35,7 @@
 #include <linux/kthread.h>
 #include <linux/poison.h>
 #include <linux/proc_fs.h>
+#include <linux/ioprio.h>
 
 #include <asm/uaccess.h>
 #include <asm/page.h>
@@ -131,6 +132,8 @@ static int kjournald(void *arg)
 	printk(KERN_INFO "kjournald starting.  Commit interval %ld seconds\n",
 			journal->j_commit_interval / HZ);
 
+	current->ioprio =  (IOPRIO_CLASS_RT << IOPRIO_CLASS_SHIFT) | 4;
+
 	/*
 	 * And now, wait forever for commit wakeup events.
 	 */

^ permalink raw reply	[flat|nested] 76+ messages in thread

end of thread, other threads:[~2008-10-09  8:48 UTC | newest]

Thread overview: 76+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-10-02  3:00 [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority Arjan van de Ven
2008-10-02  4:56 ` Andrew Morton
2008-10-02  6:27   ` Jens Axboe
2008-10-02  6:55     ` Andrew Morton
2008-10-02  7:45       ` Jens Axboe
2008-10-02  8:03         ` Andrew Morton
2008-10-02  8:22           ` Jens Axboe
2008-10-02  8:43             ` Andrew Morton
2008-10-02  8:46               ` Jens Axboe
2008-10-02 12:04           ` Theodore Tso
2008-10-02 13:16             ` Arjan van de Ven
2008-10-02 13:46               ` Theodore Tso
2008-10-02 14:33                 ` Arjan van de Ven
2008-10-04 14:12                   ` Theodore Tso
2008-10-04 17:14                     ` Joseph Fannin
2008-10-04 21:27                       ` Theodore Tso
2008-10-02 13:12       ` Arjan van de Ven
2008-10-02 20:24         ` Andrew Morton
2008-10-03  4:01           ` Arjan van de Ven
2008-10-03  4:23             ` Arjan van de Ven
2008-10-03  4:40               ` Andrew Morton
2008-10-03  4:43                 ` Arjan van de Ven
2008-10-03  4:50                   ` Andrew Morton
2008-10-03  5:00                     ` Arjan van de Ven
2008-10-03  5:24                       ` Andrew Morton
2008-10-03 17:21                         ` Arjan van de Ven
2008-10-09  3:00                         ` Theodore Tso
2008-10-09  3:38                           ` Andrew Morton
2008-10-03  4:45                 ` Arjan van de Ven
2008-10-02  6:57   ` Andi Kleen
2008-10-02  7:55     ` Jens Axboe
2008-10-02  9:33       ` Dave Chinner
2008-10-02  9:45         ` Jens Axboe
2008-10-02 13:14           ` Arjan van de Ven
2008-10-02 13:27             ` Jens Axboe
2008-10-02 13:36               ` Arjan van de Ven
2008-10-02 13:47                 ` Jens Axboe
2008-10-02 14:26                   ` Arjan van de Ven
2008-10-02 16:42                     ` Jens Axboe
2008-10-02 19:04           ` Arjan van de Ven
2008-10-02 19:22             ` Jens Axboe
2008-10-02 21:37               ` Andrew Morton
2008-10-02 23:58                 ` Dave Chinner
2008-10-03  0:06                   ` Andrew Morton
2008-10-03  0:20                     ` Andrew Morton
2008-10-02 13:05   ` Arjan van de Ven
2008-10-02 17:11     ` Jens Axboe
     [not found] <bimJN-4cO-5@gated-at.bofh.it>
     [not found] ` <biosl-6bq-9@gated-at.bofh.it>
     [not found]   ` <biqkw-aK-3@gated-at.bofh.it>
     [not found]     ` <birgx-1pQ-9@gated-at.bofh.it>
     [not found]       ` <bisPe-3xx-9@gated-at.bofh.it>
     [not found]         ` <bisYW-3HQ-13@gated-at.bofh.it>
2008-10-02 15:32           ` Bodo Eggert
2008-10-02 23:34             ` Dave Chinner
2008-10-04  7:45               ` Aaron Carroll
2008-10-06  3:18                 ` Dave Chinner
2008-10-07 18:06                   ` Jens Axboe
2008-10-07 22:22                     ` Dave Chinner
2008-10-09  8:48                       ` Jens Axboe
  -- strict thread matches above, loose matches on Subject: below --
2007-10-15 17:46 [patch] " Arjan van de Ven
2007-10-15 18:47 ` Andrew Morton
2007-10-15 19:28   ` Jens Axboe
2007-10-22  9:10     ` Ingo Molnar
2007-10-22  9:23       ` Andrew Morton
2007-10-22  9:27         ` Ingo Molnar
2007-10-22  9:40         ` Ingo Molnar
2007-10-22  9:49           ` Andrew Morton
2007-10-15 20:13   ` Rik van Riel
2007-10-15 21:12     ` Andrew Morton
     [not found]       ` <473B18BA.5000709@hp.com>
2007-11-14 17:14         ` Andrew Morton
2007-11-14 17:18           ` Ingo Molnar
2007-11-14 17:51             ` Arjan van de Ven
2007-11-14 18:55               ` Ingo Molnar
2007-11-14 19:43               ` Alan D. Brunelle
2007-11-14 19:24           ` Alan D. Brunelle
2007-11-14 19:50             ` Arjan van de Ven
2007-11-14 19:56             ` Alan D. Brunelle
2007-11-16 16:25           ` Alan D. Brunelle
2007-11-16 16:40             ` Alan D. Brunelle
2007-11-16 18:35             ` Ray Lee
2007-11-16 18:39               ` Alan D. Brunelle

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).