linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
@ 2015-09-11 19:37 Chris Mason
  2015-09-11 20:02 ` Linus Torvalds
  0 siblings, 1 reply; 54+ messages in thread
From: Chris Mason @ 2015-09-11 19:37 UTC (permalink / raw)
  To: Linus Torvalds, LKML, linux-fsdevel; +Cc: Josef Bacik, Dave Chinner, neilb

Linus, this is the plugging problem I mentioned in my btrfs pull.  It
impacts only MD raid10 and btrfs raid5/6, and I'm not wild about the
patch. But I wanted to at least send in the basic fix for rc1 so this
doesn't cause bigger problems for early testers:

Commit d353d7587 added a plug/finish_plug pair to writeback_sb_inodes,
but writeback_sb_inodes has a horrible secret...it's called with the
wb->list_lock held.

Btrfs raid5/6 and MD raid10 have horrible secrets of their own...they
both do allocations in their unplug callbacks.

None of the options to fix it are very pretty.  We don't want to kick
off workers for all of these unplugs, and the lock doesn't look hot
enough to justify bigger restructuring.

[ 2854.025042] BUG: sleeping function called from invalid context at mm/page_alloc.c:3189
[ 2854.041366] in_atomic(): 1, irqs_disabled(): 0, pid: 145562, name: kworker/u66:15
[ 2854.056813] INFO: lockdep is turned off.
[ 2854.064870] CPU: 13 PID: 145562 Comm: kworker/u66:15 Not tainted 4.2.0-mason+ #1
[ 2854.080082] Hardware name: ZTSYSTEMS Echo Ridge T4  /A9DRPF-10D, BIOS 1.07 05/10/2012
[ 2854.096211] Workqueue: writeback wb_workfn (flush-btrfs-244)
[ 2854.107821]  ffffffff81a2bbee ffff880ee09a7598 ffffffff813307bb ffff880ee09a7598
[ 2854.123162]  ffff881010d1ca00 ffff880ee09a75c8 ffffffff81086615 0000000000000000
[ 2854.138556]  0000000000000000 0000000000000c75 ffffffff81a2bbee ffff880ee09a75f8
[ 2854.153936] Call Trace:
[ 2854.181101]  [<ffffffff81086722>] __might_sleep+0x52/0x90
[ 2854.192136]  [<ffffffff8116d2b4>] __alloc_pages_nodemask+0x344/0xbe0
[ 2854.229682]  [<ffffffff811b54aa>] alloc_pages_current+0x10a/0x1e0
[ 2854.255508]  [<ffffffffa0663f19>] full_stripe_write+0x59/0xc0 [btrfs]
[ 2854.268600]  [<ffffffffa0663fb9>] __raid56_parity_write+0x39/0x60 [btrfs]
[ 2854.282385]  [<ffffffffa06640fb>] run_plug+0x11b/0x140 [btrfs]
[ 2854.294259]  [<ffffffffa0664143>] btrfs_raid_unplug+0x23/0x70 [btrfs]
[ 2854.307334]  [<ffffffff81307622>] blk_flush_plug_list+0x82/0x1f0
[ 2854.319542]  [<ffffffff813077c4>] blk_finish_plug+0x34/0x50
[ 2854.330878]  [<ffffffff812079c2>] writeback_sb_inodes+0x122/0x580
[ 2854.343256]  [<ffffffff81208016>] wb_writeback+0x136/0x4e0

Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: Jens Axboe <axboe@fb.com>
---
 fs/fs-writeback.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index ae0f438..07c9c50 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1539,7 +1539,9 @@ static long writeback_sb_inodes(struct super_block *sb,
 				break;
 		}
 	}
+	spin_unlock(&wb->list_lock);
 	blk_finish_plug(&plug);
+	spin_lock(&wb->list_lock);
 	return wrote;
 }
 
-- 
1.8.1


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
  2015-09-11 19:37 [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug() Chris Mason
@ 2015-09-11 20:02 ` Linus Torvalds
  2015-09-11 20:37   ` Linus Torvalds
  0 siblings, 1 reply; 54+ messages in thread
From: Linus Torvalds @ 2015-09-11 20:02 UTC (permalink / raw)
  To: Chris Mason, LKML, linux-fsdevel, Josef Bacik, Dave Chinner,
	Neil Brown, Jan Kara, Christoph Hellwig

I hate this fix.

On Fri, Sep 11, 2015 at 12:37 PM, Chris Mason <clm@fb.com> wrote:
> Linus, this is the plugging problem I mentioned in my btrfs pull.  It
> impacts only MD raid10 and btrfs raid5/6, and I'm not wild about the
> patch. But I wanted to at least send in the basic fix for rc1 so this
> doesn't cause bigger problems for early testers:
>
> Commit d353d7587 added a plug/finish_plug pair to writeback_sb_inodes,
> but writeback_sb_inodes has a horrible secret...it's called with the
> wb->list_lock held.

Quite frankly, just dropping and retaking the lock around the
blk_finish_plug() is just disgusting. The whole "drop and retake lock"
pattern in general is a bad idea, because it can so easily break the
caller (because now the lock no longer covers things over the call.

Yes, in this case we already do something similar in
writeback_single_inode(), so I guess the argument is that it doesn't
make things much worse, and that the caller already cannot depend on
the lock being held. True, but no less disgusting for that. So we
could do this, but in this case I don't think there's any good
_reason_ for doing that disgusting thing.

How about we instead:

 (a) revert that commit d353d7587 as broken (because it clearly is)

 (b) add a big honking comment about the fact that we hold 'list_lock'
in writeback_sb_inodes()

 (c) move the plugging up to wb_writeback() and writeback_inodes_wb()
_outside_ the spinlock.

because that way we not only avoid the ugliness, it should also be
more effective at plugging things since we gather _all_ the writeback
rather than just one superblock.

Let's not paper over a completely broken commit. Let's just admit that
commit d353d7587 was prue and utter shite, and rather than try to fix
up the mistake, make it all better!

Anyway, I will start by reverting that commit, and adding the comment.
I'm more than happy to take the patch that moves the plugging, but
since that one was about performance rather than correctness, I think
it would be good to just re-verify the numbers.

Dave?

                 Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
  2015-09-11 20:02 ` Linus Torvalds
@ 2015-09-11 20:37   ` Linus Torvalds
  2015-09-11 20:40     ` Josef Bacik
  0 siblings, 1 reply; 54+ messages in thread
From: Linus Torvalds @ 2015-09-11 20:37 UTC (permalink / raw)
  To: Chris Mason, LKML, linux-fsdevel, Josef Bacik, Dave Chinner,
	Neil Brown, Jan Kara, Christoph Hellwig

[-- Attachment #1: Type: text/plain, Size: 971 bytes --]

On Fri, Sep 11, 2015 at 1:02 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> How about we instead:
>
>  (a) revert that commit d353d7587 as broken (because it clearly is)
>
>  (b) add a big honking comment about the fact that we hold 'list_lock'
> in writeback_sb_inodes()
>
>  (c) move the plugging up to wb_writeback() and writeback_inodes_wb()
> _outside_ the spinlock.

Ok, I've done (a) and (b) in my tree. And attached is the totally
untested patch for (c). It looks ObviouslyCorrect(tm), but since this
is a performance issue, I'm not going to commit it without some more
ACK's from people.

I obviously think this is a *much* better approach than dropping and
retaking the lock, but there might be something silly I'm missing.

For example, maybe we want to unplug and replug around the
"inode_sleep_on_writeback()" in wb_writeback()? So while the revert
was a no-brainer, this one I really want people to think about.

                       Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/plain, Size: 1102 bytes --]

 fs/fs-writeback.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index d8ea7ed411b2..587ac08eabb6 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1546,12 +1546,15 @@ static long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages,
 		.range_cyclic	= 1,
 		.reason		= reason,
 	};
+	struct blk_plug plug;
 
+	blk_start_plug(&plug);
 	spin_lock(&wb->list_lock);
 	if (list_empty(&wb->b_io))
 		queue_io(wb, &work);
 	__writeback_inodes_wb(wb, &work);
 	spin_unlock(&wb->list_lock);
+	blk_finish_plug(&plug);
 
 	return nr_pages - work.nr_pages;
 }
@@ -1579,10 +1582,12 @@ static long wb_writeback(struct bdi_writeback *wb,
 	unsigned long oldest_jif;
 	struct inode *inode;
 	long progress;
+	struct blk_plug plug;
 
 	oldest_jif = jiffies;
 	work->older_than_this = &oldest_jif;
 
+	blk_start_plug(&plug);
 	spin_lock(&wb->list_lock);
 	for (;;) {
 		/*
@@ -1662,6 +1667,7 @@ static long wb_writeback(struct bdi_writeback *wb,
 		}
 	}
 	spin_unlock(&wb->list_lock);
+	blk_finish_plug(&plug);
 
 	return nr_pages - work->nr_pages;
 }

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
  2015-09-11 20:37   ` Linus Torvalds
@ 2015-09-11 20:40     ` Josef Bacik
  2015-09-11 21:04       ` Linus Torvalds
  0 siblings, 1 reply; 54+ messages in thread
From: Josef Bacik @ 2015-09-11 20:40 UTC (permalink / raw)
  To: Linus Torvalds, Chris Mason, LKML, linux-fsdevel, Dave Chinner,
	Neil Brown, Jan Kara, Christoph Hellwig

On 09/11/2015 04:37 PM, Linus Torvalds wrote:
> On Fri, Sep 11, 2015 at 1:02 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> How about we instead:
>>
>>   (a) revert that commit d353d7587 as broken (because it clearly is)
>>
>>   (b) add a big honking comment about the fact that we hold 'list_lock'
>> in writeback_sb_inodes()
>>
>>   (c) move the plugging up to wb_writeback() and writeback_inodes_wb()
>> _outside_ the spinlock.
>
> Ok, I've done (a) and (b) in my tree. And attached is the totally
> untested patch for (c). It looks ObviouslyCorrect(tm), but since this
> is a performance issue, I'm not going to commit it without some more
> ACK's from people.
>
> I obviously think this is a *much* better approach than dropping and
> retaking the lock, but there might be something silly I'm missing.
>
> For example, maybe we want to unplug and replug around the
> "inode_sleep_on_writeback()" in wb_writeback()? So while the revert
> was a no-brainer, this one I really want people to think about.

So we talked about this when we were trying to figure out a solution. 
The problem with this approach is now we have a plug that covers 
multiple super blocks (__writeback_inodes_wb loops through the sb's 
starts writeback), which is likely to give us crappier performance than 
no plug at all.  Thanks,

Josef


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
  2015-09-11 20:40     ` Josef Bacik
@ 2015-09-11 21:04       ` Linus Torvalds
  2015-09-11 22:06         ` Linus Torvalds
  2015-09-11 23:06         ` Chris Mason
  0 siblings, 2 replies; 54+ messages in thread
From: Linus Torvalds @ 2015-09-11 21:04 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Chris Mason, LKML, linux-fsdevel, Dave Chinner, Neil Brown,
	Jan Kara, Christoph Hellwig

On Fri, Sep 11, 2015 at 1:40 PM, Josef Bacik <jbacik@fb.com> wrote:
>
> So we talked about this when we were trying to figure out a solution. The
> problem with this approach is now we have a plug that covers multiple super
> blocks (__writeback_inodes_wb loops through the sb's starts writeback),
> which is likely to give us crappier performance than no plug at all.

Why would that be? Either they are on separate disks, and the IO is
all independent anyway, and at most it got started by some (small)
CPU-amount later. Actual throughput should be the same. No?

Or the filesystems are on the same disk, in which case it should
presumably be a win to submit the IO together.

Of course, actual numbers would be the deciding factor if this really
is noticeable. But "cleaner code and saner locking" is definitely an
issue at least for me.

                Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
  2015-09-11 21:04       ` Linus Torvalds
@ 2015-09-11 22:06         ` Linus Torvalds
  2015-09-11 23:16           ` Chris Mason
  2015-09-11 23:06         ` Chris Mason
  1 sibling, 1 reply; 54+ messages in thread
From: Linus Torvalds @ 2015-09-11 22:06 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Chris Mason, LKML, linux-fsdevel, Dave Chinner, Neil Brown,
	Jan Kara, Christoph Hellwig

On Fri, Sep 11, 2015 at 2:04 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Of course, actual numbers would be the deciding factor if this really
> is noticeable. But "cleaner code and saner locking" is definitely an
> issue at least for me.

Anyway, I'll hold off pushing out the revert too for a while, in the
hope that we'll have actual numbers for or against whatever the
particular solution should be.

I do get the feeling that the whole wb->list_lock needs more loving.
For example, there's that locked_inode_to_wb_and_lock_list() thing
(which might be better off trying to just do "trylock" on it, but
that's a separate issue) showing lock inversion worries.

Maybe we should *not* get that wb->list_lock early at all, and nest it
inside the inode spinlocks, and just move the list_lock locking down a
lot (ie not even try to hold it over big functions that then are
forced to releasing it anyway).

For example, realistically, it looks like the longest we ever *really*
hold that lock is at the top of the loop of writeback_sb_inodes() -
maybe we could just explicitly have a function that does "find the
first inode that matches this sb and needs writeout activity", and
literally only take hold the lock over that function. And *not* take
the lock in the caller at all?

The callers seem to want that lock mainly because they do that
"list_empty(&wb->b_io)" test. But the way our doubly linked lists
work, "list_empty()" is actually something that can be done racily
without the lock held...

That said, I doubt anybody really wants to touch this, so at least for
now we're stuck with either the "plug outside the lock" or the "drop
and retake lock" options. It really would be loveyl to haev numbers
either way.

<insert pricess leia gif>
"Dave, you're our only hope"

                   Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
  2015-09-11 21:04       ` Linus Torvalds
  2015-09-11 22:06         ` Linus Torvalds
@ 2015-09-11 23:06         ` Chris Mason
  2015-09-11 23:13           ` Linus Torvalds
  1 sibling, 1 reply; 54+ messages in thread
From: Chris Mason @ 2015-09-11 23:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Josef Bacik, LKML, linux-fsdevel, Dave Chinner, Neil Brown,
	Jan Kara, Christoph Hellwig

On Fri, Sep 11, 2015 at 02:04:13PM -0700, Linus Torvalds wrote:
> On Fri, Sep 11, 2015 at 1:40 PM, Josef Bacik <jbacik@fb.com> wrote:
> >
> > So we talked about this when we were trying to figure out a solution. The
> > problem with this approach is now we have a plug that covers multiple super
> > blocks (__writeback_inodes_wb loops through the sb's starts writeback),
> > which is likely to give us crappier performance than no plug at all.
> 
> Why would that be? Either they are on separate disks, and the IO is
> all independent anyway, and at most it got started by some (small)
> CPU-amount later. Actual throughput should be the same. No?
> 
> Or the filesystems are on the same disk, in which case it should
> presumably be a win to submit the IO together.
> 
> Of course, actual numbers would be the deciding factor if this really
> is noticeable. But "cleaner code and saner locking" is definitely an
> issue at least for me.

Originally I was worried about the latency impact of holding the
plugs over more than one super with high end flash.  I just didn't want
to hold onto the IO for longer than we had to.

But, since this isn't really latency sensitive anyway, if we find we're
not keeping the flash pipelines full the right answer is to short
circuit the plugging in general.  I'd agree actual throughput should be
the same.

But benchmarking is the best choice, I'll be able to reproduce
Dave's original results without too much trouble.  Our thinking for this
duct tape patch was the lock wasn't very hot and this was the best
immediate compromise between the bug and the perf improvement.

Happy to sign up for pushing the lock higher if that's what people would
rather see.

-chris


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
  2015-09-11 23:06         ` Chris Mason
@ 2015-09-11 23:13           ` Linus Torvalds
  0 siblings, 0 replies; 54+ messages in thread
From: Linus Torvalds @ 2015-09-11 23:13 UTC (permalink / raw)
  To: Chris Mason, Linus Torvalds, Josef Bacik, LKML, linux-fsdevel,
	Dave Chinner, Neil Brown, Jan Kara, Christoph Hellwig

On Fri, Sep 11, 2015 at 4:06 PM, Chris Mason <clm@fb.com> wrote:
>
> Originally I was worried about the latency impact of holding the
> plugs over more than one super with high end flash.  I just didn't want
> to hold onto the IO for longer than we had to.
>
> But, since this isn't really latency sensitive anyway, if we find we're
> not keeping the flash pipelines full the right answer is to short
> circuit the plugging in general.  I'd agree actual throughput should be
> the same.

Yeah, this only triggers for system-wide writeback, so I don't seer
that it should be latency-sensitive at that level, afaik.

But hey, it's IO, and I've been surprised by magic pattern
sensitivites before...

                  Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
  2015-09-11 22:06         ` Linus Torvalds
@ 2015-09-11 23:16           ` Chris Mason
  2015-09-11 23:36             ` Linus Torvalds
  0 siblings, 1 reply; 54+ messages in thread
From: Chris Mason @ 2015-09-11 23:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Josef Bacik, LKML, linux-fsdevel, Dave Chinner, Neil Brown,
	Jan Kara, Christoph Hellwig

On Fri, Sep 11, 2015 at 03:06:18PM -0700, Linus Torvalds wrote:
> On Fri, Sep 11, 2015 at 2:04 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Of course, actual numbers would be the deciding factor if this really
> > is noticeable. But "cleaner code and saner locking" is definitely an
> > issue at least for me.
> 
> Anyway, I'll hold off pushing out the revert too for a while, in the
> hope that we'll have actual numbers for or against whatever the
> particular solution should be.
> 
> I do get the feeling that the whole wb->list_lock needs more loving.
> For example, there's that locked_inode_to_wb_and_lock_list() thing
> (which might be better off trying to just do "trylock" on it, but
> that's a separate issue) showing lock inversion worries.

At the very least, it's kind of sad how many of us were surprised to
find the lock held when Dave's patch was unplugging.  If this bug
slipped through, more are going to.  It's also true N-1 of those people
were really surprised about scheduling unplug functions, so maybe we
can't put all the blame on wb->list_lock.

> 
> Maybe we should *not* get that wb->list_lock early at all, and nest it
> inside the inode spinlocks, and just move the list_lock locking down a
> lot (ie not even try to hold it over big functions that then are
> forced to releasing it anyway).
> 
> For example, realistically, it looks like the longest we ever *really*
> hold that lock is at the top of the loop of writeback_sb_inodes() -
> maybe we could just explicitly have a function that does "find the
> first inode that matches this sb and needs writeout activity", and
> literally only take hold the lock over that function. And *not* take
> the lock in the caller at all?
> 
> The callers seem to want that lock mainly because they do that
> "list_empty(&wb->b_io)" test. But the way our doubly linked lists
> work, "list_empty()" is actually something that can be done racily
> without the lock held...
> 
> That said, I doubt anybody really wants to touch this, so at least for
> now we're stuck with either the "plug outside the lock" or the "drop
> and retake lock" options. It really would be loveyl to haev numbers
> either way.

For 4.3 timeframes, what runs do you want to see numbers for:

1) revert
2) my hack
3) plug over multiple sbs (on different devices)
4) ?

-chris

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
  2015-09-11 23:16           ` Chris Mason
@ 2015-09-11 23:36             ` Linus Torvalds
  2015-09-12  0:52               ` Linus Torvalds
  2015-09-12 23:00               ` Chris Mason
  0 siblings, 2 replies; 54+ messages in thread
From: Linus Torvalds @ 2015-09-11 23:36 UTC (permalink / raw)
  To: Chris Mason, Linus Torvalds, Josef Bacik, LKML, linux-fsdevel,
	Dave Chinner, Neil Brown, Jan Kara, Christoph Hellwig

On Fri, Sep 11, 2015 at 4:16 PM, Chris Mason <clm@fb.com> wrote:
>
> For 4.3 timeframes, what runs do you want to see numbers for:
>
> 1) revert
> 2) my hack
> 3) plug over multiple sbs (on different devices)
> 4) ?

Just 2 or 3.

I don't think the plain revert is all that interesting, and I think
the "anything else" is far too late for this merge window.

So we'll go with either (2) your patch (which I obviously don't
_like_, but apart from the ugliness I don't think there's anything
technically wrong with), or with (3) the "plug across a bigger area".

So the only issue with (3) is whether that's just "revert plus the
patch I sent out", or whether we should unplug/replug over the "wait
synchronously for an inode" case (iow, the
"inode_sleep_on_writeback()").  The existing plug code (that has the
spinlock issue) already has a "wait on inode" case, and did *not*
unplug over that call, but broadening the plugging further now ends up
having two of those "wait synchronosly on inode".

Are we really ok with waiting synchronously for an inode while holding
the plug? No chance of deadlock (waiting for IO that we've plugged)?
That issue is true even of the current code, though, and I have _not_
really thought that through, it's just a worry.

                Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
  2015-09-11 23:36             ` Linus Torvalds
@ 2015-09-12  0:52               ` Linus Torvalds
  2015-09-12  2:15                 ` Chris Mason
  2015-09-12 23:00               ` Chris Mason
  1 sibling, 1 reply; 54+ messages in thread
From: Linus Torvalds @ 2015-09-12  0:52 UTC (permalink / raw)
  To: Chris Mason, Linus Torvalds, Josef Bacik, LKML, linux-fsdevel,
	Dave Chinner, Neil Brown, Jan Kara, Christoph Hellwig

On Fri, Sep 11, 2015 at 4:36 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Are we really ok with waiting synchronously for an inode while holding
> the plug? No chance of deadlock (waiting for IO that we've plugged)?
> That issue is true even of the current code, though, and I have _not_
> really thought that through, it's just a worry.

Never mind. We still flush the plug on explicit scheduling events.  I
wonder why I thought we got rid of that. Some kind of "senior moment",
I guess.

            Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
  2015-09-12  0:52               ` Linus Torvalds
@ 2015-09-12  2:15                 ` Chris Mason
  2015-09-12  2:27                   ` Linus Torvalds
  0 siblings, 1 reply; 54+ messages in thread
From: Chris Mason @ 2015-09-12  2:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Josef Bacik, LKML, linux-fsdevel, Dave Chinner, Neil Brown,
	Jan Kara, Christoph Hellwig

On Fri, Sep 11, 2015 at 05:52:27PM -0700, Linus Torvalds wrote:
> On Fri, Sep 11, 2015 at 4:36 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Are we really ok with waiting synchronously for an inode while holding
> > the plug? No chance of deadlock (waiting for IO that we've plugged)?
> > That issue is true even of the current code, though, and I have _not_
> > really thought that through, it's just a worry.
> 
> Never mind. We still flush the plug on explicit scheduling events.  I
> wonder why I thought we got rid of that. Some kind of "senior moment",

But flushing on schedule is a little different.  It ends up calling
blk_schedule_flush_plug() which will hand off work to kblockd through
blk_run_queue_async()

Not a huge deal, but if we're scheduling to wait for that IO, we should
really run the plug ourselves so that we're not waiting for kblockd too.

-chris

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
  2015-09-12  2:15                 ` Chris Mason
@ 2015-09-12  2:27                   ` Linus Torvalds
  0 siblings, 0 replies; 54+ messages in thread
From: Linus Torvalds @ 2015-09-12  2:27 UTC (permalink / raw)
  To: Chris Mason, Linus Torvalds, Josef Bacik, LKML, linux-fsdevel,
	Dave Chinner, Neil Brown, Jan Kara, Christoph Hellwig

On Fri, Sep 11, 2015 at 7:15 PM, Chris Mason <clm@fb.com> wrote:
>
> But flushing on schedule is a little different.  It ends up calling
> blk_schedule_flush_plug() which will hand off work to kblockd through
> blk_run_queue_async()

I was more worried about some actual deadlock from the changes.  And
blk_schedule_flush_plug() is fine in that it doesn't actually remove
the plug, it just schedules the currently plugged pending IO, so the
IO will start from waiting on the inode, but the plug will still
remain for the rest of the writeback, and it all looks like it should
be fine.

And the reason we use kblockd is simple: stack usage. The reschedule
can happen pretty deep on the stack, we don't actually want to
necessarily then cause much more stack use through things like md/raid
allocating new requests etc.

So it all looks fine to me.

Btw, very tangentially related: grepping a bit shows that
"blk_flush_plug()" isn't actually used anywhere any more. Can we get
rid of that?

                Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
  2015-09-11 23:36             ` Linus Torvalds
  2015-09-12  0:52               ` Linus Torvalds
@ 2015-09-12 23:00               ` Chris Mason
  2015-09-12 23:29                 ` Linus Torvalds
  2015-09-13 23:12                 ` Dave Chinner
  1 sibling, 2 replies; 54+ messages in thread
From: Chris Mason @ 2015-09-12 23:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Josef Bacik, LKML, linux-fsdevel, Dave Chinner, Neil Brown,
	Jan Kara, Christoph Hellwig

On Fri, Sep 11, 2015 at 04:36:39PM -0700, Linus Torvalds wrote:
> On Fri, Sep 11, 2015 at 4:16 PM, Chris Mason <clm@fb.com> wrote:
> >
> > For 4.3 timeframes, what runs do you want to see numbers for:
> >
> > 1) revert
> > 2) my hack
> > 3) plug over multiple sbs (on different devices)
> > 4) ?
> 
> Just 2 or 3.
> 
> I don't think the plain revert is all that interesting, and I think
> the "anything else" is far too late for this merge window.

I did the plain revert as well, just to have a baseline.  This box is a
little different from Dave's.   Bare metal two socket box (E5-2660 v2 @
2.20Ghz) with 144GB of ram.  I have two pcie flash devices, one nvme and
one fusionio, and I put a one FS on each device (two mounts total).

The test created 1.6M files, 4K each.  I used Dave's fs_mark command
line, spread out over 16 directories from each mounted filesystem.  In
btrfs they are spread over subvolumes to cut down lock contention.

I need to change around the dirty ratios more to smooth out the IO, and
I had trouble with both XFS and btrfs getting runs that were not CPU
bound.  I included the time to run sync at the end of the run because
the results were not very consistent without it.

The XFS runs generally had one CPU pegged at 100%, and I think this is
throwing off the results.  On Monday I'll redo them with two (four?)
filesystems per flash device, hopefully that'll break things up.

The btrfs runs generally had all the CPUs pegged at 100%.  I switched to
mount -o nodatasum and squeezed out an extra 50K files/sec at much lower
CPU utilization.

               wall time      fs_mark files/sec    bytes written/sec

XFS:
baseline v4.2:  5m6s           118,578              272MB/s
Dave's patch:   4m46s          151,421              294MB/s
my hack:        5m5s           150,714              275MB/s
Linus plug:     5m15s          147,735              266MB/s


Btrfs (nodatasum):
baseline v4.2:  4m39s          242,643              313MB/s	
Dave's patch:   3m46s          252,452              389MB/s
my hack:        3m48s          257,924              379MB/s
Linus plug:     3m58s          247,528              369MB/s

Bottom line, not as conclusive as I'd like.  My hack doesn't seem to
hurt, but FS internals are consuming enough CPU that this lock just
isn't showing up.

Linus' plug patch is consistently slower, and I don't have a great
explanation.  My guesses: not keeping the flash pipelines full, or the
imbalance between the different speed flash is averaging the overall
result down, or its my kblockd vs explicit unplug handwaving from
yesterday.

So, next step is either more runs on flash or grab a box with a bunch of
spindles.  I'd rather do the spindle runs, I agree with Dave that his
patch should help much more on actual drives.

-chris


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
  2015-09-12 23:00               ` Chris Mason
@ 2015-09-12 23:29                 ` Linus Torvalds
  2015-09-12 23:46                   ` Chris Mason
  2015-09-13 23:12                 ` Dave Chinner
  1 sibling, 1 reply; 54+ messages in thread
From: Linus Torvalds @ 2015-09-12 23:29 UTC (permalink / raw)
  To: Chris Mason, Linus Torvalds, Josef Bacik, LKML, linux-fsdevel,
	Dave Chinner, Neil Brown, Jan Kara, Christoph Hellwig

On Sat, Sep 12, 2015 at 4:00 PM, Chris Mason <clm@fb.com> wrote:
>
> I did the plain revert as well, just to have a baseline.

Ahh, I ended up not expecting you to get this done until after rc1 was
out, so I in the meantime just merged my fix instead rather than leave
the expected scheduling-while-atomic problem.

And just as well that you did a baseline, since apparently the numbers
are all over the map. I don't see how your hack and dave's original
can _possibly_ differ that much, but they clearly did on your xfs
test. So there's probably huge variance that depends on random
details.

I'll leave things as they are until we have something that looks a bit
more believable ;)

               Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
  2015-09-12 23:29                 ` Linus Torvalds
@ 2015-09-12 23:46                   ` Chris Mason
  2015-09-13 13:12                     ` Chris Mason
  0 siblings, 1 reply; 54+ messages in thread
From: Chris Mason @ 2015-09-12 23:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Josef Bacik, LKML, linux-fsdevel, Dave Chinner, Neil Brown,
	Jan Kara, Christoph Hellwig

On Sat, Sep 12, 2015 at 04:29:06PM -0700, Linus Torvalds wrote:
> On Sat, Sep 12, 2015 at 4:00 PM, Chris Mason <clm@fb.com> wrote:
> >
> > I did the plain revert as well, just to have a baseline.
> 
> Ahh, I ended up not expecting you to get this done until after rc1 was
> out, so I in the meantime just merged my fix instead rather than leave
> the expected scheduling-while-atomic problem.

Yeah, I wasn't sure I'd be able to do the runs, but it was a rainy
afternoon and this was more fun than cleaning.  Really glad something
got in for rc1 either way.

> 
> And just as well that you did a baseline, since apparently the numbers
> are all over the map. I don't see how your hack and dave's original
> can _possibly_ differ that much, but they clearly did on your xfs
> test. So there's probably huge variance that depends on random
> details.

I don't think the XFS numbers can be trusted too much since it was
basically bottlenecked behind that single pegged CPU.  It was bouncing
around and I couldn't quite track it down to a process name (or perf
profile).

The btrfs numbers were much more consistent, but your patch is still a
win over plain 4.2.

> 
> I'll leave things as they are until we have something that looks a bit
> more believable ;)

We can build from here, thanks Linus.

-chris

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
  2015-09-12 23:46                   ` Chris Mason
@ 2015-09-13 13:12                     ` Chris Mason
  2015-09-13 22:56                       ` Dave Chinner
  0 siblings, 1 reply; 54+ messages in thread
From: Chris Mason @ 2015-09-13 13:12 UTC (permalink / raw)
  To: Linus Torvalds, Josef Bacik, LKML, linux-fsdevel, Dave Chinner,
	Neil Brown, Jan Kara, Christoph Hellwig

On Sat, Sep 12, 2015 at 07:46:32PM -0400, Chris Mason wrote:
> I don't think the XFS numbers can be trusted too much since it was
> basically bottlenecked behind that single pegged CPU.  It was bouncing
> around and I couldn't quite track it down to a process name (or perf
> profile).

I'll do more runs Monday, but I was able to grab a perf profile of the
pegged XFS CPU.  It was just the writeback worker thread, and it
hit btrfs differently because we defer more of this stuff to endio
workers, effectively spreading it out over more CPUs.

With 4 mount points intead of 2, XFS goes from 140K files/sec to 250K.
Here's one of the profiles, but it bounced around a lot so I wouldn't
use this to actually tune anything:

   11.42%  kworker/u82:61  [kernel.kallsyms]  [k] _raw_spin_lock
            |
            ---_raw_spin_lock
               |
               |--83.43%-- xfs_extent_busy_trim
               |          xfs_alloc_compute_aligned
               |          |
               |          |--99.92%-- xfs_alloc_ag_vextent_near
               |          |          xfs_alloc_ag_vextent
               |          |          xfs_alloc_vextent
               |          |          xfs_bmap_btalloc
               |          |          xfs_bmap_alloc
               |          |          xfs_bmapi_write
               |          |          xfs_iomap_write_allocate
               |          |          xfs_map_blocks
               |          |          xfs_vm_writepage
               |          |          __writepage
               |          |          write_cache_pages
               |          |          generic_writepages
               |          |          xfs_vm_writepages
               |          |          do_writepages
               |          |          __writeback_single_inode
               |          |          writeback_sb_inodes
               |          |          __writeback_inodes_wb
               |          |          wb_writeback
               |          |          wb_do_writeback
               |          |          wb_workfn
               |          |          process_one_work
               |          |          worker_thread
               |          |          kthread
               |          |          ret_from_fork
               |           --0.08%-- [...]
               |


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
  2015-09-13 13:12                     ` Chris Mason
@ 2015-09-13 22:56                       ` Dave Chinner
  0 siblings, 0 replies; 54+ messages in thread
From: Dave Chinner @ 2015-09-13 22:56 UTC (permalink / raw)
  To: Chris Mason, Linus Torvalds, Josef Bacik, LKML, linux-fsdevel,
	Neil Brown, Jan Kara, Christoph Hellwig

On Sun, Sep 13, 2015 at 09:12:44AM -0400, Chris Mason wrote:
> On Sat, Sep 12, 2015 at 07:46:32PM -0400, Chris Mason wrote:
> > I don't think the XFS numbers can be trusted too much since it was
> > basically bottlenecked behind that single pegged CPU.  It was bouncing
> > around and I couldn't quite track it down to a process name (or perf
> > profile).
> 
> I'll do more runs Monday, but I was able to grab a perf profile of the
> pegged XFS CPU.  It was just the writeback worker thread, and it
> hit btrfs differently because we defer more of this stuff to endio
> workers, effectively spreading it out over more CPUs.
> 
> With 4 mount points intead of 2, XFS goes from 140K files/sec to 250K.
> Here's one of the profiles, but it bounced around a lot so I wouldn't
> use this to actually tune anything:

mkfs.xfs -d agcount=64 ....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
  2015-09-12 23:00               ` Chris Mason
  2015-09-12 23:29                 ` Linus Torvalds
@ 2015-09-13 23:12                 ` Dave Chinner
  2015-09-14 20:06                   ` Linus Torvalds
  1 sibling, 1 reply; 54+ messages in thread
From: Dave Chinner @ 2015-09-13 23:12 UTC (permalink / raw)
  To: Chris Mason, Linus Torvalds, Josef Bacik, LKML, linux-fsdevel,
	Neil Brown, Jan Kara, Christoph Hellwig

On Sat, Sep 12, 2015 at 07:00:27PM -0400, Chris Mason wrote:
> On Fri, Sep 11, 2015 at 04:36:39PM -0700, Linus Torvalds wrote:
> > On Fri, Sep 11, 2015 at 4:16 PM, Chris Mason <clm@fb.com> wrote:
> > >
> > > For 4.3 timeframes, what runs do you want to see numbers for:
> > >
> > > 1) revert
> > > 2) my hack
> > > 3) plug over multiple sbs (on different devices)
> > > 4) ?
> > 
> > Just 2 or 3.
> > 
> > I don't think the plain revert is all that interesting, and I think
> > the "anything else" is far too late for this merge window.
> 
> I did the plain revert as well, just to have a baseline.  This box is a
> little different from Dave's.   Bare metal two socket box (E5-2660 v2 @
> 2.20Ghz) with 144GB of ram.  I have two pcie flash devices, one nvme and
> one fusionio, and I put a one FS on each device (two mounts total).
> 
> The test created 1.6M files, 4K each.  I used Dave's fs_mark command
> line, spread out over 16 directories from each mounted filesystem.  In
> btrfs they are spread over subvolumes to cut down lock contention.
> 
> I need to change around the dirty ratios more to smooth out the IO, and
> I had trouble with both XFS and btrfs getting runs that were not CPU
> bound.  I included the time to run sync at the end of the run because
> the results were not very consistent without it.
> 
> The XFS runs generally had one CPU pegged at 100%, and I think this is
> throwing off the results.  On Monday I'll redo them with two (four?)
> filesystems per flash device, hopefully that'll break things up.
> 
> The btrfs runs generally had all the CPUs pegged at 100%.  I switched to
> mount -o nodatasum and squeezed out an extra 50K files/sec at much lower
> CPU utilization.
> 
>                wall time      fs_mark files/sec    bytes written/sec
> 
> XFS:
> baseline v4.2:  5m6s           118,578              272MB/s
> Dave's patch:   4m46s          151,421              294MB/s
> my hack:        5m5s           150,714              275MB/s
> Linus plug:     5m15s          147,735              266MB/s
> 
> 
> Btrfs (nodatasum):
> baseline v4.2:  4m39s          242,643              313MB/s	
> Dave's patch:   3m46s          252,452              389MB/s
> my hack:        3m48s          257,924              379MB/s
> Linus plug:     3m58s          247,528              369MB/s

Really need to run these numbers on slower disks where block layer
merging makes a difference to performance. The high level plugging
improves performance on spinning disks by a huge amount on XFS
because the merging reduces the number of IOs issued to disk by 2
orders of magnitude. Plugging makes comparitively little difference
on devices that can sustain extremely high IOPS and hence sink the
tens to hundreds of thousand individual 4k IOs that this workload
generates through writeback.

i.e. while throughput increases, that's not the numbers that matters
here - it's the change in write IO behaviour that needs to be
categorised and measured by the different patches...

(I'm on holidays, so I'm not going to get to this any time soon)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
  2015-09-13 23:12                 ` Dave Chinner
@ 2015-09-14 20:06                   ` Linus Torvalds
  2015-09-16 15:16                     ` Chris Mason
  0 siblings, 1 reply; 54+ messages in thread
From: Linus Torvalds @ 2015-09-14 20:06 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Chris Mason, Josef Bacik, LKML, linux-fsdevel, Neil Brown,
	Jan Kara, Christoph Hellwig

On Sun, Sep 13, 2015 at 4:12 PM, Dave Chinner <david@fromorbit.com> wrote:
>
> Really need to run these numbers on slower disks where block layer
> merging makes a difference to performance.

Yeah. We've seen plugging and io schedulers not make much difference
for high-performance flash (although I think the people who argued
that noop should generally be used for non-rotating media were wrong,
I think - the elevator ends up still being critical to merging, and
while merging isn't a life-or-death situation, it tends to still
help).

For rotating rust with nasty seek times, the plugging is likely to
make the biggest difference.

               Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
  2015-09-14 20:06                   ` Linus Torvalds
@ 2015-09-16 15:16                     ` Chris Mason
  2015-09-16 19:58                       ` Jan Kara
  0 siblings, 1 reply; 54+ messages in thread
From: Chris Mason @ 2015-09-16 15:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, Josef Bacik, LKML, linux-fsdevel, Neil Brown,
	Jan Kara, Christoph Hellwig

[-- Attachment #1: Type: text/plain, Size: 1990 bytes --]

On Mon, Sep 14, 2015 at 01:06:25PM -0700, Linus Torvalds wrote:
> On Sun, Sep 13, 2015 at 4:12 PM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > Really need to run these numbers on slower disks where block layer
> > merging makes a difference to performance.
> 
> Yeah. We've seen plugging and io schedulers not make much difference
> for high-performance flash (although I think the people who argued
> that noop should generally be used for non-rotating media were wrong,
> I think - the elevator ends up still being critical to merging, and
> while merging isn't a life-or-death situation, it tends to still
> help).


Yeah, my big concern was that holding the plug longer would result in
lower overall perf because we weren't keeping the flash busy.  So I
started with the flash boxes to make sure we weren't regressing past 4.2
levels at least.

I'm still worried about that, but this probably isn't the right
benchmark to show it.  And if it's really a problem, it'll happen
everywhere we plug and not just here.

> 
> For rotating rust with nasty seek times, the plugging is likely to
> make the biggest difference.

For rotating storage, I grabbed a big box and did the fs_mark run
against 8 spindles.  These are all behind a megaraid card as jbods, so I
flipped the card's cache to write-through.

I changed around the run a bit, making enough files for fs_mark to run
for ~10 minutes, and I took out the sync.  I ran only xfs to cut down on
the iterations, and after the fs_mark run, I did short 30 second run with
blktrace in the background to capture the io sizes.

v4.2:    178K files/sec
Chinner: 192K files/sec
Mason:   192K files/sec
Linus:   193K files/sec

I added support to iowatcher to graph IO size, and attached the graph.

Short version, Linus' patch still gives bigger IOs and similar perf to
Dave's original.  I should have done the blktrace runs for 60 seconds
instead of 30, I suspect that would even out the average sizes between
the three patches.

-chris

[-- Attachment #2: fs_mark.png --]
[-- Type: application/octet-stream, Size: 34363 bytes --]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
  2015-09-16 15:16                     ` Chris Mason
@ 2015-09-16 19:58                       ` Jan Kara
  2015-09-16 20:00                         ` Chris Mason
  0 siblings, 1 reply; 54+ messages in thread
From: Jan Kara @ 2015-09-16 19:58 UTC (permalink / raw)
  To: Chris Mason
  Cc: Linus Torvalds, Dave Chinner, Josef Bacik, LKML, linux-fsdevel,
	Neil Brown, Jan Kara, Christoph Hellwig

On Wed 16-09-15 11:16:21, Chris Mason wrote:
> On Mon, Sep 14, 2015 at 01:06:25PM -0700, Linus Torvalds wrote:
> > On Sun, Sep 13, 2015 at 4:12 PM, Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > Really need to run these numbers on slower disks where block layer
> > > merging makes a difference to performance.
> > 
> > Yeah. We've seen plugging and io schedulers not make much difference
> > for high-performance flash (although I think the people who argued
> > that noop should generally be used for non-rotating media were wrong,
> > I think - the elevator ends up still being critical to merging, and
> > while merging isn't a life-or-death situation, it tends to still
> > help).
> 
> 
> Yeah, my big concern was that holding the plug longer would result in
> lower overall perf because we weren't keeping the flash busy.  So I
> started with the flash boxes to make sure we weren't regressing past 4.2
> levels at least.
> 
> I'm still worried about that, but this probably isn't the right
> benchmark to show it.  And if it's really a problem, it'll happen
> everywhere we plug and not just here.
> 
> > 
> > For rotating rust with nasty seek times, the plugging is likely to
> > make the biggest difference.
> 
> For rotating storage, I grabbed a big box and did the fs_mark run
> against 8 spindles.  These are all behind a megaraid card as jbods, so I
> flipped the card's cache to write-through.
> 
> I changed around the run a bit, making enough files for fs_mark to run
> for ~10 minutes, and I took out the sync.  I ran only xfs to cut down on
> the iterations, and after the fs_mark run, I did short 30 second run with
> blktrace in the background to capture the io sizes.
> 
> v4.2:    178K files/sec
> Chinner: 192K files/sec
> Mason:   192K files/sec
> Linus:   193K files/sec
> 
> I added support to iowatcher to graph IO size, and attached the graph.
> 
> Short version, Linus' patch still gives bigger IOs and similar perf to
> Dave's original.  I should have done the blktrace runs for 60 seconds
> instead of 30, I suspect that would even out the average sizes between
> the three patches.

Thanks for the data Chris. So I guess we are fine with what's currently in,
right?

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
  2015-09-16 19:58                       ` Jan Kara
@ 2015-09-16 20:00                         ` Chris Mason
  2015-09-16 22:07                           ` Dave Chinner
  0 siblings, 1 reply; 54+ messages in thread
From: Chris Mason @ 2015-09-16 20:00 UTC (permalink / raw)
  To: Jan Kara
  Cc: Linus Torvalds, Dave Chinner, Josef Bacik, LKML, linux-fsdevel,
	Neil Brown, Christoph Hellwig

On Wed, Sep 16, 2015 at 09:58:06PM +0200, Jan Kara wrote:
> On Wed 16-09-15 11:16:21, Chris Mason wrote:
> > Short version, Linus' patch still gives bigger IOs and similar perf to
> > Dave's original.  I should have done the blktrace runs for 60 seconds
> > instead of 30, I suspect that would even out the average sizes between
> > the three patches.
> 
> Thanks for the data Chris. So I guess we are fine with what's currently in,
> right?

Looks like it works well to me.

-chris


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
  2015-09-16 20:00                         ` Chris Mason
@ 2015-09-16 22:07                           ` Dave Chinner
  2015-09-17  0:37                             ` Dave Chinner
  0 siblings, 1 reply; 54+ messages in thread
From: Dave Chinner @ 2015-09-16 22:07 UTC (permalink / raw)
  To: Chris Mason, Jan Kara, Linus Torvalds, Josef Bacik, LKML,
	linux-fsdevel, Neil Brown, Christoph Hellwig

On Wed, Sep 16, 2015 at 04:00:12PM -0400, Chris Mason wrote:
> On Wed, Sep 16, 2015 at 09:58:06PM +0200, Jan Kara wrote:
> > On Wed 16-09-15 11:16:21, Chris Mason wrote:
> > > Short version, Linus' patch still gives bigger IOs and similar perf to
> > > Dave's original.  I should have done the blktrace runs for 60 seconds
> > > instead of 30, I suspect that would even out the average sizes between
> > > the three patches.
> > 
> > Thanks for the data Chris. So I guess we are fine with what's currently in,
> > right?
> 
> Looks like it works well to me.

Graph looks good, though I'll confirm it on my test rig once I get
out from under the pile of email and other stuff that is queued up
after being away for a week...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
  2015-09-16 22:07                           ` Dave Chinner
@ 2015-09-17  0:37                             ` Dave Chinner
  2015-09-17  1:12                               ` Linus Torvalds
  2015-09-17  3:48                               ` Chris Mason
  0 siblings, 2 replies; 54+ messages in thread
From: Dave Chinner @ 2015-09-17  0:37 UTC (permalink / raw)
  To: Chris Mason, Jan Kara, Linus Torvalds, Josef Bacik, LKML,
	linux-fsdevel, Neil Brown, Christoph Hellwig
  Cc: Tejun Heo

[cc Tejun]

On Thu, Sep 17, 2015 at 08:07:04AM +1000, Dave Chinner wrote:
> On Wed, Sep 16, 2015 at 04:00:12PM -0400, Chris Mason wrote:
> > On Wed, Sep 16, 2015 at 09:58:06PM +0200, Jan Kara wrote:
> > > On Wed 16-09-15 11:16:21, Chris Mason wrote:
> > > > Short version, Linus' patch still gives bigger IOs and similar perf to
> > > > Dave's original.  I should have done the blktrace runs for 60 seconds
> > > > instead of 30, I suspect that would even out the average sizes between
> > > > the three patches.
> > > 
> > > Thanks for the data Chris. So I guess we are fine with what's currently in,
> > > right?
> > 
> > Looks like it works well to me.
> 
> Graph looks good, though I'll confirm it on my test rig once I get
> out from under the pile of email and other stuff that is queued up
> after being away for a week...

I ran some tests in the background while reading other email.....

TL;DR: Results look really bad - not only is the plugging
problematic, baseline writeback performance has regressed
significantly. We need to revert the plugging changes until the
underlying writeback performance regressions are sorted out.

In more detail, these tests were run on my usual 16p/16GB RAM
performance test VM with storage set up as described here:

http://permalink.gmane.org/gmane.linux.kernel/1768786

The test:

$ ~/tests/fsmark-10-4-test-xfs.sh
meta-data=/dev/vdc               isize=512    agcount=500, agsize=268435455 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=0
data     =                       bsize=4096   blocks=134217727500, imaxpct=1
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal log           bsize=4096   blocks=131072, version=2
         =                       sectsz=512   sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

#  ./fs_mark  -D  10000  -S0  -n  10000  -s  4096  -L  120  -d  /mnt/scratch/0  -d  /mnt/scratch/1  -d  /mnt/scratch/2  -d  /mnt/scratch/3  -d  /mnt/scratch/4  -d  /mnt/scratch/5  -d  /mnt/scratch/6  -d  /mnt/scratch/7
#       Version 3.3, 8 thread(s) starting at Thu Sep 17 08:08:36 2015
#       Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
#       Directories:  Time based hash between directories across 10000 subdirectories with 180 seconds per subdirectory.
#       File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
#       Files info: size 4096 bytes, written with an IO size of 16384 bytes per write
#       App overhead is time in microseconds spent in the test not doing file writing related system calls.

FSUse%        Count         Size    Files/sec     App Overhead
     0        80000         4096     106938.0           543310
     0       160000         4096     102922.7           476362
     0       240000         4096     107182.9           538206
     0       320000         4096     107871.7           619821
     0       400000         4096      99255.6           622021
     0       480000         4096     103217.8           609943
     0       560000         4096      96544.2           640988
     0       640000         4096     100347.3           676237
     0       720000         4096      87534.8           483495
     0       800000         4096      72577.5          2556920
     0       880000         4096      97569.0           646996

<RAM fills here, sustained performance is now dependent on writeback>

     0       960000         4096      80147.0           515679
     0      1040000         4096     100394.2           816979
     0      1120000         4096      91466.5           739009
     0      1200000         4096      85868.1           977506
     0      1280000         4096      89691.5           715207
     0      1360000         4096      52547.5           712810
     0      1440000         4096      47999.1           685282
     0      1520000         4096      47894.3           697261
     0      1600000         4096      47549.4           789977
     0      1680000         4096      40029.2           677885
     0      1760000         4096      16637.4         12804557
     0      1840000         4096      16883.6         24295975
     0      1920000         4096      18852.1         22264137
     0      2000000         4096      27107.3         10310408
     0      2080000         4096      23654.1         19823567
     0      2160000         4096      24588.9         19769484
     0      2240000         4096      13959.3         40168163
     0      2320000         4096      26239.2         17806813
     0      2400000         4096      17858.5         24596715
     0      2480000         4096      12072.9         38177650
     0      2560000         4096       9989.7         43413729

<This is Real Bad. It gets significantly worse as the test goes on:>

     0      6880000         4096      11773.4         16470400
     0      6960000         4096       5556.5         27325343
     0      7040000         4096      13816.6         22440991
     0      7120000         4096       7322.4         42086686
     0      7200000         4096       8033.0         31182907
     0      7280000         4096      10589.4         30876729
     0      7360000         4096      15844.5         19903105
     0      7440000         4096       9187.7         35041718
     0      7520000         4096      11800.9         12254920
     0      7600000         4096      20049.9         13052020
     0      7680000         4096      17444.9         12055262
     0      7760000         4096      25900.6         10633885
     0      7840000         4096       6398.8         12014901
     0      7920000         4096      12147.4         14655678

<...>

     0      9360000         4096      19406.0          5874193
     0      9440000         4096       7261.3         14542806
     0      9520000         4096      14978.7         21921186
     0      9600000         4096      12656.7         10539894

real    10m9.240s
user    0m51.817s
sys     13m23.874s

A quick reading of the IO behaviour shows an average of about 1000
IOPS and 100MB/s, which matches roughly with the average files/s
created. There's no obvious peaks and gaps in the IO throughput, but
CPU usage has regular gaps where it falls to nothing and the only
"usage" is IO wait time.

Reverting just Linus' patch, so no plugging at all in the code
(i.e. the baseline numbers):

FSUse%        Count         Size    Files/sec     App Overhead
     0        80000         4096     109017.5           586268
     0       160000         4096      93548.1           553333
     0       240000         4096     104141.7           817182
     0       320000         4096     112846.2           597494
     0       400000         4096      98628.6           641524
     0       480000         4096      86586.1          1274195
     0       560000         4096      95965.0           776961
     0       640000         4096      84585.9           583330
     0       720000         4096      93787.1           695658
     0       800000         4096      94112.7           645925
     0       880000         4096      92259.2           792950

<RAM fills here, sustained performance is now dependent on writeback>

     0       960000         4096      62936.9           763642
     0      1040000         4096      22484.8           737471
     0      1120000         4096      22038.6           706791
     0      1200000         4096      24096.5           680878
     0      1280000         4096      23973.1           839826
     0      1360000         4096      24761.9           779486
     0      1440000         4096      25467.7           976022
     0      1520000         4096      25338.7           935552
     0      1600000         4096      27484.8           683351
     0      1680000         4096      24383.2          3039781
     0      1760000         4096      24857.0          2657008

<maintains 13-26k files/s all the way to the end .... >

     0      9040000         4096      21438.6          4869317
     0      9120000         4096      25387.1           642198
     0      9200000         4096      26815.9           727763
     0      9280000         4096      25175.5          4458811
     0      9360000         4096      19832.8           982122
     0      9440000         4096      17643.8           805599
     0      9520000         4096      13923.4          7819152
     0      9600000         4096      18101.3          1071163

real    7m7.926s
user    0m57.976s
sys     14m14.620s

The baseline of no plugging is a full 3 minutes faster than the
plugging behaviour of Linus' patch. The IO behaviour demonstrates
that, sustaining between 25-30,000 IOPS and throughput of
130-150MB/s.  Hence, while Linus' patch does change the IO patterns,
it does not result in a performance improvement like the original
plugging patch did.

So I went back and had a look at my original patch, which I've been
using locally for a couple of years and was similar to the original
commit. It has this description from when I last updated the perf
numbers from testing done on 3.17:

| Test VM: 16p, 16GB RAM, 2xSSD in RAID0, 500TB sparse XFS filesystem,
| metadata CRCs enabled.
| 
| Test:
| 
| $ ./fs_mark  -D  10000  -S0  -n  10000  -s  4096  -L  120  -d
| /mnt/scratch/0  -d  /mnt/scratch/1  -d  /mnt/scratch/2  -d
| /mnt/scratch/3  -d  /mnt/scratch/4  -d  /mnt/scratch/5  -d
| /mnt/scratch/6  -d  /mnt/scratch/7
| 
| Result:
|                 wall    sys     create rate     Physical write IO
|                 time    CPU     (avg files/s)    IOPS   Bandwidth
|                 -----   -----   -------------   ------  ---------
| unpatched       5m54s   15m32s  32,500+/-2200   28,000  150MB/s
| patched         3m19s   13m28s  52,900+/-1800    1,500  280MB/s
| improvement     -43.8%  -13.3%    +62.7%        -94.6%  +86.6%

IOWs, what we are seeing here is that the baseline writeback
performance has regressed quite significantly since I took these
numbers back on 3.17.  I'm running on exactly the same test setup;
the only difference is the kernel and so the current kernel baseline
is ~20% slower than the baseline numbers I have in my patch.

So, let's put the original [broken] plugging back in:

.....
     0      8960000         4096      28420.0          2129129
     0      9040000         4096      26863.0          6893094
     0      9120000         4096      46721.4           705927
     0      9200000         4096      46384.0          1081627
     0      9280000         4096      39692.3          2148938
     0      9360000         4096      41976.0          3540489
     0      9440000         4096      28025.1         10364440
     0      9520000         4096      38580.2          2457019
     0      9600000         4096      33526.1          7661316

real    5m12.114s
user    0m50.294s
sys     12m54.425s

While we get back that regression and then some more, it's still
much slower than the high level plugging on slightly older kernels.
And the performance is, again, highly variable through the midpoint
of the test run. e.g:

....
     0      4720000         4096      39224.4          9401401
     0      4800000         4096      25342.9         18889665
     0      4880000         4096      14841.7         36668633
     0      4960000         4096      27624.6         17731873
     0      5040000         4096      37874.8         11390831
     0      5120000         4096      29009.2         12331965
     0      5200000         4096      49568.0          4346746
     0      5280000         4096      83074.5           877142
     0      5360000         4096      32498.8          5401470
     0      5440000         4096      55468.1          3188117
     0      5520000         4096      19156.0          4112120
     0      5600000         4096      44894.0          3186559
....

We have a low of ~15k files/s and a high of 83k files/s whilst
averaging about 38k files/s.

IOWs, ignoring the plugging, the recent changes to the writeback
code have really screwed up writeback performance and changed the
way it behaves from a throughput consistency point of view.  I'd
suggest we should start by reverting all the plugging changes for
now, because there are other problems these plugging changes have
exposed in the current writeback code that need to be fixed first...

Cheers

Dave.

PS: I was going to run this on an ext4 filesystem to check it
wasn't just XFS seeing this, but that reminded me why I don't run
this test on ext4:

FSUse%        Count         Size    Files/sec     App Overhead
     5        80000         4096     211668.3           534504
     5       160000         4096     191731.8           532701
     5       240000         4096     196251.2           489772
     5       320000         4096     207597.1           464032
     5       400000         4096     121616.0           601156
     5       480000         4096     197729.9           573925

<Really, really fast in memory, but background writeback starts here>

     5       560000         4096     128698.4           546362
     5       640000         4096     137904.3           543718
     5       720000         4096      63680.9           498110
     5       800000         4096      25825.9          1538213
     5       880000         4096      32708.7           632235

<dirty page threshold (20% RAM) is hit here. performance sucks from
 here on in (still 10GB of free RAM) because the writeback IO
 pattern is a random 4k write pattern>

     5       960000         4096       6940.2          7250666
     5      1040000         4096       7474.8           711872
     5      1120000         4096       4127.5         14034167
     5      1200000         4096       4822.1           752398
....

Hence it takes about around hour to do a single run of this test on
ext4 because it's delayed allocation algorithm does not linearise
writes to small files with similar inode locality like XFS does, and
that's the behaviour that high level plugging really helps. IOws,
this specific workload does not demonstrate any significant
performance difference on ext4 with the different plugging
algorithms because it does random IO rather than sequential IO.

I also ran it on btrfs, but it's pushing peaks of 15-25,000 IOPS
every 30s regardless of whether there is plugging at a high level or
not, so again btrfs behaviour is pretty much unaffected by the high
level plugging on this workload....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
  2015-09-17  0:37                             ` Dave Chinner
@ 2015-09-17  1:12                               ` Linus Torvalds
  2015-09-17  2:14                                 ` Dave Chinner
  2015-09-17  3:48                               ` Chris Mason
  1 sibling, 1 reply; 54+ messages in thread
From: Linus Torvalds @ 2015-09-17  1:12 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Chris Mason, Jan Kara, Josef Bacik, LKML, linux-fsdevel,
	Neil Brown, Christoph Hellwig, Tejun Heo

On Wed, Sep 16, 2015 at 5:37 PM, Dave Chinner <david@fromorbit.com> wrote:
>
> TL;DR: Results look really bad - not only is the plugging
> problematic, baseline writeback performance has regressed
> significantly.

Dave, if you're testing my current -git, the other performance issue
might still be the spinlock thing.

I haven't gotten the -tip pull fox the spinlock regression yet, and
since you were the only one who noticed, I didn't consider it critical
enough to bypass the normal schedule, so -rc1 still has the crazy-bad
test-and-set loop for spinlocks.

So it's waiting in the locking fix branch (and is cc'd for stable),
but hasn't reached me yet.

If you want to test that theory out, you can make virt_spin_lock()
just unconditionally return false to disable the bad cmpxchg loop for
now.

The plugging IO pauses are interesting, though. Plugging really
*shouldn't* cause that kind of pauses, _regardless_ of what level it
happens on, so I wonder if the patch ends up just exposing some really
basic problem that just normally goes hidden.

Can you match up the IO wait times with just *where* it is waiting? Is
it waiting for that inode I_SYNC thing in inode_sleep_on_writeback()?

But it does sound like we should just revert the whole plugging for
now, if only because "it has odd effects".

                Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
  2015-09-17  1:12                               ` Linus Torvalds
@ 2015-09-17  2:14                                 ` Dave Chinner
  2015-09-17 19:39                                   ` Linus Torvalds
  2015-09-17 23:03                                   ` Dave Chinner
  0 siblings, 2 replies; 54+ messages in thread
From: Dave Chinner @ 2015-09-17  2:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Chris Mason, Jan Kara, Josef Bacik, LKML, linux-fsdevel,
	Neil Brown, Christoph Hellwig, Tejun Heo

On Wed, Sep 16, 2015 at 06:12:29PM -0700, Linus Torvalds wrote:
> On Wed, Sep 16, 2015 at 5:37 PM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > TL;DR: Results look really bad - not only is the plugging
> > problematic, baseline writeback performance has regressed
> > significantly.
> 
> Dave, if you're testing my current -git, the other performance issue
> might still be the spinlock thing.

I have the fix as the first commit in my local tree - it'll remain
there until I get a conflict after an update. :)

> The plugging IO pauses are interesting, though. Plugging really
> *shouldn't* cause that kind of pauses, _regardless_ of what level it
> happens on, so I wonder if the patch ends up just exposing some really
> basic problem that just normally goes hidden.

Right, that's what I suspect - it didn't happen on older kernels,
but we've just completely reworked the writeback code for the
control group awareness since I last looked really closely at
this...

> Can you match up the IO wait times with just *where* it is
> waiting? Is it waiting for that inode I_SYNC thing in
> inode_sleep_on_writeback()?

I'll do some more investigation.

> But it does sound like we should just revert the whole plugging for
> now, if only because "it has odd effects".

Yup - we can add it again next merge window once we get to the
bottom of whatever is going on and have had time to test the new
code properly.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
  2015-09-17  0:37                             ` Dave Chinner
  2015-09-17  1:12                               ` Linus Torvalds
@ 2015-09-17  3:48                               ` Chris Mason
  2015-09-17  4:30                                 ` Dave Chinner
  1 sibling, 1 reply; 54+ messages in thread
From: Chris Mason @ 2015-09-17  3:48 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Linus Torvalds, Josef Bacik, LKML, linux-fsdevel,
	Neil Brown, Christoph Hellwig, Tejun Heo

On Thu, Sep 17, 2015 at 10:37:38AM +1000, Dave Chinner wrote:
> [cc Tejun]
> 
> On Thu, Sep 17, 2015 at 08:07:04AM +1000, Dave Chinner wrote:
> > On Wed, Sep 16, 2015 at 04:00:12PM -0400, Chris Mason wrote:
> > > On Wed, Sep 16, 2015 at 09:58:06PM +0200, Jan Kara wrote:
> > > > On Wed 16-09-15 11:16:21, Chris Mason wrote:
> > > > > Short version, Linus' patch still gives bigger IOs and similar perf to
> > > > > Dave's original.  I should have done the blktrace runs for 60 seconds
> > > > > instead of 30, I suspect that would even out the average sizes between
> > > > > the three patches.
> > > > 
> > > > Thanks for the data Chris. So I guess we are fine with what's currently in,
> > > > right?
> > > 
> > > Looks like it works well to me.
> > 
> > Graph looks good, though I'll confirm it on my test rig once I get
> > out from under the pile of email and other stuff that is queued up
> > after being away for a week...
> 
> I ran some tests in the background while reading other email.....
> 
> TL;DR: Results look really bad - not only is the plugging
> problematic, baseline writeback performance has regressed
> significantly. We need to revert the plugging changes until the
> underlying writeback performance regressions are sorted out.
> 
> In more detail, these tests were run on my usual 16p/16GB RAM
> performance test VM with storage set up as described here:
> 
> https://urldefense.proofpoint.com/v1/url?u=http://permalink.gmane.org/gmane.linux.kernel/1768786&k=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0A&r=6%2FL0lzzDhu0Y1hL9xm%2BQyA%3D%3D%0A&m=4Qwp5Zj8CpoMb6vOcz%2FNMQ%2Fsb0%2FamLUP1vqWgedxJL0%3D%0A&s=90b54e35a4a7fcc4bcab9e15e22c025c7c9e045541e4923500f2e3258fc1952b
> 
> The test:
> 
> $ ~/tests/fsmark-10-4-test-xfs.sh
> meta-data=/dev/vdc               isize=512    agcount=500, agsize=268435455 blks
>          =                       sectsz=512   attr=2, projid32bit=1
>          =                       crc=1        finobt=1, sparse=0
> data     =                       bsize=4096   blocks=134217727500, imaxpct=1
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
> log      =internal log           bsize=4096   blocks=131072, version=2
>          =                       sectsz=512   sunit=1 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> 
> #  ./fs_mark  -D  10000  -S0  -n  10000  -s  4096  -L  120  -d  /mnt/scratch/0  -d  /mnt/scratch/1  -d  /mnt/scratch/2  -d  /mnt/scratch/3  -d  /mnt/scratch/4  -d  /mnt/scratch/5  -d  /mnt/scratch/6  -d  /mnt/scratch/7
> #       Version 3.3, 8 thread(s) starting at Thu Sep 17 08:08:36 2015
> #       Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
> #       Directories:  Time based hash between directories across 10000 subdirectories with 180 seconds per subdirectory.
> #       File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
> #       Files info: size 4096 bytes, written with an IO size of 16384 bytes per write
> #       App overhead is time in microseconds spent in the test not doing file writing related system calls.
> 
> FSUse%        Count         Size    Files/sec     App Overhead
>      0        80000         4096     106938.0           543310
>      0       160000         4096     102922.7           476362
>      0       240000         4096     107182.9           538206
>      0       320000         4096     107871.7           619821
>      0       400000         4096      99255.6           622021
>      0       480000         4096     103217.8           609943
>      0       560000         4096      96544.2           640988
>      0       640000         4096     100347.3           676237
>      0       720000         4096      87534.8           483495
>      0       800000         4096      72577.5          2556920
>      0       880000         4096      97569.0           646996
> 
> <RAM fills here, sustained performance is now dependent on writeback>

I think too many variables have changed here.

My numbers:

FSUse%        Count         Size    Files/sec     App Overhead
     0       160000         4096     356407.1          1458461
     0       320000         4096     368755.1          1030047
     0       480000         4096     358736.8           992123
     0       640000         4096     361912.5          1009566
     0       800000         4096     342851.4          1004152
     0       960000         4096     358357.2           996014
     0      1120000         4096     338025.8          1004412
     0      1280000         4096     354440.3           997380
     0      1440000         4096     335225.9          1000222
     0      1600000         4096     278786.1          1164962
     0      1760000         4096     268161.4          1205255
     0      1920000         4096     259158.0          1298054
     0      2080000         4096     276939.1          1219411
     0      2240000         4096     252385.1          1245496
     0      2400000         4096     280674.1          1189161
     0      2560000         4096     290155.4          1141941
     0      2720000         4096     280842.2          1179964
     0      2880000         4096     272446.4          1155527
     0      3040000         4096     268827.4          1235095
     0      3200000         4096     251767.1          1250006
     0      3360000         4096     248339.8          1235471
     0      3520000         4096     267129.9          1200834
     0      3680000         4096     257320.7          1244854
     0      3840000         4096     233540.8          1267764
     0      4000000         4096     269237.0          1216324
     0      4160000         4096     249787.6          1291767
     0      4320000         4096     256185.7          1253776
     0      4480000         4096     257849.7          1212953
     0      4640000         4096     253933.9          1181216
     0      4800000         4096     263567.2          1233937
     0      4960000         4096     255666.4          1231802
     0      5120000         4096     257083.2          1282893
     0      5280000         4096     254285.0          1229031
     0      5440000         4096     265561.6          1219472
     0      5600000         4096     266374.1          1229886
     0      5760000         4096     241003.7          1257064
     0      5920000         4096     245047.4          1298330
     0      6080000         4096     254771.7          1257241
     0      6240000         4096     254355.2          1261006
     0      6400000         4096     254800.4          1201074
     0      6560000         4096     262794.5          1234816
     0      6720000         4096     248103.0          1287921
     0      6880000         4096     231397.3          1291224
     0      7040000         4096     227898.0          1285359
     0      7200000         4096     227279.6          1296340
     0      7360000         4096     232561.5          1748248
     0      7520000         4096     231055.3          1169373
     0      7680000         4096     245738.5          1121856
     0      7840000         4096     234961.7          1147035
     0      8000000         4096     243973.0          1152202
     0      8160000         4096     246292.6          1169527
     0      8320000         4096     249433.2          1197921
     0      8480000         4096     222576.0          1253650
     0      8640000         4096     239407.5          1263257
     0      8800000         4096     246037.1          1218109
     0      8960000         4096     242306.5          1293567
     0      9120000         4096     238525.9          3745133
     0      9280000         4096     269869.5          1159541
     0      9440000         4096     266447.1          4794719
     0      9600000         4096     265748.9          1161584
     0      9760000         4096     269067.8          1149918
     0      9920000         4096     248896.2          1164112
     0     10080000         4096     261342.9          1174536
     0     10240000         4096     254778.3          1225425
     0     10400000         4096     257702.2          1211634
     0     10560000         4096     233972.5          1203665
     0     10720000         4096     232647.1          1197486
     0     10880000         4096     242320.6          1203984

I can push the dirty threshold lower to try and make sure we end up in
the hard dirty limits but none of this is going to be related to the
plugging patch.  I do see lower numbers if I let the test run even
longer, but there are a lot of things in the way that can slow it down
as the filesystem gets that big.

I'll try again with lower ratios.

[ ... ]

> The baseline of no plugging is a full 3 minutes faster than the
> plugging behaviour of Linus' patch. The IO behaviour demonstrates
> that, sustaining between 25-30,000 IOPS and throughput of
> 130-150MB/s.  Hence, while Linus' patch does change the IO patterns,
> it does not result in a performance improvement like the original
> plugging patch did.
> 

How consistent is this across runs?

> So I went back and had a look at my original patch, which I've been
> using locally for a couple of years and was similar to the original
> commit. It has this description from when I last updated the perf
> numbers from testing done on 3.17:
> 
> | Test VM: 16p, 16GB RAM, 2xSSD in RAID0, 500TB sparse XFS filesystem,
> | metadata CRCs enabled.
> | 
> | Test:
> | 
> | $ ./fs_mark  -D  10000  -S0  -n  10000  -s  4096  -L  120  -d
> | /mnt/scratch/0  -d  /mnt/scratch/1  -d  /mnt/scratch/2  -d
> | /mnt/scratch/3  -d  /mnt/scratch/4  -d  /mnt/scratch/5  -d
> | /mnt/scratch/6  -d  /mnt/scratch/7
> | 
> | Result:
> |                 wall    sys     create rate     Physical write IO
> |                 time    CPU     (avg files/s)    IOPS   Bandwidth
> |                 -----   -----   -------------   ------  ---------
> | unpatched       5m54s   15m32s  32,500+/-2200   28,000  150MB/s
> | patched         3m19s   13m28s  52,900+/-1800    1,500  280MB/s
> | improvement     -43.8%  -13.3%    +62.7%        -94.6%  +86.6%
> 
> IOWs, what we are seeing here is that the baseline writeback
> performance has regressed quite significantly since I took these
> numbers back on 3.17.  I'm running on exactly the same test setup;
> the only difference is the kernel and so the current kernel baseline
> is ~20% slower than the baseline numbers I have in my patch.

All of this in a VM, I'd much rather see this reproduced on bare metal.
I've had really consistent results with VMs in the past, but there is a
huge amount of code between 3.17 and now.

-chris

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
  2015-09-17  3:48                               ` Chris Mason
@ 2015-09-17  4:30                                 ` Dave Chinner
  2015-09-17 12:13                                   ` Chris Mason
  0 siblings, 1 reply; 54+ messages in thread
From: Dave Chinner @ 2015-09-17  4:30 UTC (permalink / raw)
  To: Chris Mason, Jan Kara, Linus Torvalds, Josef Bacik, LKML,
	linux-fsdevel, Neil Brown, Christoph Hellwig, Tejun Heo

On Wed, Sep 16, 2015 at 11:48:59PM -0400, Chris Mason wrote:
> On Thu, Sep 17, 2015 at 10:37:38AM +1000, Dave Chinner wrote:
> > [cc Tejun]
> > 
> > On Thu, Sep 17, 2015 at 08:07:04AM +1000, Dave Chinner wrote:
> > #  ./fs_mark  -D  10000  -S0  -n  10000  -s  4096  -L  120  -d  /mnt/scratch/0  -d  /mnt/scratch/1  -d  /mnt/scratch/2  -d  /mnt/scratch/3  -d  /mnt/scratch/4  -d  /mnt/scratch/5  -d  /mnt/scratch/6  -d  /mnt/scratch/7
> > #       Version 3.3, 8 thread(s) starting at Thu Sep 17 08:08:36 2015
> > #       Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
> > #       Directories:  Time based hash between directories across 10000 subdirectories with 180 seconds per subdirectory.
> > #       File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
> > #       Files info: size 4096 bytes, written with an IO size of 16384 bytes per write
> > #       App overhead is time in microseconds spent in the test not doing file writing related system calls.
> > 
> > FSUse%        Count         Size    Files/sec     App Overhead
> >      0        80000         4096     106938.0           543310
> >      0       160000         4096     102922.7           476362
> >      0       240000         4096     107182.9           538206
> >      0       320000         4096     107871.7           619821
> >      0       400000         4096      99255.6           622021
> >      0       480000         4096     103217.8           609943
> >      0       560000         4096      96544.2           640988
> >      0       640000         4096     100347.3           676237
> >      0       720000         4096      87534.8           483495
> >      0       800000         4096      72577.5          2556920
> >      0       880000         4096      97569.0           646996
> > 
> > <RAM fills here, sustained performance is now dependent on writeback>
> 
> I think too many variables have changed here.
> 
> My numbers:
> 
> FSUse%        Count         Size    Files/sec     App Overhead
>      0       160000         4096     356407.1          1458461
>      0       320000         4096     368755.1          1030047
>      0       480000         4096     358736.8           992123
>      0       640000         4096     361912.5          1009566
>      0       800000         4096     342851.4          1004152

<snip>

> I can push the dirty threshold lower to try and make sure we end up in
> the hard dirty limits but none of this is going to be related to the
> plugging patch.

The point of this test is to drive writeback as hard as possible,
not to measure how fast we can create files in memory.  i.e. if the
test isn't pushing the dirty limits on your machines, then it really
isn't putting a meaningful load on writeback, and so the plugging
won't make significant difference because writeback isn't IO
bound....

> I do see lower numbers if I let the test run even
> longer, but there are a lot of things in the way that can slow it down
> as the filesystem gets that big.

Sure, that's why I hit the dirty limits early in the test - so it
measures steady state performance before the fs gets to any
significant scalability limits....

> > The baseline of no plugging is a full 3 minutes faster than the
> > plugging behaviour of Linus' patch. The IO behaviour demonstrates
> > that, sustaining between 25-30,000 IOPS and throughput of
> > 130-150MB/s.  Hence, while Linus' patch does change the IO patterns,
> > it does not result in a performance improvement like the original
> > plugging patch did.
> 
> How consistent is this across runs?

That's what I'm trying to work out. I didn't report it until I got
consistently bad results - the numbers I reported were from the
third time I ran the comparison, and they were representative and
reproducable. I also ran my inode creation workload that is similar
(but has not data writeback so doesn't go through writeback paths at
all) and that shows no change in performance, so this problem
(whatever it is) is only manifesting itself through data
writeback....

The only measurable change I've noticed in my monitoring graphs is
that there is a lot more iowait time than I normally see, even when
the plugging appears to be working as desired. That's what I'm
trying to track down now, and once I've got to the bottom of that I
should have some idea of where the performance has gone....

As it is, there are a bunch of other things going wrong with
4.3-rc1+ right now that I'm working through - I haven't updated my
kernel tree for 10 days because I've been away on holidays so I'm
doing my usual "-rc1 is broken again" dance that I do every release
cycle.  (e.g every second boot hangs because systemd appears to be
waiting for iscsi devices to appear without first starting the iscsi
target daemon.  Never happened before today, every new kernel I've
booted today has hung on the first cold boot of the VM).

> > IOWs, what we are seeing here is that the baseline writeback
> > performance has regressed quite significantly since I took these
> > numbers back on 3.17.  I'm running on exactly the same test setup;
> > the only difference is the kernel and so the current kernel baseline
> > is ~20% slower than the baseline numbers I have in my patch.
> 
> All of this in a VM, I'd much rather see this reproduced on bare metal.
> I've had really consistent results with VMs in the past, but there is a
> huge amount of code between 3.17 and now.

I'm pretty sure it's not the VM - with the locking fix in place,
everything else I've looked at (since applying the locking fix Linus
mentioned) is within measurement error compared to 4.2. The only
thing that is out of whack from a performance POV is data
writeback....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
  2015-09-17  4:30                                 ` Dave Chinner
@ 2015-09-17 12:13                                   ` Chris Mason
  0 siblings, 0 replies; 54+ messages in thread
From: Chris Mason @ 2015-09-17 12:13 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Linus Torvalds, Josef Bacik, LKML, linux-fsdevel,
	Neil Brown, Christoph Hellwig, Tejun Heo

On Thu, Sep 17, 2015 at 02:30:08PM +1000, Dave Chinner wrote:
> On Wed, Sep 16, 2015 at 11:48:59PM -0400, Chris Mason wrote:
> > On Thu, Sep 17, 2015 at 10:37:38AM +1000, Dave Chinner wrote:
> > > [cc Tejun]
> > > 
> > > On Thu, Sep 17, 2015 at 08:07:04AM +1000, Dave Chinner wrote:
> > > #  ./fs_mark  -D  10000  -S0  -n  10000  -s  4096  -L  120  -d  /mnt/scratch/0  -d  /mnt/scratch/1  -d  /mnt/scratch/2  -d  /mnt/scratch/3  -d  /mnt/scratch/4  -d  /mnt/scratch/5  -d  /mnt/scratch/6  -d  /mnt/scratch/7
> > > #       Version 3.3, 8 thread(s) starting at Thu Sep 17 08:08:36 2015
> > > #       Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
> > > #       Directories:  Time based hash between directories across 10000 subdirectories with 180 seconds per subdirectory.
> > > #       File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
> > > #       Files info: size 4096 bytes, written with an IO size of 16384 bytes per write
> > > #       App overhead is time in microseconds spent in the test not doing file writing related system calls.
> > > 
> > > FSUse%        Count         Size    Files/sec     App Overhead
> > >      0        80000         4096     106938.0           543310
> > >      0       160000         4096     102922.7           476362
> > >      0       240000         4096     107182.9           538206
> > >      0       320000         4096     107871.7           619821
> > >      0       400000         4096      99255.6           622021
> > >      0       480000         4096     103217.8           609943
> > >      0       560000         4096      96544.2           640988
> > >      0       640000         4096     100347.3           676237
> > >      0       720000         4096      87534.8           483495
> > >      0       800000         4096      72577.5          2556920
> > >      0       880000         4096      97569.0           646996
> > > 
> > > <RAM fills here, sustained performance is now dependent on writeback>
> > 
> > I think too many variables have changed here.
> > 
> > My numbers:
> > 
> > FSUse%        Count         Size    Files/sec     App Overhead
> >      0       160000         4096     356407.1          1458461
> >      0       320000         4096     368755.1          1030047
> >      0       480000         4096     358736.8           992123
> >      0       640000         4096     361912.5          1009566
> >      0       800000         4096     342851.4          1004152
> 
> <snip>
> 
> > I can push the dirty threshold lower to try and make sure we end up in
> > the hard dirty limits but none of this is going to be related to the
> > plugging patch.
> 
> The point of this test is to drive writeback as hard as possible,
> not to measure how fast we can create files in memory.  i.e. if the
> test isn't pushing the dirty limits on your machines, then it really
> isn't putting a meaningful load on writeback, and so the plugging
> won't make significant difference because writeback isn't IO
> bound....

It does end up IO bound on my rig, just because we do eventually hit the
dirty limits.  Otherwise there would be zero benefits in fs_mark from
any patches vs plain v4.2

But I setup a run last night with a dirty_ratio_bytes at 3G and
dirty_background_ratio_bytes at 1.5G.

There is definitely variation, but nothing like what you saw:

FSUse%        Count         Size    Files/sec     App Overhead
     0       160000         4096     317427.9          1524951
     0       320000         4096     319723.9          1023874
     0       480000         4096     336696.4          1053884
     0       640000         4096     257113.1          1190851
     0       800000         4096     257644.2          1198054
     0       960000         4096     254896.6          1225610
     0      1120000         4096     241052.6          1203227
     0      1280000         4096     214961.2          1386236
     0      1440000         4096     239985.7          1264659
     0      1600000         4096     232174.3          1310018
     0      1760000         4096     250477.9          1227289
     0      1920000         4096     221500.9          1276223
     0      2080000         4096     235212.1          1284989
     0      2240000         4096     238580.2          1257260
     0      2400000         4096     224182.6          1326821
     0      2560000         4096     234628.7          1236402
     0      2720000         4096     244675.3          1228400
     0      2880000         4096     234364.0          1268408
     0      3040000         4096     229712.6          1306148
     0      3200000         4096     241170.5          1254490
     0      3360000         4096     220487.8          1331456
     0      3520000         4096     215831.7          1313682
     0      3680000         4096     210934.7          1235750
     0      3840000         4096     218435.4          1258077
     0      4000000         4096     232127.7          1271555
     0      4160000         4096     212017.6          1381525
     0      4320000         4096     216309.3          1370558
     0      4480000         4096     239072.4          1269086
     0      4640000         4096     221959.1          1333164
     0      4800000         4096     228396.8          1213160
     0      4960000         4096     225747.5          1318503
     0      5120000         4096     115727.0          1237327
     0      5280000         4096     184171.4          1547357
     0      5440000         4096     209917.8          1380510
     0      5600000         4096     181074.7          1391764
     0      5760000         4096     263516.7          1155172
     0      5920000         4096     236405.8          1239719
     0      6080000         4096     231587.2          1221408
     0      6240000         4096     237118.8          1244272
     0      6400000         4096     236773.2          1201428
     0      6560000         4096     243987.5          1240527
     0      6720000         4096     232428.0          1283265
     0      6880000         4096     234839.9          1209152
     0      7040000         4096     234947.3          1223456
     0      7200000         4096     231463.1          1260628
     0      7360000         4096     226750.3          1290098
     0      7520000         4096     213632.0          1236409
     0      7680000         4096     194710.2          1411595
     0      7840000         4096     213963.1          4146893
     0      8000000         4096     225109.8          1323573
     0      8160000         4096     251322.1          1380271
     0      8320000         4096     220167.2          1159390
     0      8480000         4096     210991.2          1110593
     0      8640000         4096     197922.8          1126072
     0      8800000         4096     203539.3          1143501
     0      8960000         4096     193041.7          1134329
     0      9120000         4096     184667.9          1119222
     0      9280000         4096     165968.7          1172738
     0      9440000         4096     192767.3          1098361
     0      9600000         4096     227115.7          1158097
     0      9760000         4096     232139.8          1264245
     0      9920000         4096     213320.5          1270505
     0     10080000         4096     217013.4          1324569
     0     10240000         4096     227171.6          1308668
     0     10400000         4096     208591.4          1392098
     0     10560000         4096     212006.0          1359188
     0     10720000         4096     213449.3          1352084
     0     10880000         4096     219890.1          1326240
     0     11040000         4096     215907.7          1239180
     0     11200000         4096     214207.2          1334846
     0     11360000         4096     212875.2          1338429
     0     11520000         4096     211690.0          1249519
     0     11680000         4096     217013.0          1262050
     0     11840000         4096     204730.1          1205087
     0     12000000         4096     191146.9          1188635
     0     12160000         4096     207844.6          1157033
     0     12320000         4096     208857.7          1168111
     0     12480000         4096     198256.4          1388368
     0     12640000         4096     214996.1          1305412
     0     12800000         4096     212332.9          1357814
     0     12960000         4096     210325.8          1336127
     0     13120000         4096     200292.1          1282419
     0     13280000         4096     202030.2          1412105
     0     13440000         4096     216553.7          1424076
     0     13600000         4096     218721.7          1298149
     0     13760000         4096     202037.4          1266877
     0     13920000         4096     224032.3          1198159
     0     14080000         4096     206105.6          1336489
     0     14240000         4096     227540.3          1160841
     0     14400000         4096     236921.7          1190394
     0     14560000         4096     229343.3          1147451
     0     14720000         4096     199435.1          1284374
     0     14880000         4096     215177.3          1178542
     0     15040000         4096     206194.1          1170832
     0     15200000         4096     215762.3          1125633
     0     15360000         4096     194511.0          1122947
     0     15520000         4096     179008.5          1292603
     0     15680000         4096     208636.9          1094960
     0     15840000         4096     192173.1          1237891
     0     16000000         4096     212888.9          1111551
     0     16160000         4096     218403.0          1143400
     0     16320000         4096     207260.5          1233526
     0     16480000         4096     202123.2          1151509
     0     16640000         4096     191033.0          1257706
     0     16800000         4096     196865.4          1154520
     0     16960000         4096     210361.2          1128930
     0     17120000         4096     201755.2          1160469
     0     17280000         4096     196946.6          1173529
     0     17440000         4096     199677.8          1165750
     0     17600000         4096     194248.4          1234944
     0     17760000         4096     200027.9          1256599
     0     17920000         4096     206507.0          1166820
     0     18080000         4096     215082.7          1167599
     0     18240000         4096     201475.5          1212202
     0     18400000         4096     208247.6          1252255
     0     18560000         4096     205482.7          1311436
     0     18720000         4096     200111.9          1358784
     0     18880000         4096     200028.3          1351332
     0     19040000         4096     198873.4          1287400
     0     19200000         4096     209609.3          1268400
     0     19360000         4096     203538.6          1249787
     0     19520000         4096     203427.9          1294105
     0     19680000         4096     201905.3          1280714
     0     19840000         4096     209642.9          1283281
     0     20000000         4096     203438.9          1315427
     0     20160000         4096     199690.7          1252267
     0     20320000         4096     185965.2          1398905
     0     20480000         4096     203221.6          1214029
     0     20640000         4096     208654.8          1232679
     0     20800000         4096     212488.6          1298458
     0     20960000         4096     189701.1          1356640
     0     21120000         4096     198522.1          1361240
     0     21280000         4096     203857.3          1263402
     0     21440000         4096     204616.8          1362853
     0     21600000         4096     196310.6          1266710
     0     21760000         4096     203275.4          1391150
     0     21920000         4096     205998.5          1378741
     0     22080000         4096     205434.2          1283787
     0     22240000         4096     195918.0          1415912
     0     22400000         4096     186193.0          1413623
     0     22560000         4096     192911.3          1393471
     0     22720000         4096     203726.3          1264281
     0     22880000         4096     204853.4          1221048
     0     23040000         4096     222803.2          1153031
     0     23200000         4096     198558.6          1346256
     0     23360000         4096     201001.4          1278817
     0     23520000         4096     206225.2          1270440
     0     23680000         4096     190894.2          1425299
     0     23840000         4096     198555.6          1334122
     0     24000000         4096     202386.4          1332157
     0     24160000         4096     205103.1          1313607

> 
> > I do see lower numbers if I let the test run even
> > longer, but there are a lot of things in the way that can slow it down
> > as the filesystem gets that big.
> 
> Sure, that's why I hit the dirty limits early in the test - so it
> measures steady state performance before the fs gets to any
> significant scalability limits....
> 
> > > The baseline of no plugging is a full 3 minutes faster than the
> > > plugging behaviour of Linus' patch. The IO behaviour demonstrates
> > > that, sustaining between 25-30,000 IOPS and throughput of
> > > 130-150MB/s.  Hence, while Linus' patch does change the IO patterns,
> > > it does not result in a performance improvement like the original
> > > plugging patch did.
> > 
> > How consistent is this across runs?
> 
> That's what I'm trying to work out. I didn't report it until I got
> consistently bad results - the numbers I reported were from the
> third time I ran the comparison, and they were representative and
> reproducable. I also ran my inode creation workload that is similar
> (but has not data writeback so doesn't go through writeback paths at
> all) and that shows no change in performance, so this problem
> (whatever it is) is only manifesting itself through data
> writeback....

The big change between Linus' patch and your patch is with Linus kblockd
is probably doing most of the actual unplug work (except for the last
super block in the list).  If a process is waiting for dirty writeout
progress, it has to wait for that context switch to kblockd.

In the VM, that's going to hurt more then my big two socket mostly idle
machine.

> 
> The only measurable change I've noticed in my monitoring graphs is
> that there is a lot more iowait time than I normally see, even when
> the plugging appears to be working as desired. That's what I'm
> trying to track down now, and once I've got to the bottom of that I
> should have some idea of where the performance has gone....
> 
> As it is, there are a bunch of other things going wrong with
> 4.3-rc1+ right now that I'm working through - I haven't updated my
> kernel tree for 10 days because I've been away on holidays so I'm
> doing my usual "-rc1 is broken again" dance that I do every release
> cycle.  (e.g every second boot hangs because systemd appears to be
> waiting for iscsi devices to appear without first starting the iscsi
> target daemon.  Never happened before today, every new kernel I've
> booted today has hung on the first cold boot of the VM).

I've been doing 4.2 plus patches because rc1 didn't boot on this strange
box.  Let me nail that down and rerun.

-chris

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
  2015-09-17  2:14                                 ` Dave Chinner
@ 2015-09-17 19:39                                   ` Linus Torvalds
  2015-09-17 22:42                                     ` Chris Mason
  2015-09-17 23:03                                   ` Dave Chinner
  1 sibling, 1 reply; 54+ messages in thread
From: Linus Torvalds @ 2015-09-17 19:39 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Chris Mason, Jan Kara, Josef Bacik, LKML, linux-fsdevel,
	Neil Brown, Christoph Hellwig, Tejun Heo

On Wed, Sep 16, 2015 at 7:14 PM, Dave Chinner <david@fromorbit.com> wrote:
>>
>> Dave, if you're testing my current -git, the other performance issue
>> might still be the spinlock thing.
>
> I have the fix as the first commit in my local tree - it'll remain
> there until I get a conflict after an update. :)

Ok. I'm happy to report that you should get a conflict now, and that
the spinlock code should work well for your virtualized case again.

No updates on the plugging thing yet, I'll wait a bit and follow this
thread and see if somebody comes up with any explanations or theories
in the hope that we might not need to revert (or at least have a more
targeted change).

                Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
  2015-09-17 19:39                                   ` Linus Torvalds
@ 2015-09-17 22:42                                     ` Chris Mason
  2015-09-17 23:08                                       ` Linus Torvalds
  0 siblings, 1 reply; 54+ messages in thread
From: Chris Mason @ 2015-09-17 22:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, Jan Kara, Josef Bacik, LKML, linux-fsdevel,
	Neil Brown, Christoph Hellwig, Tejun Heo

On Thu, Sep 17, 2015 at 12:39:51PM -0700, Linus Torvalds wrote:
> On Wed, Sep 16, 2015 at 7:14 PM, Dave Chinner <david@fromorbit.com> wrote:
> >>
> >> Dave, if you're testing my current -git, the other performance issue
> >> might still be the spinlock thing.
> >
> > I have the fix as the first commit in my local tree - it'll remain
> > there until I get a conflict after an update. :)
> 
> Ok. I'm happy to report that you should get a conflict now, and that
> the spinlock code should work well for your virtualized case again.
> 
> No updates on the plugging thing yet, I'll wait a bit and follow this
> thread and see if somebody comes up with any explanations or theories
> in the hope that we might not need to revert (or at least have a more
> targeted change).

Playing around with the plug a little, most of the unplugs are coming
from the cond_resched_lock().  Not really sure why we are doing the
cond_resched() there, we should be doing it before we retake the lock
instead.

This patch takes my box (with dirty thresholds at 1.5GB/3GB) from 195K
files/sec up to 213K.  Average IO size is the same as 4.3-rc1.

It probably won't help Dave, since most of his unplugs should have been
from the cond_resched_locked() too.

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 587ac08..05ed541 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1481,6 +1481,19 @@ static long writeback_sb_inodes(struct super_block *sb,
 		wbc_detach_inode(&wbc);
 		work->nr_pages -= write_chunk - wbc.nr_to_write;
 		wrote += write_chunk - wbc.nr_to_write;
+
+		if (need_resched()) {
+			/*
+			 * we're plugged and don't want to hand off to kblockd
+			 * for the actual unplug work.  But we do want to
+			 * reschedule.  So flush our plug and then
+			 * schedule away
+			 */
+			blk_flush_plug(current);
+			cond_resched();
+		}
+
+
 		spin_lock(&wb->list_lock);
 		spin_lock(&inode->i_lock);
 		if (!(inode->i_state & I_DIRTY_ALL))
@@ -1488,7 +1501,7 @@ static long writeback_sb_inodes(struct super_block *sb,
 		requeue_inode(inode, wb, &wbc);
 		inode_sync_complete(inode);
 		spin_unlock(&inode->i_lock);
-		cond_resched_lock(&wb->list_lock);
+
 		/*
 		 * bail out to wb_writeback() often enough to check
 		 * background threshold and other termination conditions.

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
  2015-09-17  2:14                                 ` Dave Chinner
  2015-09-17 19:39                                   ` Linus Torvalds
@ 2015-09-17 23:03                                   ` Dave Chinner
  2015-09-17 23:13                                     ` Linus Torvalds
  1 sibling, 1 reply; 54+ messages in thread
From: Dave Chinner @ 2015-09-17 23:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Chris Mason, Jan Kara, Josef Bacik, LKML, linux-fsdevel,
	Neil Brown, Christoph Hellwig, Tejun Heo

On Thu, Sep 17, 2015 at 12:14:53PM +1000, Dave Chinner wrote:
> On Wed, Sep 16, 2015 at 06:12:29PM -0700, Linus Torvalds wrote:
> > On Wed, Sep 16, 2015 at 5:37 PM, Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > TL;DR: Results look really bad - not only is the plugging
> > > problematic, baseline writeback performance has regressed
> > > significantly.
> > 
> > Dave, if you're testing my current -git, the other performance issue
> > might still be the spinlock thing.
> 
> I have the fix as the first commit in my local tree - it'll remain
> there until I get a conflict after an update. :)
> 
> > The plugging IO pauses are interesting, though. Plugging really
> > *shouldn't* cause that kind of pauses, _regardless_ of what level it
> > happens on, so I wonder if the patch ends up just exposing some really
> > basic problem that just normally goes hidden.
> 
> Right, that's what I suspect - it didn't happen on older kernels,
> but we've just completely reworked the writeback code for the
> control group awareness since I last looked really closely at
> this...
> 
> > Can you match up the IO wait times with just *where* it is
> > waiting? Is it waiting for that inode I_SYNC thing in
> > inode_sleep_on_writeback()?
> 
> I'll do some more investigation.

Ok, I'm happy to report there is actually nothing wrong with the
plugging code that is your tree.  I finally tracked the problem I
was seeing down to a misbehaving RAID controller.[*]

With that problem sorted:

kernel		files/s		wall time
3.17		32500		5m54s
4.3-noplug	34400		5m25s
3.17-plug	52900		3m19s
4.3-badplug	60540		3m24s
4.3-rc1		56600		3m23s

So the 3.17/4.3-noplug baselines so no regression - 4.3 is slightly
faster. All the plugging variants show roughly the same improvement
and IO behaviour. These numbers are reproducable and there are no
weird performance inconsistencies during any of the 4.3-rc1 kernel
runs.  Hence my numbers and observed behaviour now aligns with
Chris' results and so I think we can say the reworked high level
plugging is behaving as we expected it to.

Cheers,

Dave.

[*] It seems to have a dodgy battery connector, and so has been
"losing" battery backup and changing the cache mode of the HBA from
write back to write through. This results in changing from NVRAM
performance to SSD native performance and back again.  A small
vibration would cause the connection to the battery to reconnect and
the controller would switch back to writeback mode. The few log
entries in the bios showed changes in status between a few seconds
apart to minutes apart - enough for the cache status to change
several times a 5-10 minute benchmark run.

I didn't notice the hardware was playing up because it wasn't
triggering the machine alert indicator through the bios like it's
supposed to and so the visible and audible alarms were not being
triggered, nor was the BMC logging the raid controller cache status
changes.

In the end, I noticed it by chance - during a low level test the
behaviour changed very obviously as one of my dogs ran past the
rack.  I unplugged everything inside the server, plugged it all back
in, powered it back up and fiddled with cables until I found what
was causing the problem. And having done this, the BMC is now
sending warnings and the audible alarm is working when the battery
is disconnected... :/
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
  2015-09-17 22:42                                     ` Chris Mason
@ 2015-09-17 23:08                                       ` Linus Torvalds
  2015-09-17 23:56                                         ` Chris Mason
  0 siblings, 1 reply; 54+ messages in thread
From: Linus Torvalds @ 2015-09-17 23:08 UTC (permalink / raw)
  To: Chris Mason, Linus Torvalds, Dave Chinner, Jan Kara, Josef Bacik,
	LKML, linux-fsdevel, Neil Brown, Christoph Hellwig, Tejun Heo

On Thu, Sep 17, 2015 at 3:42 PM, Chris Mason <clm@fb.com> wrote:
>
> Playing around with the plug a little, most of the unplugs are coming
> from the cond_resched_lock().  Not really sure why we are doing the
> cond_resched() there, we should be doing it before we retake the lock
> instead.
>
> This patch takes my box (with dirty thresholds at 1.5GB/3GB) from 195K
> files/sec up to 213K.  Average IO size is the same as 4.3-rc1.

Ok, so at least for you, part of the problem really ends up being that
there's a mix of the "synchronous" unplugging (by the actual explicit
"blk_finish_plug(&plug);") and the writeback that is handed off to
kblockd_workqueue.

I'm not seeing why that should be an issue. Sure, there's some CPU
overhead to context switching, but I don't see that it should be that
big of a deal.

I wonder if there is something more serious wrong with the kblockd_workqueue.

                    Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
  2015-09-17 23:03                                   ` Dave Chinner
@ 2015-09-17 23:13                                     ` Linus Torvalds
  0 siblings, 0 replies; 54+ messages in thread
From: Linus Torvalds @ 2015-09-17 23:13 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Chris Mason, Jan Kara, Josef Bacik, LKML, linux-fsdevel,
	Neil Brown, Christoph Hellwig, Tejun Heo

On Thu, Sep 17, 2015 at 4:03 PM, Dave Chinner <david@fromorbit.com> wrote:
>
> Ok, I'm happy to report there is actually nothing wrong with the
> plugging code that is your tree.  I finally tracked the problem I
> was seeing down to a misbehaving RAID controller.[*]

Hey, that's great.

Just out of interest, try the patch that Chris just sent that turns
the unplugging synchronous for the special "cond_resched()" case.

If that helps your case too, let's just do it, even while we wonder
why kblockd_workqueue messes up so noticeable.

Because regardless of the kblockd_workqueue questions, I think it's
nicer to avoid the cond_resched_lock(), and do the rescheduling while
we're not holding any locks.

              Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
  2015-09-17 23:08                                       ` Linus Torvalds
@ 2015-09-17 23:56                                         ` Chris Mason
  2015-09-18  0:37                                           ` Dave Chinner
  0 siblings, 1 reply; 54+ messages in thread
From: Chris Mason @ 2015-09-17 23:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, Jan Kara, Josef Bacik, LKML, linux-fsdevel,
	Neil Brown, Christoph Hellwig, Tejun Heo

On Thu, Sep 17, 2015 at 04:08:19PM -0700, Linus Torvalds wrote:
> On Thu, Sep 17, 2015 at 3:42 PM, Chris Mason <clm@fb.com> wrote:
> >
> > Playing around with the plug a little, most of the unplugs are coming
> > from the cond_resched_lock().  Not really sure why we are doing the
> > cond_resched() there, we should be doing it before we retake the lock
> > instead.
> >
> > This patch takes my box (with dirty thresholds at 1.5GB/3GB) from 195K
> > files/sec up to 213K.  Average IO size is the same as 4.3-rc1.
> 
> Ok, so at least for you, part of the problem really ends up being that
> there's a mix of the "synchronous" unplugging (by the actual explicit
> "blk_finish_plug(&plug);") and the writeback that is handed off to
> kblockd_workqueue.
> 
> I'm not seeing why that should be an issue. Sure, there's some CPU
> overhead to context switching, but I don't see that it should be that
> big of a deal.
> 
> I wonder if there is something more serious wrong with the kblockd_workqueue.

I'm driving the box pretty hard, it's right on the line between CPU
bound and IO bound.  So I've got 32 fs_mark processes banging away and
32 CPUs (16 really, with hyperthreading).

They are popping in and out of balance_dirty_pages() so I have high CPU
utilization alternating with high IO wait times.  There no reads at all,
so all of these waits are for buffered writes.

People in balance_dirty_pages are indirectly waiting on the unplug, so
maybe the context switch overhead on a loaded box is enough to explain
it.  We've definitely gotten more than 9% by inlining small synchronous
items in btrfs in the past, but those were more explicitly synchronous.

I know it's painfully hand wavy.  I don't see any other users of the
kblockd workqueues, and the perf profiles don't jump out at me.  I'll
feel better about the patch if Dave confirms any gains.

-chris


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
  2015-09-17 23:56                                         ` Chris Mason
@ 2015-09-18  0:37                                           ` Dave Chinner
  2015-09-18  1:50                                             ` Linus Torvalds
  0 siblings, 1 reply; 54+ messages in thread
From: Dave Chinner @ 2015-09-18  0:37 UTC (permalink / raw)
  To: Chris Mason, Linus Torvalds, Jan Kara, Josef Bacik, LKML,
	linux-fsdevel, Neil Brown, Christoph Hellwig, Tejun Heo

On Thu, Sep 17, 2015 at 07:56:47PM -0400, Chris Mason wrote:
> On Thu, Sep 17, 2015 at 04:08:19PM -0700, Linus Torvalds wrote:
> > On Thu, Sep 17, 2015 at 3:42 PM, Chris Mason <clm@fb.com> wrote:
> > >
> > > Playing around with the plug a little, most of the unplugs are coming
> > > from the cond_resched_lock().  Not really sure why we are doing the
> > > cond_resched() there, we should be doing it before we retake the lock
> > > instead.
> > >
> > > This patch takes my box (with dirty thresholds at 1.5GB/3GB) from 195K
> > > files/sec up to 213K.  Average IO size is the same as 4.3-rc1.
> > 
> > Ok, so at least for you, part of the problem really ends up being that
> > there's a mix of the "synchronous" unplugging (by the actual explicit
> > "blk_finish_plug(&plug);") and the writeback that is handed off to
> > kblockd_workqueue.
> >
> > I'm not seeing why that should be an issue. Sure, there's some CPU
> > overhead to context switching, but I don't see that it should be that
> > big of a deal.

It may well change the dispatch order of enough IOs for it to be
significant on an IO bound device.

> > I wonder if there is something more serious wrong with the kblockd_workqueue.
> 
> I'm driving the box pretty hard, it's right on the line between CPU
> bound and IO bound.  So I've got 32 fs_mark processes banging away and
> 32 CPUs (16 really, with hyperthreading).

I'm only using 8 threads right now, so I have ~6-7 idle CPUs on this
workload. Hence if it's CPU load related, I probably won't see any
change in behaviour.

> They are popping in and out of balance_dirty_pages() so I have high CPU
> utilization alternating with high IO wait times.  There no reads at all,
> so all of these waits are for buffered writes.
> 
> People in balance_dirty_pages are indirectly waiting on the unplug, so
> maybe the context switch overhead on a loaded box is enough to explain
> it.  We've definitely gotten more than 9% by inlining small synchronous
> items in btrfs in the past, but those were more explicitly synchronous.
> 
> I know it's painfully hand wavy.  I don't see any other users of the
> kblockd workqueues, and the perf profiles don't jump out at me.  I'll
> feel better about the patch if Dave confirms any gains.

In outright performance on my test machine, the difference in
files/s is noise. However, the consistency looks to be substantially
improved and the context switch rate is now running at under
3,000/sec. Numbers, including the std deviation of the files/s
number output during the fsmark run (averaged across 3 separate
benahmark runs):

			files/s		std-dev		wall time
4.3-rc1-noplug		34400		2.0e04		5m25s
4.3-rc1			56600		2.3e04		3m23s
4.3-rc1-flush		56079		1.4e04		3m14s

std-dev is well down, and the improvement in wall time is large
enough to be significant.

Looks good to me.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
  2015-09-18  0:37                                           ` Dave Chinner
@ 2015-09-18  1:50                                             ` Linus Torvalds
  2015-09-18  5:40                                               ` Dave Chinner
  0 siblings, 1 reply; 54+ messages in thread
From: Linus Torvalds @ 2015-09-18  1:50 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Chris Mason, Jan Kara, Josef Bacik, LKML, linux-fsdevel,
	Neil Brown, Christoph Hellwig, Tejun Heo

[-- Attachment #1: Type: text/plain, Size: 2457 bytes --]

On Thu, Sep 17, 2015 at 5:37 PM, Dave Chinner <david@fromorbit.com> wrote:
>> >
>> > I'm not seeing why that should be an issue. Sure, there's some CPU
>> > overhead to context switching, but I don't see that it should be that
>> > big of a deal.
>
> It may well change the dispatch order of enough IOs for it to be
> significant on an IO bound device.

Hmm. Maybe. We obviously try to order the IO's a bit by inode, and I
could see the use of a workqueue maybe changing that sufficiently. But
it still sounds a bit unlikely.

And in fact, I think I have a better explanation.

> In outright performance on my test machine, the difference in
> files/s is noise. However, the consistency looks to be substantially
> improved and the context switch rate is now running at under
> 3,000/sec.

Hmm. I don't recall seeing you mention how many context switches per
second you had before. What is it down from?

However, I think I may have found something more interesting here.

The fact is that a *normal* schedule will trigger that whole
blk_schedule_flush_plug(), but a cond_sched() or a cond_sched_lock()
doesn't actually do a normal schedule at all. Those trigger a
*preemption* schedule.

And a preemption schedule does not trigger that unplugging at all.
Why? A kernel "preemption" very much tries to avoid touching thread
state, because the whole point is that normally we may be preempting
threads in random places, so we don't run things like
sched_submit_work(), because the thread may be in the middle of
*creating* that work, and we don't want to introduce races. The
preemption scheduling can also be done with "task->state" set to
sleeping, and it won't actually sleep.

Now, for the explicit schedules like "cond_resched()" and
"cond_resched_lock()", those races with obviously don't exist, but
they happen to share the same preemption scheduling logic.

So it turns out that as far as I can see, the whole "cond_resched()"
will not start any IO at all, and it will just be left on the thread
plug until we schedule back to the thread.

So I don't think this has anything to do with kblockd_workqueue. I
don't think it even gets to that point.

I may be missing something, but just to humor me, can you test the
attached patch *without* Chris's patch to do explicit plugging? This
should make cond_resched() and cond_resched_lock() run the unplugging.

It may be entirely broken, I haven't thought this entirely through.

                   Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/plain, Size: 633 bytes --]

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 97d276ff1edb..388ea9e7ab8a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4548,6 +4548,7 @@ SYSCALL_DEFINE0(sched_yield)
 int __sched _cond_resched(void)
 {
 	if (should_resched(0)) {
+		sched_submit_work(current);
 		preempt_schedule_common();
 		return 1;
 	}
@@ -4572,9 +4573,10 @@ int __cond_resched_lock(spinlock_t *lock)
 
 	if (spin_needbreak(lock) || resched) {
 		spin_unlock(lock);
-		if (resched)
+		if (resched) {
+			sched_submit_work(current);
 			preempt_schedule_common();
-		else
+		} else
 			cpu_relax();
 		ret = 1;
 		spin_lock(lock);

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
  2015-09-18  1:50                                             ` Linus Torvalds
@ 2015-09-18  5:40                                               ` Dave Chinner
  2015-09-18  6:04                                                 ` Linus Torvalds
  0 siblings, 1 reply; 54+ messages in thread
From: Dave Chinner @ 2015-09-18  5:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Chris Mason, Jan Kara, Josef Bacik, LKML, linux-fsdevel,
	Neil Brown, Christoph Hellwig, Tejun Heo

On Thu, Sep 17, 2015 at 06:50:29PM -0700, Linus Torvalds wrote:
> On Thu, Sep 17, 2015 at 5:37 PM, Dave Chinner <david@fromorbit.com> wrote:
> >> >
> >> > I'm not seeing why that should be an issue. Sure, there's some CPU
> >> > overhead to context switching, but I don't see that it should be that
> >> > big of a deal.
> >
> > It may well change the dispatch order of enough IOs for it to be
> > significant on an IO bound device.
> 
> Hmm. Maybe. We obviously try to order the IO's a bit by inode, and I
> could see the use of a workqueue maybe changing that sufficiently. But
> it still sounds a bit unlikely.
> 
> And in fact, I think I have a better explanation.
> 
> > In outright performance on my test machine, the difference in
> > files/s is noise. However, the consistency looks to be substantially
> > improved and the context switch rate is now running at under
> > 3,000/sec.
> 
> Hmm. I don't recall seeing you mention how many context switches per
> second you had before. What is it down from?

Around 4000-4500/sec, so there's not a huge amount of context
switches going on. It's not like btrfs, where this workload
generates over 250,000 context switches/sec....

> However, I think I may have found something more interesting here.
> 
> The fact is that a *normal* schedule will trigger that whole
> blk_schedule_flush_plug(), but a cond_sched() or a cond_sched_lock()
> doesn't actually do a normal schedule at all. Those trigger a
> *preemption* schedule.

Ok, makes sense - the plug is not being flushed as we switch away,
but Chris' patch makes it do that.

> So it turns out that as far as I can see, the whole "cond_resched()"
> will not start any IO at all, and it will just be left on the thread
> plug until we schedule back to the thread.
> 
> So I don't think this has anything to do with kblockd_workqueue. I
> don't think it even gets to that point.
> 
> I may be missing something, but just to humor me, can you test the
> attached patch *without* Chris's patch to do explicit plugging? This
> should make cond_resched() and cond_resched_lock() run the unplugging.

Context switches go back to the 4-4500/sec range. Otherwise
behaviour and performance is indistinguishable from Chris' patch.

Cheers,

Dave.


PS: just hit another "did this just get broken in 4.3-rc1" issue - I
can't run blktrace while there's a IO load because:

$ sudo blktrace -d /dev/vdc
BLKTRACESETUP(2) /dev/vdc failed: 5/Input/output error
Thread 1 failed open /sys/kernel/debug/block/(null)/trace1: 2/No such file or directory
....

[  641.424618] blktrace: page allocation failure: order:5, mode:0x2040d0
[  641.425952] CPU: 13 PID: 11163 Comm: blktrace Not tainted 4.3.0-rc1-dgc+ #430
[  641.427321] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
[  641.429066]  0000000000000005 ffff8800b731faa0 ffffffff81785609 00000000002040d0
[  641.430585]  ffff8800b731fb28 ffffffff81183688 0000000000000000 ffffffff824926f8
[  641.432143]  ffff8800b731faf0 0000000000000042 0000000000000010 0000000000000100
[  641.433695] Call Trace:
[  641.434196]  [<ffffffff81785609>] dump_stack+0x4b/0x72
[  641.435219]  [<ffffffff81183688>] warn_alloc_failed+0xd8/0x130
[  641.436402]  [<ffffffff811863bb>] __alloc_pages_nodemask+0x66b/0x7d0
[  641.437684]  [<ffffffff811c0e19>] cache_grow.constprop.61+0xc9/0x2d0
[  641.438933]  [<ffffffff811c1569>] kmem_cache_alloc_trace+0x129/0x400
[  641.440240]  [<ffffffff811424f8>] relay_open+0x68/0x2c0
[  641.441299]  [<ffffffff8115deb1>] do_blk_trace_setup+0x191/0x2d0
[  641.442492]  [<ffffffff8115e041>] blk_trace_setup+0x51/0x90
[  641.443586]  [<ffffffff8115e229>] blk_trace_ioctl+0xb9/0x110
[  641.444702]  [<ffffffff811e586c>] ? mntput_no_expire+0x2c/0x1a0
[  641.446419]  [<ffffffff8176fa58>] blkdev_ioctl+0x528/0x690
[  641.447489]  [<ffffffff811fd37d>] block_ioctl+0x3d/0x50
[  641.449194]  [<ffffffff811d93da>] do_vfs_ioctl+0x2ba/0x490
[  641.450491]  [<ffffffff811d9629>] SyS_ioctl+0x79/0x90
[  641.451575]  [<ffffffff81d800ae>] entry_SYSCALL_64_fastpath+0x12/0x71
[  641.453231] Mem-Info:
[  641.453714] active_anon:11651 inactive_anon:2411 isolated_anon:0
                active_file:8456 inactive_file:2070698 isolated_file:0
                unevictable:1011 dirty:374605 writeback:1176 unstable:0
                slab_reclaimable:1514989 slab_unreclaimable:233692
                mapped:9332 shmem:2250 pagetables:1891 bounce:0
                free:24305 free_pcp:1448 free_cma:0
[  641.461950] DMA free:15908kB min:12kB low:12kB high:16kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[  641.470017] lowmem_reserve[]: 0 2960 16005 16005
[  641.471041] DMA32 free:57076kB min:2992kB low:3740kB high:4488kB active_anon:8692kB inactive_anon:2572kB active_file:6224kB inactive_file:1507352kB unevictable:1004kB isolated(anon):0kB isolated(file):0kB present:3129212kB managed:3032864kB mlocked:1004kB dirty:275336kB writeback:896kB mapped:6800kB shmem:1836kB slab_reclaimable:1086292kB slab_unreclaimable:183168kB kernel_stack:23600kB pagetables:1372kB unstable:0kB bounce:0kB free_pcp:3440kB local_pcp:16kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  641.479677] lowmem_reserve[]: 0 0 13044 13044
[  641.480888] Normal free:18040kB min:13188kB low:16484kB high:19780kB active_anon:38264kB inactive_anon:7072kB active_file:27600kB inactive_file:6775516kB unevictable:3040kB isolated(anon):0kB isolated(file):0kB present:13631488kB managed:13357908kB mlocked:3040kB dirty:1223624kB writeback:5144kB mapped:30528kB shmem:7164kB slab_reclaimable:4975584kB slab_unreclaimable:755076kB kernel_stack:44912kB pagetables:6192kB unstable:0kB bounce:0kB free_pcp:3432kB local_pcp:88kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  641.492392] lowmem_reserve[]: 0 0 0 0
[  641.493170] DMA: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15908kB
[  641.496035] DMA32: 6584*4kB (UEM) 3783*8kB (UM) 17*16kB (M) 9*32kB (M) 3*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 57352kB
[  641.498705] Normal: 3372*4kB (UE) 393*8kB (UEM) 37*16kB (UEM) 33*32kB (UEM) 6*64kB (EM) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 18664kB
[  641.501479] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  641.503019] 2082056 total pagecache pages
[  641.503759] 0 pages in swap cache
[  641.504403] Swap cache stats: add 0, delete 0, find 0/0
[  641.505374] Free swap  = 497976kB
[  641.506002] Total swap = 497976kB
[  641.506622] 4194173 pages RAM
[  641.507171] 0 pages HighMem/MovableOnly
[  641.507878] 92503 pages reserved

This is from:

gdb) l *(relay_open+0x68)
0xffffffff811424f8 is in relay_open (kernel/relay.c:582).
577                     return NULL;
578             if (subbuf_size > UINT_MAX / n_subbufs)
579                     return NULL;
580
581             chan = kzalloc(sizeof(struct rchan), GFP_KERNEL);
582             if (!chan)
583                     return NULL;
584
585             chan->version = RELAYFS_CHANNEL_VERSION;
586             chan->n_subbufs = n_subbufs;

and struct rchan has a member struct rchan_buf *buf[NR_CPUS];
and CONFIG_NR_CPUS=8192, hence the attempt at an order 5 allocation
that fails here....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
  2015-09-18  5:40                                               ` Dave Chinner
@ 2015-09-18  6:04                                                 ` Linus Torvalds
  2015-09-18  6:06                                                   ` Linus Torvalds
                                                                     ` (2 more replies)
  0 siblings, 3 replies; 54+ messages in thread
From: Linus Torvalds @ 2015-09-18  6:04 UTC (permalink / raw)
  To: Dave Chinner, Jens Axboe
  Cc: Chris Mason, Jan Kara, Josef Bacik, LKML, linux-fsdevel,
	Neil Brown, Christoph Hellwig, Tejun Heo

On Thu, Sep 17, 2015 at 10:40 PM, Dave Chinner <david@fromorbit.com> wrote:
>
> Ok, makes sense - the plug is not being flushed as we switch away,
> but Chris' patch makes it do that.

Yup.

And I actually think Chris' patch is better than the one I sent out
(but maybe the scheduler people should take a look at the behavior of
cond_resched()), I just wanted you to test that to verify the
behavior.

The fact that Chris' patch ends up lowering the context switches
(because it does the unplugging directly) is also an argument for his
approach.

I just wanted to understand the oddity with kblockd_workqueue. And I
think that's solved.

> Context switches go back to the 4-4500/sec range. Otherwise
> behaviour and performance is indistinguishable from Chris' patch.

.. this was exactly what I wanted to hear. So it sounds like we have
no odd unexplained behavior left in this area.

Which is not to say that there wouldn't be room for improvement, but
it just makes me much happier about the state of these patches to feel
like we understand what was going on.

> PS: just hit another "did this just get broken in 4.3-rc1" issue - I
> can't run blktrace while there's a IO load because:
>
> $ sudo blktrace -d /dev/vdc
> BLKTRACESETUP(2) /dev/vdc failed: 5/Input/output error
> Thread 1 failed open /sys/kernel/debug/block/(null)/trace1: 2/No such file or directory
> ....
>
> [  641.424618] blktrace: page allocation failure: order:5, mode:0x2040d0
> [  641.438933]  [<ffffffff811c1569>] kmem_cache_alloc_trace+0x129/0x400
> [  641.440240]  [<ffffffff811424f8>] relay_open+0x68/0x2c0
> [  641.441299]  [<ffffffff8115deb1>] do_blk_trace_setup+0x191/0x2d0
>
> gdb) l *(relay_open+0x68)
> 0xffffffff811424f8 is in relay_open (kernel/relay.c:582).
> 577                     return NULL;
> 578             if (subbuf_size > UINT_MAX / n_subbufs)
> 579                     return NULL;
> 580
> 581             chan = kzalloc(sizeof(struct rchan), GFP_KERNEL);
> 582             if (!chan)
> 583                     return NULL;
> 584
> 585             chan->version = RELAYFS_CHANNEL_VERSION;
> 586             chan->n_subbufs = n_subbufs;
>
> and struct rchan has a member struct rchan_buf *buf[NR_CPUS];
> and CONFIG_NR_CPUS=8192, hence the attempt at an order 5 allocation
> that fails here....

Hm. Have you always had MAX_SMP (and the NR_CPU==8192 that it causes)?
>From a quick check, none of this code seems to be new.

That said, having that

        struct rchan_buf *buf[NR_CPUS];

in "struct rchan" really is something we should fix. We really should
strive to not allocate things by CONFIG_NR_CPU's, but by the actual
real CPU count.

This looks to be mostly Jens' code, and much of it harkens back to 2006. Jens?

                    Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
  2015-09-18  6:04                                                 ` Linus Torvalds
@ 2015-09-18  6:06                                                   ` Linus Torvalds
  2015-09-18 14:21                                                     ` Jens Axboe
  2015-09-18 13:16                                                   ` Chris Mason
  2015-09-18 22:17                                                   ` Dave Chinner
  2 siblings, 1 reply; 54+ messages in thread
From: Linus Torvalds @ 2015-09-18  6:06 UTC (permalink / raw)
  To: Dave Chinner, Jens Axboe
  Cc: Chris Mason, Jan Kara, Josef Bacik, LKML, linux-fsdevel,
	Neil Brown, Christoph Hellwig, Tejun Heo

Gaah, my mailer autocompleted Jens' email with an old one..

Sorry for the repeat email with the correct address.

On Thu, Sep 17, 2015 at 11:04 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Sep 17, 2015 at 10:40 PM, Dave Chinner <david@fromorbit.com> wrote:
>>
>> PS: just hit another "did this just get broken in 4.3-rc1" issue - I
>> can't run blktrace while there's a IO load because:
>>
>> $ sudo blktrace -d /dev/vdc
>> BLKTRACESETUP(2) /dev/vdc failed: 5/Input/output error
>> Thread 1 failed open /sys/kernel/debug/block/(null)/trace1: 2/No such file or directory
>> ....
>>
>> [  641.424618] blktrace: page allocation failure: order:5, mode:0x2040d0
>> [  641.438933]  [<ffffffff811c1569>] kmem_cache_alloc_trace+0x129/0x400
>> [  641.440240]  [<ffffffff811424f8>] relay_open+0x68/0x2c0
>> [  641.441299]  [<ffffffff8115deb1>] do_blk_trace_setup+0x191/0x2d0
>>
>> gdb) l *(relay_open+0x68)
>> 0xffffffff811424f8 is in relay_open (kernel/relay.c:582).
>> 577                     return NULL;
>> 578             if (subbuf_size > UINT_MAX / n_subbufs)
>> 579                     return NULL;
>> 580
>> 581             chan = kzalloc(sizeof(struct rchan), GFP_KERNEL);
>> 582             if (!chan)
>> 583                     return NULL;
>> 584
>> 585             chan->version = RELAYFS_CHANNEL_VERSION;
>> 586             chan->n_subbufs = n_subbufs;
>>
>> and struct rchan has a member struct rchan_buf *buf[NR_CPUS];
>> and CONFIG_NR_CPUS=8192, hence the attempt at an order 5 allocation
>> that fails here....
>
> Hm. Have you always had MAX_SMP (and the NR_CPU==8192 that it causes)?
> From a quick check, none of this code seems to be new.
>
> That said, having that
>
>         struct rchan_buf *buf[NR_CPUS];
>
> in "struct rchan" really is something we should fix. We really should
> strive to not allocate things by CONFIG_NR_CPU's, but by the actual
> real CPU count.
>
> This looks to be mostly Jens' code, and much of it harkens back to 2006. Jens?
>
>                     Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
  2015-09-18  6:04                                                 ` Linus Torvalds
  2015-09-18  6:06                                                   ` Linus Torvalds
@ 2015-09-18 13:16                                                   ` Chris Mason
  2015-09-18 14:23                                                     ` Jens Axboe
  2015-09-18 22:17                                                   ` Dave Chinner
  2 siblings, 1 reply; 54+ messages in thread
From: Chris Mason @ 2015-09-18 13:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, Jens Axboe, Jan Kara, Josef Bacik, LKML,
	linux-fsdevel, Neil Brown, Christoph Hellwig, Tejun Heo

On Thu, Sep 17, 2015 at 11:04:03PM -0700, Linus Torvalds wrote:
> On Thu, Sep 17, 2015 at 10:40 PM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > Ok, makes sense - the plug is not being flushed as we switch away,
> > but Chris' patch makes it do that.
> 
> Yup.

Huh, that does make much more sense, thanks Linus.  I'm wondering where
else I've assumed that cond_resched() unplugged.

> 
> And I actually think Chris' patch is better than the one I sent out
> (but maybe the scheduler people should take a look at the behavior of
> cond_resched()), I just wanted you to test that to verify the
> behavior.

Ok, I'll fix up the description and comments and send out.

-chris

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
  2015-09-18  6:06                                                   ` Linus Torvalds
@ 2015-09-18 14:21                                                     ` Jens Axboe
  0 siblings, 0 replies; 54+ messages in thread
From: Jens Axboe @ 2015-09-18 14:21 UTC (permalink / raw)
  To: Linus Torvalds, Dave Chinner
  Cc: Chris Mason, Jan Kara, Josef Bacik, LKML, linux-fsdevel,
	Neil Brown, Christoph Hellwig, Tejun Heo

On 09/18/2015 12:06 AM, Linus Torvalds wrote:
> Gaah, my mailer autocompleted Jens' email with an old one..
>
> Sorry for the repeat email with the correct address.
>
> On Thu, Sep 17, 2015 at 11:04 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> On Thu, Sep 17, 2015 at 10:40 PM, Dave Chinner <david@fromorbit.com> wrote:
>>>
>>> PS: just hit another "did this just get broken in 4.3-rc1" issue - I
>>> can't run blktrace while there's a IO load because:
>>>
>>> $ sudo blktrace -d /dev/vdc
>>> BLKTRACESETUP(2) /dev/vdc failed: 5/Input/output error
>>> Thread 1 failed open /sys/kernel/debug/block/(null)/trace1: 2/No such file or directory
>>> ....
>>>
>>> [  641.424618] blktrace: page allocation failure: order:5, mode:0x2040d0
>>> [  641.438933]  [<ffffffff811c1569>] kmem_cache_alloc_trace+0x129/0x400
>>> [  641.440240]  [<ffffffff811424f8>] relay_open+0x68/0x2c0
>>> [  641.441299]  [<ffffffff8115deb1>] do_blk_trace_setup+0x191/0x2d0
>>>
>>> gdb) l *(relay_open+0x68)
>>> 0xffffffff811424f8 is in relay_open (kernel/relay.c:582).
>>> 577                     return NULL;
>>> 578             if (subbuf_size > UINT_MAX / n_subbufs)
>>> 579                     return NULL;
>>> 580
>>> 581             chan = kzalloc(sizeof(struct rchan), GFP_KERNEL);
>>> 582             if (!chan)
>>> 583                     return NULL;
>>> 584
>>> 585             chan->version = RELAYFS_CHANNEL_VERSION;
>>> 586             chan->n_subbufs = n_subbufs;
>>>
>>> and struct rchan has a member struct rchan_buf *buf[NR_CPUS];
>>> and CONFIG_NR_CPUS=8192, hence the attempt at an order 5 allocation
>>> that fails here....
>>
>> Hm. Have you always had MAX_SMP (and the NR_CPU==8192 that it causes)?
>>  From a quick check, none of this code seems to be new.
>>
>> That said, having that
>>
>>          struct rchan_buf *buf[NR_CPUS];
>>
>> in "struct rchan" really is something we should fix. We really should
>> strive to not allocate things by CONFIG_NR_CPU's, but by the actual
>> real CPU count.
>>
>> This looks to be mostly Jens' code, and much of it harkens back to 2006. Jens?

The relayfs code mostly came out of IBM, but yes, that alloc doesn't 
look nice. Not a regression, though, I don't think that has changed in 
years. I'll take a stab at fixing this.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
  2015-09-18 13:16                                                   ` Chris Mason
@ 2015-09-18 14:23                                                     ` Jens Axboe
  2015-09-18 15:32                                                       ` Linus Torvalds
  0 siblings, 1 reply; 54+ messages in thread
From: Jens Axboe @ 2015-09-18 14:23 UTC (permalink / raw)
  To: Chris Mason, Linus Torvalds, Dave Chinner, Jan Kara, Josef Bacik,
	LKML, linux-fsdevel, Neil Brown, Christoph Hellwig, Tejun Heo

On 09/18/2015 07:16 AM, Chris Mason wrote:
> On Thu, Sep 17, 2015 at 11:04:03PM -0700, Linus Torvalds wrote:
>> On Thu, Sep 17, 2015 at 10:40 PM, Dave Chinner <david@fromorbit.com> wrote:
>>>
>>> Ok, makes sense - the plug is not being flushed as we switch away,
>>> but Chris' patch makes it do that.
>>
>> Yup.
>
> Huh, that does make much more sense, thanks Linus.  I'm wondering where
> else I've assumed that cond_resched() unplugged.
>
>>
>> And I actually think Chris' patch is better than the one I sent out
>> (but maybe the scheduler people should take a look at the behavior of
>> cond_resched()), I just wanted you to test that to verify the
>> behavior.

It makes no sense for preemption schedule to NOT unplug, the fact that 
it doesn't is news to me as well. It was never the intent of the 
unplug-on-schedule to NOT unplug for certain schedule out events, that 
seems like very odd behavior.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
  2015-09-18 14:23                                                     ` Jens Axboe
@ 2015-09-18 15:32                                                       ` Linus Torvalds
  2015-09-18 15:59                                                         ` Peter Zijlstra
  0 siblings, 1 reply; 54+ messages in thread
From: Linus Torvalds @ 2015-09-18 15:32 UTC (permalink / raw)
  To: Jens Axboe, Ingo Molnar, Peter Zijlstra, Frederic Weisbecker
  Cc: Chris Mason, Dave Chinner, Jan Kara, Josef Bacik, LKML,
	linux-fsdevel, Neil Brown, Christoph Hellwig, Tejun Heo

On Fri, Sep 18, 2015 at 7:23 AM, Jens Axboe <axboe@fb.com> wrote:
>
> It makes no sense for preemption schedule to NOT unplug, the fact that it
> doesn't is news to me as well. It was never the intent of the
> unplug-on-schedule to NOT unplug for certain schedule out events, that seems
> like very odd behavior.

Actually, even a *full* schedule doesn't unplug, unless the process is
going to sleep. See sched_submit_work(), which  will only call the
unplugging if the process is actually going to sleep (ok, so it's a
bit subtle if you don't know the state rules, but it's the
"!tsk->state" check there)

So preemption and cond_resched() isn't _that_ odd. We've basically
treated a non-sleeping schedule as a no-op for the task work.

The thinking was probably that it might be better to delay starting
the IO in case we get scheduled back quickly, and we're obviously not
actually _sleeping_, so it's likely not too bad.

Now, that's probably bogus, and I think that we should perhaps just
make the rule be that "if we actually switch to another task, we run
blk_schedule_flush_plug()".

But it should be noted that that really *does* introduce a lot of new
potential races. Traditionally, our block layer plugging has been
entirely thread-synchronous, and would never happen asynchronously.
But with preemption, that "switch to another thread" really *does*
happen asynchronously.

So making things always happen on task switch is actually fairly
dangerous, and potentially adds the need for much more synchronization
for the IO submission.

What we possibly *could* make the scheduler rule be:

 - if it's not an actual PREEMPT_ACTIVE (ie in a random place)

 - _and_ we actually switch to another thread

 - _then_ do the whole blk_schedule_flush_plug(tsk) thing.

adding some scheduler people to the explicit cc list.

That said, the "cond_resched[_lock]()" functions currently always set
PREEMPT_ACTIVE (indirectly - they use preempt_schedule_common()), so
even though those are synchronous, right now they *look* asynchronous
to the scheduler, so we'd still have to sort that out.

Ingo/Peter/Frederic? Comments?

                        Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
  2015-09-18 15:32                                                       ` Linus Torvalds
@ 2015-09-18 15:59                                                         ` Peter Zijlstra
  2015-09-18 16:02                                                           ` Peter Zijlstra
  2015-09-18 16:12                                                           ` Linus Torvalds
  0 siblings, 2 replies; 54+ messages in thread
From: Peter Zijlstra @ 2015-09-18 15:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Ingo Molnar, Frederic Weisbecker, Chris Mason,
	Dave Chinner, Jan Kara, Josef Bacik, LKML, linux-fsdevel,
	Neil Brown, Christoph Hellwig, Tejun Heo

On Fri, Sep 18, 2015 at 08:32:15AM -0700, Linus Torvalds wrote:

> Actually, even a *full* schedule doesn't unplug, unless the process is
> going to sleep. See sched_submit_work(), which  will only call the
> unplugging if the process is actually going to sleep (ok, so it's a
> bit subtle if you don't know the state rules, but it's the
> "!tsk->state" check there)
> 
> So preemption and cond_resched() isn't _that_ odd. We've basically
> treated a non-sleeping schedule as a no-op for the task work.
> 
> The thinking was probably that it might be better to delay starting
> the IO in case we get scheduled back quickly, and we're obviously not
> actually _sleeping_, so it's likely not too bad.
> 
> Now, that's probably bogus, and I think that we should perhaps just
> make the rule be that "if we actually switch to another task, we run
> blk_schedule_flush_plug()".

That gets to be a tad tricky. We must either do it while holding
scheduler locks (and I would really want to avoid calling block layer
code with those held) or when the new task is already scheduled (and
that would suck too, because then we account stuff to the wrong task).

> But it should be noted that that really *does* introduce a lot of new
> potential races. Traditionally, our block layer plugging has been
> entirely thread-synchronous, and would never happen asynchronously.
> But with preemption, that "switch to another thread" really *does*
> happen asynchronously.
> 
> So making things always happen on task switch is actually fairly
> dangerous, and potentially adds the need for much more synchronization
> for the IO submission.

Can't say I'm a fan of it :/

> What we possibly *could* make the scheduler rule be:
> 
>  - if it's not an actual PREEMPT_ACTIVE (ie in a random place)

PREEMPT_ACTIVE is actually a recursion flag, 'random place' does not
factor into it. That is, most 'random places' will not have
PREEMPT_ACTIVE set.

>  - _and_ we actually switch to another thread

This is 'hard', as per the above.

>  - _then_ do the whole blk_schedule_flush_plug(tsk) thing.
> 
> adding some scheduler people to the explicit cc list.
> 
> That said, the "cond_resched[_lock]()" functions currently always set
> PREEMPT_ACTIVE (indirectly - they use preempt_schedule_common()), so
> even though those are synchronous, right now they *look* asynchronous
> to the scheduler, so we'd still have to sort that out.

See PREEMPT_ACTIVE being a recursion flag, we set it there so we won't
preempt while we're already scheduling.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
  2015-09-18 15:59                                                         ` Peter Zijlstra
@ 2015-09-18 16:02                                                           ` Peter Zijlstra
  2015-09-18 16:12                                                           ` Linus Torvalds
  1 sibling, 0 replies; 54+ messages in thread
From: Peter Zijlstra @ 2015-09-18 16:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Ingo Molnar, Frederic Weisbecker, Chris Mason,
	Dave Chinner, Jan Kara, Josef Bacik, LKML, linux-fsdevel,
	Neil Brown, Christoph Hellwig, Tejun Heo

On Fri, Sep 18, 2015 at 05:59:56PM +0200, Peter Zijlstra wrote:
> >  - if it's not an actual PREEMPT_ACTIVE (ie in a random place)
> 
> PREEMPT_ACTIVE is actually a recursion flag, 'random place' does not
> factor into it. That is, most 'random places' will not have
> PREEMPT_ACTIVE set.

Ah, _not_ PREEMPT_ACTIVE, its late and I can't read no moar.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
  2015-09-18 15:59                                                         ` Peter Zijlstra
  2015-09-18 16:02                                                           ` Peter Zijlstra
@ 2015-09-18 16:12                                                           ` Linus Torvalds
  2015-09-28 14:47                                                             ` Peter Zijlstra
  1 sibling, 1 reply; 54+ messages in thread
From: Linus Torvalds @ 2015-09-18 16:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jens Axboe, Ingo Molnar, Frederic Weisbecker, Chris Mason,
	Dave Chinner, Jan Kara, Josef Bacik, LKML, linux-fsdevel,
	Neil Brown, Christoph Hellwig, Tejun Heo

On Fri, Sep 18, 2015 at 8:59 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> See PREEMPT_ACTIVE being a recursion flag, we set it there so we won't
> preempt while we're already scheduling.

PREEMPT_ACTIVE does more than that. It really is a sign that "this is
not synchronous". It causes the scheduler to ignore the current task
flags (because it might already be TASK_SLEEPING, but we aren't
_actually_ ready to sleep yet) etc.

So no. It's not "you can't be preempted during scheduling". That's the
*normal* preempt count, and all scheduling calls end up setting that
some way (ie "schedule()" just does preempt_disable()).

So I disagree with your notion that it's a recursion flag. It is
absolutely nothing of the sort. It gets set by preemption - and,
somewhat illogically, by cond_resched().

The fact that cond_resched() sets it is *probably* because some of the
callers end up calling it from page fault paths etc, and the same
"ignore TASK_SLEEPING etc" rules apply. But it does mean that
"cond_resched()" is a bit misleaning as a name. It's really a
"cond_preempt()".

                 Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
  2015-09-18  6:04                                                 ` Linus Torvalds
  2015-09-18  6:06                                                   ` Linus Torvalds
  2015-09-18 13:16                                                   ` Chris Mason
@ 2015-09-18 22:17                                                   ` Dave Chinner
  2015-09-21  9:24                                                     ` Jan Kara
  2 siblings, 1 reply; 54+ messages in thread
From: Dave Chinner @ 2015-09-18 22:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Chris Mason, Jan Kara, Josef Bacik, LKML,
	linux-fsdevel, Neil Brown, Christoph Hellwig, Tejun Heo

On Thu, Sep 17, 2015 at 11:04:03PM -0700, Linus Torvalds wrote:
> On Thu, Sep 17, 2015 at 10:40 PM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > Ok, makes sense - the plug is not being flushed as we switch away,
> > but Chris' patch makes it do that.
> 
> Yup.
> 
> And I actually think Chris' patch is better than the one I sent out
> (but maybe the scheduler people should take a look at the behavior of
> cond_resched()), I just wanted you to test that to verify the
> behavior.
> 
> The fact that Chris' patch ends up lowering the context switches
> (because it does the unplugging directly) is also an argument for his
> approach.
> 
> I just wanted to understand the oddity with kblockd_workqueue. And I
> think that's solved.
> 
> > Context switches go back to the 4-4500/sec range. Otherwise
> > behaviour and performance is indistinguishable from Chris' patch.
> 
> .. this was exactly what I wanted to hear. So it sounds like we have
> no odd unexplained behavior left in this area.
> 
> Which is not to say that there wouldn't be room for improvement, but
> it just makes me much happier about the state of these patches to feel
> like we understand what was going on.
> 
> > PS: just hit another "did this just get broken in 4.3-rc1" issue - I
> > can't run blktrace while there's a IO load because:
> >
> > $ sudo blktrace -d /dev/vdc
> > BLKTRACESETUP(2) /dev/vdc failed: 5/Input/output error
> > Thread 1 failed open /sys/kernel/debug/block/(null)/trace1: 2/No such file or directory
> > ....
> >
> > [  641.424618] blktrace: page allocation failure: order:5, mode:0x2040d0
> > [  641.438933]  [<ffffffff811c1569>] kmem_cache_alloc_trace+0x129/0x400
> > [  641.440240]  [<ffffffff811424f8>] relay_open+0x68/0x2c0
> > [  641.441299]  [<ffffffff8115deb1>] do_blk_trace_setup+0x191/0x2d0
> >
> > gdb) l *(relay_open+0x68)
> > 0xffffffff811424f8 is in relay_open (kernel/relay.c:582).
> > 577                     return NULL;
> > 578             if (subbuf_size > UINT_MAX / n_subbufs)
> > 579                     return NULL;
> > 580
> > 581             chan = kzalloc(sizeof(struct rchan), GFP_KERNEL);
> > 582             if (!chan)
> > 583                     return NULL;
> > 584
> > 585             chan->version = RELAYFS_CHANNEL_VERSION;
> > 586             chan->n_subbufs = n_subbufs;
> >
> > and struct rchan has a member struct rchan_buf *buf[NR_CPUS];
> > and CONFIG_NR_CPUS=8192, hence the attempt at an order 5 allocation
> > that fails here....
> 
> Hm. Have you always had MAX_SMP (and the NR_CPU==8192 that it causes)?
> From a quick check, none of this code seems to be new.

Yes, I always build MAX_SMP kernels for testing, because XFS is
often used on such machines and so I want to find issues exactly
like this in my testing rather than on customer machines... :/

> That said, having that
> 
>         struct rchan_buf *buf[NR_CPUS];
> 
> in "struct rchan" really is something we should fix. We really should
> strive to not allocate things by CONFIG_NR_CPU's, but by the actual
> real CPU count.

*nod*. But it doesn't fix the problem of the memory allocation
failing when there's still gigabytes of immediately reclaimable
memory available in the page cache. If this is failing under page
cache memory pressure, then we're going to be doing an awful lot
more falling back to vmalloc in the filesystem code where large
allocations like this are done e.g. extended attribute buffers are
order-5, and used a lot when doing things like backups which tend to
also produce significant page cache memory pressure.

Hence I'm tending towards there being a memory reclaim behaviour
regression, not so much worrying about whether this specific
allocation is optimal or not.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
  2015-09-18 22:17                                                   ` Dave Chinner
@ 2015-09-21  9:24                                                     ` Jan Kara
  2015-09-21 20:21                                                       ` Andrew Morton
  0 siblings, 1 reply; 54+ messages in thread
From: Jan Kara @ 2015-09-21  9:24 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Linus Torvalds, Jens Axboe, Chris Mason, Jan Kara, Josef Bacik,
	LKML, linux-fsdevel, Neil Brown, Christoph Hellwig, Tejun Heo,
	linux-mm, Andrew Morton

On Sat 19-09-15 08:17:14, Dave Chinner wrote:
> On Thu, Sep 17, 2015 at 11:04:03PM -0700, Linus Torvalds wrote:
> > On Thu, Sep 17, 2015 at 10:40 PM, Dave Chinner <david@fromorbit.com> wrote:
> > > PS: just hit another "did this just get broken in 4.3-rc1" issue - I
> > > can't run blktrace while there's a IO load because:
> > >
> > > $ sudo blktrace -d /dev/vdc
> > > BLKTRACESETUP(2) /dev/vdc failed: 5/Input/output error
> > > Thread 1 failed open /sys/kernel/debug/block/(null)/trace1: 2/No such file or directory
> > > ....
> > >
> > > [  641.424618] blktrace: page allocation failure: order:5, mode:0x2040d0
> > > [  641.438933]  [<ffffffff811c1569>] kmem_cache_alloc_trace+0x129/0x400
> > > [  641.440240]  [<ffffffff811424f8>] relay_open+0x68/0x2c0
> > > [  641.441299]  [<ffffffff8115deb1>] do_blk_trace_setup+0x191/0x2d0
> > >
> > > gdb) l *(relay_open+0x68)
> > > 0xffffffff811424f8 is in relay_open (kernel/relay.c:582).
> > > 577                     return NULL;
> > > 578             if (subbuf_size > UINT_MAX / n_subbufs)
> > > 579                     return NULL;
> > > 580
> > > 581             chan = kzalloc(sizeof(struct rchan), GFP_KERNEL);
> > > 582             if (!chan)
> > > 583                     return NULL;
> > > 584
> > > 585             chan->version = RELAYFS_CHANNEL_VERSION;
> > > 586             chan->n_subbufs = n_subbufs;
> > >
> > > and struct rchan has a member struct rchan_buf *buf[NR_CPUS];
> > > and CONFIG_NR_CPUS=8192, hence the attempt at an order 5 allocation
> > > that fails here....
> > 
> > Hm. Have you always had MAX_SMP (and the NR_CPU==8192 that it causes)?
> > From a quick check, none of this code seems to be new.
> 
> Yes, I always build MAX_SMP kernels for testing, because XFS is
> often used on such machines and so I want to find issues exactly
> like this in my testing rather than on customer machines... :/
> 
> > That said, having that
> > 
> >         struct rchan_buf *buf[NR_CPUS];
> > 
> > in "struct rchan" really is something we should fix. We really should
> > strive to not allocate things by CONFIG_NR_CPU's, but by the actual
> > real CPU count.
> 
> *nod*. But it doesn't fix the problem of the memory allocation
> failing when there's still gigabytes of immediately reclaimable
> memory available in the page cache. If this is failing under page
> cache memory pressure, then we're going to be doing an awful lot
> more falling back to vmalloc in the filesystem code where large
> allocations like this are done e.g. extended attribute buffers are
> order-5, and used a lot when doing things like backups which tend to
> also produce significant page cache memory pressure.
> 
> Hence I'm tending towards there being a memory reclaim behaviour
> regression, not so much worrying about whether this specific
> allocation is optimal or not.

Yup, looks like a regression in reclaim. Added linux-mm folks to CC.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
  2015-09-21  9:24                                                     ` Jan Kara
@ 2015-09-21 20:21                                                       ` Andrew Morton
  0 siblings, 0 replies; 54+ messages in thread
From: Andrew Morton @ 2015-09-21 20:21 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dave Chinner, Linus Torvalds, Jens Axboe, Chris Mason,
	Josef Bacik, LKML, linux-fsdevel, Neil Brown, Christoph Hellwig,
	Tejun Heo, linux-mm

On Mon, 21 Sep 2015 11:24:29 +0200 Jan Kara <jack@suse.cz> wrote:

> On Sat 19-09-15 08:17:14, Dave Chinner wrote:
> > On Thu, Sep 17, 2015 at 11:04:03PM -0700, Linus Torvalds wrote:
> > > On Thu, Sep 17, 2015 at 10:40 PM, Dave Chinner <david@fromorbit.com> wrote:
> > > > PS: just hit another "did this just get broken in 4.3-rc1" issue - I
> > > > can't run blktrace while there's a IO load because:
> > > >
> > > > $ sudo blktrace -d /dev/vdc
> > > > BLKTRACESETUP(2) /dev/vdc failed: 5/Input/output error
> > > > Thread 1 failed open /sys/kernel/debug/block/(null)/trace1: 2/No such file or directory
> > > > ....
> > > >
> > > > [  641.424618] blktrace: page allocation failure: order:5, mode:0x2040d0
> > > > [  641.438933]  [<ffffffff811c1569>] kmem_cache_alloc_trace+0x129/0x400
> > > > [  641.440240]  [<ffffffff811424f8>] relay_open+0x68/0x2c0
> > > > [  641.441299]  [<ffffffff8115deb1>] do_blk_trace_setup+0x191/0x2d0
> > > >
> > > > gdb) l *(relay_open+0x68)
> > > > 0xffffffff811424f8 is in relay_open (kernel/relay.c:582).
> > > > 577                     return NULL;
> > > > 578             if (subbuf_size > UINT_MAX / n_subbufs)
> > > > 579                     return NULL;
> > > > 580
> > > > 581             chan = kzalloc(sizeof(struct rchan), GFP_KERNEL);
> > > > 582             if (!chan)
> > > > 583                     return NULL;
> > > > 584
> > > > 585             chan->version = RELAYFS_CHANNEL_VERSION;
> > > > 586             chan->n_subbufs = n_subbufs;
> > > >
> > > > and struct rchan has a member struct rchan_buf *buf[NR_CPUS];
> > > > and CONFIG_NR_CPUS=8192, hence the attempt at an order 5 allocation
> > > > that fails here....
> > > 
> > > Hm. Have you always had MAX_SMP (and the NR_CPU==8192 that it causes)?
> > > From a quick check, none of this code seems to be new.
> > 
> > Yes, I always build MAX_SMP kernels for testing, because XFS is
> > often used on such machines and so I want to find issues exactly
> > like this in my testing rather than on customer machines... :/
> > 
> > > That said, having that
> > > 
> > >         struct rchan_buf *buf[NR_CPUS];
> > > 
> > > in "struct rchan" really is something we should fix. We really should
> > > strive to not allocate things by CONFIG_NR_CPU's, but by the actual
> > > real CPU count.
> > 
> > *nod*. But it doesn't fix the problem of the memory allocation
> > failing when there's still gigabytes of immediately reclaimable
> > memory available in the page cache. If this is failing under page
> > cache memory pressure, then we're going to be doing an awful lot
> > more falling back to vmalloc in the filesystem code where large
> > allocations like this are done e.g. extended attribute buffers are
> > order-5, and used a lot when doing things like backups which tend to
> > also produce significant page cache memory pressure.
> > 
> > Hence I'm tending towards there being a memory reclaim behaviour
> > regression, not so much worrying about whether this specific
> > allocation is optimal or not.
> 
> Yup, looks like a regression in reclaim. Added linux-mm folks to CC.

That's going to be hard to find.  Possibly Vlastimil's 5-patch series
"mm, compaction: more robust check for scanners meeting", possibly
Joonsoo's "mm/compaction: correct to flush migrated pages if pageblock
skip happens".  But probably something else :(

Teach relay.c about alloc_percpu()?

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
  2015-09-18 16:12                                                           ` Linus Torvalds
@ 2015-09-28 14:47                                                             ` Peter Zijlstra
  2015-09-28 16:08                                                               ` Linus Torvalds
  0 siblings, 1 reply; 54+ messages in thread
From: Peter Zijlstra @ 2015-09-28 14:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Ingo Molnar, Frederic Weisbecker, Chris Mason,
	Dave Chinner, Jan Kara, Josef Bacik, LKML, linux-fsdevel,
	Neil Brown, Christoph Hellwig, Tejun Heo

On Fri, Sep 18, 2015 at 09:12:38AM -0700, Linus Torvalds wrote:
> 
> So I disagree with your notion that it's a recursion flag. It is
> absolutely nothing of the sort.

OK, agreed. I had it classed under recursion in my head, clearly I
indexed it sloppily. In any case I have a patch that kills off
PREEMPT_ACTIVE entirely.

I just have to clean it up, benchmark, split and write changelogs. But
it should be forthcoming 'soon'. As is, it boots..

> It gets set by preemption - and,
> somewhat illogically, by cond_resched().

I suspect that was done to make cond_resched() (voluntary preemption)
more robust and only have a single preemption path/logic. But all that
was done well before I got involved.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
  2015-09-28 14:47                                                             ` Peter Zijlstra
@ 2015-09-28 16:08                                                               ` Linus Torvalds
  2015-09-29  7:55                                                                 ` Ingo Molnar
  0 siblings, 1 reply; 54+ messages in thread
From: Linus Torvalds @ 2015-09-28 16:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jens Axboe, Ingo Molnar, Frederic Weisbecker, Chris Mason,
	Dave Chinner, Jan Kara, Josef Bacik, LKML, linux-fsdevel,
	Neil Brown, Christoph Hellwig, Tejun Heo

On Mon, Sep 28, 2015 at 10:47 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
>> It gets set by preemption - and,
>> somewhat illogically, by cond_resched().
>
> I suspect that was done to make cond_resched() (voluntary preemption)
> more robust and only have a single preemption path/logic. But all that
> was done well before I got involved.

So I think it's actually the name that is bad, not necessarily the behavior.

We tend to put "cond_resched()" (and particularly
"cond_resched_lock()") in some fairly awkward places, and it's not
always entirely clear that task->state == TASK_RUNNING there.

So the preemptive behavior of not *really* putting the task to sleep
may actually be the right one. But it is rather non-intuitive given
the name - because "cond_resched()" basically is not at all equivalent
to "if (need_resched()) schedule()", which you'd kind of expect.

An explicit schedule will actually act on the task->state, and make us
go to sleep. "cond_resched()"  really is just a "voluntary preemption
point". And I think it would be better if it got named that way.

            Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
  2015-09-28 16:08                                                               ` Linus Torvalds
@ 2015-09-29  7:55                                                                 ` Ingo Molnar
  0 siblings, 0 replies; 54+ messages in thread
From: Ingo Molnar @ 2015-09-29  7:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Jens Axboe, Frederic Weisbecker, Chris Mason,
	Dave Chinner, Jan Kara, Josef Bacik, LKML, linux-fsdevel,
	Neil Brown, Christoph Hellwig, Tejun Heo


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Mon, Sep 28, 2015 at 10:47 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> >> It gets set by preemption - and,
> >> somewhat illogically, by cond_resched().
> >
> > I suspect that was done to make cond_resched() (voluntary preemption)
> > more robust and only have a single preemption path/logic. But all that
> > was done well before I got involved.
> 
> So I think it's actually the name that is bad, not necessarily the behavior.
> 
> We tend to put "cond_resched()" (and particularly
> "cond_resched_lock()") in some fairly awkward places, and it's not
> always entirely clear that task->state == TASK_RUNNING there.
> 
> So the preemptive behavior of not *really* putting the task to sleep
> may actually be the right one. But it is rather non-intuitive given
> the name - because "cond_resched()" basically is not at all equivalent
> to "if (need_resched()) schedule()", which you'd kind of expect.
> 
> An explicit schedule will actually act on the task->state, and make us
> go to sleep. "cond_resched()"  really is just a "voluntary preemption
> point". And I think it would be better if it got named that way.

cond_preempt() perhaps? That would allude to preempt_schedule() and such, and 
would make it clearer that it's supposed to be an invariant on the sleep state 
(which schedule() is not).

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2015-09-29  7:55 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-11 19:37 [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug() Chris Mason
2015-09-11 20:02 ` Linus Torvalds
2015-09-11 20:37   ` Linus Torvalds
2015-09-11 20:40     ` Josef Bacik
2015-09-11 21:04       ` Linus Torvalds
2015-09-11 22:06         ` Linus Torvalds
2015-09-11 23:16           ` Chris Mason
2015-09-11 23:36             ` Linus Torvalds
2015-09-12  0:52               ` Linus Torvalds
2015-09-12  2:15                 ` Chris Mason
2015-09-12  2:27                   ` Linus Torvalds
2015-09-12 23:00               ` Chris Mason
2015-09-12 23:29                 ` Linus Torvalds
2015-09-12 23:46                   ` Chris Mason
2015-09-13 13:12                     ` Chris Mason
2015-09-13 22:56                       ` Dave Chinner
2015-09-13 23:12                 ` Dave Chinner
2015-09-14 20:06                   ` Linus Torvalds
2015-09-16 15:16                     ` Chris Mason
2015-09-16 19:58                       ` Jan Kara
2015-09-16 20:00                         ` Chris Mason
2015-09-16 22:07                           ` Dave Chinner
2015-09-17  0:37                             ` Dave Chinner
2015-09-17  1:12                               ` Linus Torvalds
2015-09-17  2:14                                 ` Dave Chinner
2015-09-17 19:39                                   ` Linus Torvalds
2015-09-17 22:42                                     ` Chris Mason
2015-09-17 23:08                                       ` Linus Torvalds
2015-09-17 23:56                                         ` Chris Mason
2015-09-18  0:37                                           ` Dave Chinner
2015-09-18  1:50                                             ` Linus Torvalds
2015-09-18  5:40                                               ` Dave Chinner
2015-09-18  6:04                                                 ` Linus Torvalds
2015-09-18  6:06                                                   ` Linus Torvalds
2015-09-18 14:21                                                     ` Jens Axboe
2015-09-18 13:16                                                   ` Chris Mason
2015-09-18 14:23                                                     ` Jens Axboe
2015-09-18 15:32                                                       ` Linus Torvalds
2015-09-18 15:59                                                         ` Peter Zijlstra
2015-09-18 16:02                                                           ` Peter Zijlstra
2015-09-18 16:12                                                           ` Linus Torvalds
2015-09-28 14:47                                                             ` Peter Zijlstra
2015-09-28 16:08                                                               ` Linus Torvalds
2015-09-29  7:55                                                                 ` Ingo Molnar
2015-09-18 22:17                                                   ` Dave Chinner
2015-09-21  9:24                                                     ` Jan Kara
2015-09-21 20:21                                                       ` Andrew Morton
2015-09-17 23:03                                   ` Dave Chinner
2015-09-17 23:13                                     ` Linus Torvalds
2015-09-17  3:48                               ` Chris Mason
2015-09-17  4:30                                 ` Dave Chinner
2015-09-17 12:13                                   ` Chris Mason
2015-09-11 23:06         ` Chris Mason
2015-09-11 23:13           ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).