* [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
@ 2015-09-11 19:37 Chris Mason
2015-09-11 20:02 ` Linus Torvalds
0 siblings, 1 reply; 54+ messages in thread
From: Chris Mason @ 2015-09-11 19:37 UTC (permalink / raw)
To: Linus Torvalds, LKML, linux-fsdevel; +Cc: Josef Bacik, Dave Chinner, neilb
Linus, this is the plugging problem I mentioned in my btrfs pull. It
impacts only MD raid10 and btrfs raid5/6, and I'm not wild about the
patch. But I wanted to at least send in the basic fix for rc1 so this
doesn't cause bigger problems for early testers:
Commit d353d7587 added a plug/finish_plug pair to writeback_sb_inodes,
but writeback_sb_inodes has a horrible secret...it's called with the
wb->list_lock held.
Btrfs raid5/6 and MD raid10 have horrible secrets of their own...they
both do allocations in their unplug callbacks.
None of the options to fix it are very pretty. We don't want to kick
off workers for all of these unplugs, and the lock doesn't look hot
enough to justify bigger restructuring.
[ 2854.025042] BUG: sleeping function called from invalid context at mm/page_alloc.c:3189
[ 2854.041366] in_atomic(): 1, irqs_disabled(): 0, pid: 145562, name: kworker/u66:15
[ 2854.056813] INFO: lockdep is turned off.
[ 2854.064870] CPU: 13 PID: 145562 Comm: kworker/u66:15 Not tainted 4.2.0-mason+ #1
[ 2854.080082] Hardware name: ZTSYSTEMS Echo Ridge T4 /A9DRPF-10D, BIOS 1.07 05/10/2012
[ 2854.096211] Workqueue: writeback wb_workfn (flush-btrfs-244)
[ 2854.107821] ffffffff81a2bbee ffff880ee09a7598 ffffffff813307bb ffff880ee09a7598
[ 2854.123162] ffff881010d1ca00 ffff880ee09a75c8 ffffffff81086615 0000000000000000
[ 2854.138556] 0000000000000000 0000000000000c75 ffffffff81a2bbee ffff880ee09a75f8
[ 2854.153936] Call Trace:
[ 2854.181101] [<ffffffff81086722>] __might_sleep+0x52/0x90
[ 2854.192136] [<ffffffff8116d2b4>] __alloc_pages_nodemask+0x344/0xbe0
[ 2854.229682] [<ffffffff811b54aa>] alloc_pages_current+0x10a/0x1e0
[ 2854.255508] [<ffffffffa0663f19>] full_stripe_write+0x59/0xc0 [btrfs]
[ 2854.268600] [<ffffffffa0663fb9>] __raid56_parity_write+0x39/0x60 [btrfs]
[ 2854.282385] [<ffffffffa06640fb>] run_plug+0x11b/0x140 [btrfs]
[ 2854.294259] [<ffffffffa0664143>] btrfs_raid_unplug+0x23/0x70 [btrfs]
[ 2854.307334] [<ffffffff81307622>] blk_flush_plug_list+0x82/0x1f0
[ 2854.319542] [<ffffffff813077c4>] blk_finish_plug+0x34/0x50
[ 2854.330878] [<ffffffff812079c2>] writeback_sb_inodes+0x122/0x580
[ 2854.343256] [<ffffffff81208016>] wb_writeback+0x136/0x4e0
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: Jens Axboe <axboe@fb.com>
---
fs/fs-writeback.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index ae0f438..07c9c50 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1539,7 +1539,9 @@ static long writeback_sb_inodes(struct super_block *sb,
break;
}
}
+ spin_unlock(&wb->list_lock);
blk_finish_plug(&plug);
+ spin_lock(&wb->list_lock);
return wrote;
}
--
1.8.1
^ permalink raw reply related [flat|nested] 54+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
2015-09-11 19:37 [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug() Chris Mason
@ 2015-09-11 20:02 ` Linus Torvalds
2015-09-11 20:37 ` Linus Torvalds
0 siblings, 1 reply; 54+ messages in thread
From: Linus Torvalds @ 2015-09-11 20:02 UTC (permalink / raw)
To: Chris Mason, LKML, linux-fsdevel, Josef Bacik, Dave Chinner,
Neil Brown, Jan Kara, Christoph Hellwig
I hate this fix.
On Fri, Sep 11, 2015 at 12:37 PM, Chris Mason <clm@fb.com> wrote:
> Linus, this is the plugging problem I mentioned in my btrfs pull. It
> impacts only MD raid10 and btrfs raid5/6, and I'm not wild about the
> patch. But I wanted to at least send in the basic fix for rc1 so this
> doesn't cause bigger problems for early testers:
>
> Commit d353d7587 added a plug/finish_plug pair to writeback_sb_inodes,
> but writeback_sb_inodes has a horrible secret...it's called with the
> wb->list_lock held.
Quite frankly, just dropping and retaking the lock around the
blk_finish_plug() is just disgusting. The whole "drop and retake lock"
pattern in general is a bad idea, because it can so easily break the
caller (because now the lock no longer covers things over the call.
Yes, in this case we already do something similar in
writeback_single_inode(), so I guess the argument is that it doesn't
make things much worse, and that the caller already cannot depend on
the lock being held. True, but no less disgusting for that. So we
could do this, but in this case I don't think there's any good
_reason_ for doing that disgusting thing.
How about we instead:
(a) revert that commit d353d7587 as broken (because it clearly is)
(b) add a big honking comment about the fact that we hold 'list_lock'
in writeback_sb_inodes()
(c) move the plugging up to wb_writeback() and writeback_inodes_wb()
_outside_ the spinlock.
because that way we not only avoid the ugliness, it should also be
more effective at plugging things since we gather _all_ the writeback
rather than just one superblock.
Let's not paper over a completely broken commit. Let's just admit that
commit d353d7587 was prue and utter shite, and rather than try to fix
up the mistake, make it all better!
Anyway, I will start by reverting that commit, and adding the comment.
I'm more than happy to take the patch that moves the plugging, but
since that one was about performance rather than correctness, I think
it would be good to just re-verify the numbers.
Dave?
Linus
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
2015-09-11 20:02 ` Linus Torvalds
@ 2015-09-11 20:37 ` Linus Torvalds
2015-09-11 20:40 ` Josef Bacik
0 siblings, 1 reply; 54+ messages in thread
From: Linus Torvalds @ 2015-09-11 20:37 UTC (permalink / raw)
To: Chris Mason, LKML, linux-fsdevel, Josef Bacik, Dave Chinner,
Neil Brown, Jan Kara, Christoph Hellwig
[-- Attachment #1: Type: text/plain, Size: 971 bytes --]
On Fri, Sep 11, 2015 at 1:02 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> How about we instead:
>
> (a) revert that commit d353d7587 as broken (because it clearly is)
>
> (b) add a big honking comment about the fact that we hold 'list_lock'
> in writeback_sb_inodes()
>
> (c) move the plugging up to wb_writeback() and writeback_inodes_wb()
> _outside_ the spinlock.
Ok, I've done (a) and (b) in my tree. And attached is the totally
untested patch for (c). It looks ObviouslyCorrect(tm), but since this
is a performance issue, I'm not going to commit it without some more
ACK's from people.
I obviously think this is a *much* better approach than dropping and
retaking the lock, but there might be something silly I'm missing.
For example, maybe we want to unplug and replug around the
"inode_sleep_on_writeback()" in wb_writeback()? So while the revert
was a no-brainer, this one I really want people to think about.
Linus
[-- Attachment #2: patch.diff --]
[-- Type: text/plain, Size: 1102 bytes --]
fs/fs-writeback.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index d8ea7ed411b2..587ac08eabb6 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1546,12 +1546,15 @@ static long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages,
.range_cyclic = 1,
.reason = reason,
};
+ struct blk_plug plug;
+ blk_start_plug(&plug);
spin_lock(&wb->list_lock);
if (list_empty(&wb->b_io))
queue_io(wb, &work);
__writeback_inodes_wb(wb, &work);
spin_unlock(&wb->list_lock);
+ blk_finish_plug(&plug);
return nr_pages - work.nr_pages;
}
@@ -1579,10 +1582,12 @@ static long wb_writeback(struct bdi_writeback *wb,
unsigned long oldest_jif;
struct inode *inode;
long progress;
+ struct blk_plug plug;
oldest_jif = jiffies;
work->older_than_this = &oldest_jif;
+ blk_start_plug(&plug);
spin_lock(&wb->list_lock);
for (;;) {
/*
@@ -1662,6 +1667,7 @@ static long wb_writeback(struct bdi_writeback *wb,
}
}
spin_unlock(&wb->list_lock);
+ blk_finish_plug(&plug);
return nr_pages - work->nr_pages;
}
^ permalink raw reply related [flat|nested] 54+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
2015-09-11 20:37 ` Linus Torvalds
@ 2015-09-11 20:40 ` Josef Bacik
2015-09-11 21:04 ` Linus Torvalds
0 siblings, 1 reply; 54+ messages in thread
From: Josef Bacik @ 2015-09-11 20:40 UTC (permalink / raw)
To: Linus Torvalds, Chris Mason, LKML, linux-fsdevel, Dave Chinner,
Neil Brown, Jan Kara, Christoph Hellwig
On 09/11/2015 04:37 PM, Linus Torvalds wrote:
> On Fri, Sep 11, 2015 at 1:02 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> How about we instead:
>>
>> (a) revert that commit d353d7587 as broken (because it clearly is)
>>
>> (b) add a big honking comment about the fact that we hold 'list_lock'
>> in writeback_sb_inodes()
>>
>> (c) move the plugging up to wb_writeback() and writeback_inodes_wb()
>> _outside_ the spinlock.
>
> Ok, I've done (a) and (b) in my tree. And attached is the totally
> untested patch for (c). It looks ObviouslyCorrect(tm), but since this
> is a performance issue, I'm not going to commit it without some more
> ACK's from people.
>
> I obviously think this is a *much* better approach than dropping and
> retaking the lock, but there might be something silly I'm missing.
>
> For example, maybe we want to unplug and replug around the
> "inode_sleep_on_writeback()" in wb_writeback()? So while the revert
> was a no-brainer, this one I really want people to think about.
So we talked about this when we were trying to figure out a solution.
The problem with this approach is now we have a plug that covers
multiple super blocks (__writeback_inodes_wb loops through the sb's
starts writeback), which is likely to give us crappier performance than
no plug at all. Thanks,
Josef
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
2015-09-11 20:40 ` Josef Bacik
@ 2015-09-11 21:04 ` Linus Torvalds
2015-09-11 22:06 ` Linus Torvalds
2015-09-11 23:06 ` Chris Mason
0 siblings, 2 replies; 54+ messages in thread
From: Linus Torvalds @ 2015-09-11 21:04 UTC (permalink / raw)
To: Josef Bacik
Cc: Chris Mason, LKML, linux-fsdevel, Dave Chinner, Neil Brown,
Jan Kara, Christoph Hellwig
On Fri, Sep 11, 2015 at 1:40 PM, Josef Bacik <jbacik@fb.com> wrote:
>
> So we talked about this when we were trying to figure out a solution. The
> problem with this approach is now we have a plug that covers multiple super
> blocks (__writeback_inodes_wb loops through the sb's starts writeback),
> which is likely to give us crappier performance than no plug at all.
Why would that be? Either they are on separate disks, and the IO is
all independent anyway, and at most it got started by some (small)
CPU-amount later. Actual throughput should be the same. No?
Or the filesystems are on the same disk, in which case it should
presumably be a win to submit the IO together.
Of course, actual numbers would be the deciding factor if this really
is noticeable. But "cleaner code and saner locking" is definitely an
issue at least for me.
Linus
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
2015-09-11 21:04 ` Linus Torvalds
@ 2015-09-11 22:06 ` Linus Torvalds
2015-09-11 23:16 ` Chris Mason
2015-09-11 23:06 ` Chris Mason
1 sibling, 1 reply; 54+ messages in thread
From: Linus Torvalds @ 2015-09-11 22:06 UTC (permalink / raw)
To: Josef Bacik
Cc: Chris Mason, LKML, linux-fsdevel, Dave Chinner, Neil Brown,
Jan Kara, Christoph Hellwig
On Fri, Sep 11, 2015 at 2:04 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Of course, actual numbers would be the deciding factor if this really
> is noticeable. But "cleaner code and saner locking" is definitely an
> issue at least for me.
Anyway, I'll hold off pushing out the revert too for a while, in the
hope that we'll have actual numbers for or against whatever the
particular solution should be.
I do get the feeling that the whole wb->list_lock needs more loving.
For example, there's that locked_inode_to_wb_and_lock_list() thing
(which might be better off trying to just do "trylock" on it, but
that's a separate issue) showing lock inversion worries.
Maybe we should *not* get that wb->list_lock early at all, and nest it
inside the inode spinlocks, and just move the list_lock locking down a
lot (ie not even try to hold it over big functions that then are
forced to releasing it anyway).
For example, realistically, it looks like the longest we ever *really*
hold that lock is at the top of the loop of writeback_sb_inodes() -
maybe we could just explicitly have a function that does "find the
first inode that matches this sb and needs writeout activity", and
literally only take hold the lock over that function. And *not* take
the lock in the caller at all?
The callers seem to want that lock mainly because they do that
"list_empty(&wb->b_io)" test. But the way our doubly linked lists
work, "list_empty()" is actually something that can be done racily
without the lock held...
That said, I doubt anybody really wants to touch this, so at least for
now we're stuck with either the "plug outside the lock" or the "drop
and retake lock" options. It really would be loveyl to haev numbers
either way.
<insert pricess leia gif>
"Dave, you're our only hope"
Linus
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
2015-09-11 21:04 ` Linus Torvalds
2015-09-11 22:06 ` Linus Torvalds
@ 2015-09-11 23:06 ` Chris Mason
2015-09-11 23:13 ` Linus Torvalds
1 sibling, 1 reply; 54+ messages in thread
From: Chris Mason @ 2015-09-11 23:06 UTC (permalink / raw)
To: Linus Torvalds
Cc: Josef Bacik, LKML, linux-fsdevel, Dave Chinner, Neil Brown,
Jan Kara, Christoph Hellwig
On Fri, Sep 11, 2015 at 02:04:13PM -0700, Linus Torvalds wrote:
> On Fri, Sep 11, 2015 at 1:40 PM, Josef Bacik <jbacik@fb.com> wrote:
> >
> > So we talked about this when we were trying to figure out a solution. The
> > problem with this approach is now we have a plug that covers multiple super
> > blocks (__writeback_inodes_wb loops through the sb's starts writeback),
> > which is likely to give us crappier performance than no plug at all.
>
> Why would that be? Either they are on separate disks, and the IO is
> all independent anyway, and at most it got started by some (small)
> CPU-amount later. Actual throughput should be the same. No?
>
> Or the filesystems are on the same disk, in which case it should
> presumably be a win to submit the IO together.
>
> Of course, actual numbers would be the deciding factor if this really
> is noticeable. But "cleaner code and saner locking" is definitely an
> issue at least for me.
Originally I was worried about the latency impact of holding the
plugs over more than one super with high end flash. I just didn't want
to hold onto the IO for longer than we had to.
But, since this isn't really latency sensitive anyway, if we find we're
not keeping the flash pipelines full the right answer is to short
circuit the plugging in general. I'd agree actual throughput should be
the same.
But benchmarking is the best choice, I'll be able to reproduce
Dave's original results without too much trouble. Our thinking for this
duct tape patch was the lock wasn't very hot and this was the best
immediate compromise between the bug and the perf improvement.
Happy to sign up for pushing the lock higher if that's what people would
rather see.
-chris
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
2015-09-11 23:06 ` Chris Mason
@ 2015-09-11 23:13 ` Linus Torvalds
0 siblings, 0 replies; 54+ messages in thread
From: Linus Torvalds @ 2015-09-11 23:13 UTC (permalink / raw)
To: Chris Mason, Linus Torvalds, Josef Bacik, LKML, linux-fsdevel,
Dave Chinner, Neil Brown, Jan Kara, Christoph Hellwig
On Fri, Sep 11, 2015 at 4:06 PM, Chris Mason <clm@fb.com> wrote:
>
> Originally I was worried about the latency impact of holding the
> plugs over more than one super with high end flash. I just didn't want
> to hold onto the IO for longer than we had to.
>
> But, since this isn't really latency sensitive anyway, if we find we're
> not keeping the flash pipelines full the right answer is to short
> circuit the plugging in general. I'd agree actual throughput should be
> the same.
Yeah, this only triggers for system-wide writeback, so I don't seer
that it should be latency-sensitive at that level, afaik.
But hey, it's IO, and I've been surprised by magic pattern
sensitivites before...
Linus
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
2015-09-11 22:06 ` Linus Torvalds
@ 2015-09-11 23:16 ` Chris Mason
2015-09-11 23:36 ` Linus Torvalds
0 siblings, 1 reply; 54+ messages in thread
From: Chris Mason @ 2015-09-11 23:16 UTC (permalink / raw)
To: Linus Torvalds
Cc: Josef Bacik, LKML, linux-fsdevel, Dave Chinner, Neil Brown,
Jan Kara, Christoph Hellwig
On Fri, Sep 11, 2015 at 03:06:18PM -0700, Linus Torvalds wrote:
> On Fri, Sep 11, 2015 at 2:04 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Of course, actual numbers would be the deciding factor if this really
> > is noticeable. But "cleaner code and saner locking" is definitely an
> > issue at least for me.
>
> Anyway, I'll hold off pushing out the revert too for a while, in the
> hope that we'll have actual numbers for or against whatever the
> particular solution should be.
>
> I do get the feeling that the whole wb->list_lock needs more loving.
> For example, there's that locked_inode_to_wb_and_lock_list() thing
> (which might be better off trying to just do "trylock" on it, but
> that's a separate issue) showing lock inversion worries.
At the very least, it's kind of sad how many of us were surprised to
find the lock held when Dave's patch was unplugging. If this bug
slipped through, more are going to. It's also true N-1 of those people
were really surprised about scheduling unplug functions, so maybe we
can't put all the blame on wb->list_lock.
>
> Maybe we should *not* get that wb->list_lock early at all, and nest it
> inside the inode spinlocks, and just move the list_lock locking down a
> lot (ie not even try to hold it over big functions that then are
> forced to releasing it anyway).
>
> For example, realistically, it looks like the longest we ever *really*
> hold that lock is at the top of the loop of writeback_sb_inodes() -
> maybe we could just explicitly have a function that does "find the
> first inode that matches this sb and needs writeout activity", and
> literally only take hold the lock over that function. And *not* take
> the lock in the caller at all?
>
> The callers seem to want that lock mainly because they do that
> "list_empty(&wb->b_io)" test. But the way our doubly linked lists
> work, "list_empty()" is actually something that can be done racily
> without the lock held...
>
> That said, I doubt anybody really wants to touch this, so at least for
> now we're stuck with either the "plug outside the lock" or the "drop
> and retake lock" options. It really would be loveyl to haev numbers
> either way.
For 4.3 timeframes, what runs do you want to see numbers for:
1) revert
2) my hack
3) plug over multiple sbs (on different devices)
4) ?
-chris
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
2015-09-11 23:16 ` Chris Mason
@ 2015-09-11 23:36 ` Linus Torvalds
2015-09-12 0:52 ` Linus Torvalds
2015-09-12 23:00 ` Chris Mason
0 siblings, 2 replies; 54+ messages in thread
From: Linus Torvalds @ 2015-09-11 23:36 UTC (permalink / raw)
To: Chris Mason, Linus Torvalds, Josef Bacik, LKML, linux-fsdevel,
Dave Chinner, Neil Brown, Jan Kara, Christoph Hellwig
On Fri, Sep 11, 2015 at 4:16 PM, Chris Mason <clm@fb.com> wrote:
>
> For 4.3 timeframes, what runs do you want to see numbers for:
>
> 1) revert
> 2) my hack
> 3) plug over multiple sbs (on different devices)
> 4) ?
Just 2 or 3.
I don't think the plain revert is all that interesting, and I think
the "anything else" is far too late for this merge window.
So we'll go with either (2) your patch (which I obviously don't
_like_, but apart from the ugliness I don't think there's anything
technically wrong with), or with (3) the "plug across a bigger area".
So the only issue with (3) is whether that's just "revert plus the
patch I sent out", or whether we should unplug/replug over the "wait
synchronously for an inode" case (iow, the
"inode_sleep_on_writeback()"). The existing plug code (that has the
spinlock issue) already has a "wait on inode" case, and did *not*
unplug over that call, but broadening the plugging further now ends up
having two of those "wait synchronosly on inode".
Are we really ok with waiting synchronously for an inode while holding
the plug? No chance of deadlock (waiting for IO that we've plugged)?
That issue is true even of the current code, though, and I have _not_
really thought that through, it's just a worry.
Linus
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
2015-09-11 23:36 ` Linus Torvalds
@ 2015-09-12 0:52 ` Linus Torvalds
2015-09-12 2:15 ` Chris Mason
2015-09-12 23:00 ` Chris Mason
1 sibling, 1 reply; 54+ messages in thread
From: Linus Torvalds @ 2015-09-12 0:52 UTC (permalink / raw)
To: Chris Mason, Linus Torvalds, Josef Bacik, LKML, linux-fsdevel,
Dave Chinner, Neil Brown, Jan Kara, Christoph Hellwig
On Fri, Sep 11, 2015 at 4:36 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Are we really ok with waiting synchronously for an inode while holding
> the plug? No chance of deadlock (waiting for IO that we've plugged)?
> That issue is true even of the current code, though, and I have _not_
> really thought that through, it's just a worry.
Never mind. We still flush the plug on explicit scheduling events. I
wonder why I thought we got rid of that. Some kind of "senior moment",
I guess.
Linus
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
2015-09-12 0:52 ` Linus Torvalds
@ 2015-09-12 2:15 ` Chris Mason
2015-09-12 2:27 ` Linus Torvalds
0 siblings, 1 reply; 54+ messages in thread
From: Chris Mason @ 2015-09-12 2:15 UTC (permalink / raw)
To: Linus Torvalds
Cc: Josef Bacik, LKML, linux-fsdevel, Dave Chinner, Neil Brown,
Jan Kara, Christoph Hellwig
On Fri, Sep 11, 2015 at 05:52:27PM -0700, Linus Torvalds wrote:
> On Fri, Sep 11, 2015 at 4:36 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Are we really ok with waiting synchronously for an inode while holding
> > the plug? No chance of deadlock (waiting for IO that we've plugged)?
> > That issue is true even of the current code, though, and I have _not_
> > really thought that through, it's just a worry.
>
> Never mind. We still flush the plug on explicit scheduling events. I
> wonder why I thought we got rid of that. Some kind of "senior moment",
But flushing on schedule is a little different. It ends up calling
blk_schedule_flush_plug() which will hand off work to kblockd through
blk_run_queue_async()
Not a huge deal, but if we're scheduling to wait for that IO, we should
really run the plug ourselves so that we're not waiting for kblockd too.
-chris
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
2015-09-12 2:15 ` Chris Mason
@ 2015-09-12 2:27 ` Linus Torvalds
0 siblings, 0 replies; 54+ messages in thread
From: Linus Torvalds @ 2015-09-12 2:27 UTC (permalink / raw)
To: Chris Mason, Linus Torvalds, Josef Bacik, LKML, linux-fsdevel,
Dave Chinner, Neil Brown, Jan Kara, Christoph Hellwig
On Fri, Sep 11, 2015 at 7:15 PM, Chris Mason <clm@fb.com> wrote:
>
> But flushing on schedule is a little different. It ends up calling
> blk_schedule_flush_plug() which will hand off work to kblockd through
> blk_run_queue_async()
I was more worried about some actual deadlock from the changes. And
blk_schedule_flush_plug() is fine in that it doesn't actually remove
the plug, it just schedules the currently plugged pending IO, so the
IO will start from waiting on the inode, but the plug will still
remain for the rest of the writeback, and it all looks like it should
be fine.
And the reason we use kblockd is simple: stack usage. The reschedule
can happen pretty deep on the stack, we don't actually want to
necessarily then cause much more stack use through things like md/raid
allocating new requests etc.
So it all looks fine to me.
Btw, very tangentially related: grepping a bit shows that
"blk_flush_plug()" isn't actually used anywhere any more. Can we get
rid of that?
Linus
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
2015-09-11 23:36 ` Linus Torvalds
2015-09-12 0:52 ` Linus Torvalds
@ 2015-09-12 23:00 ` Chris Mason
2015-09-12 23:29 ` Linus Torvalds
2015-09-13 23:12 ` Dave Chinner
1 sibling, 2 replies; 54+ messages in thread
From: Chris Mason @ 2015-09-12 23:00 UTC (permalink / raw)
To: Linus Torvalds
Cc: Josef Bacik, LKML, linux-fsdevel, Dave Chinner, Neil Brown,
Jan Kara, Christoph Hellwig
On Fri, Sep 11, 2015 at 04:36:39PM -0700, Linus Torvalds wrote:
> On Fri, Sep 11, 2015 at 4:16 PM, Chris Mason <clm@fb.com> wrote:
> >
> > For 4.3 timeframes, what runs do you want to see numbers for:
> >
> > 1) revert
> > 2) my hack
> > 3) plug over multiple sbs (on different devices)
> > 4) ?
>
> Just 2 or 3.
>
> I don't think the plain revert is all that interesting, and I think
> the "anything else" is far too late for this merge window.
I did the plain revert as well, just to have a baseline. This box is a
little different from Dave's. Bare metal two socket box (E5-2660 v2 @
2.20Ghz) with 144GB of ram. I have two pcie flash devices, one nvme and
one fusionio, and I put a one FS on each device (two mounts total).
The test created 1.6M files, 4K each. I used Dave's fs_mark command
line, spread out over 16 directories from each mounted filesystem. In
btrfs they are spread over subvolumes to cut down lock contention.
I need to change around the dirty ratios more to smooth out the IO, and
I had trouble with both XFS and btrfs getting runs that were not CPU
bound. I included the time to run sync at the end of the run because
the results were not very consistent without it.
The XFS runs generally had one CPU pegged at 100%, and I think this is
throwing off the results. On Monday I'll redo them with two (four?)
filesystems per flash device, hopefully that'll break things up.
The btrfs runs generally had all the CPUs pegged at 100%. I switched to
mount -o nodatasum and squeezed out an extra 50K files/sec at much lower
CPU utilization.
wall time fs_mark files/sec bytes written/sec
XFS:
baseline v4.2: 5m6s 118,578 272MB/s
Dave's patch: 4m46s 151,421 294MB/s
my hack: 5m5s 150,714 275MB/s
Linus plug: 5m15s 147,735 266MB/s
Btrfs (nodatasum):
baseline v4.2: 4m39s 242,643 313MB/s
Dave's patch: 3m46s 252,452 389MB/s
my hack: 3m48s 257,924 379MB/s
Linus plug: 3m58s 247,528 369MB/s
Bottom line, not as conclusive as I'd like. My hack doesn't seem to
hurt, but FS internals are consuming enough CPU that this lock just
isn't showing up.
Linus' plug patch is consistently slower, and I don't have a great
explanation. My guesses: not keeping the flash pipelines full, or the
imbalance between the different speed flash is averaging the overall
result down, or its my kblockd vs explicit unplug handwaving from
yesterday.
So, next step is either more runs on flash or grab a box with a bunch of
spindles. I'd rather do the spindle runs, I agree with Dave that his
patch should help much more on actual drives.
-chris
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
2015-09-12 23:00 ` Chris Mason
@ 2015-09-12 23:29 ` Linus Torvalds
2015-09-12 23:46 ` Chris Mason
2015-09-13 23:12 ` Dave Chinner
1 sibling, 1 reply; 54+ messages in thread
From: Linus Torvalds @ 2015-09-12 23:29 UTC (permalink / raw)
To: Chris Mason, Linus Torvalds, Josef Bacik, LKML, linux-fsdevel,
Dave Chinner, Neil Brown, Jan Kara, Christoph Hellwig
On Sat, Sep 12, 2015 at 4:00 PM, Chris Mason <clm@fb.com> wrote:
>
> I did the plain revert as well, just to have a baseline.
Ahh, I ended up not expecting you to get this done until after rc1 was
out, so I in the meantime just merged my fix instead rather than leave
the expected scheduling-while-atomic problem.
And just as well that you did a baseline, since apparently the numbers
are all over the map. I don't see how your hack and dave's original
can _possibly_ differ that much, but they clearly did on your xfs
test. So there's probably huge variance that depends on random
details.
I'll leave things as they are until we have something that looks a bit
more believable ;)
Linus
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
2015-09-12 23:29 ` Linus Torvalds
@ 2015-09-12 23:46 ` Chris Mason
2015-09-13 13:12 ` Chris Mason
0 siblings, 1 reply; 54+ messages in thread
From: Chris Mason @ 2015-09-12 23:46 UTC (permalink / raw)
To: Linus Torvalds
Cc: Josef Bacik, LKML, linux-fsdevel, Dave Chinner, Neil Brown,
Jan Kara, Christoph Hellwig
On Sat, Sep 12, 2015 at 04:29:06PM -0700, Linus Torvalds wrote:
> On Sat, Sep 12, 2015 at 4:00 PM, Chris Mason <clm@fb.com> wrote:
> >
> > I did the plain revert as well, just to have a baseline.
>
> Ahh, I ended up not expecting you to get this done until after rc1 was
> out, so I in the meantime just merged my fix instead rather than leave
> the expected scheduling-while-atomic problem.
Yeah, I wasn't sure I'd be able to do the runs, but it was a rainy
afternoon and this was more fun than cleaning. Really glad something
got in for rc1 either way.
>
> And just as well that you did a baseline, since apparently the numbers
> are all over the map. I don't see how your hack and dave's original
> can _possibly_ differ that much, but they clearly did on your xfs
> test. So there's probably huge variance that depends on random
> details.
I don't think the XFS numbers can be trusted too much since it was
basically bottlenecked behind that single pegged CPU. It was bouncing
around and I couldn't quite track it down to a process name (or perf
profile).
The btrfs numbers were much more consistent, but your patch is still a
win over plain 4.2.
>
> I'll leave things as they are until we have something that looks a bit
> more believable ;)
We can build from here, thanks Linus.
-chris
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
2015-09-12 23:46 ` Chris Mason
@ 2015-09-13 13:12 ` Chris Mason
2015-09-13 22:56 ` Dave Chinner
0 siblings, 1 reply; 54+ messages in thread
From: Chris Mason @ 2015-09-13 13:12 UTC (permalink / raw)
To: Linus Torvalds, Josef Bacik, LKML, linux-fsdevel, Dave Chinner,
Neil Brown, Jan Kara, Christoph Hellwig
On Sat, Sep 12, 2015 at 07:46:32PM -0400, Chris Mason wrote:
> I don't think the XFS numbers can be trusted too much since it was
> basically bottlenecked behind that single pegged CPU. It was bouncing
> around and I couldn't quite track it down to a process name (or perf
> profile).
I'll do more runs Monday, but I was able to grab a perf profile of the
pegged XFS CPU. It was just the writeback worker thread, and it
hit btrfs differently because we defer more of this stuff to endio
workers, effectively spreading it out over more CPUs.
With 4 mount points intead of 2, XFS goes from 140K files/sec to 250K.
Here's one of the profiles, but it bounced around a lot so I wouldn't
use this to actually tune anything:
11.42% kworker/u82:61 [kernel.kallsyms] [k] _raw_spin_lock
|
---_raw_spin_lock
|
|--83.43%-- xfs_extent_busy_trim
| xfs_alloc_compute_aligned
| |
| |--99.92%-- xfs_alloc_ag_vextent_near
| | xfs_alloc_ag_vextent
| | xfs_alloc_vextent
| | xfs_bmap_btalloc
| | xfs_bmap_alloc
| | xfs_bmapi_write
| | xfs_iomap_write_allocate
| | xfs_map_blocks
| | xfs_vm_writepage
| | __writepage
| | write_cache_pages
| | generic_writepages
| | xfs_vm_writepages
| | do_writepages
| | __writeback_single_inode
| | writeback_sb_inodes
| | __writeback_inodes_wb
| | wb_writeback
| | wb_do_writeback
| | wb_workfn
| | process_one_work
| | worker_thread
| | kthread
| | ret_from_fork
| --0.08%-- [...]
|
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
2015-09-13 13:12 ` Chris Mason
@ 2015-09-13 22:56 ` Dave Chinner
0 siblings, 0 replies; 54+ messages in thread
From: Dave Chinner @ 2015-09-13 22:56 UTC (permalink / raw)
To: Chris Mason, Linus Torvalds, Josef Bacik, LKML, linux-fsdevel,
Neil Brown, Jan Kara, Christoph Hellwig
On Sun, Sep 13, 2015 at 09:12:44AM -0400, Chris Mason wrote:
> On Sat, Sep 12, 2015 at 07:46:32PM -0400, Chris Mason wrote:
> > I don't think the XFS numbers can be trusted too much since it was
> > basically bottlenecked behind that single pegged CPU. It was bouncing
> > around and I couldn't quite track it down to a process name (or perf
> > profile).
>
> I'll do more runs Monday, but I was able to grab a perf profile of the
> pegged XFS CPU. It was just the writeback worker thread, and it
> hit btrfs differently because we defer more of this stuff to endio
> workers, effectively spreading it out over more CPUs.
>
> With 4 mount points intead of 2, XFS goes from 140K files/sec to 250K.
> Here's one of the profiles, but it bounced around a lot so I wouldn't
> use this to actually tune anything:
mkfs.xfs -d agcount=64 ....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
2015-09-12 23:00 ` Chris Mason
2015-09-12 23:29 ` Linus Torvalds
@ 2015-09-13 23:12 ` Dave Chinner
2015-09-14 20:06 ` Linus Torvalds
1 sibling, 1 reply; 54+ messages in thread
From: Dave Chinner @ 2015-09-13 23:12 UTC (permalink / raw)
To: Chris Mason, Linus Torvalds, Josef Bacik, LKML, linux-fsdevel,
Neil Brown, Jan Kara, Christoph Hellwig
On Sat, Sep 12, 2015 at 07:00:27PM -0400, Chris Mason wrote:
> On Fri, Sep 11, 2015 at 04:36:39PM -0700, Linus Torvalds wrote:
> > On Fri, Sep 11, 2015 at 4:16 PM, Chris Mason <clm@fb.com> wrote:
> > >
> > > For 4.3 timeframes, what runs do you want to see numbers for:
> > >
> > > 1) revert
> > > 2) my hack
> > > 3) plug over multiple sbs (on different devices)
> > > 4) ?
> >
> > Just 2 or 3.
> >
> > I don't think the plain revert is all that interesting, and I think
> > the "anything else" is far too late for this merge window.
>
> I did the plain revert as well, just to have a baseline. This box is a
> little different from Dave's. Bare metal two socket box (E5-2660 v2 @
> 2.20Ghz) with 144GB of ram. I have two pcie flash devices, one nvme and
> one fusionio, and I put a one FS on each device (two mounts total).
>
> The test created 1.6M files, 4K each. I used Dave's fs_mark command
> line, spread out over 16 directories from each mounted filesystem. In
> btrfs they are spread over subvolumes to cut down lock contention.
>
> I need to change around the dirty ratios more to smooth out the IO, and
> I had trouble with both XFS and btrfs getting runs that were not CPU
> bound. I included the time to run sync at the end of the run because
> the results were not very consistent without it.
>
> The XFS runs generally had one CPU pegged at 100%, and I think this is
> throwing off the results. On Monday I'll redo them with two (four?)
> filesystems per flash device, hopefully that'll break things up.
>
> The btrfs runs generally had all the CPUs pegged at 100%. I switched to
> mount -o nodatasum and squeezed out an extra 50K files/sec at much lower
> CPU utilization.
>
> wall time fs_mark files/sec bytes written/sec
>
> XFS:
> baseline v4.2: 5m6s 118,578 272MB/s
> Dave's patch: 4m46s 151,421 294MB/s
> my hack: 5m5s 150,714 275MB/s
> Linus plug: 5m15s 147,735 266MB/s
>
>
> Btrfs (nodatasum):
> baseline v4.2: 4m39s 242,643 313MB/s
> Dave's patch: 3m46s 252,452 389MB/s
> my hack: 3m48s 257,924 379MB/s
> Linus plug: 3m58s 247,528 369MB/s
Really need to run these numbers on slower disks where block layer
merging makes a difference to performance. The high level plugging
improves performance on spinning disks by a huge amount on XFS
because the merging reduces the number of IOs issued to disk by 2
orders of magnitude. Plugging makes comparitively little difference
on devices that can sustain extremely high IOPS and hence sink the
tens to hundreds of thousand individual 4k IOs that this workload
generates through writeback.
i.e. while throughput increases, that's not the numbers that matters
here - it's the change in write IO behaviour that needs to be
categorised and measured by the different patches...
(I'm on holidays, so I'm not going to get to this any time soon)
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
2015-09-13 23:12 ` Dave Chinner
@ 2015-09-14 20:06 ` Linus Torvalds
2015-09-16 15:16 ` Chris Mason
0 siblings, 1 reply; 54+ messages in thread
From: Linus Torvalds @ 2015-09-14 20:06 UTC (permalink / raw)
To: Dave Chinner
Cc: Chris Mason, Josef Bacik, LKML, linux-fsdevel, Neil Brown,
Jan Kara, Christoph Hellwig
On Sun, Sep 13, 2015 at 4:12 PM, Dave Chinner <david@fromorbit.com> wrote:
>
> Really need to run these numbers on slower disks where block layer
> merging makes a difference to performance.
Yeah. We've seen plugging and io schedulers not make much difference
for high-performance flash (although I think the people who argued
that noop should generally be used for non-rotating media were wrong,
I think - the elevator ends up still being critical to merging, and
while merging isn't a life-or-death situation, it tends to still
help).
For rotating rust with nasty seek times, the plugging is likely to
make the biggest difference.
Linus
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
2015-09-14 20:06 ` Linus Torvalds
@ 2015-09-16 15:16 ` Chris Mason
2015-09-16 19:58 ` Jan Kara
0 siblings, 1 reply; 54+ messages in thread
From: Chris Mason @ 2015-09-16 15:16 UTC (permalink / raw)
To: Linus Torvalds
Cc: Dave Chinner, Josef Bacik, LKML, linux-fsdevel, Neil Brown,
Jan Kara, Christoph Hellwig
[-- Attachment #1: Type: text/plain, Size: 1990 bytes --]
On Mon, Sep 14, 2015 at 01:06:25PM -0700, Linus Torvalds wrote:
> On Sun, Sep 13, 2015 at 4:12 PM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > Really need to run these numbers on slower disks where block layer
> > merging makes a difference to performance.
>
> Yeah. We've seen plugging and io schedulers not make much difference
> for high-performance flash (although I think the people who argued
> that noop should generally be used for non-rotating media were wrong,
> I think - the elevator ends up still being critical to merging, and
> while merging isn't a life-or-death situation, it tends to still
> help).
Yeah, my big concern was that holding the plug longer would result in
lower overall perf because we weren't keeping the flash busy. So I
started with the flash boxes to make sure we weren't regressing past 4.2
levels at least.
I'm still worried about that, but this probably isn't the right
benchmark to show it. And if it's really a problem, it'll happen
everywhere we plug and not just here.
>
> For rotating rust with nasty seek times, the plugging is likely to
> make the biggest difference.
For rotating storage, I grabbed a big box and did the fs_mark run
against 8 spindles. These are all behind a megaraid card as jbods, so I
flipped the card's cache to write-through.
I changed around the run a bit, making enough files for fs_mark to run
for ~10 minutes, and I took out the sync. I ran only xfs to cut down on
the iterations, and after the fs_mark run, I did short 30 second run with
blktrace in the background to capture the io sizes.
v4.2: 178K files/sec
Chinner: 192K files/sec
Mason: 192K files/sec
Linus: 193K files/sec
I added support to iowatcher to graph IO size, and attached the graph.
Short version, Linus' patch still gives bigger IOs and similar perf to
Dave's original. I should have done the blktrace runs for 60 seconds
instead of 30, I suspect that would even out the average sizes between
the three patches.
-chris
[-- Attachment #2: fs_mark.png --]
[-- Type: application/octet-stream, Size: 34363 bytes --]
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
2015-09-16 15:16 ` Chris Mason
@ 2015-09-16 19:58 ` Jan Kara
2015-09-16 20:00 ` Chris Mason
0 siblings, 1 reply; 54+ messages in thread
From: Jan Kara @ 2015-09-16 19:58 UTC (permalink / raw)
To: Chris Mason
Cc: Linus Torvalds, Dave Chinner, Josef Bacik, LKML, linux-fsdevel,
Neil Brown, Jan Kara, Christoph Hellwig
On Wed 16-09-15 11:16:21, Chris Mason wrote:
> On Mon, Sep 14, 2015 at 01:06:25PM -0700, Linus Torvalds wrote:
> > On Sun, Sep 13, 2015 at 4:12 PM, Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > Really need to run these numbers on slower disks where block layer
> > > merging makes a difference to performance.
> >
> > Yeah. We've seen plugging and io schedulers not make much difference
> > for high-performance flash (although I think the people who argued
> > that noop should generally be used for non-rotating media were wrong,
> > I think - the elevator ends up still being critical to merging, and
> > while merging isn't a life-or-death situation, it tends to still
> > help).
>
>
> Yeah, my big concern was that holding the plug longer would result in
> lower overall perf because we weren't keeping the flash busy. So I
> started with the flash boxes to make sure we weren't regressing past 4.2
> levels at least.
>
> I'm still worried about that, but this probably isn't the right
> benchmark to show it. And if it's really a problem, it'll happen
> everywhere we plug and not just here.
>
> >
> > For rotating rust with nasty seek times, the plugging is likely to
> > make the biggest difference.
>
> For rotating storage, I grabbed a big box and did the fs_mark run
> against 8 spindles. These are all behind a megaraid card as jbods, so I
> flipped the card's cache to write-through.
>
> I changed around the run a bit, making enough files for fs_mark to run
> for ~10 minutes, and I took out the sync. I ran only xfs to cut down on
> the iterations, and after the fs_mark run, I did short 30 second run with
> blktrace in the background to capture the io sizes.
>
> v4.2: 178K files/sec
> Chinner: 192K files/sec
> Mason: 192K files/sec
> Linus: 193K files/sec
>
> I added support to iowatcher to graph IO size, and attached the graph.
>
> Short version, Linus' patch still gives bigger IOs and similar perf to
> Dave's original. I should have done the blktrace runs for 60 seconds
> instead of 30, I suspect that would even out the average sizes between
> the three patches.
Thanks for the data Chris. So I guess we are fine with what's currently in,
right?
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
2015-09-16 19:58 ` Jan Kara
@ 2015-09-16 20:00 ` Chris Mason
2015-09-16 22:07 ` Dave Chinner
0 siblings, 1 reply; 54+ messages in thread
From: Chris Mason @ 2015-09-16 20:00 UTC (permalink / raw)
To: Jan Kara
Cc: Linus Torvalds, Dave Chinner, Josef Bacik, LKML, linux-fsdevel,
Neil Brown, Christoph Hellwig
On Wed, Sep 16, 2015 at 09:58:06PM +0200, Jan Kara wrote:
> On Wed 16-09-15 11:16:21, Chris Mason wrote:
> > Short version, Linus' patch still gives bigger IOs and similar perf to
> > Dave's original. I should have done the blktrace runs for 60 seconds
> > instead of 30, I suspect that would even out the average sizes between
> > the three patches.
>
> Thanks for the data Chris. So I guess we are fine with what's currently in,
> right?
Looks like it works well to me.
-chris
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
2015-09-16 20:00 ` Chris Mason
@ 2015-09-16 22:07 ` Dave Chinner
2015-09-17 0:37 ` Dave Chinner
0 siblings, 1 reply; 54+ messages in thread
From: Dave Chinner @ 2015-09-16 22:07 UTC (permalink / raw)
To: Chris Mason, Jan Kara, Linus Torvalds, Josef Bacik, LKML,
linux-fsdevel, Neil Brown, Christoph Hellwig
On Wed, Sep 16, 2015 at 04:00:12PM -0400, Chris Mason wrote:
> On Wed, Sep 16, 2015 at 09:58:06PM +0200, Jan Kara wrote:
> > On Wed 16-09-15 11:16:21, Chris Mason wrote:
> > > Short version, Linus' patch still gives bigger IOs and similar perf to
> > > Dave's original. I should have done the blktrace runs for 60 seconds
> > > instead of 30, I suspect that would even out the average sizes between
> > > the three patches.
> >
> > Thanks for the data Chris. So I guess we are fine with what's currently in,
> > right?
>
> Looks like it works well to me.
Graph looks good, though I'll confirm it on my test rig once I get
out from under the pile of email and other stuff that is queued up
after being away for a week...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
2015-09-16 22:07 ` Dave Chinner
@ 2015-09-17 0:37 ` Dave Chinner
2015-09-17 1:12 ` Linus Torvalds
2015-09-17 3:48 ` Chris Mason
0 siblings, 2 replies; 54+ messages in thread
From: Dave Chinner @ 2015-09-17 0:37 UTC (permalink / raw)
To: Chris Mason, Jan Kara, Linus Torvalds, Josef Bacik, LKML,
linux-fsdevel, Neil Brown, Christoph Hellwig
Cc: Tejun Heo
[cc Tejun]
On Thu, Sep 17, 2015 at 08:07:04AM +1000, Dave Chinner wrote:
> On Wed, Sep 16, 2015 at 04:00:12PM -0400, Chris Mason wrote:
> > On Wed, Sep 16, 2015 at 09:58:06PM +0200, Jan Kara wrote:
> > > On Wed 16-09-15 11:16:21, Chris Mason wrote:
> > > > Short version, Linus' patch still gives bigger IOs and similar perf to
> > > > Dave's original. I should have done the blktrace runs for 60 seconds
> > > > instead of 30, I suspect that would even out the average sizes between
> > > > the three patches.
> > >
> > > Thanks for the data Chris. So I guess we are fine with what's currently in,
> > > right?
> >
> > Looks like it works well to me.
>
> Graph looks good, though I'll confirm it on my test rig once I get
> out from under the pile of email and other stuff that is queued up
> after being away for a week...
I ran some tests in the background while reading other email.....
TL;DR: Results look really bad - not only is the plugging
problematic, baseline writeback performance has regressed
significantly. We need to revert the plugging changes until the
underlying writeback performance regressions are sorted out.
In more detail, these tests were run on my usual 16p/16GB RAM
performance test VM with storage set up as described here:
http://permalink.gmane.org/gmane.linux.kernel/1768786
The test:
$ ~/tests/fsmark-10-4-test-xfs.sh
meta-data=/dev/vdc isize=512 agcount=500, agsize=268435455 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=0
data = bsize=4096 blocks=134217727500, imaxpct=1
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=1
log =internal log bsize=4096 blocks=131072, version=2
= sectsz=512 sunit=1 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
# ./fs_mark -D 10000 -S0 -n 10000 -s 4096 -L 120 -d /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/2 -d /mnt/scratch/3 -d /mnt/scratch/4 -d /mnt/scratch/5 -d /mnt/scratch/6 -d /mnt/scratch/7
# Version 3.3, 8 thread(s) starting at Thu Sep 17 08:08:36 2015
# Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
# Directories: Time based hash between directories across 10000 subdirectories with 180 seconds per subdirectory.
# File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
# Files info: size 4096 bytes, written with an IO size of 16384 bytes per write
# App overhead is time in microseconds spent in the test not doing file writing related system calls.
FSUse% Count Size Files/sec App Overhead
0 80000 4096 106938.0 543310
0 160000 4096 102922.7 476362
0 240000 4096 107182.9 538206
0 320000 4096 107871.7 619821
0 400000 4096 99255.6 622021
0 480000 4096 103217.8 609943
0 560000 4096 96544.2 640988
0 640000 4096 100347.3 676237
0 720000 4096 87534.8 483495
0 800000 4096 72577.5 2556920
0 880000 4096 97569.0 646996
<RAM fills here, sustained performance is now dependent on writeback>
0 960000 4096 80147.0 515679
0 1040000 4096 100394.2 816979
0 1120000 4096 91466.5 739009
0 1200000 4096 85868.1 977506
0 1280000 4096 89691.5 715207
0 1360000 4096 52547.5 712810
0 1440000 4096 47999.1 685282
0 1520000 4096 47894.3 697261
0 1600000 4096 47549.4 789977
0 1680000 4096 40029.2 677885
0 1760000 4096 16637.4 12804557
0 1840000 4096 16883.6 24295975
0 1920000 4096 18852.1 22264137
0 2000000 4096 27107.3 10310408
0 2080000 4096 23654.1 19823567
0 2160000 4096 24588.9 19769484
0 2240000 4096 13959.3 40168163
0 2320000 4096 26239.2 17806813
0 2400000 4096 17858.5 24596715
0 2480000 4096 12072.9 38177650
0 2560000 4096 9989.7 43413729
<This is Real Bad. It gets significantly worse as the test goes on:>
0 6880000 4096 11773.4 16470400
0 6960000 4096 5556.5 27325343
0 7040000 4096 13816.6 22440991
0 7120000 4096 7322.4 42086686
0 7200000 4096 8033.0 31182907
0 7280000 4096 10589.4 30876729
0 7360000 4096 15844.5 19903105
0 7440000 4096 9187.7 35041718
0 7520000 4096 11800.9 12254920
0 7600000 4096 20049.9 13052020
0 7680000 4096 17444.9 12055262
0 7760000 4096 25900.6 10633885
0 7840000 4096 6398.8 12014901
0 7920000 4096 12147.4 14655678
<...>
0 9360000 4096 19406.0 5874193
0 9440000 4096 7261.3 14542806
0 9520000 4096 14978.7 21921186
0 9600000 4096 12656.7 10539894
real 10m9.240s
user 0m51.817s
sys 13m23.874s
A quick reading of the IO behaviour shows an average of about 1000
IOPS and 100MB/s, which matches roughly with the average files/s
created. There's no obvious peaks and gaps in the IO throughput, but
CPU usage has regular gaps where it falls to nothing and the only
"usage" is IO wait time.
Reverting just Linus' patch, so no plugging at all in the code
(i.e. the baseline numbers):
FSUse% Count Size Files/sec App Overhead
0 80000 4096 109017.5 586268
0 160000 4096 93548.1 553333
0 240000 4096 104141.7 817182
0 320000 4096 112846.2 597494
0 400000 4096 98628.6 641524
0 480000 4096 86586.1 1274195
0 560000 4096 95965.0 776961
0 640000 4096 84585.9 583330
0 720000 4096 93787.1 695658
0 800000 4096 94112.7 645925
0 880000 4096 92259.2 792950
<RAM fills here, sustained performance is now dependent on writeback>
0 960000 4096 62936.9 763642
0 1040000 4096 22484.8 737471
0 1120000 4096 22038.6 706791
0 1200000 4096 24096.5 680878
0 1280000 4096 23973.1 839826
0 1360000 4096 24761.9 779486
0 1440000 4096 25467.7 976022
0 1520000 4096 25338.7 935552
0 1600000 4096 27484.8 683351
0 1680000 4096 24383.2 3039781
0 1760000 4096 24857.0 2657008
<maintains 13-26k files/s all the way to the end .... >
0 9040000 4096 21438.6 4869317
0 9120000 4096 25387.1 642198
0 9200000 4096 26815.9 727763
0 9280000 4096 25175.5 4458811
0 9360000 4096 19832.8 982122
0 9440000 4096 17643.8 805599
0 9520000 4096 13923.4 7819152
0 9600000 4096 18101.3 1071163
real 7m7.926s
user 0m57.976s
sys 14m14.620s
The baseline of no plugging is a full 3 minutes faster than the
plugging behaviour of Linus' patch. The IO behaviour demonstrates
that, sustaining between 25-30,000 IOPS and throughput of
130-150MB/s. Hence, while Linus' patch does change the IO patterns,
it does not result in a performance improvement like the original
plugging patch did.
So I went back and had a look at my original patch, which I've been
using locally for a couple of years and was similar to the original
commit. It has this description from when I last updated the perf
numbers from testing done on 3.17:
| Test VM: 16p, 16GB RAM, 2xSSD in RAID0, 500TB sparse XFS filesystem,
| metadata CRCs enabled.
|
| Test:
|
| $ ./fs_mark -D 10000 -S0 -n 10000 -s 4096 -L 120 -d
| /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/2 -d
| /mnt/scratch/3 -d /mnt/scratch/4 -d /mnt/scratch/5 -d
| /mnt/scratch/6 -d /mnt/scratch/7
|
| Result:
| wall sys create rate Physical write IO
| time CPU (avg files/s) IOPS Bandwidth
| ----- ----- ------------- ------ ---------
| unpatched 5m54s 15m32s 32,500+/-2200 28,000 150MB/s
| patched 3m19s 13m28s 52,900+/-1800 1,500 280MB/s
| improvement -43.8% -13.3% +62.7% -94.6% +86.6%
IOWs, what we are seeing here is that the baseline writeback
performance has regressed quite significantly since I took these
numbers back on 3.17. I'm running on exactly the same test setup;
the only difference is the kernel and so the current kernel baseline
is ~20% slower than the baseline numbers I have in my patch.
So, let's put the original [broken] plugging back in:
.....
0 8960000 4096 28420.0 2129129
0 9040000 4096 26863.0 6893094
0 9120000 4096 46721.4 705927
0 9200000 4096 46384.0 1081627
0 9280000 4096 39692.3 2148938
0 9360000 4096 41976.0 3540489
0 9440000 4096 28025.1 10364440
0 9520000 4096 38580.2 2457019
0 9600000 4096 33526.1 7661316
real 5m12.114s
user 0m50.294s
sys 12m54.425s
While we get back that regression and then some more, it's still
much slower than the high level plugging on slightly older kernels.
And the performance is, again, highly variable through the midpoint
of the test run. e.g:
....
0 4720000 4096 39224.4 9401401
0 4800000 4096 25342.9 18889665
0 4880000 4096 14841.7 36668633
0 4960000 4096 27624.6 17731873
0 5040000 4096 37874.8 11390831
0 5120000 4096 29009.2 12331965
0 5200000 4096 49568.0 4346746
0 5280000 4096 83074.5 877142
0 5360000 4096 32498.8 5401470
0 5440000 4096 55468.1 3188117
0 5520000 4096 19156.0 4112120
0 5600000 4096 44894.0 3186559
....
We have a low of ~15k files/s and a high of 83k files/s whilst
averaging about 38k files/s.
IOWs, ignoring the plugging, the recent changes to the writeback
code have really screwed up writeback performance and changed the
way it behaves from a throughput consistency point of view. I'd
suggest we should start by reverting all the plugging changes for
now, because there are other problems these plugging changes have
exposed in the current writeback code that need to be fixed first...
Cheers
Dave.
PS: I was going to run this on an ext4 filesystem to check it
wasn't just XFS seeing this, but that reminded me why I don't run
this test on ext4:
FSUse% Count Size Files/sec App Overhead
5 80000 4096 211668.3 534504
5 160000 4096 191731.8 532701
5 240000 4096 196251.2 489772
5 320000 4096 207597.1 464032
5 400000 4096 121616.0 601156
5 480000 4096 197729.9 573925
<Really, really fast in memory, but background writeback starts here>
5 560000 4096 128698.4 546362
5 640000 4096 137904.3 543718
5 720000 4096 63680.9 498110
5 800000 4096 25825.9 1538213
5 880000 4096 32708.7 632235
<dirty page threshold (20% RAM) is hit here. performance sucks from
here on in (still 10GB of free RAM) because the writeback IO
pattern is a random 4k write pattern>
5 960000 4096 6940.2 7250666
5 1040000 4096 7474.8 711872
5 1120000 4096 4127.5 14034167
5 1200000 4096 4822.1 752398
....
Hence it takes about around hour to do a single run of this test on
ext4 because it's delayed allocation algorithm does not linearise
writes to small files with similar inode locality like XFS does, and
that's the behaviour that high level plugging really helps. IOws,
this specific workload does not demonstrate any significant
performance difference on ext4 with the different plugging
algorithms because it does random IO rather than sequential IO.
I also ran it on btrfs, but it's pushing peaks of 15-25,000 IOPS
every 30s regardless of whether there is plugging at a high level or
not, so again btrfs behaviour is pretty much unaffected by the high
level plugging on this workload....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
2015-09-17 0:37 ` Dave Chinner
@ 2015-09-17 1:12 ` Linus Torvalds
2015-09-17 2:14 ` Dave Chinner
2015-09-17 3:48 ` Chris Mason
1 sibling, 1 reply; 54+ messages in thread
From: Linus Torvalds @ 2015-09-17 1:12 UTC (permalink / raw)
To: Dave Chinner
Cc: Chris Mason, Jan Kara, Josef Bacik, LKML, linux-fsdevel,
Neil Brown, Christoph Hellwig, Tejun Heo
On Wed, Sep 16, 2015 at 5:37 PM, Dave Chinner <david@fromorbit.com> wrote:
>
> TL;DR: Results look really bad - not only is the plugging
> problematic, baseline writeback performance has regressed
> significantly.
Dave, if you're testing my current -git, the other performance issue
might still be the spinlock thing.
I haven't gotten the -tip pull fox the spinlock regression yet, and
since you were the only one who noticed, I didn't consider it critical
enough to bypass the normal schedule, so -rc1 still has the crazy-bad
test-and-set loop for spinlocks.
So it's waiting in the locking fix branch (and is cc'd for stable),
but hasn't reached me yet.
If you want to test that theory out, you can make virt_spin_lock()
just unconditionally return false to disable the bad cmpxchg loop for
now.
The plugging IO pauses are interesting, though. Plugging really
*shouldn't* cause that kind of pauses, _regardless_ of what level it
happens on, so I wonder if the patch ends up just exposing some really
basic problem that just normally goes hidden.
Can you match up the IO wait times with just *where* it is waiting? Is
it waiting for that inode I_SYNC thing in inode_sleep_on_writeback()?
But it does sound like we should just revert the whole plugging for
now, if only because "it has odd effects".
Linus
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
2015-09-17 1:12 ` Linus Torvalds
@ 2015-09-17 2:14 ` Dave Chinner
2015-09-17 19:39 ` Linus Torvalds
2015-09-17 23:03 ` Dave Chinner
0 siblings, 2 replies; 54+ messages in thread
From: Dave Chinner @ 2015-09-17 2:14 UTC (permalink / raw)
To: Linus Torvalds
Cc: Chris Mason, Jan Kara, Josef Bacik, LKML, linux-fsdevel,
Neil Brown, Christoph Hellwig, Tejun Heo
On Wed, Sep 16, 2015 at 06:12:29PM -0700, Linus Torvalds wrote:
> On Wed, Sep 16, 2015 at 5:37 PM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > TL;DR: Results look really bad - not only is the plugging
> > problematic, baseline writeback performance has regressed
> > significantly.
>
> Dave, if you're testing my current -git, the other performance issue
> might still be the spinlock thing.
I have the fix as the first commit in my local tree - it'll remain
there until I get a conflict after an update. :)
> The plugging IO pauses are interesting, though. Plugging really
> *shouldn't* cause that kind of pauses, _regardless_ of what level it
> happens on, so I wonder if the patch ends up just exposing some really
> basic problem that just normally goes hidden.
Right, that's what I suspect - it didn't happen on older kernels,
but we've just completely reworked the writeback code for the
control group awareness since I last looked really closely at
this...
> Can you match up the IO wait times with just *where* it is
> waiting? Is it waiting for that inode I_SYNC thing in
> inode_sleep_on_writeback()?
I'll do some more investigation.
> But it does sound like we should just revert the whole plugging for
> now, if only because "it has odd effects".
Yup - we can add it again next merge window once we get to the
bottom of whatever is going on and have had time to test the new
code properly.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
2015-09-17 0:37 ` Dave Chinner
2015-09-17 1:12 ` Linus Torvalds
@ 2015-09-17 3:48 ` Chris Mason
2015-09-17 4:30 ` Dave Chinner
1 sibling, 1 reply; 54+ messages in thread
From: Chris Mason @ 2015-09-17 3:48 UTC (permalink / raw)
To: Dave Chinner
Cc: Jan Kara, Linus Torvalds, Josef Bacik, LKML, linux-fsdevel,
Neil Brown, Christoph Hellwig, Tejun Heo
On Thu, Sep 17, 2015 at 10:37:38AM +1000, Dave Chinner wrote:
> [cc Tejun]
>
> On Thu, Sep 17, 2015 at 08:07:04AM +1000, Dave Chinner wrote:
> > On Wed, Sep 16, 2015 at 04:00:12PM -0400, Chris Mason wrote:
> > > On Wed, Sep 16, 2015 at 09:58:06PM +0200, Jan Kara wrote:
> > > > On Wed 16-09-15 11:16:21, Chris Mason wrote:
> > > > > Short version, Linus' patch still gives bigger IOs and similar perf to
> > > > > Dave's original. I should have done the blktrace runs for 60 seconds
> > > > > instead of 30, I suspect that would even out the average sizes between
> > > > > the three patches.
> > > >
> > > > Thanks for the data Chris. So I guess we are fine with what's currently in,
> > > > right?
> > >
> > > Looks like it works well to me.
> >
> > Graph looks good, though I'll confirm it on my test rig once I get
> > out from under the pile of email and other stuff that is queued up
> > after being away for a week...
>
> I ran some tests in the background while reading other email.....
>
> TL;DR: Results look really bad - not only is the plugging
> problematic, baseline writeback performance has regressed
> significantly. We need to revert the plugging changes until the
> underlying writeback performance regressions are sorted out.
>
> In more detail, these tests were run on my usual 16p/16GB RAM
> performance test VM with storage set up as described here:
>
> https://urldefense.proofpoint.com/v1/url?u=http://permalink.gmane.org/gmane.linux.kernel/1768786&k=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0A&r=6%2FL0lzzDhu0Y1hL9xm%2BQyA%3D%3D%0A&m=4Qwp5Zj8CpoMb6vOcz%2FNMQ%2Fsb0%2FamLUP1vqWgedxJL0%3D%0A&s=90b54e35a4a7fcc4bcab9e15e22c025c7c9e045541e4923500f2e3258fc1952b
>
> The test:
>
> $ ~/tests/fsmark-10-4-test-xfs.sh
> meta-data=/dev/vdc isize=512 agcount=500, agsize=268435455 blks
> = sectsz=512 attr=2, projid32bit=1
> = crc=1 finobt=1, sparse=0
> data = bsize=4096 blocks=134217727500, imaxpct=1
> = sunit=0 swidth=0 blks
> naming =version 2 bsize=4096 ascii-ci=0 ftype=1
> log =internal log bsize=4096 blocks=131072, version=2
> = sectsz=512 sunit=1 blks, lazy-count=1
> realtime =none extsz=4096 blocks=0, rtextents=0
>
> # ./fs_mark -D 10000 -S0 -n 10000 -s 4096 -L 120 -d /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/2 -d /mnt/scratch/3 -d /mnt/scratch/4 -d /mnt/scratch/5 -d /mnt/scratch/6 -d /mnt/scratch/7
> # Version 3.3, 8 thread(s) starting at Thu Sep 17 08:08:36 2015
> # Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
> # Directories: Time based hash between directories across 10000 subdirectories with 180 seconds per subdirectory.
> # File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
> # Files info: size 4096 bytes, written with an IO size of 16384 bytes per write
> # App overhead is time in microseconds spent in the test not doing file writing related system calls.
>
> FSUse% Count Size Files/sec App Overhead
> 0 80000 4096 106938.0 543310
> 0 160000 4096 102922.7 476362
> 0 240000 4096 107182.9 538206
> 0 320000 4096 107871.7 619821
> 0 400000 4096 99255.6 622021
> 0 480000 4096 103217.8 609943
> 0 560000 4096 96544.2 640988
> 0 640000 4096 100347.3 676237
> 0 720000 4096 87534.8 483495
> 0 800000 4096 72577.5 2556920
> 0 880000 4096 97569.0 646996
>
> <RAM fills here, sustained performance is now dependent on writeback>
I think too many variables have changed here.
My numbers:
FSUse% Count Size Files/sec App Overhead
0 160000 4096 356407.1 1458461
0 320000 4096 368755.1 1030047
0 480000 4096 358736.8 992123
0 640000 4096 361912.5 1009566
0 800000 4096 342851.4 1004152
0 960000 4096 358357.2 996014
0 1120000 4096 338025.8 1004412
0 1280000 4096 354440.3 997380
0 1440000 4096 335225.9 1000222
0 1600000 4096 278786.1 1164962
0 1760000 4096 268161.4 1205255
0 1920000 4096 259158.0 1298054
0 2080000 4096 276939.1 1219411
0 2240000 4096 252385.1 1245496
0 2400000 4096 280674.1 1189161
0 2560000 4096 290155.4 1141941
0 2720000 4096 280842.2 1179964
0 2880000 4096 272446.4 1155527
0 3040000 4096 268827.4 1235095
0 3200000 4096 251767.1 1250006
0 3360000 4096 248339.8 1235471
0 3520000 4096 267129.9 1200834
0 3680000 4096 257320.7 1244854
0 3840000 4096 233540.8 1267764
0 4000000 4096 269237.0 1216324
0 4160000 4096 249787.6 1291767
0 4320000 4096 256185.7 1253776
0 4480000 4096 257849.7 1212953
0 4640000 4096 253933.9 1181216
0 4800000 4096 263567.2 1233937
0 4960000 4096 255666.4 1231802
0 5120000 4096 257083.2 1282893
0 5280000 4096 254285.0 1229031
0 5440000 4096 265561.6 1219472
0 5600000 4096 266374.1 1229886
0 5760000 4096 241003.7 1257064
0 5920000 4096 245047.4 1298330
0 6080000 4096 254771.7 1257241
0 6240000 4096 254355.2 1261006
0 6400000 4096 254800.4 1201074
0 6560000 4096 262794.5 1234816
0 6720000 4096 248103.0 1287921
0 6880000 4096 231397.3 1291224
0 7040000 4096 227898.0 1285359
0 7200000 4096 227279.6 1296340
0 7360000 4096 232561.5 1748248
0 7520000 4096 231055.3 1169373
0 7680000 4096 245738.5 1121856
0 7840000 4096 234961.7 1147035
0 8000000 4096 243973.0 1152202
0 8160000 4096 246292.6 1169527
0 8320000 4096 249433.2 1197921
0 8480000 4096 222576.0 1253650
0 8640000 4096 239407.5 1263257
0 8800000 4096 246037.1 1218109
0 8960000 4096 242306.5 1293567
0 9120000 4096 238525.9 3745133
0 9280000 4096 269869.5 1159541
0 9440000 4096 266447.1 4794719
0 9600000 4096 265748.9 1161584
0 9760000 4096 269067.8 1149918
0 9920000 4096 248896.2 1164112
0 10080000 4096 261342.9 1174536
0 10240000 4096 254778.3 1225425
0 10400000 4096 257702.2 1211634
0 10560000 4096 233972.5 1203665
0 10720000 4096 232647.1 1197486
0 10880000 4096 242320.6 1203984
I can push the dirty threshold lower to try and make sure we end up in
the hard dirty limits but none of this is going to be related to the
plugging patch. I do see lower numbers if I let the test run even
longer, but there are a lot of things in the way that can slow it down
as the filesystem gets that big.
I'll try again with lower ratios.
[ ... ]
> The baseline of no plugging is a full 3 minutes faster than the
> plugging behaviour of Linus' patch. The IO behaviour demonstrates
> that, sustaining between 25-30,000 IOPS and throughput of
> 130-150MB/s. Hence, while Linus' patch does change the IO patterns,
> it does not result in a performance improvement like the original
> plugging patch did.
>
How consistent is this across runs?
> So I went back and had a look at my original patch, which I've been
> using locally for a couple of years and was similar to the original
> commit. It has this description from when I last updated the perf
> numbers from testing done on 3.17:
>
> | Test VM: 16p, 16GB RAM, 2xSSD in RAID0, 500TB sparse XFS filesystem,
> | metadata CRCs enabled.
> |
> | Test:
> |
> | $ ./fs_mark -D 10000 -S0 -n 10000 -s 4096 -L 120 -d
> | /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/2 -d
> | /mnt/scratch/3 -d /mnt/scratch/4 -d /mnt/scratch/5 -d
> | /mnt/scratch/6 -d /mnt/scratch/7
> |
> | Result:
> | wall sys create rate Physical write IO
> | time CPU (avg files/s) IOPS Bandwidth
> | ----- ----- ------------- ------ ---------
> | unpatched 5m54s 15m32s 32,500+/-2200 28,000 150MB/s
> | patched 3m19s 13m28s 52,900+/-1800 1,500 280MB/s
> | improvement -43.8% -13.3% +62.7% -94.6% +86.6%
>
> IOWs, what we are seeing here is that the baseline writeback
> performance has regressed quite significantly since I took these
> numbers back on 3.17. I'm running on exactly the same test setup;
> the only difference is the kernel and so the current kernel baseline
> is ~20% slower than the baseline numbers I have in my patch.
All of this in a VM, I'd much rather see this reproduced on bare metal.
I've had really consistent results with VMs in the past, but there is a
huge amount of code between 3.17 and now.
-chris
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
2015-09-17 3:48 ` Chris Mason
@ 2015-09-17 4:30 ` Dave Chinner
2015-09-17 12:13 ` Chris Mason
0 siblings, 1 reply; 54+ messages in thread
From: Dave Chinner @ 2015-09-17 4:30 UTC (permalink / raw)
To: Chris Mason, Jan Kara, Linus Torvalds, Josef Bacik, LKML,
linux-fsdevel, Neil Brown, Christoph Hellwig, Tejun Heo
On Wed, Sep 16, 2015 at 11:48:59PM -0400, Chris Mason wrote:
> On Thu, Sep 17, 2015 at 10:37:38AM +1000, Dave Chinner wrote:
> > [cc Tejun]
> >
> > On Thu, Sep 17, 2015 at 08:07:04AM +1000, Dave Chinner wrote:
> > # ./fs_mark -D 10000 -S0 -n 10000 -s 4096 -L 120 -d /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/2 -d /mnt/scratch/3 -d /mnt/scratch/4 -d /mnt/scratch/5 -d /mnt/scratch/6 -d /mnt/scratch/7
> > # Version 3.3, 8 thread(s) starting at Thu Sep 17 08:08:36 2015
> > # Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
> > # Directories: Time based hash between directories across 10000 subdirectories with 180 seconds per subdirectory.
> > # File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
> > # Files info: size 4096 bytes, written with an IO size of 16384 bytes per write
> > # App overhead is time in microseconds spent in the test not doing file writing related system calls.
> >
> > FSUse% Count Size Files/sec App Overhead
> > 0 80000 4096 106938.0 543310
> > 0 160000 4096 102922.7 476362
> > 0 240000 4096 107182.9 538206
> > 0 320000 4096 107871.7 619821
> > 0 400000 4096 99255.6 622021
> > 0 480000 4096 103217.8 609943
> > 0 560000 4096 96544.2 640988
> > 0 640000 4096 100347.3 676237
> > 0 720000 4096 87534.8 483495
> > 0 800000 4096 72577.5 2556920
> > 0 880000 4096 97569.0 646996
> >
> > <RAM fills here, sustained performance is now dependent on writeback>
>
> I think too many variables have changed here.
>
> My numbers:
>
> FSUse% Count Size Files/sec App Overhead
> 0 160000 4096 356407.1 1458461
> 0 320000 4096 368755.1 1030047
> 0 480000 4096 358736.8 992123
> 0 640000 4096 361912.5 1009566
> 0 800000 4096 342851.4 1004152
<snip>
> I can push the dirty threshold lower to try and make sure we end up in
> the hard dirty limits but none of this is going to be related to the
> plugging patch.
The point of this test is to drive writeback as hard as possible,
not to measure how fast we can create files in memory. i.e. if the
test isn't pushing the dirty limits on your machines, then it really
isn't putting a meaningful load on writeback, and so the plugging
won't make significant difference because writeback isn't IO
bound....
> I do see lower numbers if I let the test run even
> longer, but there are a lot of things in the way that can slow it down
> as the filesystem gets that big.
Sure, that's why I hit the dirty limits early in the test - so it
measures steady state performance before the fs gets to any
significant scalability limits....
> > The baseline of no plugging is a full 3 minutes faster than the
> > plugging behaviour of Linus' patch. The IO behaviour demonstrates
> > that, sustaining between 25-30,000 IOPS and throughput of
> > 130-150MB/s. Hence, while Linus' patch does change the IO patterns,
> > it does not result in a performance improvement like the original
> > plugging patch did.
>
> How consistent is this across runs?
That's what I'm trying to work out. I didn't report it until I got
consistently bad results - the numbers I reported were from the
third time I ran the comparison, and they were representative and
reproducable. I also ran my inode creation workload that is similar
(but has not data writeback so doesn't go through writeback paths at
all) and that shows no change in performance, so this problem
(whatever it is) is only manifesting itself through data
writeback....
The only measurable change I've noticed in my monitoring graphs is
that there is a lot more iowait time than I normally see, even when
the plugging appears to be working as desired. That's what I'm
trying to track down now, and once I've got to the bottom of that I
should have some idea of where the performance has gone....
As it is, there are a bunch of other things going wrong with
4.3-rc1+ right now that I'm working through - I haven't updated my
kernel tree for 10 days because I've been away on holidays so I'm
doing my usual "-rc1 is broken again" dance that I do every release
cycle. (e.g every second boot hangs because systemd appears to be
waiting for iscsi devices to appear without first starting the iscsi
target daemon. Never happened before today, every new kernel I've
booted today has hung on the first cold boot of the VM).
> > IOWs, what we are seeing here is that the baseline writeback
> > performance has regressed quite significantly since I took these
> > numbers back on 3.17. I'm running on exactly the same test setup;
> > the only difference is the kernel and so the current kernel baseline
> > is ~20% slower than the baseline numbers I have in my patch.
>
> All of this in a VM, I'd much rather see this reproduced on bare metal.
> I've had really consistent results with VMs in the past, but there is a
> huge amount of code between 3.17 and now.
I'm pretty sure it's not the VM - with the locking fix in place,
everything else I've looked at (since applying the locking fix Linus
mentioned) is within measurement error compared to 4.2. The only
thing that is out of whack from a performance POV is data
writeback....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
2015-09-17 4:30 ` Dave Chinner
@ 2015-09-17 12:13 ` Chris Mason
0 siblings, 0 replies; 54+ messages in thread
From: Chris Mason @ 2015-09-17 12:13 UTC (permalink / raw)
To: Dave Chinner
Cc: Jan Kara, Linus Torvalds, Josef Bacik, LKML, linux-fsdevel,
Neil Brown, Christoph Hellwig, Tejun Heo
On Thu, Sep 17, 2015 at 02:30:08PM +1000, Dave Chinner wrote:
> On Wed, Sep 16, 2015 at 11:48:59PM -0400, Chris Mason wrote:
> > On Thu, Sep 17, 2015 at 10:37:38AM +1000, Dave Chinner wrote:
> > > [cc Tejun]
> > >
> > > On Thu, Sep 17, 2015 at 08:07:04AM +1000, Dave Chinner wrote:
> > > # ./fs_mark -D 10000 -S0 -n 10000 -s 4096 -L 120 -d /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/2 -d /mnt/scratch/3 -d /mnt/scratch/4 -d /mnt/scratch/5 -d /mnt/scratch/6 -d /mnt/scratch/7
> > > # Version 3.3, 8 thread(s) starting at Thu Sep 17 08:08:36 2015
> > > # Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
> > > # Directories: Time based hash between directories across 10000 subdirectories with 180 seconds per subdirectory.
> > > # File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
> > > # Files info: size 4096 bytes, written with an IO size of 16384 bytes per write
> > > # App overhead is time in microseconds spent in the test not doing file writing related system calls.
> > >
> > > FSUse% Count Size Files/sec App Overhead
> > > 0 80000 4096 106938.0 543310
> > > 0 160000 4096 102922.7 476362
> > > 0 240000 4096 107182.9 538206
> > > 0 320000 4096 107871.7 619821
> > > 0 400000 4096 99255.6 622021
> > > 0 480000 4096 103217.8 609943
> > > 0 560000 4096 96544.2 640988
> > > 0 640000 4096 100347.3 676237
> > > 0 720000 4096 87534.8 483495
> > > 0 800000 4096 72577.5 2556920
> > > 0 880000 4096 97569.0 646996
> > >
> > > <RAM fills here, sustained performance is now dependent on writeback>
> >
> > I think too many variables have changed here.
> >
> > My numbers:
> >
> > FSUse% Count Size Files/sec App Overhead
> > 0 160000 4096 356407.1 1458461
> > 0 320000 4096 368755.1 1030047
> > 0 480000 4096 358736.8 992123
> > 0 640000 4096 361912.5 1009566
> > 0 800000 4096 342851.4 1004152
>
> <snip>
>
> > I can push the dirty threshold lower to try and make sure we end up in
> > the hard dirty limits but none of this is going to be related to the
> > plugging patch.
>
> The point of this test is to drive writeback as hard as possible,
> not to measure how fast we can create files in memory. i.e. if the
> test isn't pushing the dirty limits on your machines, then it really
> isn't putting a meaningful load on writeback, and so the plugging
> won't make significant difference because writeback isn't IO
> bound....
It does end up IO bound on my rig, just because we do eventually hit the
dirty limits. Otherwise there would be zero benefits in fs_mark from
any patches vs plain v4.2
But I setup a run last night with a dirty_ratio_bytes at 3G and
dirty_background_ratio_bytes at 1.5G.
There is definitely variation, but nothing like what you saw:
FSUse% Count Size Files/sec App Overhead
0 160000 4096 317427.9 1524951
0 320000 4096 319723.9 1023874
0 480000 4096 336696.4 1053884
0 640000 4096 257113.1 1190851
0 800000 4096 257644.2 1198054
0 960000 4096 254896.6 1225610
0 1120000 4096 241052.6 1203227
0 1280000 4096 214961.2 1386236
0 1440000 4096 239985.7 1264659
0 1600000 4096 232174.3 1310018
0 1760000 4096 250477.9 1227289
0 1920000 4096 221500.9 1276223
0 2080000 4096 235212.1 1284989
0 2240000 4096 238580.2 1257260
0 2400000 4096 224182.6 1326821
0 2560000 4096 234628.7 1236402
0 2720000 4096 244675.3 1228400
0 2880000 4096 234364.0 1268408
0 3040000 4096 229712.6 1306148
0 3200000 4096 241170.5 1254490
0 3360000 4096 220487.8 1331456
0 3520000 4096 215831.7 1313682
0 3680000 4096 210934.7 1235750
0 3840000 4096 218435.4 1258077
0 4000000 4096 232127.7 1271555
0 4160000 4096 212017.6 1381525
0 4320000 4096 216309.3 1370558
0 4480000 4096 239072.4 1269086
0 4640000 4096 221959.1 1333164
0 4800000 4096 228396.8 1213160
0 4960000 4096 225747.5 1318503
0 5120000 4096 115727.0 1237327
0 5280000 4096 184171.4 1547357
0 5440000 4096 209917.8 1380510
0 5600000 4096 181074.7 1391764
0 5760000 4096 263516.7 1155172
0 5920000 4096 236405.8 1239719
0 6080000 4096 231587.2 1221408
0 6240000 4096 237118.8 1244272
0 6400000 4096 236773.2 1201428
0 6560000 4096 243987.5 1240527
0 6720000 4096 232428.0 1283265
0 6880000 4096 234839.9 1209152
0 7040000 4096 234947.3 1223456
0 7200000 4096 231463.1 1260628
0 7360000 4096 226750.3 1290098
0 7520000 4096 213632.0 1236409
0 7680000 4096 194710.2 1411595
0 7840000 4096 213963.1 4146893
0 8000000 4096 225109.8 1323573
0 8160000 4096 251322.1 1380271
0 8320000 4096 220167.2 1159390
0 8480000 4096 210991.2 1110593
0 8640000 4096 197922.8 1126072
0 8800000 4096 203539.3 1143501
0 8960000 4096 193041.7 1134329
0 9120000 4096 184667.9 1119222
0 9280000 4096 165968.7 1172738
0 9440000 4096 192767.3 1098361
0 9600000 4096 227115.7 1158097
0 9760000 4096 232139.8 1264245
0 9920000 4096 213320.5 1270505
0 10080000 4096 217013.4 1324569
0 10240000 4096 227171.6 1308668
0 10400000 4096 208591.4 1392098
0 10560000 4096 212006.0 1359188
0 10720000 4096 213449.3 1352084
0 10880000 4096 219890.1 1326240
0 11040000 4096 215907.7 1239180
0 11200000 4096 214207.2 1334846
0 11360000 4096 212875.2 1338429
0 11520000 4096 211690.0 1249519
0 11680000 4096 217013.0 1262050
0 11840000 4096 204730.1 1205087
0 12000000 4096 191146.9 1188635
0 12160000 4096 207844.6 1157033
0 12320000 4096 208857.7 1168111
0 12480000 4096 198256.4 1388368
0 12640000 4096 214996.1 1305412
0 12800000 4096 212332.9 1357814
0 12960000 4096 210325.8 1336127
0 13120000 4096 200292.1 1282419
0 13280000 4096 202030.2 1412105
0 13440000 4096 216553.7 1424076
0 13600000 4096 218721.7 1298149
0 13760000 4096 202037.4 1266877
0 13920000 4096 224032.3 1198159
0 14080000 4096 206105.6 1336489
0 14240000 4096 227540.3 1160841
0 14400000 4096 236921.7 1190394
0 14560000 4096 229343.3 1147451
0 14720000 4096 199435.1 1284374
0 14880000 4096 215177.3 1178542
0 15040000 4096 206194.1 1170832
0 15200000 4096 215762.3 1125633
0 15360000 4096 194511.0 1122947
0 15520000 4096 179008.5 1292603
0 15680000 4096 208636.9 1094960
0 15840000 4096 192173.1 1237891
0 16000000 4096 212888.9 1111551
0 16160000 4096 218403.0 1143400
0 16320000 4096 207260.5 1233526
0 16480000 4096 202123.2 1151509
0 16640000 4096 191033.0 1257706
0 16800000 4096 196865.4 1154520
0 16960000 4096 210361.2 1128930
0 17120000 4096 201755.2 1160469
0 17280000 4096 196946.6 1173529
0 17440000 4096 199677.8 1165750
0 17600000 4096 194248.4 1234944
0 17760000 4096 200027.9 1256599
0 17920000 4096 206507.0 1166820
0 18080000 4096 215082.7 1167599
0 18240000 4096 201475.5 1212202
0 18400000 4096 208247.6 1252255
0 18560000 4096 205482.7 1311436
0 18720000 4096 200111.9 1358784
0 18880000 4096 200028.3 1351332
0 19040000 4096 198873.4 1287400
0 19200000 4096 209609.3 1268400
0 19360000 4096 203538.6 1249787
0 19520000 4096 203427.9 1294105
0 19680000 4096 201905.3 1280714
0 19840000 4096 209642.9 1283281
0 20000000 4096 203438.9 1315427
0 20160000 4096 199690.7 1252267
0 20320000 4096 185965.2 1398905
0 20480000 4096 203221.6 1214029
0 20640000 4096 208654.8 1232679
0 20800000 4096 212488.6 1298458
0 20960000 4096 189701.1 1356640
0 21120000 4096 198522.1 1361240
0 21280000 4096 203857.3 1263402
0 21440000 4096 204616.8 1362853
0 21600000 4096 196310.6 1266710
0 21760000 4096 203275.4 1391150
0 21920000 4096 205998.5 1378741
0 22080000 4096 205434.2 1283787
0 22240000 4096 195918.0 1415912
0 22400000 4096 186193.0 1413623
0 22560000 4096 192911.3 1393471
0 22720000 4096 203726.3 1264281
0 22880000 4096 204853.4 1221048
0 23040000 4096 222803.2 1153031
0 23200000 4096 198558.6 1346256
0 23360000 4096 201001.4 1278817
0 23520000 4096 206225.2 1270440
0 23680000 4096 190894.2 1425299
0 23840000 4096 198555.6 1334122
0 24000000 4096 202386.4 1332157
0 24160000 4096 205103.1 1313607
>
> > I do see lower numbers if I let the test run even
> > longer, but there are a lot of things in the way that can slow it down
> > as the filesystem gets that big.
>
> Sure, that's why I hit the dirty limits early in the test - so it
> measures steady state performance before the fs gets to any
> significant scalability limits....
>
> > > The baseline of no plugging is a full 3 minutes faster than the
> > > plugging behaviour of Linus' patch. The IO behaviour demonstrates
> > > that, sustaining between 25-30,000 IOPS and throughput of
> > > 130-150MB/s. Hence, while Linus' patch does change the IO patterns,
> > > it does not result in a performance improvement like the original
> > > plugging patch did.
> >
> > How consistent is this across runs?
>
> That's what I'm trying to work out. I didn't report it until I got
> consistently bad results - the numbers I reported were from the
> third time I ran the comparison, and they were representative and
> reproducable. I also ran my inode creation workload that is similar
> (but has not data writeback so doesn't go through writeback paths at
> all) and that shows no change in performance, so this problem
> (whatever it is) is only manifesting itself through data
> writeback....
The big change between Linus' patch and your patch is with Linus kblockd
is probably doing most of the actual unplug work (except for the last
super block in the list). If a process is waiting for dirty writeout
progress, it has to wait for that context switch to kblockd.
In the VM, that's going to hurt more then my big two socket mostly idle
machine.
>
> The only measurable change I've noticed in my monitoring graphs is
> that there is a lot more iowait time than I normally see, even when
> the plugging appears to be working as desired. That's what I'm
> trying to track down now, and once I've got to the bottom of that I
> should have some idea of where the performance has gone....
>
> As it is, there are a bunch of other things going wrong with
> 4.3-rc1+ right now that I'm working through - I haven't updated my
> kernel tree for 10 days because I've been away on holidays so I'm
> doing my usual "-rc1 is broken again" dance that I do every release
> cycle. (e.g every second boot hangs because systemd appears to be
> waiting for iscsi devices to appear without first starting the iscsi
> target daemon. Never happened before today, every new kernel I've
> booted today has hung on the first cold boot of the VM).
I've been doing 4.2 plus patches because rc1 didn't boot on this strange
box. Let me nail that down and rerun.
-chris
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
2015-09-17 2:14 ` Dave Chinner
@ 2015-09-17 19:39 ` Linus Torvalds
2015-09-17 22:42 ` Chris Mason
2015-09-17 23:03 ` Dave Chinner
1 sibling, 1 reply; 54+ messages in thread
From: Linus Torvalds @ 2015-09-17 19:39 UTC (permalink / raw)
To: Dave Chinner
Cc: Chris Mason, Jan Kara, Josef Bacik, LKML, linux-fsdevel,
Neil Brown, Christoph Hellwig, Tejun Heo
On Wed, Sep 16, 2015 at 7:14 PM, Dave Chinner <david@fromorbit.com> wrote:
>>
>> Dave, if you're testing my current -git, the other performance issue
>> might still be the spinlock thing.
>
> I have the fix as the first commit in my local tree - it'll remain
> there until I get a conflict after an update. :)
Ok. I'm happy to report that you should get a conflict now, and that
the spinlock code should work well for your virtualized case again.
No updates on the plugging thing yet, I'll wait a bit and follow this
thread and see if somebody comes up with any explanations or theories
in the hope that we might not need to revert (or at least have a more
targeted change).
Linus
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
2015-09-17 19:39 ` Linus Torvalds
@ 2015-09-17 22:42 ` Chris Mason
2015-09-17 23:08 ` Linus Torvalds
0 siblings, 1 reply; 54+ messages in thread
From: Chris Mason @ 2015-09-17 22:42 UTC (permalink / raw)
To: Linus Torvalds
Cc: Dave Chinner, Jan Kara, Josef Bacik, LKML, linux-fsdevel,
Neil Brown, Christoph Hellwig, Tejun Heo
On Thu, Sep 17, 2015 at 12:39:51PM -0700, Linus Torvalds wrote:
> On Wed, Sep 16, 2015 at 7:14 PM, Dave Chinner <david@fromorbit.com> wrote:
> >>
> >> Dave, if you're testing my current -git, the other performance issue
> >> might still be the spinlock thing.
> >
> > I have the fix as the first commit in my local tree - it'll remain
> > there until I get a conflict after an update. :)
>
> Ok. I'm happy to report that you should get a conflict now, and that
> the spinlock code should work well for your virtualized case again.
>
> No updates on the plugging thing yet, I'll wait a bit and follow this
> thread and see if somebody comes up with any explanations or theories
> in the hope that we might not need to revert (or at least have a more
> targeted change).
Playing around with the plug a little, most of the unplugs are coming
from the cond_resched_lock(). Not really sure why we are doing the
cond_resched() there, we should be doing it before we retake the lock
instead.
This patch takes my box (with dirty thresholds at 1.5GB/3GB) from 195K
files/sec up to 213K. Average IO size is the same as 4.3-rc1.
It probably won't help Dave, since most of his unplugs should have been
from the cond_resched_locked() too.
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 587ac08..05ed541 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1481,6 +1481,19 @@ static long writeback_sb_inodes(struct super_block *sb,
wbc_detach_inode(&wbc);
work->nr_pages -= write_chunk - wbc.nr_to_write;
wrote += write_chunk - wbc.nr_to_write;
+
+ if (need_resched()) {
+ /*
+ * we're plugged and don't want to hand off to kblockd
+ * for the actual unplug work. But we do want to
+ * reschedule. So flush our plug and then
+ * schedule away
+ */
+ blk_flush_plug(current);
+ cond_resched();
+ }
+
+
spin_lock(&wb->list_lock);
spin_lock(&inode->i_lock);
if (!(inode->i_state & I_DIRTY_ALL))
@@ -1488,7 +1501,7 @@ static long writeback_sb_inodes(struct super_block *sb,
requeue_inode(inode, wb, &wbc);
inode_sync_complete(inode);
spin_unlock(&inode->i_lock);
- cond_resched_lock(&wb->list_lock);
+
/*
* bail out to wb_writeback() often enough to check
* background threshold and other termination conditions.
^ permalink raw reply related [flat|nested] 54+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
2015-09-17 2:14 ` Dave Chinner
2015-09-17 19:39 ` Linus Torvalds
@ 2015-09-17 23:03 ` Dave Chinner
2015-09-17 23:13 ` Linus Torvalds
1 sibling, 1 reply; 54+ messages in thread
From: Dave Chinner @ 2015-09-17 23:03 UTC (permalink / raw)
To: Linus Torvalds
Cc: Chris Mason, Jan Kara, Josef Bacik, LKML, linux-fsdevel,
Neil Brown, Christoph Hellwig, Tejun Heo
On Thu, Sep 17, 2015 at 12:14:53PM +1000, Dave Chinner wrote:
> On Wed, Sep 16, 2015 at 06:12:29PM -0700, Linus Torvalds wrote:
> > On Wed, Sep 16, 2015 at 5:37 PM, Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > TL;DR: Results look really bad - not only is the plugging
> > > problematic, baseline writeback performance has regressed
> > > significantly.
> >
> > Dave, if you're testing my current -git, the other performance issue
> > might still be the spinlock thing.
>
> I have the fix as the first commit in my local tree - it'll remain
> there until I get a conflict after an update. :)
>
> > The plugging IO pauses are interesting, though. Plugging really
> > *shouldn't* cause that kind of pauses, _regardless_ of what level it
> > happens on, so I wonder if the patch ends up just exposing some really
> > basic problem that just normally goes hidden.
>
> Right, that's what I suspect - it didn't happen on older kernels,
> but we've just completely reworked the writeback code for the
> control group awareness since I last looked really closely at
> this...
>
> > Can you match up the IO wait times with just *where* it is
> > waiting? Is it waiting for that inode I_SYNC thing in
> > inode_sleep_on_writeback()?
>
> I'll do some more investigation.
Ok, I'm happy to report there is actually nothing wrong with the
plugging code that is your tree. I finally tracked the problem I
was seeing down to a misbehaving RAID controller.[*]
With that problem sorted:
kernel files/s wall time
3.17 32500 5m54s
4.3-noplug 34400 5m25s
3.17-plug 52900 3m19s
4.3-badplug 60540 3m24s
4.3-rc1 56600 3m23s
So the 3.17/4.3-noplug baselines so no regression - 4.3 is slightly
faster. All the plugging variants show roughly the same improvement
and IO behaviour. These numbers are reproducable and there are no
weird performance inconsistencies during any of the 4.3-rc1 kernel
runs. Hence my numbers and observed behaviour now aligns with
Chris' results and so I think we can say the reworked high level
plugging is behaving as we expected it to.
Cheers,
Dave.
[*] It seems to have a dodgy battery connector, and so has been
"losing" battery backup and changing the cache mode of the HBA from
write back to write through. This results in changing from NVRAM
performance to SSD native performance and back again. A small
vibration would cause the connection to the battery to reconnect and
the controller would switch back to writeback mode. The few log
entries in the bios showed changes in status between a few seconds
apart to minutes apart - enough for the cache status to change
several times a 5-10 minute benchmark run.
I didn't notice the hardware was playing up because it wasn't
triggering the machine alert indicator through the bios like it's
supposed to and so the visible and audible alarms were not being
triggered, nor was the BMC logging the raid controller cache status
changes.
In the end, I noticed it by chance - during a low level test the
behaviour changed very obviously as one of my dogs ran past the
rack. I unplugged everything inside the server, plugged it all back
in, powered it back up and fiddled with cables until I found what
was causing the problem. And having done this, the BMC is now
sending warnings and the audible alarm is working when the battery
is disconnected... :/
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
2015-09-17 22:42 ` Chris Mason
@ 2015-09-17 23:08 ` Linus Torvalds
2015-09-17 23:56 ` Chris Mason
0 siblings, 1 reply; 54+ messages in thread
From: Linus Torvalds @ 2015-09-17 23:08 UTC (permalink / raw)
To: Chris Mason, Linus Torvalds, Dave Chinner, Jan Kara, Josef Bacik,
LKML, linux-fsdevel, Neil Brown, Christoph Hellwig, Tejun Heo
On Thu, Sep 17, 2015 at 3:42 PM, Chris Mason <clm@fb.com> wrote:
>
> Playing around with the plug a little, most of the unplugs are coming
> from the cond_resched_lock(). Not really sure why we are doing the
> cond_resched() there, we should be doing it before we retake the lock
> instead.
>
> This patch takes my box (with dirty thresholds at 1.5GB/3GB) from 195K
> files/sec up to 213K. Average IO size is the same as 4.3-rc1.
Ok, so at least for you, part of the problem really ends up being that
there's a mix of the "synchronous" unplugging (by the actual explicit
"blk_finish_plug(&plug);") and the writeback that is handed off to
kblockd_workqueue.
I'm not seeing why that should be an issue. Sure, there's some CPU
overhead to context switching, but I don't see that it should be that
big of a deal.
I wonder if there is something more serious wrong with the kblockd_workqueue.
Linus
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
2015-09-17 23:03 ` Dave Chinner
@ 2015-09-17 23:13 ` Linus Torvalds
0 siblings, 0 replies; 54+ messages in thread
From: Linus Torvalds @ 2015-09-17 23:13 UTC (permalink / raw)
To: Dave Chinner
Cc: Chris Mason, Jan Kara, Josef Bacik, LKML, linux-fsdevel,
Neil Brown, Christoph Hellwig, Tejun Heo
On Thu, Sep 17, 2015 at 4:03 PM, Dave Chinner <david@fromorbit.com> wrote:
>
> Ok, I'm happy to report there is actually nothing wrong with the
> plugging code that is your tree. I finally tracked the problem I
> was seeing down to a misbehaving RAID controller.[*]
Hey, that's great.
Just out of interest, try the patch that Chris just sent that turns
the unplugging synchronous for the special "cond_resched()" case.
If that helps your case too, let's just do it, even while we wonder
why kblockd_workqueue messes up so noticeable.
Because regardless of the kblockd_workqueue questions, I think it's
nicer to avoid the cond_resched_lock(), and do the rescheduling while
we're not holding any locks.
Linus
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
2015-09-17 23:08 ` Linus Torvalds
@ 2015-09-17 23:56 ` Chris Mason
2015-09-18 0:37 ` Dave Chinner
0 siblings, 1 reply; 54+ messages in thread
From: Chris Mason @ 2015-09-17 23:56 UTC (permalink / raw)
To: Linus Torvalds
Cc: Dave Chinner, Jan Kara, Josef Bacik, LKML, linux-fsdevel,
Neil Brown, Christoph Hellwig, Tejun Heo
On Thu, Sep 17, 2015 at 04:08:19PM -0700, Linus Torvalds wrote:
> On Thu, Sep 17, 2015 at 3:42 PM, Chris Mason <clm@fb.com> wrote:
> >
> > Playing around with the plug a little, most of the unplugs are coming
> > from the cond_resched_lock(). Not really sure why we are doing the
> > cond_resched() there, we should be doing it before we retake the lock
> > instead.
> >
> > This patch takes my box (with dirty thresholds at 1.5GB/3GB) from 195K
> > files/sec up to 213K. Average IO size is the same as 4.3-rc1.
>
> Ok, so at least for you, part of the problem really ends up being that
> there's a mix of the "synchronous" unplugging (by the actual explicit
> "blk_finish_plug(&plug);") and the writeback that is handed off to
> kblockd_workqueue.
>
> I'm not seeing why that should be an issue. Sure, there's some CPU
> overhead to context switching, but I don't see that it should be that
> big of a deal.
>
> I wonder if there is something more serious wrong with the kblockd_workqueue.
I'm driving the box pretty hard, it's right on the line between CPU
bound and IO bound. So I've got 32 fs_mark processes banging away and
32 CPUs (16 really, with hyperthreading).
They are popping in and out of balance_dirty_pages() so I have high CPU
utilization alternating with high IO wait times. There no reads at all,
so all of these waits are for buffered writes.
People in balance_dirty_pages are indirectly waiting on the unplug, so
maybe the context switch overhead on a loaded box is enough to explain
it. We've definitely gotten more than 9% by inlining small synchronous
items in btrfs in the past, but those were more explicitly synchronous.
I know it's painfully hand wavy. I don't see any other users of the
kblockd workqueues, and the perf profiles don't jump out at me. I'll
feel better about the patch if Dave confirms any gains.
-chris
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
2015-09-17 23:56 ` Chris Mason
@ 2015-09-18 0:37 ` Dave Chinner
2015-09-18 1:50 ` Linus Torvalds
0 siblings, 1 reply; 54+ messages in thread
From: Dave Chinner @ 2015-09-18 0:37 UTC (permalink / raw)
To: Chris Mason, Linus Torvalds, Jan Kara, Josef Bacik, LKML,
linux-fsdevel, Neil Brown, Christoph Hellwig, Tejun Heo
On Thu, Sep 17, 2015 at 07:56:47PM -0400, Chris Mason wrote:
> On Thu, Sep 17, 2015 at 04:08:19PM -0700, Linus Torvalds wrote:
> > On Thu, Sep 17, 2015 at 3:42 PM, Chris Mason <clm@fb.com> wrote:
> > >
> > > Playing around with the plug a little, most of the unplugs are coming
> > > from the cond_resched_lock(). Not really sure why we are doing the
> > > cond_resched() there, we should be doing it before we retake the lock
> > > instead.
> > >
> > > This patch takes my box (with dirty thresholds at 1.5GB/3GB) from 195K
> > > files/sec up to 213K. Average IO size is the same as 4.3-rc1.
> >
> > Ok, so at least for you, part of the problem really ends up being that
> > there's a mix of the "synchronous" unplugging (by the actual explicit
> > "blk_finish_plug(&plug);") and the writeback that is handed off to
> > kblockd_workqueue.
> >
> > I'm not seeing why that should be an issue. Sure, there's some CPU
> > overhead to context switching, but I don't see that it should be that
> > big of a deal.
It may well change the dispatch order of enough IOs for it to be
significant on an IO bound device.
> > I wonder if there is something more serious wrong with the kblockd_workqueue.
>
> I'm driving the box pretty hard, it's right on the line between CPU
> bound and IO bound. So I've got 32 fs_mark processes banging away and
> 32 CPUs (16 really, with hyperthreading).
I'm only using 8 threads right now, so I have ~6-7 idle CPUs on this
workload. Hence if it's CPU load related, I probably won't see any
change in behaviour.
> They are popping in and out of balance_dirty_pages() so I have high CPU
> utilization alternating with high IO wait times. There no reads at all,
> so all of these waits are for buffered writes.
>
> People in balance_dirty_pages are indirectly waiting on the unplug, so
> maybe the context switch overhead on a loaded box is enough to explain
> it. We've definitely gotten more than 9% by inlining small synchronous
> items in btrfs in the past, but those were more explicitly synchronous.
>
> I know it's painfully hand wavy. I don't see any other users of the
> kblockd workqueues, and the perf profiles don't jump out at me. I'll
> feel better about the patch if Dave confirms any gains.
In outright performance on my test machine, the difference in
files/s is noise. However, the consistency looks to be substantially
improved and the context switch rate is now running at under
3,000/sec. Numbers, including the std deviation of the files/s
number output during the fsmark run (averaged across 3 separate
benahmark runs):
files/s std-dev wall time
4.3-rc1-noplug 34400 2.0e04 5m25s
4.3-rc1 56600 2.3e04 3m23s
4.3-rc1-flush 56079 1.4e04 3m14s
std-dev is well down, and the improvement in wall time is large
enough to be significant.
Looks good to me.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
2015-09-18 0:37 ` Dave Chinner
@ 2015-09-18 1:50 ` Linus Torvalds
2015-09-18 5:40 ` Dave Chinner
0 siblings, 1 reply; 54+ messages in thread
From: Linus Torvalds @ 2015-09-18 1:50 UTC (permalink / raw)
To: Dave Chinner
Cc: Chris Mason, Jan Kara, Josef Bacik, LKML, linux-fsdevel,
Neil Brown, Christoph Hellwig, Tejun Heo
[-- Attachment #1: Type: text/plain, Size: 2457 bytes --]
On Thu, Sep 17, 2015 at 5:37 PM, Dave Chinner <david@fromorbit.com> wrote:
>> >
>> > I'm not seeing why that should be an issue. Sure, there's some CPU
>> > overhead to context switching, but I don't see that it should be that
>> > big of a deal.
>
> It may well change the dispatch order of enough IOs for it to be
> significant on an IO bound device.
Hmm. Maybe. We obviously try to order the IO's a bit by inode, and I
could see the use of a workqueue maybe changing that sufficiently. But
it still sounds a bit unlikely.
And in fact, I think I have a better explanation.
> In outright performance on my test machine, the difference in
> files/s is noise. However, the consistency looks to be substantially
> improved and the context switch rate is now running at under
> 3,000/sec.
Hmm. I don't recall seeing you mention how many context switches per
second you had before. What is it down from?
However, I think I may have found something more interesting here.
The fact is that a *normal* schedule will trigger that whole
blk_schedule_flush_plug(), but a cond_sched() or a cond_sched_lock()
doesn't actually do a normal schedule at all. Those trigger a
*preemption* schedule.
And a preemption schedule does not trigger that unplugging at all.
Why? A kernel "preemption" very much tries to avoid touching thread
state, because the whole point is that normally we may be preempting
threads in random places, so we don't run things like
sched_submit_work(), because the thread may be in the middle of
*creating* that work, and we don't want to introduce races. The
preemption scheduling can also be done with "task->state" set to
sleeping, and it won't actually sleep.
Now, for the explicit schedules like "cond_resched()" and
"cond_resched_lock()", those races with obviously don't exist, but
they happen to share the same preemption scheduling logic.
So it turns out that as far as I can see, the whole "cond_resched()"
will not start any IO at all, and it will just be left on the thread
plug until we schedule back to the thread.
So I don't think this has anything to do with kblockd_workqueue. I
don't think it even gets to that point.
I may be missing something, but just to humor me, can you test the
attached patch *without* Chris's patch to do explicit plugging? This
should make cond_resched() and cond_resched_lock() run the unplugging.
It may be entirely broken, I haven't thought this entirely through.
Linus
[-- Attachment #2: patch.diff --]
[-- Type: text/plain, Size: 633 bytes --]
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 97d276ff1edb..388ea9e7ab8a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4548,6 +4548,7 @@ SYSCALL_DEFINE0(sched_yield)
int __sched _cond_resched(void)
{
if (should_resched(0)) {
+ sched_submit_work(current);
preempt_schedule_common();
return 1;
}
@@ -4572,9 +4573,10 @@ int __cond_resched_lock(spinlock_t *lock)
if (spin_needbreak(lock) || resched) {
spin_unlock(lock);
- if (resched)
+ if (resched) {
+ sched_submit_work(current);
preempt_schedule_common();
- else
+ } else
cpu_relax();
ret = 1;
spin_lock(lock);
^ permalink raw reply related [flat|nested] 54+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
2015-09-18 1:50 ` Linus Torvalds
@ 2015-09-18 5:40 ` Dave Chinner
2015-09-18 6:04 ` Linus Torvalds
0 siblings, 1 reply; 54+ messages in thread
From: Dave Chinner @ 2015-09-18 5:40 UTC (permalink / raw)
To: Linus Torvalds
Cc: Chris Mason, Jan Kara, Josef Bacik, LKML, linux-fsdevel,
Neil Brown, Christoph Hellwig, Tejun Heo
On Thu, Sep 17, 2015 at 06:50:29PM -0700, Linus Torvalds wrote:
> On Thu, Sep 17, 2015 at 5:37 PM, Dave Chinner <david@fromorbit.com> wrote:
> >> >
> >> > I'm not seeing why that should be an issue. Sure, there's some CPU
> >> > overhead to context switching, but I don't see that it should be that
> >> > big of a deal.
> >
> > It may well change the dispatch order of enough IOs for it to be
> > significant on an IO bound device.
>
> Hmm. Maybe. We obviously try to order the IO's a bit by inode, and I
> could see the use of a workqueue maybe changing that sufficiently. But
> it still sounds a bit unlikely.
>
> And in fact, I think I have a better explanation.
>
> > In outright performance on my test machine, the difference in
> > files/s is noise. However, the consistency looks to be substantially
> > improved and the context switch rate is now running at under
> > 3,000/sec.
>
> Hmm. I don't recall seeing you mention how many context switches per
> second you had before. What is it down from?
Around 4000-4500/sec, so there's not a huge amount of context
switches going on. It's not like btrfs, where this workload
generates over 250,000 context switches/sec....
> However, I think I may have found something more interesting here.
>
> The fact is that a *normal* schedule will trigger that whole
> blk_schedule_flush_plug(), but a cond_sched() or a cond_sched_lock()
> doesn't actually do a normal schedule at all. Those trigger a
> *preemption* schedule.
Ok, makes sense - the plug is not being flushed as we switch away,
but Chris' patch makes it do that.
> So it turns out that as far as I can see, the whole "cond_resched()"
> will not start any IO at all, and it will just be left on the thread
> plug until we schedule back to the thread.
>
> So I don't think this has anything to do with kblockd_workqueue. I
> don't think it even gets to that point.
>
> I may be missing something, but just to humor me, can you test the
> attached patch *without* Chris's patch to do explicit plugging? This
> should make cond_resched() and cond_resched_lock() run the unplugging.
Context switches go back to the 4-4500/sec range. Otherwise
behaviour and performance is indistinguishable from Chris' patch.
Cheers,
Dave.
PS: just hit another "did this just get broken in 4.3-rc1" issue - I
can't run blktrace while there's a IO load because:
$ sudo blktrace -d /dev/vdc
BLKTRACESETUP(2) /dev/vdc failed: 5/Input/output error
Thread 1 failed open /sys/kernel/debug/block/(null)/trace1: 2/No such file or directory
....
[ 641.424618] blktrace: page allocation failure: order:5, mode:0x2040d0
[ 641.425952] CPU: 13 PID: 11163 Comm: blktrace Not tainted 4.3.0-rc1-dgc+ #430
[ 641.427321] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
[ 641.429066] 0000000000000005 ffff8800b731faa0 ffffffff81785609 00000000002040d0
[ 641.430585] ffff8800b731fb28 ffffffff81183688 0000000000000000 ffffffff824926f8
[ 641.432143] ffff8800b731faf0 0000000000000042 0000000000000010 0000000000000100
[ 641.433695] Call Trace:
[ 641.434196] [<ffffffff81785609>] dump_stack+0x4b/0x72
[ 641.435219] [<ffffffff81183688>] warn_alloc_failed+0xd8/0x130
[ 641.436402] [<ffffffff811863bb>] __alloc_pages_nodemask+0x66b/0x7d0
[ 641.437684] [<ffffffff811c0e19>] cache_grow.constprop.61+0xc9/0x2d0
[ 641.438933] [<ffffffff811c1569>] kmem_cache_alloc_trace+0x129/0x400
[ 641.440240] [<ffffffff811424f8>] relay_open+0x68/0x2c0
[ 641.441299] [<ffffffff8115deb1>] do_blk_trace_setup+0x191/0x2d0
[ 641.442492] [<ffffffff8115e041>] blk_trace_setup+0x51/0x90
[ 641.443586] [<ffffffff8115e229>] blk_trace_ioctl+0xb9/0x110
[ 641.444702] [<ffffffff811e586c>] ? mntput_no_expire+0x2c/0x1a0
[ 641.446419] [<ffffffff8176fa58>] blkdev_ioctl+0x528/0x690
[ 641.447489] [<ffffffff811fd37d>] block_ioctl+0x3d/0x50
[ 641.449194] [<ffffffff811d93da>] do_vfs_ioctl+0x2ba/0x490
[ 641.450491] [<ffffffff811d9629>] SyS_ioctl+0x79/0x90
[ 641.451575] [<ffffffff81d800ae>] entry_SYSCALL_64_fastpath+0x12/0x71
[ 641.453231] Mem-Info:
[ 641.453714] active_anon:11651 inactive_anon:2411 isolated_anon:0
active_file:8456 inactive_file:2070698 isolated_file:0
unevictable:1011 dirty:374605 writeback:1176 unstable:0
slab_reclaimable:1514989 slab_unreclaimable:233692
mapped:9332 shmem:2250 pagetables:1891 bounce:0
free:24305 free_pcp:1448 free_cma:0
[ 641.461950] DMA free:15908kB min:12kB low:12kB high:16kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[ 641.470017] lowmem_reserve[]: 0 2960 16005 16005
[ 641.471041] DMA32 free:57076kB min:2992kB low:3740kB high:4488kB active_anon:8692kB inactive_anon:2572kB active_file:6224kB inactive_file:1507352kB unevictable:1004kB isolated(anon):0kB isolated(file):0kB present:3129212kB managed:3032864kB mlocked:1004kB dirty:275336kB writeback:896kB mapped:6800kB shmem:1836kB slab_reclaimable:1086292kB slab_unreclaimable:183168kB kernel_stack:23600kB pagetables:1372kB unstable:0kB bounce:0kB free_pcp:3440kB local_pcp:16kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 641.479677] lowmem_reserve[]: 0 0 13044 13044
[ 641.480888] Normal free:18040kB min:13188kB low:16484kB high:19780kB active_anon:38264kB inactive_anon:7072kB active_file:27600kB inactive_file:6775516kB unevictable:3040kB isolated(anon):0kB isolated(file):0kB present:13631488kB managed:13357908kB mlocked:3040kB dirty:1223624kB writeback:5144kB mapped:30528kB shmem:7164kB slab_reclaimable:4975584kB slab_unreclaimable:755076kB kernel_stack:44912kB pagetables:6192kB unstable:0kB bounce:0kB free_pcp:3432kB local_pcp:88kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 641.492392] lowmem_reserve[]: 0 0 0 0
[ 641.493170] DMA: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15908kB
[ 641.496035] DMA32: 6584*4kB (UEM) 3783*8kB (UM) 17*16kB (M) 9*32kB (M) 3*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 57352kB
[ 641.498705] Normal: 3372*4kB (UE) 393*8kB (UEM) 37*16kB (UEM) 33*32kB (UEM) 6*64kB (EM) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 18664kB
[ 641.501479] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 641.503019] 2082056 total pagecache pages
[ 641.503759] 0 pages in swap cache
[ 641.504403] Swap cache stats: add 0, delete 0, find 0/0
[ 641.505374] Free swap = 497976kB
[ 641.506002] Total swap = 497976kB
[ 641.506622] 4194173 pages RAM
[ 641.507171] 0 pages HighMem/MovableOnly
[ 641.507878] 92503 pages reserved
This is from:
gdb) l *(relay_open+0x68)
0xffffffff811424f8 is in relay_open (kernel/relay.c:582).
577 return NULL;
578 if (subbuf_size > UINT_MAX / n_subbufs)
579 return NULL;
580
581 chan = kzalloc(sizeof(struct rchan), GFP_KERNEL);
582 if (!chan)
583 return NULL;
584
585 chan->version = RELAYFS_CHANNEL_VERSION;
586 chan->n_subbufs = n_subbufs;
and struct rchan has a member struct rchan_buf *buf[NR_CPUS];
and CONFIG_NR_CPUS=8192, hence the attempt at an order 5 allocation
that fails here....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
2015-09-18 5:40 ` Dave Chinner
@ 2015-09-18 6:04 ` Linus Torvalds
2015-09-18 6:06 ` Linus Torvalds
` (2 more replies)
0 siblings, 3 replies; 54+ messages in thread
From: Linus Torvalds @ 2015-09-18 6:04 UTC (permalink / raw)
To: Dave Chinner, Jens Axboe
Cc: Chris Mason, Jan Kara, Josef Bacik, LKML, linux-fsdevel,
Neil Brown, Christoph Hellwig, Tejun Heo
On Thu, Sep 17, 2015 at 10:40 PM, Dave Chinner <david@fromorbit.com> wrote:
>
> Ok, makes sense - the plug is not being flushed as we switch away,
> but Chris' patch makes it do that.
Yup.
And I actually think Chris' patch is better than the one I sent out
(but maybe the scheduler people should take a look at the behavior of
cond_resched()), I just wanted you to test that to verify the
behavior.
The fact that Chris' patch ends up lowering the context switches
(because it does the unplugging directly) is also an argument for his
approach.
I just wanted to understand the oddity with kblockd_workqueue. And I
think that's solved.
> Context switches go back to the 4-4500/sec range. Otherwise
> behaviour and performance is indistinguishable from Chris' patch.
.. this was exactly what I wanted to hear. So it sounds like we have
no odd unexplained behavior left in this area.
Which is not to say that there wouldn't be room for improvement, but
it just makes me much happier about the state of these patches to feel
like we understand what was going on.
> PS: just hit another "did this just get broken in 4.3-rc1" issue - I
> can't run blktrace while there's a IO load because:
>
> $ sudo blktrace -d /dev/vdc
> BLKTRACESETUP(2) /dev/vdc failed: 5/Input/output error
> Thread 1 failed open /sys/kernel/debug/block/(null)/trace1: 2/No such file or directory
> ....
>
> [ 641.424618] blktrace: page allocation failure: order:5, mode:0x2040d0
> [ 641.438933] [<ffffffff811c1569>] kmem_cache_alloc_trace+0x129/0x400
> [ 641.440240] [<ffffffff811424f8>] relay_open+0x68/0x2c0
> [ 641.441299] [<ffffffff8115deb1>] do_blk_trace_setup+0x191/0x2d0
>
> gdb) l *(relay_open+0x68)
> 0xffffffff811424f8 is in relay_open (kernel/relay.c:582).
> 577 return NULL;
> 578 if (subbuf_size > UINT_MAX / n_subbufs)
> 579 return NULL;
> 580
> 581 chan = kzalloc(sizeof(struct rchan), GFP_KERNEL);
> 582 if (!chan)
> 583 return NULL;
> 584
> 585 chan->version = RELAYFS_CHANNEL_VERSION;
> 586 chan->n_subbufs = n_subbufs;
>
> and struct rchan has a member struct rchan_buf *buf[NR_CPUS];
> and CONFIG_NR_CPUS=8192, hence the attempt at an order 5 allocation
> that fails here....
Hm. Have you always had MAX_SMP (and the NR_CPU==8192 that it causes)?
>From a quick check, none of this code seems to be new.
That said, having that
struct rchan_buf *buf[NR_CPUS];
in "struct rchan" really is something we should fix. We really should
strive to not allocate things by CONFIG_NR_CPU's, but by the actual
real CPU count.
This looks to be mostly Jens' code, and much of it harkens back to 2006. Jens?
Linus
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
2015-09-18 6:04 ` Linus Torvalds
@ 2015-09-18 6:06 ` Linus Torvalds
2015-09-18 14:21 ` Jens Axboe
2015-09-18 13:16 ` Chris Mason
2015-09-18 22:17 ` Dave Chinner
2 siblings, 1 reply; 54+ messages in thread
From: Linus Torvalds @ 2015-09-18 6:06 UTC (permalink / raw)
To: Dave Chinner, Jens Axboe
Cc: Chris Mason, Jan Kara, Josef Bacik, LKML, linux-fsdevel,
Neil Brown, Christoph Hellwig, Tejun Heo
Gaah, my mailer autocompleted Jens' email with an old one..
Sorry for the repeat email with the correct address.
On Thu, Sep 17, 2015 at 11:04 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Sep 17, 2015 at 10:40 PM, Dave Chinner <david@fromorbit.com> wrote:
>>
>> PS: just hit another "did this just get broken in 4.3-rc1" issue - I
>> can't run blktrace while there's a IO load because:
>>
>> $ sudo blktrace -d /dev/vdc
>> BLKTRACESETUP(2) /dev/vdc failed: 5/Input/output error
>> Thread 1 failed open /sys/kernel/debug/block/(null)/trace1: 2/No such file or directory
>> ....
>>
>> [ 641.424618] blktrace: page allocation failure: order:5, mode:0x2040d0
>> [ 641.438933] [<ffffffff811c1569>] kmem_cache_alloc_trace+0x129/0x400
>> [ 641.440240] [<ffffffff811424f8>] relay_open+0x68/0x2c0
>> [ 641.441299] [<ffffffff8115deb1>] do_blk_trace_setup+0x191/0x2d0
>>
>> gdb) l *(relay_open+0x68)
>> 0xffffffff811424f8 is in relay_open (kernel/relay.c:582).
>> 577 return NULL;
>> 578 if (subbuf_size > UINT_MAX / n_subbufs)
>> 579 return NULL;
>> 580
>> 581 chan = kzalloc(sizeof(struct rchan), GFP_KERNEL);
>> 582 if (!chan)
>> 583 return NULL;
>> 584
>> 585 chan->version = RELAYFS_CHANNEL_VERSION;
>> 586 chan->n_subbufs = n_subbufs;
>>
>> and struct rchan has a member struct rchan_buf *buf[NR_CPUS];
>> and CONFIG_NR_CPUS=8192, hence the attempt at an order 5 allocation
>> that fails here....
>
> Hm. Have you always had MAX_SMP (and the NR_CPU==8192 that it causes)?
> From a quick check, none of this code seems to be new.
>
> That said, having that
>
> struct rchan_buf *buf[NR_CPUS];
>
> in "struct rchan" really is something we should fix. We really should
> strive to not allocate things by CONFIG_NR_CPU's, but by the actual
> real CPU count.
>
> This looks to be mostly Jens' code, and much of it harkens back to 2006. Jens?
>
> Linus
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
2015-09-18 6:04 ` Linus Torvalds
2015-09-18 6:06 ` Linus Torvalds
@ 2015-09-18 13:16 ` Chris Mason
2015-09-18 14:23 ` Jens Axboe
2015-09-18 22:17 ` Dave Chinner
2 siblings, 1 reply; 54+ messages in thread
From: Chris Mason @ 2015-09-18 13:16 UTC (permalink / raw)
To: Linus Torvalds
Cc: Dave Chinner, Jens Axboe, Jan Kara, Josef Bacik, LKML,
linux-fsdevel, Neil Brown, Christoph Hellwig, Tejun Heo
On Thu, Sep 17, 2015 at 11:04:03PM -0700, Linus Torvalds wrote:
> On Thu, Sep 17, 2015 at 10:40 PM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > Ok, makes sense - the plug is not being flushed as we switch away,
> > but Chris' patch makes it do that.
>
> Yup.
Huh, that does make much more sense, thanks Linus. I'm wondering where
else I've assumed that cond_resched() unplugged.
>
> And I actually think Chris' patch is better than the one I sent out
> (but maybe the scheduler people should take a look at the behavior of
> cond_resched()), I just wanted you to test that to verify the
> behavior.
Ok, I'll fix up the description and comments and send out.
-chris
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
2015-09-18 6:06 ` Linus Torvalds
@ 2015-09-18 14:21 ` Jens Axboe
0 siblings, 0 replies; 54+ messages in thread
From: Jens Axboe @ 2015-09-18 14:21 UTC (permalink / raw)
To: Linus Torvalds, Dave Chinner
Cc: Chris Mason, Jan Kara, Josef Bacik, LKML, linux-fsdevel,
Neil Brown, Christoph Hellwig, Tejun Heo
On 09/18/2015 12:06 AM, Linus Torvalds wrote:
> Gaah, my mailer autocompleted Jens' email with an old one..
>
> Sorry for the repeat email with the correct address.
>
> On Thu, Sep 17, 2015 at 11:04 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> On Thu, Sep 17, 2015 at 10:40 PM, Dave Chinner <david@fromorbit.com> wrote:
>>>
>>> PS: just hit another "did this just get broken in 4.3-rc1" issue - I
>>> can't run blktrace while there's a IO load because:
>>>
>>> $ sudo blktrace -d /dev/vdc
>>> BLKTRACESETUP(2) /dev/vdc failed: 5/Input/output error
>>> Thread 1 failed open /sys/kernel/debug/block/(null)/trace1: 2/No such file or directory
>>> ....
>>>
>>> [ 641.424618] blktrace: page allocation failure: order:5, mode:0x2040d0
>>> [ 641.438933] [<ffffffff811c1569>] kmem_cache_alloc_trace+0x129/0x400
>>> [ 641.440240] [<ffffffff811424f8>] relay_open+0x68/0x2c0
>>> [ 641.441299] [<ffffffff8115deb1>] do_blk_trace_setup+0x191/0x2d0
>>>
>>> gdb) l *(relay_open+0x68)
>>> 0xffffffff811424f8 is in relay_open (kernel/relay.c:582).
>>> 577 return NULL;
>>> 578 if (subbuf_size > UINT_MAX / n_subbufs)
>>> 579 return NULL;
>>> 580
>>> 581 chan = kzalloc(sizeof(struct rchan), GFP_KERNEL);
>>> 582 if (!chan)
>>> 583 return NULL;
>>> 584
>>> 585 chan->version = RELAYFS_CHANNEL_VERSION;
>>> 586 chan->n_subbufs = n_subbufs;
>>>
>>> and struct rchan has a member struct rchan_buf *buf[NR_CPUS];
>>> and CONFIG_NR_CPUS=8192, hence the attempt at an order 5 allocation
>>> that fails here....
>>
>> Hm. Have you always had MAX_SMP (and the NR_CPU==8192 that it causes)?
>> From a quick check, none of this code seems to be new.
>>
>> That said, having that
>>
>> struct rchan_buf *buf[NR_CPUS];
>>
>> in "struct rchan" really is something we should fix. We really should
>> strive to not allocate things by CONFIG_NR_CPU's, but by the actual
>> real CPU count.
>>
>> This looks to be mostly Jens' code, and much of it harkens back to 2006. Jens?
The relayfs code mostly came out of IBM, but yes, that alloc doesn't
look nice. Not a regression, though, I don't think that has changed in
years. I'll take a stab at fixing this.
--
Jens Axboe
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
2015-09-18 13:16 ` Chris Mason
@ 2015-09-18 14:23 ` Jens Axboe
2015-09-18 15:32 ` Linus Torvalds
0 siblings, 1 reply; 54+ messages in thread
From: Jens Axboe @ 2015-09-18 14:23 UTC (permalink / raw)
To: Chris Mason, Linus Torvalds, Dave Chinner, Jan Kara, Josef Bacik,
LKML, linux-fsdevel, Neil Brown, Christoph Hellwig, Tejun Heo
On 09/18/2015 07:16 AM, Chris Mason wrote:
> On Thu, Sep 17, 2015 at 11:04:03PM -0700, Linus Torvalds wrote:
>> On Thu, Sep 17, 2015 at 10:40 PM, Dave Chinner <david@fromorbit.com> wrote:
>>>
>>> Ok, makes sense - the plug is not being flushed as we switch away,
>>> but Chris' patch makes it do that.
>>
>> Yup.
>
> Huh, that does make much more sense, thanks Linus. I'm wondering where
> else I've assumed that cond_resched() unplugged.
>
>>
>> And I actually think Chris' patch is better than the one I sent out
>> (but maybe the scheduler people should take a look at the behavior of
>> cond_resched()), I just wanted you to test that to verify the
>> behavior.
It makes no sense for preemption schedule to NOT unplug, the fact that
it doesn't is news to me as well. It was never the intent of the
unplug-on-schedule to NOT unplug for certain schedule out events, that
seems like very odd behavior.
--
Jens Axboe
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
2015-09-18 14:23 ` Jens Axboe
@ 2015-09-18 15:32 ` Linus Torvalds
2015-09-18 15:59 ` Peter Zijlstra
0 siblings, 1 reply; 54+ messages in thread
From: Linus Torvalds @ 2015-09-18 15:32 UTC (permalink / raw)
To: Jens Axboe, Ingo Molnar, Peter Zijlstra, Frederic Weisbecker
Cc: Chris Mason, Dave Chinner, Jan Kara, Josef Bacik, LKML,
linux-fsdevel, Neil Brown, Christoph Hellwig, Tejun Heo
On Fri, Sep 18, 2015 at 7:23 AM, Jens Axboe <axboe@fb.com> wrote:
>
> It makes no sense for preemption schedule to NOT unplug, the fact that it
> doesn't is news to me as well. It was never the intent of the
> unplug-on-schedule to NOT unplug for certain schedule out events, that seems
> like very odd behavior.
Actually, even a *full* schedule doesn't unplug, unless the process is
going to sleep. See sched_submit_work(), which will only call the
unplugging if the process is actually going to sleep (ok, so it's a
bit subtle if you don't know the state rules, but it's the
"!tsk->state" check there)
So preemption and cond_resched() isn't _that_ odd. We've basically
treated a non-sleeping schedule as a no-op for the task work.
The thinking was probably that it might be better to delay starting
the IO in case we get scheduled back quickly, and we're obviously not
actually _sleeping_, so it's likely not too bad.
Now, that's probably bogus, and I think that we should perhaps just
make the rule be that "if we actually switch to another task, we run
blk_schedule_flush_plug()".
But it should be noted that that really *does* introduce a lot of new
potential races. Traditionally, our block layer plugging has been
entirely thread-synchronous, and would never happen asynchronously.
But with preemption, that "switch to another thread" really *does*
happen asynchronously.
So making things always happen on task switch is actually fairly
dangerous, and potentially adds the need for much more synchronization
for the IO submission.
What we possibly *could* make the scheduler rule be:
- if it's not an actual PREEMPT_ACTIVE (ie in a random place)
- _and_ we actually switch to another thread
- _then_ do the whole blk_schedule_flush_plug(tsk) thing.
adding some scheduler people to the explicit cc list.
That said, the "cond_resched[_lock]()" functions currently always set
PREEMPT_ACTIVE (indirectly - they use preempt_schedule_common()), so
even though those are synchronous, right now they *look* asynchronous
to the scheduler, so we'd still have to sort that out.
Ingo/Peter/Frederic? Comments?
Linus
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
2015-09-18 15:32 ` Linus Torvalds
@ 2015-09-18 15:59 ` Peter Zijlstra
2015-09-18 16:02 ` Peter Zijlstra
2015-09-18 16:12 ` Linus Torvalds
0 siblings, 2 replies; 54+ messages in thread
From: Peter Zijlstra @ 2015-09-18 15:59 UTC (permalink / raw)
To: Linus Torvalds
Cc: Jens Axboe, Ingo Molnar, Frederic Weisbecker, Chris Mason,
Dave Chinner, Jan Kara, Josef Bacik, LKML, linux-fsdevel,
Neil Brown, Christoph Hellwig, Tejun Heo
On Fri, Sep 18, 2015 at 08:32:15AM -0700, Linus Torvalds wrote:
> Actually, even a *full* schedule doesn't unplug, unless the process is
> going to sleep. See sched_submit_work(), which will only call the
> unplugging if the process is actually going to sleep (ok, so it's a
> bit subtle if you don't know the state rules, but it's the
> "!tsk->state" check there)
>
> So preemption and cond_resched() isn't _that_ odd. We've basically
> treated a non-sleeping schedule as a no-op for the task work.
>
> The thinking was probably that it might be better to delay starting
> the IO in case we get scheduled back quickly, and we're obviously not
> actually _sleeping_, so it's likely not too bad.
>
> Now, that's probably bogus, and I think that we should perhaps just
> make the rule be that "if we actually switch to another task, we run
> blk_schedule_flush_plug()".
That gets to be a tad tricky. We must either do it while holding
scheduler locks (and I would really want to avoid calling block layer
code with those held) or when the new task is already scheduled (and
that would suck too, because then we account stuff to the wrong task).
> But it should be noted that that really *does* introduce a lot of new
> potential races. Traditionally, our block layer plugging has been
> entirely thread-synchronous, and would never happen asynchronously.
> But with preemption, that "switch to another thread" really *does*
> happen asynchronously.
>
> So making things always happen on task switch is actually fairly
> dangerous, and potentially adds the need for much more synchronization
> for the IO submission.
Can't say I'm a fan of it :/
> What we possibly *could* make the scheduler rule be:
>
> - if it's not an actual PREEMPT_ACTIVE (ie in a random place)
PREEMPT_ACTIVE is actually a recursion flag, 'random place' does not
factor into it. That is, most 'random places' will not have
PREEMPT_ACTIVE set.
> - _and_ we actually switch to another thread
This is 'hard', as per the above.
> - _then_ do the whole blk_schedule_flush_plug(tsk) thing.
>
> adding some scheduler people to the explicit cc list.
>
> That said, the "cond_resched[_lock]()" functions currently always set
> PREEMPT_ACTIVE (indirectly - they use preempt_schedule_common()), so
> even though those are synchronous, right now they *look* asynchronous
> to the scheduler, so we'd still have to sort that out.
See PREEMPT_ACTIVE being a recursion flag, we set it there so we won't
preempt while we're already scheduling.
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
2015-09-18 15:59 ` Peter Zijlstra
@ 2015-09-18 16:02 ` Peter Zijlstra
2015-09-18 16:12 ` Linus Torvalds
1 sibling, 0 replies; 54+ messages in thread
From: Peter Zijlstra @ 2015-09-18 16:02 UTC (permalink / raw)
To: Linus Torvalds
Cc: Jens Axboe, Ingo Molnar, Frederic Weisbecker, Chris Mason,
Dave Chinner, Jan Kara, Josef Bacik, LKML, linux-fsdevel,
Neil Brown, Christoph Hellwig, Tejun Heo
On Fri, Sep 18, 2015 at 05:59:56PM +0200, Peter Zijlstra wrote:
> > - if it's not an actual PREEMPT_ACTIVE (ie in a random place)
>
> PREEMPT_ACTIVE is actually a recursion flag, 'random place' does not
> factor into it. That is, most 'random places' will not have
> PREEMPT_ACTIVE set.
Ah, _not_ PREEMPT_ACTIVE, its late and I can't read no moar.
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
2015-09-18 15:59 ` Peter Zijlstra
2015-09-18 16:02 ` Peter Zijlstra
@ 2015-09-18 16:12 ` Linus Torvalds
2015-09-28 14:47 ` Peter Zijlstra
1 sibling, 1 reply; 54+ messages in thread
From: Linus Torvalds @ 2015-09-18 16:12 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Jens Axboe, Ingo Molnar, Frederic Weisbecker, Chris Mason,
Dave Chinner, Jan Kara, Josef Bacik, LKML, linux-fsdevel,
Neil Brown, Christoph Hellwig, Tejun Heo
On Fri, Sep 18, 2015 at 8:59 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> See PREEMPT_ACTIVE being a recursion flag, we set it there so we won't
> preempt while we're already scheduling.
PREEMPT_ACTIVE does more than that. It really is a sign that "this is
not synchronous". It causes the scheduler to ignore the current task
flags (because it might already be TASK_SLEEPING, but we aren't
_actually_ ready to sleep yet) etc.
So no. It's not "you can't be preempted during scheduling". That's the
*normal* preempt count, and all scheduling calls end up setting that
some way (ie "schedule()" just does preempt_disable()).
So I disagree with your notion that it's a recursion flag. It is
absolutely nothing of the sort. It gets set by preemption - and,
somewhat illogically, by cond_resched().
The fact that cond_resched() sets it is *probably* because some of the
callers end up calling it from page fault paths etc, and the same
"ignore TASK_SLEEPING etc" rules apply. But it does mean that
"cond_resched()" is a bit misleaning as a name. It's really a
"cond_preempt()".
Linus
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
2015-09-18 6:04 ` Linus Torvalds
2015-09-18 6:06 ` Linus Torvalds
2015-09-18 13:16 ` Chris Mason
@ 2015-09-18 22:17 ` Dave Chinner
2015-09-21 9:24 ` Jan Kara
2 siblings, 1 reply; 54+ messages in thread
From: Dave Chinner @ 2015-09-18 22:17 UTC (permalink / raw)
To: Linus Torvalds
Cc: Jens Axboe, Chris Mason, Jan Kara, Josef Bacik, LKML,
linux-fsdevel, Neil Brown, Christoph Hellwig, Tejun Heo
On Thu, Sep 17, 2015 at 11:04:03PM -0700, Linus Torvalds wrote:
> On Thu, Sep 17, 2015 at 10:40 PM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > Ok, makes sense - the plug is not being flushed as we switch away,
> > but Chris' patch makes it do that.
>
> Yup.
>
> And I actually think Chris' patch is better than the one I sent out
> (but maybe the scheduler people should take a look at the behavior of
> cond_resched()), I just wanted you to test that to verify the
> behavior.
>
> The fact that Chris' patch ends up lowering the context switches
> (because it does the unplugging directly) is also an argument for his
> approach.
>
> I just wanted to understand the oddity with kblockd_workqueue. And I
> think that's solved.
>
> > Context switches go back to the 4-4500/sec range. Otherwise
> > behaviour and performance is indistinguishable from Chris' patch.
>
> .. this was exactly what I wanted to hear. So it sounds like we have
> no odd unexplained behavior left in this area.
>
> Which is not to say that there wouldn't be room for improvement, but
> it just makes me much happier about the state of these patches to feel
> like we understand what was going on.
>
> > PS: just hit another "did this just get broken in 4.3-rc1" issue - I
> > can't run blktrace while there's a IO load because:
> >
> > $ sudo blktrace -d /dev/vdc
> > BLKTRACESETUP(2) /dev/vdc failed: 5/Input/output error
> > Thread 1 failed open /sys/kernel/debug/block/(null)/trace1: 2/No such file or directory
> > ....
> >
> > [ 641.424618] blktrace: page allocation failure: order:5, mode:0x2040d0
> > [ 641.438933] [<ffffffff811c1569>] kmem_cache_alloc_trace+0x129/0x400
> > [ 641.440240] [<ffffffff811424f8>] relay_open+0x68/0x2c0
> > [ 641.441299] [<ffffffff8115deb1>] do_blk_trace_setup+0x191/0x2d0
> >
> > gdb) l *(relay_open+0x68)
> > 0xffffffff811424f8 is in relay_open (kernel/relay.c:582).
> > 577 return NULL;
> > 578 if (subbuf_size > UINT_MAX / n_subbufs)
> > 579 return NULL;
> > 580
> > 581 chan = kzalloc(sizeof(struct rchan), GFP_KERNEL);
> > 582 if (!chan)
> > 583 return NULL;
> > 584
> > 585 chan->version = RELAYFS_CHANNEL_VERSION;
> > 586 chan->n_subbufs = n_subbufs;
> >
> > and struct rchan has a member struct rchan_buf *buf[NR_CPUS];
> > and CONFIG_NR_CPUS=8192, hence the attempt at an order 5 allocation
> > that fails here....
>
> Hm. Have you always had MAX_SMP (and the NR_CPU==8192 that it causes)?
> From a quick check, none of this code seems to be new.
Yes, I always build MAX_SMP kernels for testing, because XFS is
often used on such machines and so I want to find issues exactly
like this in my testing rather than on customer machines... :/
> That said, having that
>
> struct rchan_buf *buf[NR_CPUS];
>
> in "struct rchan" really is something we should fix. We really should
> strive to not allocate things by CONFIG_NR_CPU's, but by the actual
> real CPU count.
*nod*. But it doesn't fix the problem of the memory allocation
failing when there's still gigabytes of immediately reclaimable
memory available in the page cache. If this is failing under page
cache memory pressure, then we're going to be doing an awful lot
more falling back to vmalloc in the filesystem code where large
allocations like this are done e.g. extended attribute buffers are
order-5, and used a lot when doing things like backups which tend to
also produce significant page cache memory pressure.
Hence I'm tending towards there being a memory reclaim behaviour
regression, not so much worrying about whether this specific
allocation is optimal or not.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
2015-09-18 22:17 ` Dave Chinner
@ 2015-09-21 9:24 ` Jan Kara
2015-09-21 20:21 ` Andrew Morton
0 siblings, 1 reply; 54+ messages in thread
From: Jan Kara @ 2015-09-21 9:24 UTC (permalink / raw)
To: Dave Chinner
Cc: Linus Torvalds, Jens Axboe, Chris Mason, Jan Kara, Josef Bacik,
LKML, linux-fsdevel, Neil Brown, Christoph Hellwig, Tejun Heo,
linux-mm, Andrew Morton
On Sat 19-09-15 08:17:14, Dave Chinner wrote:
> On Thu, Sep 17, 2015 at 11:04:03PM -0700, Linus Torvalds wrote:
> > On Thu, Sep 17, 2015 at 10:40 PM, Dave Chinner <david@fromorbit.com> wrote:
> > > PS: just hit another "did this just get broken in 4.3-rc1" issue - I
> > > can't run blktrace while there's a IO load because:
> > >
> > > $ sudo blktrace -d /dev/vdc
> > > BLKTRACESETUP(2) /dev/vdc failed: 5/Input/output error
> > > Thread 1 failed open /sys/kernel/debug/block/(null)/trace1: 2/No such file or directory
> > > ....
> > >
> > > [ 641.424618] blktrace: page allocation failure: order:5, mode:0x2040d0
> > > [ 641.438933] [<ffffffff811c1569>] kmem_cache_alloc_trace+0x129/0x400
> > > [ 641.440240] [<ffffffff811424f8>] relay_open+0x68/0x2c0
> > > [ 641.441299] [<ffffffff8115deb1>] do_blk_trace_setup+0x191/0x2d0
> > >
> > > gdb) l *(relay_open+0x68)
> > > 0xffffffff811424f8 is in relay_open (kernel/relay.c:582).
> > > 577 return NULL;
> > > 578 if (subbuf_size > UINT_MAX / n_subbufs)
> > > 579 return NULL;
> > > 580
> > > 581 chan = kzalloc(sizeof(struct rchan), GFP_KERNEL);
> > > 582 if (!chan)
> > > 583 return NULL;
> > > 584
> > > 585 chan->version = RELAYFS_CHANNEL_VERSION;
> > > 586 chan->n_subbufs = n_subbufs;
> > >
> > > and struct rchan has a member struct rchan_buf *buf[NR_CPUS];
> > > and CONFIG_NR_CPUS=8192, hence the attempt at an order 5 allocation
> > > that fails here....
> >
> > Hm. Have you always had MAX_SMP (and the NR_CPU==8192 that it causes)?
> > From a quick check, none of this code seems to be new.
>
> Yes, I always build MAX_SMP kernels for testing, because XFS is
> often used on such machines and so I want to find issues exactly
> like this in my testing rather than on customer machines... :/
>
> > That said, having that
> >
> > struct rchan_buf *buf[NR_CPUS];
> >
> > in "struct rchan" really is something we should fix. We really should
> > strive to not allocate things by CONFIG_NR_CPU's, but by the actual
> > real CPU count.
>
> *nod*. But it doesn't fix the problem of the memory allocation
> failing when there's still gigabytes of immediately reclaimable
> memory available in the page cache. If this is failing under page
> cache memory pressure, then we're going to be doing an awful lot
> more falling back to vmalloc in the filesystem code where large
> allocations like this are done e.g. extended attribute buffers are
> order-5, and used a lot when doing things like backups which tend to
> also produce significant page cache memory pressure.
>
> Hence I'm tending towards there being a memory reclaim behaviour
> regression, not so much worrying about whether this specific
> allocation is optimal or not.
Yup, looks like a regression in reclaim. Added linux-mm folks to CC.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
2015-09-21 9:24 ` Jan Kara
@ 2015-09-21 20:21 ` Andrew Morton
0 siblings, 0 replies; 54+ messages in thread
From: Andrew Morton @ 2015-09-21 20:21 UTC (permalink / raw)
To: Jan Kara
Cc: Dave Chinner, Linus Torvalds, Jens Axboe, Chris Mason,
Josef Bacik, LKML, linux-fsdevel, Neil Brown, Christoph Hellwig,
Tejun Heo, linux-mm
On Mon, 21 Sep 2015 11:24:29 +0200 Jan Kara <jack@suse.cz> wrote:
> On Sat 19-09-15 08:17:14, Dave Chinner wrote:
> > On Thu, Sep 17, 2015 at 11:04:03PM -0700, Linus Torvalds wrote:
> > > On Thu, Sep 17, 2015 at 10:40 PM, Dave Chinner <david@fromorbit.com> wrote:
> > > > PS: just hit another "did this just get broken in 4.3-rc1" issue - I
> > > > can't run blktrace while there's a IO load because:
> > > >
> > > > $ sudo blktrace -d /dev/vdc
> > > > BLKTRACESETUP(2) /dev/vdc failed: 5/Input/output error
> > > > Thread 1 failed open /sys/kernel/debug/block/(null)/trace1: 2/No such file or directory
> > > > ....
> > > >
> > > > [ 641.424618] blktrace: page allocation failure: order:5, mode:0x2040d0
> > > > [ 641.438933] [<ffffffff811c1569>] kmem_cache_alloc_trace+0x129/0x400
> > > > [ 641.440240] [<ffffffff811424f8>] relay_open+0x68/0x2c0
> > > > [ 641.441299] [<ffffffff8115deb1>] do_blk_trace_setup+0x191/0x2d0
> > > >
> > > > gdb) l *(relay_open+0x68)
> > > > 0xffffffff811424f8 is in relay_open (kernel/relay.c:582).
> > > > 577 return NULL;
> > > > 578 if (subbuf_size > UINT_MAX / n_subbufs)
> > > > 579 return NULL;
> > > > 580
> > > > 581 chan = kzalloc(sizeof(struct rchan), GFP_KERNEL);
> > > > 582 if (!chan)
> > > > 583 return NULL;
> > > > 584
> > > > 585 chan->version = RELAYFS_CHANNEL_VERSION;
> > > > 586 chan->n_subbufs = n_subbufs;
> > > >
> > > > and struct rchan has a member struct rchan_buf *buf[NR_CPUS];
> > > > and CONFIG_NR_CPUS=8192, hence the attempt at an order 5 allocation
> > > > that fails here....
> > >
> > > Hm. Have you always had MAX_SMP (and the NR_CPU==8192 that it causes)?
> > > From a quick check, none of this code seems to be new.
> >
> > Yes, I always build MAX_SMP kernels for testing, because XFS is
> > often used on such machines and so I want to find issues exactly
> > like this in my testing rather than on customer machines... :/
> >
> > > That said, having that
> > >
> > > struct rchan_buf *buf[NR_CPUS];
> > >
> > > in "struct rchan" really is something we should fix. We really should
> > > strive to not allocate things by CONFIG_NR_CPU's, but by the actual
> > > real CPU count.
> >
> > *nod*. But it doesn't fix the problem of the memory allocation
> > failing when there's still gigabytes of immediately reclaimable
> > memory available in the page cache. If this is failing under page
> > cache memory pressure, then we're going to be doing an awful lot
> > more falling back to vmalloc in the filesystem code where large
> > allocations like this are done e.g. extended attribute buffers are
> > order-5, and used a lot when doing things like backups which tend to
> > also produce significant page cache memory pressure.
> >
> > Hence I'm tending towards there being a memory reclaim behaviour
> > regression, not so much worrying about whether this specific
> > allocation is optimal or not.
>
> Yup, looks like a regression in reclaim. Added linux-mm folks to CC.
That's going to be hard to find. Possibly Vlastimil's 5-patch series
"mm, compaction: more robust check for scanners meeting", possibly
Joonsoo's "mm/compaction: correct to flush migrated pages if pageblock
skip happens". But probably something else :(
Teach relay.c about alloc_percpu()?
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
2015-09-18 16:12 ` Linus Torvalds
@ 2015-09-28 14:47 ` Peter Zijlstra
2015-09-28 16:08 ` Linus Torvalds
0 siblings, 1 reply; 54+ messages in thread
From: Peter Zijlstra @ 2015-09-28 14:47 UTC (permalink / raw)
To: Linus Torvalds
Cc: Jens Axboe, Ingo Molnar, Frederic Weisbecker, Chris Mason,
Dave Chinner, Jan Kara, Josef Bacik, LKML, linux-fsdevel,
Neil Brown, Christoph Hellwig, Tejun Heo
On Fri, Sep 18, 2015 at 09:12:38AM -0700, Linus Torvalds wrote:
>
> So I disagree with your notion that it's a recursion flag. It is
> absolutely nothing of the sort.
OK, agreed. I had it classed under recursion in my head, clearly I
indexed it sloppily. In any case I have a patch that kills off
PREEMPT_ACTIVE entirely.
I just have to clean it up, benchmark, split and write changelogs. But
it should be forthcoming 'soon'. As is, it boots..
> It gets set by preemption - and,
> somewhat illogically, by cond_resched().
I suspect that was done to make cond_resched() (voluntary preemption)
more robust and only have a single preemption path/logic. But all that
was done well before I got involved.
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
2015-09-28 14:47 ` Peter Zijlstra
@ 2015-09-28 16:08 ` Linus Torvalds
2015-09-29 7:55 ` Ingo Molnar
0 siblings, 1 reply; 54+ messages in thread
From: Linus Torvalds @ 2015-09-28 16:08 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Jens Axboe, Ingo Molnar, Frederic Weisbecker, Chris Mason,
Dave Chinner, Jan Kara, Josef Bacik, LKML, linux-fsdevel,
Neil Brown, Christoph Hellwig, Tejun Heo
On Mon, Sep 28, 2015 at 10:47 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
>> It gets set by preemption - and,
>> somewhat illogically, by cond_resched().
>
> I suspect that was done to make cond_resched() (voluntary preemption)
> more robust and only have a single preemption path/logic. But all that
> was done well before I got involved.
So I think it's actually the name that is bad, not necessarily the behavior.
We tend to put "cond_resched()" (and particularly
"cond_resched_lock()") in some fairly awkward places, and it's not
always entirely clear that task->state == TASK_RUNNING there.
So the preemptive behavior of not *really* putting the task to sleep
may actually be the right one. But it is rather non-intuitive given
the name - because "cond_resched()" basically is not at all equivalent
to "if (need_resched()) schedule()", which you'd kind of expect.
An explicit schedule will actually act on the task->state, and make us
go to sleep. "cond_resched()" really is just a "voluntary preemption
point". And I think it would be better if it got named that way.
Linus
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
2015-09-28 16:08 ` Linus Torvalds
@ 2015-09-29 7:55 ` Ingo Molnar
0 siblings, 0 replies; 54+ messages in thread
From: Ingo Molnar @ 2015-09-29 7:55 UTC (permalink / raw)
To: Linus Torvalds
Cc: Peter Zijlstra, Jens Axboe, Frederic Weisbecker, Chris Mason,
Dave Chinner, Jan Kara, Josef Bacik, LKML, linux-fsdevel,
Neil Brown, Christoph Hellwig, Tejun Heo
* Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Mon, Sep 28, 2015 at 10:47 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> >> It gets set by preemption - and,
> >> somewhat illogically, by cond_resched().
> >
> > I suspect that was done to make cond_resched() (voluntary preemption)
> > more robust and only have a single preemption path/logic. But all that
> > was done well before I got involved.
>
> So I think it's actually the name that is bad, not necessarily the behavior.
>
> We tend to put "cond_resched()" (and particularly
> "cond_resched_lock()") in some fairly awkward places, and it's not
> always entirely clear that task->state == TASK_RUNNING there.
>
> So the preemptive behavior of not *really* putting the task to sleep
> may actually be the right one. But it is rather non-intuitive given
> the name - because "cond_resched()" basically is not at all equivalent
> to "if (need_resched()) schedule()", which you'd kind of expect.
>
> An explicit schedule will actually act on the task->state, and make us
> go to sleep. "cond_resched()" really is just a "voluntary preemption
> point". And I think it would be better if it got named that way.
cond_preempt() perhaps? That would allude to preempt_schedule() and such, and
would make it clearer that it's supposed to be an invariant on the sleep state
(which schedule() is not).
Thanks,
Ingo
^ permalink raw reply [flat|nested] 54+ messages in thread
end of thread, other threads:[~2015-09-29 7:55 UTC | newest]
Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-11 19:37 [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug() Chris Mason
2015-09-11 20:02 ` Linus Torvalds
2015-09-11 20:37 ` Linus Torvalds
2015-09-11 20:40 ` Josef Bacik
2015-09-11 21:04 ` Linus Torvalds
2015-09-11 22:06 ` Linus Torvalds
2015-09-11 23:16 ` Chris Mason
2015-09-11 23:36 ` Linus Torvalds
2015-09-12 0:52 ` Linus Torvalds
2015-09-12 2:15 ` Chris Mason
2015-09-12 2:27 ` Linus Torvalds
2015-09-12 23:00 ` Chris Mason
2015-09-12 23:29 ` Linus Torvalds
2015-09-12 23:46 ` Chris Mason
2015-09-13 13:12 ` Chris Mason
2015-09-13 22:56 ` Dave Chinner
2015-09-13 23:12 ` Dave Chinner
2015-09-14 20:06 ` Linus Torvalds
2015-09-16 15:16 ` Chris Mason
2015-09-16 19:58 ` Jan Kara
2015-09-16 20:00 ` Chris Mason
2015-09-16 22:07 ` Dave Chinner
2015-09-17 0:37 ` Dave Chinner
2015-09-17 1:12 ` Linus Torvalds
2015-09-17 2:14 ` Dave Chinner
2015-09-17 19:39 ` Linus Torvalds
2015-09-17 22:42 ` Chris Mason
2015-09-17 23:08 ` Linus Torvalds
2015-09-17 23:56 ` Chris Mason
2015-09-18 0:37 ` Dave Chinner
2015-09-18 1:50 ` Linus Torvalds
2015-09-18 5:40 ` Dave Chinner
2015-09-18 6:04 ` Linus Torvalds
2015-09-18 6:06 ` Linus Torvalds
2015-09-18 14:21 ` Jens Axboe
2015-09-18 13:16 ` Chris Mason
2015-09-18 14:23 ` Jens Axboe
2015-09-18 15:32 ` Linus Torvalds
2015-09-18 15:59 ` Peter Zijlstra
2015-09-18 16:02 ` Peter Zijlstra
2015-09-18 16:12 ` Linus Torvalds
2015-09-28 14:47 ` Peter Zijlstra
2015-09-28 16:08 ` Linus Torvalds
2015-09-29 7:55 ` Ingo Molnar
2015-09-18 22:17 ` Dave Chinner
2015-09-21 9:24 ` Jan Kara
2015-09-21 20:21 ` Andrew Morton
2015-09-17 23:03 ` Dave Chinner
2015-09-17 23:13 ` Linus Torvalds
2015-09-17 3:48 ` Chris Mason
2015-09-17 4:30 ` Dave Chinner
2015-09-17 12:13 ` Chris Mason
2015-09-11 23:06 ` Chris Mason
2015-09-11 23:13 ` Linus Torvalds
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).