All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Chris Mason <clm@fb.com>, Jan Kara <jack@suse.cz>,
	Josef Bacik <jbacik@fb.com>, LKML <linux-kernel@vger.kernel.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Neil Brown <neilb@suse.de>, Christoph Hellwig <hch@lst.de>,
	Tejun Heo <tj@kernel.org>
Subject: Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
Date: Fri, 18 Sep 2015 15:40:44 +1000	[thread overview]
Message-ID: <20150918054044.GT3902@dastard> (raw)
In-Reply-To: <CA+55aFzXW7t+1v3tmW2sxn-BLpvZ1_Ye6epiPWBeq70FoaSmFQ@mail.gmail.com>

On Thu, Sep 17, 2015 at 06:50:29PM -0700, Linus Torvalds wrote:
> On Thu, Sep 17, 2015 at 5:37 PM, Dave Chinner <david@fromorbit.com> wrote:
> >> >
> >> > I'm not seeing why that should be an issue. Sure, there's some CPU
> >> > overhead to context switching, but I don't see that it should be that
> >> > big of a deal.
> >
> > It may well change the dispatch order of enough IOs for it to be
> > significant on an IO bound device.
> 
> Hmm. Maybe. We obviously try to order the IO's a bit by inode, and I
> could see the use of a workqueue maybe changing that sufficiently. But
> it still sounds a bit unlikely.
> 
> And in fact, I think I have a better explanation.
> 
> > In outright performance on my test machine, the difference in
> > files/s is noise. However, the consistency looks to be substantially
> > improved and the context switch rate is now running at under
> > 3,000/sec.
> 
> Hmm. I don't recall seeing you mention how many context switches per
> second you had before. What is it down from?

Around 4000-4500/sec, so there's not a huge amount of context
switches going on. It's not like btrfs, where this workload
generates over 250,000 context switches/sec....

> However, I think I may have found something more interesting here.
> 
> The fact is that a *normal* schedule will trigger that whole
> blk_schedule_flush_plug(), but a cond_sched() or a cond_sched_lock()
> doesn't actually do a normal schedule at all. Those trigger a
> *preemption* schedule.

Ok, makes sense - the plug is not being flushed as we switch away,
but Chris' patch makes it do that.

> So it turns out that as far as I can see, the whole "cond_resched()"
> will not start any IO at all, and it will just be left on the thread
> plug until we schedule back to the thread.
> 
> So I don't think this has anything to do with kblockd_workqueue. I
> don't think it even gets to that point.
> 
> I may be missing something, but just to humor me, can you test the
> attached patch *without* Chris's patch to do explicit plugging? This
> should make cond_resched() and cond_resched_lock() run the unplugging.

Context switches go back to the 4-4500/sec range. Otherwise
behaviour and performance is indistinguishable from Chris' patch.

Cheers,

Dave.


PS: just hit another "did this just get broken in 4.3-rc1" issue - I
can't run blktrace while there's a IO load because:

$ sudo blktrace -d /dev/vdc
BLKTRACESETUP(2) /dev/vdc failed: 5/Input/output error
Thread 1 failed open /sys/kernel/debug/block/(null)/trace1: 2/No such file or directory
....

[  641.424618] blktrace: page allocation failure: order:5, mode:0x2040d0
[  641.425952] CPU: 13 PID: 11163 Comm: blktrace Not tainted 4.3.0-rc1-dgc+ #430
[  641.427321] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
[  641.429066]  0000000000000005 ffff8800b731faa0 ffffffff81785609 00000000002040d0
[  641.430585]  ffff8800b731fb28 ffffffff81183688 0000000000000000 ffffffff824926f8
[  641.432143]  ffff8800b731faf0 0000000000000042 0000000000000010 0000000000000100
[  641.433695] Call Trace:
[  641.434196]  [<ffffffff81785609>] dump_stack+0x4b/0x72
[  641.435219]  [<ffffffff81183688>] warn_alloc_failed+0xd8/0x130
[  641.436402]  [<ffffffff811863bb>] __alloc_pages_nodemask+0x66b/0x7d0
[  641.437684]  [<ffffffff811c0e19>] cache_grow.constprop.61+0xc9/0x2d0
[  641.438933]  [<ffffffff811c1569>] kmem_cache_alloc_trace+0x129/0x400
[  641.440240]  [<ffffffff811424f8>] relay_open+0x68/0x2c0
[  641.441299]  [<ffffffff8115deb1>] do_blk_trace_setup+0x191/0x2d0
[  641.442492]  [<ffffffff8115e041>] blk_trace_setup+0x51/0x90
[  641.443586]  [<ffffffff8115e229>] blk_trace_ioctl+0xb9/0x110
[  641.444702]  [<ffffffff811e586c>] ? mntput_no_expire+0x2c/0x1a0
[  641.446419]  [<ffffffff8176fa58>] blkdev_ioctl+0x528/0x690
[  641.447489]  [<ffffffff811fd37d>] block_ioctl+0x3d/0x50
[  641.449194]  [<ffffffff811d93da>] do_vfs_ioctl+0x2ba/0x490
[  641.450491]  [<ffffffff811d9629>] SyS_ioctl+0x79/0x90
[  641.451575]  [<ffffffff81d800ae>] entry_SYSCALL_64_fastpath+0x12/0x71
[  641.453231] Mem-Info:
[  641.453714] active_anon:11651 inactive_anon:2411 isolated_anon:0
                active_file:8456 inactive_file:2070698 isolated_file:0
                unevictable:1011 dirty:374605 writeback:1176 unstable:0
                slab_reclaimable:1514989 slab_unreclaimable:233692
                mapped:9332 shmem:2250 pagetables:1891 bounce:0
                free:24305 free_pcp:1448 free_cma:0
[  641.461950] DMA free:15908kB min:12kB low:12kB high:16kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[  641.470017] lowmem_reserve[]: 0 2960 16005 16005
[  641.471041] DMA32 free:57076kB min:2992kB low:3740kB high:4488kB active_anon:8692kB inactive_anon:2572kB active_file:6224kB inactive_file:1507352kB unevictable:1004kB isolated(anon):0kB isolated(file):0kB present:3129212kB managed:3032864kB mlocked:1004kB dirty:275336kB writeback:896kB mapped:6800kB shmem:1836kB slab_reclaimable:1086292kB slab_unreclaimable:183168kB kernel_stack:23600kB pagetables:1372kB unstable:0kB bounce:0kB free_pcp:3440kB local_pcp:16kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  641.479677] lowmem_reserve[]: 0 0 13044 13044
[  641.480888] Normal free:18040kB min:13188kB low:16484kB high:19780kB active_anon:38264kB inactive_anon:7072kB active_file:27600kB inactive_file:6775516kB unevictable:3040kB isolated(anon):0kB isolated(file):0kB present:13631488kB managed:13357908kB mlocked:3040kB dirty:1223624kB writeback:5144kB mapped:30528kB shmem:7164kB slab_reclaimable:4975584kB slab_unreclaimable:755076kB kernel_stack:44912kB pagetables:6192kB unstable:0kB bounce:0kB free_pcp:3432kB local_pcp:88kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  641.492392] lowmem_reserve[]: 0 0 0 0
[  641.493170] DMA: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15908kB
[  641.496035] DMA32: 6584*4kB (UEM) 3783*8kB (UM) 17*16kB (M) 9*32kB (M) 3*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 57352kB
[  641.498705] Normal: 3372*4kB (UE) 393*8kB (UEM) 37*16kB (UEM) 33*32kB (UEM) 6*64kB (EM) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 18664kB
[  641.501479] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  641.503019] 2082056 total pagecache pages
[  641.503759] 0 pages in swap cache
[  641.504403] Swap cache stats: add 0, delete 0, find 0/0
[  641.505374] Free swap  = 497976kB
[  641.506002] Total swap = 497976kB
[  641.506622] 4194173 pages RAM
[  641.507171] 0 pages HighMem/MovableOnly
[  641.507878] 92503 pages reserved

This is from:

gdb) l *(relay_open+0x68)
0xffffffff811424f8 is in relay_open (kernel/relay.c:582).
577                     return NULL;
578             if (subbuf_size > UINT_MAX / n_subbufs)
579                     return NULL;
580
581             chan = kzalloc(sizeof(struct rchan), GFP_KERNEL);
582             if (!chan)
583                     return NULL;
584
585             chan->version = RELAYFS_CHANNEL_VERSION;
586             chan->n_subbufs = n_subbufs;

and struct rchan has a member struct rchan_buf *buf[NR_CPUS];
and CONFIG_NR_CPUS=8192, hence the attempt at an order 5 allocation
that fails here....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  reply	other threads:[~2015-09-18  5:40 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-09-11 19:37 [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug() Chris Mason
2015-09-11 20:02 ` Linus Torvalds
2015-09-11 20:37   ` Linus Torvalds
2015-09-11 20:40     ` Josef Bacik
2015-09-11 21:04       ` Linus Torvalds
2015-09-11 22:06         ` Linus Torvalds
2015-09-11 23:16           ` Chris Mason
2015-09-11 23:36             ` Linus Torvalds
2015-09-12  0:52               ` Linus Torvalds
2015-09-12  2:15                 ` Chris Mason
2015-09-12  2:27                   ` Linus Torvalds
2015-09-12 23:00               ` Chris Mason
2015-09-12 23:29                 ` Linus Torvalds
2015-09-12 23:46                   ` Chris Mason
2015-09-13 13:12                     ` Chris Mason
2015-09-13 22:56                       ` Dave Chinner
2015-09-13 23:12                 ` Dave Chinner
2015-09-14 20:06                   ` Linus Torvalds
2015-09-16 15:16                     ` Chris Mason
2015-09-16 19:58                       ` Jan Kara
2015-09-16 20:00                         ` Chris Mason
2015-09-16 22:07                           ` Dave Chinner
2015-09-17  0:37                             ` Dave Chinner
2015-09-17  1:12                               ` Linus Torvalds
2015-09-17  2:14                                 ` Dave Chinner
2015-09-17 19:39                                   ` Linus Torvalds
2015-09-17 22:42                                     ` Chris Mason
2015-09-17 23:08                                       ` Linus Torvalds
2015-09-17 23:56                                         ` Chris Mason
2015-09-18  0:37                                           ` Dave Chinner
2015-09-18  1:50                                             ` Linus Torvalds
2015-09-18  5:40                                               ` Dave Chinner [this message]
2015-09-18  6:04                                                 ` Linus Torvalds
2015-09-18  6:06                                                   ` Linus Torvalds
2015-09-18 14:21                                                     ` Jens Axboe
2015-09-18 13:16                                                   ` Chris Mason
2015-09-18 14:23                                                     ` Jens Axboe
2015-09-18 15:32                                                       ` Linus Torvalds
2015-09-18 15:59                                                         ` Peter Zijlstra
2015-09-18 16:02                                                           ` Peter Zijlstra
2015-09-18 16:12                                                           ` Linus Torvalds
2015-09-28 14:47                                                             ` Peter Zijlstra
2015-09-28 16:08                                                               ` Linus Torvalds
2015-09-29  7:55                                                                 ` Ingo Molnar
2015-09-18 22:17                                                   ` Dave Chinner
2015-09-21  9:24                                                     ` Jan Kara
2015-09-21  9:24                                                       ` Jan Kara
2015-09-21 20:21                                                       ` Andrew Morton
2015-09-21 20:21                                                         ` Andrew Morton
2015-09-17 23:03                                   ` Dave Chinner
2015-09-17 23:13                                     ` Linus Torvalds
2015-09-17  3:48                               ` Chris Mason
2015-09-17  4:30                                 ` Dave Chinner
2015-09-17 12:13                                   ` Chris Mason
2015-09-11 23:06         ` Chris Mason
2015-09-11 23:13           ` Linus Torvalds
  -- strict thread matches above, loose matches on Subject: below --
2015-09-09 15:23 Chris Mason
2015-09-11 18:49 ` Jens Axboe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150918054044.GT3902@dastard \
    --to=david@fromorbit.com \
    --cc=clm@fb.com \
    --cc=hch@lst.de \
    --cc=jack@suse.cz \
    --cc=jbacik@fb.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=neilb@suse.de \
    --cc=tj@kernel.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.