Linux-XFS Archive on lore.kernel.org
 help / color / Atom feed
From: Ming Lei <ming.lei@redhat.com>
To: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Hillf Danton <hdanton@sina.com>,
	Dave Chinner <david@fromorbit.com>,
	linux-block <linux-block@vger.kernel.org>,
	linux-fs <linux-fsdevel@vger.kernel.org>,
	linux-xfs <linux-xfs@vger.kernel.org>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	Christoph Hellwig <hch@lst.de>, Jens Axboe <axboe@kernel.dk>,
	Peter Zijlstra <peterz@infradead.org>,
	Rong Chen <rong.a.chen@intel.com>, Tejun Heo <tj@kernel.org>
Subject: Re: single aio thread is migrated crazily by scheduler
Date: Mon, 2 Dec 2019 10:46:25 +0800
Message-ID: <20191202024625.GD24512@ming.t460p> (raw)
In-Reply-To: <CAKfTPtA23ErKGCEJVmg6vk-QoufkiUM3NbXd31mZmKnuwbTkFw@mail.gmail.com>

On Thu, Nov 28, 2019 at 10:53:33AM +0100, Vincent Guittot wrote:
> On Thu, 28 Nov 2019 at 10:40, Hillf Danton <hdanton@sina.com> wrote:
> >
> >
> > On Sat, 16 Nov 2019 10:40:05 Dave Chinner wrote:
> > > On Fri, Nov 15, 2019 at 03:08:43PM +0800, Ming Lei wrote:
> > > > On Fri, Nov 15, 2019 at 03:56:34PM +1100, Dave Chinner wrote:
> > > > > On Fri, Nov 15, 2019 at 09:08:24AM +0800, Ming Lei wrote:
> > > > I can reproduce the issue with 4k block size on another RH system, and
> > > > the login info of that system has been shared to you in RH BZ.
> > > >
> > > > 1)
> > > > sysctl kernel.sched_min_granularity_ns=10000000
> > > > sysctl kernel.sched_wakeup_granularity_ns=15000000
> > >
> > > So, these settings definitely influence behaviour.
> > >
> > > If these are set to kernel defaults (4ms and 3ms each):
> > >
> > > sysctl kernel.sched_min_granularity_ns=4000000
> > > sysctl kernel.sched_wakeup_granularity_ns=3000000
> > >
> > > The migration problem largely goes away - the fio task migration
> > > event count goes from ~2,000 a run down to 200/run.
> > >
> > > That indicates that the migration trigger is likely load/timing
> > > based. The analysis below is based on the 10/15ms numbers above,
> > > because it makes it so much easier to reproduce.
> > >
> > > > 2)
> > > > ./xfs_complete 4k
> > > >
> > > > Then you should see 1k~1.5k fio io thread migration in above test,
> > > > either v5.4-rc7(build with rhel8 config) or RHEL 4.18 kernel.
> > >
> > > Almost all the fio task migrations are coming from migration/X
> > > kernel threads. i.e it's the scheduler active balancing that is
> > > causing the fio thread to bounce around.
> > >
> > > This is typical a typical trace, trimmed to remove extraneous noise.
> > > The fio process is running on CPU 10:
> > >
> > >              fio-3185  [010] 50419.285954: sched_stat_runtime:   comm=fio pid=3185 runtime=1004014 [ns] vruntime=27067882290 [ns]
> > >              fio-3185  [010] 50419.286953: sched_stat_runtime:   comm=fio pid=3185 runtime=979458 [ns] vruntime=27068861748 [ns]
> > >              fio-3185  [010] 50419.287998: sched_stat_runtime:   comm=fio pid=3185 runtime=1028471 [ns] vruntime=27069890219 [ns]
> > >              fio-3185  [010] 50419.289973: sched_stat_runtime:   comm=fio pid=3185 runtime=989989 [ns] vruntime=27071836208 [ns]
> > >              fio-3185  [010] 50419.290958: sched_stat_runtime:   comm=fio pid=3185 runtime=963914 [ns] vruntime=27072800122 [ns]
> > >              fio-3185  [010] 50419.291952: sched_stat_runtime:   comm=fio pid=3185 runtime=972532 [ns] vruntime=27073772654 [ns]
> > >
> > > fio consumes CPU for several milliseconds, then:
> > >
> > >              fio-3185  [010] 50419.292935: sched_stat_runtime:   comm=fio pid=3185 runtime=966032 [ns] vruntime=27074738686 [ns]
> > >              fio-3185  [010] 50419.292941: sched_switch:         fio:3185 [120] S ==> kworker/10:0:2763 [120]
> > >     kworker/10:0-2763  [010] 50419.292954: sched_stat_runtime:   comm=kworker/10:0 pid=2763 runtime=13423 [ns] vruntime=27052479694 [ns]
> > >     kworker/10:0-2763  [010] 50419.292956: sched_switch:         kworker/10:0:2763 [120] R ==> fio:3185 [120]
> > >              fio-3185  [010] 50419.293115: sched_waking:         comm=kworker/10:0 pid=2763 prio=120 target_cpu=010
> > >              fio-3185  [010] 50419.293116: sched_stat_runtime:   comm=fio pid=3185 runtime=160370 [ns] vruntime=27074899056 [ns]
> > >              fio-3185  [010] 50419.293118: sched_wakeup:         kworker/10:0:2763 [120] success=1 CPU:010
> > >
> > > A context switch out to a kworker, then 13us later we immediately
> > > switch back to the fio process, and go on running. No doubt
> > > somewhere in what the fio process is doing, we queue up more work to
> > > be run on the cpu, but the fio task keeps running
> > > (due to CONFIG_PREEMPT=n).
> > >
> > >              fio-3185  [010] 50419.293934: sched_stat_runtime:   comm=fio pid=3185 runtime=803135 [ns] vruntime=27075702191 [ns]
> > >              fio-3185  [010] 50419.294936: sched_stat_runtime:   comm=fio pid=3185 runtime=988478 [ns] vruntime=27076690669 [ns]
> > >              fio-3185  [010] 50419.295934: sched_stat_runtime:   comm=fio pid=3185 runtime=982219 [ns] vruntime=27077672888 [ns]
> > >              fio-3185  [010] 50419.296935: sched_stat_runtime:   comm=fio pid=3185 runtime=984781 [ns] vruntime=27078657669 [ns]
> > >              fio-3185  [010] 50419.297934: sched_stat_runtime:   comm=fio pid=3185 runtime=981703 [ns] vruntime=27079639372 [ns]
> > >              fio-3185  [010] 50419.298937: sched_stat_runtime:   comm=fio pid=3185 runtime=990057 [ns] vruntime=27080629429 [ns]
> > >              fio-3185  [010] 50419.299935: sched_stat_runtime:   comm=fio pid=3185 runtime=977554 [ns] vruntime=27081606983 [ns]
> > >
> > > About 6ms later, CPU 0 kicks the active load balancer on CPU 10...
> > >
> > >           <idle>-0     [000] 50419.300014: sched_waking:         comm=migration/10 pid=70 prio=0 target_cpu=010
> > >              fio-3185  [010] 50419.300024: sched_wakeup:         migration/10:70 [0] success=1 CPU:010
> > >              fio-3185  [010] 50419.300026: sched_stat_runtime:   comm=fio pid=3185 runtime=79291 [ns] vruntime=27081686274 [ns]
> > >              fio-3185  [010] 50419.300027: sched_switch:         fio:3185 [120] S ==> migration/10:70 [0]
> > >     migration/10-70    [010] 50419.300032: sched_migrate_task:   comm=fio pid=3185 prio=120 orig_cpu=10 dest_cpu=12
> > >     migration/10-70    [010] 50419.300040: sched_switch:         migration/10:70 [0] D ==> kworker/10:0:2763 [120]
> > >
> > > And 10us later the fio process is switched away, the active load
> > > balancer work is run and migrates the fio process to CPU 12. Then...
> > >
> > >     kworker/10:0-2763  [010] 50419.300048: sched_stat_runtime:   comm=kworker/10:0 pid=2763 runtime=9252 [ns] vruntime=27062908308 [ns]
> > >     kworker/10:0-2763  [010] 50419.300062: sched_switch:         kworker/10:0:2763 [120] R ==> swapper/10:0 [120]
> > >           <idle>-0     [010] 50419.300067: sched_waking:         comm=kworker/10:0 pid=2763 prio=120 target_cpu=010
> > >           <idle>-0     [010] 50419.300069: sched_wakeup:         kworker/10:0:2763 [120] success=1 CPU:010
> > >           <idle>-0     [010] 50419.300071: sched_switch:         swapper/10:0 [120] S ==> kworker/10:0:2763 [120]
> > >     kworker/10:0-2763  [010] 50419.300073: sched_switch:         kworker/10:0:2763 [120] R ==> swapper/10:0 [120]
> > >
> > > The kworker runs for another 10us and the CPU goes idle. Shortly
> > > after this, CPU 12 is woken:
> > >
> > >           <idle>-0     [012] 50419.300113: sched_switch:         swapper/12:0 [120] S ==> fio:3185 [120]
> > >              fio-3185  [012] 50419.300596: sched_waking:         comm=kworker/12:1 pid=227 prio=120 target_cpu=012
> > >              fio-3185  [012] 50419.300598: sched_stat_runtime:   comm=fio pid=3185 runtime=561137 [ns] vruntime=20361153275 [ns]
> > >              fio-3185  [012] 50419.300936: sched_stat_runtime:   comm=fio pid=3185 runtime=326187 [ns] vruntime=20361479462 [ns]
> > >              fio-3185  [012] 50419.301935: sched_stat_runtime:   comm=fio pid=3185 runtime=981201 [ns] vruntime=20362460663 [ns]
> > >              fio-3185  [012] 50419.302935: sched_stat_runtime:   comm=fio pid=3185 runtime=983160 [ns] vruntime=20363443823 [ns]
> > >              fio-3185  [012] 50419.303934: sched_stat_runtime:   comm=fio pid=3185 runtime=983855 [ns] vruntime=20364427678 [ns]
> > >              fio-3185  [012] 50419.304934: sched_stat_runtime:   comm=fio pid=3185 runtime=977757 [ns] vruntime=20365405435 [ns]
> > >              fio-3185  [012] 50419.305948: sched_stat_runtime:   comm=fio pid=3185 runtime=999563 [ns] vruntime=20366404998 [ns]
> > >
> > >
> > > and fio goes on running there. The pattern repeats very soon afterwards:
> > >
> > >           <idle>-0     [000] 50419.314982: sched_waking:         comm=migration/12 pid=82 prio=0 target_cpu=012
> > >              fio-3185  [012] 50419.314988: sched_wakeup:         migration/12:82 [0] success=1 CPU:012
> > >              fio-3185  [012] 50419.314990: sched_stat_runtime:   comm=fio pid=3185 runtime=46342 [ns] vruntime=20375268656 [ns]
> > >              fio-3185  [012] 50419.314991: sched_switch:         fio:3185 [120] S ==> migration/12:82 [0]
> > >     migration/12-82    [012] 50419.314995: sched_migrate_task:   comm=fio pid=3185 prio=120 orig_cpu=12 dest_cpu=5
> > >     migration/12-82    [012] 50419.315001: sched_switch:         migration/12:82 [0] D ==> kworker/12:1:227 [120]
> > >     kworker/12:1-227   [012] 50419.315022: sched_stat_runtime:   comm=kworker/12:1 pid=227 runtime=21453 [ns] vruntime=20359477889 [ns]
> > >     kworker/12:1-227   [012] 50419.315028: sched_switch:         kworker/12:1:227 [120] R ==> swapper/12:0 [120]
> > >           <idle>-0     [005] 50419.315053: sched_switch:         swapper/5:0 [120] S ==> fio:3185 [120]
> > >              fio-3185  [005] 50419.315286: sched_waking:         comm=kworker/5:0 pid=2646 prio=120 target_cpu=005
> > >              fio-3185  [005] 50419.315288: sched_stat_runtime:   comm=fio pid=3185 runtime=287737 [ns] vruntime=33779011507 [ns]
> > >
> > > And fio is now running on CPU 5 - it only ran on CPU 12 for about
> > > 15ms. Hmmm:
> > >
> > > $ grep fio-3185 ~/tmp/sched.out | awk 'BEGIN {totcpu = 0.0; switches = 0.0; prev_waket = 0.0 }/sched_waking/ { cpu = $2; split($3, t, ":"); waket = t[1]; if (cpu != prev_cpu) { t_on_cpu = waket - prev_waket; if (prev_waket) { print "time on CPU", cpu, "was", t_on_cpu; totcpu += t_on_cpu; switches++ } prev_waket = waket; prev_cpu = cpu; } } END { print "switches", switches, "time on cpu", totcpu, "aver time on cpu", (totcpu / switches) } ' | stats --trim-outliers
> > > switches 2211 time on cpu 30.0994 aver time on cpu 0.0136135
> > > time on CPU [0-23(8.8823+/-6.2)] was 0.000331-0.330772(0.0134759+/-0.012)
> > >
> > > Yeah, the fio task averages 13.4ms on any given CPU before being
> > > switched to another CPU. Mind you, the stddev is 12ms, so the range
> > > of how long it spends on any one CPU is pretty wide (330us to
> > > 330ms).
> > >
> > Hey Dave
> >
> > > IOWs, this doesn't look like a workqueue problem at all - this looks
> >
> > Surprised to see you're so sure it has little to do with wq,
> >
> > > like the scheduler is repeatedly making the wrong load balancing
> > > decisions when mixing a very short runtime task (queued work) with a
> > > long runtime task on the same CPU....
> > >
> > and it helps more to know what is driving lb to make decisions like
> > this. Because for 70+ per cent of communters in cities like London it
> > is supposed tube is better than cab on work days, the end_io cb is
> > tweaked to be a lookalike of execute_in_process_context() in the diff
> > with the devoted s_dio_done_wq taken out of account. It's interesting
> > to see what difference lb will make in the tube environment.
> >
> > Hillf
> >
> > > This is not my area of expertise, so I have no idea why this might
> > > be happening. Scheduler experts: is this expected behaviour? What
> > > tunables directly influence the active load balancer (and/or CONFIG
> > > options) to change how aggressive it's behaviour is?
> > >
> > > > Not reproduced the issue with 512 block size on the RH system yet,
> > > > maybe it is related with my kernel config.
> > >
> > > I doubt it - this looks like a load specific corner case in the
> > > scheduling algorithm....
> > >
> > > Cheers,
> > >
> > > Dave.
> > > --
> > > Dave Chinner
> > > david@fromorbit.com
> >
> > --- a/fs/iomap/direct-io.c
> > +++ b/fs/iomap/direct-io.c
> > @@ -157,10 +157,8 @@ static void iomap_dio_bio_end_io(struct
> >                         WRITE_ONCE(dio->submit.waiter, NULL);
> >                         blk_wake_io_task(waiter);
> >                 } else if (dio->flags & IOMAP_DIO_WRITE) {
> > -                       struct inode *inode = file_inode(dio->iocb->ki_filp);
> > -
> >                         INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
> > -                       queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work);
> > +                       schedule_work(&dio->aio.work);
> 
> I'm not sure that this will make a real difference because it ends up
> to call queue_work(system_wq, ...) and system_wq is bounded as well so
> the work will still be pinned to a CPU
> Using system_unbound_wq should make a difference because it doesn't
> pin the work on a CPU
>  +                       queue_work(system_unbound_wq, &dio->aio.work);

Indeed, just run a quick test on my KVM guest, looks the following patch
makes a difference:

diff --git a/fs/direct-io.c b/fs/direct-io.c
index 9329ced91f1d..2f4488b0ecec 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -613,7 +613,8 @@ int sb_init_dio_done_wq(struct super_block *sb)
 {
        struct workqueue_struct *old;
        struct workqueue_struct *wq = alloc_workqueue("dio/%s",
-                                                     WQ_MEM_RECLAIM, 0,
+                                                     WQ_MEM_RECLAIM |
+                                                     WQ_UNBOUND, 0,
                                                      sb->s_id);


Thanks,
Ming


  reply index

Thread overview: 64+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-11-14 11:31 Ming Lei
2019-11-14 13:14 ` Peter Zijlstra
2019-11-15  0:09   ` Ming Lei
2019-11-15 14:16     ` Ming Lei
2019-11-14 23:54 ` Dave Chinner
2019-11-15  1:08   ` Ming Lei
2019-11-15  4:56     ` Dave Chinner
2019-11-15  7:08       ` Ming Lei
2019-11-15 23:40         ` Dave Chinner
2019-11-16  6:31           ` Ming Lei
2019-11-18  9:21           ` Peter Zijlstra
2019-11-18 14:54             ` Vincent Guittot
2019-11-18 20:40             ` Dave Chinner
2019-11-20 19:16               ` Peter Zijlstra
2019-11-20 22:03                 ` Phil Auld
2019-11-21  4:12                   ` Ming Lei
2019-11-21 14:12                     ` Phil Auld
2019-11-21 15:02                       ` Boaz Harrosh
2019-11-21 16:19                         ` Jens Axboe
2019-12-09 16:58                           ` Srikar Dronamraju
2019-11-21 22:10                       ` Dave Chinner
2019-11-21 13:29                   ` Peter Zijlstra
2019-11-21 14:21                     ` Phil Auld
2019-12-09 16:51                     ` Srikar Dronamraju
2019-12-09 23:17                       ` Dave Chinner
2019-12-10  3:27                         ` Srikar Dronamraju
2019-12-10  5:43                         ` [PATCH v2] sched/core: Preempt current task in favour of bound kthread Srikar Dronamraju
2019-12-10  9:26                           ` Peter Zijlstra
2019-12-10  9:33                             ` Peter Zijlstra
2019-12-10 10:18                               ` Srikar Dronamraju
2019-12-10 10:16                             ` Srikar Dronamraju
2019-12-10  9:43                           ` Vincent Guittot
2019-12-10 10:11                             ` Srikar Dronamraju
2019-12-10 11:02                               ` Vincent Guittot
2019-12-10 17:23                           ` [PATCH v3] " Srikar Dronamraju
2019-12-11 17:38                             ` [PATCH v4] " Srikar Dronamraju
2019-12-11 22:46                               ` Dave Chinner
2019-12-12 10:10                                 ` Peter Zijlstra
2019-12-12 10:14                                   ` Peter Zijlstra
2019-12-12 10:23                                     ` Peter Zijlstra
2019-12-12 11:20                                       ` Vincent Guittot
2019-12-12 13:12                                         ` Peter Zijlstra
2019-12-12 15:07                                   ` Srikar Dronamraju
2019-12-12 15:15                                     ` Peter Zijlstra
2019-12-13  5:32                                   ` Srikar Dronamraju
2019-11-18 16:26           ` single aio thread is migrated crazily by scheduler Srikar Dronamraju
2019-11-18 21:18             ` Dave Chinner
2019-11-19  8:54               ` Ming Lei
     [not found]         ` <20191128094003.752-1-hdanton@sina.com>
2019-11-28  9:53           ` Vincent Guittot
2019-12-02  2:46             ` Ming Lei [this message]
2019-12-02  4:02               ` Dave Chinner
2019-12-02  4:22                 ` Ming Lei
2019-12-02 13:45                 ` Vincent Guittot
2019-12-02 21:22                   ` Phil Auld
2019-12-03  9:45                     ` Vincent Guittot
2019-12-04 13:50                       ` Ming Lei
2019-12-02 23:53                   ` Dave Chinner
2019-12-03  0:18                     ` Ming Lei
2019-12-03 13:34                     ` Vincent Guittot
2019-12-02  7:39               ` Juri Lelli
2019-12-02  3:08           ` Dave Chinner
     [not found]           ` <20191202090158.15016-1-hdanton@sina.com>
2019-12-02 23:06             ` Dave Chinner
     [not found]             ` <20191203131514.5176-1-hdanton@sina.com>
2019-12-03 22:29               ` Dave Chinner
     [not found]               ` <20191204102903.896-1-hdanton@sina.com>
2019-12-04 22:59                 ` Dave Chinner

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191202024625.GD24512@ming.t460p \
    --to=ming.lei@redhat.com \
    --cc=axboe@kernel.dk \
    --cc=david@fromorbit.com \
    --cc=hch@lst.de \
    --cc=hdanton@sina.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=peterz@infradead.org \
    --cc=rong.a.chen@intel.com \
    --cc=tj@kernel.org \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-XFS Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-xfs/0 linux-xfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-xfs linux-xfs/ https://lore.kernel.org/linux-xfs \
		linux-xfs@vger.kernel.org
	public-inbox-index linux-xfs

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-xfs


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git