Linux-XFS Archive on lore.kernel.org
 help / color / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Ming Lei <ming.lei@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>,
	Hillf Danton <hdanton@sina.com>,
	linux-block <linux-block@vger.kernel.org>,
	linux-fs <linux-fsdevel@vger.kernel.org>,
	linux-xfs <linux-xfs@vger.kernel.org>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	Christoph Hellwig <hch@lst.de>, Jens Axboe <axboe@kernel.dk>,
	Peter Zijlstra <peterz@infradead.org>,
	Rong Chen <rong.a.chen@intel.com>, Tejun Heo <tj@kernel.org>
Subject: Re: single aio thread is migrated crazily by scheduler
Date: Mon, 2 Dec 2019 15:02:56 +1100
Message-ID: <20191202040256.GE2695@dread.disaster.area> (raw)
In-Reply-To: <20191202024625.GD24512@ming.t460p>

On Mon, Dec 02, 2019 at 10:46:25AM +0800, Ming Lei wrote:
> On Thu, Nov 28, 2019 at 10:53:33AM +0100, Vincent Guittot wrote:
> > On Thu, 28 Nov 2019 at 10:40, Hillf Danton <hdanton@sina.com> wrote:
> > > --- a/fs/iomap/direct-io.c
> > > +++ b/fs/iomap/direct-io.c
> > > @@ -157,10 +157,8 @@ static void iomap_dio_bio_end_io(struct
> > >                         WRITE_ONCE(dio->submit.waiter, NULL);
> > >                         blk_wake_io_task(waiter);
> > >                 } else if (dio->flags & IOMAP_DIO_WRITE) {
> > > -                       struct inode *inode = file_inode(dio->iocb->ki_filp);
> > > -
> > >                         INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
> > > -                       queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work);
> > > +                       schedule_work(&dio->aio.work);
> > 
> > I'm not sure that this will make a real difference because it ends up
> > to call queue_work(system_wq, ...) and system_wq is bounded as well so
> > the work will still be pinned to a CPU
> > Using system_unbound_wq should make a difference because it doesn't
> > pin the work on a CPU
> >  +                       queue_work(system_unbound_wq, &dio->aio.work);
> 
> Indeed, just run a quick test on my KVM guest, looks the following patch
> makes a difference:
> 
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index 9329ced91f1d..2f4488b0ecec 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -613,7 +613,8 @@ int sb_init_dio_done_wq(struct super_block *sb)
>  {
>         struct workqueue_struct *old;
>         struct workqueue_struct *wq = alloc_workqueue("dio/%s",
> -                                                     WQ_MEM_RECLAIM, 0,
> +                                                     WQ_MEM_RECLAIM |
> +                                                     WQ_UNBOUND, 0,
>                                                       sb->s_id);

That's not an answer to the user task migration issue.

That is, all this patch does is trade user task migration when the
CPU is busy for migrating all the queued work off the CPU so the
user task does not get migrated. IOWs, this forces all the queued
work to be migrated rather than the user task. IOWs, it does not
address the issue we've exposed in the scheduler between tasks with
competing CPU affinity scheduling requirements - it just hides the
symptom.

Maintaining CPU affinity across dispatch and completion work has
been proven to be a significant performance win. Right throughout
the IO stack we try to keep this submitter/completion affinity,
and that's the whole point of using a bound wq in the first place:
efficient delayed batch processing of work on the local CPU.

Spewing deferred completion work across every idle CPU in the
machine because the local cpu is temporarily busy is a bad choice,
both from a performance perspective (dirty cacheline bouncing) and
from a power efficiency point of view as it causes CPUs to be taken
out of idle state much more frequently[*].

The fact that the scheduler migrates the user task we use workqueues
for deferred work as they were intended doesn't make this a
workqueue problem. If the answer to this problem is "make all IO
workqueues WQ_UNBOUND" then we are effectively saying "the scheduler
has unfixable problems when mixing bound and unbound work on the
same run queue".

And, besides, what happens when every other CPU is also completely
busy and can't run the work in a timely fashion? We've just moved
the work to some random CPU where we wait to be scheduled instead of
just sitting on the local CPU and waiting....

So, yes, we can work around the -symptoms- we see (frequent user
task migration) by changing the work queue configuration or
bypassing the workqueue for this specific workload. But these only
address the visible symptom and don't take into account the wider
goals of retaining CPU affinity in the IO stack, and they will have
variable scheduling latency and perofrmance and as the overall
system load changes.

So, we can fiddle with workqueues, but it doesn't address the
underlying issue that the scheduler appears to be migrating
non-bound tasks off a busy CPU too easily....

-Dave.

[*] Pay attention to the WQ_POWER_EFFICIENT definition for a work
queue: it's designed for interrupt routines that defer work via work
queues to avoid doing work on otherwise idle CPUs. It does this by
turning the per-cpu wq into an unbound wq so that work gets
scheduled on a non-idle CPUs in preference to the local idle CPU
which can then remain in low power states.

That's the exact opposite of what using WQ_UNBOUND ends up doing in
this IO completion context: it pushes the work out over idle CPUs
rather than keeping them confined on the already busy CPUs where CPU
affinity allows the work to be done quickly. So while WQ_UNBOUND
avoids the user task being migrated frequently, it results in the
work being spread around many more CPUs and we burn more power to do
the same work.

-- 
Dave Chinner
david@fromorbit.com

  reply index

Thread overview: 64+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-11-14 11:31 Ming Lei
2019-11-14 13:14 ` Peter Zijlstra
2019-11-15  0:09   ` Ming Lei
2019-11-15 14:16     ` Ming Lei
2019-11-14 23:54 ` Dave Chinner
2019-11-15  1:08   ` Ming Lei
2019-11-15  4:56     ` Dave Chinner
2019-11-15  7:08       ` Ming Lei
2019-11-15 23:40         ` Dave Chinner
2019-11-16  6:31           ` Ming Lei
2019-11-18  9:21           ` Peter Zijlstra
2019-11-18 14:54             ` Vincent Guittot
2019-11-18 20:40             ` Dave Chinner
2019-11-20 19:16               ` Peter Zijlstra
2019-11-20 22:03                 ` Phil Auld
2019-11-21  4:12                   ` Ming Lei
2019-11-21 14:12                     ` Phil Auld
2019-11-21 15:02                       ` Boaz Harrosh
2019-11-21 16:19                         ` Jens Axboe
2019-12-09 16:58                           ` Srikar Dronamraju
2019-11-21 22:10                       ` Dave Chinner
2019-11-21 13:29                   ` Peter Zijlstra
2019-11-21 14:21                     ` Phil Auld
2019-12-09 16:51                     ` Srikar Dronamraju
2019-12-09 23:17                       ` Dave Chinner
2019-12-10  3:27                         ` Srikar Dronamraju
2019-12-10  5:43                         ` [PATCH v2] sched/core: Preempt current task in favour of bound kthread Srikar Dronamraju
2019-12-10  9:26                           ` Peter Zijlstra
2019-12-10  9:33                             ` Peter Zijlstra
2019-12-10 10:18                               ` Srikar Dronamraju
2019-12-10 10:16                             ` Srikar Dronamraju
2019-12-10  9:43                           ` Vincent Guittot
2019-12-10 10:11                             ` Srikar Dronamraju
2019-12-10 11:02                               ` Vincent Guittot
2019-12-10 17:23                           ` [PATCH v3] " Srikar Dronamraju
2019-12-11 17:38                             ` [PATCH v4] " Srikar Dronamraju
2019-12-11 22:46                               ` Dave Chinner
2019-12-12 10:10                                 ` Peter Zijlstra
2019-12-12 10:14                                   ` Peter Zijlstra
2019-12-12 10:23                                     ` Peter Zijlstra
2019-12-12 11:20                                       ` Vincent Guittot
2019-12-12 13:12                                         ` Peter Zijlstra
2019-12-12 15:07                                   ` Srikar Dronamraju
2019-12-12 15:15                                     ` Peter Zijlstra
2019-12-13  5:32                                   ` Srikar Dronamraju
2019-11-18 16:26           ` single aio thread is migrated crazily by scheduler Srikar Dronamraju
2019-11-18 21:18             ` Dave Chinner
2019-11-19  8:54               ` Ming Lei
     [not found]         ` <20191128094003.752-1-hdanton@sina.com>
2019-11-28  9:53           ` Vincent Guittot
2019-12-02  2:46             ` Ming Lei
2019-12-02  4:02               ` Dave Chinner [this message]
2019-12-02  4:22                 ` Ming Lei
2019-12-02 13:45                 ` Vincent Guittot
2019-12-02 21:22                   ` Phil Auld
2019-12-03  9:45                     ` Vincent Guittot
2019-12-04 13:50                       ` Ming Lei
2019-12-02 23:53                   ` Dave Chinner
2019-12-03  0:18                     ` Ming Lei
2019-12-03 13:34                     ` Vincent Guittot
2019-12-02  7:39               ` Juri Lelli
2019-12-02  3:08           ` Dave Chinner
     [not found]           ` <20191202090158.15016-1-hdanton@sina.com>
2019-12-02 23:06             ` Dave Chinner
     [not found]             ` <20191203131514.5176-1-hdanton@sina.com>
2019-12-03 22:29               ` Dave Chinner
     [not found]               ` <20191204102903.896-1-hdanton@sina.com>
2019-12-04 22:59                 ` Dave Chinner

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191202040256.GE2695@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=axboe@kernel.dk \
    --cc=hch@lst.de \
    --cc=hdanton@sina.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=ming.lei@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rong.a.chen@intel.com \
    --cc=tj@kernel.org \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-XFS Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-xfs/0 linux-xfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-xfs linux-xfs/ https://lore.kernel.org/linux-xfs \
		linux-xfs@vger.kernel.org
	public-inbox-index linux-xfs

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-xfs


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git