Linux-XFS Archive on lore.kernel.org
 help / color / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Hillf Danton <hdanton@sina.com>
Cc: Ming Lei <ming.lei@redhat.com>,
	linux-block <linux-block@vger.kernel.org>,
	linux-fs <linux-fsdevel@vger.kernel.org>,
	linux-xfs <linux-xfs@vger.kernel.org>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	Christoph Hellwig <hch@lst.de>, Jens Axboe <axboe@kernel.dk>,
	Peter Zijlstra <peterz@infradead.org>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Rong Chen <rong.a.chen@intel.com>, Tejun Heo <tj@kernel.org>
Subject: Re: single aio thread is migrated crazily by scheduler
Date: Wed, 4 Dec 2019 09:29:25 +1100
Message-ID: <20191203222925.GM2695@dread.disaster.area> (raw)
In-Reply-To: <20191203131514.5176-1-hdanton@sina.com>

On Tue, Dec 03, 2019 at 09:15:14PM +0800, Hillf Danton wrote:
> > IOWs, we are trying to ensure that we run the data IO completion on
> > the CPU with that has that data hot in cache. When we are running
> > millions of IOs every second, this matters -a lot-. IRQ steering is
> > just a mechansim that is used to ensure completion processing hits
> > hot caches.
> 
> Along the "CPU affinity" direction, a trade-off is made between CPU
> affinity and cache affinity before lb can bear the ca scheme.
> Completion works are queued in round robin on the CPUs that share
> cache with the submission CPU.
> 
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -143,6 +143,42 @@ static inline void iomap_dio_set_error(s
>  	cmpxchg(&dio->error, 0, ret);
>  }
>  
> +static DEFINE_PER_CPU(int, iomap_dio_bio_end_io_cnt);
> +static DEFINE_PER_CPU(int, iomap_dio_bio_end_io_cpu);
> +#define IOMAP_DIO_BIO_END_IO_BATCH 7
> +
> +static int iomap_dio_cpu_rr(void)
> +{
> +	int *io_cnt, *io_cpu;
> +	int cpu, this_cpu;
> +
> +	io_cnt = get_cpu_ptr(&iomap_dio_bio_end_io_cnt);
> +	io_cpu = this_cpu_ptr(&iomap_dio_bio_end_io_cpu);
> +	this_cpu = smp_processor_id();
> +
> +	if (!(*io_cnt & IOMAP_DIO_BIO_END_IO_BATCH)) {
> +		for (cpu = *io_cpu + 1; cpu < nr_cpu_id; cpu++)
> +			if (cpu == this_cpu ||
> +			    cpus_share_cache(cpu, this_cpu))
> +				goto update_cpu;
> +
> +		for (cpu = 0; cpu < *io_cpu; cpu++)
> +			if (cpu == this_cpu ||
> +			    cpus_share_cache(cpu, this_cpu))
> +				goto update_cpu;

Linear scans like this just don't scale. We can have thousands of
CPUs in a system and maybe only 8 cores that share a local cache.
And we can be completing millions of direct IO writes a second these
days. A linear scan of (thousands - 8) cpu ids every so often is
going to show up as long tail latency for the unfortunate IO that
has to scan those thousands of non-matching CPU IDs to find a
sibling, and we'll be doing that every handful of IOs that are
completed on every CPU.

> +
> +		cpu = this_cpu;
> +update_cpu:
> +		*io_cpu = cpu;
> +	}
> +
> +	(*io_cnt)++;
> +	cpu = *io_cpu;
> +	put_cpu_ptr(&iomap_dio_bio_end_io_cnt);
> +
> +	return cpu;
> +}
> 
>  static void iomap_dio_bio_end_io(struct bio *bio)
>  {
>  	struct iomap_dio *dio = bio->bi_private;
> @@ -158,9 +194,10 @@ static void iomap_dio_bio_end_io(struct
>  			blk_wake_io_task(waiter);
>  		} else if (dio->flags & IOMAP_DIO_WRITE) {
>  			struct inode *inode = file_inode(dio->iocb->ki_filp);
> +			int cpu = iomap_dio_cpu_rr();

IMO, this sort of "limit work to sibling CPU cores" does not belong in
general code. We have *lots* of workqueues that need this treatment,
and it's not viable to add this sort of linear search loop to every
workqueue and place we queue work. Besides....

>  
>  			INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
> -			queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work);
> +			queue_work_on(cpu, inode->i_sb->s_dio_done_wq, &dio->aio.work);

.... as I've stated before, this *does not solve the scheduler
problem*.  All this does is move the problem to the target CPU
instead of seeing it on the local CPU.

If we really want to hack around the load balancer problems in this
way, then we need to add a new workqueue concurrency management type
with behaviour that lies between the default of bound and WQ_UNBOUND.

WQ_UNBOUND limits scheduling to within a numa node - see
wq_update_unbound_numa() for how it sets up the cpumask attributes
it applies to it's workers - but we need the work to be bound to
within the local cache domain rather than a numa node. IOWs, set up
the kworker task pool management structure with the right attributes
(e.g. cpu masks) to define the cache domains, add all the hotplug
code to make it work with CPU hotplug, then simply apply those
attributes to the kworker task that is selected to execute the work.

This allows the scheduler to migrate the kworker away from the local
run queue without interrupting the currently scheduled task. The
cpumask limits the task is configured with limit the scheduler to
selecting the best CPU within the local cache domain, and we don't
have to bind work to CPUs to get CPU cache friendly work scheduling.
This also avoids overhead of per-queue_work_on() sibling CPU
calculation, and all the code that wants to use this functionality
needs to do is add a single flag at work queue init time (e.g.
WQ_CACHEBOUND).

IOWs, if the task migration behaviour cannot be easily fixed and so
we need work queue users to be more flexible about work placement,
then the solution needed here is "cpu cache local work queue
scheduling" implemented in the work queue infrastructure, not in
every workqueue user.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  parent reply index

Thread overview: 64+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-11-14 11:31 Ming Lei
2019-11-14 13:14 ` Peter Zijlstra
2019-11-15  0:09   ` Ming Lei
2019-11-15 14:16     ` Ming Lei
2019-11-14 23:54 ` Dave Chinner
2019-11-15  1:08   ` Ming Lei
2019-11-15  4:56     ` Dave Chinner
2019-11-15  7:08       ` Ming Lei
2019-11-15 23:40         ` Dave Chinner
2019-11-16  6:31           ` Ming Lei
2019-11-18  9:21           ` Peter Zijlstra
2019-11-18 14:54             ` Vincent Guittot
2019-11-18 20:40             ` Dave Chinner
2019-11-20 19:16               ` Peter Zijlstra
2019-11-20 22:03                 ` Phil Auld
2019-11-21  4:12                   ` Ming Lei
2019-11-21 14:12                     ` Phil Auld
2019-11-21 15:02                       ` Boaz Harrosh
2019-11-21 16:19                         ` Jens Axboe
2019-12-09 16:58                           ` Srikar Dronamraju
2019-11-21 22:10                       ` Dave Chinner
2019-11-21 13:29                   ` Peter Zijlstra
2019-11-21 14:21                     ` Phil Auld
2019-12-09 16:51                     ` Srikar Dronamraju
2019-12-09 23:17                       ` Dave Chinner
2019-12-10  3:27                         ` Srikar Dronamraju
2019-12-10  5:43                         ` [PATCH v2] sched/core: Preempt current task in favour of bound kthread Srikar Dronamraju
2019-12-10  9:26                           ` Peter Zijlstra
2019-12-10  9:33                             ` Peter Zijlstra
2019-12-10 10:18                               ` Srikar Dronamraju
2019-12-10 10:16                             ` Srikar Dronamraju
2019-12-10  9:43                           ` Vincent Guittot
2019-12-10 10:11                             ` Srikar Dronamraju
2019-12-10 11:02                               ` Vincent Guittot
2019-12-10 17:23                           ` [PATCH v3] " Srikar Dronamraju
2019-12-11 17:38                             ` [PATCH v4] " Srikar Dronamraju
2019-12-11 22:46                               ` Dave Chinner
2019-12-12 10:10                                 ` Peter Zijlstra
2019-12-12 10:14                                   ` Peter Zijlstra
2019-12-12 10:23                                     ` Peter Zijlstra
2019-12-12 11:20                                       ` Vincent Guittot
2019-12-12 13:12                                         ` Peter Zijlstra
2019-12-12 15:07                                   ` Srikar Dronamraju
2019-12-12 15:15                                     ` Peter Zijlstra
2019-12-13  5:32                                   ` Srikar Dronamraju
2019-11-18 16:26           ` single aio thread is migrated crazily by scheduler Srikar Dronamraju
2019-11-18 21:18             ` Dave Chinner
2019-11-19  8:54               ` Ming Lei
     [not found]         ` <20191128094003.752-1-hdanton@sina.com>
2019-11-28  9:53           ` Vincent Guittot
2019-12-02  2:46             ` Ming Lei
2019-12-02  4:02               ` Dave Chinner
2019-12-02  4:22                 ` Ming Lei
2019-12-02 13:45                 ` Vincent Guittot
2019-12-02 21:22                   ` Phil Auld
2019-12-03  9:45                     ` Vincent Guittot
2019-12-04 13:50                       ` Ming Lei
2019-12-02 23:53                   ` Dave Chinner
2019-12-03  0:18                     ` Ming Lei
2019-12-03 13:34                     ` Vincent Guittot
2019-12-02  7:39               ` Juri Lelli
2019-12-02  3:08           ` Dave Chinner
     [not found]           ` <20191202090158.15016-1-hdanton@sina.com>
2019-12-02 23:06             ` Dave Chinner
     [not found]             ` <20191203131514.5176-1-hdanton@sina.com>
2019-12-03 22:29               ` Dave Chinner [this message]
     [not found]               ` <20191204102903.896-1-hdanton@sina.com>
2019-12-04 22:59                 ` Dave Chinner

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191203222925.GM2695@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=axboe@kernel.dk \
    --cc=hch@lst.de \
    --cc=hdanton@sina.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=ming.lei@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rong.a.chen@intel.com \
    --cc=tj@kernel.org \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-XFS Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-xfs/0 linux-xfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-xfs linux-xfs/ https://lore.kernel.org/linux-xfs \
		linux-xfs@vger.kernel.org
	public-inbox-index linux-xfs

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-xfs


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git