Re: single aio thread is migrated crazily by scheduler

From: Dave Chinner <david@fromorbit.com>
To: Hillf Danton <hdanton@sina.com>
Cc: Ming Lei <ming.lei@redhat.com>,
	linux-block <linux-block@vger.kernel.org>,
	linux-fs <linux-fsdevel@vger.kernel.org>,
	linux-xfs <linux-xfs@vger.kernel.org>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	Christoph Hellwig <hch@lst.de>, Jens Axboe <axboe@kernel.dk>,
	Peter Zijlstra <peterz@infradead.org>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Rong Chen <rong.a.chen@intel.com>, Tejun Heo <tj@kernel.org>
Subject: Re: single aio thread is migrated crazily by scheduler
Date: Wed, 4 Dec 2019 09:29:25 +1100	[thread overview]
Message-ID: <20191203222925.GM2695@dread.disaster.area> (raw)
In-Reply-To: <20191203131514.5176-1-hdanton@sina.com>

On Tue, Dec 03, 2019 at 09:15:14PM +0800, Hillf Danton wrote:
> > IOWs, we are trying to ensure that we run the data IO completion on
> > the CPU with that has that data hot in cache. When we are running
> > millions of IOs every second, this matters -a lot-. IRQ steering is
> > just a mechansim that is used to ensure completion processing hits
> > hot caches.
> 
> Along the "CPU affinity" direction, a trade-off is made between CPU
> affinity and cache affinity before lb can bear the ca scheme.
> Completion works are queued in round robin on the CPUs that share
> cache with the submission CPU.
> 
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -143,6 +143,42 @@ static inline void iomap_dio_set_error(s
>  	cmpxchg(&dio->error, 0, ret);
>  }
>  
> +static DEFINE_PER_CPU(int, iomap_dio_bio_end_io_cnt);
> +static DEFINE_PER_CPU(int, iomap_dio_bio_end_io_cpu);
> +#define IOMAP_DIO_BIO_END_IO_BATCH 7
> +
> +static int iomap_dio_cpu_rr(void)
> +{
> +	int *io_cnt, *io_cpu;
> +	int cpu, this_cpu;
> +
> +	io_cnt = get_cpu_ptr(&iomap_dio_bio_end_io_cnt);
> +	io_cpu = this_cpu_ptr(&iomap_dio_bio_end_io_cpu);
> +	this_cpu = smp_processor_id();
> +
> +	if (!(*io_cnt & IOMAP_DIO_BIO_END_IO_BATCH)) {
> +		for (cpu = *io_cpu + 1; cpu < nr_cpu_id; cpu++)
> +			if (cpu == this_cpu ||
> +			    cpus_share_cache(cpu, this_cpu))
> +				goto update_cpu;
> +
> +		for (cpu = 0; cpu < *io_cpu; cpu++)
> +			if (cpu == this_cpu ||
> +			    cpus_share_cache(cpu, this_cpu))
> +				goto update_cpu;

Linear scans like this just don't scale. We can have thousands of
CPUs in a system and maybe only 8 cores that share a local cache.
And we can be completing millions of direct IO writes a second these
days. A linear scan of (thousands - 8) cpu ids every so often is
going to show up as long tail latency for the unfortunate IO that
has to scan those thousands of non-matching CPU IDs to find a
sibling, and we'll be doing that every handful of IOs that are
completed on every CPU.

> +
> +		cpu = this_cpu;
> +update_cpu:
> +		*io_cpu = cpu;
> +	}
> +
> +	(*io_cnt)++;
> +	cpu = *io_cpu;
> +	put_cpu_ptr(&iomap_dio_bio_end_io_cnt);
> +
> +	return cpu;
> +}
> 
>  static void iomap_dio_bio_end_io(struct bio *bio)
>  {
>  	struct iomap_dio *dio = bio->bi_private;
> @@ -158,9 +194,10 @@ static void iomap_dio_bio_end_io(struct
>  			blk_wake_io_task(waiter);
>  		} else if (dio->flags & IOMAP_DIO_WRITE) {
>  			struct inode *inode = file_inode(dio->iocb->ki_filp);
> +			int cpu = iomap_dio_cpu_rr();

IMO, this sort of "limit work to sibling CPU cores" does not belong in
general code. We have *lots* of workqueues that need this treatment,
and it's not viable to add this sort of linear search loop to every
workqueue and place we queue work. Besides....

>  
>  			INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
> -			queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work);
> +			queue_work_on(cpu, inode->i_sb->s_dio_done_wq, &dio->aio.work);

.... as I've stated before, this *does not solve the scheduler
problem*.  All this does is move the problem to the target CPU
instead of seeing it on the local CPU.

If we really want to hack around the load balancer problems in this
way, then we need to add a new workqueue concurrency management type
with behaviour that lies between the default of bound and WQ_UNBOUND.

WQ_UNBOUND limits scheduling to within a numa node - see
wq_update_unbound_numa() for how it sets up the cpumask attributes
it applies to it's workers - but we need the work to be bound to
within the local cache domain rather than a numa node. IOWs, set up
the kworker task pool management structure with the right attributes
(e.g. cpu masks) to define the cache domains, add all the hotplug
code to make it work with CPU hotplug, then simply apply those
attributes to the kworker task that is selected to execute the work.

This allows the scheduler to migrate the kworker away from the local
run queue without interrupting the currently scheduled task. The
cpumask limits the task is configured with limit the scheduler to
selecting the best CPU within the local cache domain, and we don't
have to bind work to CPUs to get CPU cache friendly work scheduling.
This also avoids overhead of per-queue_work_on() sibling CPU
calculation, and all the code that wants to use this functionality
needs to do is add a single flag at work queue init time (e.g.
WQ_CACHEBOUND).

IOWs, if the task migration behaviour cannot be easily fixed and so
we need work queue users to be more flexible about work placement,
then the solution needed here is "cpu cache local work queue
scheduling" implemented in the work queue infrastructure, not in
every workqueue user.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com