Re: [PATCH 0/3] fix interrupt swamp in NVMe

From: Thomas Gleixner <tglx@linutronix.de>
To: Keith Busch <keith.busch@gmail.com>
Cc: Ming Lei <ming.lei@redhat.com>, Long Li <longli@microsoft.com>,
	Jens Axboe <axboe@fb.com>, Sagi Grimberg <sagi@grimberg.me>,
	chenxiang <chenxiang66@hisilicon.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Ming Lei <tom.leiming@gmail.com>,
	John Garry <john.garry@huawei.com>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	linux-nvme <linux-nvme@lists.infradead.org>,
	Keith Busch <keith.busch@intel.com>,
	Ingo Molnar <mingo@redhat.com>, Christoph Hellwig <hch@lst.de>,
	"longli@linuxonhyperv.com" <longli@linuxonhyperv.com>
Subject: Re: [PATCH 0/3] fix interrupt swamp in NVMe
Date: Thu, 22 Aug 2019 11:48:29 +0200 (CEST)	[thread overview]
Message-ID: <alpine.DEB.2.21.1908221143060.1983@nanos.tec.linutronix.de> (raw)
In-Reply-To: <CAOSXXT7LVjBqVW14y-pZyUCat3PBPd_nVd_uDahBdhyW+eHmcg@mail.gmail.com>

On Wed, 21 Aug 2019, Keith Busch wrote:
> On Wed, Aug 21, 2019 at 7:34 PM Ming Lei <ming.lei@redhat.com> wrote:
> > On Wed, Aug 21, 2019 at 04:27:00PM +0000, Long Li wrote:
> > > Here is the command to benchmark it:
> > >
> > > fio --bs=4k --ioengine=libaio --iodepth=128 --filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1:/dev/nvme4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/nvme8n1:/dev/nvme9n1 --direct=1 --runtime=120 --numjobs=80 --rw=randread --name=test --group_reporting --gtod_reduce=1
> > >
> >
> > I can reproduce the issue on one machine(96 cores) with 4 NVMes(32 queues), so
> > each queue is served on 3 CPUs.
> >
> > IOPS drops > 20% when 'use_threaded_interrupts' is enabled. From fio log, CPU
> > context switch is increased a lot.
> 
> Interestingly use_threaded_interrupts shows a marginal improvement on
> my machine with the same fio profile. It was only 5 NVMes, but they've
> one queue per-cpu on 112 cores.

Which is not surprising because the thread and the hard interrupt are on
the same CPU and there is just that little overhead of the context switch.

The thing is that this really depends on how the scheduler decides to place
the interrupt thread.

If you have a queue for several CPUs, then depending on the load situation
allowing a multi-cpu affinity for the thread can cause lots of task
migration.

But restricting the irq thread to the CPU on which the interrupt is affine
can also starve that CPU. There is no universal rule for that.

Tracing should tell.

Thanks,

	tglx