Re: [PATCH] iosched: Add i10 I/O Scheduler

From: Sagi Grimberg <sagi@grimberg.me>
To: Ming Lei <ming.lei@redhat.com>, Rachit Agarwal <rach4x0r@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>, Qizhe Cai <qc228@cornell.edu>,
	Rachit Agarwal <ragarwal@cornell.edu>,
	linux-kernel@vger.kernel.org, linux-nvme@lists.infradead.org,
	linux-block@vger.kernel.org,
	Midhul Vuppalapati <mvv25@cornell.edu>,
	Jaehyun Hwang <jaehyun.hwang@cornell.edu>,
	Rachit Agarwal <ragarwal@cs.cornell.edu>,
	Keith Busch <kbusch@kernel.org>,
	Sagi Grimberg <sagi@lightbitslabs.com>,
	Christoph Hellwig <hch@lst.de>
Subject: Re: [PATCH] iosched: Add i10 I/O Scheduler
Date: Fri, 13 Nov 2020 12:58:10 -0800	[thread overview]
Message-ID: <44d5bcb0-689e-50c8-fa8e-a7d2b569f75c@grimberg.me> (raw)
In-Reply-To: <20201113145912.GA1074955@T590>

> blk-mq actually has built-in batching(or sort of) mechanism, which is enabled
> if the hw queue is busy(hctx->dispatch_busy is > 0). We use EWMA to compute
> hctx->dispatch_busy, and it is adaptive, even though the implementation is quite
> coarse. But there should be much space to improve, IMO.

You are correct, however nvme-tcp should be getting to dispatch_busy > 0
IIUC.

> It is reported that this way improves SQ high-end SCSI SSD very much[1],
> and MMC performance gets improved too[2].
> 
> [1] https://lore.kernel.org/linux-block/3cc3e03901dc1a63ef32e036182521af@mail.gmail.com/
> [2] https://lore.kernel.org/linux-block/CADBw62o9eTQDJ9RvNgEqSpXmg6Xcq=2TxH0Hfxhp29uF2W=TXA@mail.gmail.com/

Yes, the guys paid attention to the MMC related improvements that you
made.

>> The i10 I/O scheduler builds upon recent work on [6]. We have tested the i10 I/O
>> scheduler with nvme-tcp optimizaitons [2,3] and batching dispatch [4], varying number
>> of cores, varying read/write ratios, and varying request sizes, and with NVMe SSD and
>> RAM block device. For NVMe SSDs, the i10 I/O scheduler achieves ~60% improvements in
>> terms of IOPS per core over "noop" I/O scheduler. These results are available at [5],
>> and many additional results are presented in [6].
> 
> In case of none scheduler, basically nvme driver won't provide any queue busy
> feedback, so the built-in batching dispatch doesn't work simply.

Exactly.

> kyber scheduler uses io latency feedback to throttle and build io batch,
> can you compare i10 with kyber on nvme/nvme-tcp?

I assume it should be simple to get, I'll let Rachit/Jaehyun comment.

>> While other schedulers may also batch I/O (e.g., mq-deadline), the optimization target
>> in the i10 I/O scheduler is throughput maximization. Hence there is no latency target
>> nor a need for a global tracking context, so a new scheduler is needed rather than
>> to build this functionality to an existing scheduler.
>>
>> We currently use fixed default values as batching thresholds (e.g., 16 for #requests,
>> 64KB for #bytes, and 50us for timeout). These default values are based on sensitivity
>> tests in [6]. For our future work, we plan to support adaptive batching according to
> 
> Frankly speaking, hardcode 16 #rquests or 64KB may not work everywhere,
> and product environment could be much complicated than your sensitivity
> tests. If possible, please start with adaptive batching.

That was my feedback as well for sure. But given that this is a
scheduler one would opt-in to anyway, that won't be a must-have
initially. I'm not sure if the guys made progress with this yet, I'll
let them comment.