Re: [PATCH V4] block: optimize for small block size IO

From: Kent Overstreet <kent.overstreet@gmail.com>
To: Ming Lei <ming.lei@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>,
	Christoph Hellwig <hch@infradead.org>,
	linux-block@vger.kernel.org, Coly Li <colyli@suse.de>,
	Keith Busch <kbusch@kernel.org>,
	linux-bcache@vger.kernel.org
Subject: Re: [PATCH V4] block: optimize for small block size IO
Date: Mon, 4 Nov 2019 21:30:02 -0500	[thread overview]
Message-ID: <20191105023002.GC18564@moria.home.lan> (raw)
In-Reply-To: <20191105022046.GF11436@ming.t460p>

On Tue, Nov 05, 2019 at 10:20:46AM +0800, Ming Lei wrote:
> On Mon, Nov 04, 2019 at 09:11:30PM -0500, Kent Overstreet wrote:
> > On Tue, Nov 05, 2019 at 09:11:35AM +0800, Ming Lei wrote:
> > > On Mon, Nov 04, 2019 at 01:42:17PM -0500, Kent Overstreet wrote:
> > > > On Mon, Nov 04, 2019 at 11:23:42AM -0700, Jens Axboe wrote:
> > > > > On 11/4/19 11:17 AM, Kent Overstreet wrote:
> > > > > > On Mon, Nov 04, 2019 at 10:15:41AM -0800, Christoph Hellwig wrote:
> > > > > >> On Mon, Nov 04, 2019 at 01:14:03PM -0500, Kent Overstreet wrote:
> > > > > >>> On Sat, Nov 02, 2019 at 03:29:11PM +0800, Ming Lei wrote:
> > > > > >>>> __blk_queue_split() may be a bit heavy for small block size(such as
> > > > > >>>> 512B, or 4KB) IO, so introduce one flag to decide if this bio includes
> > > > > >>>> multiple page. And only consider to try splitting this bio in case
> > > > > >>>> that the multiple page flag is set.
> > > > > >>>
> > > > > >>> So, back in the day I had an alternative approach in mind: get rid of
> > > > > >>> blk_queue_split entirely, by pushing splitting down to the request layer - when
> > > > > >>> we map the bio/request to sgl, just have it map as much as will fit in the sgl
> > > > > >>> and if it doesn't entirely fit bump bi_remaining and leave it on the request
> > > > > >>> queue.
> > > > > >>>
> > > > > >>> This would mean there'd be no need for counting segments at all, and would cut a
> > > > > >>> fair amount of code out of the io path.
> > > > > >>
> > > > > >> I thought about that to, but it will take a lot more effort.  Mostly
> > > > > >> because md/dm heavily rely on splitting as well.  I still think it is
> > > > > >> worthwhile, it will just take a significant amount of time and we
> > > > > >> should have the quick improvement now.
> > > > > > 
> > > > > > We can do it one driver at a time - driver sets a flag to disable
> > > > > > blk_queue_split(). Obvious one to do first would be nvme since that's where it
> > > > > > shows up the most.
> > > > > > 
> > > > > > And md/md do splitting internally, but I'm not so sure they need
> > > > > > blk_queue_split().
> > > > > 
> > > > > I'm a big proponent of doing something like that instead, but it is a
> > > > > lot of work. I absolutely hate the splitting we're doing now, even
> > > > > though the original "let's work as hard as we add add page time to get
> > > > > things right" was pretty abysmal as well.
> > > > 
> > > > Last I looked I don't think it was going to be that bad, just needed a bit of
> > > > finesse. We just need to be able to partially process a request in e.g.
> > > > nvme_map_data(), and blk_rq_map_sg() needs to be modified to only map as much as
> > > > will fit instead of popping an assertion.
> > > 
> > > I think it may not be doable.
> > > 
> > > blk_rq_map_sg() is called by drivers and has to work on single request, however
> > > more requests have to be involved if we delay the splitting to blk_rq_map_sg().
> > > Cause splitting means that two bios can't be submitted in single IO request.
> > 
> > Of course it's doable, do I have to show you how?
> 
> No, you don't have to, could you just point out where my above words is wrong?

blk_rq_map_sg() _currently_ works on a single request, but as I said from the
start that this would involve changing it to only process as much of a request
as would fit on an sglist.

Drivers will have to be modified, but the changes to driver code should be
pretty easy. What will be slightly trickier will be changing blk-mq to handle
requests that are only partially completed; that will be harder than it would
have been before blk-mq, since the old request queue code used to handle
partially completed requests - not much work would have to be done that code.

I'm not very familiar with the blk-mq code, so Jens would be better qualified to
say how best to change that code. The basic idea would probably be the same as
how bios how have a refcount - bi_remaining - to track splits/completions. If
requests (in blk-mq land) don't have such a refcount (they don't appear to), it
will have to be added.

From a quick glance, blk_mq_complete_request() is where the refcount put will
have to be added. I haven't found where requests are popped off the request
queue in blk-mq land yet - the code will have to be changed to only do that once
the request has been fully mapped and submitted by the driver.