From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: MIME-Version: 1.0 In-Reply-To: <20170828090620.GA4529@lst.de> References: <20170823175815.3646-1-hch@lst.de> <20170823175815.3646-11-hch@lst.de> <84429b00-300f-a36b-c4ac-b7d930ca6c7e@grimberg.me> <20170828090620.GA4529@lst.de> From: Tony Yang Date: Thu, 7 Sep 2017 23:17:22 +0800 Message-ID: Subject: Re: [PATCH 10/10] nvme: implement multipath access to nvme subsystems To: Christoph Hellwig Cc: Sagi Grimberg , Jens Axboe , Keith Busch , linux-block@vger.kernel.org, linux-nvme@lists.infradead.org Content-Type: multipart/alternative; boundary="001a113f9c280bd7e105589af5a2" List-ID: --001a113f9c280bd7e105589af5a2 Content-Type: text/plain; charset="UTF-8" Hi, Where can I download this patch? Thanks 2017-08-28 17:06 GMT+08:00 Christoph Hellwig : > On Mon, Aug 28, 2017 at 10:23:33AM +0300, Sagi Grimberg wrote: > > This is really taking a lot into the nvme driver. > > This patch adds about 100 lines of code, out of which 1/4th is the > parsing of status codes. I don't think it's all that much.. > > > I'm not sure if > > this approach will be used in other block driver, but would it > > make sense to place the block_device node creation, > > I really want to reduce the amount of boilerplate code for creating > a block queue and I have some ideas for that. But that's not in > any way related to this multi-path code. > > > the make_request > > That basically just does a lookup (in a data structure we need for > tracking the siblings inside nvme anyway) and the submits the > I/O again. The generic code all is in generic_make_request_fast. > There are two more things which could move into common code, one > reasonable, the other not: > > (1) the whole requeue_list list logic. I thought about moving this > to the block layer as I'd also have some other use cases for it, > but decided to wait until those materialize to see if I'd really > need it. > (2) turning the logic inside out, e.g. providing a generic make_request > and supplying a find_path callback to it. This would require (1) > but even then we'd save basically no code, add an indirect call > in the fast path and make things harder to read. This actually > is how I started out and didn't like it. > > > and failover logic > > The big thing about the failover logic is looking a the detailed error > codes and changing the controller state, all of which is fundamentally > driver specific. The only thing that could reasonably be common is > the requeue_list handling as mentioned above. Note that this will > become even more integrated with the nvme driver once ANA support > lands. > > > and maybe the path selection in the block layer > > leaving just the construction of the path mappings in nvme? > > The point is that path selection fundamentally depends on your > actual protocol. E.g. for NVMe we now look at the controller state > as that's what our keep alive work on, and what reflects that status > in PCIe. We could try to communicate this up, but it would just lead > to more data structure that get out of sync. And this will become > much worse with ANA where we have even more fine grained state. > In the end path selection right now is less than 20 lines of code. > > Path selection is complicated when you want non-trivial path selector > algorithms like round robin or service time, but I'd really avoid having > those on nvme - we already have multiple queues to go beyong the limits > of a single queue and use blk-mq for that, and once we're beyond the > limits of a single transport path for performance reasons I'd much > rather rely on ANA, numa nodes or static partitioning of namepsaces > than trying to do dynamic decicions in the fast path. But if I don't > get what I want there and we'll need more complicated algorithms they > absolutely should be in common code. In fact one of my earlier protoypes > moved the dm-mpath path selector algorithms to core block code and > tried to use them - it just turned out enabling them was pretty much > pointless. > > _______________________________________________ > Linux-nvme mailing list > Linux-nvme@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-nvme > --001a113f9c280bd7e105589af5a2 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi,=C2=A0Where can I downlo= ad this patch? Thanks

2017-08-28 17:06 GMT+08:00 Christoph Hellwig <hch@lst.de= >:
On Mon, Aug= 28, 2017 at 10:23:33AM +0300, Sagi Grimberg wrote:
> This is really taking a lot into the nvme driver.

This patch adds about 100 lines of code, out of which 1/4th is the parsing of status codes.=C2=A0 I don't think it's all that much..
> I'm not sure if
> this approach will be used in other block driver, but would it
> make sense to place the block_device node creation,

I really want to reduce the amount of boilerplate code for creating<= br> a block queue and I have some ideas for that.=C2=A0 But that's not in any way related to this multi-path code.

> the make_request

That basically just does a lookup (in a data structure we need for
tracking the siblings inside nvme anyway) and the submits the
I/O again.=C2=A0 The generic code all is in generic_make_request_fast.
There are two more things which could move into common code, one
reasonable, the other not:

=C2=A0(1) the whole requeue_list list logic.=C2=A0 I thought about moving t= his
=C2=A0 =C2=A0 =C2=A0to the block layer as I'd also have some other use = cases for it,
=C2=A0 =C2=A0 =C2=A0but decided to wait until those materialize to see if I= 'd really
=C2=A0 =C2=A0 =C2=A0need it.
=C2=A0(2) turning the logic inside out, e.g. providing a generic make_reque= st
=C2=A0 =C2=A0 =C2=A0and supplying a find_path callback to it.=C2=A0 This wo= uld require (1)
=C2=A0 =C2=A0 =C2=A0but even then we'd save basically no code, add an i= ndirect call
=C2=A0 =C2=A0 =C2=A0in the fast path and make things harder to read.=C2=A0 = This actually
=C2=A0 =C2=A0 =C2=A0is how I started out and didn't like it.

> and failover logic

The big thing about the failover logic is looking a the detailed error
codes and changing the controller state, all of which is fundamentally
driver specific.=C2=A0 The only thing that could reasonably be common is the requeue_list handling as mentioned above.=C2=A0 Note that this will
become even more integrated with the nvme driver once ANA support
lands.

> and maybe the path selection in the block layer
> leaving just the construction of the path mappings in nvme?

The point is that path selection fundamentally depends on your
actual protocol.=C2=A0 E.g. for NVMe we now look at the controller state as that's what our keep alive work on, and what reflects that status in PCIe. We could try to communicate this up, but it would just lead
to more data structure that get out of sync.=C2=A0 And this will become
much worse with ANA where we have even more fine grained state.
In the end path selection right now is less than 20 lines of code.

Path selection is complicated when you want non-trivial path selector
algorithms like round robin or service time, but I'd really avoid havin= g
those on nvme - we already have multiple queues to go beyong the limits
of a single queue and use blk-mq for that, and once we're beyond the limits of a single transport path for performance reasons I'd much
rather rely on ANA, numa nodes or static partitioning of namepsaces
than trying to do dynamic decicions in the fast path.=C2=A0 But if I don= 9;t
get what I want there and we'll need more complicated algorithms they absolutely should be in common code.=C2=A0 In fact one of my earlier protoy= pes
moved the dm-mpath path selector algorithms to core block code and
tried to use them - it just turned out enabling them was pretty much
pointless.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradea= d.org
http://lists.infradead.org/mailman/listin= fo/linux-nvme

--001a113f9c280bd7e105589af5a2--