From mboxrd@z Thu Jan 1 00:00:00 1970 From: Damien Le Moal Subject: Re: [PATCH 1/3] block: fix blk_rq_get_max_sectors() to flow more carefully Date: Tue, 15 Sep 2020 04:21:54 +0000 Message-ID: References: <20200911215338.44805-1-snitzer@redhat.com> <20200911215338.44805-2-snitzer@redhat.com> <20200914150352.GC14410@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Return-path: Content-Language: en-US Sender: linux-block-owner@vger.kernel.org To: Mike Snitzer Cc: Ming Lei , Vijayendra Suman , "dm-devel@redhat.com" , "linux-block@vger.kernel.org" List-Id: dm-devel.ids On 2020/09/15 10:10, Damien Le Moal wrote:=0A= > On 2020/09/15 0:04, Mike Snitzer wrote:=0A= >> On Sun, Sep 13 2020 at 8:46pm -0400,=0A= >> Damien Le Moal wrote:=0A= >>=0A= >>> On 2020/09/12 6:53, Mike Snitzer wrote:=0A= >>>> blk_queue_get_max_sectors() has been trained for REQ_OP_WRITE_SAME and= =0A= >>>> REQ_OP_WRITE_ZEROES yet blk_rq_get_max_sectors() didn't call it for=0A= >>>> those operations.=0A= >>>>=0A= >>>> Also, there is no need to avoid blk_max_size_offset() if=0A= >>>> 'chunk_sectors' isn't set because it falls back to 'max_sectors'.=0A= >>>>=0A= >>>> Signed-off-by: Mike Snitzer =0A= >>>> ---=0A= >>>> include/linux/blkdev.h | 19 +++++++++++++------=0A= >>>> 1 file changed, 13 insertions(+), 6 deletions(-)=0A= >>>>=0A= >>>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h=0A= >>>> index bb5636cc17b9..453a3d735d66 100644=0A= >>>> --- a/include/linux/blkdev.h=0A= >>>> +++ b/include/linux/blkdev.h=0A= >>>> @@ -1070,17 +1070,24 @@ static inline unsigned int blk_rq_get_max_sect= ors(struct request *rq,=0A= >>>> sector_t offset)=0A= >>>> {=0A= >>>> struct request_queue *q =3D rq->q;=0A= >>>> + int op;=0A= >>>> + unsigned int max_sectors;=0A= >>>> =0A= >>>> if (blk_rq_is_passthrough(rq))=0A= >>>> return q->limits.max_hw_sectors;=0A= >>>> =0A= >>>> - if (!q->limits.chunk_sectors ||=0A= >>>> - req_op(rq) =3D=3D REQ_OP_DISCARD ||=0A= >>>> - req_op(rq) =3D=3D REQ_OP_SECURE_ERASE)=0A= >>>> - return blk_queue_get_max_sectors(q, req_op(rq));=0A= >>>> + op =3D req_op(rq);=0A= >>>> + max_sectors =3D blk_queue_get_max_sectors(q, op);=0A= >>>> =0A= >>>> - return min(blk_max_size_offset(q, offset),=0A= >>>> - blk_queue_get_max_sectors(q, req_op(rq)));=0A= >>>> + switch (op) {=0A= >>>> + case REQ_OP_DISCARD:=0A= >>>> + case REQ_OP_SECURE_ERASE:=0A= >>>> + case REQ_OP_WRITE_SAME:=0A= >>>> + case REQ_OP_WRITE_ZEROES:=0A= >>>> + return max_sectors;=0A= >>>> + }=0A= >>>=0A= >>> Doesn't this break md devices ? (I think does use chunk_sectors for str= ide size,=0A= >>> no ?)=0A= >>>=0A= >>> As mentioned in my reply to Ming's email, this will allow these command= s to=0A= >>> potentially cross over zone boundaries on zoned block devices, which wo= uld be an=0A= >>> immediate command failure.=0A= >>=0A= >> Depending on the implementation it is beneficial to get a large=0A= >> discard (one not constrained by chunk_sectors, e.g. dm-stripe.c's=0A= >> optimization for handling large discards and issuing N discards, one per= =0A= >> stripe). Same could apply for other commands.=0A= >>=0A= >> Like all devices, zoned devices should impose command specific limits in= =0A= >> the queue_limits (and not lean on chunk_sectors to do a=0A= >> one-size-fits-all).=0A= > =0A= > Yes, understood. But I think that in the case of md, chunk_sectors is us= ed to=0A= > indicate the boundary between drives for a raid volume. So it does indeed= make=0A= > sense to limit the IO size on submission since otherwise, the md driver i= tself=0A= > would have to split that bio again anyway.=0A= > =0A= >> But that aside, yes I agree I didn't pay close enough attention to the= =0A= >> implications of deferring the splitting of these commands until they=0A= >> were issued to underlying storage. This chunk_sectors early splitting= =0A= >> override is a bit of a mess... not quite following the logic given we=0A= >> were supposed to be waiting to split bios as late as possible.=0A= > =0A= > My view is that the multipage bvec (BIOs almost as large as we want) and = late=0A= > splitting is beneficial to get larger effective BIO sent to the device as= having=0A= > more pages on hand allows bigger segments in the bio instead of always ha= ving at=0A= > most PAGE_SIZE per segment. The effect of this is very visible with blktr= ace. A=0A= > lot of requests end up being much larger than the device max_segments * p= age_size.=0A= > =0A= > However, if there is already a known limit on the BIO size when the BIO i= s being=0A= > built, it does not make much sense to try to grow a bio beyond that limit= since=0A= > it will have to be split by the driver anyway. chunk_sectors is one such = limit=0A= > used for md (I think) to indicate boundaries between drives of a raid vol= ume.=0A= > And we reuse it (abuse it ?) for zoned block devices to ensure that any c= ommand=0A= > does not cross over zone boundaries since that triggers errors for writes= within=0A= > sequential zones or read/write crossing over zones of different types=0A= > (conventional->sequential zone boundary).=0A= > =0A= > I may not have the entire picture correctly here, but so far, this is my= =0A= > understanding.=0A= =0A= And I was wrong :) In light of Ming's comment + a little code refresher rea= ding,=0A= indeed, chunk_sectors will split BIOs so that *requests* do not exceed that= =0A= limit, but the initial BIO submission may be much larger regardless of=0A= chunk_sectors.=0A= =0A= Ming, I think the point here is that building a large BIO first and splitti= ng it=0A= later (as opposed to limiting the bio size by stopping bio_add_page()) is m= ore=0A= efficient as there is only one bio submit instead of many, right ?=0A= =0A= =0A= -- =0A= Damien Le Moal=0A= Western Digital Research=0A=