All of lore.kernel.org
 help / color / mirror / Atom feed
From: Hannes Reinecke <hare@suse.de>
To: Mike Snitzer <snitzer@redhat.com>
Cc: James Bottomley <James.Bottomley@hansenpartnership.com>,
	linux-block@vger.kernel.org, lsf@lists.linux-foundation.org,
	device-mapper development <dm-devel@redhat.com>,
	hch@lst.de, linux-scsi <linux-scsi@vger.kernel.org>,
	axboe@kernel.dk, Ming Lei <ming.lei@canonical.com>
Subject: Re: bio-based DM multipath is back from the dead [was: Re: Notes from the four separate IO track sessions at LSF/MM]
Date: Fri, 27 May 2016 17:42:06 +0200	[thread overview]
Message-ID: <57486ACE.40707@suse.de> (raw)
In-Reply-To: <20160527144407.GA31394@redhat.com>

On 05/27/2016 04:44 PM, Mike Snitzer wrote:
> On Fri, May 27 2016 at  4:39am -0400,
> Hannes Reinecke <hare@suse.de> wrote:
>
[ .. ]
>> No, the real issue is load-balancing.
>> If you have several paths you have to schedule I/O across all paths,
>> _and_ you should be feeding these paths efficiently.
>
> <snip well known limitation of bio-based mpath load balancing, also
> detailed in the multipath paper I refernced>
>
> Right, as my patch header details, this is the only limitation that
> remains with the reinstated bio-based DM multipath.
>

:-)
And the very reason why we went into request-based multipathing in the 
first place...

>> I was sort-of hoping that with the large bio work from Shaohua we
>
> I think you mean Ming Lei and his multipage biovec work?
>
Errm. Yeah, of course. Apologies.

>> could build bio which would not require any merging, ie building
>> bios which would be assembled into a single request per bio.
>> Then the above problem wouldn't exist anymore and we _could_ do
>> scheduling on bio level.
>> But from what I've gathered this is not always possible (eg for
>> btrfs with delayed allocation).
>
> I doubt many people are running btrfs over multipath in production
> but...
>
Hey. There is a company who does ...

> Taking a step back: reinstating bio-based DM multipath is _not_ at the
> expense of request-based DM multipath.  As you can see I've made it so
> that all modes (bio-based, request_fn rq-based, and blk-mq rq-based) are
> supported by a single DM multipath target.  When the trnasition to
> request-based happened it would've been wise to preserve bio-based but I
> digress...
>
> So, the point is: there isn't any one-size-fits-all DM multipath queue
> mode here.  If a storage config benefits from the request_fn IO
> schedulers (but isn't hurt by .request_fn's queue lock, so slower
> rotational storage?) then use queue_mode=2.  If the storage is connected
> to a large NUMA system and there is some reason to want to use blk-mq
> request_queue at the DM level: use queue_mode=3.  If the storage is
> _really_ fast and doesn't care about extra IO grooming (e.g. sorting and
> merging) then select bio-based using queue_mode=1.
>
> I collected some quick performance numbers against a null_blk device, on
> a single NUMA node system, with various DM layers ontop -- the multipath
> runs are only with a single path... fio workload is just 10 sec randread:
>
Which is precisely the point.
Everything's nice and shiny with a single path, as then the above issue 
simply doesn't apply.
Things only start getting interesting if you have _several_ paths.
So the benchmarks only prove that device-mapper doesn't add too much of 
an overhead; they don't prove that the above point has been addressed...

[ .. ]
>> Have you found another way of addressing this problem?
>
> No, bio sorting/merging really isn't a problem for DM multipath to
> solve.
>
> Though Jens did say (in the context of one of these dm-crypt bulk mode
> threads) that the block core _could_ grow some additional _minimalist_
> capability for bio merging:
> https://www.redhat.com/archives/dm-devel/2015-November/msg00130.html
>
> I'd like to understand a bit more about what Jens is thinking in that
> area because it could benefit DM thinp as well (though that is using bio
> sorting rather than merging, introduced via commit 67324ea188).
>
> I'm not opposed to any line of future development -- but development
> needs to be driven by observed limitations while testing on _real_
> hardware.
>
In the end, with Ming Leis multipage bvec work we essentially already 
moved some merging ability into the bios; during bio_add_page() the 
block layer will already merge bios together.

(I'll probably be yelled at by hch for ignorance for the following, but 
nevertheless)
 From my POV there are several areas of 'merging' which currently happen:
a) bio merging: combine several consecutive bios into a larger one; 
should be largely address by Ming Leis multipage bvec
b) bio sorting: reshuffle bios so that any requests on the request queue 
are ordered 'best' for the underlying hardware (ie the actual I/O 
scheduler). Not implemented for mq, and actually of questionable value 
for fast storage. One of the points I'll be testing in the very near 
future; ideally we find that it's not _that_ important (compared to the 
previous point), then we could drop it altogether for mq.
c) clustering: coalescing several consecutive pages/bvecs into a single 
SG element. Obviously only can happen if you have large enough requests.
But the only gain is shortening the number of SG elements for a requests.
Again of questionable value as the request itself and the amount of data 
to transfer isn't changed. And another point of performance testing on 
my side.

So ideally we will find that b) and c) only contribute with a small 
amount to the overall performance, then we could easily drop it for MQ 
and concentrate on make bio merging work well.
Then it wouldn't really matter if we were doing bio-based or 
request-based multipathing as we had a 1:1 relationship, and this entire 
discussion could go away.

Well. Or that's the hope, at least.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: J. Hawn, J. Guild, F. Imend�rffer, HRB 16746 (AG N�rnberg)

WARNING: multiple messages have this Message-ID (diff)
From: Hannes Reinecke <hare@suse.de>
To: Mike Snitzer <snitzer@redhat.com>
Cc: James Bottomley <James.Bottomley@hansenpartnership.com>,
	linux-block@vger.kernel.org, lsf@lists.linux-foundation.org,
	device-mapper development <dm-devel@redhat.com>,
	hch@lst.de, linux-scsi <linux-scsi@vger.kernel.org>,
	axboe@kernel.dk, Ming Lei <ming.lei@canonical.com>
Subject: Re: bio-based DM multipath is back from the dead [was: Re: Notes from the four separate IO track sessions at LSF/MM]
Date: Fri, 27 May 2016 17:42:06 +0200	[thread overview]
Message-ID: <57486ACE.40707@suse.de> (raw)
In-Reply-To: <20160527144407.GA31394@redhat.com>

On 05/27/2016 04:44 PM, Mike Snitzer wrote:
> On Fri, May 27 2016 at  4:39am -0400,
> Hannes Reinecke <hare@suse.de> wrote:
>
[ .. ]
>> No, the real issue is load-balancing.
>> If you have several paths you have to schedule I/O across all paths,
>> _and_ you should be feeding these paths efficiently.
>
> <snip well known limitation of bio-based mpath load balancing, also
> detailed in the multipath paper I refernced>
>
> Right, as my patch header details, this is the only limitation that
> remains with the reinstated bio-based DM multipath.
>

:-)
And the very reason why we went into request-based multipathing in the 
first place...

>> I was sort-of hoping that with the large bio work from Shaohua we
>
> I think you mean Ming Lei and his multipage biovec work?
>
Errm. Yeah, of course. Apologies.

>> could build bio which would not require any merging, ie building
>> bios which would be assembled into a single request per bio.
>> Then the above problem wouldn't exist anymore and we _could_ do
>> scheduling on bio level.
>> But from what I've gathered this is not always possible (eg for
>> btrfs with delayed allocation).
>
> I doubt many people are running btrfs over multipath in production
> but...
>
Hey. There is a company who does ...

> Taking a step back: reinstating bio-based DM multipath is _not_ at the
> expense of request-based DM multipath.  As you can see I've made it so
> that all modes (bio-based, request_fn rq-based, and blk-mq rq-based) are
> supported by a single DM multipath target.  When the trnasition to
> request-based happened it would've been wise to preserve bio-based but I
> digress...
>
> So, the point is: there isn't any one-size-fits-all DM multipath queue
> mode here.  If a storage config benefits from the request_fn IO
> schedulers (but isn't hurt by .request_fn's queue lock, so slower
> rotational storage?) then use queue_mode=2.  If the storage is connected
> to a large NUMA system and there is some reason to want to use blk-mq
> request_queue at the DM level: use queue_mode=3.  If the storage is
> _really_ fast and doesn't care about extra IO grooming (e.g. sorting and
> merging) then select bio-based using queue_mode=1.
>
> I collected some quick performance numbers against a null_blk device, on
> a single NUMA node system, with various DM layers ontop -- the multipath
> runs are only with a single path... fio workload is just 10 sec randread:
>
Which is precisely the point.
Everything's nice and shiny with a single path, as then the above issue 
simply doesn't apply.
Things only start getting interesting if you have _several_ paths.
So the benchmarks only prove that device-mapper doesn't add too much of 
an overhead; they don't prove that the above point has been addressed...

[ .. ]
>> Have you found another way of addressing this problem?
>
> No, bio sorting/merging really isn't a problem for DM multipath to
> solve.
>
> Though Jens did say (in the context of one of these dm-crypt bulk mode
> threads) that the block core _could_ grow some additional _minimalist_
> capability for bio merging:
> https://www.redhat.com/archives/dm-devel/2015-November/msg00130.html
>
> I'd like to understand a bit more about what Jens is thinking in that
> area because it could benefit DM thinp as well (though that is using bio
> sorting rather than merging, introduced via commit 67324ea188).
>
> I'm not opposed to any line of future development -- but development
> needs to be driven by observed limitations while testing on _real_
> hardware.
>
In the end, with Ming Leis multipage bvec work we essentially already 
moved some merging ability into the bios; during bio_add_page() the 
block layer will already merge bios together.

(I'll probably be yelled at by hch for ignorance for the following, but 
nevertheless)
 From my POV there are several areas of 'merging' which currently happen:
a) bio merging: combine several consecutive bios into a larger one; 
should be largely address by Ming Leis multipage bvec
b) bio sorting: reshuffle bios so that any requests on the request queue 
are ordered 'best' for the underlying hardware (ie the actual I/O 
scheduler). Not implemented for mq, and actually of questionable value 
for fast storage. One of the points I'll be testing in the very near 
future; ideally we find that it's not _that_ important (compared to the 
previous point), then we could drop it altogether for mq.
c) clustering: coalescing several consecutive pages/bvecs into a single 
SG element. Obviously only can happen if you have large enough requests.
But the only gain is shortening the number of SG elements for a requests.
Again of questionable value as the request itself and the amount of data 
to transfer isn't changed. And another point of performance testing on 
my side.

So ideally we will find that b) and c) only contribute with a small 
amount to the overall performance, then we could easily drop it for MQ 
and concentrate on make bio merging work well.
Then it wouldn't really matter if we were doing bio-based or 
request-based multipathing as we had a 1:1 relationship, and this entire 
discussion could go away.

Well. Or that's the hope, at least.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

  reply	other threads:[~2016-05-27 15:42 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-04-27 23:39 Notes from the four separate IO track sessions at LSF/MM James Bottomley
2016-04-28 12:11 ` Mike Snitzer
2016-04-28 15:40   ` James Bottomley
2016-04-28 15:53     ` [Lsf] " Bart Van Assche
2016-04-28 16:19       ` Knight, Frederick
2016-04-28 16:37         ` Bart Van Assche
2016-04-28 17:33         ` James Bottomley
2016-04-28 16:23       ` Laurence Oberman
2016-04-28 16:41         ` [dm-devel] " Bart Van Assche
2016-04-28 16:47           ` Laurence Oberman
2016-04-29 21:47             ` Laurence Oberman
2016-04-29 21:51               ` Laurence Oberman
2016-04-30  0:36               ` Bart Van Assche
2016-04-30  0:47                 ` Laurence Oberman
2016-05-02 18:49                   ` Bart Van Assche
2016-05-02 19:28                     ` Laurence Oberman
2016-05-02 22:28                       ` Bart Van Assche
2016-05-03 17:44                         ` Laurence Oberman
2016-05-26  2:38     ` bio-based DM multipath is back from the dead [was: Re: Notes from the four separate IO track sessions at LSF/MM] Mike Snitzer
2016-05-27  8:39       ` Hannes Reinecke
2016-05-27  8:39         ` Hannes Reinecke
2016-05-27 14:44         ` Mike Snitzer
2016-05-27 15:42           ` Hannes Reinecke [this message]
2016-05-27 15:42             ` Hannes Reinecke
2016-05-27 16:10             ` Mike Snitzer
2016-04-29 16:45 ` [dm-devel] Notes from the four separate IO track sessions at LSF/MM Benjamin Marzinski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=57486ACE.40707@suse.de \
    --to=hare@suse.de \
    --cc=James.Bottomley@hansenpartnership.com \
    --cc=axboe@kernel.dk \
    --cc=dm-devel@redhat.com \
    --cc=hch@lst.de \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=lsf@lists.linux-foundation.org \
    --cc=ming.lei@canonical.com \
    --cc=snitzer@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.